Crawl Websites Using Python

In this notebook, I will go over a simpler Python scraper.

How to Parse a website using Python LXML

We will use Python 'requests' module to open the website URL.

In [1]:
import requests
In [2]:
headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0'
}

For this example, I will crawl crosswordsclue.co website, It is relatively easy website to crawl. We will crawl four items from this website,

  1. Crossword Clue
  2. Crosssword Hint
  3. Crossword Publisher
  4. Crossword Published Date

Let us start with a random URL. In next few lines of code, We will extract the html text from the above URL and pass it to LXML module. Python lxml is a package to query the DOM of website.

In [3]:
url = 'https://crosswordclue.co/does-a-farriers-work-crossword-clue'
In [4]:
response = requests.get(url,headers=headers)
In [5]:
response.status_code
Out[5]:
200
In [6]:
import lxml.html
In [76]:
root = lxml.html.fromstring(response.text)

Once we have the root using lxml, we can query any html component. Let us find out the title of this webpage.

In [80]:
print(root.xpath('.//title')[0].text_content())
Does a farrier's work crossword clue • Crossword Tracker

Ok, we got the title above. Now let us work on extracting the above 4 fields which we described above.

Let us first extract the crossword clue...

To find out the location of the clue, open the above webpage URL in Google Chrome, locate the clue and click right click on the text and then click inspect, Google will show the corresponding HTML tag location.

We can access the above tag as shown below.

In [82]:
r = root.xpath('.//h1/strong')
In [83]:
clue = r[0].text
print(clue)
Does a farrier's work

Let us get the remaining fields publisher, publisher date and hint.

In [84]:
publisher = root.xpath('//div//ol/li/a')[1].attrib['href'].replace("/","").replace("-crossword-answers","")
publisher
Out[84]:
'washington-post'
In [85]:
strdate = root.xpath('//div//ol/li/a')[2].attrib['href'].split("/")[2].replace("th","")
In [86]:
strdate
Out[86]:
'february-28-2017'

Let us convert the above string date to Python datetime object.

In [87]:
import datetime
In [88]:
publishdate = datetime.datetime.strptime(strdate,'%B-%d-%Y')
In [89]:
publishdate
Out[89]:
datetime.datetime(2017, 2, 28, 0, 0)
In [91]:
solution = root.xpath('//div//*[@class="clearfix solution"]')[0].text_content().replace(" ","").strip('\n')
print(solution)
SHOES

That's it we are done with parsing part. Let us put all the code above in following two functions.

In [92]:
def getroot(url):
    response = requests.get(url,headers=headers)
    root = lxml.html.fromstring(response.text)
    return(root)
In [97]:
def parsehtml(root):
    publisher = root.xpath('//div//ol/li/a')[1].attrib['href'].replace("/","").replace("-crossword-answers","")
    solution = root.xpath('//div//*[@class="clearfix solution"]')[0].text_content().replace(" ","").strip('\n')
    strdate = root.xpath('//div//ol/li/a')[2].attrib['href'].split("/")[2].replace("th","")
    publishdate = datetime.datetime.strptime(strdate,'%B-%d-%Y')
    clue = root.xpath('.//h1/strong')[0].text_content().replace('"',"")
    
    return({'publisher':publisher, \
            'publishdate':publishdate, \
            'solution':solution, \
            'clue':clue
           })

Let us check if we can use above two functions.

In [98]:
root = getroot(url)
resp = parsehtml(root)
print(resp)
{'publisher': 'washington-post', 'publishdate': datetime.datetime(2017, 2, 28, 0, 0), 'solution': 'SHOES', 'clue': "Does a farrier's work"}

Ok that works for one webpage. Let us assume now that we want to scrape the entire website starting from the home page. To do that we will have to get all the links and then pass on those links to our LXML parser.

Let us start with home page.

In [21]:
parenturl = 'https://crosswordclue.co/'

To keep track of visited urls, let us create a small function which will check for that.

In [99]:
urlsvisited = {}
def checkIfUrlVisited(url):
    if url in urlsvisited:
        return(True)
    return(False)

Below function will get all the links of a given URL.

In [49]:
def getPageLinks(url):
    proot = getroot(parenturl)
    for alink in proot.xpath('.//a'):
        #internal link
        ilink = alink.attrib['href']
        if ilink.endswith('crossword-answers'):
            ilink = ilink.replace("/","")
            if not checkIfUrlVisited(ilink):
                urlsvisited[ilink] = 1
            else:
                #do something ex: parseHTML

Ok, in the above function we are getting all the links for a give URL using proot.xpath('.//a') and checking if the link ends with 'crossword-answers', then we process it and at the same time add the link to our urlsvisited dictionary so that we don't crawl it next time. Till now, the above snippets should be sufficient enough for you to build your crawler for the entire website.

But let us take the above learnings one step further. Let us see how can we accelerate the process of crawling the entire website.

Crawl the entire site using Python threads and Queues

In [50]:
import threading
import time
from multiprocessing import JoinableQueue

Below function will start the Python threads and assign each thread a worker. We will define our worker in a bit.

In [51]:
threads = []
def start_threads(num_threads):
    try:
        for i in range(num_threads):
            t = threading.Thread(target=worker,args=())
            t.start()
            threads.append(t)
    except Exception as e:
        print("Error: unable to start thread",e)

Let us fix our getPageLinks function above so that we can use it with Python threads. We will use Python queues to crawl as well as track the number of URLS. Checkout in below snippet the use of
my_queue.put(parenturl + ilink)

In [106]:
def getPageLinks(url):
    proot = getroot(parenturl)
    for alink in proot.xpath('.//a'):
        #internal link
        ilink = alink.attrib['href']
        if ilink.endswith('crossword-answers'):
            ilink = ilink.replace("/","")
            if not checkIfUrlVisited(ilink):
                urlsvisited[ilink] = 1
                my_queue.put(parenturl + ilink)
            else:
                continue

Now let us define our worker as well.

In [64]:
def worker():
    while(my_queue.qsize() > 0):
        pageurl = my_queue.get()
        root = getroot(url)
        crosswordinfo = parsehtml(root)
        print(crosswordinfo)
        my_queue.task_done()
        print("done")
    print("exiting")
    raise KeyboardInterrupt

Let us look at our worker snippet above. The above worker has a while loop which will keep working until our queue is not empty. The worker will get the url from the queue and pass it to parsehtml() function. The parsehtml would return the concerned fields from a webpage. In the above function, I am printing the output to console using print(crosswordinfo) but in real scenario, you will have to store this data in your database.

Ok most of code is done. Let us write few lines of code to start our crawler.

In below snippet, we are defining a queue first.
getPageLinks(parenturl) will start the crawler from the parent URL,crosswordclue.co and then we are looping through our URL queue using an infinite while loop.

In [108]:
import time
urlsvisited = {}
my_queue = JoinableQueue()
getPageLinks(parenturl)
print("sleeping")
time.sleep(5)
count=0
while(my_queue.qsize() > 0):
    print(count,my_queue.get())
    count+=1
sleeping
0 https://crosswordclue.co/daily-crossword-answers
1 https://crosswordclue.co/eugene-sheffer-crossword-answers
2 https://crosswordclue.co/thomas-joseph-crossword-answers
3 https://crosswordclue.co/evening-standard-classic-crossword-answers
4 https://crosswordclue.co/evening-standard-cryptic-crossword-answers
5 https://crosswordclue.co/evening-standard-quick-crossword-answers
6 https://crosswordclue.co/evening-standard-easy-crossword-answers
7 https://crosswordclue.co/evening-standard-mini-crossword-answers
8 https://crosswordclue.co/star-tribune-crossword-answers
9 https://crosswordclue.co/universal-crossword-answers
10 https://crosswordclue.co/metro-crossword-answers
11 https://crosswordclue.co/nz-herald-crossword-answers
12 https://crosswordclue.co/new-york-times-crossword-answers
13 https://crosswordclue.co/usa-today-crossword-answers
14 https://crosswordclue.co/7-little-words-crossword-answers
15 https://crosswordclue.co/penny-dell-crossword-answers
16 https://crosswordclue.co/the-tennesean-crossword-answers
17 https://crosswordclue.co/the-independent-concise-crossword-answers
18 https://crosswordclue.co/the-independent-quick-crossword-answers
19 https://crosswordclue.co/aussies-crossword-answers
20 https://crosswordclue.co/crusader-crossword-answers
21 https://crosswordclue.co/irish-news-cryptic-crossword-answers
22 https://crosswordclue.co/irish-news-quick-crossword-answers
23 https://crosswordclue.co/irish-news-prize-crossword-answers
24 https://crosswordclue.co/7-little-words-bonus-crossword-answers
25 https://crosswordclue.co/crossword-champ-daily-crossword-answers
26 https://crosswordclue.co/crossword-champ-pro-crossword-answers
27 https://crosswordclue.co/crossword-champ-premium-crossword-answers
28 https://crosswordclue.co/7-little-words-bonus-2-crossword-answers
29 https://crosswordclue.co/7-little-words-bonus-3-crossword-answers
30 https://crosswordclue.co/7-little-words-bonus-4-crossword-answers
31 https://crosswordclue.co/premier-sunday-crossword-answers
32 https://crosswordclue.co/the-guardian-speedy-crossword-answers
33 https://crosswordclue.co/penny-dell-sunday-crossword-answers
34 https://crosswordclue.co/the-guardian-quick-crossword-answers
35 https://crosswordclue.co/the-guardian-weekend-crossword-answers
36 https://crosswordclue.co/the-guardian-cryptic-crossword-answers
37 https://crosswordclue.co/the-guardian-quiptic-crossword-answers
38 https://crosswordclue.co/the-guardian-everyman-crossword-answers
39 https://crosswordclue.co/the-guardian-prize-crossword-answers
40 https://crosswordclue.co/irish-times-crosaire-crossword-answers
41 https://crosswordclue.co/irish-times-simplex-crossword-answers
42 https://crosswordclue.co/the-sunday-crossword-by-evan-birnholz-crossword-answers
43 https://crosswordclue.co/newsday-crossword-answers
44 https://crosswordclue.co/classic-crossword-by-merl-reagle-crossword-answers
45 https://crosswordclue.co/la-times-crossword-answers
46 https://crosswordclue.co/mirror-classic-crossword-answers
47 https://crosswordclue.co/mirror-quiz-crossword-answers
48 https://crosswordclue.co/mirror-quick-crossword-answers
49 https://crosswordclue.co/mirror-cryptic-crossword-answers
50 https://crosswordclue.co/family-times-crossword-answers
51 https://crosswordclue.co/the-denver-post-crossword-answers
52 https://crosswordclue.co/jumble-crossword-answers
53 https://crosswordclue.co/canadian-crossword-answers
54 https://crosswordclue.co/daily-themed-crossword-answers
55 https://crosswordclue.co/boston-globe-crossword-answers
56 https://crosswordclue.co/wall-street-journal-crossword-answers
57 https://crosswordclue.co/rock-and-roll-crossword-answers
58 https://crosswordclue.co/mystic-words-crossword-answers
59 https://crosswordclue.co/washington-post-crossword-answers

Ok it has worked as we expected. Let us check the length of our queue now. It should be zero because we just ran it for the home page.

In [109]:
print(my_queue.qsize())
0

Ok let us put all the pieces together to run it for the entire website.

In [110]:
import time
my_queue = JoinableQueue()
getPageLinks(parenturl)
print("sleeping")
time.sleep(5)
num_threads = 10
start_threads(num_threads)
count=0
while(my_queue.qsize() > 0):
    suburl = parenturl + my_queue.get()
    my_queue.put(suburl)
    getPageLinks(suburl)
    count+=1
    if my_queue.qsize() > 500:
        print("sleeping",my_queue.qsize())
        time.sleep(5)
    else:
        break

In the above snippet, we have added code to start our 5 threads. Inside the while loop, we are getting the URL from our queue and passing it to our getPageLinks() function. getPageLinks will extract the additional links for each URL and add to our queue. To make sure, our queue size does'nt become too big, we delay the addition of entries if the queue size becomes more than 500.

Note, the code can be optimized further. We are opening the same url two times. One inside the main loop and second time during our worker function. Let us fix that.

In [103]:
import time
my_queue = JoinableQueue()
urlsvisited = {}
getPageLinks(parenturl)
print("sleeping")
time.sleep(5)
num_threads = 10
start_threads(num_threads)
count=0
while(my_queue.qsize() > 0):
    suburl = parenturl + my_queue.get()
    my_queue.put(suburl)
    root = getroot(suburl)
    getPageLinks(root)
    count+=1
    if my_queue.qsize() > 500:
        print("sleeping",my_queue.qsize())
        time.sleep(5)
    else:
        break

We need to fix our getPageLinks() function too so that it takes (lxml) root as input.

In [54]:
def getPageLinks(proot):
    for alink in proot.xpath('.//a'):
        #internal link
        ilink = alink.attrib['href']
        if ilink.endswith('crossword-answers'):
            ilink = ilink.replace("/","")
            if not checkIfUrlVisited(ilink):
                urlsvisited[ilink] = 1
                my_queue.put(ilink)
            else:
                continue