# Crawl Websites Using Python

In this notebook, I will go over a simpler Python scraper.

## How to Parse a website using Python LXML

We will use Python 'requests' module to open the website URL.

In [1]:
import requests

In [2]:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0'
}


For this example, I will crawl crosswordsclue.co website, It is relatively easy website to crawl. We will crawl four items from this website,

1. Crossword Clue
2. Crosssword Hint
3. Crossword Publisher
4. Crossword Published Date

Let us start with a random URL. In next few lines of code, We will extract the html text from the above URL and pass it to LXML module. Python lxml is a package to query the DOM of website.

In [3]:
url = 'https://crosswordclue.co/does-a-farriers-work-crossword-clue'

In [4]:
response = requests.get(url,headers=headers)

In [5]:
response.status_code

Out[5]:
200
In [6]:
import lxml.html

In [76]:
root = lxml.html.fromstring(response.text)


Once we have the root using lxml, we can query any html component. Let us find out the title of this webpage.

In [80]:
print(root.xpath('.//title')[0].text_content())

Does a farrier's work crossword clue • Crossword Tracker


Ok, we got the title above. Now let us work on extracting the above 4 fields which we described above.

Let us first extract the crossword clue...

To find out the location of the clue, open the above webpage URL in Google Chrome, locate the clue and click right click on the text and then click inspect, Google will show the corresponding HTML tag location.

We can access the above tag as shown below.

In [82]:
r = root.xpath('.//h1/strong')

In [83]:
clue = r[0].text
print(clue)

Does a farrier's work


Let us get the remaining fields publisher, publisher date and hint.

In [84]:
publisher = root.xpath('//div//ol/li/a')[1].attrib['href'].replace("/","").replace("-crossword-answers","")
publisher

Out[84]:
'washington-post'
In [85]:
strdate = root.xpath('//div//ol/li/a')[2].attrib['href'].split("/")[2].replace("th","")

In [86]:
strdate

Out[86]:
'february-28-2017'

Let us convert the above string date to Python datetime object.

In [87]:
import datetime

In [88]:
publishdate = datetime.datetime.strptime(strdate,'%B-%d-%Y')

In [89]:
publishdate

Out[89]:
datetime.datetime(2017, 2, 28, 0, 0)
In [91]:
solution = root.xpath('//div//*[@class="clearfix solution"]')[0].text_content().replace(" ","").strip('\n')
print(solution)

SHOES


That's it we are done with parsing part. Let us put all the code above in following two functions.

In [92]:
def getroot(url):
root = lxml.html.fromstring(response.text)
return(root)

In [97]:
def parsehtml(root):
solution = root.xpath('//div//*[@class="clearfix solution"]')[0].text_content().replace(" ","").strip('\n')
strdate = root.xpath('//div//ol/li/a')[2].attrib['href'].split("/")[2].replace("th","")
publishdate = datetime.datetime.strptime(strdate,'%B-%d-%Y')
clue = root.xpath('.//h1/strong')[0].text_content().replace('"',"")

return({'publisher':publisher, \
'publishdate':publishdate, \
'solution':solution, \
'clue':clue
})


Let us check if we can use above two functions.

In [98]:
root = getroot(url)
resp = parsehtml(root)
print(resp)

{'publisher': 'washington-post', 'publishdate': datetime.datetime(2017, 2, 28, 0, 0), 'solution': 'SHOES', 'clue': "Does a farrier's work"}


Ok that works for one webpage. Let us assume now that we want to scrape the entire website starting from the home page. To do that we will have to get all the links and then pass on those links to our LXML parser.

In [21]:
parenturl = 'https://crosswordclue.co/'


To keep track of visited urls, let us create a small function which will check for that.

In [99]:
urlsvisited = {}
def checkIfUrlVisited(url):
if url in urlsvisited:
return(True)
return(False)


Below function will get all the links of a given URL.

In [49]:
def getPageLinks(url):
proot = getroot(parenturl)
else:
#do something ex: parseHTML


Ok, in the above function we are getting all the links for a give URL using proot.xpath('.//a') and checking if the link ends with 'crossword-answers', then we process it and at the same time add the link to our urlsvisited dictionary so that we don't crawl it next time. Till now, the above snippets should be sufficient enough for you to build your crawler for the entire website.

But let us take the above learnings one step further. Let us see how can we accelerate the process of crawling the entire website.

## Crawl the entire site using Python threads and Queues

In [50]:
import threading
import time
from multiprocessing import JoinableQueue


Below function will start the Python threads and assign each thread a worker. We will define our worker in a bit.

In [51]:
threads = []
try:
t.start()
except Exception as e:


Let us fix our getPageLinks function above so that we can use it with Python threads. We will use Python queues to crawl as well as track the number of URLS. Checkout in below snippet the use of

In [106]:
def getPageLinks(url):
proot = getroot(parenturl)
else:
continue


Now let us define our worker as well.

In [64]:
def worker():
while(my_queue.qsize() > 0):
pageurl = my_queue.get()
root = getroot(url)
crosswordinfo = parsehtml(root)
print(crosswordinfo)
print("done")
print("exiting")
raise KeyboardInterrupt


Let us look at our worker snippet above. The above worker has a while loop which will keep working until our queue is not empty. The worker will get the url from the queue and pass it to parsehtml() function. The parsehtml would return the concerned fields from a webpage. In the above function, I am printing the output to console using print(crosswordinfo) but in real scenario, you will have to store this data in your database.

Ok most of code is done. Let us write few lines of code to start our crawler.

In below snippet, we are defining a queue first.
getPageLinks(parenturl) will start the crawler from the parent URL,crosswordclue.co and then we are looping through our URL queue using an infinite while loop.

In [108]:
import time
urlsvisited = {}
my_queue = JoinableQueue()
print("sleeping")
time.sleep(5)
count=0
while(my_queue.qsize() > 0):
print(count,my_queue.get())
count+=1

sleeping


Ok it has worked as we expected. Let us check the length of our queue now. It should be zero because we just ran it for the home page.

In [109]:
print(my_queue.qsize())

0


Ok let us put all the pieces together to run it for the entire website.

In [110]:
import time
my_queue = JoinableQueue()
print("sleeping")
time.sleep(5)
count=0
while(my_queue.qsize() > 0):
suburl = parenturl + my_queue.get()
my_queue.put(suburl)
count+=1
if my_queue.qsize() > 500:
print("sleeping",my_queue.qsize())
time.sleep(5)
else:
break


In the above snippet, we have added code to start our 5 threads. Inside the while loop, we are getting the URL from our queue and passing it to our getPageLinks() function. getPageLinks will extract the additional links for each URL and add to our queue. To make sure, our queue size does'nt become too big, we delay the addition of entries if the queue size becomes more than 500.

Note, the code can be optimized further. We are opening the same url two times. One inside the main loop and second time during our worker function. Let us fix that.

In [103]:
import time
my_queue = JoinableQueue()
urlsvisited = {}
print("sleeping")
time.sleep(5)
count=0
while(my_queue.qsize() > 0):
suburl = parenturl + my_queue.get()
my_queue.put(suburl)
root = getroot(suburl)
count+=1
if my_queue.qsize() > 500:
print("sleeping",my_queue.qsize())
time.sleep(5)
else:
break


We need to fix our getPageLinks() function too so that it takes (lxml) root as input.

In [54]:
def getPageLinks(proot):