In this notebook, I will go over a simpler Python scraper.
We will use Python 'requests' module to open the website URL.
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0'
}
For this example, I will crawl crosswordsclue.co website, It is relatively easy website to crawl. We will crawl four items from this website,
- Crossword Clue
- Crosssword Hint
- Crossword Publisher
- Crossword Published Date
Let us start with a random URL. In next few lines of code, We will extract the html text from the above URL and pass it to LXML module. Python lxml is a package to query the DOM of website.
url = 'https://crosswordclue.co/does-a-farriers-work-crossword-clue'
response = requests.get(url,headers=headers)
response.status_code
import lxml.html
root = lxml.html.fromstring(response.text)
Once we have the root using lxml, we can query any html component. Let us find out the title of this webpage.
print(root.xpath('.//title')[0].text_content())
Ok, we got the title above. Now let us work on extracting the above 4 fields which we described above.
Let us first extract the crossword clue...
To find out the location of the clue, open the above webpage URL in Google Chrome, locate the clue and click right click on the text and then click inspect, Google will show the corresponding HTML tag location.
We can access the above tag as shown below.
r = root.xpath('.//h1/strong')
clue = r[0].text
print(clue)
Let us get the remaining fields publisher, publisher date and hint.
publisher = root.xpath('//div//ol/li/a')[1].attrib['href'].replace("/","").replace("-crossword-answers","")
publisher
strdate = root.xpath('//div//ol/li/a')[2].attrib['href'].split("/")[2].replace("th","")
strdate
Let us convert the above string date to Python datetime object.
import datetime
publishdate = datetime.datetime.strptime(strdate,'%B-%d-%Y')
publishdate
solution = root.xpath('//div//*[@class="clearfix solution"]')[0].text_content().replace(" ","").strip('\n')
print(solution)
That's it we are done with parsing part. Let us put all the code above in following two functions.
def getroot(url):
response = requests.get(url,headers=headers)
root = lxml.html.fromstring(response.text)
return(root)
def parsehtml(root):
publisher = root.xpath('//div//ol/li/a')[1].attrib['href'].replace("/","").replace("-crossword-answers","")
solution = root.xpath('//div//*[@class="clearfix solution"]')[0].text_content().replace(" ","").strip('\n')
strdate = root.xpath('//div//ol/li/a')[2].attrib['href'].split("/")[2].replace("th","")
publishdate = datetime.datetime.strptime(strdate,'%B-%d-%Y')
clue = root.xpath('.//h1/strong')[0].text_content().replace('"',"")
return({'publisher':publisher, \
'publishdate':publishdate, \
'solution':solution, \
'clue':clue
})
Let us check if we can use above two functions.
root = getroot(url)
resp = parsehtml(root)
print(resp)
Ok that works for one webpage. Let us assume now that we want to scrape the entire website starting from the home page. To do that we will have to get all the links and then pass on those links to our LXML parser.
Let us start with home page.
parenturl = 'https://crosswordclue.co/'
To keep track of visited urls, let us create a small function which will check for that.
urlsvisited = {}
def checkIfUrlVisited(url):
if url in urlsvisited:
return(True)
return(False)
Below function will get all the links of a given URL.
def getPageLinks(url):
proot = getroot(parenturl)
for alink in proot.xpath('.//a'):
#internal link
ilink = alink.attrib['href']
if ilink.endswith('crossword-answers'):
ilink = ilink.replace("/","")
if not checkIfUrlVisited(ilink):
urlsvisited[ilink] = 1
else:
#do something ex: parseHTML
Ok, in the above function we are getting all the links for a give URL using proot.xpath('.//a') and checking if the link ends with 'crossword-answers', then we process it and at the same time add the link to our urlsvisited dictionary so that we don't crawl it next time. Till now, the above snippets should be sufficient enough for you to build your crawler for the entire website.
But let us take the above learnings one step further. Let us see how can we accelerate the process of crawling the entire website.
import threading
import time
from multiprocessing import JoinableQueue
Below function will start the Python threads and assign each thread a worker. We will define our worker in a bit.
threads = []
def start_threads(num_threads):
try:
for i in range(num_threads):
t = threading.Thread(target=worker,args=())
t.start()
threads.append(t)
except Exception as e:
print("Error: unable to start thread",e)
Let us fix our getPageLinks function above so that we can use it with Python threads. We will use Python queues to crawl as well as track the number of URLS. Checkout in below snippet the use of
my_queue.put(parenturl + ilink)
def getPageLinks(url):
proot = getroot(parenturl)
for alink in proot.xpath('.//a'):
#internal link
ilink = alink.attrib['href']
if ilink.endswith('crossword-answers'):
ilink = ilink.replace("/","")
if not checkIfUrlVisited(ilink):
urlsvisited[ilink] = 1
my_queue.put(parenturl + ilink)
else:
continue
Now let us define our worker as well.
def worker():
while(my_queue.qsize() > 0):
pageurl = my_queue.get()
root = getroot(url)
crosswordinfo = parsehtml(root)
print(crosswordinfo)
my_queue.task_done()
print("done")
print("exiting")
raise KeyboardInterrupt
Let us look at our worker snippet above. The above worker has a while loop which will keep working until our queue is not empty. The worker will get the url from the queue and pass it to parsehtml() function. The parsehtml would return the concerned fields from a webpage. In the above function, I am printing the output to console using print(crosswordinfo) but in real scenario, you will have to store this data in your database.
Ok most of code is done. Let us write few lines of code to start our crawler.
In below snippet, we are defining a queue first.
getPageLinks(parenturl) will start the crawler from the parent URL,crosswordclue.co and then we are looping through our URL queue using an infinite while loop.
import time
urlsvisited = {}
my_queue = JoinableQueue()
getPageLinks(parenturl)
print("sleeping")
time.sleep(5)
count=0
while(my_queue.qsize() > 0):
print(count,my_queue.get())
count+=1
Ok it has worked as we expected. Let us check the length of our queue now. It should be zero because we just ran it for the home page.
print(my_queue.qsize())
Ok let us put all the pieces together to run it for the entire website.
import time
my_queue = JoinableQueue()
getPageLinks(parenturl)
print("sleeping")
time.sleep(5)
num_threads = 10
start_threads(num_threads)
count=0
while(my_queue.qsize() > 0):
suburl = parenturl + my_queue.get()
my_queue.put(suburl)
getPageLinks(suburl)
count+=1
if my_queue.qsize() > 500:
print("sleeping",my_queue.qsize())
time.sleep(5)
else:
break
In the above snippet, we have added code to start our 5 threads. Inside the while loop, we are getting the URL from our queue and passing it to our getPageLinks() function. getPageLinks will extract the additional links for each URL and add to our queue. To make sure, our queue size does'nt become too big, we delay the addition of entries if the queue size becomes more than 500.
Note, the code can be optimized further. We are opening the same url two times. One inside the main loop and second time during our worker function. Let us fix that.
import time
my_queue = JoinableQueue()
urlsvisited = {}
getPageLinks(parenturl)
print("sleeping")
time.sleep(5)
num_threads = 10
start_threads(num_threads)
count=0
while(my_queue.qsize() > 0):
suburl = parenturl + my_queue.get()
my_queue.put(suburl)
root = getroot(suburl)
getPageLinks(root)
count+=1
if my_queue.qsize() > 500:
print("sleeping",my_queue.qsize())
time.sleep(5)
else:
break
We need to fix our getPageLinks() function too so that it takes (lxml) root as input.
def getPageLinks(proot):
for alink in proot.xpath('.//a'):
#internal link
ilink = alink.attrib['href']
if ilink.endswith('crossword-answers'):
ilink = ilink.replace("/","")
if not checkIfUrlVisited(ilink):
urlsvisited[ilink] = 1
my_queue.put(ilink)
else:
continue
Related Notebooks
- How To Use Selenium Webdriver To Crawl Websites
- Understanding Logistic Regression Using Python
- Understanding Word Embeddings Using Spacy Python
- TTM Squeeze Stocks Scanner Using Python
- Demystifying Stock Options Vega Using Python
- How to Visualize Data Using Python - Matplotlib
- How To Read CSV File Using Python PySpark
- How To Read JSON Data Using Python Pandas
- Polynomial Interpolation Using Python Pandas Numpy And Sklearn