A Study of the TextRank Algorithm in Python
TextRank is a graph based algorithm for keyword and sentence extraction. It is similar in nature to Google's page rank algorithm.
In this post we will go through a tutorial about how to install and use Textrank on Android reviews to extract keywords.
Requirements:
- Python 3.5+
- Spacy
- Pytextrank
!pip install spacy
!pip install pytextrank
import pytextrank
import spacy
import pandas as pd
For this exercise I will be a using a csv which is about Android reviews.
!ls data/sample_data.csv
Let us read the csv file using pandas read_csv()
df = pd.read_csv('data/sample_data.csv')
Let us take a peek in to our data.
df.head(2)
Lets get rid of Unnamed: 0 column by setting index_col=0 while doing pd.read_csv
df = pd.read_csv('data/sample_data.csv',index_col=0)
set display.max_colwidth', -1 so that data is not truncated in our python notebook.
pd.set_option('display.max_colwidth', -1)
df.head(1)
Lets try to find the keywords from few of these reviews.
review1 = df.iloc[0]['review']
Before we do that, we need to load our spacy model.
nlp = spacy.load('en_core_web_sm')
Lets initializer our pytextrank now.
tr = pytextrank.TextRank(logger=None)
Next we need to add textrank as a pipeline to our spacy model.
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)
Now we are ready to use our model. Lets load the text in to our spacy model.
doc = nlp(review1)
for phrase in doc._.phrases:
print("%s %s %s"%(phrase.rank, phrase.count, phrase.text))
As we above the Ist column is the pytext rank. The higher the rank better the quality of extracted keyword.
Lets do another example.
df.iloc[1]['review']
doc = nlp(df.iloc[1]['review'])
for phrase in doc._.phrases:
print(phrase.rank,phrase.count,phrase.chunks)
Commonly encountered errors while installing spacy
You might run in to following error while loading Spacy model spacy.load("en_core_web_sm")
OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
Do following to fix that.
!python3 -m spacy download en_core_web_sm
Wrap Up!
This tutorial just introduces users to Textrank algorithm. In the next tutorial, I will go over how to improve the results of Textrank algorithm.
Related Topics
Related Notebooks
- How to Analyze the CSV data in Pandas
- Python IndexError List Index Out of Range
- How To Get Measures Of Spread With Python
- Calculate Implied Volatility of Stock Option Using Python
- Pandas Groupby Count of Rows In Each Group
- An Anatomy of Key Tricks in word2vec project with examples
- ERROR Could not find a version that satisfies the requirement numpy==1 22 3
- cannot access local variable a where it is not associated with a value but the value is defined
- How to Plot a Histogram in Python