A Study of the TextRank Algorithm in Python

TextRank is a graph based algorithm for keyword and sentence extraction. It is similar in nature to Google's page rank algorithm.

In this post we will go through a tutorial about how to install and use Textrank on Android reviews to extract keywords.

Requirements:

  1. Python 3.5+
  2. Spacy
  3. Pytextrank
In [1]:
!pip install spacy
!pip install pytextrank
In [2]:
import pytextrank
import spacy
import pandas as pd

For this exercise I will be a using a csv which is about Android reviews.

In [3]:
!ls data/sample_data.csv
data/sample_data.csv

Let us read the csv file using pandas read_csv()

In [4]:
df = pd.read_csv('data/sample_data.csv')

Let us take a peek in to our data.

In [5]:
df.head(2)
Out[5]:
Unnamed: 0 rating review
0 0 4 anyone know how to get FM tuner on this launch...
1 1 2 Developers of this app need to work hard to fi...

Lets get rid of Unnamed: 0 column by setting index_col=0 while doing pd.read_csv

In [6]:
df = pd.read_csv('data/sample_data.csv',index_col=0)

set display.max_colwidth', -1 so that data is not truncated in our python notebook.

In [7]:
pd.set_option('display.max_colwidth', -1)
In [8]:
df.head(1)
Out[8]:
rating review
0 4 anyone know how to get FM tuner on this launcher? It is available in the dafault launcher but does not show up in app list to add to this one. Otherwise.. great launcher! All I can find on the store are apps for streaming stations but the original launcher did have a real FM tuner which is the only thing missing from this launcher.
In [ ]:
Lets try to find the keywords from few of these reviews.
In [9]:
review1 = df.iloc[0]['review']

Before we do that, we need to load our spacy model.

In [10]:
nlp = spacy.load('en_core_web_sm')

Lets initializer our pytextrank now.

In [11]:
tr = pytextrank.TextRank(logger=None)

Next we need to add textrank as a pipeline to our spacy model.

In [12]:
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)

Now we are ready to use our model. Lets load the text in to our spacy model.

In [13]:
doc = nlp(review1)
In [15]:
for phrase in doc._.phrases:
    print("%s %s %s"%(phrase.rank, phrase.count, phrase.text))
0.1643258973249535 1 app list
0.14870405163352085 1 fm tuner
0.10002872204845309 1 a real fm tuner
0.09741561461611117 1 stations
0.09562079838741741 1 the dafault launcher
0.094116179868447 1 the original launcher
0.07679311366536046 2 this launcher
0.07303293766844456 1 the only thing
0.06477630351859456 1 otherwise.. great launcher
0.053698883087075634 1 the store
0.03965858602000139 1 this one
0.0 3 anyone

As we above the Ist column is the pytext rank. The higher the rank better the quality of extracted keyword.

Lets do another example.

In [16]:
df.iloc[1]['review']
Out[16]:
'Developers of this app need to work hard to fine tune. There are many issues in this app. I sent an email to developers but they don\'t bother to reply the email. I can not add system widgets to the screen. If added one, it only displays \\recover\\". Weather is nit displayed on home screen. Doesn\'t support built-in music player and it\'s control. Speed is not accurate. Please try to work on these issues if you really want to make this app the one of its kind."'
In [21]:
doc = nlp(df.iloc[1]['review'])
for phrase in doc._.phrases:
    print(phrase.rank,phrase.count,phrase.chunks)
0.11430978384935088 1 [system widgets]
0.11159252187593624 1 [home screen]
0.10530999092027488 1 [many issues]
0.0979183266371772 1 [fine tune]
0.08643261057360326 1 [nit]
0.08563916592311799 1 [Speed]
0.08201697027034136 2 [Developers, developers]
0.07255614913054882 1 [Weather]
0.06461967687026247 3 [this app, this app, this app]
0.06362587300087594 1 [built-in music player]
0.055491039197743064 2 [an email, the email]
0.05137598599688147 1 [these issues]
0.04561572496611145 1 [the screen]
0.033167906340332974 1 [control]
0.0175899386182573 1 [its kind]
0.0 8 [I, they, I, it, it, you, one, one]

Commonly encountered errors while installing spacy

You might run in to following error while loading Spacy model spacy.load("en_core_web_sm")

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Do following to fix that.

In [22]:
!python3 -m spacy download en_core_web_sm

Wrap Up!

This tutorial just introduces users to Textrank algorithm. In the next tutorial, I will go over how to improve the results of Textrank algorithm.

Related Topics