How to Sort Pandas DataFrame with Examples

For this exercise I will be using Movie database which I have downloaded from Kaggle.

In [1]:
import pandas as pd

For this exercise we will be using ratings.csv file which comes with movie database.

In [4]:
ls ratings.csv
ratings.csv
In [5]:
df = pd.read_csv('ratings.csv')
In [6]:
df.head(2)
Out[6]:
userId movieId rating timestamp
0 1 110 1.0 1425941529
1 1 147 4.5 1425942435

Lets check the size of our dataframe

In [12]:
df.count()
Out[12]:
userId       26024289
movieId      26024289
rating       26024289
timestamp    26024289
dtype: int64

The dataframe has roughly 26 million entries. Each User has multiple entries. Lets check how many unique users are there.

In [15]:
len(df.userId.unique())
Out[15]:
270896

So there are 270k Unique users are present in our db. Lets checkout how many unique movies are there.

In [16]:
len(df.movieId.unique())
Out[16]:
45115

Ok, roughly around 45k movies are present for which we have the ratings present.

Lets also check the range of our rating column. For that we can use describe() method on the entire dataframe. Describe() method gives us the nice statistics.

In [22]:
df.describe()
Out[22]:
userId movieId rating timestamp
count 2.602429e+07 2.602429e+07 2.602429e+07 2.602429e+07
mean 1.350371e+05 1.584911e+04 3.528090e+00 1.171258e+09
std 7.817620e+04 3.108526e+04 1.065443e+00 2.052889e+08
min 1.000000e+00 1.000000e+00 5.000000e-01 7.896520e+08
25% 6.716400e+04 1.073000e+03 3.000000e+00 9.907545e+08
50% 1.351630e+05 2.583000e+03 3.500000e+00 1.151716e+09
75% 2.026930e+05 6.503000e+03 4.000000e+00 1.357578e+09
max 2.708960e+05 1.762750e+05 5.000000e+00 1.501830e+09

As we see above, rating has minimum value of 0.5 and max value of 5.0.

Now we know our dataframe much better, lets do sort operations now.

How To Do Sort By Values

Lets say we want to sort the movies by ratings from high to low that is in descending order.

In [19]:
df.sort_values(by=['rating'],ascending=False).head(2)
Out[19]:
userId movieId rating timestamp
18471899 191698 1617 5.0 879251202
8646727 89135 491 5.0 860830376

As we see above, the first two rows are of rating 5 now. Remember above command has not saved the data in the descending order. If we want to save the data, use the option inplace=True

In [29]:
df.sort_values(by=['rating'],ascending=False,inplace=True)
In [30]:
df.head(2)
Out[30]:
userId movieId rating timestamp
18471899 191698 1617 5.0 879251202
15579334 162194 94959 5.0 1387729076

For ascending just use the option, ascending=True, as shown below...

In [31]:
df.sort_values(by=['rating'],ascending=True).head(2)
Out[31]:
userId movieId rating timestamp
16757505 174227 3968 0.5 1112330388
19166883 199049 1777 0.5 1075598940

We can also sort by multiple columns in Pandas, lets say we want to sort highest to lowest rating and then by highest to lowest movieId.

In [32]:
df.sort_values(by=['rating','movieId'],ascending=False).head(2)
Out[32]:
userId movieId rating timestamp
83813 837 176271 5.0 1501765780
24906715 258955 176219 5.0 1501539818

As we see above, rating and movieId are both showing as highest to lowest. Lets add another column userId, it doesnt make sense to sort movieId and userId in this example but for the sake of exercise we will anyway do it.

In [33]:
df.sort_values(by=['rating','movieId','userId'],ascending=False).head(2)
Out[33]:
userId movieId rating timestamp
83813 837 176271 5.0 1501765780
24906715 258955 176219 5.0 1501539818

We saw that nothing much changed vs our previous command. Lets do a little change now, lets sort userId before movieId and see what happens

In [34]:
df.sort_values(by=['rating','userId','movieId'],ascending=False).head(2)
Out[34]:
userId movieId rating timestamp
26024285 270896 60069 5.0 1257032032
26024284 270896 58559 5.0 1257031564

We see a different result, what just happened. We sorted first rating then userId. Since 270896 has give more than one movie ratings, the code above sorts the movieId of user 270896 further from highest to lowest.

Lets do a meaniful example, Lets find the average rating by a user and sort it by highest to lowest. For this query, we will have to groupby first user and apply average operation on the column rating and then will have to sort it by rating. Lets try this.

In [35]:
import numpy as np
In [37]:
df1 = df.groupby('userId')['rating'].apply(np.average)

Lets check our dataframe df1. Remember we have not sorted the dataframe yet.

In [38]:
df1.head()
Out[38]:
userId
1    4.277778
2    3.318182
3    3.100000
4    3.500000
5    4.269231
Name: rating, dtype: float64

Lets check the datatype of df1

In [43]:
df1.__class__
Out[43]:
pandas.core.series.Series

It has changed to Pandas Series. Lets change it back to dataframe by using reset_index()

In [44]:
df2 = df1.reset_index()
In [45]:
df2.__class__
Out[45]:
pandas.core.frame.DataFrame
In [46]:
df2.head(2)
Out[46]:
userId rating
0 1 4.277778
1 2 3.318182

Now we have our dataframe back, lets sort by the rating from highest to lowest

In [48]:
df2.sort_values(by=['rating'],ascending=False,inplace=True)
In [49]:
df2.head(5)
Out[49]:
userId rating
36217 36218 5.0
153214 153215 5.0
81043 81044 5.0
264393 264394 5.0
245026 245027 5.0

The first column in the above dataframe is the row ids. Our data frame is sorted by rating and now we know which user has on average given hightest rating.