How to Sort Pandas DataFrame with Examples

For this exercise I will be using Movie database which I have downloaded from Kaggle.

import pandas as pd

For this exercise we will be using ratings.csv file which comes with movie database.

ls moviesdb/ml-latest/ratings.csv

moviesdb/ml-latest/ratings.csv

df = pd.read_csv('moviesdb/ml-latest/ratings.csv')

df.head(2)

Lets check the size of our dataframe

df.count()

userId       26024289
movieId      26024289
rating       26024289
timestamp    26024289
dtype: int64

The dataframe has roughly 26 million entries. Each User has multiple entries. Lets check how many unique users are there.

len(df.userId.unique())

270896

So there are 270k Unique users are present in our db. Lets checkout how many unique movies are there.

len(df.movieId.unique())

45115

Ok, roughly around 45k movies are present for which we have the ratings present.

Lets also check the range of our rating column. For that we can use describe() method on the entire dataframe. Describe() method gives us the nice statistics.

df.describe()

As we see above, rating has minimum value of 0.5 and max value of 5.0.

Now we know our dataframe much better, lets do sort operations now.

How To Do Sort By Values

Lets say we want to sort the movies by ratings from high to low that is in descending order.

df.sort_values(by=['rating'],ascending=False).head(2)

As we see above, the first two rows are of rating 5 now. Remember above command has not saved the data in the descending order. If we want to save the data, use the option inplace=True

df.sort_values(by=['rating'],ascending=False,inplace=True)

df.head(2)

For ascending just use the option, ascending=True, as shown below...

df.sort_values(by=['rating'],ascending=True).head(2)

How To Sort Multiple Columns In Pandas

We can also sort by multiple columns in Pandas, lets say we want to sort highest to lowest rating and then by highest to lowest movieId.

df.sort_values(by=['rating','movieId'],ascending=False).head(2)

As we see above, rating and movieId are both showing as highest to lowest. Lets add another column userId, it doesnt make sense to sort movieId and userId in this example but for the sake of exercise we will anyway do it.

df.sort_values(by=['rating','movieId','userId'],ascending=False).head(2)

We saw that nothing much changed vs our previous command. Lets do a little change now, lets sort userId before movieId and see what happens

df.sort_values(by=['rating','userId','movieId'],ascending=False).head(2)

We see a different result, what just happened. We sorted first rating then userId. Since 270896 has give more than one movie ratings, the code above sorts the movieId of user 270896 further from highest to lowest.

Lets do a meaniful example, Lets find the average rating by a user and sort it by highest to lowest. For this query, we will have to groupby first user and apply average operation on the column rating and then will have to sort it by rating. Lets try this.

import numpy as np

df1 = df.groupby('userId')['rating'].apply(np.average)

Lets check our dataframe df1. Remember we have not sorted the dataframe yet.

df1.head()

userId
1    4.277778
2    3.318182
3    3.100000
4    3.500000
5    4.269231
Name: rating, dtype: float64

Lets check the datatype of df1

df1.__class__

pandas.core.series.Series

It has changed to Pandas Series. Lets change it back to dataframe by using reset_index()

df2 = df1.reset_index()

df2.__class__

pandas.core.frame.DataFrame

df2.head(2)

Now we have our dataframe back, lets sort by the rating from highest to lowest

df2.sort_values(by=['rating'],ascending=False,inplace=True)

df2.head(5)

The first column in the above dataframe is the row ids. Our data frame is sorted by rating and now we know which user has on average given hightest rating.

How To Sort Multi-level Pandas Data Frame

Let us first create two levels data frame. We will use df.set_index() to do that.

dfl = df.set_index(['movieId','rating'])
dfl.sort_index(inplace=TruE)

dfl.head(2)

As we see above, Dataframe dfl contains levels "movieId" and "rating".

Let us sort the levels by Ascending movieId and Descending rating. This will sort the movieId starting from lowest to highest and for each movieId, the rating will be sorted from highest to lowest.

dfl.sort_index(level=[0,1],ascending=[True,False]).head(2)

	userId	movieId	rating	timestamp
0	1	110	1.0	1425941529
1	1	147	4.5	1425942435

	userId	movieId	rating	timestamp
count	2.602429e+07	2.602429e+07	2.602429e+07	2.602429e+07
mean	1.350371e+05	1.584911e+04	3.528090e+00	1.171258e+09
std	7.817620e+04	3.108526e+04	1.065443e+00	2.052889e+08
min	1.000000e+00	1.000000e+00	5.000000e-01	7.896520e+08
25%	6.716400e+04	1.073000e+03	3.000000e+00	9.907545e+08
50%	1.351630e+05	2.583000e+03	3.500000e+00	1.151716e+09
75%	2.026930e+05	6.503000e+03	4.000000e+00	1.357578e+09
max	2.708960e+05	1.762750e+05	5.000000e+00	1.501830e+09

	userId	movieId	rating	timestamp
18471899	191698	1617	5.0	879251202
8646727	89135	491	5.0	860830376

	userId	movieId	rating	timestamp
18471899	191698	1617	5.0	879251202
8646727	89135	491	5.0	860830376

	userId	movieId	rating	timestamp
16757505	174227	3968	0.5	1112330388
12103042	125498	86911	0.5	1307014797

	userId	movieId	rating	timestamp
83813	837	176271	5.0	1501765780
24906715	258955	176219	5.0	1501539818

	userId	movieId	rating	timestamp
26024285	270896	60069	5.0	1257032032
26024284	270896	58559	5.0	1257031564

	userId	rating
36217	36218	5.0
153214	153215	5.0
81043	81044	5.0
264393	264394	5.0
245026	245027	5.0

How to Sort Pandas DataFrame with Examples

How To Do Sort By Values

How To Sort Multiple Columns In Pandas

How To Sort Multi-level Pandas Data Frame

Related Notebooks