How to Sort Pandas DataFrame with Examples
For this exercise I will be using Movie database which I have downloaded from Kaggle.
import pandas as pd
For this exercise we will be using ratings.csv file which comes with movie database.
df = pd.read_csv('ratings.csv')
Lets check the size of our dataframe
userId 26024289 movieId 26024289 rating 26024289 timestamp 26024289 dtype: int64
The dataframe has roughly 26 million entries. Each User has multiple entries. Lets check how many unique users are there.
So there are 270k Unique users are present in our db. Lets checkout how many unique movies are there.
Ok, roughly around 45k movies are present for which we have the ratings present.
Lets also check the range of our rating column. For that we can use describe() method on the entire dataframe. Describe() method gives us the nice statistics.
As we see above, rating has minimum value of 0.5 and max value of 5.0.
Now we know our dataframe much better, lets do sort operations now.
How To Do Sort By Values
Lets say we want to sort the movies by ratings from high to low that is in descending order.
As we see above, the first two rows are of rating 5 now. Remember above command has not saved the data in the descending order. If we want to save the data, use the option inplace=True
For ascending just use the option, ascending=True, as shown below...
We can also sort by multiple columns in Pandas, lets say we want to sort highest to lowest rating and then by highest to lowest movieId.
As we see above, rating and movieId are both showing as highest to lowest. Lets add another column userId, it doesnt make sense to sort movieId and userId in this example but for the sake of exercise we will anyway do it.
We saw that nothing much changed vs our previous command. Lets do a little change now, lets sort userId before movieId and see what happens
We see a different result, what just happened. We sorted first rating then userId. Since 270896 has give more than one movie ratings, the code above sorts the movieId of user 270896 further from highest to lowest.
Lets do a meaniful example, Lets find the average rating by a user and sort it by highest to lowest. For this query, we will have to groupby first user and apply average operation on the column rating and then will have to sort it by rating. Lets try this.
import numpy as np
df1 = df.groupby('userId')['rating'].apply(np.average)
Lets check our dataframe df1. Remember we have not sorted the dataframe yet.
userId 1 4.277778 2 3.318182 3 3.100000 4 3.500000 5 4.269231 Name: rating, dtype: float64
Lets check the datatype of df1
It has changed to Pandas Series. Lets change it back to dataframe by using reset_index()
df2 = df1.reset_index()
Now we have our dataframe back, lets sort by the rating from highest to lowest
The first column in the above dataframe is the row ids. Our data frame is sorted by rating and now we know which user has on average given hightest rating.
- How To Write DataFrame To CSV In R
- Pandas How To Sort Columns And Rows
- How To Drop One Or More Columns In Pandas Dataframe
- How To Use Pandas Correlation Matrix
- Merge and Join DataFrames with Pandas in Python
- Python Pandas String To Integer And Integer To String DataFrame
- How To Analyze Yahoo Finance Data With R
- How to Create DataFrame in R Using Examples