Pandas How To Sort Columns And Rows
In this post, I will go over sort operation in Pandas. Lot of times for doing data analysis, we have to sort columns and rows frequently. In Pandas it is very easy to sort columns and rows. For this exercise, I will be using covid19 database from following link...
pomber.github.io/covid19/timeseries.json
Before we delve in to our example, Let us first import the necessary package pandas.
import pandas as pd
To read json data from web, we will use requests package.
data = requests.get('https://pomber.github.io/covid19/timeseries.json')
First we need to convert this data from json to Pandas dataframe.
jsondata = data.json()
columns=['country','date','confirmed','deaths','recovered']
data = []
for country in jsondata:
for x in jsondata[country]:
data.append([country, x['date'],x['confirmed'],x['deaths'],x['recovered']])
df = pd.DataFrame(data,columns=columns)
Let us check the first two rows in our dataframe.
df.head(2)
Let us check the number of rows we have in our dataframe by using len(df)
len(df)
For every country, we have the data of corona virus cases by date.
Pandas Sort Dataframe By Column Using df.sort_values
We can sort values in dataframe by using function df.sort_values. Important parameters to pass to sort_values are "by", "ascending", "inplace" and "axis"
- "by" - takes the column name - by which column, we want to sort the dataframe
- "ascending" - Takes True or False value
- "inplace" - inplace=True will overwrite the dataframe. By default inplace is False.
- "axis" can take either 1 or 0. One means sort row. 0 Means sort Column.
Another parameter which sort_values takes is "kind". With "kind", we can specify the sorting algorithm. Available options are {'quicksort', 'mergesort', 'heapsort'}, default is 'quicksort'.
If we run df.sort_values(), it will return following error...
TypeError: sort_values() missing 1 required positional argument: 'by'
Therefore we need to give it a column name. Let us sort the column by date in descending order that is from latest to oldest.
df.sort_values(by='date',ascending=False).head(1)
Since i am running it on April 3rd. The latest covid 19 cases available is for April 3rd.
We can also sort the dataframe by multiple columns. For example, we can sort by 'date' first and then by number of confirmed cases. Therefore we should get is, on April 3rd which country has the highest number of covid 19 cases.
df_clatest = df.sort_values(by=['date','confirmed'],ascending=False)
Let us check the first two rows of dataframe.
df_clatest.head(2)
What we got is that on April 3rd, US has the most number of cases follow by Italy.
Pandas Sort Dataframe By Row Using df.sort_values
We can also sort the order of columns by rows of dataframe. You generally don't need to do that but I am just covering it for completeness.
df.sort_values(ascending=False,axis=1).head(1)
I got following error.
TypeError: sort_values() missing 1 required positional argument: 'by'
We need to mention 'by' parameter for row too. We can't access the rows by name in our dataframe. To fix that we need to make one of our columns as index. Let us make 'country' as index.
dfc = df.set_index('country')
dfc.head(1)
Well you can now in theory can sort the row by 'country' name. Let us see what happens if sort by country name 'Afghanistan'.
dfc.sort_values('Afghanistan',axis=1)
I got following error.
ValueError: The index label 'Afghanistan' is not unique.
Ok we need uniqe label to sort it. If we combine 'country' and 'date', that will be unique label. Let us do that.
dfd = df.reset_index().set_index(['country','date'])
dfd.head(1)
Ok Now have the unique index label defined. If we sort our dataframe by now combining both 'country' and 'date'. We can sort the columns by row values. Let us pick up country US which we noticed has highest number of covid 19 cases. Let us try to sort the columns by row values for combination 'US' and '2020-4-3' as shown below.
dfd.sort_values(('US','2020-4-3'),axis=1,ascending=True).loc['US','2020-4-3']
OK, it returns the values in ascending order starting from column deaths to all the way column confirmed. Let see what we get if we reverse the order.
dfd.sort_values(('US','2020-4-3'),axis=1,ascending=False).loc['US','2020-4-3']
There you go, we got the order in reverse order.
One thing to note about the syntax when we order by multiple labels. sort_values take tuple as input not list when you do sort by multiple labels.
Wrap Up!
I hope above examples would give you clear understanding about how to use sort_values in Pandas.Related Notebooks
- How to Sort Pandas DataFrame with Examples
- Select Pandas Dataframe Rows And Columns Using iloc loc and ix
- Python Sort And Sorted
- How To Drop One Or More Columns In Pandas Dataframe
- 3 Ways to Rename Columns in Pandas DataFrame
- How to do SQL Select and Where Using Python Pandas
- How To Append Rows With Concat to a Pandas DataFrame
- How To Iterate Over Rows In A Dataframe In Pandas
- Pandas group by multiple custom aggregate function on multiple columns