Pandas How To Sort Columns And Rows

In this post, I will go over sort operation in Pandas. Lot of times for doing data analysis, we have to sort columns and rows frequently. In Pandas it is very easy to sort columns and rows. For this exercise, I will be using covid19 database from following link...

pomber.github.io/covid19/timeseries.json

Before we delve in to our example, Let us first import the necessary package pandas.

import pandas as pd

To read json data from web, we will use requests package.

data = requests.get('https://pomber.github.io/covid19/timeseries.json')

First we need to convert this data from json to Pandas dataframe.

jsondata = data.json()

columns=['country','date','confirmed','deaths','recovered']
data = []
for country in jsondata:
    for x in jsondata[country]:
        data.append([country, x['date'],x['confirmed'],x['deaths'],x['recovered']])
df = pd.DataFrame(data,columns=columns)

Let us check the first two rows in our dataframe.

df.head(2)

Let us check the number of rows we have in our dataframe by using len(df)

len(df)

13213

For every country, we have the data of corona virus cases by date.

Pandas Sort Dataframe By Column Using df.sort_values

We can sort values in dataframe by using function df.sort_values. Important parameters to pass to sort_values are "by", "ascending", "inplace" and "axis"

"by" - takes the column name - by which column, we want to sort the dataframe
"ascending" - Takes True or False value
"inplace" - inplace=True will overwrite the dataframe. By default inplace is False.
"axis" can take either 1 or 0. One means sort row. 0 Means sort Column.

Another parameter which sort_values takes is "kind". With "kind", we can specify the sorting algorithm. Available options are {'quicksort', 'mergesort', 'heapsort'}, default is 'quicksort'.

If we run df.sort_values(), it will return following error...

TypeError: sort_values() missing 1 required positional argument: 'by'

Therefore we need to give it a column name. Let us sort the column by date in descending order that is from latest to oldest.

df.sort_values(by='date',ascending=False).head(1)

Since i am running it on April 3rd. The latest covid 19 cases available is for April 3rd.

We can also sort the dataframe by multiple columns. For example, we can sort by 'date' first and then by number of confirmed cases. Therefore we should get is, on April 3rd which country has the highest number of covid 19 cases.

df_clatest = df.sort_values(by=['date','confirmed'],ascending=False)

Let us check the first two rows of dataframe.

df_clatest.head(2)

What we got is that on April 3rd, US has the most number of cases follow by Italy.

Pandas Sort Dataframe By Row Using df.sort_values

We can also sort the order of columns by rows of dataframe. You generally don't need to do that but I am just covering it for completeness.

df.sort_values(ascending=False,axis=1).head(1)

I got following error.

TypeError: sort_values() missing 1 required positional argument: 'by'

We need to mention 'by' parameter for row too. We can't access the rows by name in our dataframe. To fix that we need to make one of our columns as index. Let us make 'country' as index.

dfc = df.set_index('country')

dfc.head(1)

Well you can now in theory can sort the row by 'country' name. Let us see what happens if sort by country name 'Afghanistan'.

dfc.sort_values('Afghanistan',axis=1)

I got following error.

ValueError: The index label 'Afghanistan' is not unique.

Ok we need uniqe label to sort it. If we combine 'country' and 'date', that will be unique label. Let us do that.

dfd = df.reset_index().set_index(['country','date'])

dfd.head(1)

Ok Now have the unique index label defined. If we sort our dataframe by now combining both 'country' and 'date'. We can sort the columns by row values. Let us pick up country US which we noticed has highest number of covid 19 cases. Let us try to sort the columns by row values for combination 'US' and '2020-4-3' as shown below.

dfd.sort_values(('US','2020-4-3'),axis=1,ascending=True).loc['US','2020-4-3']

deaths         7087
recovered      9707
index         11460
confirmed    275586
Name: (US, 2020-4-3), dtype: int64

OK, it returns the values in ascending order starting from column deaths to all the way column confirmed. Let see what we get if we reverse the order.

dfd.sort_values(('US','2020-4-3'),axis=1,ascending=False).loc['US','2020-4-3']

confirmed    275586
index         11460
recovered      9707
deaths         7087
Name: (US, 2020-4-3), dtype: int64

There you go, we got the order in reverse order.

One thing to note about the syntax when we order by multiple labels. sort_values take tuple as input not list when you do sort by multiple labels.

Wrap Up!

I hope above examples would give you clear understanding about how to use sort_values in Pandas.

	country	date	confirmed	deaths	recovered
11460	US	2020-4-3	275586	7087	9707
5693	Italy	2020-4-3	119827	14681	19758

Pandas How To Sort Columns And Rows

Pandas Sort Dataframe By Column Using df.sort_values

Pandas Sort Dataframe By Row Using df.sort_values

Wrap Up!

Related Notebooks