Pandas How To Sort Columns And Rows

In this post, I will go over sort operation in Pandas. Lot of times for doing data analysis, we have to sort columns and rows frequently. In Pandas it is very easy to sort columns and rows. For this exercise, I will be using covid19 database from following link...

pomber.github.io/covid19/timeseries.json

Before we delve in to our example, Let us first import the necessary package pandas.

In [1]:
import pandas as pd

To read json data from web, we will use requests package.

In [2]:
data = requests.get('https://pomber.github.io/covid19/timeseries.json')

First we need to convert this data from json to Pandas dataframe.

In [3]:
jsondata = data.json()
In [4]:
columns=['country','date','confirmed','deaths','recovered']
data = []
for country in jsondata:
    for x in jsondata[country]:
        data.append([country, x['date'],x['confirmed'],x['deaths'],x['recovered']])
df = pd.DataFrame(data,columns=columns)

Let us check the first two rows in our dataframe.

In [5]:
df.head(2)
Out[5]:
country date confirmed deaths recovered
0 Afghanistan 2020-1-22 0 0 0
1 Afghanistan 2020-1-23 0 0 0

Let us check the number of rows we have in our dataframe by using len(df)

In [6]:
len(df)
Out[6]:
13213

For every country, we have the data of corona virus cases by date.

Pandas Sort Dataframe By Column Using df.sort_values

We can sort values in dataframe by using function df.sort_values. Important parameters to pass to sort_values are "by", "ascending", "inplace" and "axis"

  • "by" - takes the column name - by which column, we want to sort the dataframe
  • "ascending" - Takes True or False value
  • "inplace" - inplace=True will overwrite the dataframe. By default inplace is False.
  • "axis" can take either 1 or 0. One means sort row. 0 Means sort Column.

Another parameter which sort_values takes is "kind". With "kind", we can specify the sorting algorithm. Available options are {'quicksort', 'mergesort', 'heapsort'}, default is 'quicksort'.

If we run df.sort_values(), it will return following error...

TypeError: sort_values() missing 1 required positional argument: 'by'

Therefore we need to give it a column name. Let us sort the column by date in descending order that is from latest to oldest.

In [7]:
df.sort_values(by='date',ascending=False).head(1)
Out[7]:
country date confirmed deaths recovered
13212 Malawi 2020-4-3 3 0 0

Since i am running it on April 3rd. The latest covid 19 cases available is for April 3rd.

We can also sort the dataframe by multiple columns. For example, we can sort by 'date' first and then by number of confirmed cases. Therefore we should get is, on April 3rd which country has the highest number of covid 19 cases.

In [8]:
df_clatest = df.sort_values(by=['date','confirmed'],ascending=False)

Let us check the first two rows of dataframe.

In [9]:
df_clatest.head(2)
Out[9]:
country date confirmed deaths recovered
11460 US 2020-4-3 275586 7087 9707
5693 Italy 2020-4-3 119827 14681 19758

What we got is that on April 3rd, US has the most number of cases follow by Italy.

Pandas Sort Dataframe By Row Using df.sort_values

We can also sort the order of columns by rows of dataframe. You generally don't need to do that but I am just covering it for completeness.

In [13]:
df.sort_values(ascending=False,axis=1).head(1)

I got following error.

TypeError: sort_values() missing 1 required positional argument: 'by'

We need to mention 'by' parameter for row too. We can't access the rows by name in our dataframe. To fix that we need to make one of our columns as index. Let us make 'country' as index.

In [19]:
dfc = df.set_index('country')
In [20]:
dfc.head(1)
Out[20]:
date confirmed deaths recovered
country
Afghanistan 2020-1-22 0 0 0

Well you can now in theory can sort the row by 'country' name. Let us see what happens if sort by country name 'Afghanistan'.

In [63]:
dfc.sort_values('Afghanistan',axis=1)

I got following error.

ValueError: The index label 'Afghanistan' is not unique.

Ok we need uniqe label to sort it. If we combine 'country' and 'date', that will be unique label. Let us do that.

In [64]:
dfd = df.reset_index().set_index(['country','date'])
In [65]:
dfd.head(1)
Out[65]:
index confirmed deaths recovered
country date
Afghanistan 2020-1-22 0 0 0 0

Ok Now have the unique index label defined. If we sort our dataframe by now combining both 'country' and 'date'. We can sort the columns by row values. Let us pick up country US which we noticed has highest number of covid 19 cases. Let us try to sort the columns by row values for combination 'US' and '2020-4-3' as shown below.

In [68]:
dfd.sort_values(('US','2020-4-3'),axis=1,ascending=True).loc['US','2020-4-3']
Out[68]:
deaths         7087
recovered      9707
index         11460
confirmed    275586
Name: (US, 2020-4-3), dtype: int64

OK, it returns the values in ascending order starting from column deaths to all the way column confirmed. Let see what we get if we reverse the order.

In [69]:
dfd.sort_values(('US','2020-4-3'),axis=1,ascending=False).loc['US','2020-4-3']
Out[69]:
confirmed    275586
index         11460
recovered      9707
deaths         7087
Name: (US, 2020-4-3), dtype: int64

There you go, we got the order in reverse order.

One thing to note about the syntax when we order by multiple labels. sort_values take tuple as input not list when you do sort by multiple labels.

Wrap Up!

I hope above examples would give you clear understanding about how to use sort_values in Pandas.