Pivot Tables In Python Pandas

For this post, I will be using covid19 database from following link...

pomber.github.io/covid19/timeseries.json

Let us first import the necessary packages "requests and pandas".

In [1]:
import requests
import pandas as pd
import numpy as np
In [2]:
data = requests.get('https://pomber.github.io/covid19/timeseries.json')

We need to convert this data to the pandas dataframe so that we can build the pivot table.

In [3]:
jsondata = data.json()
In [4]:
columns=['country','date','confirmed','deaths','recovered']
data = []
for country in jsondata:
    for x in jsondata[country]:
        data.append([country, x['date'],x['confirmed'],x['deaths'],x['recovered']])
df = pd.DataFrame(data,columns=columns)
In [5]:
df.head(2)
Out[5]:
country date confirmed deaths recovered
0 Afghanistan 2020-1-22 0 0 0
1 Afghanistan 2020-1-23 0 0 0

Let us check the number of rows we have in our dataframe by using len(df)

In [6]:
len(df)
Out[6]:
20009

For every country, we have the data of corona virus cases by date.

How to create a pivot table in Python Pandas

Let us create a pivot table with respect to country. Remember we need to pass in the group key that is index for pivot table. Otherwise you would see following error...

ValueError: No group keys passed!

We can sum the numerical data of each country. To do that we can pass np.sum function to the pd.pivot_table().

In [7]:
pivoted = pd.pivot_table(df,index='country',aggfunc=np.sum)

Let us check the pivot table dataframe now.

In [8]:
pivoted.head(2)
Out[8]:
confirmed deaths recovered
country
Afghanistan 49043 1509 5683
Albania 23527 1036 11994

Let us now create the pivot table with multiple indexes country and date.

In [9]:
pivoted_m = pd.pivot_table(df,index=['country','date'],aggfunc=np.sum)
In [10]:
pivoted_m.head(2)
Out[10]:
confirmed deaths recovered
country date
Afghanistan 2020-1-22 0 0 0
2020-1-23 0 0 0

Note, here passing the aggfunc=np.sum would not have any affect, since we are creating separate index for date. Therefore we can take out the aggregate function np.sum and result would still be the same.

In [11]:
pivoted_m = pd.pivot_table(df,index=['country','date'])
In [12]:
pivoted_m.head(2)
Out[12]:
confirmed deaths recovered
country date
Afghanistan 2020-1-22 0 0 0
2020-1-23 0 0 0

Pandas Pivot table count

We can count the number of times , We have data for any country. We just need to use the aggregate function 'count' as shown below.

In [13]:
pd.pivot_table(df,index=['country'],aggfunc='count').head(2)
Out[13]:
confirmed date deaths recovered
country
Afghanistan 107 107 107 107
Albania 107 107 107 107

Let us delete an entry and then run count aggregate function again to see if we get the result different. Let us delete the entry for Afghanistan for data 2020-1-22 which is row 0 in our original data frame. Let us drop the row 0 in our dataframe using df.drop method.

In [14]:
df.drop([0],inplace=True)
In [15]:
df.head(1)
Out[15]:
country date confirmed deaths recovered
1 Afghanistan 2020-1-23 0 0 0

Ok now our dataframe starts from row 1. Let us run the count aggregate function again.

In [16]:
pd.pivot_table(df,index=['country'],aggfunc='count').head(2)
Out[16]:
confirmed date deaths recovered
country
Afghanistan 106 106 106 106
Albania 107 107 107 107

There you go,we got a different count number for the country Aghanistan covid 19 cases.

How to access rows by multiple Indexes in Python Pivot Table

We can select the rows by the index name. Let us say we want to see covid19 cases of USA on date march 30. We can use Pandas loc function to select the row as shown below.

In [17]:
pivoted_m.loc['US','2020-3-30']
Out[17]:
confirmed    161831
deaths         4192
recovered      5644
Name: (US, 2020-3-30), dtype: int64

Our pivoted table has two levels. We can access the index level values by dataframe.index.levels function.

Let us access the level 0 which is country index and see how many countries we have in our dataframe.

In [18]:
len(pivoted_m.index.levels[0])
Out[18]:
187

we have 187 countries in our dataframe.

Lets us check how many dates we have in our dataframe.

In [19]:
len(pivoted_m.index.levels[1])
Out[19]:
107

So we have more than three months of corona virus cases worth of data.

Plot Python pivot table using Matplotlib

Let us see now, how we can plot the pivot tables data using matplotlib.

In [20]:
from matplotlib import pyplot as plt
In [21]:
ax = pivoted_m.loc['US']['confirmed'].plot.bar()
for i, t in enumerate(ax.get_xticklabels()):
    if (i % 5) != 0:
        t.set_visible(False)

There were not enough cases in US before March 30, let us just filter the data to after march 30.

In [24]:
ax = pivoted_m.loc[('US', '2020-3-30'):('US', '2020-5-7')]['confirmed'].plot.bar()
for i, t in enumerate(ax.get_xticklabels()):
    if (i % 5) != 0:
        t.set_visible(False)

Wrap Up!

Pivot tables are very powerful data structure in Python Pandas. There is much more this that can be done with Python pivot tables.