Summarising, Aggregating, and Grouping data in Python Pandas

In this post, I will talk about summarizing techniques that can be used to compile and understand the data. I will use Python library Pandas to summarize, group and aggregate the data in different ways.

I will be using college.csv data which has details about university admissions.

Lets start with importing pandas library and read_csv to read the csv file

import pandas as pd

df = pd.read_csv('College.csv')

Lets understand the data first. We can pass in an argument "a number" to head function. I will print out the first two rows of the data using pandas head function.

df.head(2)

Looking at the data, we have both textual, categorical and numerical data. Lets first fix the university column name which is "Unnamed:0"

df.rename(columns={'Unnamed: 0':'univ_name'},inplace=True)

df.head(1)

OK, now we have fixed the column name, lets start with tutorial now. I will talk about how to summarize first.

How to summarize the data using Python Pandas

Easiest way is to use describe function of Python Pandas. Lets first use it and see and what happens.

df.describe()

As we see above, All the numberical columns have been statistically summarized by the describe() function. For every column, Pandas has given us a nice summary count, mean, standard deviation (std), min, max, 25 percentile, 50 percentile and 75 percentile. This is good. We can also call a plot method on the describe() method to see the plots of different columns. Lets try that and see what happens.

For that we need to first execute %matplotlib inline so that we can see the plot.

%matplotlib inline
df.describe().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7f74f9ab4cd0>

The labels are not aligned properly. Lets fix that quickly using the legend. I wouldn't go in detail about matplotlib and its usage that in itself required a multipart series.

df.describe().plot().legend(loc='center left', bbox_to_anchor=(1, 0.5))

<matplotlib.legend.Legend at 0x7f74f9a8c710>

Lets continue with our summarize discussion.

We can apply max, min, sum, average, count functions directly on the dataframe for each column. Lets try these methods on Apps column.

df['Apps'].sum()

2332273

df['Apps'].count()

777

df['Apps'].max()

48094

df['Apps'].min()

81

df['Apps'].mean()

3001.6383526383524

We can also apply all these methods in one command using Pandas apply method. Lets try to calculate all the above metrics using the apply method in one command.

df['Apps'].apply({'sum':sum,'min':min,'max':max,'count':count,'mean':mean})

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-26-44af3e04c98d> in <module>
----> 1 df['Apps'].apply({'sum':sum,'min':min,'max':max,'count':count,'mean':mean})

NameError: name 'count' is not defined

Ok, we got the error that count is not defined. count is not vectorized method, therefore we can't use with apply method. However we can use len method of Python.

df['Apps'].apply({'sum':sum,'min':min,'max':max,'count':len,'mean':mean})

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-27-743e2f3746e4> in <module>
----> 1 df['Apps'].apply({'sum':sum,'min':min,'max':max,'count':len,'mean':mean})

NameError: name 'mean' is not defined

Ok, len has worked but not we got the error that mean is not defined. For that we will have to use method from the numpy library. Numpy is great library for matix calculations.

import numpy as np

df['Apps'].apply({'sum':sum,'min':min,'max':max,'count':len,'mean':np.mean})

sum      2.332273e+06
min      8.100000e+01
max      4.809400e+04
count    7.770000e+02
mean     3.001638e+03
Name: Apps, dtype: float64

How to aggregate data using Python Pandas aggregate() method

Please checkout below example to see the syntax of Pandas aggregate() method.

df['Apps'].aggregate({'sum':sum,'min':min,'max':max,'count':len,'mean':np.mean})

sum      2.332273e+06
min      8.100000e+01
max      4.809400e+04
count    7.770000e+02
mean     3.001638e+03
Name: Apps, dtype: float64

Lets try aggregate on all the columns

df.aggregate({sum,min,max,len,np.mean})

Note one difference is that we can't rename the metrics. Although we can rename the names separately. One other thing to notice here is that Aggregate method skipped automatically the textual columns univ_name and Private and only calculated metrics for numerical columns. Although you would see metrics on all the columns if you run following command.

df.aggregate(['sum','min'])

The output shown above is not meaningful since 'max' of column univ_name and 'Private' dont make any sense. If we use above method, we will have to explictly mentioned for which columns we want to calculate metrics.

df.aggregate({'Apps':['sum','min'],'Accept':'min'})

As we shown above, this way we get more control, we have applied sum and min on Apps method but only applied min on Accept column. If we want to apply same functions to selected columns then do following...

df[['Apps','Accept']].aggregate(['sum','min'])

Aggregate is very powerful command. We can do much more than what I described above. Lets look at one more scenario. Lets say we want to calculate for the univerisities which are private and non private what is maximum value for each column.

To do that, lets just take out the column 'univ_name', because max of univ_name doesnt make any sense. To group by 'Private' column, we would use Pandas groupby method. groupby will group our entire data set by the unique private entries. In our data set we have only two unique values of 'Private' field 'Yes' and 'No'.

df.loc[:, df.columns != 'univ_name'].groupby('Private').aggregate(max)

As we see above, we got maximum value for each column. We can also apply multiple methods to see other metrics too.

df.loc[:, df.columns != 'univ_name'].groupby('Private').aggregate(['max','mean','min'])

In the above output, we wee max,mean and min for each column for both Private vs non Private Universities.

Wrap Up!

In the above examples, I have just scractched the surface. There is a lot more we can do by combining aggregate and groupby methods of Pandas.

	Unnamed: 0	Private	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
0	Abilene Christian University	Yes	1660	1232	721	23	52	2885	537	7440	3300	450	2200	70	78	18.1	12	7041	60
1	Adelphi University	Yes	2186	1924	512	16	29	2683	1227	12280	6450	750	1500	29	30	12.2	16	10527	56

	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
count	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.00000
mean	3001.638353	2018.804376	779.972973	27.558559	55.796654	3699.907336	855.298584	10440.669241	4357.526384	549.380952	1340.642214	72.660232	79.702703	14.089704	22.743887	9660.171171	65.46332
std	3870.201484	2451.113971	929.176190	17.640364	19.804778	4850.420531	1522.431887	4023.016484	1096.696416	165.105360	677.071454	16.328155	14.722359	3.958349	12.391801	5221.768440	17.17771
min	81.000000	72.000000	35.000000	1.000000	9.000000	139.000000	1.000000	2340.000000	1780.000000	96.000000	250.000000	8.000000	24.000000	2.500000	0.000000	3186.000000	10.00000
25%	776.000000	604.000000	242.000000	15.000000	41.000000	992.000000	95.000000	7320.000000	3597.000000	470.000000	850.000000	62.000000	71.000000	11.500000	13.000000	6751.000000	53.00000
50%	1558.000000	1110.000000	434.000000	23.000000	54.000000	1707.000000	353.000000	9990.000000	4200.000000	500.000000	1200.000000	75.000000	82.000000	13.600000	21.000000	8377.000000	65.00000
75%	3624.000000	2424.000000	902.000000	35.000000	69.000000	4005.000000	967.000000	12925.000000	5050.000000	600.000000	1700.000000	85.000000	92.000000	16.500000	31.000000	10830.000000	78.00000
max	48094.000000	26330.000000	6392.000000	96.000000	100.000000	31643.000000	21836.000000	21700.000000	8124.000000	2340.000000	6800.000000	103.000000	100.000000	39.800000	64.000000	56233.000000	118.00000

	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
mean	3.001638e+03	2.018804e+03	779.972973	27.558559	55.796654	3.699907e+03	855.298584	1.044067e+04	4.357526e+03	549.380952	1.340642e+03	72.660232	79.702703	14.089704	22.743887	9.660171e+03	65.46332
sum	2.332273e+06	1.568611e+06	606039.000000	21413.000000	43354.000000	2.874828e+06	664567.000000	8.112400e+06	3.385798e+06	426869.000000	1.041679e+06	56457.000000	61929.000000	10947.700000	17672.000000	7.505953e+06	50865.00000
len	7.770000e+02	7.770000e+02	777.000000	777.000000	777.000000	7.770000e+02	777.000000	7.770000e+02	7.770000e+02	777.000000	7.770000e+02	777.000000	777.000000	777.000000	777.000000	7.770000e+02	777.00000
min	8.100000e+01	7.200000e+01	35.000000	1.000000	9.000000	1.390000e+02	1.000000	2.340000e+03	1.780000e+03	96.000000	2.500000e+02	8.000000	24.000000	2.500000	0.000000	3.186000e+03	10.00000
max	4.809400e+04	2.633000e+04	6392.000000	96.000000	100.000000	3.164300e+04	21836.000000	2.170000e+04	8.124000e+03	2340.000000	6.800000e+03	103.000000	100.000000	39.800000	64.000000	5.623300e+04	118.00000

	univ_name	Private	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
sum	Abilene Christian UniversityAdelphi University...	YesYesYesYesYesYesYesYesYesYesYesYesYesYesYesY...	2332273	1568611	606039	21413	43354	2874828	664567	8112400	3385798	426869	1041679	56457	61929	10947.7	17672	7505953	50865
min	Abilene Christian University	No	81	72	35	1	9	139	1	2340	1780	96	250	8	24	2.5	0	3186	10

	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
Private
No	48094	26330	6392	95	100	31643	21836	15732	6540	1125	4288	103	100	28.8	48	16527	100
Yes	20192	13007	4615	96	100	27378	10221	21700	8124	2340	6800	100	100	39.8	64	56233	118

Summarising, Aggregating, and Grouping data in Python Pandas

How to summarize the data using Python Pandas

How to aggregate data using Python Pandas aggregate() method

Wrap Up!

Related Notebooks