Summarising, Aggregating, and Grouping data in Python Pandas
In this post, I will talk about summarizing techniques that can be used to compile and understand the data. I will use Python library Pandas to summarize, group and aggregate the data in different ways.
I will be using college.csv data which has details about university admissions.
Lets start with importing pandas library and read_csv to read the csv file
import pandas as pd
df = pd.read_csv('College.csv')
Lets understand the data first. We can pass in an argument "a number" to head function. I will print out the first two rows of the data using pandas head function.
df.head(2)
Looking at the data, we have both textual, categorical and numerical data. Lets first fix the university column name which is "Unnamed:0"
df.rename(columns={'Unnamed: 0':'univ_name'},inplace=True)
df.head(1)
OK, now we have fixed the column name, lets start with tutorial now. I will talk about how to summarize first.
How to summarize the data using Python Pandas
Easiest way is to use describe function of Python Pandas. Lets first use it and see and what happens.
df.describe()
As we see above, All the numberical columns have been statistically summarized by the describe() function. For every column, Pandas has given us a nice summary count, mean, standard deviation (std), min, max, 25 percentile, 50 percentile and 75 percentile. This is good. We can also call a plot method on the describe() method to see the plots of different columns. Lets try that and see what happens.
For that we need to first execute %matplotlib inline so that we can see the plot.
%matplotlib inline
df.describe().plot()
The labels are not aligned properly. Lets fix that quickly using the legend. I wouldn't go in detail about matplotlib and its usage that in itself required a multipart series.
df.describe().plot().legend(loc='center left', bbox_to_anchor=(1, 0.5))
Lets continue with our summarize discussion.
We can apply max, min, sum, average, count functions directly on the dataframe for each column. Lets try these methods on Apps column.
df['Apps'].sum()
df['Apps'].count()
df['Apps'].max()
df['Apps'].min()
df['Apps'].mean()
We can also apply all these methods in one command using Pandas apply method. Lets try to calculate all the above metrics using the apply method in one command.
df['Apps'].apply({'sum':sum,'min':min,'max':max,'count':count,'mean':mean})
Ok, we got the error that count is not defined. count is not vectorized method, therefore we can't use with apply method. However we can use len method of Python.
df['Apps'].apply({'sum':sum,'min':min,'max':max,'count':len,'mean':mean})
Ok, len has worked but not we got the error that mean is not defined. For that we will have to use method from the numpy library. Numpy is great library for matix calculations.
import numpy as np
df['Apps'].apply({'sum':sum,'min':min,'max':max,'count':len,'mean':np.mean})
How to aggregate data using Python Pandas aggregate() method
Please checkout below example to see the syntax of Pandas aggregate() method.
df['Apps'].aggregate({'sum':sum,'min':min,'max':max,'count':len,'mean':np.mean})
Lets try aggregate on all the columns
df.aggregate({sum,min,max,len,np.mean})
Note one difference is that we can't rename the metrics. Although we can rename the names separately. One other thing to notice here is that Aggregate method skipped automatically the textual columns univ_name and Private and only calculated metrics for numerical columns. Although you would see metrics on all the columns if you run following command.
df.aggregate(['sum','min'])
The output shown above is not meaningful since 'max' of column univ_name and 'Private' dont make any sense. If we use above method, we will have to explictly mentioned for which columns we want to calculate metrics.
df.aggregate({'Apps':['sum','min'],'Accept':'min'})
As we shown above, this way we get more control, we have applied sum and min on Apps method but only applied min on Accept column. If we want to apply same functions to selected columns then do following...
df[['Apps','Accept']].aggregate(['sum','min'])
Aggregate is very powerful command. We can do much more than what I described above. Lets look at one more scenario. Lets say we want to calculate for the univerisities which are private and non private what is maximum value for each column.
To do that, lets just take out the column 'univ_name', because max of univ_name doesnt make any sense. To group by 'Private' column, we would use Pandas groupby method. groupby will group our entire data set by the unique private entries. In our data set we have only two unique values of 'Private' field 'Yes' and 'No'.
df.loc[:, df.columns != 'univ_name'].groupby('Private').aggregate(max)
As we see above, we got maximum value for each column. We can also apply multiple methods to see other metrics too.
df.loc[:, df.columns != 'univ_name'].groupby('Private').aggregate(['max','mean','min'])
In the above output, we wee max,mean and min for each column for both Private vs non Private Universities.
Wrap Up!
In the above examples, I have just scractched the surface. There is a lot more we can do by combining aggregate and groupby methods of Pandas.
Related Notebooks
- Pandas Datareader To Download Stocks Data From Google And Yahoo Finance
- How to Analyze the CSV data in Pandas
- Merge and Join DataFrames with Pandas in Python
- Tidy Data In R
- How To Read JSON Data Using Python Pandas
- Strftime and Strptime In Python
- How To Analyze Wikipedia Data Tables Using Python Pandas
- String And Literal In Python 3
- Data Cleaning With Python Pdpipe