How to Analyze the CSV data in Pandas

For this exercise, I am using College.csv data. The brief explantion of data is given below.

In [1]:
import pandas as pd
In [2]:
df = pd.read_csv('College.csv')
In [3]:
df.head()
Out[3]:
Unnamed: 0 Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
0 Abilene Christian University Yes 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.1 12 7041 60
1 Adelphi University Yes 2186 1924 512 16 29 2683 1227 12280 6450 750 1500 29 30 12.2 16 10527 56
2 Adrian College Yes 1428 1097 336 22 50 1036 99 11250 3750 400 1165 53 66 12.9 30 8735 54
3 Agnes Scott College Yes 417 349 137 60 89 510 63 12960 5450 450 875 92 97 7.7 37 19016 59
4 Alaska Pacific University Yes 193 146 55 16 44 249 869 7560 4120 800 1500 76 72 11.9 2 10922 15

Description Of Data Private : Public/private indicator

Apps : Number of

applications received

Accept : Number of applicants accepted

Enroll : Number of new students enrolled

Top10perc : New students from top 10% of high school class

Top25perc : New students from top 25% of high school class

F.Undergrad : Number of full-time undergraduates

P.Undergrad : Number of part-time undergraduates

Outstate : Out-of-state tuition

Room.Board : Room and board costs

Books : Estimated book costs

Personal : Estimated personal spending

PhD : Percent of faculty with Ph.D.’s

Terminal : Percent of faculty with terminal degree

S.F.Ratio : Student/faculty ratio

perc.alumni : Percent of alumni who donate

Expend : Instructional expenditure per student

Grad.Rate : Graduation rate

Lets look at the summary of data by using describe() method of pandas

In [5]:
df.describe()
Out[5]:
Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
count 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.00000
mean 3001.638353 2018.804376 779.972973 27.558559 55.796654 3699.907336 855.298584 10440.669241 4357.526384 549.380952 1340.642214 72.660232 79.702703 14.089704 22.743887 9660.171171 65.46332
std 3870.201484 2451.113971 929.176190 17.640364 19.804778 4850.420531 1522.431887 4023.016484 1096.696416 165.105360 677.071454 16.328155 14.722359 3.958349 12.391801 5221.768440 17.17771
min 81.000000 72.000000 35.000000 1.000000 9.000000 139.000000 1.000000 2340.000000 1780.000000 96.000000 250.000000 8.000000 24.000000 2.500000 0.000000 3186.000000 10.00000
25% 776.000000 604.000000 242.000000 15.000000 41.000000 992.000000 95.000000 7320.000000 3597.000000 470.000000 850.000000 62.000000 71.000000 11.500000 13.000000 6751.000000 53.00000
50% 1558.000000 1110.000000 434.000000 23.000000 54.000000 1707.000000 353.000000 9990.000000 4200.000000 500.000000 1200.000000 75.000000 82.000000 13.600000 21.000000 8377.000000 65.00000
75% 3624.000000 2424.000000 902.000000 35.000000 69.000000 4005.000000 967.000000 12925.000000 5050.000000 600.000000 1700.000000 85.000000 92.000000 16.500000 31.000000 10830.000000 78.00000
max 48094.000000 26330.000000 6392.000000 96.000000 100.000000 31643.000000 21836.000000 21700.000000 8124.000000 2340.000000 6800.000000 103.000000 100.000000 39.800000 64.000000 56233.000000 118.00000

Lets fixed the University name column which is showing up as Unnamed.

In [19]:
df.rename(columns = {'Unnamed: 0':'University'},inplace=True)

Lets check if the colunm has been fixed

In [20]:
df.head(1)
Out[20]:
University Private Apps Accept Enroll Top10perc Top25perc F_Undergrad P_Undergrad Outstate Room_Board Books Personal PhD Terminal S_F_Ratio perc_alumni Expend Grad_Rate
0 Abilene Christian University Yes 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.1 12 7041 60

We can plot few columns to understand more about the data

Lets look at the plot between column Phd and column Grad.Rate

Lets fix the column names which have dot in it and replace them with underscore _

In [7]:
df.rename(columns=lambda x: x.replace(".","_"),inplace=True)

Lets checkout the column names now

In [8]:
df.columns
Out[8]:
Index(['Unnamed: 0', 'Private', 'Apps', 'Accept', 'Enroll', 'Top10perc',
       'Top25perc', 'F_Undergrad', 'P_Undergrad', 'Outstate', 'Room_Board',
       'Books', 'Personal', 'PhD', 'Terminal', 'S_F_Ratio', 'perc_alumni',
       'Expend', 'Grad_Rate'],
      dtype='object')

Ok we see dot now replaced with underscore now. We can do the plotting now. We will use library seaborn to plot.

In [9]:
import seaborn as sns
In [13]:
sns.scatterplot('PhD','Grad_Rate',data=df)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f067ce6cb90>

Above is a simple plot which shows Grad_Rate on Y axis and PhD on x axis. In the command sns.scatterplot('PhD','Grad_Rate',data=df) , we supplied the column names and supplied dataframe df to the data option

Lets do another query to see how many of these colleges are private. This is equilavent to SQL select statement which is 'select count(colleges) from df where private="yes"'. Let us see how can we do this in pandas very easily

In [16]:
len(df[df.Private=="Yes"])
Out[16]:
565

Lets do another query. How many universities have more than 50% of students which were among the top 10% in the high school.

To run this query, we will have to look at variable Top10perc. Let us create a new column and call it Elite.

In [24]:
df['elite'] = df.Top10perc > 50

Lets print the first 5 rows to see what we got. We should see elite column with True and False values.

In [26]:
df.head(5)
Out[26]:
University Private Apps Accept Enroll Top10perc Top25perc F_Undergrad P_Undergrad Outstate Room_Board Books Personal PhD Terminal S_F_Ratio perc_alumni Expend Grad_Rate elite
0 Abilene Christian University Yes 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.1 12 7041 60 False
1 Adelphi University Yes 2186 1924 512 16 29 2683 1227 12280 6450 750 1500 29 30 12.2 16 10527 56 False
2 Adrian College Yes 1428 1097 336 22 50 1036 99 11250 3750 400 1165 53 66 12.9 30 8735 54 False
3 Agnes Scott College Yes 417 349 137 60 89 510 63 12960 5450 450 875 92 97 7.7 37 19016 59 True
4 Alaska Pacific University Yes 193 146 55 16 44 249 869 7560 4120 800 1500 76 72 11.9 2 10922 15 False

Yes thats what we got.

Lets check out how many elite universities we got. We can again use the describe() function. But since elite is not a numerical method, therefore we can't use directly the describe() method. elite is a category variable. Therefore we will have to use groupby() method first and then apply count() method. lets see how it works.

In [35]:
df.groupby('elite')['University'].count()
Out[35]:
elite
False    699
True      78
Name: University, dtype: int64

How to Use Searborn Plots to Analyze the CSV data

Lets see now how can we use plot to analyze the data. As we saw above seaborn is a great utility to plot data.

Lets do historgram plot for the query df.groupby('elite')['University'].count()

In [49]:
import matplotlib.pyplot as plt
sns.countplot(df['elite'],hue=df['elite'])
plt.show()

As we see above, historgram is showing us True and False count for the column elite

Lets do a scattorplot matrix using seaborn

In [52]:
sns.pairplot(df)

I got following error

TypeError: numpy boolean subtract, the - operator, is deprecated, use the bitwise_xor, the ^ operator, or the logical_xor function instead.

The above error is because we have wrong data type that is the new category variable "elite" we created. Lets exclude that variable and plot it again.

But how would we just exclude one column in Pandas. Lets try following...

In [54]:
df.loc[:, df.columns != 'elite'].head(1)
Out[54]:
University Private Apps Accept Enroll Top10perc Top25perc F_Undergrad P_Undergrad Outstate Room_Board Books Personal PhD Terminal S_F_Ratio perc_alumni Expend Grad_Rate
0 Abilene Christian University Yes 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.1 12 7041 60

Ok Lets check we can pass this dataframe to seaborn.

In [56]:
sns.pairplot(df.loc[:, df.columns != 'elite'])

The above command worked, not showing the plot because of the size of the plot, lets just select 2 columns and then plot it.

In [73]:
sns.pairplot(df.loc[:,['Apps','Accept']])
Out[73]:
<seaborn.axisgrid.PairGrid at 0x7f065f53b390>