How to Analyze the CSV data in Pandas

For this exercise, I am using College.csv data. The brief explantion of data is given below.

import pandas as pd

df = pd.read_csv('College.csv')

df.head()

Description Of Data Private : Public/private indicator

Apps : Number of

applications received

Accept : Number of applicants accepted

Enroll : Number of new students enrolled

Top10perc : New students from top 10% of high school class

Top25perc : New students from top 25% of high school class

F.Undergrad : Number of full-time undergraduates

P.Undergrad : Number of part-time undergraduates

Outstate : Out-of-state tuition

Room.Board : Room and board costs

Books : Estimated book costs

Personal : Estimated personal spending

PhD : Percent of faculty with Ph.D.’s

Terminal : Percent of faculty with terminal degree

S.F.Ratio : Student/faculty ratio

perc.alumni : Percent of alumni who donate

Expend : Instructional expenditure per student

Grad.Rate : Graduation rate

Lets look at the summary of data by using describe() method of pandas

df.describe()

Lets fixed the University name column which is showing up as Unnamed.

df.rename(columns = {'Unnamed: 0':'University'},inplace=True)

Lets check if the colunm has been fixed

df.head(1)

We can plot few columns to understand more about the data

Lets look at the plot between column Phd and column Grad.Rate

Lets fix the column names which have dot in it and replace them with underscore _

df.rename(columns=lambda x: x.replace(".","_"),inplace=True)

Lets checkout the column names now

df.columns

Index(['Unnamed: 0', 'Private', 'Apps', 'Accept', 'Enroll', 'Top10perc',
       'Top25perc', 'F_Undergrad', 'P_Undergrad', 'Outstate', 'Room_Board',
       'Books', 'Personal', 'PhD', 'Terminal', 'S_F_Ratio', 'perc_alumni',
       'Expend', 'Grad_Rate'],
      dtype='object')

Ok we see dot now replaced with underscore now. We can do the plotting now. We will use library seaborn to plot.

import seaborn as sns

sns.scatterplot('PhD','Grad_Rate',data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x7f067ce6cb90>

Above is a simple plot which shows Grad_Rate on Y axis and PhD on x axis. In the command sns.scatterplot('PhD','Grad_Rate',data=df) , we supplied the column names and supplied dataframe df to the data option

Lets do another query to see how many of these colleges are private. This is equilavent to SQL select statement which is 'select count(colleges) from df where private="yes"'. Let us see how can we do this in pandas very easily

len(df[df.Private=="Yes"])

565

Lets do another query. How many universities have more than 50% of students which were among the top 10% in the high school.

To run this query, we will have to look at variable Top10perc. Let us create a new column and call it Elite.

df['elite'] = df.Top10perc > 50

Lets print the first 5 rows to see what we got. We should see elite column with True and False values.

df.head(5)

Yes thats what we got.

Lets check out how many elite universities we got. We can again use the describe() function. But since elite is not a numerical method, therefore we can't use directly the describe() method. elite is a category variable. Therefore we will have to use groupby() method first and then apply count() method. lets see how it works.

df.groupby('elite')['University'].count()

elite
False    699
True      78
Name: University, dtype: int64

How to Use Searborn Plots to Analyze the CSV data

Lets see now how can we use plot to analyze the data. As we saw above seaborn is a great utility to plot data.

Lets do historgram plot for the query df.groupby('elite')['University'].count()

import matplotlib.pyplot as plt
sns.countplot(df['elite'],hue=df['elite'])
plt.show()

As we see above, historgram is showing us True and False count for the column elite

Lets do a scattorplot matrix using seaborn

sns.pairplot(df)

I got following error

TypeError: numpy boolean subtract, the - operator, is deprecated, use the bitwise_xor, the ^ operator, or the logical_xor function instead.

The above error is because we have wrong data type that is the new category variable "elite" we created. Lets exclude that variable and plot it again.

But how would we just exclude one column in Pandas. Lets try following...

df.loc[:, df.columns != 'elite'].head(1)

Ok Lets check we can pass this dataframe to seaborn.

sns.pairplot(df.loc[:, df.columns != 'elite'])

The above command worked, not showing the plot because of the size of the plot, lets just select 2 columns and then plot it.

sns.pairplot(df.loc[:,['Apps','Accept']])

<seaborn.axisgrid.PairGrid at 0x7f065f53b390>

	Unnamed: 0	Private	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
0	Abilene Christian University	Yes	1660	1232	721	23	52	2885	537	7440	3300	450	2200	70	78	18.1	12	7041	60
1	Adelphi University	Yes	2186	1924	512	16	29	2683	1227	12280	6450	750	1500	29	30	12.2	16	10527	56
2	Adrian College	Yes	1428	1097	336	22	50	1036	99	11250	3750	400	1165	53	66	12.9	30	8735	54
3	Agnes Scott College	Yes	417	349	137	60	89	510	63	12960	5450	450	875	92	97	7.7	37	19016	59
4	Alaska Pacific University	Yes	193	146	55	16	44	249	869	7560	4120	800	1500	76	72	11.9	2	10922	15

	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
count	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.00000
mean	3001.638353	2018.804376	779.972973	27.558559	55.796654	3699.907336	855.298584	10440.669241	4357.526384	549.380952	1340.642214	72.660232	79.702703	14.089704	22.743887	9660.171171	65.46332
std	3870.201484	2451.113971	929.176190	17.640364	19.804778	4850.420531	1522.431887	4023.016484	1096.696416	165.105360	677.071454	16.328155	14.722359	3.958349	12.391801	5221.768440	17.17771
min	81.000000	72.000000	35.000000	1.000000	9.000000	139.000000	1.000000	2340.000000	1780.000000	96.000000	250.000000	8.000000	24.000000	2.500000	0.000000	3186.000000	10.00000
25%	776.000000	604.000000	242.000000	15.000000	41.000000	992.000000	95.000000	7320.000000	3597.000000	470.000000	850.000000	62.000000	71.000000	11.500000	13.000000	6751.000000	53.00000
50%	1558.000000	1110.000000	434.000000	23.000000	54.000000	1707.000000	353.000000	9990.000000	4200.000000	500.000000	1200.000000	75.000000	82.000000	13.600000	21.000000	8377.000000	65.00000
75%	3624.000000	2424.000000	902.000000	35.000000	69.000000	4005.000000	967.000000	12925.000000	5050.000000	600.000000	1700.000000	85.000000	92.000000	16.500000	31.000000	10830.000000	78.00000
max	48094.000000	26330.000000	6392.000000	96.000000	100.000000	31643.000000	21836.000000	21700.000000	8124.000000	2340.000000	6800.000000	103.000000	100.000000	39.800000	64.000000	56233.000000	118.00000

	University	Private	Apps	Accept	Enroll	Top10perc	Top25perc	F_Undergrad	P_Undergrad	Outstate	Room_Board	Books	Personal	PhD	Terminal	S_F_Ratio	perc_alumni	Expend	Grad_Rate	elite
0	Abilene Christian University	Yes	1660	1232	721	23	52	2885	537	7440	3300	450	2200	70	78	18.1	12	7041	60	False
1	Adelphi University	Yes	2186	1924	512	16	29	2683	1227	12280	6450	750	1500	29	30	12.2	16	10527	56	False
2	Adrian College	Yes	1428	1097	336	22	50	1036	99	11250	3750	400	1165	53	66	12.9	30	8735	54	False
3	Agnes Scott College	Yes	417	349	137	60	89	510	63	12960	5450	450	875	92	97	7.7	37	19016	59	True
4	Alaska Pacific University	Yes	193	146	55	16	44	249	869	7560	4120	800	1500	76	72	11.9	2	10922	15	False

How to Analyze the CSV data in Pandas

Description Of Data Private : Public/private indicator

How to Use Searborn Plots to Analyze the CSV data

Related Notebooks