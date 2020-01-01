Most Frequently Asked Questions Python Pandas Part1

For this exercise, I am using College.csv data. You can download the data from here. github.com/jstjohn/IntroToStatisticalLearningR-/blob/master/data/College.csv I would also create dummy dataframes to explain some of the concepts.

In [2]:
import pandas as pd

Check out how to read csv file name.

In [3]:
df = pd.read_csv('College.csv')
In [4]:
df.head(1)
Out[4]:
Unnamed: 0 Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
0 Abilene Christian University Yes 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.1 12 7041 60

How to rename column in Python Pandas

Lets check if we are missing a column name in our csv file. We can print out the header using unix command.

In [6]:
!head -1 College.csv

























How to copy dataframe in Python Pandas


















Why would I need to make a copy explicitly in dataframe?



















Indexing in Python Pandas doesn't make a seperate copy of the dataframe but it makes a reference to the original dataframe. Therefore if you make any change to the dataframe,it will change the original dataframe. Lets do an example.













In [39]:



    


df = pd.DataFrame({'name':['John','Evan']})

















In [40]:



    


dfn = df[0:2]

















In [41]:



    


print(dfn)
























   name
0  John
1  Evan



















In [42]:



    


dfn.iloc[0,0] = 'Adam'

















In [44]:



    


df



















Out[44]:










  
    

      
      name
    

  
  
    

      0
      Adam
    

    

      1
      Evan
    

  




























As we above our original dataframe has changed. Therefore correct way is to make a copy first.













In [45]:



    


df = pd.DataFrame({'name':['John','Evan']})
dfn = df[0:2].copy()

















In [46]:



    


dfn



















Out[46]:










  
    

      
      name
    

  
  
    

      0
      John
    

    

      1
      Evan
    

  






















In [47]:



    


dfn.iloc[0,0] = 'Adam'

















In [48]:



    


df



















Out[48]:










  
    

      
      name
    

  
  
    

      0
      John
    

    

      1
      Evan
    

  






















In [49]:



    


dfn



















Out[49]:










  
    

      
      name
    

  
  
    

      0
      Adam
    

    

      1
      Evan
    

  




























As we see above our original dataframe df has not changed when we used df.copy() command.



















How to create empty dataframe in Python Pandas












In [89]:



    


dfe = pd.DataFrame([])























How to add columns to add empty dataframe?













In [95]:



    


dfe = dfe.assign(col1=None,col2=None)

















In [96]:



    


dfe.head()



















Out[96]:










  
    

      
      col1
      col2
    

  
  
  




























How to append values to empty dataframe?


















Appending in dataframe is very easy. Just use the append command.













In [105]:



    


dfe = dfe.append({'col1':1,'col2':2},ignore_index=True)



















Out[105]:










  
    

      
      col1
      col2
    

  
  
    

      0
      1
      2
    

  




























Remember above command although works, but it is not memory efficient. Above will reallocate the memory every time we do the append to dataframe. Dont use the pd.append inside the loop. Best way is to build the data in the python list and then use pd.DataFrame to create the dataframe at once as shown below.













In [108]:



    


data = []
data.append([3,4])
data.append([5,6])

















In [109]:



    


data



















Out[109]:







[[3, 4], [5, 6]]

























Now create the dataframe using above data.













In [110]:



    


dfe = pd.DataFrame(data,columns=['col1','col2'])

















In [111]:



    


dfe.head()



















Out[111]:










  
    

      
      col1
      col2
    

  
  
    

      0
      3
      4
    

    

      1
      5
      6
    

  




























How to convert Pandas dataframe to Numpy array


















Lets use our previous dataframe dfe for this.













In [112]:



    


import numpy as np

















In [114]:



    


dfe.to_numpy()



















Out[114]:







array([[3, 4],
       [5, 6]])

























Also we can do this way.













In [115]:



    


np.array(dfe)



















Out[115]:







array([[3, 4],
       [5, 6]])

























How to Concat Pandas Dataframe


















Concat is used to concatenate dataframe either using rows or columns.













In [117]:



    


df1 = pd.DataFrame({'A':[1,2],'B':[3,4]})
df2 = pd.DataFrame({'C':[1,2],'D':[3,4]})























Lets concatenate df1 and df2 so that rows append.













In [124]:



    


pd.concat([df1,df2],sort=False)



















Out[124]:










  
    

      
      A
      B
      C
      D
    

  
  
    

      0
      1.0
      3.0
      NaN
      NaN
    

    

      1
      2.0
      4.0
      NaN
      NaN
    

    

      0
      NaN
      NaN
      1.0
      3.0
    

    

      1
      NaN
      NaN
      2.0
      4.0
    

  




























We see that two columns have been created since, column names dont match in df1 and df2



















How about concatenate the dataframes so that columns concatenate.













In [125]:



    


pd.concat([df1,df2],sort=False,axis=1)



















Out[125]:










  
    

      
      A
      B
      C
      D
    

  
  
    

      0
      1
      3
      1
      3
    

    

      1
      2
      4
      2
      4
    

  




























How about concatenating the dataframes with same headers. Lets create a 3rd dataframe with same headers as df1.













In [126]:



    


df3 = pd.DataFrame({'A':[56,57],'B':[100,101]})























Lets concatenate df1 and df3 so  that row append.













In [127]:



    


pd.concat([df1,df3])



















Out[127]:










  
    

      
      A
      B
    

  
  
    

      0
      1
      3
    

    

      1
      2
      4
    

    

      0
      56
      100
    

    

      1
      57
      101
    

  




























As we see above, while concatenating row indexing are preserved from the original dataframe. We can ignore the indexes and make it incremental using option ignore_index=True













In [128]:



    


pd.concat([df1,df3],ignore_index=True)



















Out[128]:










  
    

      
      A
      B
    

  
  
    

      0
      1
      3
    

    

      1
      2
      4
    

    

      2
      56
      100
    

    

      3
      57
      101
    

  




























with pd.concat, we can create an outside hierarchy by creating an index.













In [132]:



    


dfc = pd.concat([df1,df3],keys=['s1','s2'])

















In [133]:



    


dfc.head()



















Out[133]:










  
    

      
      
      A
      B
    

  
  
    

      s1
      0
      1
      3
    

    

      1
      2
      4
    

    

      s2
      0
      56
      100
    

    

      1
      57
      101
    

  




























Now we can access the data using the new index keys s1 and s2



















