Merge and Join DataFrames with Pandas in Python
For this exercise, I will use the movies database that I downloaded from the Kaggle. kaggle.com/rounakbanik/the-movies-dataset
Lets see what the different csv files we have in the zip file that I downloaded from the Kaggle.
ls *.csv
Please checkout the following link to learn more about the data in different files
kaggle.com/rounakbanik/the-movies-dataset
Lets start with our analysis. For this exercise, I will use two files movies_metadata.csv and ratings.csv files
Lets import the Python Pandas library first
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
Lets read first the movies_metadata.csv file.
dfm = pd.read_csv('movies_metadata.csv',warn_bad_lines=False)
You might run in to errors if your encoding is different. Check out how to read csv file in python pandas
dfm.head(2)
We have got multiple columns, one of them is id column. Each movies has a unique id. To check randomly, lets just pull out a row by movie id and see if we get only row or not.
dfm[dfm.id==862]
Ok we got 0 rows, the reason for that 'id' column is not numeric, lets try with string format.
dfm[dfm.id=="862"]
There you go we got our row back but we need to fix the id type, id should be numeric
dfm['id'] = pd.to_numeric(dfm['id'],errors='coerce',downcast='integer')
To know more about pd.to_numeric and other options, Refer to this article https://www.nbshare.io/notebook/644259634/How-to-do-SQL-Select-and-Where-Using-Python-Pandas/
Lets check the data type now of id column to see if it is numeric.
dfm.id.dtype
Surprisingly, it has changed to float64 instead of int64. Not sure why, I will udpate post later if I could figure out the reason for that but for our purpose the data is numeric now.
Lets look at our second file now which is ratings.csv
dfr = pd.read_csv('ratings.csv')
dfr.head(1)
dfr.movieId.dtype
Lets see if we can pull up a record for movieId 862 from ratings dataframe.
dfr[dfr.movieId==862].head(2)
Ok we see ratings by different users for each movieId 862.
How to do left merge in pandas
Lets try with left merge(same as left join in mysql) in Python Pandas by using the column id from dfm and movieId from dfr dataframe.
df_mleft = dfm.merge(dfr,left_on='id',right_on='movieId',how='left')
df_mleft.head(2)
Left join looks for dfm column 'id' and for each 'id' looks for corresponding 'movieId' in dfr.
How to do right, inner and outer joins in Python Pandas
Similarly we can do the right merge (sql right join) which means it looks for all the movieIds in dataframe dfr and for each movieId look for a corresponding id in dfm dataframe and join the record.
df_mright = dfm.merge(dfr,left_on='id',right_on='movieId',how='right')
For inner join (merge), do following...
df_minner = dfm.merge(dfr,left_on='id',right_on='movieId',how='inner')
For outer join, do following...
df_mouter = dfm.merge(dfr,left_on='id',right_on='movieId',how='outer')
For outer merge, it is hard to find out after the merge, the source of each row, whether it is left dataframe or right frame. But there is an option to track that. Use the option indicator=True, with this option, we would see an extra column _merge which would tell us the source of each record or row. Lets try this on our outer merge command again and print few rows to see what do we get.
df_mouter = dfm.merge(dfr,left_on='id',right_on='movieId',how='outer',indicator=True)
Since the datasize is huge, I got the following error
MemoryError: Unable to allocate array with shape (9, 26063255) and data type object
Thats tell me that I needed a bigger machine. But never the less,If you run the above command with option "indicator=True", you would notice a separate column telling you the source of that row.
Wrap Up!
Merge command in Python Pandas is equivalent to what we can achieve in SQL using different joins. You can definately expand on this notebook by experimenting further. Please download and try it out yourself.
Related Notebooks
- Join or Merge Lists In Python
- PySpark Replace Values In DataFrames
- Summarising Aggregating and Grouping data in Python Pandas
- Introduction To R DataFrames
- Learn And Code Confusion Matrix With Python
- With Open Statement in Python
- Polynomial Interpolation Using Python Pandas Numpy And Sklearn
- Strftime and Strptime In Python
- Covid 19 Curve Fit Using Python Pandas And Numpy