Merge and Join DataFrames with Pandas in Python

For this exercise, I will use the movies database that I downloaded from the Kaggle. kaggle.com/rounakbanik/the-movies-dataset

Lets see what the different csv files we have in the zip file that I downloaded from the Kaggle.

ls *.csv

College.csv   links.csv            programcreek.csv
credits.csv   links_small.csv      ratings.csv
keywords.csv  movies_metadata.csv  ratings_small.csv

Please checkout the following link to learn more about the data in different files

kaggle.com/rounakbanik/the-movies-dataset

Lets start with our analysis. For this exercise, I will use two files movies_metadata.csv and ratings.csv files

Lets import the Python Pandas library first

import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

Lets read first the movies_metadata.csv file.

dfm = pd.read_csv('movies_metadata.csv',warn_bad_lines=False)

You might run in to errors if your encoding is different. Check out how to read csv file in python pandas

dfm.head(2)

We have got multiple columns, one of them is id column. Each movies has a unique id. To check randomly, lets just pull out a row by movie id and see if we get only row or not.

dfm[dfm.id==862]

Ok we got 0 rows, the reason for that 'id' column is not numeric, lets try with string format.

dfm[dfm.id=="862"]

There you go we got our row back but we need to fix the id type, id should be numeric

dfm['id'] = pd.to_numeric(dfm['id'],errors='coerce',downcast='integer')

To know more about pd.to_numeric and other options, Refer to this article https://www.nbshare.io/notebook/644259634/How-to-do-SQL-Select-and-Where-Using-Python-Pandas/

Lets check the data type now of id column to see if it is numeric.

dfm.id.dtype

dtype('float64')

Surprisingly, it has changed to float64 instead of int64. Not sure why, I will udpate post later if I could figure out the reason for that but for our purpose the data is numeric now.

Lets look at our second file now which is ratings.csv

dfr = pd.read_csv('ratings.csv')

dfr.head(1)

dfr.movieId.dtype

dtype('int64')

Lets see if we can pull up a record for movieId 862 from ratings dataframe.

dfr[dfr.movieId==862].head(2)

Ok we see ratings by different users for each movieId 862.

How to do left merge in pandas

Lets try with left merge(same as left join in mysql) in Python Pandas by using the column id from dfm and movieId from dfr dataframe.

df_mleft = dfm.merge(dfr,left_on='id',right_on='movieId',how='left')

df_mleft.head(2)

Left join looks for dfm column 'id' and for each 'id' looks for corresponding 'movieId' in dfr.

How to do right, inner and outer joins in Python Pandas

Similarly we can do the right merge (sql right join) which means it looks for all the movieIds in dataframe dfr and for each movieId look for a corresponding id in dfm dataframe and join the record.

df_mright = dfm.merge(dfr,left_on='id',right_on='movieId',how='right')

For inner join (merge), do following...

df_minner = dfm.merge(dfr,left_on='id',right_on='movieId',how='inner')

For outer join, do following...

df_mouter = dfm.merge(dfr,left_on='id',right_on='movieId',how='outer')

For outer merge, it is hard to find out after the merge, the source of each row, whether it is left dataframe or right frame. But there is an option to track that. Use the option indicator=True, with this option, we would see an extra column _merge which would tell us the source of each record or row. Lets try this on our outer merge command again and print few rows to see what do we get.

df_mouter = dfm.merge(dfr,left_on='id',right_on='movieId',how='outer',indicator=True)

Since the datasize is huge, I got the following error

MemoryError: Unable to allocate array with shape (9, 26063255) and data type object

Thats tell me that I needed a bigger machine. But never the less,If you run the above command with option "indicator=True", you would notice a separate column telling you the source of that row.

Wrap Up!

Merge command in Python Pandas is equivalent to what we can achieve in SQL using different joins. You can definately expand on this notebook by experimenting further. Please download and try it out yourself.

	adult	belongs_to_collection	budget	genres	homepage	id	imdb_id	original_language	original_title	overview	...	release_date	revenue	runtime	spoken_languages	status	tagline	title	video	vote_average	vote_count
0	False	{'id': 10194, 'name': 'Toy Story Collection', ...	30000000	[{'id': 16, 'name': 'Animation'}, {'id': 35, '...	http://toystory.disney.com/toy-story	862	tt0114709	en	Toy Story	Led by Woody, Andy's toys live happily in his ...	...	1995-10-30	373554033.0	81.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	NaN	Toy Story	False	7.7	5415.0
1	False	NaN	65000000	[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...	NaN	8844	tt0113497	en	Jumanji	When siblings Judy and Peter discover an encha...	...	1995-12-15	262797249.0	104.0	[{'iso_639_1': 'en', 'name': 'English'}, {'iso...	Released	Roll the dice and unleash the excitement!	Jumanji	False	6.9	2413.0

	userId	movieId	rating	timestamp
184624	1923	862	3.0	858335006
200490	2103	862	5.0	946044912

	adult	belongs_to_collection	budget	genres	homepage	id	imdb_id	original_language	original_title	overview	...	status	tagline	title	video	vote_average	vote_count	userId	movieId	rating	timestamp
0	False	{'id': 10194, 'name': 'Toy Story Collection', ...	30000000	[{'id': 16, 'name': 'Animation'}, {'id': 35, '...	http://toystory.disney.com/toy-story	862.0	tt0114709	en	Toy Story	Led by Woody, Andy's toys live happily in his ...	...	Released	NaN	Toy Story	False	7.7	5415.0	1923.0	862.0	3.0	858335006.0
1	False	{'id': 10194, 'name': 'Toy Story Collection', ...	30000000	[{'id': 16, 'name': 'Animation'}, {'id': 35, '...	http://toystory.disney.com/toy-story	862.0	tt0114709	en	Toy Story	Led by Woody, Andy's toys live happily in his ...	...	Released	NaN	Toy Story	False	7.7	5415.0	2103.0	862.0	5.0	946044912.0

Merge and Join DataFrames with Pandas in Python

How to do left merge in pandas

How to do right, inner and outer joins in Python Pandas

Wrap Up!

Related Notebooks