Merge and Join DataFrames with Pandas in Python

For this exercise, I will use the movies database that I downloaded from the Kaggle. kaggle.com/rounakbanik/the-movies-dataset

Lets see what the different csv files we have in the zip file that I downloaded from the Kaggle.

In [2]:
ls *.csv
College.csv   links.csv            programcreek.csv
credits.csv   links_small.csv      ratings.csv
keywords.csv  movies_metadata.csv  ratings_small.csv

Please checkout the following link to learn more about the data in different files

kaggle.com/rounakbanik/the-movies-dataset

Lets start with our analysis. For this exercise, I will use two files movies_metadata.csv and ratings.csv files

Lets import the Python Pandas library first

In [3]:
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

Lets read first the movies_metadata.csv file.

In [5]:
dfm = pd.read_csv('movies_metadata.csv',warn_bad_lines=False)

You might run in to errors if your encoding is different. Check out how to read csv file in python pandas

In [6]:
dfm.head(2)
Out[6]:
adult belongs_to_collection budget genres homepage id imdb_id original_language original_title overview ... release_date revenue runtime spoken_languages status tagline title video vote_average vote_count
0 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... http://toystory.disney.com/toy-story 862 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... ... 1995-10-30 373554033.0 81.0 [{'iso_639_1': 'en', 'name': 'English'}] Released NaN Toy Story False 7.7 5415.0
1 False NaN 65000000 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... NaN 8844 tt0113497 en Jumanji When siblings Judy and Peter discover an encha... ... 1995-12-15 262797249.0 104.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released Roll the dice and unleash the excitement! Jumanji False 6.9 2413.0

2 rows × 24 columns

We have got multiple columns, one of them is id column. Each movies has a unique id. To check randomly, lets just pull out a row by movie id and see if we get only row or not.

In [7]:
dfm[dfm.id==862]
Out[7]:
adult belongs_to_collection budget genres homepage id imdb_id original_language original_title overview ... release_date revenue runtime spoken_languages status tagline title video vote_average vote_count

0 rows × 24 columns

Ok we got 0 rows, the reason for that 'id' column is not numeric, lets try with string format.

In [8]:
dfm[dfm.id=="862"]
Out[8]:
adult belongs_to_collection budget genres homepage id imdb_id original_language original_title overview ... release_date revenue runtime spoken_languages status tagline title video vote_average vote_count
0 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... http://toystory.disney.com/toy-story 862 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... ... 1995-10-30 373554033.0 81.0 [{'iso_639_1': 'en', 'name': 'English'}] Released NaN Toy Story False 7.7 5415.0

1 rows × 24 columns

There you go we got our row back but we need to fix the id type, id should be numeric

In [9]:
dfm['id'] = pd.to_numeric(dfm['id'],errors='coerce',downcast='integer')

To know more about pd.to_numeric and other options, Refer to this article https://www.nbshare.io/notebook/644259634/How-to-do-SQL-Select-and-Where-Using-Python-Pandas/

Lets check the data type now of id column to see if it is numeric.

In [10]:
dfm.id.dtype
Out[10]:
dtype('float64')

Surprisingly, it has changed to float64 instead of int64. Not sure why, I will udpate post later if I could figure out the reason for that but for our purpose the data is numeric now.

Lets look at our second file now which is ratings.csv

In [11]:
dfr = pd.read_csv('ratings.csv')
In [12]:
dfr.head(1)
Out[12]:
userId movieId rating timestamp
0 1 110 1.0 1425941529
In [13]:
dfr.movieId.dtype
Out[13]:
dtype('int64')

Lets see if we can pull up a record for movieId 862 from ratings dataframe.

In [15]:
dfr[dfr.movieId==862].head(2)
Out[15]:
userId movieId rating timestamp
184624 1923 862 3.0 858335006
200490 2103 862 5.0 946044912

Ok we see ratings by different users for each movieId 862.

How to do left merge in pandas

Lets try with left merge(same as left join in mysql) in Python Pandas by using the column id from dfm and movieId from dfr dataframe.

In [16]:
df_mleft = dfm.merge(dfr,left_on='id',right_on='movieId',how='left')
In [18]:
df_mleft.head(2)
Out[18]:
adult belongs_to_collection budget genres homepage id imdb_id original_language original_title overview ... status tagline title video vote_average vote_count userId movieId rating timestamp
0 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... http://toystory.disney.com/toy-story 862.0 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... ... Released NaN Toy Story False 7.7 5415.0 1923.0 862.0 3.0 858335006.0
1 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... http://toystory.disney.com/toy-story 862.0 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... ... Released NaN Toy Story False 7.7 5415.0 2103.0 862.0 5.0 946044912.0

2 rows × 28 columns

Left join looks for dfm column 'id' and for each 'id' looks for corresponding 'movieId' in dfr.

How to do right, inner and outer joins in Python Pandas

Similarly we can do the right merge (sql right join) which means it looks for all the movieIds in dataframe dfr and for each movieId look for a corresponding id in dfm dataframe and join the record.

In [19]:
df_mright = dfm.merge(dfr,left_on='id',right_on='movieId',how='right')

For inner join (merge), do following...

In [20]:
df_minner = dfm.merge(dfr,left_on='id',right_on='movieId',how='inner')

For outer join, do following...

In [21]:
df_mouter = dfm.merge(dfr,left_on='id',right_on='movieId',how='outer')

For outer merge, it is hard to find out after the merge, the source of each row, whether it is left dataframe or right frame. But there is an option to track that. Use the option indicator=True, with this option, we would see an extra column _merge which would tell us the source of each record or row. Lets try this on our outer merge command again and print few rows to see what do we get.

In [23]:
df_mouter = dfm.merge(dfr,left_on='id',right_on='movieId',how='outer',indicator=True)

Since the datasize is huge, I got the following error

MemoryError: Unable to allocate array with shape (9, 26063255) and data type object

Thats tell me that I needed a bigger machine. But never the less,If you run the above command with option "indicator=True", you would notice a separate column telling you the source of that row.

Wrap Up!

Merge command in Python Pandas is equivalent to what we can achieve in SQL using different joins. You can definately expand on this notebook by experimenting further. Please download and try it out yourself.