How to do SQL Select and Where Using Python Pandas
Lets start with movie database that I downloaded from Kaggle
import pandas as pd
df = pd.read_csv("movies_metadata.csv")
df.columns
How to do SELECT, WHERE in pandas dataframe
lets do simple select first
Select first 2 rows
df.head(2)
Now lets do the SQL equivalent where clause now. Select all the movies where budget is greater than 30000000
df[df['budget'] > 30000000]
We got following error...
TypeError: '>' not supported between instances of 'str' and 'int'
The error is because we have the wrong data type, lets check the data type of budget
df.budget.dtype
its an object, so lets convert it to the number
df = df.astype({"budget": int})
We got following error
ValueError: invalid literal for int() with base 10: '/ff9qCepilowshEtG2GYWwzt2bs4.jpg'
pd.to_numeric(df['budget'],errors='coerce').head(2)
We added coerce to insert nan for the values where pd.to_numeric fails to convert, lets check what we get if run isnull()
df[pd.to_numeric(df['budget'],errors='coerce').isnull()]['budget'].head(2)
Yes indeed we see that budget column contains values which are not even numbers. lets just go ahead with pd.to_numeric command and lets overwrite the budget column
df['budget'] = pd.to_numeric(df['budget'],errors='coerce')
Lets check what happened to our budget value at index number 19730
df.iloc[19730]['budget']
Yes indeed the value has changed to nan.
Now lets our run SQL equilvalent where command again on the budget
df[df['budget'] > 30000000].head(2)
There you go we got the where clause operation working now.
It takes a while to get used to Pandas commands. If you want to still use SQL commands in Pandas , there is a library to do that as well which is pandasql
How to run SQL commands "select" and "where" using pandasql
Lets import the library pandasql first
import pandasql
I got following error...
ModuleNotFoundError: No module named 'pandasql'
Lets install the library first and import again
!pip install pandasql
import pandasql
Lets try SQL select statement now using pandasql now
pandasql.sqldf("SELECT * FROM df LIMIT 1;", globals())
Ok, now lets try the SQL statement including "where" clause
pandasql.sqldf("SELECT * FROM df where budget > 30000000 LIMIT 1 ;", globals())
There u go we got the results using SQL exact statement in Python Pandas.
Related Notebooks
- Select Pandas Dataframe Rows And Columns Using iloc loc and ix
- Python Numpy Where
- How To Read JSON Data Using Python Pandas
- How To Analyze Wikipedia Data Tables Using Python Pandas
- How To Calculate Stocks Support And Resistance Using Clustering
- Polynomial Interpolation Using Python Pandas Numpy And Sklearn
- Pandas How To Sort Columns And Rows
- Covid 19 Curve Fit Using Python Pandas And Numpy
- Why do we use Optional ListNode in Python