Data Cleaning with Python Pdpipe
What is Data Cleaning?
Data cleaning is the process of preparing a dataset that you can use for the analysis purpose by transforming and manipulating unwanted information. The goal of data cleaning is not all about removing unwanted data; rather improving the accuracy of the dataset by removing unwanted data.What is a Pipeline?
Pipelines are a sequence of data processing mechanisms. You might need to manipulate or transform raw data into some useful information that your model can use. In machine learning systems, pipelines play a useful role in transforming and manipulating tons of data.What is pdpipe?
The pdpipe is a pre-processing pipeline framework for Python’s panda data frame. The pdpipe API helps to easily break down or compose complexed panda processing pipelines with few lines of codes.Advantages of Using the pdpipe framework
According to the creators of pdpipe framework, the main advantage is that it adheres to SciKit Learn’s Transformer API supporting machine learning tasks. Apart from that, a few other advantages are:The pdpipe framework is compatible with Python 3.0 and above You don’t need to configure pdpipe All the pdpipe function are documented with working example codes Creates pipelines that easily process various data types. You can customize the pipelines
In today’s article, we will look at how to install pdpipe and use it for data cleaning for a selected dataset. Later, we will also explain the basics of how you can use the data for visualization purposes as well.
!pip install pdpipe
In some cases, you might have to install scikit-learn and/or nltk in order to run the pipeline stages. If the compiler requires it you can download and install them by visiting the relevant websites.
How to Prepare the Dataset?
For this demonstration, we will use the cars dataset which you can download from Kaggle website. Once you downloaded you can assign all the data to a data frame.import pandas as pd
df = pd.read_csv('cars.csv')
Let’s look at a glimpse of what data are within the dataset.
df.tail()
According to the output you can see that there are 260 lines of data with 8 columns. Now let’s look at the column information.
list(df.columns.values)
Make sure you know how the extract column name is in the dataset, as it is case sensitive when you use it with pdpipe. How to Import pdpipe? Importing pdpipe is simple as you import any other frameworks to Python programs.
import pdpipe as pdp
Now as we know how to import pdpipe, let’s focus on how we can use to manipulate our dataset.
How to Remove a column?
You can clean your dataset using pdpipe by removing unwanted columns. There are two ways this can be done. Let’s remove the ‘time-to-60’ column in our dataset using both the methods.Method 1
You can directly drop a column from the data frame without the need of creating a new data frame output.dropCol1 = pdp.ColDrop(" time-to-60").apply(df)
dropCol1.tail()
Method 2
You can create a new data frame to store the outcome after dropping the column. The variable assigned as data frame can be used as a callable function makes pdpipe somewhat unique from other pipelines.dropCol2 = pdp.ColDrop(" time-to-60")
df2 = dropCol2(df)
df2.tail()
What is OneHotEncode?
When it comes to machine learning, classification & regression plays a major role. However, in our dataset, we cannot apply any classification or regression models as there are no columns with binary classification information. So, in a situation, if you want to prepare your dataset for classification or regression, pdpipe works in handy to manipulate the data as binary classification. In this example, let’s classify the year as before and after the 1980s. For this purpose, we will also get some help from a simple if-else function.def size(n):
if n < 1980:
return 'before 1980s'
else:
return 'after 1980s'
Now we can call this function using pdpipe to create a new classification column naming it as Year_Classification.
df['Year_Classification'] = df[' year'].apply(size)
df.tail(2)
As per the output, you can see a new column created and only two information is stored – before the 1980s & after the 1980s. But still, it is not the best way to use it with any classification or regression model. For this purpose, we will use the OneHotEncode method which will display the output in one’s & zero’s.
pipeline = pdp.ColDrop(' time-to-60')
pipeline+= pdp.OneHotEncode('Year_Classification')
df3 = pipeline(df)
df3.tail(2)
According to the output you can see that the OneHotEncode method has classified before and after the 1980s in 1s & 0s!
How to remove Rows?
Now let’s focus on how to remove rows of values where the cars have less than 4 cylinders. First, we will define a simple function.def numberOfCylinders(x):
if x <= 4:
return 'No'
else:
return 'Yes'
This function will determine if the number of cylinders are less than 4 and then return the output. We will store them in a separate column naming it as CylindersLessThan_4.
pipeline+=pdp.ApplyByCols(' cylinders', numberOfCylinders, 'CylindersLessThan_4', drop=False)
df4 = pipeline(df)
df4.tail(2)
According to the output you can see a new column which says yes or no based on the number of cylinders. Now let’s drop the rows which have less than 4 cylinders.
pipeline+=pdp.ValDrop(['No'],'CylindersLessThan_4')
df5 = pipeline(df)
df5[df5['CylindersLessThan_4']=='No']
Yes, we have successfully cleaned the unwanted information now. Also, it’s pointless having the column CylindersLessThan_4. So better remove that column as well.
pipeline+= pdp.ColDrop('CylindersLessThan_4')
df6 = pipeline(df)
df6.tail(2)
You can also use the RowDrop method to drop the unwanted row using just one line. Let us remove all the rows which have horsepower less than 100. You have to also use the lambda function as well.
pipeline+= pdp.RowDrop({' hp': lambda x: x <= 100})
df7 = pipeline(df)
df7.tail(2)
According to the output, all the values of horsepower less than 100 are removed. You can apply these methods based on your dataset requirement. Finally, let’s see how we can apply scaling estimators of Sci-Kit Learn with pdpipe. For demonstration lets use MinMaxScaler function. You can use any scaler functions available in Sci-Kit Learn (MaxAbsScaler, StandardScaler, RobustScaler etc.).
pipeline_scale = pdp.Scale('MinMaxScaler', exclude_columns=['mpg','year','brand','cubicinches'])
df8 = pipeline_scale(df7)
df8.tail(2)
We can also omit columns which we do not need to scale. In our example, we chose not to scale the columns 'mpg',' year',' brand' & ' cubicinches'.
Conclusion
Panda libraries are widely used to handle big data sets. As data scientists/engineers it is important to know how to manipulate the data in order to make a perfect analysis. Data cleaning is pretty much easier with pdpipe and you can explore more methods that can be found in the official documentation. Happy coding!Related Notebooks
- Data Analysis With Pyspark Dataframe
- Calculate Stock Options Max Pain Using Data From Yahoo Finance With Python
- Understanding Standard Deviation With Python
- With Open Statement in Python
- How To Analyze Yahoo Finance Data With R
- How To Install Python With Conda
- Regularization Techniques in Linear Regression With Python
- Learn And Code Confusion Matrix With Python
- Merge and Join DataFrames with Pandas in Python