Data Cleaning with Python Pdpipe

What is Data Cleaning?

Data cleaning is the process of preparing a dataset that you can use for the analysis purpose by transforming and manipulating unwanted information. The goal of data cleaning is not all about removing unwanted data; rather improving the accuracy of the dataset by removing unwanted data.

What is a Pipeline?

Pipelines are a sequence of data processing mechanisms. You might need to manipulate or transform raw data into some useful information that your model can use. In machine learning systems, pipelines play a useful role in transforming and manipulating tons of data.

What is pdpipe?

The pdpipe is a pre-processing pipeline framework for Python’s panda data frame. The pdpipe API helps to easily break down or compose complexed panda processing pipelines with few lines of codes.

Advantages of Using the pdpipe framework

According to the creators of pdpipe framework, the main advantage is that it adheres to SciKit Learn’s Transformer API supporting machine learning tasks. Apart from that, a few other advantages are:

The pdpipe framework is compatible with Python 3.0 and above You don’t need to configure pdpipe All the pdpipe function are documented with working example codes Creates pipelines that easily process various data types. You can customize the pipelines

In today’s article, we will look at how to install pdpipe and use it for data cleaning for a selected dataset. Later, we will also explain the basics of how you can use the data for visualization purposes as well.

In [6]:
!pip install pdpipe

In some cases, you might have to install scikit-learn and/or nltk in order to run the pipeline stages. If the compiler requires it you can download and install them by visiting the relevant websites.

How to Prepare the Dataset?

For this demonstration, we will use the cars dataset which you can download from Kaggle website. Once you downloaded you can assign all the data to a data frame.

In [8]:
import pandas as pd
df = pd.read_csv('cars.csv')

Let’s look at a glimpse of what data are within the dataset.

In [9]:
df.tail()
Out[9]:
mpg cylinders cubicinches hp weightlbs time-to-60 year brand
256 17.0 8 305 130 3840 15 1980 US.
257 36.1 4 91 60 1800 16 1979 Japan.
258 22.0 6 232 112 2835 15 1983 US.
259 18.0 6 232 100 3288 16 1972 US.
260 22.0 6 250 105 3353 15 1977 US.

According to the output you can see that there are 260 lines of data with 8 columns. Now let’s look at the column information.

In [10]:
list(df.columns.values)
Out[10]:
['mpg',
 ' cylinders',
 ' cubicinches',
 ' hp',
 ' weightlbs',
 ' time-to-60',
 ' year',
 ' brand']

Make sure you know how the extract column name is in the dataset, as it is case sensitive when you use it with pdpipe. How to Import pdpipe? Importing pdpipe is simple as you import any other frameworks to Python programs.

In [12]:
import pdpipe as pdp

Now as we know how to import pdpipe, let’s focus on how we can use to manipulate our dataset.

How to Remove a column?

You can clean your dataset using pdpipe by removing unwanted columns. There are two ways this can be done. Let’s remove the ‘time-to-60’ column in our dataset using both the methods.

Method 1

You can directly drop a column from the data frame without the need of creating a new data frame output.

In [14]:
dropCol1 = pdp.ColDrop(" time-to-60").apply(df)
dropCol1.tail()
Out[14]:
mpg cylinders cubicinches hp weightlbs year brand
256 17.0 8 305 130 3840 1980 US.
257 36.1 4 91 60 1800 1979 Japan.
258 22.0 6 232 112 2835 1983 US.
259 18.0 6 232 100 3288 1972 US.
260 22.0 6 250 105 3353 1977 US.

Method 2

You can create a new data frame to store the outcome after dropping the column. The variable assigned as data frame can be used as a callable function makes pdpipe somewhat unique from other pipelines.

In [15]:
dropCol2 = pdp.ColDrop(" time-to-60")
df2 = dropCol2(df)
df2.tail()
Out[15]:
mpg cylinders cubicinches hp weightlbs year brand
256 17.0 8 305 130 3840 1980 US.
257 36.1 4 91 60 1800 1979 Japan.
258 22.0 6 232 112 2835 1983 US.
259 18.0 6 232 100 3288 1972 US.
260 22.0 6 250 105 3353 1977 US.

What is OneHotEncode?

When it comes to machine learning, classification & regression plays a major role. However, in our dataset, we cannot apply any classification or regression models as there are no columns with binary classification information. So, in a situation, if you want to prepare your dataset for classification or regression, pdpipe works in handy to manipulate the data as binary classification. In this example, let’s classify the year as before and after the 1980s. For this purpose, we will also get some help from a simple if-else function.

In [16]:
def size(n):
    if n < 1980:
        return 'before 1980s'
    else:
        return 'after 1980s'

Now we can call this function using pdpipe to create a new classification column naming it as Year_Classification.

In [19]:
df['Year_Classification'] = df[' year'].apply(size) 
df.tail(2)
Out[19]:
mpg cylinders cubicinches hp weightlbs time-to-60 year brand Year_Classification
259 18.0 6 232 100 3288 16 1972 US. before 1980s
260 22.0 6 250 105 3353 15 1977 US. before 1980s

As per the output, you can see a new column created and only two information is stored – before the 1980s & after the 1980s. But still, it is not the best way to use it with any classification or regression model. For this purpose, we will use the OneHotEncode method which will display the output in one’s & zero’s.

In [20]:
pipeline = pdp.ColDrop(' time-to-60')
pipeline+= pdp.OneHotEncode('Year_Classification')
df3 = pipeline(df)
df3.tail(2)
Out[20]:
mpg cylinders cubicinches hp weightlbs year brand Year_Classification_before 1980s
259 18.0 6 232 100 3288 1972 US. 1
260 22.0 6 250 105 3353 1977 US. 1

According to the output you can see that the OneHotEncode method has classified before and after the 1980s in 1s & 0s!

How to remove Rows?

Now let’s focus on how to remove rows of values where the cars have less than 4 cylinders. First, we will define a simple function.

In [21]:
def numberOfCylinders(x):
    if x <= 4:
        return 'No'
    else:
        return 'Yes'

This function will determine if the number of cylinders are less than 4 and then return the output. We will store them in a separate column naming it as CylindersLessThan_4.

In [22]:
pipeline+=pdp.ApplyByCols(' cylinders', numberOfCylinders, 'CylindersLessThan_4', drop=False)
df4 = pipeline(df)
df4.tail(2)
Out[22]:
mpg cylinders CylindersLessThan_4 cubicinches hp weightlbs year brand Year_Classification_before 1980s
259 18.0 6 Yes 232 100 3288 1972 US. 1
260 22.0 6 Yes 250 105 3353 1977 US. 1

According to the output you can see a new column which says yes or no based on the number of cylinders. Now let’s drop the rows which have less than 4 cylinders.

In [23]:
pipeline+=pdp.ValDrop(['No'],'CylindersLessThan_4')
In [27]:
df5 = pipeline(df)
df5[df5['CylindersLessThan_4']=='No']
Out[27]:
mpg cylinders CylindersLessThan_4 cubicinches hp weightlbs year brand Year_Classification_before 1980s

Yes, we have successfully cleaned the unwanted information now. Also, it’s pointless having the column CylindersLessThan_4. So better remove that column as well.

In [28]:
pipeline+= pdp.ColDrop('CylindersLessThan_4')
df6 = pipeline(df)
df6.tail(2)
Out[28]:
mpg cylinders cubicinches hp weightlbs year brand Year_Classification_before 1980s
259 18.0 6 232 100 3288 1972 US. 1
260 22.0 6 250 105 3353 1977 US. 1

You can also use the RowDrop method to drop the unwanted row using just one line. Let us remove all the rows which have horsepower less than 100. You have to also use the lambda function as well.

In [30]:
pipeline+= pdp.RowDrop({' hp': lambda x: x <= 100})
df7 = pipeline(df)
df7.tail(2)
Out[30]:
mpg cylinders cubicinches hp weightlbs year brand Year_Classification_before 1980s
258 22.0 6 232 112 2835 1983 US. 0
260 22.0 6 250 105 3353 1977 US. 1

According to the output, all the values of horsepower less than 100 are removed. You can apply these methods based on your dataset requirement. Finally, let’s see how we can apply scaling estimators of Sci-Kit Learn with pdpipe. For demonstration lets use MinMaxScaler function. You can use any scaler functions available in Sci-Kit Learn (MaxAbsScaler, StandardScaler, RobustScaler etc.).

In [36]:
pipeline_scale = pdp.Scale('MinMaxScaler', exclude_columns=['mpg','year','brand','cubicinches'])
In [37]:
df8 = pipeline_scale(df7)
df8.tail(2)
Out[37]:
mpg cylinders cubicinches hp weightlbs year brand Year_Classification_before 1980s
258 0.528634 0.333333 232 0.070866 2835 1.0 US. 0.0
260 0.528634 0.333333 250 0.015748 3353 0.5 US. 1.0

We can also omit columns which we do not need to scale. In our example, we chose not to scale the columns 'mpg',' year',' brand' & ' cubicinches'.

Conclusion

Panda libraries are widely used to handle big data sets. As data scientists/engineers it is important to know how to manipulate the data in order to make a perfect analysis. Data cleaning is pretty much easier with pdpipe and you can explore more methods that can be found in the official documentation. Happy coding!