PySpark Replace Values In DataFrames Using regexp_replace(), translate() and Overlay() Functions

regexp_replace(), translate(), and overlay() functions can be used to replace values in PySpark Dataframes.

First we load the important libraries

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (col, regexp_replace, translate, overlay, when, expr)
In [25]:
# initializing spark session instance
spark = SparkSession.builder.appName('snippets').getOrCreate()

Then load our initial records

In [3]:
columns = ["Full_Name","Salary", "Last_Name_Pattern", "Last_Name_Replacement"]
data = [('Sam A Smith', '1,000.01', 'Sm', 'Griffi'),
        ('Alex Wesley Jones', '120,000.89', 'Jo', 'Ba'),
        ('Steve Paul Jobs', '5,000.90', 'Jo', 'Bo')]
In [4]:
# converting data to rdds
rdd = spark.sparkContext.parallelize(data)
In [5]:
# Then creating a dataframe from our rdd variable
dfFromRDD2 = spark.createDataFrame(rdd).toDF(*columns)
In [6]:
# visualizing current data before manipulation
dfFromRDD2.show()
+-----------------+----------+-----------------+---------------------+
|        Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|      Sam A Smith|  1,000.01|               Sm|               Griffi|
|Alex Wesley Jones|120,000.89|               Jo|                   Ba|
|  Steve Paul Jobs|  5,000.90|               Jo|                   Bo|
+-----------------+----------+-----------------+---------------------+

PySpark regex_replace

regex_replace: we will use the regex_replace(col_name, pattern, new_value) to replace character(s) in a string column that match the pattern with the new_value

1) Here we are replacing the characters 'Jo' in the Full_Name with 'Ba'

In [7]:
# here we update the column called 'Full_Name' by replacing some characters in the name that fit the criteria
modified_dfFromRDD2 = dfFromRDD2.withColumn("Full_Name", regexp_replace('Full_Name', 'Jo', 'Ba'))
In [8]:
# visualizing the modified dataframe. We see that only the last two names are updated as those meet our criteria
modified_dfFromRDD2.show()
+-----------------+----------+-----------------+---------------------+
|        Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|      Sam A Smith|  1,000.01|               Sm|               Griffi|
|Alex Wesley Banes|120,000.89|               Jo|                   Ba|
|  Steve Paul Babs|  5,000.90|               Jo|                   Bo|
+-----------------+----------+-----------------+---------------------+

2) In the above example, we see that only two values (Jones, Jobs) are replaced but not Smith. We can use when function to replace column values conditionally

In [9]:
# Here we update the column called 'Full_Name' by replacing some characters in the name that fit the criteria
# based on the conditions
modified_dfFromRDD3 = dfFromRDD2.withColumn("Full_Name", when(col('Full_Name').endswith('th'),
                                                              regexp_replace('Full_Name', 'Smith', 'Griffith'))\
                                                         .otherwise(regexp_replace('Full_Name', 'Jo', 'Ba')))
In [10]:
# visualizing the modified dataframe we see how all the column values are updated based on the conditions provided
modified_dfFromRDD3.show()
+-----------------+----------+-----------------+---------------------+
|        Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|   Sam A Griffith|  1,000.01|               Sm|               Griffi|
|Alex Wesley Banes|120,000.89|               Jo|                   Ba|
|  Steve Paul Babs|  5,000.90|               Jo|                   Bo|
+-----------------+----------+-----------------+---------------------+

3) We can also use a regex to replace characters. As an example we are making the decimal digits in the salary column to '00'.

In [11]:
modified_dfFromRDD4 = dfFromRDD2.withColumn("Salary", regexp_replace('Salary', '\\.\d\d$', '.00 \\$'))
In [12]:
# visualizing the modified dataframe, we see how the Salary column is updated
modified_dfFromRDD4.show(truncate=False)
+-----------------+------------+-----------------+---------------------+
|Full_Name        |Salary      |Last_Name_Pattern|Last_Name_Replacement|
+-----------------+------------+-----------------+---------------------+
|Sam A Smith      |1,000.00 $  |Sm               |Griffi               |
|Alex Wesley Jones|120,000.00 $|Jo               |Ba                   |
|Steve Paul Jobs  |5,000.00 $  |Jo               |Bo                   |
+-----------------+------------+-----------------+---------------------+

4) Now we will use another regex example to replace varialbe number of characters where the pattern matches regex. Here we replace all lower case characters in the Full_Name column with '--'

In [13]:
# Replace only the lowercase characters in the Full_Name with --
modified_dfFromRDD5 = dfFromRDD2.withColumn("Full_Name", regexp_replace('Full_Name', '[a-z]+', '--'))
In [14]:
# visualizing the modified data frame. We see that all the lowercase characters are replaced.
# The uppercase characters are same as they were before
modified_dfFromRDD5.show()
+-----------+----------+-----------------+---------------------+
|  Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------+----------+-----------------+---------------------+
|  S-- A S--|  1,000.01|               Sm|               Griffi|
|A-- W-- J--|120,000.89|               Jo|                   Ba|
|S-- P-- J--|  5,000.90|               Jo|                   Bo|
+-----------+----------+-----------------+---------------------+

5) We can also use regex_replace with expr to replace a column's value with a match pattern from a second column with the values from third column i.e 'regexp_replace(col1, col2, col3)'. Here we are going to replace the characters in column 1, that match the pattern in column 2 with characters from column 3.

In [15]:
# Here we update the column called 'Full_Name' by replacing some characters in the 'Full_Name' that match the values
# in 'Last_Name_Pattern' with characters in 'Last_Name_Replacement'
modified_dfFromRDD6 = modified_dfFromRDD2.withColumn("Full_Name",
                                                     expr("regexp_replace(Full_Name, Last_Name_Pattern, Last_Name_Replacement)"))
In [16]:
# visualizing the modified dataframe. 
# The Full_Name column has been updated with some characters from Last_Name_Replacement
modified_dfFromRDD6.show()
+-----------------+----------+-----------------+---------------------+
|        Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|  Sam A Griffiith|  1,000.01|               Sm|               Griffi|
|Alex Wesley Banes|120,000.89|               Jo|                   Ba|
|  Steve Paul Babs|  5,000.90|               Jo|                   Bo|
+-----------------+----------+-----------------+---------------------+

PySpark translate()

translate(): This function is used to do character by character replacement of column values

In [17]:
# here we update the column called 'Full_Name' by replacing the lowercase characters in the following way:
# each 'a' is replaced by 0, 'b' by 1, 'c' by 2, .....'i' by 8 and j by 9
alphabets = 'abcdefjhij'
digits    = '0123456789'
modified_dfFromRDD7 = dfFromRDD2.withColumn("Full_Name", translate('Full_Name', alphabets, digits))
In [18]:
# visualizing the modified dataframe we see the replacements has been done character by character
modified_dfFromRDD7.show(truncate=False)
+-----------------+----------+-----------------+---------------------+
|Full_Name        |Salary    |Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|S0m A Sm8t7      |1,000.01  |Sm               |Griffi               |
|Al4x W4sl4y Jon4s|120,000.89|Jo               |Ba                   |
|St4v4 P0ul Jo1s  |5,000.90  |Jo               |Bo                   |
+-----------------+----------+-----------------+---------------------+

PySpark overlay()

overlay(src_col, replace_col, src_start_pos, src_char_len <default -1>): This function is used to replace the values in a src_col column from src_start_pos with values from replace_col. This replacement starts from src_start_pos and replaces src_char_len characters (by default replaces replace_col length characters)

In [19]:
# Here the first two characters are replaced by the replacement string in Last_Name_Replacement column
modified_dfFromRDD8 = dfFromRDD2.select('Full_Name', overlay("Full_Name", "Last_Name_Replacement", 1, 2).alias("FullName_Overlayed"))
In [20]:
# Visualizing the modified dataframe
modified_dfFromRDD8.show()
+-----------------+------------------+
|        Full_Name|FullName_Overlayed|
+-----------------+------------------+
|      Sam A Smith|   Griffim A Smith|
|Alex Wesley Jones| Baex Wesley Jones|
|  Steve Paul Jobs|   Boeve Paul Jobs|
+-----------------+------------------+

In [21]:
# Here we replace characters starting from position 5 (1-indexed) and replace characters equal to the 
# length of the replacement string
modified_dfFromRDD9 = dfFromRDD2.select('Full_Name', overlay("Full_Name", "Last_Name_Replacement", 5).alias("FullName_Overlayed"))
In [22]:
# Visualizing the modified dataframe
modified_dfFromRDD9.show()
+-----------------+------------------+
|        Full_Name|FullName_Overlayed|
+-----------------+------------------+
|      Sam A Smith|       Sam Griffih|
|Alex Wesley Jones| AlexBaesley Jones|
|  Steve Paul Jobs|   StevBoPaul Jobs|
+-----------------+------------------+

In [23]:
spark.stop()