regexp_replace(), translate(), and overlay() functions can be used to replace values in PySpark Dataframes.
First we load the important libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import (col, regexp_replace, translate, overlay, when, expr)
# initializing spark session instance
spark = SparkSession.builder.appName('snippets').getOrCreate()
columns = ["Full_Name","Salary", "Last_Name_Pattern", "Last_Name_Replacement"]
data = [('Sam A Smith', '1,000.01', 'Sm', 'Griffi'),
('Alex Wesley Jones', '120,000.89', 'Jo', 'Ba'),
('Steve Paul Jobs', '5,000.90', 'Jo', 'Bo')]
# converting data to rdds
rdd = spark.sparkContext.parallelize(data)
# Then creating a dataframe from our rdd variable
dfFromRDD2 = spark.createDataFrame(rdd).toDF(*columns)
# visualizing current data before manipulation
dfFromRDD2.show()
regex_replace: we will use the regex_replace(col_name, pattern, new_value) to replace character(s) in a string column that match the pattern with the new_value
1) Here we are replacing the characters 'Jo' in the Full_Name with 'Ba'
# here we update the column called 'Full_Name' by replacing some characters in the name that fit the criteria
modified_dfFromRDD2 = dfFromRDD2.withColumn("Full_Name", regexp_replace('Full_Name', 'Jo', 'Ba'))
# visualizing the modified dataframe. We see that only the last two names are updated as those meet our criteria
modified_dfFromRDD2.show()
2) In the above example, we see that only two values (Jones, Jobs) are replaced but not Smith. We can use when function to replace column values conditionally
# Here we update the column called 'Full_Name' by replacing some characters in the name that fit the criteria
# based on the conditions
modified_dfFromRDD3 = dfFromRDD2.withColumn("Full_Name", when(col('Full_Name').endswith('th'),
regexp_replace('Full_Name', 'Smith', 'Griffith'))\
.otherwise(regexp_replace('Full_Name', 'Jo', 'Ba')))
# visualizing the modified dataframe we see how all the column values are updated based on the conditions provided
modified_dfFromRDD3.show()
3) We can also use a regex to replace characters. As an example we are making the decimal digits in the salary column to '00'.
modified_dfFromRDD4 = dfFromRDD2.withColumn("Salary", regexp_replace('Salary', '\\.\d\d$', '.00 \\$'))
# visualizing the modified dataframe, we see how the Salary column is updated
modified_dfFromRDD4.show(truncate=False)
4) Now we will use another regex example to replace varialbe number of characters where the pattern matches regex. Here we replace all lower case characters in the Full_Name column with '--'
# Replace only the lowercase characters in the Full_Name with --
modified_dfFromRDD5 = dfFromRDD2.withColumn("Full_Name", regexp_replace('Full_Name', '[a-z]+', '--'))
# visualizing the modified data frame. We see that all the lowercase characters are replaced.
# The uppercase characters are same as they were before
modified_dfFromRDD5.show()
5) We can also use regex_replace with expr to replace a column's value with a match pattern from a second column with the values from third column i.e 'regexp_replace(col1, col2, col3)'. Here we are going to replace the characters in column 1, that match the pattern in column 2 with characters from column 3.
# Here we update the column called 'Full_Name' by replacing some characters in the 'Full_Name' that match the values
# in 'Last_Name_Pattern' with characters in 'Last_Name_Replacement'
modified_dfFromRDD6 = modified_dfFromRDD2.withColumn("Full_Name",
expr("regexp_replace(Full_Name, Last_Name_Pattern, Last_Name_Replacement)"))
# visualizing the modified dataframe.
# The Full_Name column has been updated with some characters from Last_Name_Replacement
modified_dfFromRDD6.show()
# here we update the column called 'Full_Name' by replacing the lowercase characters in the following way:
# each 'a' is replaced by 0, 'b' by 1, 'c' by 2, .....'i' by 8 and j by 9
alphabets = 'abcdefjhij'
digits = '0123456789'
modified_dfFromRDD7 = dfFromRDD2.withColumn("Full_Name", translate('Full_Name', alphabets, digits))
# visualizing the modified dataframe we see the replacements has been done character by character
modified_dfFromRDD7.show(truncate=False)
overlay(src_col, replace_col, src_start_pos, src_char_len <default -1>): This function is used to replace the values in a src_col column from src_start_pos with values from replace_col. This replacement starts from src_start_pos and replaces src_char_len characters (by default replaces replace_col length characters)
# Here the first two characters are replaced by the replacement string in Last_Name_Replacement column
modified_dfFromRDD8 = dfFromRDD2.select('Full_Name', overlay("Full_Name", "Last_Name_Replacement", 1, 2).alias("FullName_Overlayed"))
# Visualizing the modified dataframe
modified_dfFromRDD8.show()
# Here we replace characters starting from position 5 (1-indexed) and replace characters equal to the
# length of the replacement string
modified_dfFromRDD9 = dfFromRDD2.select('Full_Name', overlay("Full_Name", "Last_Name_Replacement", 5).alias("FullName_Overlayed"))
# Visualizing the modified dataframe
modified_dfFromRDD9.show()
spark.stop()