PySpark Substr and Substring

substring(col_name, pos, len) - Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type.

First we load the important libraries

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (col, substring)
In [24]:
# initializing spark session instance
spark = SparkSession.builder.appName('snippets').getOrCreate()

Let us load our initial records.

In [3]:
columns = ["Full_Name","Salary"]
data = [("John A Smith", 1000), ("Alex Wesley Jones", 120000), ("Jane Tom James", 5000)]
In [4]:
# converting data to rdds
rdd = spark.sparkContext.parallelize(data)
In [5]:
# Then creating a dataframe from our rdd variable
dfFromRDD2 = spark.createDataFrame(rdd).toDF(*columns)
In [6]:
# visualizing current data before manipulation
dfFromRDD2.show()
+-----------------+------+
|        Full_Name|Salary|
+-----------------+------+
|     John A Smith|  1000|
|Alex Wesley Jones|120000|
|   Jane Tom James|  5000|
+-----------------+------+

PySpark substring

1) Here we are taking a substring for the first name from the Full_Name Column. The Full_Name contains first name, middle name and last name. We are adding a new column for the substring called First_Name

In [7]:
# here we add a new column called 'First_Name' and use substring() to get partial string from 'Full_Name' column
modified_dfFromRDD2 = dfFromRDD2.withColumn("First_Name", substring('Full_Name', 1, 4))
In [8]:
# visualizing the modified dataframe 
modified_dfFromRDD2.show()
+-----------------+------+----------+
|        Full_Name|Salary|First_Name|
+-----------------+------+----------+
|     John A Smith|  1000|      John|
|Alex Wesley Jones|120000|      Alex|
|   Jane Tom James|  5000|      Jane|
+-----------------+------+----------+

2) We can also get a substring with select and alias to achieve the same result as above

In [9]:
modified_dfFromRDD3 = dfFromRDD2.select("Full_Name", 'Salary', substring('Full_Name', 1, 4).alias('First_Name'))
In [10]:
# visualizing the modified dataframe after executing the above.
# As you can see, it is exactly the same as the previous output.
modified_dfFromRDD3.show()
+-----------------+------+----------+
|        Full_Name|Salary|First_Name|
+-----------------+------+----------+
|     John A Smith|  1000|      John|
|Alex Wesley Jones|120000|      Alex|
|   Jane Tom James|  5000|      Jane|
+-----------------+------+----------+

3) We can also use substring with selectExpr to get a substring of 'Full_Name' column. selectExpr takes SQL expression(s) in a string to execute. This way we can run SQL-like expressions without creating views.

In [11]:
modified_dfFromRDD4 = dfFromRDD2.selectExpr("Full_Name", 'Salary', 'substring(Full_Name, 1, 4) as First_Name')
In [12]:
# visualizing the modified dataframe after executing the above.
# As you can see, it is exactly the same as the previous output.
modified_dfFromRDD4.show()
+-----------------+------+----------+
|        Full_Name|Salary|First_Name|
+-----------------+------+----------+
|     John A Smith|  1000|      John|
|Alex Wesley Jones|120000|      Alex|
|   Jane Tom James|  5000|      Jane|
+-----------------+------+----------+

4) Here we are going to use substr function of the Column data type to obtain the substring from the 'Full_Name' column and create a new column called 'First_Name'

In [13]:
modified_dfFromRDD5 = dfFromRDD2.withColumn("First_Name", col('Full_Name').substr(1, 4))
In [14]:
# visualizing the modified dataframe yields the same output as seen for all previous examples.
modified_dfFromRDD5.show()
+-----------------+------+----------+
|        Full_Name|Salary|First_Name|
+-----------------+------+----------+
|     John A Smith|  1000|      John|
|Alex Wesley Jones|120000|      Alex|
|   Jane Tom James|  5000|      Jane|
+-----------------+------+----------+

5) Let us consider now a example of substring when the indices are beyond the length of column. In that case, the substring() function only returns characters that fall in the bounds i.e (start, start+len). This can be seen in the example below

In [15]:
# In this example we are going to get the four characters of Full_Name column starting from position 14.
#  As can be seen in the example, 4 or fewer characters are returned depending on the string length
modified_dfFromRDD6 = dfFromRDD2.withColumn("Last_Name", substring('Full_Name', 14, 4))
In [16]:
modified_dfFromRDD6.show()
+-----------------+------+---------+
|        Full_Name|Salary|Last_Name|
+-----------------+------+---------+
|     John A Smith|  1000|         |
|Alex Wesley Jones|120000|     ones|
|   Jane Tom James|  5000|        s|
+-----------------+------+---------+

The above method produces wrong last name. We can fix it by following approach.

6) Another example of substring when we want to get the characters relative to end of the string. In this example, we are going to extract the last name from the Full_Name column.

In [17]:
# In this example we are going to get the five characters of Full_Name column relative to the end of the string.
#  As can be seen in the example, last 5 charcters are returned
modified_dfFromRDD7 = dfFromRDD2.withColumn("Last_Name", substring('Full_Name', -5, 5))
In [18]:
modified_dfFromRDD7.show()
+-----------------+------+---------+
|        Full_Name|Salary|Last_Name|
+-----------------+------+---------+
|     John A Smith|  1000|    Smith|
|Alex Wesley Jones|120000|    Jones|
|   Jane Tom James|  5000|    James|
+-----------------+------+---------+

Note above approach works only if the last name in each row is of constant characters length. What if the last name is of different characters length, the solution is not that simple.
I will need the index at which the last name starts and also the length of 'Full_Name'. If you are curious, I have provided the solution below without the explanation.

In [19]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (col, substring, lit, substring_index, length)

Let us create an example with last names having variable character length.

In [20]:
columns = ["Full_Name","Salary"]
data = [("John A Smith", 1000), ("Alex Wesley leeper", 120000), ("Jane Tom kinderman", 5000)]
rdd = spark.sparkContext.parallelize(data)
dfFromRDD2 = spark.createDataFrame(rdd).toDF(*columns)
dfFromRDD2.show()
+------------------+------+
|         Full_Name|Salary|
+------------------+------+
|      John A Smith|  1000|
|Alex Wesley leeper|120000|
|Jane Tom kinderman|  5000|
+------------------+------+

Pyspark substr

In [21]:
dfFromRDD2.withColumn('Last_Name', col("Full_Name").substr((length('Full_Name') - length(substring_index('Full_Name', " ", -1))),length('Full_Name'))).show()                      
+------------------+------+----------+
|         Full_Name|Salary| Last_Name|
+------------------+------+----------+
|      John A Smith|  1000|     Smith|
|Alex Wesley leeper|120000|    leeper|
|Jane Tom kinderman|  5000| kinderman|
+------------------+------+----------+

In [22]:
spark.stop()