How To Read CSV File Using Python PySpark

Spark is an open source library from Apache which is used for data analysis. In this tutorial I will cover "how to read csv data in Spark"

For these commands to work, you should have following installed.

  1. Spark - Check out how to install spark
  2. Pyspark - Check out how to install pyspark in Python 3
In [1]:
from pyspark.sql import SparkSession

Lets initialize our sparksession now.

In [2]:
spark = SparkSession \
    .builder \
    .appName("how to read csv file") \
    .getOrCreate()

Lets first check the spark version using spark.version.

In [3]:
spark.version
Out[3]:
'3.0.0-preview2'

For this exercise I will be a using a csv which is about Android reviews.

In [4]:
!ls data/sample_data.csv
data/sample_data.csv

Lets read the csv file now using spark.read.csv.

In [6]:
df = spark.read.csv('data/sample_data.csv')

Lets check our data type.

In [7]:
type(df)
Out[7]:
pyspark.sql.dataframe.DataFrame

We can peek in to our data using df.show() method.

In [8]:
df.show(5)
+----+------+--------------------+
| _c0|   _c1|                 _c2|
+----+------+--------------------+
|null|rating|              review|
|   0|     4|anyone know how t...|
|   1|     2|"Developers of th...|
|   2|     4|This app works gr...|
|   3|     1|Shouldn't of paid...|
+----+------+--------------------+
only showing top 5 rows

As we see above, the headers are _c0, _c1 and _c2 which is not correct. Lets fix that using header=True option.

In [10]:
df = spark.read.csv('data/sample_data.csv',header=True)
In [11]:
df.show(2)
+---+------+--------------------+
|_c0|rating|              review|
+---+------+--------------------+
|  0|     4|anyone know how t...|
|  1|     2|"Developers of th...|
+---+------+--------------------+
only showing top 2 rows

Ok the headers are fixed now. But first column in spark dataframe is _c0. The first column can be renamed also using withColumnRenamed

In [14]:
df = df.withColumnRenamed('_c0','sno')
In [15]:
df.show(2)
+---+------+--------------------+
|sno|rating|              review|
+---+------+--------------------+
|  0|     4|anyone know how t...|
|  1|     2|"Developers of th...|
+---+------+--------------------+
only showing top 2 rows

Related Topics