How To Read CSV File Using Python PySpark

Spark is an open source library from Apache which is used for data analysis. In this tutorial I will cover "how to read csv data in Spark"

For these commands to work, you should have following installed.

Spark - Check out how to install spark
Pyspark - Check out how to install pyspark in Python 3

from pyspark.sql import SparkSession

Lets initialize our sparksession now.

spark = SparkSession \
    .builder \
    .appName("how to read csv file") \
    .getOrCreate()

Lets first check the spark version using spark.version.

spark.version

'3.0.0-preview2'

For this exercise I will be a using a csv which is about Android reviews.

!ls data/sample_data.csv

data/sample_data.csv

Lets read the csv file now using spark.read.csv.

df = spark.read.csv('data/sample_data.csv')

Lets check our data type.

type(df)

pyspark.sql.dataframe.DataFrame

We can peek in to our data using df.show() method.

df.show(5)

+----+------+--------------------+
| _c0|   _c1|                 _c2|
+----+------+--------------------+
|null|rating|              review|
|   0|     4|anyone know how t...|
|   1|     2|"Developers of th...|
|   2|     4|This app works gr...|
|   3|     1|Shouldn't of paid...|
+----+------+--------------------+
only showing top 5 rows

As we see above, the headers are _c0, _c1 and _c2 which is not correct. Lets fix that using header=True option.

df = spark.read.csv('data/sample_data.csv',header=True)

df.show(2)

+---+------+--------------------+
|_c0|rating|              review|
+---+------+--------------------+
|  0|     4|anyone know how t...|
|  1|     2|"Developers of th...|
+---+------+--------------------+
only showing top 2 rows

Ok the headers are fixed now. But first column in spark dataframe is _c0. The first column can be renamed also using withColumnRenamed

df = df.withColumnRenamed('_c0','sno')

df.show(2)

+---+------+--------------------+
|sno|rating|              review|
+---+------+--------------------+
|  0|     4|anyone know how t...|
|  1|     2|"Developers of th...|
+---+------+--------------------+
only showing top 2 rows

How To Read CSV File Using Python PySpark

Related Topics

Related Notebooks