How To Read CSV File Using Python PySpark
Spark is an open source library from Apache which is used for data analysis. In this tutorial I will cover "how to read csv data in Spark"
For these commands to work, you should have following installed.
- Spark - Check out how to install spark
- Pyspark - Check out how to install pyspark in Python 3
In [1]:
from pyspark.sql import SparkSession
Lets initialize our sparksession now.
In [2]:
spark = SparkSession \
.builder \
.appName("how to read csv file") \
.getOrCreate()
Lets first check the spark version using spark.version.
In [3]:
spark.version
Out[3]:
For this exercise I will be a using a csv which is about Android reviews.
In [4]:
!ls data/sample_data.csv
Lets read the csv file now using spark.read.csv.
In [6]:
df = spark.read.csv('data/sample_data.csv')
Lets check our data type.
In [7]:
type(df)
Out[7]:
We can peek in to our data using df.show() method.
In [8]:
df.show(5)
As we see above, the headers are _c0, _c1 and _c2 which is not correct. Lets fix that using header=True option.
In [10]:
df = spark.read.csv('data/sample_data.csv',header=True)
In [11]:
df.show(2)
Ok the headers are fixed now. But first column in spark dataframe is _c0. The first column can be renamed also using withColumnRenamed
In [14]:
df = df.withColumnRenamed('_c0','sno')
In [15]:
df.show(2)
Related Topics
Related Notebooks
- How to Export Pandas DataFrame to a CSV File
- How To Read JSON Data Using Python Pandas
- How to Analyze the CSV data in Pandas
- How To Analyze Wikipedia Data Tables Using Python Pandas
- How to Visualize Data Using Python - Matplotlib
- How To Write DataFrame To CSV In R
- How To Solve Linear Equations Using Sympy In Python
- How To Analyze Data Using Pyspark RDD
- How To Plot Unix Directory Structure Using Python Graphviz