How To Read CSV File Using Python PySpark
Spark is an open source library from Apache which is used for data analysis. In this tutorial I will cover "how to read csv data in Spark"
For these commands to work, you should have following installed.
- Spark - Check out how to install spark
- Pyspark - Check out how to install pyspark in Python 3
from pyspark.sql import SparkSession
Lets initialize our sparksession now.
spark = SparkSession \
.builder \
.appName("how to read csv file") \
.getOrCreate()
Lets first check the spark version using spark.version.
spark.version
For this exercise I will be a using a csv which is about Android reviews.
!ls data/sample_data.csv
Lets read the csv file now using spark.read.csv.
df = spark.read.csv('data/sample_data.csv')
Lets check our data type.
type(df)
We can peek in to our data using df.show() method.
df.show(5)
As we see above, the headers are _c0, _c1 and _c2 which is not correct. Lets fix that using header=True option.
df = spark.read.csv('data/sample_data.csv',header=True)
df.show(2)
Ok the headers are fixed now. But first column in spark dataframe is _c0. The first column can be renamed also using withColumnRenamed
df = df.withColumnRenamed('_c0','sno')
df.show(2)
Related Topics
Related Notebooks
- How to Export Pandas DataFrame to a CSV File
- How To Read JSON Data Using Python Pandas
- Pandas Read and Write Excel File
- Save Pandas DataFrame as CSV file
- How To Write DataFrame To CSV In R
- How to Analyze the CSV data in Pandas
- How To Analyze Data Using Pyspark RDD
- How To Fix Error Pandas Cannot Open An Excel xlsx File
- How to Visualize Data Using Python - Matplotlib