In this notebook, we will go through PySpark Distinct. For this exercise, I will be using following data from Kaggle...
https://www.kaggle.com/code/kirichenko17roman/recommender-systems/data
If you don't have PySpark installed, install Pyspark on Linux by clicking here.
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("Purchase") \
.config('spark.ui.showConsoleProgress', False) \
.getOrCreate()
Let us look at the data first.
df = spark.read.csv(
"/home/notebooks/kz.csv",
header=True, sep=",")
#show 3 rows of our DataFrame
df.show(3)
df.columns
This is transaction data.
Let us check how many rows are in our data.
df.count()
To count the distinct rows, we can use distinct() method on the pyspark dataframe.
df.distinct().count()
from pyspark.sql.functions import countDistinct
CountDistinct can be passed to pySpark aggregate function. In the below snippet, we are counting number of unique brands.
df.agg(countDistinct('brand').alias('cnt')).collect()[0].cnt
We can apply the above command on multiple columns as shown below.
items = df.agg(*(countDistinct(col(c)).alias(c) for c in ['category_code','brand'])).collect()[0]
print('category_code\tbrand\n')
print('%s\t\t%s\n'%(items.category_code,items.brand))
We can also use groupby, agg and countDistinct together. Let us say we want to calculate average price of each brand and also find out how many categories are there for each brand.
from pyspark.sql import functions as F
avg_price = [F.avg('price')]
cnt = [F.countDistinct(c) for c in ['category_code','brand']]
df.groupby('brand').agg(F.avg('price'),F.countDistinct('category_code')).show(5)
Looks like there are lot of rows in data with no price. Let us re-run above command without null rows.
avg_price = [F.avg('price')]
cnt = [F.countDistinct(c) for c in ['category_code','brand']]
df.dropna().groupby('brand').agg(F.avg('price'),F.countDistinct('category_code')).show(5)
We can also perform Distinct using SQL select method.
df.select('brand').distinct().count()
df.select('category_code').distinct().count()
We can repeat above command on multiple columns.
df.select('category_code','brand').distinct().count()
Related Notebooks
- PySpark GroupBy Examples
- An Anatomy of Key Tricks in word2vec project with examples
- How to Sort Pandas DataFrame with Examples
- How To Read CSV File Using Python PySpark
- How To Analyze Data Using Pyspark RDD
- How to Create DataFrame in R Using Examples
- Top Non Javascript Boxplot Libraries In R With Examples
- Learn Pygame With Examples
- Data Analysis With Pyspark Dataframe