Install PySpark With MongoDB On Linux

We will go through following topics in this tutorial.

  • Install Java
  • Install Spark
  • Install MongoDB
  • Install PySpark
  • Install Mongo PySpark Connector
  • Connect PySpark to Mongo
  • Conclusion

Install Java

Check if you have JAVA installed by running following command in your shell...

java --version

If you don't have JAVA installed, run following commands on Ubuntu. If you are on Centos, replace apt with yum.

sudo apt update
sudo apt install default-jre -y
sudo apt install default-jdk -y

Now try java command again and you should see the version of JAVA, you  just installed.

java --version

Install SPARK

You need to have curl installed for following command. 

apt install curl -y

Now run following curl command...

curl -O https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz

Do following to install SPARK..

sudo tar -xvf spark-3.2.0-bin-hadoop3.2.tgz
sudo mkdir /opt/spark
sudo mv spark-3.2.0-bin-hadoop3.2/* /opt/spark
sudo chmod -R 777 /opt/spark

Now open ~/.bashrc or ~/.zshrc depending upon which shell you are in, add following export commands.

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Save the changes and source the ~/.bashrc file.

source ~/.bashrc
start-master.sh

You should see following output.

starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-ns102471.out

Open the file and go to end of it, you should see something like following message...

22/04/04 04:22:32 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://<your_ip_address>:8080

Apache SPARK is successfully started and listening on port 8080. Make sure you have 8080 port open. Now you can open the above http address in your browser.

You can also stop the SPARK with following command.

stop-master.sh

You should see following output...

stopping org.apache.spark.deploy.master.Master

Install MongoDB

Let us first install necessary dependencies...

sudo apt install dirmngr gnupg apt-transport-https ca-certificates software-properties-common

Run following commands...

wget -qO - https://www.mongodb.org/static/pgp/server-4.4.asc | sudo apt-key add -
sudo add-apt-repository 'deb [arch=amd64] https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/4.4 multiverse'

Note - above command will enable repository to install MongoDB 4.4 version, if you want to install different version, replace the version number above.

Let us install MongoDB now...

sudo apt install mongodb-org

Run following command, if you want to start and enable MongoDB on every time the system boots up...

sudo systemctl enable --now mongod

Run following command to see if mongo is working fine...

mongo --version
MongoDB shell version v5.0.6
Build Info: {
"version": "5.0.6",
"gitVersion": "212a8dbb47f07427dae194a9c75baec1d81d9259",
"openSSLVersion": "OpenSSL 1.1.1 11 Sep 2018",
"modules": [],
"allocator": "tcmalloc",
"environment": {
"distmod": "ubuntu1804",
"distarch": "x86_64",
"target_arch": "x86_64"
}
}

Install PySpark

Make sure you have latest version of Python installed. 

python --version
Python 3.9.7

Run following command to install PySpark...

pip install pyspark

Install Mongo PySpark Connector

Finally we are ready to install Mongo PySpark BI connector.

Go to following link and find the appropriate version of Mongo-Spark to download the relevant Mongo-Spark-Connector JAR file.

https://spark-packages.org/package/mongodb/mongo-spark

Run following commands...

cd /opt/spark/jars
wget https://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.12/3.0.1/mongo-spark-connector_2.12-3.0.1.jar

We are all set now to connect MongoDB using PySpark.

Connect PySpark to MongoDB

Replace the <user_name>, <password>, <db_name> and <collection> with yours in below commands.

from pyspark.sql import SQLContext, SparkSession
from pyspark import SparkContext, SparkConf
sparkConf = SparkConf().setMaster("local").setAppName("myfirstapp").set("spark.app.id", "myfirstapp")
sc = SparkContext(conf=sparkConf)
sqlContext = SQLContext(sc)
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")\
.option("spark.mongodb.input.uri", "mongodb://<user_name>:<password>@localhost:27017/<db_name>.<collection>")\
.load()

If you didn't receive any JAVA errors, then you are good to go.
At the end of above commands. You would get a df which is a cursor to Pyspark DataFrame.
We can print the first document using following command...

df.first()

Conclusion

In this tutorial, We’ve shown you how to install PySpark and use it with MongoDB.