Spark on Hadoop integration with Jupyter

For several years, Jupyter notebook has established itself as the notebook solution in the Python universe. Historically, Jupyter is the tool of choice for data scientists who mainly develop in Python. Over the years, Jupyter has evolved and now has a wide range of features thanks to its plugins. Also, one of the main advantages of Jupyter is its ease of deployment.

More and more Spark developers favor Python over Scala to develop their various jobs for speed of development.

In this article, we will see together how to connect a Jupyter server to a Spark cluster running on Hadoop Yarn secured with Kerberos.

How to install Jupyter?

We cover two methods to connect Jupyter to a Spark cluster:

  1. Set up a script to launch a Jupyter instance that will have a Python Spark interpreter.
  2. Connect Jupyter notebook to a Spark cluster via the Sparkmagic extension.

Method 1: Create a startup script

Prerequisites:

  • Have access to a Spark cluster machine, usually a master node or a edge node;
  • Having an environment (Conda, Mamba, virtualenv, ..) with the ‘jupyter’ package. Example with Conda: conda create -n pysparktest python=3.7 jupyter.

Create a script in /home directory and insert the following code by modifying the paths so that they correspond to your environment:

#! /bin/bash


export PYSPARK_PYTHON=/home/adaltas/.conda/envs/pysparktest/bin/python


export PYSPARK_DRIVER_PYTHON=/home/adaltas/.conda/envs/pysparktest/bin/ipython3


export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --ip=10.10.20.11--port=8888"

pyspark \
  --master yarn \
  --conf spark.shuffle.service.enabled=true \
  --conf spark.dynamicAllocation.enabled=false \
  --driver-cores 2 --driver-memory 11136m 
Read more