There is little question, big data analytics, data science, artificial intelligence (AI), and machine learning (ML), a subcategory of AI, have all experienced a tremendous surge in popularity over the last few years. Behind the marketing hype, these technologies are having a significant influence on many aspects of our modern lives. Due to their popularity and potential benefits, commercial enterprises, academic institutions, and the public sector are rushing to develop hardware and software solutions to lower the barriers to entry and increase the velocity of ML and Data Scientists and Engineers.
(courtesy Google Trends and Plotly)
Many open-source software projects are also lowering the barriers to entry into these technologies. An excellent example of one such open-source project working on this challenge is Project Jupyter. Similar to Apache Zeppelin and the newly open-sourced Netflix’s Polynote, Jupyter Notebooks enables data-driven, interactive, and collaborative data analytics.
This post will demonstrate the creation of a containerized data analytics environment using Jupyter Docker Stacks. The particular environment will be suited for learning and developing applications for Apache Spark using the Python, Scala, and R programming languages. We will focus on Python and Spark, using PySpark.
Featured Technologies
The following technologies are featured prominently in this post.
Jupyter Notebooks
According to Project Jupyter, the Jupyter Notebook, formerly known as the IPython Notebook, is an open-source web application that allows users to create and share documents that contain live code, equations, visualizations, and narrative text. Uses include data cleansing and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. The word, Jupyter, is a loose acronym for Julia, Python, and R, but today, Jupyter supports many programming languages.
Interest in Jupyter Notebooks has grown dramatically over the last 3–5 years, fueled in part by the major Cloud providers, AWS, Google Cloud, and Azure. Amazon Sagemaker, Amazon EMR (Elastic MapReduce), Google Cloud Dataproc, Google Colab (Collaboratory), and Microsoft Azure Notebooks all have direct integrations with Jupyter notebooks for big data analytics and machine learning.
(courtesy Google Trends and Plotly)
Jupyter Docker Stacks
To enable quick and easy access to Jupyter Notebooks, Project Jupyter has created Jupyter Docker Stacks. The stacks are ready-to-run Docker images containing Jupyter applications, along with accompanying technologies. Currently, the Jupyter Docker Stacks focus on a variety of specializations, including the r-notebook, scipy-notebook, tensorflow-notebook, datascience-notebook, pyspark-notebook, and the subject of this post, the all-spark-notebook. The stacks include a wide variety of well-known packages to extend their functionality, such as scikit-learn, pandas, Matplotlib, Bokeh, NumPy, and Facets.
Apache Spark
According to Apache, Spark is a unified analytics engine for large-scale data processing. Starting as a research project at the UC Berkeley AMPLab in 2009, Spark was open-sourced in early 2010 and moved to the Apache Software Foundation in 2013. Reviewing the postings on any major career site will confirm that Spark is widely used by well-known modern enterprises, such as Netflix, Adobe, Capital One, Lockheed Martin, JetBlue Airways, Visa, and Databricks. At the time of this post, LinkedIn, alone, had approximately 3,500 listings for jobs that reference the use of Apache Spark, just in the United States.
With speeds up to 100 times faster than Hadoop, Apache Spark achieves high performance for static, batch, and streaming data, using a state-of-the-art DAG (Directed Acyclic Graph) scheduler, a query optimizer, and a physical execution engine. Spark’s polyglot programming model allows users to write applications quickly in Scala, Java, Python, R, and SQL. Spark includes libraries for Spark SQL (DataFrames and Datasets), MLlib (Machine Learning), GraphX (Graph Processing), and DStreams (Spark Streaming). You can run Spark using its standalone cluster mode, Apache Hadoop YARN, Mesos, or Kubernetes.
PySpark
The Spark Python API, PySpark, exposes the Spark programming model to Python. PySpark is built on top of Spark’s Java API and uses Py4J. According to Apache, Py4J, a bridge between Python and Java, enables Python programs running in a Python interpreter to dynamically access Java objects in a Java Virtual Machine (JVM). Data is processed in Python and cached and shuffled in the JVM.
Docker
According to Docker, their technology gives developers and IT the freedom to build, manage, and secure business-critical applications without the fear of technology or infrastructure lock-in. For this post, I am using the current stable version of Docker Desktop Community version for macOS, as of March 2020.
Docker Swarm
Current versions of Docker include both a Kubernetes and Swarm orchestrator for deploying and managing containers. We will choose Swarm for this demonstration. According to Docker, the cluster management and orchestration features embedded in the Docker Engine are built using swarmkit. Swarmkit is a separate project that implements Docker’s orchestration layer and is used directly within Docker.
PostgreSQL
PostgreSQL is a powerful, open-source, object-relational database system. According to their website, PostgreSQL comes with many features aimed to help developers build applications, administrators to protect data integrity and build fault-tolerant environments, and help manage data no matter how big or small the dataset.
Demonstration
In this demonstration, we will explore the capabilities of the Spark Jupyter Docker Stack to provide an effective data analytics development environment. We will explore a few everyday uses, including executing Python scripts, submitting PySpark jobs, and working with Jupyter Notebooks, and reading and writing data to and from different file formats and a database. We will be using the latest jupyter/all-spark-notebook
Docker Image. This image includes Python, R, and Scala support for Apache Spark, using Apache Toree.
Architecture
As shown below, we will deploy a Docker stack to a single-node Docker swarm. The stack consists of a Jupyter All-Spark-Notebook, PostgreSQL (Alpine Linux version 12), and Adminer container. The Docker stack will have two local directories bind-mounted into the containers. Files from our GitHub project will be shared with the Jupyter application container through a bind-mounted directory. Our PostgreSQL data will also be persisted through a bind-mounted directory. This allows us to persist data external to the ephemeral containers. If the containers are restarted or recreated, the data is preserved locally.
Source Code
All source code for this post can be found on GitHub. Use the following command to clone the project. Note this post uses the v2
branch.
git clone \ --branch v2 --single-branch --depth 1 --no-tags \ https://github.com/garystafford/pyspark-setup-demo.git
Source code samples are displayed as GitHub Gists, which may not display correctly on some mobile and social media browsers.
Deploy Docker Stack
To start, create the $HOME/data/postgres
directory to store PostgreSQL data files.
mkdir -p ~/data/postgres
This directory will be bind-mounted into the PostgreSQL container on line 41 of the stack.yml file, $HOME/data/postgres:/var/lib/postgresql/data
. The HOME
environment variable assumes you are working on Linux or macOS and is equivalent to HOMEPATH
on Windows.
The Jupyter container’s working directory is set on line 15 of the stack.yml file, working_dir: /home/$USER/work
. The local bind-mounted working directory is $PWD/work
. This path is bind-mounted to the working directory in the Jupyter container, on line 29 of the Docker stack file, $PWD/work:/home/$USER/work
. The PWD
environment variable assumes you are working on Linux or macOS (CD
on Windows).
By default, the user within the Jupyter container is jovyan
. We will override that user with our own local host’s user account, as shown on line 21 of the Docker stack file, NB_USER: $USER
. We will use the host’s USER
environment variable value (equivalent to USERNAME
on Windows). There are additional options for configuring the Jupyter container. Several of those options are used on lines 17–22 of the Docker stack file (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# docker stack deploy -c stack.yml jupyter | |
# optional pgadmin container | |
version: "3.7" | |
networks: | |
demo-net: | |
services: | |
spark: | |
image: jupyter/all-spark-notebook:latest | |
ports: | |
– "8888:8888/tcp" | |
– "4040:4040/tcp" | |
networks: | |
– demo-net | |
working_dir: /home/$USER/work | |
environment: | |
CHOWN_HOME: "yes" | |
GRANT_SUDO: "yes" | |
NB_UID: 1000 | |
NB_GID: 100 | |
NB_USER: $USER | |
NB_GROUP: staff | |
user: root | |
deploy: | |
replicas: 1 | |
restart_policy: | |
condition: on-failure | |
volumes: | |
– $PWD/work:/home/$USER/work | |
postgres: | |
image: postgres:12-alpine | |
environment: | |
POSTGRES_USERNAME: postgres | |
POSTGRES_PASSWORD: postgres1234 | |
POSTGRES_DB: bakery | |
ports: | |
– "5432:5432/tcp" | |
networks: | |
– demo-net | |
volumes: | |
– $HOME/data/postgres:/var/lib/postgresql/data | |
deploy: | |
restart_policy: | |
condition: on-failure | |
adminer: | |
image: adminer:latest | |
ports: | |
– "8080:8080/tcp" | |
networks: | |
– demo-net | |
deploy: | |
restart_policy: | |
condition: on-failure | |
# pgadmin: | |
# image: dpage/pgadmin4:latest | |
# environment: | |
# PGADMIN_DEFAULT_EMAIL: user@domain.com | |
# PGADMIN_DEFAULT_PASSWORD: 5up3rS3cr3t! | |
# ports: | |
# – "8180:80/tcp" | |
# networks: | |
# – demo-net | |
# deploy: | |
# restart_policy: | |
# condition: on-failure |
The jupyter/all-spark-notebook
Docker image is large, approximately 5 GB. Depending on your Internet connection, if this is the first time you have pulled this image, the stack may take several minutes to enter a running state. Although not required, I usually pull new Docker images in advance.
docker pull jupyter/all-spark-notebook:latest docker pull postgres:12-alpine docker pull adminer:latest
Assuming you have a recent version of Docker installed on your local development machine and running in swarm mode, standing up the stack is as easy as running the following docker command from the root directory of the project.
docker stack deploy -c stack.yml jupyter
The Docker stack consists of a new overlay network, jupyter_demo-net
, and three containers. To confirm the stack deployed successfully, run the following docker command.
docker stack ps jupyter --no-trunc
To access the Jupyter Notebook application, you need to obtain the Jupyter URL and access token. The Jupyter URL and the access token are output to the Jupyter container log, which can be accessed with the following command.
docker logs $(docker ps | grep jupyter_spark | awk '{print $NF}')
You should observe log output similar to the following. Retrieve the complete URL, for example, http://127.0.0.1:8888/?token=f78cbe...
, to access the Jupyter web-based user interface.
From the Jupyter dashboard landing page, you should see all the files in the project’s work/
directory. Note the types of files you can create from the dashboard, including Python 3, R, and Scala (using Apache Toree or spylon-kernal) notebooks, and text. You can also open a Jupyter terminal or create a new Folder from the drop-down menu. At the time of this post (March 2020), the latest jupyter/all-spark-notebook
Docker Image runs Spark 2.4.5, Scala 2.11.12, Python 3.7.6, and OpenJDK 64-Bit Server VM, Java 1.8.0 Update 242.
Bootstrap Environment
Included in the project is a bootstrap script, bootstrap_jupyter.sh. The script will install the required Python packages using pip, the Python package installer, and a requirement.txt file. The bootstrap script also installs the latest PostgreSQL driver JAR, configures Apache Log4j to reduce log verbosity when submitting Spark jobs, and installs htop. Although these tasks could also be done from a Jupyter terminal or from within a Jupyter notebook, using a bootstrap script ensures you will have a consistent work environment every time you spin up the Jupyter Docker stack. Add or remove items from the bootstrap script as necessary.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
set -ex | |
# update/upgrade and install htop | |
sudo apt-get update -y && sudo apt-get upgrade -y | |
sudo apt-get install htop | |
# install required python packages | |
python3 -m pip install –user –upgrade pip | |
python3 -m pip install -r requirements.txt –upgrade | |
# download latest postgres driver jar | |
POSTGRES_JAR="postgresql-42.2.17.jar" | |
if [ -f "$POSTGRES_JAR" ]; then | |
echo "$POSTGRES_JAR exist" | |
else | |
wget -nv "https://jdbc.postgresql.org/download/${POSTGRES_JAR}" | |
fi | |
# spark-submit logging level from INFO to WARN | |
sudo cp log4j.properties /usr/local/spark/conf/log4j.properties |
That’s it, our new Jupyter environment is ready to start exploring.
Running Python Scripts
One of the simplest tasks we could perform in our new Jupyter environment is running Python scripts. Instead of worrying about installing and maintaining the correct versions of Python and multiple Python packages on your own development machine, we can run Python scripts from within the Jupyter container. At the time of this post update, the latest jupyter/all-spark-notebook
Docker image runs Python 3.7.3 and Conda 4.7.12. Let’s start with a simple Python script, 01_simple_script.py.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/python3 | |
import random | |
technologies = [ | |
'PySpark', 'Python', 'Spark', 'Scala', 'Java', 'Project Jupyter', 'R' | |
] | |
print("Technologies: %s\n" % technologies) | |
technologies.sort() | |
print("Sorted: %s\n" % technologies) | |
print("I'm interested in learning about %s." % random.choice(technologies)) |
From a Jupyter terminal window, use the following command to run the script.
python3 01_simple_script.py
You should observe the following output.
Kaggle Datasets
To explore more features of the Jupyter and PySpark, we will use a publicly available dataset from Kaggle. Kaggle is an excellent open-source resource for datasets used for big-data and ML projects. Their tagline is ‘Kaggle is the place to do data science projects’. For this demonstration, we will use the Transactions from a bakery dataset from Kaggle. The dataset is available as a single CSV-format file. A copy is also included in the project.
The ‘Transactions from a bakery’ dataset contains 21,294 rows with 4 columns of data. Although certainly not big data, the dataset is large enough to test out the Spark Jupyter Docker Stack functionality. The data consists of 9,531 customer transactions for 21,294 bakery items between 2016-10-30 and 2017-04-09 (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Date | Time | Transaction | Item | |
---|---|---|---|---|
2016-10-30 | 09:58:11 | 1 | Bread | |
2016-10-30 | 10:05:34 | 2 | Scandinavian | |
2016-10-30 | 10:05:34 | 2 | Scandinavian | |
2016-10-30 | 10:07:57 | 3 | Hot chocolate | |
2016-10-30 | 10:07:57 | 3 | Jam | |
2016-10-30 | 10:07:57 | 3 | Cookies | |
2016-10-30 | 10:08:41 | 4 | Muffin | |
2016-10-30 | 10:13:03 | 5 | Coffee | |
2016-10-30 | 10:13:03 | 5 | Pastry | |
2016-10-30 | 10:13:03 | 5 | Bread |
Submitting Spark Jobs
We are not limited to Jupyter notebooks to interact with Spark. We can also submit scripts directly to Spark from the Jupyter terminal. This is typically how Spark is used in a Production for performing analysis on large datasets, often on a regular schedule, using tools such as Apache Airflow. With Spark, you are load data from one or more data sources. After performing operations and transformations on the data, the data is persisted to a datastore, such as a file or a database, or conveyed to another system for further processing.
The project includes a simple Python PySpark ETL script, 02_pyspark_job.py. The ETL script loads the original Kaggle Bakery dataset from the CSV file into memory, into a Spark DataFrame. The script then performs a simple Spark SQL query, calculating the total quantity of each type of bakery item sold, sorted in descending order. Finally, the script writes the results of the query to a new CSV file, output/items-sold.csv
.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/python3 | |
from pyspark.sql import SparkSession | |
spark = SparkSession \ | |
.builder \ | |
.appName('spark-demo') \ | |
.getOrCreate() | |
df_bakery = spark.read \ | |
.format('csv') \ | |
.option('header', 'true') \ | |
.option('delimiter', ',') \ | |
.option('inferSchema', 'true') \ | |
.load('BreadBasket_DMS.csv') | |
df_sorted = df_bakery.cube('item').count() \ | |
.filter('item NOT LIKE \'NONE\'') \ | |
.filter('item NOT LIKE \'Adjustment\'') \ | |
.orderBy(['count', 'item'], ascending=[False, True]) | |
df_sorted.show(10, False) | |
df_sorted.coalesce(1) \ | |
.write.format('csv') \ | |
.option('header', 'true') \ | |
.save('output/items-sold.csv', mode='overwrite') |
Run the script directly from a Jupyter terminal using the following command.
python3 02_pyspark_job.py
An example of the output of the Spark job is shown below.
Typically, you would submit the Spark job using the spark-submit
command. Use a Jupyter terminal to run the following command.
$SPARK_HOME/bin/spark-submit 02_pyspark_job.py
Below, we see the output from the spark-submit
command. Printing the results in the output is merely for the purposes of the demo. Typically, Spark jobs are submitted non-interactively, and the results are persisted directly to a datastore or conveyed to another system.
Using the following commands, we can view the resulting CVS file, created by the spark job.
ls -alh output/items-sold.csv/ head -5 output/items-sold.csv/*.csv
An example of the files created by the spark job is shown below. We should have discovered that coffee is the most commonly sold bakery item with 5,471 sales, followed by bread with 3,325 sales.
Interacting with Databases
To demonstrate the flexibility of Jupyter to work with databases, PostgreSQL is part of the Docker Stack. We can read and write data from the Jupyter container to the PostgreSQL instance, running in a separate container. To begin, we will run a SQL script, written in Python, to create our database schema and some test data in a new database table. To do so, we will use the psycopg2, the PostgreSQL database adapter package for the Python, we previously installed into our Jupyter container using the bootstrap script. The below Python script, 03_load_sql.py, will execute a set of SQL statements contained in a SQL file, bakery.sql, against the PostgreSQL container instance.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/python3 | |
import psycopg2 | |
# connect to database | |
connect_str = 'host=postgres port=5432 dbname=bakery user=postgres password=postgres1234' | |
conn = psycopg2.connect(connect_str) | |
conn.autocommit = True | |
cursor = conn.cursor() | |
# execute sql script | |
sql_file = open('bakery.sql', 'r') | |
sqlFile = sql_file.read() | |
sql_file.close() | |
sqlCommands = sqlFile.split(';') | |
for command in sqlCommands: | |
print(command) | |
if command.strip() != '': | |
cursor.execute(command) | |
# import data from csv file | |
with open('BreadBasket_DMS.csv', 'r') as f: | |
next(f) # Skip the header row. | |
cursor.copy_from( | |
f, | |
'transactions', | |
sep=',', | |
columns=('date', 'time', 'transaction', 'item') | |
) | |
conn.commit() | |
# confirm by selecting record | |
command = 'SELECT COUNT(*) FROM public.transactions;' | |
cursor.execute(command) | |
recs = cursor.fetchall() | |
print('Row count: %d' % recs[0]) |
The SQL file, bakery.sql.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
DROP TABLE IF EXISTS "transactions"; | |
DROP SEQUENCE IF EXISTS transactions_id_seq; | |
CREATE SEQUENCE transactions_id_seq INCREMENT 1 MINVALUE 1 MAXVALUE 2147483647 START 1 CACHE 1; | |
CREATE TABLE "public"."transactions" | |
( | |
"id" integer DEFAULT nextval('transactions_id_seq') NOT NULL, | |
"date" character varying(10) NOT NULL, | |
"time" character varying(8) NOT NULL, | |
"transaction" integer NOT NULL, | |
"item" character varying(50) NOT NULL | |
) WITH (oids = false); |
To execute the script, run the following command.
python3 03_load_sql.py
This should result in the following output, if successful.
Adminer
To confirm the SQL script’s success, use Adminer. Adminer (formerly phpMinAdmin) is a full-featured database management tool written in PHP. Adminer natively recognizes PostgreSQL, MySQL, SQLite, and MongoDB, among other database engines. The current version is 4.7.6 (March 2020).
Adminer should be available on localhost port 8080. The password credentials, shown below, are located in the stack.yml file. The server name, postgres
, is the name of the PostgreSQL Docker container. This is the domain name the Jupyter container will use to communicate with the PostgreSQL container.
Connecting to the new bakery
database with Adminer, we should see the transactions
table.
The table should contain 21,293 rows, each with 5 columns of data.
pgAdmin
Another excellent choice for interacting with PostgreSQL, in particular, is pgAdmin 4. It is my favorite tool for the administration of PostgreSQL. Although limited to PostgreSQL, the user interface and administrative capabilities of pgAdmin is superior to Adminer, in my opinion. For brevity, I chose not to include pgAdmin in this post. The Docker stack also contains a pgAdmin container, which has been commented out. To use pgAdmin, just uncomment the container and re-run the Docker stack deploy command. pgAdmin should then be available on localhost port 81. The pgAdmin login credentials are in the Docker stack file.
Developing Jupyter Notebooks
The real power of the Jupyter Docker Stacks containers is Jupyter Notebooks. According to the Jupyter Project, the notebook extends the console-based approach to interactive computing in a qualitatively new direction, providing a web-based application suitable for capturing the whole computation process, including developing, documenting, and executing code, as well as communicating the results. Notebook documents contain the inputs and outputs of an interactive session as well as additional text that accompanies the code but is not meant for execution.
To explore the capabilities of Jupyter notebooks, the project includes two simple Jupyter notebooks. The first notebooks, 04_notebook.ipynb, demonstrates typical PySpark functions, such as loading data from a CSV file and from the PostgreSQL database, performing basic data analysis with Spark SQL including the use of PySpark user-defined functions (UDF), graphing the data using BokehJS, and finally, saving data back to the database, as well as to the fast and efficient Apache Parquet file format. Below we see several notebook cells demonstrating these features.
IDE Integration
Recall, the working directory, containing the GitHub source code for the project, is bind-mounted to the Jupyter container. Therefore, you can also edit any of the files, including notebooks, in your favorite IDE, such as JetBrains PyCharm and Microsoft Visual Studio Code. PyCharm has built-in language support for Jupyter Notebooks, as shown below.
As does Visual Studio Code using the Python extension.
Using Additional Packages
As mentioned in the Introduction, the Jupyter Docker Stacks come ready-to-run, with a wide variety of Python packages to extend their functionality. To demonstrate the use of these packages, the project contains a second Jupyter notebook document, 05_notebook.ipynb. This notebook uses SciPy, the well-known Python package for mathematics, science, and engineering, NumPy, the well-known Python package for scientific computing, and the Plotly Python Graphing Library. While NumPy and SciPy are included on the Jupyter Docker Image, the bootstrap script uses pip to install the required Plotly packages. Similar to Bokeh, shown in the previous notebook, we can use these libraries to create richly interactive data visualizations.
Plotly
To use Plotly from within the notebook, you will first need to sign up for a free account and obtain a username and API key. To ensure we do not accidentally save sensitive Plotly credentials in the notebook, we are using python-dotenv. This Python package reads key/value pairs from a .env
file, making them available as environment variables to our notebook. Modify and run the following two commands from a Jupyter terminal to create the .env
file and set you Plotly username and API key credentials. Note that the .env
file is part of the .gitignore
file and will not be committed to back to git, potentially compromising the credentials.
echo "PLOTLY_USERNAME=your-username" >> .env echo "PLOTLY_API_KEY=your-api-key" >> .env
The notebook expects to find the two environment variables, which it uses to authenticate with Plotly.
Shown below, we use Plotly to construct a bar chart of daily bakery items sold. The chart uses SciPy and NumPy to construct a linear fit (regression) and plot a line of best fit for the bakery data and overlaying the vertical bars. The chart also uses SciPy’s Savitzky-Golay Filter to plot the second line, illustrating a smoothing of our bakery data.
Plotly also provides Chart Studio Online Chart Maker. Plotly describes Chart Studio as the world’s most modern enterprise data visualization solutions. We can enhance, stylize, and share our data visualizations using the free version of Chart Studio Cloud.
Jupyter Notebook Viewer
Notebooks can also be viewed using nbviewer, an open-source project under Project Jupyter. Thanks to Rackspace hosting, the nbviewer instance is a free service.
Using nbviewer, below, we see the output of a cell within the 04_notebook.ipynb notebook. View this notebook, here, using nbviewer.
Monitoring Spark Jobs
The Jupyter Docker container exposes Spark’s monitoring and instrumentation web user interface. We can review each completed Spark Job in detail.
We can review details of each stage of the Spark job, including a visualization of the DAG (Directed Acyclic Graph), which Spark constructs as part of the job execution plan, using the DAG Scheduler.
We can also review the task composition and timing of each event occurring as part of the stages of the Spark job.
We can also use the Spark interface to review and confirm the runtime environment configuration, including versions of Java, Scala, and Spark, as well as packages available on the Java classpath.
Local Spark Performance
Running Spark on a single node within the Jupyter Docker container on your local development system is not a substitute for a true Spark cluster, Production-grade, multi-node Spark clusters running on bare metal or robust virtualized hardware, and managed with Apache Hadoop YARN, Apache Mesos, or Kubernetes. In my opinion, you should only adjust your Docker resources limits to support an acceptable level of Spark performance for running small exploratory workloads. You will not realistically replace the need to process big data and execute jobs requiring complex calculations on a Production-grade, multi-node Spark cluster.
We can use the following docker stats command to examine the container’s CPU and memory metrics.
docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"
Below, we see the stats from the Docker stack’s three containers showing little or no activity.
Compare those stats with the ones shown below, recorded while a notebook was reading and writing data, and executing Spark SQL queries. The CPU and memory output show spikes, but both appear to be within acceptable ranges.
Linux Process Monitors
Another option to examine container performance metrics is top, which is pre-installed in our Jupyter container. For example, execute the following top
command from a Jupyter terminal, sorting processes by CPU usage.
top -o %CPU
We should observe the individual performance of each process running in the Jupyter container.
A step up from top
is htop, an interactive process viewer for Unix. It was installed in the container by the bootstrap script. For example, we can execute the htop
command from a Jupyter terminal, sorting processes by CPU % usage.
htop --sort-key PERCENT_CPU
With htop
, observe the individual CPU activity. Here, the four CPUs at the top left of the htop
window are the CPUs assigned to Docker. We get insight into the way Spark is using multiple CPUs, as well as other performance metrics, such as memory and swap.
Assuming your development machine is robust, it is easy to allocate and deallocate additional compute resources to Docker if required. Be careful not to allocate excessive resources to Docker, starving your host machine of available compute resources for other applications.
Notebook Extensions
There are many ways to extend the Jupyter Docker Stacks. A popular option is jupyter-contrib-nbextensions. According to their website, the jupyter-contrib-nbextensions
package contains a collection of community-contributed unofficial extensions that add functionality to the Jupyter notebook. These extensions are mostly written in JavaScript and will be loaded locally in your browser. Installed notebook extensions can be enabled, either by using built-in Jupyter commands, or more conveniently by using the jupyter_nbextensions_configurator server extension.
The project contains an alternate Docker stack file, stack-nbext.yml. The stack uses an alternative Docker image, garystafford/all-spark-notebook-nbext:latest
, This Dockerfile, which builds it, uses the jupyter/all-spark-notebook:latest
image as a base image. The alternate image adds in the jupyter-contrib-nbextensions
and jupyter_nbextensions_configurator
packages. Since Jupyter would need to be restarted after nbextensions
is deployed, it cannot be done from within a running jupyter/all-spark-notebook
container.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
FROM jupyter/all-spark-notebook:latest | |
USER root | |
RUN pip install jupyter_contrib_nbextensions \ | |
&& jupyter contrib nbextension install –system \ | |
&& pip install jupyter_nbextensions_configurator \ | |
&& jupyter nbextensions_configurator enable –system \ | |
&& pip install yapf # for code pretty | |
USER $NB_UID |
Using this alternate stack, below in our Jupyter container, we see the sizable list of extensions available. Some of my favorite extensions include ‘spellchecker’, ‘Codefolding’, ‘Code pretty’, ‘ExecutionTime’, ‘Table of Contents’, and ‘Toggle all line numbers’.
Below, we see five new extension icons that have been added to the menu bar of 04_notebook.ipynb. You can also observe the extensions have been applied to the notebook, including the table of contents, code-folding, execution time, and line numbering. The spellchecking and pretty code extensions were both also applied.
Conclusion
In this brief post, we have seen how easy it is to get started learning and performing data analytics using Jupyter notebooks, Python, Spark, and PySpark, all thanks to the Jupyter Docker Stacks. We could use this same stack to learn and perform machine learning using Scala and R. Extending the stack’s capabilities is as simple as swapping out this Jupyter image for another, with a different set of tools, as well as adding additional containers to the stack, such as MySQL, MongoDB, RabbitMQ, Apache Kafka, and Apache Cassandra.
All opinions expressed in this post are my own and not necessarily the views of my current or past employers, their clients.
#1 by Richard Steele on December 27, 2019 - 11:03 am
First, Gary, excellent write up (as usual). I’m following along, replicating at work, and a couple of things to note:
1. First, I couldn’t get this to work on Windows, even running Cygwin. Issues with permission of the mounted file system, etc.
2. By default, I’m running an older version of Docker (1.13+), so I had to change the compose version to match. It doesn’t look like you’re using the newer features. Well, at least things started for me.
That’s it, for now. It’s running so I can play!
#2 by crotunno92 on August 15, 2021 - 2:07 pm
I am reaching out to see if you have any advice for a problem that I am having with persisting the postgresql server and the docker stack. Every time I log out of the computer the docker stack seems to go idle, and when I return to the notebook all the nbextension configurations are gone and so are the additional pip installs I entered in Jupiter terminal. Also, the pgadmin doesn’t seem to persist the bakery database as mentioned in your article. Is there a command you recommend or a concept you suggest I study? Perhaps this is something you’re familiar with and will offer feedback.