Posts Tagged Notebook
Getting Started with Apache Zeppelin on Amazon EMR, using AWS Glue, RDS, and S3: Part 1
Posted by Gary A. Stafford in AWS, Bash Scripting, Big Data, Build Automation, Cloud, DevOps, Python, Software Development on November 22, 2019
Introduction
There is little question big data analytics, data science, artificial intelligence (AI), and machine learning (ML), a subcategory of AI, have all experienced a tremendous surge in popularity over the last 3–5 years. Behind the hype cycles and marketing buzz, these technologies are having a significant influence on many aspects of our modern lives. Due to their popularity, commercial enterprises, academic institutions, and the public sector have all rushed to develop hardware and software solutions to decrease the barrier to entry and increase the velocity of ML and Data Scientists and Engineers.
Data Science: 5-Year Search Trend (courtesy Google Trends)
Machine Learning: 5-Year Search Trend (courtesy Google Trends)
Technologies
All three major cloud providers, Amazon Web Services (AWS), Microsoft Azure, and Google Cloud, have rapidly maturing big data analytics, data science, and AI and ML services. AWS, for example, introduced Amazon Elastic MapReduce (EMR) in 2009, primarily as an Apache Hadoop-based big data processing service. Since then, according to Amazon, EMR has evolved into a service that uses Apache Spark, Apache Hadoop, and several other leading open-source frameworks to quickly and cost-effectively process and analyze vast amounts of data. More recently, in late 2017, Amazon released SageMaker, a service that provides the ability to build, train, and deploy machine learning models quickly and securely.
Simultaneously, organizations are building solutions that integrate and enhance these Cloud-based big data analytics, data science, AI, and ML services. One such example is Apache Zeppelin. Similar to the immensely popular Project Jupyter and the newly open-sourced Netflix’s Polynote, Apache Zeppelin is a web-based, polyglot, computational notebook. Zeppelin enables data-driven, interactive data analytics and document collaboration using a number of interpreters such as Scala (with Apache Spark), Python (with Apache Spark), Spark SQL, JDBC, Markdown, Shell and so on. Zeppelin is one of the core applications supported natively by Amazon EMR.
In the following two-part post, we will explore the use of Apache Zeppelin on EMR for data science and data analytics using a series of Zeppelin notebooks. The notebooks feature the use of AWS Glue, the fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. The notebooks also feature the use of Amazon Relational Database Service (RDS) for PostgreSQL and Amazon Simple Cloud Storage Service (S3). Amazon S3 will serve as a Data Lake to store our unstructured data. Given the current choice of Zeppelin’s more than twenty different interpreters, we will use Python3 and Apache Spark, specifically Spark SQL and PySpark, for all notebooks.
We will build an economical single-node EMR cluster for data exploration, as well as a larger multi-node EMR cluster for analyzing large data sets. Amazon S3 will be used to store input and output data, while intermediate results are stored in the Hadoop Distributed File System (HDFS) on the EMR cluster. Amazon provides a good overview of EMR architecture. Below is a high-level architectural diagram of the infrastructure we will construct during Part 1 for this demonstration.
Notebook Features
Below is an overview of each Zeppelin notebook with a link to view it using Zepl’s free Notebook Explorer. Zepl was founded by the same engineers that developed Apache Zeppelin, including Moonsoo Lee, Zepl CTO and creator for Apache Zeppelin. Zepl’s enterprise collaboration platform, built on Apache Zeppelin, enables both Data Science and AI/ML teams to collaborate around data.
Notebook 1
The first notebook uses a small 21k row kaggle dataset, Transactions from a Bakery. The notebook demonstrates Zeppelin’s integration capabilities with the Helium plugin system for adding new chart types, the use of Amazon S3 for data storage and retrieval, and the use of Apache Parquet, a compressed and efficient columnar data storage format, and Zeppelin’s storage integration with GitHub for notebook version control.
Notebook 2
The second notebook demonstrates the use of a single-node and multi-node Amazon EMR cluster for the exploration and analysis of public datasets ranging from approximately 100k rows up to 27MM rows, using Zeppelin. We will use the latest GroupLens MovieLens rating datasets to examine the performance characteristics of Zeppelin, using Spark, on single- verses multi-node EMR clusters for analyzing big data using a variety of Amazon EC2 Instance Types.
Notebook 3
The third notebook demonstrates Amazon EMR and Zeppelin’s integration capabilities with AWS Glue Data Catalog as an Apache Hive-compatible metastore for Spark SQL. We will create an Amazon S3-based Data Lake using the AWS Glue Data Catalog and a set of AWS Glue Crawlers.
Notebook 4
The fourth notebook demonstrates Zeppelin’s ability to integrate with an external data source. In this case, we will interact with data in an Amazon RDS PostgreSQL relational database using three methods, including the Psycopg 2 PostgreSQL adapter for Python, Spark’s native JDBC capability, and Zeppelin’s JDBC Interpreter.
Demonstration
In Part 1 of the post, as a DataOps Engineer, we will create and configure the AWS resources required to demonstrate the use of Apache Zeppelin on EMR, using an AWS Glue Data Catalog, Amazon RDS PostgreSQL database, and an S3-based data lake. In Part 2 of this post, as a Data Scientist, we will explore Apache Zeppelin’s features and integration capabilities with a variety of AWS services using the Zeppelin notebooks.
Source Code
The demonstration’s source code is contained in two public GitHub repositories. The first repository, zeppelin-emr-demo, includes the four Zeppelin notebooks, organized according to the conventions of Zeppelin’s pluggable notebook storage mechanisms.
. ├── 2ERVVKTCG │ └── note.json ├── 2ERYY923A │ └── note.json ├── 2ESH8DGFS │ └── note.json ├── 2EUZKQXX7 │ └── note.json ├── LICENSE └── README.md
Zeppelin GitHub Storage
During the demonstration, changes made to your copy of the Zeppelin notebooks running on EMR will be automatically pushed back to GitHub when a commit occurs. To accomplish this, instead of just cloning a local copy of my zeppelin-emr-demo project repository, you will want your own copy, within your personal GitHub account. You could folk my zeppelin-emr-demo GitHub repository or copy a clone into your own GitHub repository.
To make a copy of the project in your own GitHub account, first, create a new empty repository on GitHub, for example, ‘my-zeppelin-emr-demo-copy’. Then, execute the following commands from your terminal, to clone the original project repository to your local environment, and finally, push it to your GitHub account.
# change me GITHUB_ACCOUNT="your-account-name" # i.e. garystafford GITHUB_REPO="your-new-project-name" # i.e. my-zeppelin-emr-demo-copy # shallow clone into new directory git clone --branch master \ --single-branch --depth 1 --no-tags \ https://github.com/garystafford/zeppelin-emr-demo.git \ ${GITHUB_REPO} # re-initialize repository cd ${GITHUB_REPO} rm -rf .git git init # re-commit code git add -A git commit -m "Initial commit of my copy of zeppelin-emr-demo" # push to your repo git remote add origin \ https://github.com/$GITHUB_ACCOUNT/$GITHUB_REPO.git git push -u origin master
GitHub Personal Access Token
To automatically push changes to your GitHub repository when a commit occurs, Zeppelin will need a GitHub personal access token. Create a personal access token with the scope shown below. Be sure to keep the token secret. Make sure you do not accidentally check your token value into your source code on GitHub. To minimize the risk, change or delete the token after completing the demo.
The second repository, zeppelin-emr-config, contains the necessary bootstrap files, CloudFormation templates, and PostgreSQL DDL (Data Definition Language) SQL script.
. ├── LICENSE ├── README.md ├── bootstrap │ ├── bootstrap.sh │ ├── emr-config.json │ ├── helium.json ├── cloudformation │ ├── crawler.yml │ ├── emr_single_node.yml │ ├── emr_cluster.yml │ └── rds_postgres.yml └── sql └── ratings.sql
Use the following AWS CLI command to clone the GitHub repository to your local environment.
git clone --branch master \ --single-branch --depth 1 --no-tags \ https://github.com/garystafford/zeppelin-emr-demo-setup.git
Requirements
To follow along with the demonstration, you will need an AWS Account, an existing Amazon S3 bucket to store EMR configuration and data, and an EC2 key pair. You will also need a current version of the AWS CLI installed in your work environment. Due to the particular EMR features, we will be using, I recommend using the us-east-1
AWS Region to create the demonstration’s resources.
# create secure emr config and data bucket # change me ZEPPELIN_DEMO_BUCKET="your-bucket-name" aws s3api create-bucket \ --bucket ${ZEPPELIN_DEMO_BUCKET} aws s3api put-public-access-block \ --bucket ${ZEPPELIN_DEMO_BUCKET} \ --public-access-block-configuration \ BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true
Copy Configuration Files to S3
To start, we need to copy three configuration files, bootstrap.sh, helium.json, and ratings.sql, from the zeppelin-emr-demo-setup
project directory to our S3 bucket. Change the ZEPPELIN_DEMO_BUCKET
variable value, then run the following s3 cp
API command, using the AWS CLI. The three files will be copied to a bootstrap directory within your S3 bucket.
# change me ZEPPELIN_DEMO_BUCKET="your-bucket-name" aws s3 cp bootstrap/bootstrap.sh s3://${ZEPPELIN_DEMO_BUCKET}/bootstrap/ aws s3 cp bootstrap/helium.json s3://${ZEPPELIN_DEMO_BUCKET}/bootstrap/ aws s3 cp sql/ratings.sql s3://${ZEPPELIN_DEMO_BUCKET}/bootstrap/
Below, sample output from copying local files to S3.
Create AWS Resources
We will start by creating most of the required AWS resources for the demonstration using three CloudFormation templates. We will create a single-node Amazon EMR cluster, an Amazon RDS PostgresSQL database, an AWS Glue Data Catalog database, two AWS Glue Crawlers, and a Glue IAM Role. We will wait to create the multi-node EMR cluster due to the compute costs of running large EC2 instances in the cluster. You should understand the cost of these resources before proceeding, and that you ensure they are destroyed immediately upon completion of the demonstration to minimize your expenses.
Single-Node EMR Cluster
We will start by creating the single-node Amazon EMR cluster, consisting of just one master node with no core or task nodes (a cluster of one). All operations will take place on the master node.
Default EMR Resources
The following EMR instructions assume you have already created at least one EMR cluster in the past, in your current AWS Region, using the EMR web interface with the ‘Create Cluster – Quick Options’ option. Creating a cluster this way creates several additional AWS resources, such as the EMR_EC2_DefaultRole
EC2 instance profile, the default EMR_DefaultRole
EMR IAM Role, and the default EMR S3 log bucket.
If you haven’t created any EMR clusters using the EMR ‘Create Cluster – Quick Options’ in the past, don’t worry, you can also create the required resources with a few quick AWS CLI commands. Change the following LOG_BUCKET
variable value, then run the aws emr
and aws s3api
API commands, using the AWS CLI. The LOG_BUCKET
variable value follows the convention of aws-logs-awsaccount-region
. For example, aws-logs-012345678901-us-east-1
.
# create emr roles aws emr create-default-roles # create log secure bucket # change me LOG_BUCKET="aws-logs-your_aws_account_id-your_region" aws s3api create-bucket --bucket ${LOG_BUCKET} aws s3api put-public-access-block --bucket ${LOG_BUCKET} \ --public-access-block-configuration \ BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true
The new EMR IAM Roles can be viewed in the IAM Roles web interface.
Often, I see tutorials that reference these default EMR resources from the AWS CLI or CloudFormation, without any understanding or explanation of how they are created.
EMR Bootstrap Script
As part of creating our EMR cluster, the CloudFormation template, emr_single_node.yml, will call the bootstrap script we copied to S3, earlier, bootstrap.sh. The bootstrap script pre-installs required Python and Linux software packages, and the PostgreSQL driver JAR. The bootstrap script also clones your copy of the zeppelin-emr-demo GitHub repository.
#!/bin/bash set -ex if [[ $# -ne 2 ]] ; then echo "Script requires two arguments" exit 1 fi GITHUB_ACCOUNT=$1 GITHUB_REPO=$2 # install extra python packages sudo python3 -m pip install psycopg2-binary boto3 # install extra linux packages yes | sudo yum install git htop # clone github repo cd /tmp git clone "https://github.com/${GITHUB_ACCOUNT}/${GITHUB_REPO}.git" # install extra jars POSTGRES_JAR="postgresql-42.2.8.jar" wget -nv "https://jdbc.postgresql.org/download/${POSTGRES_JAR}" sudo chown -R hadoop:hadoop ${POSTGRES_JAR} mkdir -p /home/hadoop/extrajars/ cp ${POSTGRES_JAR} /home/hadoop/extrajars/
EMR Application Configuration
The EMR CloudFormation template will also modify the EMR cluster’s Spark and Zeppelin application configurations. Amongst other configuration properties, the template sets the default Python version to Python3, instructs Zeppelin to use the cloned GitHub notebook directory path, and adds the PostgreSQL Driver JAR to the JVM ClassPath. Below we can see the configuration properties applied to an existing EMR cluster.

EMR Application Versions
As of the date of this post, EMR is at version 5.28.0. Below, as shown in the EMR web interface, are the current (21) applications and frameworks available for installation on EMR.
For this demo, we will install Apache Spark v2.4.4, Ganglia v3.7.2, and Zeppelin 0.8.2.
Apache Zeppelin: Web Interface
Apache Spark: DAG Visualization
Ganglia: Cluster CPU Monitoring
Create the EMR CloudFormation Stack
Change the following (7) variable values, then run the emr cloudformation create-stack
API command, using the AWS CLI.
# change me ZEPPELIN_DEMO_BUCKET="your-bucket-name" EC2_KEY_NAME="your-key-name" LOG_BUCKET="aws-logs-your_aws_account_id-your_region" GITHUB_ACCOUNT="your-account-name" # i.e. garystafford GITHUB_REPO="your-new-project-name" # i.e. my-zeppelin-emr-demo GITHUB_TOKEN="your-token-value" MASTER_INSTANCE_TYPE="m5.xlarge" # optional aws cloudformation create-stack \ --stack-name zeppelin-emr-dev-stack \ --template-body file://cloudformation/emr_single_node.yml \ --parameters ParameterKey=ZeppelinDemoBucket,ParameterValue=${ZEPPELIN_DEMO_BUCKET} \ ParameterKey=Ec2KeyName,ParameterValue=${EC2_KEY_NAME} \ ParameterKey=LogBucket,ParameterValue=${LOG_BUCKET} \ ParameterKey=MasterInstanceType,ParameterValue=${MASTER_INSTANCE_TYPE} \ ParameterKey=GitHubAccount,ParameterValue=${GITHUB_ACCOUNT} \ ParameterKey=GitHubRepository,ParameterValue=${GITHUB_REPO} \ ParameterKey=GitHubToken,ParameterValue=${GITHUB_TOKEN}
You can use the Amazon EMR web interface to confirm the results of the CloudFormation stack. The cluster should be in the ‘Waiting’ state.
PostgreSQL on Amazon RDS
Next, create a simple, single-AZ, single-master, non-replicated Amazon RDS PostgreSQL database, using the included CloudFormation template, rds_postgres.yml. We will use this database in Notebook 4. For the demo, I have selected the current-generation general purpose db.m4.large
EC2 instance type to run PostgreSQL. You can easily change the instance type to another RDS-supported instance type to suit your own needs.
Change the following (3) variable values, then run the cloudformation create-stack
API command, using the AWS CLI.
# change me DB_MASTER_USER="your-db-username" # i.e. masteruser DB_MASTER_PASSWORD="your-db-password" # i.e. 5up3r53cr3tPa55w0rd MASTER_INSTANCE_TYPE="db.m4.large" # optional aws cloudformation create-stack \ --stack-name zeppelin-rds-stack \ --template-body file://cloudformation/rds_postgres.yml \ --parameters ParameterKey=DBUser,ParameterValue=${DB_MASTER_USER} \ ParameterKey=DBPassword,ParameterValue=${DB_MASTER_PASSWORD} \ ParameterKey=DBInstanceClass,ParameterValue=${MASTER_INSTANCE_TYPE}
You can use the Amazon RDS web interface to confirm the results of the CloudFormation stack.
AWS Glue
Next, create the AWS Glue Data Catalog database, the Apache Hive-compatible metastore for Spark SQL, two AWS Glue Crawlers, and a Glue IAM Role (ZeppelinDemoCrawlerRole
), using the included CloudFormation template, crawler.yml. The AWS Glue Data Catalog database will be used in Notebook 3.
Change the following variable value, then run the cloudformation create-stack
API command, using the AWS CLI.
# change me ZEPPELIN_DEMO_BUCKET="your-bucket-name" aws cloudformation create-stack \ --stack-name zeppelin-crawlers-stack \ --template-body file://cloudformation/crawler.yml \ --parameters ParameterKey=ZeppelinDemoBucket,ParameterValue=${ZEPPELIN_DEMO_BUCKET} \ --capabilities CAPABILITY_NAMED_IAM
You can use the AWS Glue web interface to confirm the results of the CloudFormation stack. Note the Data Catalog database and the two Glue Crawlers. We will not run the two crawlers until Part 2 of the post, so no tables will exist in the Data Catalog database, yet.
At this point in the demonstration, you should have successfully created a single-node Amazon EMR cluster, an Amazon RDS PostgresSQL database, and several AWS Glue resources, all using CloudFormation templates.
Post-EMR Creation Configuration
RDS Security
For the new EMR cluster to communicate with the RDS PostgreSQL database, we need to ensure that port 5432 is open from the RDS database’s VPC security group, which is the default VPC security group, to the security groups of the EMR nodes. Obtain the Group ID of the ElasticMapReduce-master
and ElasticMapReduce-slave
Security Groups from the EMR web interface.
Access the Security Group for the RDS database using the RDS web interface. Change the inbound rule for port 5432 to include both Security Group IDs.
SSH to EMR Master Node
In addition to the bootstrap script and configurations, we already applied to the EMR cluster, we need to make several post-EMR creation configuration changes to the EMR cluster for our demonstration. These changes will require SSH’ing to the EMR cluster. Using the master node’s public DNS address and SSH command provided in the EMR web console, SSH into the master node.
If you cannot access the node using SSH, check that port 22 is open on the associated EMR master node IAM Security Group (ElasticMapReduce-master
) to your IP address or address range.
Git Permissions
We need to change permissions on the git repository we installed during the EMR bootstrapping phase. Typically, with an EC2 instance, you perform operations as the ec2-user
user. With Amazon EMR, you often perform actions as the hadoop
user. With Zeppelin on EMR, the notebooks perform operations, including interacting with the git repository as the zeppelin
user. As a result of the bootstrap.sh script, the contents of the git repository directory, /tmp/zeppelin-emr-demo/
, are owned by the hadoop
user and group by default.
We will change their owner to the zeppelin
user and group. We could not perform this step as part of the bootstrap script since the the zeppelin
user and group did not exist at the time the script was executed.
cd /tmp/zeppelin-emr-demo/ sudo chown -R zeppelin:zeppelin .
The results should look similar to the following output.
Pre-Install Visualization Packages
Next, we will pre-install several Apache Zeppelin Visualization packages. According to the Zeppelin website, an Apache Zeppelin Visualization is a pluggable package that can be loaded/unloaded on runtime through the Helium framework in Zeppelin. We can use them just like any other built-in visualization in the notebook. A Visualization is a javascript npm package. For example, here is a link to the ultimate-pie-chart on the public npm registry.
We can pre-load plugins by replacing the /usr/lib/zeppelin/conf/helium.json
file with the version of helium.json we copied to S3, earlier, and restarting Zeppelin. If you have a lot of Visualizations or package types or use any DataOps automation to create EMR clusters, this approach is much more efficient and repeatable than manually loading plugins using the Zeppelin UI, each time you create a new EMR cluster. Below, the helium.json
file, which pre-loads (8) Visualization packages.
{ "enabled": { "ultimate-pie-chart": "ultimate-pie-chart@0.0.2", "ultimate-column-chart": "ultimate-column-chart@0.0.2", "ultimate-scatter-chart": "ultimate-scatter-chart@0.0.2", "ultimate-range-chart": "ultimate-range-chart@0.0.2", "ultimate-area-chart": "ultimate-area-chart@0.0.1", "ultimate-line-chart": "ultimate-line-chart@0.0.1", "zeppelin-bubblechart": "zeppelin-bubblechart@0.0.4", "zeppelin-highcharts-scatterplot": "zeppelin-highcharts-scatterplot@0.0.2" }, "packageConfig": {}, "bundleDisplayOrder": [ "ultimate-pie-chart", "ultimate-column-chart", "ultimate-scatter-chart", "ultimate-range-chart", "ultimate-area-chart", "ultimate-line-chart", "zeppelin-bubblechart", "zeppelin-highcharts-scatterplot" ] }
Run the following commands to load the plugins and adjust the permissions on the file.
# change me ZEPPELIN_DEMO_BUCKET="your-bucket-name" sudo aws s3 cp s3://${ZEPPELIN_DEMO_BUCKET}/bootstrap/helium.json \ /usr/lib/zeppelin/conf/helium.json sudo chown zeppelin:zeppelin /usr/lib/zeppelin/conf/helium.json
Create New JDBC Interpreter
Lastly, we need to create a new Zeppelin JDBC Interpreter to connect to our RDS database. By default, Zeppelin has several interpreters installed. You can review a list of available interpreters using the following command.
sudo sh /usr/lib/zeppelin/bin/install-interpreter.sh --list
The new JDBC interpreter will allow us to connect to our RDS PostgreSQL database, using Java Database Connectivity (JDBC). First, ensure all available interpreters are installed, including the current Zeppelin JDBC driver (org.apache.zeppelin:zeppelin-jdbc:0.8.0
) to /usr/lib/zeppelin/interpreter/jdbc
.
Creating a new interpreter is a two-part process. In this stage, we install the required interpreter files on the master node using the following command. Then later, in the Zeppelin web interface, we will configure the new PostgreSQL JDBC interpreter. Note we must provide a unique name for the interpreter (I chose ‘postgres’ in this case), which we will refer to in part two of the interpreter creation process.
sudo sh /usr/lib/zeppelin/bin/install-interpreter.sh --all sudo sh /usr/lib/zeppelin/bin/install-interpreter.sh \ --name "postgres" \ --artifact org.apache.zeppelin:zeppelin-jdbc:0.8.0
To complete the post-EMR creation configuration on the master node, we must restart Zeppelin for our changes to take effect.
sudo stop zeppelin && sudo start zeppelin
In my experience, it could take 2–3 minutes for the Zeppelin UI to become fully responsive after a restart.
Zeppelin Web Interface Access
With all the EMR application configuration complete, we will access the Zeppelin web interface running on the master node. Use the Zeppelin connection information provided in the EMR web interface to setup SSH tunneling to the Zeppelin web interface, running on the master node. Using this method, we can also access the Spark History Server, Ganglia, and Hadoop Resource Manager web interfaces; all links are provided from EMR.
To set up a web connection to the applications installed on the EMR cluster, I am using FoxyProxy as a proxy management tool with Google Chrome.
If everything is working so far, you should see the Zeppelin web interface with all four Zeppelin notebooks available from the cloned GitHub repository. You will be logged in as the anonymous
user. Zeppelin offers authentication for accessing notebooks on the EMR cluster. For brevity, we will not cover setting up authentication in Zeppelin, using Shiro Authentication.
To confirm the path to the local, cloned copy of the GitHub notebook repository, is correct, check the Notebook Repos interface, accessible under the Settings dropdown (anonymous
user) in the upper right of the screen. The value should match the ZEPPELIN_NOTEBOOK_DIR
configuration property value in the emr_single_node.yml CloudFormation template we executed earlier.
To confirm the Helium Visualizations were pre-installed correctly, using the helium.json file, open the Helium interface, accessible under the Settings dropdown (anonymous
user) in the upper right of the screen.
Note the enabled visualizations. And, it is easy to enable additional plugins through the web interface.
New PostgreSQL JDBC Interpreter
If you recall, earlier, we install the required interpreter files on the master node using the following command using the bootstrap script. We will now complete the process of configuring the new PostgreSQL JDBC interpreter. Open the Interpreter interface, accessible under the Settings dropdown (anonymous
user) in the upper right of the screen.
The title of the new interpreter must match the name we used to install the interpreter files, ‘postgres’. The interpreter group will be ‘jdbc’. There are, minimally, three properties we need to configure for your specific RDS database instance, including default.url
, default.user
, and default.password
. These should match the values you used to create your RDS instance, earlier. Make sure to includes the database name in the default.url
. An example is shown below.
default.url: jdbc:postgresql://zeppelin-demo.abcd1234efg56.us-east-1.rds.amazonaws.com:5432/ratings default.user: masteruser default.password: 5up3r53cr3tPa55w0rd
We also need to provide a path to the PostgreSQL driver JAR dependency. This path is the location where we placed the JAR using the bootstrap.sh script, earlier, /home/hadoop/extrajars/postgresql-42.2.8.jar
. Save the new interpreter and make sure it starts successfully (shows a green icon).
Switch Interpreters to Python 3
The last thing we need to do is change the Spark and Python interpreters to use Python 3 instead of the default Python 2. On the same screen you used to create a new interpreter, modify the Spark and Python interpreters. First, for the Python interpreter, change the zeppelin.python
property to python3
.
Next, for the Spark interpreter, change the zeppelin.pyspark.python
property to python3
.
Part 2
In Part 1 of this post, we created and configured the AWS resources required to demonstrate the use of Apache Zeppelin on EMR, using an AWS Glue Data Catalog, Amazon RDS PostgreSQL database, and an S3 data lake. In Part 2 of this post, we will explore some of Apache Zeppelin’s features and integration capabilities with a variety of AWS services using the notebooks.
All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.