Posts Tagged Dataproc

Developing on the Google Cloud Platform

Looking for information about developing on the Google Cloud Platform? Enjoy some of my most recent articles on the subject.

  1. Getting Started with Red Hat Ansible for Google Cloud Platform (January 2019)
  2. Automating Multi-Environment Kubernetes Virtual Clusters with Google Cloud DNS, Auth0, and Istio 1.0 (January 2019)
  3. Istio End-User Authentication for Kubernetes using JSON Web Tokens (JWT) and Auth0 (January 2019)
  4. Securing Kubernetes with Istio End User Authentication using JSON Web Tokens (JWT) (January 2019)
  5. Securing Your Istio Ingress Gateway with HTTPS (January 2019)
  6. Building a Microservices Platform with Confluent Cloud, MongoDB Atlas, Istio, and Google Kubernetes Engine (December 2018)
  7. Using the Google Cloud Dataproc WorkflowTemplates API to Automate Spark and Hadoop Workloads on GCP (December 2018)
  8. Big Data Analytics with Java and Python, using Cloud Dataproc, Google’s Fully-Managed Spark and Hadoop Service (December 2018)
  9. Integrating Search Capabilities with Actions for Google Assistant, using GKE and Elasticsearch: Part 1 (September 2018)
  10. Integrating Search Capabilities with Actions for Google Assistant, using GKE and Elasticsearch: Part 2 (September 2018)
  11. Building Serverless Actions for Google Assistant with Google Cloud Functions, Cloud Datastore, and Cloud Storage (August 2018)
  12. Managing Applications Across Multiple Kubernetes Environments with Istio: Part 1 (April 2018)
  13. Managing Applications Across Multiple Kubernetes Environments with Istio: Part 2 (April 2018)
  14. Deploying and Configuring Istio on Google Kubernetes Engine (GKE) (December 2017)

 

 All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.

, , , , ,

Leave a comment

Using the Google Cloud Dataproc WorkflowTemplates API to Automate Spark and Hadoop Workloads on GCP

In the previous post, Big Data Analytics with Java and Python, using Cloud Dataproc, Google’s Fully-Managed Spark and Hadoop Service, we explored Google Cloud Dataproc using the Google Cloud Console as well as the Google Cloud SDK and Cloud Dataproc API. We created clusters, then uploaded and ran Spark and PySpark jobs, then deleted clusters, each as discrete tasks. Although each task could be done via the Dataproc API and therefore automatable, they were independent tasks, without awareness of the previous task’s state.

Screen Shot 2018-12-15 at 11.39.26 PM.png

In this brief follow-up post, we will examine the Cloud Dataproc WorkflowTemplates API to more efficiently and effectively automate Spark and Hadoop workloads. According to Google, the Cloud Dataproc WorkflowTemplates API provides a flexible and easy-to-use mechanism for managing and executing Dataproc workflows. A Workflow Template is a reusable workflow configuration. It defines a graph of jobs with information on where to run those jobs. A Workflow is an operation that runs a Directed Acyclic Graph (DAG) of jobs on a cluster. Shown below, we see one of the Workflows that will be demonstrated in this post, displayed in Spark History Server Web UI.

screen-shot-2018-12-16-at-11.07.29-am.png

Here we see a four-stage DAG of one of the three jobs in the workflow, displayed in Spark History Server Web UI.

screen-shot-2018-12-16-at-11.18.45-am

Workflows are ideal for automating large batches of dynamic Spark and Hadoop jobs, and for long-running and unattended job execution, such as overnight.

Demonstration

Using the Python and Java projects from the previous post, we will first create workflow templates using the just the WorkflowTemplates API. We will create the template, set a managed cluster, add jobs to the template, and instantiate the workflow. Next, we will further optimize and simplify our workflow by using a YAML-based workflow template file. The YAML-based template file eliminates the need to make API calls to set the template’s cluster and add the jobs to the template. Finally, to further enhance the workflow and promote re-use of the template, we will incorporate parameterization. Parameters will allow us to pass parameters (key/value) pairs from the command line to workflow template, and on to the Python script as input arguments.

It is not necessary to use the Google Cloud Console for this post. All steps will be done using Google Cloud SDK shell commands. This means all steps may be automated using CI/CD DevOps tools, like Jenkins and Spinnaker on GKE.

Source Code

All open-sourced code for this post can be found on GitHub within three repositories: dataproc-java-demodataproc-python-demo, and dataproc-workflow-templates. Source code samples are displayed as GitHub Gists, which may not display correctly on all mobile and social media browsers.

WorkflowTemplates API

Always start by ensuring you have the latest Google Cloud SDK updates and are working within the correct Google Cloud project.

gcloud components update

export PROJECT_ID=your-project-id 
gcloud config set project $PROJECT

Set the following variables based on your Google environment. The variables will be reused throughout the post for multiple commands.

export REGION=your-region
export ZONE=your-zone
export BUCKET_NAME=your-bucket

The post assumes you still have the Cloud Storage bucket we created in the previous post. In the bucket, you will need the two Kaggle IBRD CSV files, available on Kaggle, the compiled Java JAR file from the dataproc-java-demo project, and a new Python script, international_loans_dataproc.py, from the dataproc-python-demo project.

screen-shot-2018-12-16-at-12.03.51-pm

Use gsutil with the copy (cp) command to upload the four files to your Storage bucket.

gsutil cp data/ibrd-statement-of-loans-*.csv $BUCKET_NAME
gsutil cp build/libs/dataprocJavaDemo-1.0-SNAPSHOT.jar $BUCKET_NAME
gsutil cp international_loans_dataproc.py $BUCKET_NAME

Following Google’s suggested process, we create a workflow template using the workflow-templates create command.

export TEMPLATE_ID=template-demo-1
  
gcloud dataproc workflow-templates create \
  $TEMPLATE_ID --region $REGION

Adding a Cluster

Next, we need to set a cluster for the workflow to use, in order to run the jobs. Cloud Dataproc will create and use a Managed Cluster for your workflow or use an existing cluster. If the workflow uses a managed cluster, it creates the cluster, runs the jobs, and then deletes the cluster when the jobs are finished. This means, for many use cases, there is no need to maintain long-lived clusters, they become just an ephemeral part of the workflow.

We set a managed cluster for our Workflow using the workflow-templates set-managed-cluster command. We will re-use the same cluster specifications we used in the previous post, the Standard, 1 master node and 2 worker nodes, cluster type.

gcloud dataproc workflow-templates set-managed-cluster \
  $TEMPLATE_ID \
  --region $REGION \
  --zone $ZONE \
  --cluster-name three-node-cluster \
  --master-machine-type n1-standard-4 \
  --master-boot-disk-size 500 \
  --worker-machine-type n1-standard-4 \
  --worker-boot-disk-size 500 \
  --num-workers 2 \
  --image-version 1.3-deb9

Alternatively, if we already had an existing cluster, we would use the workflow-templates set-cluster-selector command, to associate that cluster with the workflow template.

gcloud dataproc workflow-templates set-cluster-selector \
  $TEMPLATE_ID \
  --region $REGION \
  --cluster-labels goog-dataproc-cluster-uuid=$CLUSTER_UUID

To get the existing cluster’s UUID label value, you could use a command similar to the following.

CLUSTER_UUID=$(gcloud dataproc clusters describe $CLUSTER_2 \
  --region $REGION \
  | grep 'goog-dataproc-cluster-uuid:' \
  | sed 's/.* //')

echo $CLUSTER_UUID

1c27efd2-f296-466e-b14e-c4263d0d7e19

Adding Jobs

Next, we add the jobs we want to run to the template. Each job is considered a step in the template, each step requires a unique step id. We will add three jobs to the template, two Java-based Spark jobs from the previous post, and a new Python-based PySpark job.

First, we add the two Java-based Spark jobs, using the workflow-templates add-job spark command. This command’s flags are nearly identical to the dataproc jobs submit spark command, used in the previous post.

export STEP_ID=ibrd-small-spark
  
gcloud dataproc workflow-templates add-job spark \
  --region $REGION \
  --step-id $STEP_ID \
  --workflow-template $TEMPLATE_ID \
  --class org.example.dataproc.InternationalLoansAppDataprocSmall \
  --jars $BUCKET_NAME/dataprocJavaDemo-1.0-SNAPSHOT.jar

export STEP_ID=ibrd-large-spark
  
gcloud dataproc workflow-templates add-job spark \
  --region $REGION \
  --step-id $STEP_ID \
  --workflow-template $TEMPLATE_ID \
  --class org.example.dataproc.InternationalLoansAppDataprocLarge \
  --jars $BUCKET_NAME/dataprocJavaDemo-1.0-SNAPSHOT.jar

Next, we add the Python-based PySpark job, international_loans_dataproc.py, as the second job in the template. This Python script requires three input arguments, on lines 15–17, which are the bucket where the data is located and the and results are placed, the name of the data file, and the directory in the bucket where the results will be placed (gist).


#!/usr/bin/python
# Author: Gary A. Stafford
# License: MIT
# Arguments Example:
# gs://dataproc-demo-bucket
# ibrd-statement-of-loans-historical-data.csv
# ibrd-summary-large-python
from pyspark.sql import SparkSession
import sys
def main(argv):
storage_bucket = argv[0]
data_file = argv[1]
results_directory = argv[2]
print "Number of arguments: {0} arguments.".format(len(sys.argv))
print "Argument List: {0}".format(str(sys.argv))
spark = SparkSession \
.builder \
.master("yarn") \
.appName('dataproc-python-demo') \
.getOrCreate()
# Defaults to INFO
sc = spark.sparkContext
sc.setLogLevel("WARN")
# Loads CSV file from Google Storage Bucket
df_loans = spark \
.read \
.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load(storage_bucket + "/" + data_file)
# Creates temporary view using DataFrame
df_loans.withColumnRenamed("Country", "country") \
.withColumnRenamed("Country Code", "country_code") \
.withColumnRenamed("Disbursed Amount", "disbursed") \
.withColumnRenamed("Borrower's Obligation", "obligation") \
.withColumnRenamed("Interest Rate", "interest_rate") \
.createOrReplaceTempView("loans")
# Performs basic analysis of dataset
df_disbursement = spark.sql("""
SELECT country, country_code,
format_number(total_disbursement, 0) AS total_disbursement,
format_number(ABS(total_obligation), 0) AS total_obligation,
format_number(avg_interest_rate, 2) AS avg_interest_rate
FROM (
SELECT country, country_code,
SUM(disbursed) AS total_disbursement,
SUM(obligation) AS total_obligation,
AVG(interest_rate) AS avg_interest_rate
FROM loans
GROUP BY country, country_code
ORDER BY total_disbursement DESC
LIMIT 25)
""").cache()
print "Results:"
df_disbursement.show(25, True)
# Saves results to single CSV file in Google Storage Bucket
df_disbursement.write \
.mode("overwrite") \
.format("parquet") \
.save(storage_bucket + "/" + results_directory)
spark.stop()
if __name__ == "__main__":
main(sys.argv[1:])

We pass the arguments to the Python script as part of the PySpark job, using the workflow-templates add-job pyspark command.

export STEP_ID=ibrd-large-pyspark
  
gcloud dataproc workflow-templates add-job pyspark \
  $BUCKET_NAME/international_loans_dataproc.py \
  --step-id $STEP_ID \
  --workflow-template $TEMPLATE_ID \
  --region $REGION \
  -- $BUCKET_NAME \
     ibrd-statement-of-loans-historical-data.csv \
     ibrd-summary-large-python

That’s it, we have created our first Cloud Dataproc Workflow Template using the Dataproc WorkflowTemplate API. To view our template we can use the following two commands. First, use the workflow-templates list command to display a list of available templates. The list command output displays the version of the workflow template and how many jobs are in the template.

gcloud dataproc workflow-templates list --region $REGION
  
ID               JOBS  UPDATE_TIME               VERSION
template-demo-1  3     2018-12-15T16:32:06.508Z  5

Then, we use the workflow-templates describe command to show the details of a specific template.

gcloud dataproc workflow-templates describe \
  $TEMPLATE_ID --region $REGION

Using the workflow-templates describe command, we should see output similar to the following (gist).


createTime: '2018-12-15T16:31:21.779Z'
id: template-demo-1
jobs:
sparkJob:
jarFileUris:
gs://dataproc-demo-bucket/dataprocJavaDemo-1.0-SNAPSHOT.jar
mainClass: org.example.dataproc.InternationalLoansAppDataprocSmall
stepId: ibrd-small-spark
sparkJob:
jarFileUris:
gs://dataproc-demo-bucket/dataprocJavaDemo-1.0-SNAPSHOT.jar
mainClass: org.example.dataproc.InternationalLoansAppDataprocLarge
stepId: ibrd-large-spark
pysparkJob:
args:
gs://dataproc-demo-bucket
ibrd-statement-of-loans-historical-data.csv
ibrd-summary-large-python
mainPythonFileUri: gs://dataproc-demo-bucket/international_loans_dataproc.py
stepId: ibrd-large-pyspark
name: projects/dataproc-demo-224523/regions/us-east1/workflowTemplates/template-demo-1
placement:
managedCluster:
clusterName: three-node-cluster
config:
gceClusterConfig:
zoneUri: us-east1-b
masterConfig:
diskConfig:
bootDiskSizeGb: 500
machineTypeUri: n1-standard-4
softwareConfig:
imageVersion: 1.3-deb9
workerConfig:
diskConfig:
bootDiskSizeGb: 500
machineTypeUri: n1-standard-4
numInstances: 2
updateTime: '2018-12-15T16:32:06.508Z'
version: 5

In the template description, notice the template’s id, the managed cluster in the placement section, and the three jobs, all which we added using the above series of workflow-templates commands. Also, notice the creation and update timestamps and version number, which were automatically generated by Dataproc. Lastly, notice the name, which refers to the GCP project and region where this copy of the template is located. Had we used an existing cluster with our workflow, as opposed to a managed cluster, the placement section would have looked as follows.

placement:
  clusterSelector:
    clusterLabels:
      goog-dataproc-cluster-uuid: your_clusters_uuid_label_value

To instantiate the workflow, we use the workflow-templates instantiate command. This command will create the managed cluster, run all the steps (jobs), then delete the cluster. I have added the time command to see how fast the workflow will take to complete.

time gcloud dataproc workflow-templates instantiate \
  $TEMPLATE_ID --region $REGION #--async

We can observe the progress from the Google Cloud Dataproc Console, or from the command line by omitting the --async flag. Below we see the three jobs completed successfully on the managed cluster.

Waiting on operation [projects/dataproc-demo-224523/regions/us-east1/operations/e720bb96-9c87-330e-b1cd-efa4612b3c57].
WorkflowTemplate [template-demo-1] RUNNING
Creating cluster: Operation ID [projects/dataproc-demo-224523/regions/us-east1/operations/e1fe53de-92f2-4f8c-8b3a-fda5e13829b6].
Created cluster: three-node-cluster-ugdo4ygpl52bo.
Job ID ibrd-small-spark-ugdo4ygpl52bo RUNNING
Job ID ibrd-large-spark-ugdo4ygpl52bo RUNNING
Job ID ibrd-large-pyspark-ugdo4ygpl52bo RUNNING
Job ID ibrd-small-spark-ugdo4ygpl52bo COMPLETED
Job ID ibrd-large-spark-ugdo4ygpl52bo COMPLETED
Job ID ibrd-large-pyspark-ugdo4ygpl52bo COMPLETED
Deleting cluster: Operation ID [projects/dataproc-demo-224523/regions/us-east1/operations/f2a40c33-3cdf-47f5-92d6-345463fbd404].
WorkflowTemplate [template-demo-1] DONE
Deleted cluster: three-node-cluster-ugdo4ygpl52bo.

1.02s user 0.35s system 0% cpu 5:03.55 total

In the output, you see the creation of the cluster, the three jobs running and completing successfully, and finally the cluster deletion. The entire workflow took approximately 5 minutes to complete. Below is the view of the workflow’s results from the Dataproc Clusters Console Jobs tab.

screen_shot_2018-12-15_at_11.42.44_am

Below we see the output from the PySpark job, run as part of the workflow template, shown in the Dataproc Clusters Console Output tab. Notice the three input arguments we passed to the Python script from the workflow template, listed in the output.

screen_shot_2018-12-15_at_11.43.56_am

We see the arguments passed to the job, from the Jobs Configuration tab.

screen_shot_2018-12-15_at_1.11.11_pm.png

Examining the Google Cloud Dataproc Jobs Console, we will observe that the WorkflowTemplate API automatically adds a unique alphanumeric extension to both the name of the managed clusters we create, as well as to the name of each job that is run. The extension on the cluster name matches the extension on the jobs ran on that cluster.

screen_shot_2018-12-15_at_1.05.41_pm

YAML-based Workflow Template

Although, the above WorkflowTemplates API-based workflow was certainly more convenient than using the individual Cloud Dataproc API commands. At a minimum, we don’t have to remember to delete our cluster when the jobs are complete, as I often do. To further optimize the workflow, we will introduce YAML-based Workflow Template. According to Google, you can define a workflow template in a YAML file, then instantiate the template to run the workflow. You can also import and export a workflow template YAML file to create and update a Cloud Dataproc workflow template resource.

We can export our first workflow template to create our YAML-based template file.

gcloud dataproc workflow-templates export template-demo-1 \
  --destination template-demo-2.yaml \
  --region $REGION

Below is our first YAML-based template, template-demo-2.yaml. You will need to replace the values in the template with your own values, based on your environment (gist).


jobs:
sparkJob:
jarFileUris:
gs://dataproc-demo-bucket/dataprocJavaDemo-1.0-SNAPSHOT.jar
mainClass: org.example.dataproc.InternationalLoansAppDataprocSmall
stepId: ibrd-small-spark
sparkJob:
jarFileUris:
gs://dataproc-demo-bucket/dataprocJavaDemo-1.0-SNAPSHOT.jar
mainClass: org.example.dataproc.InternationalLoansAppDataprocLarge
stepId: ibrd-large-spark
pysparkJob:
args:
gs://dataproc-demo-bucket
ibrd-statement-of-loans-historical-data.csv
ibrd-summary-large-python
mainPythonFileUri: gs://dataproc-demo-bucket/international_loans_dataproc.py
stepId: ibrd-large-pyspark
placement:
managedCluster:
clusterName: three-node-cluster
config:
gceClusterConfig:
zoneUri: us-east1-b
masterConfig:
diskConfig:
bootDiskSizeGb: 500
machineTypeUri: n1-standard-4
softwareConfig:
imageVersion: 1.3-deb9
workerConfig:
diskConfig:
bootDiskSizeGb: 500
machineTypeUri: n1-standard-4
numInstances: 2

Note the template looks almost similar to the template we just created previously using the WorkflowTemplates API. The YAML-based template requires the placement and jobs fields. All the available fields are detailed, here.

To run the template we use the workflow-templates instantiate-from-file command. Again, I will use the time command to measure performance.

time gcloud dataproc workflow-templates instantiate-from-file \
  --file template-demo-2.yaml \
  --region $REGION

Running the workflow-templates instantiate-from-file command will run a workflow, nearly identical to the workflow we ran in the previous example, with a similar timing. Below we see the three jobs completed successfully on the managed cluster, in approximately the same time as the previous workflow.

Waiting on operation [projects/dataproc-demo-224523/regions/us-east1/operations/7ba3c28e-ebfa-32e7-9dd6-d938a1cfe23b].
WorkflowTemplate RUNNING
Creating cluster: Operation ID [projects/dataproc-demo-224523/regions/us-east1/operations/8d05199f-ed36-4787-8a28-ae784c5bc8ae].
Created cluster: three-node-cluster-5k3bdmmvnna2y.
Job ID ibrd-small-spark-5k3bdmmvnna2y RUNNING
Job ID ibrd-large-spark-5k3bdmmvnna2y RUNNING
Job ID ibrd-large-pyspark-5k3bdmmvnna2y RUNNING
Job ID ibrd-small-spark-5k3bdmmvnna2y COMPLETED
Job ID ibrd-large-spark-5k3bdmmvnna2y COMPLETED
Job ID ibrd-large-pyspark-5k3bdmmvnna2y COMPLETED
Deleting cluster: Operation ID [projects/dataproc-demo-224523/regions/us-east1/operations/a436ae82-f171-4b0a-9b36-5e16406c75d5].
WorkflowTemplate DONE
Deleted cluster: three-node-cluster-5k3bdmmvnna2y.

1.16s user 0.44s system 0% cpu 4:48.84 total

Parameterization of Templates

To further optimize the workflow template process for re-use, we have the option of passing parameters to our template. Imagine you now receive new loan snapshot data files every night. Imagine you need to run the same data analysis on the financial transactions of thousands of your customers, nightly. Parameterizing templates makes it more flexible and reusable. By removing hard-codes values, such as Storage bucket paths and data file names, a single template may be re-used for multiple variations of the same job. Parameterization allows you to automate hundreds or thousands of Spark and Hadoop jobs in a workflow or workflows, each with different parameters, programmatically.

To demonstrate the parameterization of a workflow template, we create another YAML-based template with just the Python/PySpark job, template-demo-3.yaml. If you recall from our first example, the Python script, international_loans_dataproc.py, requires three input arguments: the bucket where the data is located and the and results are placed, the name of the data file, and the directory in the bucket, where the results will be placed.

We will replace four of the values in the template with parameters. We will inject those parameter’s values when we instantiate the workflow. Below is the new parameterized template. The template now has a parameters section from lines 26–46. They define parameters that will be used to replace the four values on lines 3–7 (gist).


jobs:
pysparkJob:
args:
storage_bucket_parameter
data_file_parameter
results_directory_parameter
mainPythonFileUri: main_python_file_parameter
stepId: ibrd-pyspark
placement:
managedCluster:
clusterName: three-node-cluster
config:
gceClusterConfig:
zoneUri: us-east1-b
masterConfig:
diskConfig:
bootDiskSizeGb: 500
machineTypeUri: n1-standard-4
softwareConfig:
imageVersion: 1.3-deb9
workerConfig:
diskConfig:
bootDiskSizeGb: 500
machineTypeUri: n1-standard-4
numInstances: 2
parameters:
description: Python script to run
fields:
jobs['ibrd-pyspark'].pysparkJob.mainPythonFileUri
name: MAIN_PYTHON_FILE
description: Storage bucket location of data file and results
fields:
jobs['ibrd-pyspark'].pysparkJob.args[0]
name: STORAGE_BUCKET
validation:
regex:
regexes:
gs://.*
description: IBRD data file
fields:
jobs['ibrd-pyspark'].pysparkJob.args[1]
name: IBRD_DATA_FILE
description: Result directory
fields:
jobs['ibrd-pyspark'].pysparkJob.args[2]
name: RESULTS_DIRECTORY

Note the PySpark job’s three arguments and the location of the Python script have been parameterized. Parameters may include validation. As an example of validation, the template uses regex to validate the format of the Storage bucket path. The regex follows Google’s RE2 regular expression library syntax. If you need help with regex, the Regex Tester – Golang website is a convenient way to test your parameter’s regex validations.

First, we import the new parameterized YAML-based workflow template, using the workflow-templates import command. Then, we instantiate the template using the workflow-templates instantiate command. The workflow-templates instantiate command will run the single PySpark job, analyzing the smaller IBRD data file, and placing the resulting Parquet-format file in a directory within the Storage bucket. We pass the Python script location, bucket link, smaller IBRD data file name, and output directory, as parameters to the template, and therefore indirectly, three of these, as input arguments to the Python script.

export TEMPLATE_ID=template-demo-3

gcloud dataproc workflow-templates import $TEMPLATE_ID \
   --region $REGION --source template-demo-3.yaml
  
gcloud dataproc workflow-templates instantiate \
  $TEMPLATE_ID --region $REGION --async \
  --parameters MAIN_PYTHON_FILE="$BUCKET_NAME/international_loans_dataproc.py",STORAGE_BUCKET=$BUCKET_NAME,IBRD_DATA_FILE="ibrd-statement-of-loans-latest-available-snapshot.csv",RESULTS_DIRECTORY="ibrd-summary-small-python"

Next, we will analyze the larger historic data file, using the same parameterized YAML-based workflow template, but changing two of the four parameters we are passing to the template with the workflow-templates instantiate command. This will run a single PySpark job on the larger IBRD data file and place the resulting Parquet-format file in a different directory within the Storage bucket.

time gcloud dataproc workflow-templates instantiate \
  $TEMPLATE_ID --region $REGION \
  --parameters MAIN_PYTHON_FILE="$BUCKET_NAME/international_loans_dataproc.py",STORAGE_BUCKET=$BUCKET_NAME,IBRD_DATA_FILE="ibrd-statement-of-loans-historical-data.csv",RESULTS_DIRECTORY="ibrd-summary-large-python"

This is the power of parameterization—one workflow template and one job script, but two different datasets and two different results.

Below we see the single PySpark job ran on the managed cluster.

Waiting on operation [projects/dataproc-demo-224523/regions/us-east1/operations/b3c5063f-e3cf-3833-b613-83db12b82f32].
WorkflowTemplate [template-demo-3] RUNNING
Creating cluster: Operation ID [projects/dataproc-demo-224523/regions/us-east1/operations/896b7922-da8e-49a9-bd80-b1ac3fda5105].
Created cluster: three-node-cluster-j6q2al2mkkqck.
Job ID ibrd-pyspark-j6q2al2mkkqck RUNNING
Job ID ibrd-pyspark-j6q2al2mkkqck COMPLETED
Deleting cluster: Operation ID [projects/dataproc-demo-224523/regions/us-east1/operations/fe4a263e-7c6d-466e-a6e2-52292cbbdc9b].
WorkflowTemplate [template-demo-3] DONE
Deleted cluster: three-node-cluster-j6q2al2mkkqck.

0.98s user 0.40s system 0% cpu 4:19.42 total

Using the workflow-templates list command again, should display a list of two workflow templates.

gcloud dataproc workflow-templates list --region $REGION
  
ID               JOBS  UPDATE_TIME               VERSION
template-demo-3  1     2018-12-15T17:04:39.064Z  2
template-demo-1  3     2018-12-15T16:32:06.508Z  5

Looking within the Google Cloud Storage bucket, we should now see four different folders, the results of the workflows.

screen-shot-2018-12-16-at-11.58.32-am.png

Job Results and Testing

To check on the status of a job, we use the dataproc jobs wait command. This returns the standard output (stdout) and standard error (stderr) for that specific job.

export SET_ID=ibrd-large-dataset-pyspark-cxzzhr2ro3i54
  
gcloud dataproc jobs wait $SET_ID \
  --project $PROJECT_ID \
  --region $REGION

The dataproc jobs wait command is frequently used for automated testing of jobs, often within a CI/CD pipeline. Assume we have expected part of the job output that indicates success, such as a string, boolean, or numeric value. We could any number of test frameworks or other methods to confirm the existence of that expected value or values. Below is a simple example of using the grep command to check for the existence of the expected line ‘ state: FINISHED’ in the standard output of the dataproc jobs wait command.

command=$(gcloud dataproc jobs wait $SET_ID \
--project $PROJECT_ID \
--region $REGION) &>/dev/null

if grep -Fqx "  state: FINISHED" <<< $command &>/dev/null; then
  echo "Job Success!"
else
  echo "Job Failure?"
fi

# single line alternative
if grep -Fqx "  state: FINISHED" <<< $command &>/dev/null;then echo "Job Success!";else echo "Job Failure?";fi

Job Success!

Individual Operations

To view individual workflow operations, use the operations list and operations describe commands. The operations list command will list all operations.

Notice the three distinct series of operations within each workflow, shown with the operations list command: WORKFLOW, CREATE, and DELETE. In the example below, I’ve separated the operations by workflow, for better clarity.

gcloud dataproc operations list --region $REGION

NAME                                  TIMESTAMP                 TYPE      STATE  ERROR  WARNINGS
fe4a263e-7c6d-466e-a6e2-52292cbbdc9b  2018-12-15T17:11:45.178Z  DELETE    DONE
896b7922-da8e-49a9-bd80-b1ac3fda5105  2018-12-15T17:08:38.322Z  CREATE    DONE
b3c5063f-e3cf-3833-b613-83db12b82f32  2018-12-15T17:08:37.497Z  WORKFLOW  DONE
---
be0e5293-275f-46ad-b1f4-696ba44c222e  2018-12-15T17:07:26.305Z  DELETE    DONE
6784078c-cbe3-4c1e-a56e-217149f555a4  2018-12-15T17:04:40.613Z  CREATE    DONE
fcd8039e-a260-3ab3-ad31-01abc1a524b4  2018-12-15T17:04:40.007Z  WORKFLOW  DONE
---
b4b23ca6-9442-4ffb-8aaf-460bac144dd8  2018-12-15T17:02:16.744Z  DELETE    DONE
89ef9c7c-f3c9-4d01-9091-61ed9e1f085d  2018-12-15T17:01:45.514Z  CREATE    DONE
243fa7c1-502d-3d7a-aaee-b372fe317570  2018-12-15T17:01:44.895Z  WORKFLOW  DONE

We use the results of the operations list command to execute the operations describe command to describe a specific operation.

gcloud dataproc operations describe \
  projects/$PROJECT_ID/regions/$REGION/operations/896b7922-da8e-49a9-bd80-b1ac3fda5105

Each type of operation contains different details. Note the fine-grain of detail we get from Dataproc using the operations describe command for a CREATE operation (gist).


projects/$PROJECT_ID/regions/$REGION/operations/896b7922-da8e-49a9-bd80-b1ac3fda5105
done: true
metadata:
'@type': type.googleapis.com/google.cloud.dataproc.v1beta2.ClusterOperationMetadata
clusterName: three-node-cluster-j6q2al2mkkqck
clusterUuid: 10656c6e-ef49-4264-805b-463e1e819626
description: Create cluster with 2 workers
operationType: CREATE
status:
innerState: DONE
state: DONE
stateStartTime: '2018-12-15T17:10:12.722Z'
statusHistory:
state: PENDING
stateStartTime: '2018-12-15T17:08:38.322Z'
state: RUNNING
stateStartTime: '2018-12-15T17:08:38.380Z'
name: projects/dataproc-demo-224523/regions/us-east1/operations/896b7922-da8e-49a9-bd80-b1ac3fda5105
response:
'@type': type.googleapis.com/google.cloud.dataproc.v1beta2.Cluster
clusterName: three-node-cluster-j6q2al2mkkqck
clusterUuid: 10656c6e-ef49-4264-805b-463e1e819626
config:
configBucket: dataproc-5214e13c-d3ea-400b-9c70-11ee08fac5ab-us-east1
gceClusterConfig:
networkUri: https://www.googleapis.com/compute/v1/projects/dataproc-demo-224523/global/networks/default
serviceAccountScopes:
https://www.googleapis.com/auth/bigquery
https://www.googleapis.com/auth/bigtable.admin.table
https://www.googleapis.com/auth/bigtable.data
https://www.googleapis.com/auth/cloud.useraccounts.readonly
https://www.googleapis.com/auth/devstorage.full_control
https://www.googleapis.com/auth/devstorage.read_write
https://www.googleapis.com/auth/logging.write
zoneUri: https://www.googleapis.com/compute/v1/projects/dataproc-demo-224523/zones/us-east1-b
masterConfig:
diskConfig:
bootDiskSizeGb: 500
bootDiskType: pd-standard
imageUri: https://www.googleapis.com/compute/v1/projects/cloud-dataproc/global/images/dataproc-1-3-deb9-20181206-000000-rc01
machineTypeUri: https://www.googleapis.com/compute/v1/projects/dataproc-demo-224523/zones/us-east1-b/machineTypes/n1-standard-4
minCpuPlatform: AUTOMATIC
numInstances: 1
softwareConfig:
imageVersion: 1.3.19-deb9
properties:
capacity-scheduler:yarn.scheduler.capacity.root.default.ordering-policy: fair
core:fs.gs.block.size: '134217728'
core:fs.gs.metadata.cache.enable: 'false'
distcp:mapreduce.map.java.opts: -Xmx768m
distcp:mapreduce.map.memory.mb: '1024'
distcp:mapreduce.reduce.java.opts: -Xmx768m
distcp:mapreduce.reduce.memory.mb: '1024'
hdfs:dfs.datanode.address: 0.0.0.0:9866
hdfs:dfs.datanode.http.address: 0.0.0.0:9864
hdfs:dfs.datanode.https.address: 0.0.0.0:9865
hdfs:dfs.datanode.ipc.address: 0.0.0.0:9867
hdfs:dfs.namenode.handler.count: '20'
hdfs:dfs.namenode.http-address: 0.0.0.0:9870
hdfs:dfs.namenode.https-address: 0.0.0.0:9871
hdfs:dfs.namenode.lifeline.rpc-address: three-node-cluster-j6q2al2mkkqck-m:8050
hdfs:dfs.namenode.secondary.http-address: 0.0.0.0:9868
hdfs:dfs.namenode.secondary.https-address: 0.0.0.0:9869
hdfs:dfs.namenode.service.handler.count: '10'
hdfs:dfs.namenode.servicerpc-address: three-node-cluster-j6q2al2mkkqck-m:8051
mapred-env:HADOOP_JOB_HISTORYSERVER_HEAPSIZE: '3840'
mapred:mapreduce.job.maps: '21'
mapred:mapreduce.job.reduce.slowstart.completedmaps: '0.95'
mapred:mapreduce.job.reduces: '7'
mapred:mapreduce.map.cpu.vcores: '1'
mapred:mapreduce.map.java.opts: -Xmx2457m
mapred:mapreduce.map.memory.mb: '3072'
mapred:mapreduce.reduce.cpu.vcores: '1'
mapred:mapreduce.reduce.java.opts: -Xmx2457m
mapred:mapreduce.reduce.memory.mb: '3072'
mapred:mapreduce.task.io.sort.mb: '256'
mapred:yarn.app.mapreduce.am.command-opts: -Xmx2457m
mapred:yarn.app.mapreduce.am.resource.cpu-vcores: '1'
mapred:yarn.app.mapreduce.am.resource.mb: '3072'
presto-jvm:MaxHeapSize: 12288m
presto:query.max-memory-per-node: 7372MB
presto:query.max-total-memory-per-node: 7372MB
spark-env:SPARK_DAEMON_MEMORY: 3840m
spark:spark.driver.maxResultSize: 1920m
spark:spark.driver.memory: 3840m
spark:spark.executor.cores: '2'
spark:spark.executor.instances: '2'
spark:spark.executor.memory: 5586m
spark:spark.executorEnv.OPENBLAS_NUM_THREADS: '1'
spark:spark.scheduler.mode: FAIR
spark:spark.sql.cbo.enabled: 'true'
spark:spark.yarn.am.memory: 640m
yarn-env:YARN_TIMELINESERVER_HEAPSIZE: '3840'
yarn:yarn.nodemanager.resource.memory-mb: '12288'
yarn:yarn.scheduler.maximum-allocation-mb: '12288'
yarn:yarn.scheduler.minimum-allocation-mb: '1024'
workerConfig:
diskConfig:
bootDiskSizeGb: 500
bootDiskType: pd-standard
imageUri: https://www.googleapis.com/compute/v1/projects/cloud-dataproc/global/images/dataproc-1-3-deb9-20181206-000000-rc01
machineTypeUri: https://www.googleapis.com/compute/v1/projects/dataproc-demo-224523/zones/us-east1-b/machineTypes/n1-standard-4
minCpuPlatform: AUTOMATIC
numInstances: 2
labels:
goog-dataproc-cluster-name: three-node-cluster-j6q2al2mkkqck
goog-dataproc-cluster-uuid: 10656c6e-ef49-4264-805b-463e1e819626
goog-dataproc-location: us-east1
goog-dataproc-workflow-instance-id: b3c5063f-e3cf-3833-b613-83db12b82f32
goog-dataproc-workflow-template-id: template-demo-3
projectId: dataproc-demo-224523

Conclusion

In this brief, follow-up post to the previous post, Big Data Analytics with Java and Python, using Cloud Dataproc, Google’s Fully-Managed Spark and Hadoop Service, we have seen how easy the WorkflowTemplates API and YAML-based workflow templates make automating our analytics jobs. This post only scraped the surface of the complete functionality of the WorkflowTemplates API and parameterization of templates.

In a future post, we leverage the automation capabilities of the Google Cloud Platform, the WorkflowTemplates API, YAML-based workflow templates, and parameterization, to develop a fully-automated DevOps for Big Data workflow, capable of running hundreds of Spark and Hadoop jobs.

All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.

, , , , , , , , ,

2 Comments

Big Data Analytics with Java and Python, using Cloud Dataproc, Google’s Fully-Managed Spark and Hadoop Service

There is little question, big data analytics, data science, artificial intelligence (AI), and machine learning (ML), a subcategory of AI, have all experienced a tremendous surge in popularity over the last few years. Behind the hype curves and marketing buzz, these technologies are having a significant influence on all aspects of our modern lives.

However, installing, configuring, and managing the technologies that support big data analytics, data science, ML, and AI, at scale and in Production, often demands an advanced level of familiarity with Linux, distributed systems, cloud- and container-based platforms, databases, and data-streaming applications. The mere ability to manage terabytes and petabytes of transient data is beyond the capability of many enterprises, let alone performing analysis of that data.

To ease the burden of implementing these technologies, the three major cloud providers, AWS, Azure, and Google Cloud, all have multiple Big Data Analytics-, AI-, and ML-as-a-Service offerings. In this post, we will explore one such cloud-based service offering in the field of big data analytics, Google Cloud Dataproc. We will focus on Cloud Dataproc’s ability to quickly and efficiently run Spark jobs written in Java and Python, two widely adopted enterprise programming languages.

Featured Technologies

The following technologies are featured prominently in this post.

dataproc

Google Cloud Dataproc

dataproc_logoAccording to Google, Cloud Dataproc is a fast, easy-to-use, fully-managed cloud service for running the Apache Spark and Apache Hadoop ecosystem on Google Cloud Platform. Dataproc is a complete platform for data processing, analytics, and machine learning. Dataproc offers per-second billing, so you only pay for exactly the resources you consume. Dataproc offers frequently updated and native versions of Apache Spark, Hadoop, Pig, and Hive, as well as other related applications. Dataproc has built-in integrations with other Google Cloud Platform (GCP) services, such as Cloud Storage, BigQuery, Bigtable, Stackdriver Logging, and Stackdriver Monitoring. Dataproc’s clusters are configurable and resizable from a three to hundreds of nodes, and each cluster action takes less than 90 seconds on average.

Similar Platform as a Service (PaaS) offerings to Dataproc, include Amazon Elastic MapReduce (EMR), Microsoft Azure HDInsight, and Qubole Data Service. Qubole is offered on AWS, Azure, and Oracle Cloud Infrastructure (Oracle OCI).

According to Google, Cloud Dataproc and Cloud Dataflow, both part of GCP’s Data Analytics/Big Data Product offerings, can both be used for data processing, and there’s overlap in their batch and streaming capabilities. Cloud Dataflow is a fully-managed service for transforming and enriching data in stream and batch modes. Dataflow uses the Apache Beam SDK to provide developers with Java and Python APIs, similar to Spark.

Apache Spark

spark_logoAccording to Apache, Spark is a unified analytics engine for large-scale data processing, used by well-known, modern enterprises, such as Netflix, Yahoo, and eBay. With in-memory speeds up to 100x faster than Hadoop, Apache Spark achieves high performance for static, batch, and streaming data, using a state-of-the-art DAG (Directed Acyclic Graph) scheduler, a query optimizer, and a physical execution engine.

According to a post by DataFlair, ‘the DAG in Apache Spark is a set of Vertices and Edges, where vertices represent the RDDs and the edges represent the Operation to be applied on RDD (Resilient Distributed Dataset). In Spark DAG, every edge directs from earlier to later in the sequence. On the calling of Action, the created DAG submits to DAG Scheduler which further splits the graph into the stages of the task.’ Below, we see a three-stage DAG visualization, displayed using the Spark History Server Web UI, from a job demonstrated in this post.

Screen Shot 2018-12-15 at 11.20.57 PM

Spark’s polyglot programming model allows users to write applications in Scala, Java, Python, R, and SQL. Spark includes libraries for Spark SQL (DataFrames and Datasets), MLlib (Machine Learning), GraphX (Graph Processing), and DStreams (Spark Streaming). Spark may be run using its standalone cluster mode or on Apache Hadoop YARNMesos, and Kubernetes.

PySpark

pyspark_logoThe Spark Python API, PySpark, exposes the Spark programming model to Python. PySpark is built on top of Spark’s Java API. Data is processed in Python and cached and shuffled in the JVM. According to Apache, Py4J enables Python programs running in a Python interpreter to dynamically access Java objects in a JVM.

Apache Hadoop

hadoop_logo1According to Apache, the Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. This is a rather modest description of such a significant and transformative project. When we talk about Hadoop, often it is in the context of the project’s well-known modules, which includes:

  • Hadoop Common: The common utilities that support the other Hadoop modules
  • Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data
  • Hadoop YARN (Yet Another Resource Negotiator): A framework for job scheduling and cluster resource management, also known as ‘Hadoop NextGen’
  • Hadoop MapReduce: A YARN-based system for parallel processing of large datasets
  • Hadoop Ozone: An object store for Hadoop

Based on the Powered by Apache Hadoop list, there are many well-known enterprises and academic institutions using Apache Hadoop, including Adobe, eBay, Facebook, Hulu, LinkedIn, and The New York Times.

Spark vs. Hadoop

There are many articles and posts that delve into the Spark versus Hadoop debate, this post is not one of them. Although both are mature technologies, Spark, the new kid on the block, reached version 1.0.0 in May 2014, whereas Hadoop reached version 1.0.0, earlier, in December 2011. According to Google Trends, interest in both technologies has remained relatively high over the last three years. However, interest in Spark, based on the volume of searches, has been steadily outpacing Hadoop for well over a year now. The in-memory speed of Spark over HDFS-based Hadoop and ease of Spark SQL for working with structured data are likely big differentiators for many users coming from a traditional relational database background and users with large or streaming datasets, requiring near real-time processing.

spark-to-hadoop

In this post, all examples are built to run on Spark. This is not meant to suggest Spark is necessarily superior or that Spark runs better on Dataproc than Hadoop. In fact, Dataproc’s implementation of Spark relies on Hadoop’s core HDFS and YARN technologies to run.

Demonstration

To show the capabilities of Cloud Dataproc, we will create both a single-node Dataproc cluster and three-node cluster, upload Java- and Python-based analytics jobs and data to Google Cloud Storage, and execute the jobs on the Spark cluster. Finally, we will enable monitoring and notifications for the Dataproc clusters and the jobs running on the clusters with Stackdriver. The post will demonstrate the use of the Google Cloud Console, as well as Google’s Cloud SDK’s command line tools, for all tasks.

In this post, we will be uploading and running individual jobs on the Dataproc Spark cluster, as opposed to using the Cloud Dataproc Workflow Templates. According to Google, Workflow Template is a reusable workflow configuration. It defines a graph of jobs with information on where to run those jobs. Workflow Templates are useful for automating your Datapoc workflows, however, automation is not the primary topic of this post.

Source Code

All open-sourced code for this post can be found on GitHub in two repositories, one for Java with Spark and one for Python with PySpark. Source code samples are displayed as GitHub Gists, which may not display correctly on all mobile and social media browsers.

Cost

Of course, there is a cost associated with provisioning cloud services. However, if you manage the Google Cloud Dataproc resources prudently, the costs are negligible. Regarding pricing, according to Google, Cloud Dataproc pricing is based on the size of Cloud Dataproc clusters and the duration of time that they run. The size of a cluster is based on the aggregate number of virtual CPUs (vCPUs) across the entire cluster, including the master and worker nodes. The duration of a cluster is the length of time, measured in minutes, between cluster creation and cluster deletion.

Over the course of writing the code for this post, as well as writing the post itself, the entire cost of all the related resources was a minuscule US$7.50. The cost includes creating, running, and deleting more than a dozen Dataproc clusters and uploading and executing approximately 75-100 Spark and PySpark jobs. Given the quick creation time of a cluster, 2 minutes on average or less in this demonstration, there is no reason to leave a cluster running longer than it takes to complete your workloads.

Kaggle Datasets

To explore the features of Dataproc, we will use a publicly-available dataset from Kaggle. Kaggle is a popular open-source resource for datasets used for big-data and ML applications. Their tagline is ‘Kaggle is the place to do data science projects’.

For this demonstration, I chose the IBRD Statement Of Loans Data dataset, from World Bank Financial Open Data, and available on Kaggle. The International Bank for Reconstruction and Development (IBRD) loans are public and publicly guaranteed debt extended by the World Bank Group. IBRD loans are made to, or guaranteed by, countries that are members of IBRD. This dataset contains historical snapshots of the Statement of Loans including the latest available snapshots.

screen_shot_2018-12-04_at_7.02.53_pm

There are two data files available. The ‘Statement of Loans’ latest available snapshots data file contains 8,713 rows of loan data (~3 MB), ideal for development and testing. The ‘Statement of Loans’ historic data file contains approximately 750,000 rows of data (~265 MB). Although not exactly ‘big data’, the historic dataset is large enough to sufficiently explore Dataproc. Both IBRD files have an identical schema with 33 columns of data (gist).



End of Period Loan Number Region Country Code Country Borrower Guarantor Country Code Guarantor Loan Type Loan Status Interest Rate Currency of Commitment Project ID Project Name Original Principal Amount Cancelled Amount Undisbursed Amount Disbursed Amount Repaid to IBRD Due to IBRD Exchange Adjustment Borrower's Obligation Sold 3rd Party Repaid 3rd Party Due 3rd Party Loans Held First Repayment Date Last Repayment Date Agreement Signing Date Board Approval Date Effective Date (Most Recent) Closed Date (Most Recent) Last Disbursement Date
2018-10-31T00:00:00 IBRD00010 EUROPE AND CENTRAL ASIA FR France CREDIT NATIONAL FR France NON POOL Fully Repaid 4.2500 P037383 RECONSTRUCTION 250000000.00 0.00 0.00 250000000.00 38000.00 0.00 0.00 0.00 249962000.00 249962000.00 0.00 0.00 1952-11-01T00:00:00 1977-05-01T00:00:00 1947-05-09T00:00:00 1947-05-09T00:00:00 1947-06-09T00:00:00 1947-12-31T00:00:00
2018-10-31T00:00:00 IBRD00020 EUROPE AND CENTRAL ASIA NL Netherlands NON POOL Fully Repaid 4.2500 P037452 RECONSTRUCTION 191044211.75 0.00 0.00 191044211.75 103372211.75 0.00 0.00 0.00 87672000.00 87672000.00 0.00 0.00 1952-04-01T00:00:00 1972-10-01T00:00:00 1947-08-07T00:00:00 1947-08-07T00:00:00 1947-09-11T00:00:00 1948-03-31T00:00:00
2018-10-31T00:00:00 IBRD00021 EUROPE AND CENTRAL ASIA NL Netherlands NON POOL Fully Repaid 4.2500 P037452 RECONSTRUCTION 3955788.25 0.00 0.00 3955788.25 0.00 0.00 0.00 0.00 3955788.25 3955788.25 0.00 0.00 1953-04-01T00:00:00 1954-04-01T00:00:00 1948-05-25T00:00:00 1947-08-07T00:00:00 1948-06-01T00:00:00 1948-06-30T00:00:00
2018-10-31T00:00:00 IBRD00030 EUROPE AND CENTRAL ASIA DK Denmark NON POOL Fully Repaid 4.2500 P037362 RECONSTRUCTION 40000000.00 0.00 0.00 40000000.00 17771000.00 0.00 0.00 0.00 22229000.00 22229000.00 0.00 0.00 1953-02-01T00:00:00 1972-08-01T00:00:00 1947-08-22T00:00:00 1947-08-22T00:00:00 1947-10-17T00:00:00 1949-03-31T00:00:00
2018-10-31T00:00:00 IBRD00040 EUROPE AND CENTRAL ASIA LU Luxembourg NON POOL Fully Repaid 4.2500 P037451 RECONSTRUCTION 12000000.00 238016.98 0.00 11761983.02 1619983.02 0.00 0.00 0.00 10142000.00 10142000.00 0.00 0.00 1949-07-15T00:00:00 1972-07-15T00:00:00 1947-08-28T00:00:00 1947-08-28T00:00:00 1947-10-24T00:00:00 1949-03-31T00:00:00
2018-10-31T00:00:00 IBRD00050 LATIN AMERICA AND CARIBBEAN CL Chile Ministry of Finance CL Chile NON POOL Fully Repaid 4.5000 P006578 POWER 13500000.00 0.00 0.00 13500000.00 12167000.00 0.00 0.00 0.00 1333000.00 1333000.00 0.00 0.00 1953-07-01T00:00:00 1968-07-01T00:00:00 1948-03-25T00:00:00 1948-03-25T00:00:00 1949-04-07T00:00:00 1954-12-31T00:00:00
2018-10-31T00:00:00 IBRD00060 LATIN AMERICA AND CARIBBEAN CL Chile Ministry of Finance CL Chile NON POOL Fully Repaid 3.7500 P006577 FOMENTO AGRIC CREDIT 2500000.00 0.00 0.00 2500000.00 755000.00 0.00 0.00 0.00 1745000.00 1745000.00 0.00 0.00 1950-07-01T00:00:00 1955-01-01T00:00:00 1948-03-25T00:00:00 1948-03-25T00:00:00 1949-04-07T00:00:00 1950-01-01T00:00:00
2018-10-31T00:00:00 IBRD00070 EUROPE AND CENTRAL ASIA NL Netherlands NON POOL Fully Repaid 3.5625 P037453 SHIPPING I 2000000.00 0.00 0.00 2000000.00 0.00 0.00 0.00 0.00 2000000.00 2000000.00 0.00 0.00 1949-01-15T00:00:00 1958-07-15T00:00:00 1948-07-15T00:00:00 1948-05-21T00:00:00 1948-08-03T00:00:00 1948-08-03T00:00:00
2018-10-31T00:00:00 IBRD00071 EUROPE AND CENTRAL ASIA NL Netherlands NON POOL Fully Repaid 3.5625 P037453 SHIPPING I 2000000.00 0.00 0.00 2000000.00 0.00 0.00 0.00 0.00 2000000.00 2000000.00 0.00 0.00 1949-01-15T00:00:00 1958-07-15T00:00:00 1948-07-15T00:00:00 1948-05-21T00:00:00 1948-08-03T00:00:00 1948-08-03T00:00:00

In this demonstration, both the Java and Python jobs will perform the same simple analysis of the larger historic dataset. For the analysis, we will ascertain the top 25 historic IBRD borrower, we will determine their total loan disbursements, current loan obligations, and the average interest rates they were charged for all loans. This simple analysis will be performed using Spark’s SQL capabilities. The results of the analysis, a Spark DataFrame containing 25 rows, will be saved as a CSV-format data file.

SELECT country, country_code,
       Format_number(total_disbursement, 0) AS total_disbursement,
       Format_number(total_obligation, 0) AS total_obligation,
       Format_number(avg_interest_rate, 2) AS avg_interest_rate
FROM   (SELECT country,
               country_code,
               Sum(disbursed) AS total_disbursement,
               Sum(obligation) AS total_obligation,
               Avg(interest_rate) AS avg_interest_rate
        FROM   loans
        GROUP  BY country, country_code
        ORDER  BY total_disbursement DESC
        LIMIT  25)

Google Cloud Storage

First, we need a location to store our Spark jobs, data files, and results, which will be accessible to Dataproc. Although there are a number of choices, the simplest and most convenient location for Dataproc is a Google Cloud Storage bucket. According to Google, Cloud Storage offers the highest level of availability and performance within a single region and is ideal for compute, analytics, and ML workloads in a particular region. Cloud Storage buckets are nearly identical to Amazon Simple Storage Service (Amazon S3), their object storage service.

Using the Google Cloud Console, Google’s Web Admin UI, create a new, uniquely named Cloud Storage bucket. Our Dataproc clusters will eventually be created in a single regional location. We need to ensure our new bucket is created in the same regional location as the clusters; I chose us-east1.

screen_shot_2018-12-04_at_7.04.45_pm

We will need the new bucket’s link, to use within the Java and Python code as well from the command line with gsutil. The gsutil tool is a Python application that lets you access Cloud Storage from the command line. The bucket’s link may be found on the Storage Browser Console’s Overview tab. A bucket’s link is always in the format, gs://bucket-name.

screen_shot_2018-12-04_at_7.06.06_pm

Alternatively, we may also create the Cloud Storage bucket using gsutil with the make buckets (mb) command, as follows:

# Always best practice since features are updated frequently
gcloud components update
  
export PROJECT=your_project_name
export REGION=us-east1
export BUCKET_NAME=gs://your_bucket_name
  
# Make sure you are creating resources in the correct project
gcloud config set project $PROJECT
  
gsutil mb -p $PROJECT -c regional -l $REGION $BUCKET_NAME

Cloud Dataproc Cluster

Next, we will create two different Cloud Dataproc clusters for demonstration purposes. If you have not used Cloud Dataproc previously in your GCP Project, you will first need to enable the API for Cloud Dataproc.

screen_shot_2018-12-04_at_7.15.05_pm

Single Node Cluster

We will start with a single node cluster with no worker nodes, suitable for development and testing Spark and Hadoop jobs, using small datasets. Create a single-node Dataproc cluster using the Single Node Cluster mode option. Create the cluster in the same region as the new Cloud Storage bucket. This will allow the Dataproc cluster access to the bucket without additional security or IAM configuration. I used the n1-standard-1 machine type, with 1 vCPU and 3.75 GB of memory. Observe the resources assigned to Hadoop YARN for Spark job scheduling and cluster resource management.

screen_shot_2018-12-04_at_7.19.37_pm

The new cluster, consisting of a single node and no worker nodes, should be ready for use in a few minutes or less.

screen_shot_2018-12-04_at_7.38.23_pm

Note the Image version, 1.3.16-deb9. According to Google, Dataproc uses image versions to bundle operating system, big data components, and Google Cloud Platform connectors into one package that is deployed on a cluster.  This image, released in November 2018, is the latest available version at the time of this post. The image contains:

  • Apache Spark 2.3.1
  • Apache Hadoop 2.9.0
  • Apache Pig 0.17.0
  • Apache Hive 2.3.2
  • Apache Tez 0.9.0
  • Cloud Storage connector 1.9.9-hadoop2
  • Scala 2.11.8
  • Python 2.7

To avoid lots of troubleshooting, make sure your code is compatible with the image’s versions. It is important to note the image does not contain a version of Python 3. You will need to ensure your Python code is built to run with Python 2.7. Alternatively, use Dataproc’s --initialization-actions flag along with bootstrap and setup shell scripts to install Python 3 on the cluster using pip or conda. Tips for installing Python 3 on Datapoc be found on Stack Overflow and elsewhere on the Internet.

As as an alternative to the Google Cloud Console, we are able to create the cluster using a REST command. Google provides the Google Cloud Console’s equivalent REST request, as shown in the example below.

screen_shot_2018-12-04_at_7.20.07_pm

Additionally, we have the option of using the gcloud command line tool. This tool provides the primary command-line interface to Google Cloud Platform and is part of Google’s Cloud SDK, which also includes the aforementioned gsutil. Here again, Google provides the Google Cloud Console’s equivalent gcloud command. This is a great way to learn to use the command line.

screen_shot_2018-12-04_at_7.20.21_pm

Using the dataproc clusters create command, we are able to create the same cluster as shown above from the command line, as follows:

export PROJECT=your_project_name
export CLUSTER_1=your_single_node_cluster_name 
export REGION=us-east1
export ZONE=us-east1-b
export MACHINE_TYPE_SMALL=n1-standard-1
  
gcloud dataproc clusters create $CLUSTER_1 \
  --region $REGION \
  --zone $ZONE \
  --single-node \
  --master-machine-type $MACHINE_TYPE_SMALL \
  --master-boot-disk-size 500 \
  --image-version 1.3-deb9 \
  --project $PROJECT

There are a few useful commands to inspect your running Dataproc clusters. The dataproc clusters describe command, in particular, provides detailed information about all aspects of the cluster’s configuration and current state.

gcloud dataproc clusters list --region $REGION

gcloud dataproc clusters describe $CLUSTER_2 \
  --region $REGION --format json

Standard Cluster

In addition to the single node cluster, we will create a second three-node Dataproc cluster. We will compare the speed of a single-node cluster to that of a true cluster with multiple worker nodes. Create a new Dataproc cluster using the Standard Cluster mode option. Again, make sure to create the cluster in the same region as the new Storage bucket.

screen_shot_2018-12-04_at_10.15.05_pm

The second cluster contains a single master node and two worker nodes. All three nodes use the n1-standard-4 machine type, with 4 vCPU and 15 GB of memory. Although still considered a minimally-sized cluster, this cluster represents a significant increase in compute power over the first single-node cluster, which had a total of 2 vCPU, 3.75 GB of memory, and no worker nodes on which to distribute processing. Between the two workers in the second cluster, we have 8 vCPU and 30 GB of memory for computation.

screen_shot_2018-12-04_at_10.18.54_pm

Again, we have the option of using the gcloud command line tool to create the cluster:

export PROJECT=your_project_name
export CLUSTER_2=your_three_node_cluster_name 
export REGION=us-east1
export ZONE=us-east1-b
export NUM_WORKERS=2
export MACHINE_TYPE_LARGE=n1-standard-4
  
gcloud dataproc clusters create $CLUSTER_2 \
  --region $REGION \
  --zone $ZONE \
  --master-machine-type $MACHINE_TYPE_LARGE \
  --master-boot-disk-size 500 \
  --num-workers $NUM_WORKERS \
  --worker-machine-type $MACHINE_TYPE_LARGE \
  --worker-boot-disk-size 500 \
  --image-version 1.3-deb9 \
  --project $PROJECT

Cluster Creation Speed: Cloud Dataproc versus Amazon EMS?

In a series of rather unscientific tests, I found the three-node Dataproc cluster took less than two minutes on average to be created. Compare that time to a similar three-node cluster built with Amazon’s EMR service using their general purpose m4.4xlarge Amazon EC2 instance type. In a similar series of tests, I found the EMR cluster took seven minutes on average to be created. The EMR cluster took 3.5 times longer to create than the comparable Dataproc cluster. Again, although not a totally accurate comparison, since both services offer different features, it gives you a sense of the speed of Dataproc as compared to Amazon EMR.

Staging Buckets

According to Google, when you create a cluster, Cloud Dataproc creates a Cloud Storage staging bucket in your project or reuses an existing Cloud Dataproc-created bucket from a previous cluster creation request. Staging buckets are used to stage miscellaneous configuration and control files that are needed by your cluster. Below, we see the staging buckets created for the two Dataproc clusters.

screen_shot_2018-12-04_at_10.26.49_pm

Project Files

Before uploading the jobs and running them on the Cloud Dataproc clusters, we need to understand what is included in the two GitHub projects. If you recall from the Kaggle section of the post, both projects are basically the same but, written in different languages, Java and Python. The jobs they contain all perform the same basic analysis on the dataset.

Java Project

The dataproc-java-demo Java-based GitHub project contains three classes, each which are jobs to run by Spark. The InternationalLoansApp Java class is only intended to be run locally with the smaller 8.7K rows of data in the snapshot CSV file (gist).


package org.example.dataproc;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.SparkSession;
public class InternationalLoansApp {
public static void main(String[] args) {
InternationalLoansApp app = new InternationalLoansApp();
app.start();
}
private void start() {
SparkSession spark = SparkSession.builder()
.appName("dataproc-java-demo")
.master("local[*]")
.getOrCreate();
spark.sparkContext().setLogLevel("INFO"); // INFO by default
// Loads CSV file from local directory
Dataset<Row> dfLoans = spark.read()
.format("csv")
.option("header", "true")
.option("inferSchema", true)
.load("data/ibrd-statement-of-loans-latest-available-snapshot.csv");
// Basic stats
System.out.printf("Rows of data:%d%n", dfLoans.count());
System.out.println("Inferred Schema:");
dfLoans.printSchema();
// Creates temporary view using DataFrame
dfLoans.withColumnRenamed("Country", "country")
.withColumnRenamed("Country Code", "country_code")
.withColumnRenamed("Disbursed Amount", "disbursed")
.withColumnRenamed("Borrower's Obligation", "obligation")
.withColumnRenamed("Interest Rate", "interest_rate")
.createOrReplaceTempView("loans");
// Performs basic analysis of dataset
Dataset<Row> dfDisbursement = spark.sql(
"SELECT country, country_code, "
+ "format_number(total_disbursement, 0) AS total_disbursement, "
+ "format_number(ABS(total_obligation), 0) AS total_obligation, "
+ "format_number(avg_interest_rate, 2) AS avg_interest_rate "
+ "FROM ( "
+ "SELECT country, country_code, "
+ "SUM(disbursed) AS total_disbursement, "
+ "SUM(obligation) AS total_obligation, "
+ "AVG(interest_rate) AS avg_interest_rate "
+ "FROM loans "
+ "GROUP BY country, country_code "
+ "ORDER BY total_disbursement DESC "
+ "LIMIT 25)"
);
dfDisbursement.show(25, 100);
// Calculates and displays the grand total disbursed amount
Dataset<Row> dfGrandTotalDisbursement = spark.sql(
"SELECT format_number(SUM(disbursed),0) AS grand_total_disbursement FROM loans"
);
dfGrandTotalDisbursement.show();
// Calculates and displays the grand total remaining obligation amount
Dataset<Row> dfGrandTotalObligation = spark.sql(
"SELECT format_number(SUM(obligation),0) AS grand_total_obligation FROM loans"
);
dfGrandTotalObligation.show();
// Saves results to a locally CSV file
dfDisbursement.repartition(1)
.write()
.mode(SaveMode.Overwrite)
.format("csv")
.option("header", "true")
.save("data/ibrd-summary-small-java");
System.out.println("Results successfully written to CSV file");
}
}

On line 20, the Spark Session’s Master URL, .master("local[*]"), directs Spark to run locally with as many worker threads as logical cores on the machine. There are several options for setting the Master URL, detailed here.

On line 30, the path to the data file, and on line 84, the output path for the data file, is a local relative file path.

On lines 38–42, we do a bit of clean up on the column names, for only those columns we are interested in for the analysis. Be warned, the column names of the IBRD data are less than ideal for SQL-based analysis, containing mixed-cased characters, word spaces, and brackets.

On line 79, we call Spark DataFrame’s repartition method, dfDisbursement.repartition(1). The repartition method allows us to recombine the results of our analysis and output a single CSV file to the bucket. Ordinarily, Spark splits the data into partitions and executes computations on the partitions in parallel. Each partition’s data is written to separate CSV files when a DataFrame is written back to the bucket.

Using coalesce(1) or repartition(1) to recombine the resulting 25-Row DataFrame on a single node is okay for the sake of this demonstration, but is not practical for recombining partitions from larger DataFrames. There are more efficient and less costly ways to manage the results of computations, depending on the intended use of the resulting data.

screen_shot_2018-12-05_at_4.04.24_pm

The InternationalLoansAppDataprocSmall class is intended to be run on the Dataproc clusters, analyzing the same smaller CSV data file. The InternationalLoansAppDataprocLarge class is also intended to be run on the Dataproc clusters, however, it analyzes the larger 750K rows of data in the IRBD historic CSV file (gist).


package org.example.dataproc;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.SparkSession;
public class InternationalLoansAppDataprocLarge {
public static void main(String[] args) {
InternationalLoansAppDataprocLarge app = new InternationalLoansAppDataprocLarge();
app.start();
}
private void start() {
SparkSession spark = SparkSession.builder()
.appName("dataproc-java-demo")
.master("yarn")
.getOrCreate();
spark.sparkContext().setLogLevel("WARN"); // INFO by default
// Loads CSV file from Google Storage Bucket
Dataset<Row> dfLoans = spark.read()
.format("csv")
.option("header", "true")
.option("inferSchema", true)
.load("gs://dataproc-demo-bucket/ibrd-statement-of-loans-historical-data.csv");
// Creates temporary view using DataFrame
dfLoans.withColumnRenamed("Country", "country")
.withColumnRenamed("Country Code", "country_code")
.withColumnRenamed("Disbursed Amount", "disbursed")
.withColumnRenamed("Borrower's Obligation", "obligation")
.withColumnRenamed("Interest Rate", "interest_rate")
.createOrReplaceTempView("loans");
// Performs basic analysis of dataset
Dataset<Row> dfDisbursement = spark.sql(
"SELECT country, country_code, "
+ "format_number(total_disbursement, 0) AS total_disbursement, "
+ "format_number(ABS(total_obligation), 0) AS total_obligation, "
+ "format_number(avg_interest_rate, 2) AS avg_interest_rate "
+ "FROM ( "
+ "SELECT country, country_code, "
+ "SUM(disbursed) AS total_disbursement, "
+ "SUM(obligation) AS total_obligation, "
+ "AVG(interest_rate) AS avg_interest_rate "
+ "FROM loans "
+ "GROUP BY country, country_code "
+ "ORDER BY total_disbursement DESC "
+ "LIMIT 25)"
);
// Saves results to single CSV file in Google Storage Bucket
dfDisbursement.repartition(1)
.write()
.mode(SaveMode.Overwrite)
.format("csv")
.option("header", "true")
.save("gs://dataproc-demo-bucket/ibrd-summary-large-java");
System.out.println("Results successfully written to CSV file");
}
}

On line 20, note the Spark Session’s Master URL, .master(yarn), directs Spark to connect to a YARN cluster in client or cluster mode depending on the value of --deploy-mode when submitting the job. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. Recall, the Dataproc cluster runs Spark on YARN.

Also, note on line 30, the path to the data file, and on line 63, the output path for the data file, is to the Cloud Storage bucket we created earlier (.load("gs://your-bucket-name/your-data-file.csv"). Cloud Dataproc clusters automatically install the Cloud Storage connector. According to Google, there are a number of benefits to choosing Cloud Storage over traditional HDFS including data persistence, reliability, and performance.

These are the only two differences between the local version of the Spark job and the version of the Spark job intended for Dataproc. To build the project’s JAR file, which you will later upload to the Cloud Storage bucket, compile the Java project using the gradle build command from the root of the project. For convenience, the JAR file is also included in the GitHub repository.

screen_shot_2018-12-07_at_12.57.55_pm

Python Project

The dataproc-python-demo Python-based GitHub project contains two Python scripts to be run using PySpark for this post. The international_loans_local.py Python script is only intended to be run locally with the smaller 8.7K rows of data in the snapshot CSV file. It does a few different analysis with the smaller dataset. (gist).


#!/usr/bin/python
# Author: Gary A. Stafford
# License: MIT
from pyspark.sql import SparkSession
def main():
spark = SparkSession \
.builder \
.master("local[*]") \
.appName('dataproc-python-demo') \
.getOrCreate()
# Defaults to INFO
sc = spark.sparkContext
sc.setLogLevel("INFO")
# Loads CSV file from local directory
df_loans = spark \
.read \
.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("data/ibrd-statement-of-loans-latest-available-snapshot.csv")
# Prints basic stats
print "Rows of data:" + str(df_loans.count())
print "Inferred Schema:"
df_loans.printSchema()
# Creates temporary view using DataFrame
df_loans.withColumnRenamed("Country", "country") \
.withColumnRenamed("Country Code", "country_code") \
.withColumnRenamed("Disbursed Amount", "disbursed") \
.withColumnRenamed("Borrower's Obligation", "obligation") \
.withColumnRenamed("Interest Rate", "interest_rate") \
.createOrReplaceTempView("loans")
# Performs basic analysis of dataset
df_disbursement = spark.sql("""
SELECT country, country_code,
format_number(total_disbursement, 0) AS total_disbursement,
format_number(ABS(total_obligation), 0) AS total_obligation,
format_number(avg_interest_rate, 2) AS avg_interest_rate
FROM (
SELECT country, country_code,
SUM(disbursed) AS total_disbursement,
SUM(obligation) AS total_obligation,
AVG(interest_rate) AS avg_interest_rate
FROM loans
GROUP BY country, country_code
ORDER BY total_disbursement DESC
LIMIT 25)
""").cache()
df_disbursement.show(25, True)
# Saves results to a locally CSV file
df_disbursement.repartition(1) \
.write \
.mode("overwrite") \
.format("csv") \
.option("header", "true") \
.save("data/ibrd-summary-small-python")
print "Results successfully written to CSV file"
spark.stop()
if __name__ == "__main__":
main()

Identical to the corresponding Java class, note on line 12, the Spark Session’s Master URL, .master("local[*]"), directs Spark to run locally with as many worker threads as logical cores on the machine.

Also identical to the corresponding Java class, note on line 26, the path to the data file, and on line 66, the output path for the resulting data file, is a local relative file path.

screen_shot_2018-12-05_at_4.02.50_pm

The international_loans_dataproc-large.py Python script is intended to be run on the Dataproc clusters, analyzing the larger 750K rows of data in the IRBD historic CSV file (gist).


#!/usr/bin/python
# Author: Gary A. Stafford
# License: MIT
from pyspark.sql import SparkSession
def main():
spark = SparkSession \
.builder \
.master("yarn") \
.appName('dataproc-python-demo') \
.getOrCreate()
# Defaults to INFO
sc = spark.sparkContext
sc.setLogLevel("WARN")
# Loads CSV file from Google Storage Bucket
df_loans = spark \
.read \
.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("gs://dataproc-demo-bucket/ibrd-statement-of-loans-historical-data.csv")
# Creates temporary view using DataFrame
df_loans.withColumnRenamed("Country", "country") \
.withColumnRenamed("Country Code", "country_code") \
.withColumnRenamed("Disbursed Amount", "disbursed") \
.withColumnRenamed("Borrower's Obligation", "obligation") \
.withColumnRenamed("Interest Rate", "interest_rate") \
.createOrReplaceTempView("loans")
# Performs basic analysis of dataset
df_disbursement = spark.sql("""
SELECT country, country_code,
format_number(total_disbursement, 0) AS total_disbursement,
format_number(ABS(total_obligation), 0) AS total_obligation,
format_number(avg_interest_rate, 2) AS avg_interest_rate
FROM (
SELECT country, country_code,
SUM(disbursed) AS total_disbursement,
SUM(obligation) AS total_obligation,
AVG(interest_rate) AS avg_interest_rate
FROM loans
GROUP BY country, country_code
ORDER BY total_disbursement DESC
LIMIT 25)
""").cache()
# Saves results to single CSV file in Google Storage Bucket
df_disbursement.repartition(1) \
.write \
.mode("overwrite") \
.format("csv") \
.option("header", "true") \
.save("gs://dataproc-demo-bucket/ibrd-summary-large-python")
spark.stop()
if __name__ == "__main__":
main()

On line 12, note the Spark Session’s Master URL, .master(yarn), directs Spark to connect to a YARN cluster.

Again, note on line 26, the path to the data file, and on line 59, the output path for the data file, is to the Cloud Storage bucket we created earlier (.load("gs://your-bucket-name/your-data-file.csv").

These are the only two differences between the local version of the PySpark job and the version of the PySpark job intended for Dataproc. With Python, there is no pre-compilation necessary. We will upload the second script, directly.

Uploading Job Resources to Cloud Storage

In total, we need to upload four items to the new Cloud Storage bucket we created previously. The items include the two Kaggle IBRD CSV files, the compiled Java JAR file from the dataproc-java-demo project, and the Python script from the dataproc-python-demo project. Using the Google Cloud Console, upload the four files to the new Google Storage bucket, as shown below. Make sure you unzip the two Kaggle IRBD CSV data files before uploading.

screen_shot_2018-12-05_at_12.52.51_pm

Like before, we also have the option of using gsutil with the copy (cp) command to upload the four files. The cp command accepts wildcards, as shown below.

export PROJECT=your_project_name
export BUCKET_NAME=gs://your_bucket_name
  
gsutil cp data/ibrd-statement-of-loans-*.csv $BUCKET_NAME
gsutil cp build/libs/dataprocJavaDemo-1.0-SNAPSHOT.jar $BUCKET_NAME
gsutil cp international_loans_dataproc_large.py $BUCKET_NAME

If our Java or Python jobs were larger, or more complex and required multiple files to run, we could also choose to upload ZIP or other common compression formatted archives using the --archives flag.

Running Jobs on Dataproc

The easiest way to run a job on the Dataproc cluster is by submitting a job through the Dataproc Jobs UI, part of the Google Cloud Console.

screen_shot_2018-12-05_at_11.29.34_pm

Dataproc has the capability of running multiple types of jobs, including:

  • Hadoop
  • Spark
  • SparkR
  • PySpark
  • Hive
  • SparkSql
  • Pig

We will be running both Spark and PySpark jobs as part of this demonstration.

Spark Jobs

To run a Spark job using the JAR file, select Job type Spark. The Region will match your Dataproc cluster and bucket locations, us-east-1 in my case. You should have a choice of both clusters in your chosen region. Run both jobs at least twice, once on both clusters, for a total of four jobs.

screen_shot_2018-12-05_at_12.57.55_pm

Lastly, you will need to input the main class and the path to the JAR file. The JAR location will be:

gs://your_bucket_name/dataprocJavaDemo-1.0-SNAPSHOT.jar

The main class for the smaller dataset will be:

org.example.dataproc.InternationalLoansAppDataprocSmall

The main class for the larger dataset will be:

org.example.dataproc.InternationalLoansAppDataprocLarge

During or after job execution, you may view details in the Output tab of the Dataproc Jobs console.

screen_shot_2018-12-04_at_7.53.27_pm

Like every other step in this demonstration, we can also use the gcloud command line tool, instead of the web console, to submit our Spark jobs to each cluster. Here, I am submitting the larger dataset Spark job to the three-node cluster.

export CLUSTER_2=your_three_node_cluster_name
export REGION=us-east1
export BUCKET_NAME=gs://your_bucket_name
  
gcloud dataproc jobs submit spark \
  --region $REGION \
  --cluster $CLUSTER_2 \
  --class org.example.dataproc.InternationalLoansAppDataprocLarge \
  --jars $BUCKET_NAME/dataprocJavaDemo-1.0-SNAPSHOT.jar \
  --async

PySpark Jobs

To run a Spark job using the Python script, select Job type PySpark. The Region will match your Dataproc cluster and bucket locations, us-east-1 in my case. You should have a choice of both clusters. Run the job at least twice, once on both clusters.

screen_shot_2018-12-05_at_12.53.36_pm

Lastly, you will need to input the main Python file path. There is only one Dataproc Python script, which analyzes the larger dataset. The script location will be:

gs://your_bucket_name/international_loans_dataproc_large.py

Like every other step in this demonstration, we can also use the gcloud command line tool instead of the web console to submit our PySpark jobs to each cluster. Below, I am submitting the PySpark job to the three-node cluster.

export CLUSTER_2=your_three_node_cluster_name
export REGION=us-east1
export BUCKET_NAME=gs://your_bucket_name
  
gcloud dataproc jobs submit pyspark \
  $BUCKET_NAME/international_loans_dataproc_large.py \
  --region $REGION \
  --cluster $CLUSTER_2 \
  --async

Including the optional --async flag with any of the dataproc jobs submit command, the job will be sent to the Dataproc cluster and immediately release the terminal back to the user. If you do not to use the --async flag, the terminal will be unavailable until the job is finished.

However, without the flag, we will get the standard output (stdout) and standard error (stderr) from Dataproc. The output includes some useful information, including different stages of the job execution lifecycle and execution times.

screen_shot_2018-12-05_at_10.38.52_pm

File Output

During development and testing, outputting results to the console is useful. However, in Production, the output from jobs is most often written to Apache Parquet, Apache Avro, CSV, JSON, or XML format files, persisted Apache Hive, SQL, or NoSQL database, or streamed to another system for post-processing, using technologies such as Apache Kafka.

Once both the Java and Python jobs have run successfully on the Dataproc cluster, you should observe the results have been saved back to the Storage bucket. Each script saves its results to a single CSV file in separate directories, as shown below.

screen_shot_2018-12-05_at_4.09.31_pm.png

The final dataset, written to the CSV file, contains the results of the analysis results (gist).



country country_code total_disbursement total_obligation avg_interest_rate
Brazil BR 4,302,455,404,056 1,253,228,385,979 4.08
Mexico MX 4,219,081,270,927 1,297,489,060,082 4.94
Indonesia ID 3,270,346,860,046 1,162,592,633,450 4.67
China CN 3,065,658,803,841 1,178,177,730,111 3.01
India IN 3,052,082,309,937 1,101,910,589,590 3.81
Turkey TR 2,797,634,959,120 1,111,562,740,520 4.85
Argentina AR 2,241,512,056,786 524,815,800,115 3.38
Colombia CO 1,701,021,819,054 758,168,606,621 4.48
Korea, Republic of KR 1,349,701,565,303 9,609,765,857 6.81
Philippines PH 1,166,976,603,303 365,840,981,818 5.38
Poland PL 1,157,181,357,135 671,373,801,971 2.89
Morocco MA 1,045,267,705,436 365,073,667,924 4.39
Russian Federation RU 915,318,843,306 98,207,276,721 1.70
Romania RO 902,736,599,033 368,321,253,522 4.36
Egypt, Arab Republic of EG 736,945,143,568 431,086,774,867 4.43
Thailand TH 714,203,701,665 70,485,749,749 5.93
Peru PE 655,818,700,812 191,464,347,544 3.83
Ukraine UA 644,031,278,339 394,273,593,116 1.47
Pakistan PK 628,853,154,827 121,673,028,048 3.82
Tunisia TN 625,648,381,742 202,230,595,005 4.56
Nigeria NG 484,529,279,526 2,351,912,541 5.86
Kazakhstan KZ 453,938,975,114 292,590,991,287 2.81
Algeria DZ 390,644,588,386 251,720,881 5.30
Chile CL 337,041,916,083 11,479,003,904 4.86
Serbia YF 331,975,671,975 173,516,517,964 5.30

Cleaning Up

When you are finished, make sure to delete your running clusters. This may be done through the Google Cloud Console. Deletion of the three-node cluster took, on average, slightly more than one minute.

screen_shot_2018-12-04_at_11.11.40_pm

As usual, we can also use the gcloud command line tool instead of the web console to delete the Dataproc clusters.

export CLUSTER_1=your_single_node_cluster_name
export CLUSTER_2=your_three_node_cluster_name 
export REGION=us-east1
  
yes | gcloud dataproc clusters delete $CLUSTER_1 --region $REGION
yes | gcloud dataproc clusters delete $CLUSTER_2 --region $REGION

Results

Some observations, based on approximately 75 successful jobs. First, both the Python job and the Java jobs ran in nearly the same amount of time on the single-node cluster and then on the three-node cluster. This is beneficial since, although, a lot of big data analysis is performed with Python, Java is still the lingua franca of many large enterprises.

screen_shot_2018-12-05_at_1.49.01_pm

Consecutive Execution

Below are the average times for running the larger dataset on both clusters, in Java, and in Python. The jobs were all run consecutively as opposed to concurrently. The best time was 59 seconds on the three-node cluster compared to the best time of 150 seconds on the single-node cluster, a difference of 256%. Given the differences in the two clusters, this large variation is expected. The average difference between the two clusters for running the large dataset was 254%.

chart2

Concurrent Execution

It is important to understand the impact of concurrently running multiple jobs on the same Dataproc cluster. To demonstrate this, both the Java and Python jobs were also run concurrently. In one such test, ten copies of the Python job were run concurrently on the three-node cluster.

concurrent-jobs

Observe that the execution times of the concurrent jobs increase in near-linear time. The first job completes in roughly the same time as the consecutively executed jobs, shown above, but each proceeding job’s execution time increases linearly.

chart1

According to Apache, when running on a cluster, each Spark application gets an independent set of executor JVMs that only run tasks and store data for that application. Each application is given a maximum amount of resources it can use and holds onto them for its whole duration. Note no tuning was done to the Dataproc clusters to optimize for concurrent execution.

Really Big Data?

Although there is no exact definition of ‘big data’, 750K rows of data at 265 MB is probably not generally considered big data. Likewise, the three-node cluster used in this demonstration is still pretty diminutive. Lastly, the SQL query was less than complex. To really test the abilities of Dataproc would require a multi-gigabyte or multi-terabyte-sized dataset, divided amongst multiple files, computed on a much beefier cluster with more workers nodes and more computer resources.

Monitoring and Instrumentation

In addition to viewing the results of running and completed jobs, there are a number of additional monitoring resources, including the Hadoop Yarn Resource Manager, HDFS NameNode, and Spark History Server Web UIs, and Google Stackdriver. I will only briefly introduce these resources, and not examine any of these interfaces in detail. You’re welcome to investigate the resources for your own clusters. Apache lists other Spark monitoring and instrumentation resources in their documentation.

To access the Hadoop Yarn Resource Manager, HDFS NameNode, and Spark History Server Web UIs, you must create an SSH tunnel and run Chrome through a proxy. Google Dataproc provides both commands and a link to documentation in the Dataproc Cluster tab, to connect.

screen_shot_2018-12-15_at_9.09.00_pm

Hadoop Yarn Resource Manager Web UI

Once you are connected to the Dataproc cluster, via the SSH tunnel and proxy, the Hadoop Yarn Resource Manager Web UI is accessed on port 8088. The UI allows you to view all aspects of the YARN cluster and the distributed applications running on the YARN system.

screen_shot_2018-12-15_at_9.15.27_pm

HDFS NameNode Web UI

Once you are connected to the Dataproc cluster, via the SSH tunnel and proxy, the HDFS NameNode Web UI may is accessed on port 9870. According to the Hadoop Wiki, the NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks were across the cluster the file data is kept. It does not store the data of these files itself.

screen_shot_2018-12-15_at_9.41.19_pm

Spark History Server Web UI

We can view the details of all completed jobs using the Spark History Server Web UI. Once you are connected to the cluster, via the SSH tunnel and proxy, the Spark History Server Web UI is accessed on port 18080. Of all the methods of reviewing aspects of a completed Spark job, the History Server provides the most detailed.

screen_shot_2018-12-15_at_9.15.09_pm

Using the History Server UI, we can drill into fine-grained details of each job, including the event timeline.

screen_shot_2018-12-15_at_9.22.32_pm

Also, using the History Server UI, we can see a visualization of the Spark job’s DAG (Directed Acyclic Graph). DataBricks published an excellent post on learning how to interpret the information visualized in the Spark UI.

screen_shot_2018-12-15_at_9.19.15_pm

Not only can view the DAG and drill into each Stage of the DAG, from the UI.

screen_shot_2018-12-15_at_10.22.33_pm

Stackdriver

We can also enable Google Stackdriver for monitoring and management of services, containers, applications, and infrastructure. Stackdriver offers an impressive array of services, including debugging, error reporting, monitoring, alerting, tracing, logging, and dashboards, to mention only a few Stackdriver features.

screen_shot_2018-12-05_at_3.18.31_pm

There are dozens of metrics available, which collectively, reflect the health of the Dataproc clusters. Below we see the states of one such metric, the YARN virtual cores (vcores). A YARN vcore, introduced in Hadoop 2.4, is a usage share of a host CPU.  The number of YARN virtual cores is equivalent to the number of worker nodes (2) times the number of vCPUs per node (4), for a total of eight YARN virtual cores. Below, we see that at one point in time, 5 of the 8 vcores have been allocated, with 2 more available.

screen_shot_2018-12-05_at_3.29.47_pm

Next, we see the states of the YARN memory size. YARN memory size is calculated as the number of worker nodes (2) times the amount of memory on each node (15 GB) times the fraction given to YARN (0.8), for a total of 24 GB (2 x 15 GB x 0.8). Below, we see that at one point in time, 20 GB of RAM is allocated with 4 GB available. At that instant in time, the workload does not appear to be exhausting the cluster’s memory.

screen_shot_2018-12-05_at_3.30.39_pm

Notifications

Since no one actually watches dashboards all day, waiting for something to fail, how do know when we have an issue with Dataproc? Stackdrive offers integrations with most popular notification channels, including email, SMS, Slack, PagerDuty, HipChat, and Webhooks. With Stackdriver, we define a condition which describes when a service is considered unhealthy. When triggered, Stackdriver sends a notification to one or more channels.

notifications

Below is a preview of two alert notifications in Slack. I enabled Slack as a notification channel and created an alert which is triggered each time a Dataproc job fails. Whenever a job fails, such as the two examples below, I receive a Slack notification through the Slack Channel defined in Stackdriver.

slack.png

Slack notifications contain a link, which routes you back to Stackdriver, to an incident which was opened on your behalf, due to the job failure.

incident

For convenience, the incident also includes a pre-filtered link directly to the log entries at the time of the policy violation. Stackdriver logging offers advanced filtering capabilities to quickly find log entries, as shown below.screen_shot_2018-12-09_at_12.52.51_pm

With Stackdriver, you get monitoring, logging, alerting, notification, and incident management as a service, with minimal cost and upfront configuration. Think about how much time and effort it takes the average enterprise to achieve this level of infrastructure observability on their own, most never do.

Conclusion

In this post, we have seen the ease-of-use, extensive feature-set, out-of-the-box integration ability with other cloud services, low cost, and speed of Google Cloud Dataproc, to run big data analytics workloads. Couple this with the ability of Stackdriver to provide monitoring, logging, alerting, notification, and incident management for Dataproc with minimal up-front configuration. In my opinion, based on these features, Google Cloud Dataproc leads other cloud competitors for fully-managed Spark and Hadoop Cluster management.

In future posts, we will examine the use of Cloud Dataproc Workflow Templates for process automation, the integration capabilities of Dataproc with services such as BigQuery, Bigtable, Cloud Dataflow, and Google Cloud Pub/Sub, and finally, DevOps for Big Data with Dataproc and tools like Spinnaker and Jenkins on GKE.

All opinions expressed in this post are my own and not necessarily the views of my current or past employers, their clients, nor Apache or Google.

, , , , , , , , , , , ,

2 Comments