Archive for category Cloud

Getting Started with IoT Analytics on AWS

 

Introduction

AWS defines AWS IoT as a set of managed services that enable ‘internet-connected devices to connect to the AWS Cloud and lets applications in the cloud interact with internet-connected devices.’ AWS IoT services span three categories: Device Software, Connectivity and Control, and Analytics. In this post, we will focus on AWS IoT Analytics, one of four services, which are part of the AWS IoT Analytics category. According to AWS, AWS IoT Analytics is a fully-managed IoT analytics service, designed specifically for IoT, which collects, pre-processes, enriches, stores, and analyzes IoT device data at scale.

Certainly, AWS IoT Analytics is not the only way to analyze the Internet of Things (IoT) or Industrial Internet of Things (IIoT) data on AWS. It is common to see Data Analyst teams using a more general AWS data analytics stack, composed of Amazon S3, Amazon Kinesis, AWS Glue, and Amazon Athena or Amazon Redshift and Redshift Spectrum, for analyzing IoT data. So then why choose AWS IoT Analytics over a more traditional AWS data analytics stack? According to AWS, IoT Analytics was purpose-built to manage the complexities of IoT and IIoT data on a petabyte-scale. According to AWS, IoT data frequently has significant gaps, corrupted messages, and false readings that must be cleaned up before analysis can occur. Additionally, IoT data must often be enriched and transformed to be meaningful. IoT Analytics can filter, transform, and enrich IoT data before storing it in a time-series data store for analysis.

In the following post, we will explore the use of AWS IoT Analytics to analyze environmental sensor data, in near real-time, from a series of IoT devices. To follow along with the post’s demonstration, there is an option to use sample data to simulate the IoT devices (see the ‘Simulating IoT Device Messages’ section of this post).

IoT Devices

In this post, we will explore IoT Analytics using IoT data generated from a series of custom-built environmental sensor arrays. Each breadboard-based sensor array is connected to a Raspberry Pi single-board computer (SBC), the popular, low cost, credit-card sized Linux computer. The IoT devices were purposely placed in physical locations that vary in temperature, humidity, and other environmental conditions.

rasppi

Each device includes the following sensors:

  1. MQ135 Air Quality Sensor Hazardous Gas Detection Sensor: CO, LPG, Smoke (link)
    (requires an MCP3008 – 8-Channel 10-Bit ADC w/ SPI Interface (link))
  2. DHT22/AM2302 Digital Temperature and Humidity Sensor (link)
  3. Onyehn IR Pyroelectric Infrared PIR Motion Sensor (link)
  4. Anmbest Light Intensity Detection Photosensitive Sensor (link)

rasppi_detail

AWS IoT Device SDK

Each Raspberry Pi device runs a custom Python script, sensor_collector_v2.py. The script uses the AWS IoT Device SDK for Python v2 to communicate with AWS. The script collects a total of seven different readings from the four sensors at a regular interval. Sensor readings include temperature, humidity, carbon monoxide (CO), liquid petroleum gas (LPG), smoke, light, and motion.

The script securely publishes the sensor readings, along with a device ID and timestamp, as a single message, to AWS using the ISO standard Message Queuing Telemetry Transport (MQTT) network protocol. Below is an example of an MQTT message payload, published by the collector script.

As shown below, using tcpdump on the IoT device, the MQTT message payloads generated by the script average approximately 275 bytes. The complete MQTT messages average around 300 bytes.

screen_shot_2020-07-15_at_1_56_21_pm

AWS IoT Core

Each Raspberry Pi is registered with AWS IoT Core. IoT Core allows users to quickly and securely connect devices to AWS. According to AWS, IoT Core can reliably scale to billions of devices and trillions of messages. Registered devices are referred to as things in AWS IoT Core. A thing is a representation of a specific device or logical entity. Information about a thing is stored in the registry as JSON data.

IoT Core provides a Device Gateway, which manages all active device connections. The Gateway currently supports MQTT, WebSockets, and HTTP 1.1 protocols. Behind the Message Gateway is a high-throughput pub/sub Message Broker, which securely transmits messages to and from all IoT devices and applications with low latency. Below, we see a typical AWS IoT Core architecture.

AWS_IoT_Diagram_01_Ingest_blog

At a message frequency of five seconds, the three Raspberry Pi devices publish a total of roughly 50,000 IoT messages per day to AWS IoT Core.

monitoring

AWS IoT Security

AWS IoT Core provides mutual authentication and encryption, ensuring all data is exchanged between AWS and the devices are secure by default. In the demo, all data is sent securely using Transport Layer Security (TLS) 1.2 with X.509 digital certificates on port 443. Authorization of the device to access any resource on AWS is controlled by individual AWS IoT Core Policies, similar to AWS IAM Policies. Below, we see an example of an X.509 certificate, assigned to a registered device.

thing_cert

AWS IoT Core Rules

Once an MQTT message is received from an IoT device (a thing), we use AWS IoT Rules to send message data to an AWS IoT Analytics Channel. Rules give your devices the ability to interact with AWS services. Rules are written in standard Structured Query Language (SQL). Rules are analyzed, and Actions are performed based on the MQTT topic stream. Below, we see an example rule that forwards our messages to IoT Analytics, in addition to AWS IoT Events and Amazon Kinesis Data Firehose.

rule

Simulating IoT Device Messages

Building and configuring multiple Raspberry Pi-based sensor arrays, and registering the devices with AWS IoT Core would require a lot of work just for this post. Therefore, I have provided everything you need to simulate the three IoT devices, on GitHub. Use the following command to git clone a local copy of the project.

AWS CloudFormation

Use the CloudFormation template, iot-analytics.yaml, to create an IoT Analytics stack containing (17) resources, including the following.

  • (3) AWS IoT Things
  • (1) AWS IoT Core Topic Rule
  • (1) AWS IoT Analytics Channel, Pipeline, Data store, and Data set
  • (1) AWS Lambda and Lambda Permission
  • (1) Amazon S3 Bucket
  • (1) Amazon SageMaker Notebook Instance
  • (5) AWS IAM Roles

Please be aware of the costs involved with the AWS resources used in the CloudFormation template before continuing. To build the AWS CloudFormation stack, run the following AWS CLI command.

Below, we see a successful deployment of the IoT Analytics Demo CloudFormation Stack.

cfn_stack

Publishing Sample Messages

Once the CloudFormation stack is created successfully, use an included Python script, send_sample_messages.py, to send sample IoT data to an AWS IoT Topic, from your local machine. The script will use your AWS identity and credentials, instead of an actual IoT device registered with IoT Core. The IoT data will be intercepted by an IoT Topic Rule and redirected, using a Topic Rule Action, to the IoT Analytics Channel.

First, we will ensure the IoT stack is running correctly on AWS by sending a few test messages. Go to the AWS IoT Core Test tab. Subscribe to the iot-device-data topic.

screen_shot_2020-07-13_at_2_06_32_pm

Then, run the following command using the smaller data file, raw_data_small.json.

If successful, you should see the five messages appear in the Test tab, shown above. Example output from the script is shown below.

screen_shot_2020-07-15_at_10.30.58_pm

Then, run the second command using the larger data file, raw_data_large.json, containing 9,995 messages (a few hours worth of data). The command will take approximately 12 minutes to complete.

Once the second command completes successfully, your IoT Analytics Channel should contain 10,000 unique messages. There is an optional extra-large data file containing approximately 50,000 IoT messages (24 hours of IoT messages).

AWS IoT Analytics

AWS IoT Analytics is composed of five primary components: Channels, Pipelines, Data stores, Data sets, and Notebooks. These components enable you to collect, prepare, store, analyze, and visualize your IoT data.

iot_analytics

Below, we see a typical AWS IoT Analytics architecture. IoT messages are pulled from AWS IoT Core, thought a Rule Action. Amazon QuickSight provides business intelligence, visualization. Amazon QuickSight ML Insights adds anomaly detection and forecasting.

AWS_IoT_Diagram_02_IoT_Analytics_blog

IoT Analytics Channel

An AWS IoT Analytics Channel pulls messages or data into IoT Analytics from other AWS sources, such as Amazon S3, Amazon Kinesis, or Amazon IoT Core. Channels store data for IoT Analytics Pipelines. Both Channels and Data store support storing data in your own Amazon S3 bucket or in an IoT Analytics service-managed S3 bucket. In the demonstration, we are using a service managed S3 bucket.

When creating a Channel, you also decide how long to retain the data. For the demonstration, we have set the data retention period for 14 days. Channels are generally not used for long term storage of data. Typically, you would only retain data in the Channel for the time period you need to analyze. For long term storage of IoT message data, I recommend using an AWS IoT Core Rule to send a copy of the raw IoT data to Amazon S3, using a service such as Amazon Kinesis Data Firehose.

screen_shot_2020-07-13_at_3_03_09_pm

IoT Analytics Pipeline

An AWS IoT Analytics Pipeline consumes messages from one or more Channels. Pipelines transform, filter, and enrich the messages before storing them in IoT Analytics Data stores. A Pipeline is composed of an array of activities. Logically, you must specify both a Channel (source) and a Datastore (destination) activity. Optionally, you may choose as many as 23 additional activities in the pipelineActivities array.

In our demonstration’s Pipeline, iot_analytics_pipeline, we have specified five additional activities, including DeviceRegistryEnrich, Filter, Math, Lambda, and SelectAttributes. There are two additional Activity types we did not choose, RemoveAttributes and AddAttributes.

screen_shot_2020-07-14_at_3_11_01_pm

The demonstration’s Pipeline created by CloudFormation starts with messages from the demonstration’s Channel, iot_analytics_channel, similar to the following.

The demonstration’s Pipeline transforms the messages through a series of Pipeline Activities and then stores the resulting message in the demonstration’s Data store, iot_analytics_data_store. The resulting messages appear similar to the following.

In our demonstration, transformations to the messages include dropping the device_id attribute and converting the temp attribute value to Fahrenheit. In addition, the Lambda Activity rounds down the temp, humidity, co, lpg, and smoke attribute values to between 2–4 decimal places of precision.

screen_shot_2020-07-14_at_4_38_33_pm

The demonstration’s Pipeline also enriches the message with the metadata attribute, containing metadata from the IoT device’s AWS IoT Core Registry. The metadata includes additional information about the device that generated the message, including custom attributes we input, such as location (longitude and latitude) and the device’s installation date.

screen_shot_2020-07-14_at_3_15_27_pm

A significant feature of Pipelines is the ability to reprocess messages. If you make a change to the Pipeline, which often happens during the data preparation stage, you can reprocess any or all messages in the associated Channel, and overwrite the messages in the Data set.

screen_shot_2020-07-14_at_4_55_24_pm

IoT Analytics Data store

An AWS IoT Analytics Data store stores prepared data from an AWS IoT Analytics Pipeline, in a fully-managed database. Both Channels and Data store support storing data in your own Amazon S3 bucket or in an IoT Analytics managed S3 bucket. In the demonstration, we are using a service-managed S3 bucket to store messages in our Data store.

screen_shot_2020-07-13_at_3_03_22_pm

IoT Analytics Data set

An AWS IoT Analytics Data set automatically provides regular, up-to-date insights for data analysts by querying a Data store using standard SQL. Regular updates are provided through the use of a cron expression. For the demonstration, we are using a 15-minute interval.

Below, we see the sample messages in the Result preview pane of the Data set. These are the five test messages we sent to check the stack. Note the SQL query used to obtain the messages, which queries the Data store. The Data store, as you will recall, contains the transformed messages from the Pipeline.

screen_shot_2020-07-14_at_4_46_03_pm

IoT Analytics Data sets also support sending content results, which are materialized views of your IoT Analytics data, to an Amazon S3 bucket.

screen_shot_2020-07-14_at_3_32_18_pm

The CloudFormation stack contains an encrypted Amazon S3 Bucket. This bucket receives a copy of the messages from the IoT Analytics Data set whenever the scheduled update is run by the cron expression.

screen_shot_2020-07-14_at_3_38_40_pm

IoT Analytics Notebook

An AWS IoT Analytics Notebook allows users to perform statistical analysis and machine learning on IoT Analytics Data sets using Jupyter Notebooks. The IoT Analytics Notebook service includes a set of notebook templates that contain AWS-authored machine learning models and visualizations. Notebooks Instances can be linked to a GitHub or other source code repository. Notebooks created with IoT Analytics Notebook can also be accessed directly through Amazon SageMaker. For the demonstration, the Notebooks Instance is associated with the project’s GitHub repository.

screen_shot_2020-07-14_at_10_18_12_pm

The repository contains a sample Jupyter Notebook, IoT_Analytics_Demo_Notebook.ipynb, based on the conda_python3 kernel. This preinstalled environment includes the default Anaconda installation and Python 3. The Notebook uses pandas, matplotlib, and plotly to manipulate and visualize the sample IoT messages we published earlier and stored in the Data set.

screen_shot_2020-07-14_at_10_00_39_pm

screen_shot_2020-07-14_at_10_44_54_pm

screen_shot_2020-07-14_at_9_34_27_pm

screen_shot_2020-07-14_at_9_53_27_pm

Notebooks can be modified, and the changes pushed back to GitHub. You could easily fork a copy of my GitHub repository and modify the CloudFormation template, to include your own GitHub repository URL.

screen_shot_2020-07-14_at_10_14_51_pm

Amazon QuickSight

Amazon QuickSight provides business intelligence (BI) and visualization. Amazon QuickSight ML Insights adds anomaly detection and forecasting. We can use Amazon QuickSight to visualize the IoT message data, stored in the IoT Analytics Data set.

Amazon QuickSight has both a Standard and an Enterprise Edition. AWS provides a detailed product comparison of each edition. For the post, I am demonstrating the Enterprise Edition, which includes additional features, such as ML Insights, hourly refreshes of SPICE (super-fast, parallel, in-memory, calculation engine), and theme customization. Please be aware of the costs of Amazon QuickSight if you choose to follow along with this part of the demo. Amazon QuickSight is enabled or configured with the demonstration’s CloudFormation template.

QuickSight Data Sets

Amazon QuickSight has a wide variety of data source options for creating Amazon QuickSight Data sets, including the ones shown below. Do not confuse Amazon QuickSight Data sets with IoT Analytics Data sets. These are two different, yet similar, constructs.

screen_shot_2020-07-15_at_8.50.26_am

For the demonstration, we will create an Amazon QuickSight Data set that will use our IoT Analytics Data set as a data source.

screen_shot_2020-07-15_at_9_06_50_am

Amazon QuickSight gives you the ability to modify QuickSight Data sets. For the demonstration, I have added two additional fields, converting the boolean light and motion values of true and false to binary values of 0 or 1. I have also deselected two fields that I do not need for QuickSight Analysis.

screen_shot_2020-07-15_at_8.53.02_am

QuickSight provides a wide variety of functions, enabling us to perform dynamic calculations on the field values. Below, we see a new calculated field, light_dec, containing the original light field’s Boolean values converted to binary values. I am using a if...else formula to change the field’s value depending on the value in another field.

screen_shot_2020-07-15_at_8.53.16_am

QuickSight Analysis

Using the QuickSight Data set, built from the IoT Analytics Data set as a data source, we create a QuickSight Analysis. The QuickSight Analysis user interface is shown below. An Analysis is primarily a collection of Visuals (Visual types). QuickSight provides a number of Visual types. Each visual is associated with a Data set. Data for the QuickSight Analysis or for each individual visual can be filtered. For the demo, I have created a QuickSight Analysis, including several typical QuickSight Visuals.

screen_shot_2020-07-10_at_12_21_32_pm

QuickSight Dashboards

To share a QuickSight Analysis, we can create a QuickSight Dashboard. Below, we see a few views of the QuickSight Analysis, shown above, as a Dashboard. A viewer of the Dashboard cannot edit the visuals, though they can apply filtering and interactively drill-down into data in the Visuals.

screen_shot_2020-07-15_at_11_38_17_am

screen_shot_2020-07-15_at_11_40_17_am

screen_shot_2020-07-15_at_11_43_23_am

Geospatial Data

Amazon QuickSight understands geospatial data. If you recall, in the IoT Analytics Pipeline, we enriched the messages in the metadata from the device registry. The metadata attributes contained the device’s longitude and latitude. Quicksight will recognize those fields as geographic fields. In our QuickSight Analysis, we can visualize the geospatial data, using the geospatial chart (map) Visual type.

screen_shot_2020-07-10_at_12_22_12_pm

QuickSight Mobile App

Amazon QuickSight offers free iOS and Android versions of the Amazon QuickSight Mobile App. The mobile application makes it easy for registered QuickSight end-users to securely connect to QuickSight Dashboards, using their mobile devices. Below, we see two views of the same Dashboard, shown in the iOS version of the Amazon QuickSight Mobile App.

mobile_quicksight

Amazon QuickSight ML Insights

According to Amazon, ML Insights leverages AWS’s machine learning (ML) and natural language capabilities to gain deeper insights from data. QuickSight’s ML-powered Anomaly Detection continuously analyze data to discover anomalies and variations inside of the aggregates, giving you the insights to act when business changes occur. QuickSight’s ML-powered Forecasting can be used to accurately predict your business metrics, and perform interactive what-if analysis with point-and-click simplicity. QuickSight’s built-in algorithms make it easy for anyone to use ML that learns from your data patterns to provide you with accurate predictions based on historical trends.

Below, we see the ML Insights tab in the demonstration’s QuickSight Analysis. Individually detected anomalies can be added to the QuickSight Analysis, similar to Visuals, and configured to tune the detection parameters.

screen_shot_2020-07-15_at_12_04_21_pm

Below, we see an example of humidity anomalies across all devices, based on their Anomaly Score and are higher or lower with a minimum delta of five percent.

screen_shot_2020-07-15_at_12_24_12_pm

Cleaning Up

You are charged hourly for the SageMaker Notebook Instance. Do not forget to delete your CloudFormation stack when you are done with the demonstration. Note the Amazon S3 bucket will not be deleted; you must do this manually.

Conclusion

In this post, we demonstrated how to use AWS IoT Analytics to analyze and visualize streaming messages from multiple IoT devices, in near real-time. Combined with other AWS IoT analytics services, such as AWS IoT SiteWise, AWS IoT Events, and AWS IoT Things Graph, you can create a robust, full-featured IoT Analytics platform, capable of handling millions of industrial, commercial, and residential IoT devices, generating petabytes of data.

This blog represents my own viewpoints and not of my employer, Amazon Web Services.

, , , , , , ,

Leave a comment

Architecting a Successful SaaS: Understanding Cloud-based SaaS Models

Originally published on the AWS APN Blog.

Introduction

You’re a startup with an idea for a revolutionary new software product. You quickly build a beta version and deploy it to the cloud. After a successful social-marketing campaign and concerted sales effort, dozens of customers subscribe to your SaaS-based product. You’re ecstatic…until you realize you never architected your product for this level of success. You were so busy coding, raising capital, marketing, and selling, you never planned how you would scale your Sass product. How you would ensure your customer’s security, as well as your own. How you would meet the product reliability, compliance, and performance you promised. And, how you would monitor and meter your customer’s usage, no matter how fast you or they grew.

I’ve often heard budding entrepreneurs jest, if only success was their biggest problem. Certainly, success won’t be their biggest problem. For many, the problems come afterward, when they disappoint their customers by failing to deliver the quality product they promised. Or worse, damaging their customer’s reputation (and their own) by losing or exposing sensitive data. As the old saying goes, ‘you never get a second chance to make a first impression.’ Customer trust is hard-earned and easily lost. Properly architecting a scalable and secure SaaS-based product is just as important as feature development and sales. No one wants to fail on Day 1—you worked too hard to get there.

Architecting a Successful SaaS

In this series of posts, Architecting a Successful SaaS, we will explore how to properly plan and architect a SaaS product offering, designed for hosting on the cloud. We will start by answering basic questions, like, what is SaaS, what are the alternatives to SaaS for software distribution, and what are the most common SaaS product models. We will then examine different high-level SaaS architectures, review tenant isolation strategies, and explore how SaaS vendors securely interact with their customer’s cloud accounts. Finally, we will discuss how SaaS providers can meet established best practices, like those from AWS SaaS Factory and the AWS Well-Architected Framework.

For this post, I have chosen many examples of cloud services from AWS and vendors from AWS Marketplace. However, the principals discussed may be applied to other leading cloud providers, SaaS products, and cloud-based software marketplaces. All information in this post is publicly available.

What is SaaS?

According to AWS Marketplace, ‘SaaS [Software as a Service] is a delivery model for software applications whereby the vendor hosts and operates the application over the Internet. Customers pay for using the software without owning the underlying infrastructure.’ Another definition from AWS, ‘SaaS is a licensing and delivery model whereby software is centrally managed and hosted by a provider and available to customers on a subscription basis.’

A SaaS product, like other forms of software, is produced by what is commonly referred to as an Independent Software Vendor (ISV). According to Wikipedia, an Independent Software Vendor ‘is an organization specializing in making and selling software, as opposed to hardware, designed for mass or niche markets. This is in contrast to in-house software, which is developed by the organization that will use it, or custom software, which is designed or adapted for a single, specific third party. Although ISV-provided software is consumed by end-users, it remains the property of the vendor.’

Although estimates vary greatly, according to The Software as a Service (SaaS) Global Market Report 2020, the global SaaS market was valued at about $134.44B in 2018 and is expected to grow to $220.21B at a compound annual growth rate (CAGR) of 13.1% through 2022. Statista predicts SaaS revenues will grow even faster, forecasting revenues of $266B by 2022, with continued strong positive growth to $346B by 2027.

Cloud-based Usage Models

Let’s start by reviewing the three most common ways that individuals, businesses, academic institutions, the public sector, and government consume services from cloud providers such as Amazon Web Services (AWS), Microsoft Azure, Google Cloud, and IBM Cloud (now includes Red Hat).

Indirect Consumer

Indirect consumers are customers who consume cloud-based SaaS products. Indirect users are often unlikely to know which cloud provider host’s the SaaS products to which they subscribe. Many SaaS products can import and export data, as well as integrate with other SaaS products. Many successful companies run their entire business in the cloud using a combination of SaaS products from multiple vendors.

SaaS-28

Examples

  • An advertising firm that uses Google G Suite for day-to-day communications and collaboration between its employees and clients.
  • A large automotive parts manufacturer that runs its business using the Workday cloud-based Enterprise Resource Management (ERP) suite.
  • A software security company that uses Zendesk for customer support. They also use the Slack integration for Zendesk to view, create, and take action on support tickets, using Slack channels.
  • A recruiting firm that uses Zoom Meetings & Chat to interview remote candidates. They also use the Zoom integration with Lever recruiting software, to schedule interviews.

Direct Consumer

Direct consumers are customers who use cloud-based Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) services to build and run their software; the DIY (do it yourself) model. The software deployed in the customer’s account may be created by the customer or purchased from a third-party software vendor and deployed within the customer’s cloud account. Direct users may purchase IaaS and PaaS services from multiple cloud providers.

SaaS-18

Examples

Hybrid Consumers

Hybrid consumers are customers who use a combination of IaaS, PaaS, and SaaS services. Customers often connect multiple IaaS, PaaS, and SaaS services as part of larger enterprise software application platforms.

SaaS-27

Examples

  • A payroll company that hosts its proprietary payroll software product, using IaaS products like Amazon EC2 and Elastic Load Balancing. In addition, they use an integrated SaaS-based fraud detection product, like Cequence Security CQ botDefense, to ensure the safety and security of payroll customers.
  • An online gaming company that operates its applications using the fully-managed container-based PaaS service, Amazon ECS. To promote their gaming products, they use a SaaS-based marketing product, like Mailchimp Marketing CRM.

Cloud-based Software

Most cloud-based software is sold in one of two ways, Customer-deployed or SaaS. Below, we see a breakdown by the method of product delivery on AWS Marketplace. All items in the chart, except SaaS, represent Customer-deployed products. Serverless applications are available elsewhere on AWS and are not represented in the AWS Marketplace statistics.

DeliveryTypes
AWS Marketplace: All Products – Delivery Methods (February 2020)

Customer-deployed

An ISV who sells customer-deployed software products to consumers of cloud-based IaaS and PaaS services. Products are installed by the customer, Systems Integrator (SI), or the ISV into the customer’s cloud account. Customer-deployed products are reminiscent of traditional ‘boxed’ software.

Customers typically pay a reoccurring hourly, monthly, or annual subscription fee for the software product, commonly referred to as pay-as-you-go (PAYG). The subscription fee paid to the vendor is in addition to the fees charged to the customer by the cloud service provider for the underlying compute resources on which the customer-deployed product runs in the customer’s cloud account.

Some customer-deployed products may also require a software license. Software licenses are often purchased separately through other channels. Applying a license you already own to a newly purchased product is commonly referred to as bring your own license (BYOL). BYOL is common in larger enterprise customers, who may have entered into an Enterprise License Agreement (ELA) with the ISV.

PlanTypesCD
AWS Marketplace: Customer-deployed Product Subscription Types (February 2020)

Customer-deployed cloud-based software products can take a variety of forms. The most common deliverables include some combination of virtual machines (VMs) such as Amazon Machine Images (AMIs), Docker images, Amazon SageMaker models, or Infrastructure as Code such as AWS CloudFormationHashiCorp Terraform, or Helm Charts. Customers usually pull these deliverables from a vendor’s AWS account or other public or private source code or binary repositories. Below, we see the breakdown of customer-deployed products, by the method of delivery, on AWS Marketplace.

DeliveryTypesCD2
AWS Marketplace: Customer-deployed Product Delivery Methods (February 2020)

Although historically, AMIs have been the predominant method of customer-deployed software delivery, newer technologies, such as Docker images, serverless, SageMaker models, and AWS Data Exchange datasets will continue to grow in this segment. The AWS Serverless Application Repository (SAR), currently contains over 500 serverless applications, not reflected in this chart. AWS appears to be moving toward making it easier to sell serverless software applications in AWS Marketplace, according to one recent post.

Customer-deployed cloud-based software products may require a connection between the installed product and the ISV for product support, license verification, product upgrades, or security notifications.

SaaS-17

Examples

SaaS

An ISV who sells SaaS software products to customers. The SaaS product is deployed, managed, and sold by the ISV and hosted by a cloud provider, such as AWS. A SaaS product may or may not interact with a customer’s cloud account. SaaS products are similar to customer-deployed products with respect to their subscription-based fee structure. Subscriptions may be based on a unit of measure, often a period of time. Subscriptions may also be based on the number of users, requests, hosts, or the volume of data.

AWS Marketplace: SaaS Products - Delivery Methods (February 2020)
AWS Marketplace: SaaS Products – Pricing Plans (February 2020)

A significant difference between SaaS products and customer-deployed products is the lack of direct customer costs from the underlying cloud provider. The underlying costs are bundled into the subscription fee for the SaaS product.

Similar to Customer-deployed products, SaaS products target both consumers and businesses. SaaS products span a wide variety of consumer, business, industry-specific, and technical categories. AWS Marketplace offers products from vendors covering eight major categories and over 70 sub-categories.

SaaSCats
AWS Marketplace: SaaS Product Categories (February 2020)

SaaS Product Variants

I regularly work with a wide variety of cloud-based software vendors. In my experience, most cloud-based SaaS products fit into one of four categories, based on the primary way a customer interacts with the SaaS product:

  • Stand-alone: A SaaS product that has no interaction with the customer’s cloud account;
  • Data Access: A SaaS product that connects to the customer’s cloud account to only obtain data;
  • Augmentation: A SaaS product that connects to the customer’s cloud account, interacting with and augmenting the customer’s software;
  • Discrete Service: A variation of augmentation, a SaaS product that provides a discrete service or function as opposed to a more complete software product;

Stand-alone

A stand-alone SaaS product has no interaction with a customer’s cloud account. Customers of stand-alone SaaS products interact with the product through an interface provided by the SaaS vendor. Many stand-alone SaaS products can import and export customer data, as well as integrate with other cloud-based SaaS products. Stand-alone SaaS products may target consumers, known as Business-to-Consumer (B2C SaaS). They may also target businesses, known as Business-to-Business (B2B SaaS).

SaaS-29

Examples

Data Access

A SaaS product that connects to a customer’s data sources in their cloud account or on-prem. These SaaS products often fall into the categories of Big Data and Data Analytics, Machine Learning and Artificial Intelligence, and IoT (Internet of Things). Products in these categories work with large quantities of data. Given the sheer quantity of data or real-time nature of the data, importing or manually inputting data directly into the SaaS product, through the SaaS vendor’s user interface is unrealistic. Often, these SaaS products will cache some portion of the customer’s data to reduce customer’s data transfer costs.

Similar to the previous stand-alone SaaS products, customers of these SaaS products interact with the product thought a user interface provided by the SaaS vendor.

SaaS-14

Examples

  • Zepl provides an enterprise data science analytics platform, which enables data exploration, analysis, and collaboration. Zepl sells its Zepl Science and Analytics Platform SaaS product on AWS Marketplace. The Zepl product provides integration to many types of customer data sources including Snowflake, Amazon S3, Amazon Redshift, Amazon Athena, Google BigQuery, Apache Cassandra (Amazon MCS), and other SQL databases.
  • Sisense provides an enterprise-grade, cloud-native business intelligence and analytics platform, powered by AI. Sisense offers its Sisense Business Intelligence SaaS product on AWS Marketplace. This product lets customers prepare and analyze disparate big datasets using Sisense’s Data Connectors. The wide array of connectors provide connectivity to dozens of different cloud-based and on-prem data sources.
  • Databricks provides a unified data analytics platform, designed for massive-scale data engineering and collaborative data science. Databricks offers its Databricks Unified Analytics Platform SaaS product on AWS Marketplace. Databricks allows customers to interact with data across many different data sources, data storage types, and data types, including batch and streaming.
  • DataRobot provides an enterprise AI platform, which enables global enterprises to collaboratively harness the power of AI. DataRobot sells its DataRobot Automated Machine Learning for AWS SaaS product on AWS Marketplace. Using connectors, like Skyvia’s OData connector, customers can connect their data sources to the DataRobot product.

Augmentation

A SaaS product that interacts with, or augments a customer’s application, which is managed by the customer in their own cloud account. These SaaS products often maintain secure, loosely-coupled, unidirectional or bidirectional connections between the vendor’s SaaS product and the customer’s account. Vendors on AWS often use services like Amazon EventBridgeAWS PrivateLink, VPC Peering, Amazon S3, Amazon Kinesis, Amazon SQS, and Amazon SNS to interact with customer’s accounts and exchange data. Often, these SaaS products fall within the categories of Security, Logging and Monitoring, and DevOps.

Customers of these types of SaaS products generally interact with their own software, as well as the SaaS product thought an interface provided by the SaaS vendor.

SaaS-24

Examples

  • CloudCheckr provides solutions that enable clients to optimize costs, security, and compliance on leading cloud providers. CloudCheckr sells its Cloud Management Platform SaaS product on AWS Marketplace. CloudCheckr uses an AWS IAM cross-account role and Amazon S3 to exchange data between the customer’s account and their SaaS product.
  • Splunk provides the leading software platform for real-time Operational Intelligence. Splunk sells its Splunk Cloud SaaS product on AWS Marketplace. Splunk Cloud enables rapid application troubleshooting, ensures security and compliance, and provides monitoring of business-critical services in real-time. According to their documentation, Splunk uses a combination of AWS S3, Amazon SQS, and Amazon SNS services to transfer AWS CloudTrail logs from the customer’s accounts to Splunk Cloud.

Discrete Service

Discrete SaaS products are a variation of SaaS augmentation products. Discrete SaaS products provide specific, distinct functionality to a customer’s software application. These products may be an API, data source, or machine learning model, which is often accessed completely through a vendor’s API. The products have a limited or no visual user interface. These SaaS products are sometimes referred to as a ‘Service as a Service’. Discrete SaaS products often fall into the categories of Artificial Intelligence and Machine Learning, Financial Services, Reference Data, and Authentication and Authorization.

SaaS-30

Examples

AWS Data Exchange

There is a new category of products on AWS Marketplace. Released in November 2019, AWS Data Exchange makes it easy to find, subscribe to, and use third-party data in the cloud. According to AWS, Data Exchange vendors can publish new data, as well as automatically publish revisions to existing data and notify subscribers. Once subscribed to a data product, customers can use the AWS Data Exchange API to load data into Amazon S3 and then analyze it with a wide variety of AWS analytics and machine learning services.

SaaS-25

Data Exchange seems to best fit the description of a customer-deployed product. However, given the nature of the vendor-subscriber relationship, where data may be regularly exchanged—revised and published by the vendor and pulled by the subscriber—I would consider Data Exchange a cloud-based hybrid product.

AWS Data Exchange products are available on AWS Marketplace. The list of qualified data providers is growing and includes Reuters, Foursquare, TransUnion, Pitney Bowes, IMDb, Epsilon, ADP, Dun & Bradstreet, and others. As illustrated below, data sets are available in the categories of financial services, public sector, healthcare, media, telecommunications, and more.

DataTypes
AWS Marketplace: Data Exchange Product Categories (February 2020)

Examples

Conclusion

In this first post, we’ve become familiar with the common ways in which customers consume cloud-based IaaS, PaaS, and SaaS products and services. We also explored the different ways in which ISVs sell their software products to customers. In future posts, we will examine different high-level SaaS architectures, review tenant isolation strategies, and explore how SaaS vendors securely interact with their customer’s cloud accounts. Finally, we will discuss how SaaS providers can meet best-practices, like those from AWS SaaS Factory and the AWS Well-Architected Framework.

References

Here are some great references to learn more about building and managing SaaS products on AWS.

This blog represents my own view points and not of my employer, Amazon Web Services.

, , , , , ,

Leave a comment

Streaming Data Analytics with Amazon Kinesis Data Firehose, Redshift, and QuickSight

Introduction

Databases are ideal for storing and organizing data that requires a high volume of transaction-oriented query processing while maintaining data integrity. In contrast, data warehouses are designed for performing data analytics on vast amounts of data from one or more disparate sources. In our fast-paced, hyper-connected world, those sources often take the form of continuous streams of web application logs, e-commerce transactions, social media feeds, online gaming activities, financial trading transactions, and IoT sensor readings. Streaming data must be analyzed in near real-time, while often first requiring cleansing, transformation, and enrichment.

In the following post, we will demonstrate the use of Amazon Kinesis Data Firehose, Amazon Redshift, and Amazon QuickSight to analyze streaming data. We will simulate time-series data, streaming from a set of IoT sensors to Kinesis Data Firehose. Kinesis Data Firehose will write the IoT data to an Amazon S3 Data Lake, where it will then be copied to Redshift in near real-time. In Amazon Redshift, we will enhance the streaming sensor data with data contained in the Redshift data warehouse, which has been gathered and denormalized into a star schema.

Streaming-Kinesis-Redshift

In Redshift, we can analyze the data, asking questions like, what is the min, max, mean, and median temperature over a given time period at each sensor location. Finally, we will use Amazon Quicksight to visualize the Redshift data using rich interactive charts and graphs, including displaying geospatial sensor data.

screen_shot_2020-03-04_at_9.27.33_pm

Featured Technologies

The following AWS services are discussed in this post.

Amazon Kinesis Data Firehose

According to Amazon, Amazon Kinesis Data Firehose can capture, transform, and load streaming data into data lakes, data stores, and analytics tools. Direct Kinesis Data Firehose integrations include Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk. Kinesis Data Firehose enables near real-time analytics with existing business intelligence (BI) tools and dashboards.

Amazon Redshift

According to Amazon, Amazon Redshift is the most popular and fastest cloud data warehouse. With Redshift, users can query petabytes of structured and semi-structured data across your data warehouse and data lake using standard SQL. Redshift allows users to query and export data to and from data lakes. Redshift can federate queries of live data from Redshift, as well as across one or more relational databases.

Amazon Redshift Spectrum

According to Amazon, Amazon Redshift Spectrum can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables. Redshift Spectrum tables are created by defining the structure for data files and registering them as tables in an external data catalog. The external data catalog can be AWS Glue or an Apache Hive metastore. While Redshift Spectrum is an alternative to copying the data into Redshift for analysis, we will not be using Redshift Spectrum in this post.

Amazon QuickSight

According to Amazon, Amazon QuickSight is a fully managed business intelligence service that makes it easy to deliver insights to everyone in an organization. QuickSight lets users easily create and publish rich, interactive dashboards that include Amazon QuickSight ML Insights. Dashboards can then be accessed from any device and embedded into applications, portals, and websites.

What is a Data Warehouse?

According to Amazon, a data warehouse is a central repository of information that can be analyzed to make better-informed decisions. Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically on a regular cadence. Business analysts, data scientists, and decision-makers access the data through business intelligence tools, SQL clients, and other analytics applications.

Demonstration

Source Code

All the source code for this post can be found on GitHub. Use the following command to git clone a local copy of the project.

CloudFormation

Use the two AWS CloudFormation templates, included in the project, to build two CloudFormation stacks. Please review the two templates and understand the costs of the resources before continuing.

The first CloudFormation template, redshift.yml, provisions a new Amazon VPC with associated network and security resources, a single-node Redshift cluster, and two S3 buckets.

The second CloudFormation template, kinesis-firehose.yml, provisions an Amazon Kinesis Data Firehose delivery stream, associated IAM Policy and Role, and an Amazon CloudWatch log group and two log streams.

Change the REDSHIFT_PASSWORD value to ensure your security. Optionally, change the REDSHIFT_USERNAME value. Make sure that the first stack completes successfully, before creating the second stack.

Review AWS Resources

To confirm all the AWS resources were created correctly, use the AWS Management Console.

Kinesis Data Firehose

In the Amazon Kinesis Dashboard, you should see the new Amazon Kinesis Data Firehose delivery stream, redshift-delivery-stream.

screen_shot_2020-03-02_at_3.58.12_pm

The Details tab of the new Amazon Kinesis Firehose delivery stream should look similar to the following. Note the IAM Role, FirehoseDeliveryRole, which was created and associated with the delivery stream by CloudFormation.

screen_shot_2020-03-02_at_6.30.25_pm

We are not performing any transformations of the incoming messages. Note the new S3 bucket that was created and associated with the stream by CloudFormation. The bucket name was randomly generated. This bucket is where the incoming messages will be written.

screen_shot_2020-03-02_at_6.31.13_pm

Note the buffer conditions of 1 MB and 60 seconds. Whenever the buffer of incoming messages is greater than 1 MB or the time exceeds 60 seconds, the messages are written in JSON format, using GZIP compression, to S3. These are the minimal buffer conditions, and as close to real-time streaming to Redshift as we can get.

screen_shot_2020-03-02_at_6.31.28_pm

Note the COPY command, which is used to copy the messages from S3 to the message table in Amazon Redshift. Kinesis uses the IAM Role, ClusterPermissionsRole, created by CloudFormation, for credentials. We are using a Manifest to copy the data to Redshift from S3. According to Amazon, a Manifest ensures that the COPY command loads all of the required files, and only the required files, for a data load. The Manifests are automatically generated and managed by the Kinesis Firehose delivery stream.

screen_shot_2020-03-02_at_6.31.43_pm

Redshift Cluster

In the Amazon Redshift Console, you should see a new single-node Redshift cluster consisting of one Redshift dc2.large Dense Compute node type.

screen_shot_2020-03-02_at_7.09.35_pm

Note the new VPC, Subnet, and VPC Security Group created by CloudFormation. Also, observe that the Redshift cluster is publically accessible from outside the new VPC.

screen_shot_2020-03-02_at_7.09.41_pm

Redshift Ingress Rules

The single-node Redshift cluster is assigned to an AWS Availability Zone in the US East (N. Virginia) us-east-1 AWS Region. The cluster is associated with a VPC Security Group. The Security Group contains three inbound rules, all for Redshift port 5439. The IP addresses associated with the three inbound rules provide access to the following: 1) a /27 CIDR block for Amazon QuickSight in us-east-1, a /27 CIDR block for Amazon Kinesis Firehose in us-east-1, and to you, a /32 CIDR block with your current IP address. If your IP address changes or you do not use the us-east-1 Region, you will need to change one or all of these IP addresses. The list of Kinesis Firehose IP addresses is here. The list of QuickSight IP addresses is here.

screen_shot_2020-03-02_at_7.09.59_pm

If you cannot connect to Redshift from your local SQL client, most often, your IP address has changed and is incorrect in the Security Group’s inbound rule.

Redshift SQL Client

You can choose to use the Redshift Query Editor to interact with Redshift or use a third-party SQL client for greater flexibility. To access the Redshift Query Editor, use the user credentials specified in the redshift.yml CloudFormation template.

screen_shot_2020-03-02_at_4.01.49_pm

There is a lot of useful functionality in the Redshift Console and within the Redshift Query Editor. However, a notable limitation of the Redshift Query Editor, in my opinion, is the inability to execute multiple SQL statements at the same time. Whereas, most SQL clients allow multiple SQL queries to be executed at the same time.

screen_shot_2020-03-04_at_8.48.39_am

I prefer to use JetBrains PyCharm IDE. PyCharm has out-of-the-box integration with RedShift. Using PyCharm, I can edit the project’s Python, SQL, AWS CLI shell, and CloudFormation code, all from within PyCharm.

screen_shot_2020-03-02_at_7.12.11_pm

If you use any of the common SQL clients, you will need to set-up a JDBC (Java Database Connectivity) or ODBC (Open Database Connectivity) connection to Redshift. The ODBC and JDBC connection strings can be found in the Redshift cluster’s Properties tab or in the Outputs tab from the CloudFormation stack, redshift-stack.

screen_shot_2020-03-04_at_9.07.14_am

You will also need the RedShift database username and password you included in the aws cloudformation create-stack AWS CLI command you executed previously. Below, we see PyCharm’s Project Data Sources window containing a new data source for the Redshift dev database.

screen_shot_2020-03-02_at_6.33.01_pm

Database Schema and Tables

When CloudFormation created the Redshift cluster, it also created a new database, dev. Using the Redshift Query Editor or your SQL client of choice, execute the following series of SQL commands to create a new database schema, sensor, and six tables in the sensor schema.

Star Schema

The tables represent denormalized data, taken from one or more relational database sources. The tables form a star schema.  The star schema is widely used to develop data warehouses. The star schema consists of one or more fact tables referencing any number of dimension tables. The location, manufacturer, sensor, and history tables are dimension tables. The sensors table is a fact table.

In the diagram below, the foreign key relationships are virtual, not physical. The diagram was created using PyCharm’s schema visualization tool. Note the schema’s star shape. The message table is where the streaming IoT data will eventually be written. The message table is related to the sensors fact table through the common guid field.

schema-light-2

Sample Data to S3

Next, copy the sample data, included in the project, to the S3 data bucket created with CloudFormation. Each CSV-formatted data file corresponds to one of the tables we previously created. Since the bucket name is semi-random, we can use the AWS CLI and jq to get the bucket name, then use it to perform the copy commands.

The output from the AWS CLI should look similar to the following.

screen_shot_2020-03-02_at_7.18.22_pm

Sample Data to RedShift

Whereas a relational database, such as Amazon RDS is designed for online transaction processing (OLTP), Amazon Redshift is designed for online analytic processing (OLAP) and business intelligence applications. To write data to Redshift we typically use the COPY command versus frequent, individual INSERT statements, as with OLTP, which would be prohibitively slow. According to Amazon, the Redshift COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from files on Amazon S3, from a DynamoDB table, or from text output from one or more remote hosts.

In the following series of SQL statements, replace the placeholder, your_bucket_name, in five places with your S3 data bucket name. The bucket name will start with the prefix, redshift-stack-databucket. The bucket name can be found in the Outputs tab of the CloudFormation stack, redshift-stack. Next, replace the placeholder, cluster_permissions_role_arn, with the ARN (Amazon Resource Name) of the ClusterPermissionsRole. The ARN is formatted as follows, arn:aws:iam::your-account-id:role/ClusterPermissionsRole. The ARN can be found in the Outputs tab of the CloudFormation stack, redshift-stack.

Using the Redshift Query Editor or your SQL client of choice, execute the SQL statements to copy the sample data from S3 to each of the corresponding tables in the Redshift dev database. The TRUNCATE commands guarantee there is no previous sample data present in the tables.

Database Views

Next, create four Redshift database Views. These views may be used to analyze the data in Redshift, and later, in Amazon QuickSight.

  1. sensor_msg_detail: Returns aggregated sensor details, using the sensors fact table and all five dimension tables in a SQL Join.
  2. sensor_msg_count: Returns the number of messages received by Redshift, for each sensor.
  3. sensor_avg_temp: Returns the average temperature from each sensor, based on all the messages received from each sensor.
  4. sensor_avg_temp_current: View is identical for the previous view but limited to the last 30 minutes.

Using the Redshift Query Editor or your SQL client of choice, execute the following series of SQL statements.

At this point, you should have a total of six tables and four views in the sensor schema of the dev database in Redshift.

Test the System

With all the necessary AWS resources and Redshift database objects created and sample data in the Redshift database, we can test the system. The included Python script, kinesis_put_test_msg.py, will generate a single test message and send it to Kinesis Data Firehose. If everything is working, the message should be delivered from Kinesis Data Firehose to S3, then copied to Redshift, and appear in the message table.

Install the required Python packages and then execute the Python script.

Run the following SQL query to confirm the record is in the message table of the dev database. It will take at least one minute for the message to appear in Redshift.

Once the message is confirmed to be present in the message table, delete the record by truncating the table.

Streaming Data

Assuming the test message worked, we can proceed with simulating the streaming IoT sensor data. The included Python script, kinesis_put_streaming_data.py, creates six concurrent threads, representing six temperature sensors. The simulated data uses an algorithm that follows an oscillating sine wave or sinusoid, representing rising and falling temperatures. In the script, I have configured each thread to start with an arbitrary offset to add some randomness to the simulated data.

screen_shot_2020-03-04_at_9.27.33_pm

The variables within the script can be adjusted to shorten or lengthen the time it takes to stream the simulated data. By default, each of the six threads creates 400 messages per sensor, in one-minute increments. Including the offset start of each proceeding thread, the total runtime of the script is about 7.5 hours to generate 2,400 simulated IoT sensor temperature readings and push to Kinesis Data Firehose. Make sure you can guarantee you will maintain a connection to the Internet for the entire runtime of the script. I normally run the script in the background, from a small EC2 instance.

To use the Python script, execute either of the two following commands. Using the first command will run the script in the foreground. Using the second command will run the script in the background.

If you used the foreground command, you should see messages being generated by each thread and sent to Kinesis Data Firehose. Each message contains the GUID of the sensor, a timestamp, and a temperature reading.

screen_shot_2020-03-03_at_7.01.37_am

The messages are sent to Kinesis Data Firehose, which in turn writes the messages to S3. The messages are written in JSON format using GZIP compression. Below, we see an example of the GZIP compressed JSON files in S3. The JSON files are partitioned by year, month, day, and hour.

screen_shot_2020-03-04_at_11.04.36_pm

Confirm Data Streaming to Redshift

From the Amazon Kinesis Firehose Console Metrics tab, you should see incoming messages flowing to S3 and on to Redshift.

screen_shot_2020-03-03_at_6.17.14_pm

Executing the following SQL query should show an increasing number of messages.

How Near Real-time?

Earlier, we saw how the Amazon Kinesis Data Firehose delivery stream was configured to buffer data at the rate of 1 MB or 60 seconds. Whenever the buffer of incoming messages is greater than 1 MB or the time exceeds 60 seconds, the messages are written to S3. Each record in the message table has two timestamps. The first timestamp, ts, is when the temperature reading was recorded. The second timestamp, created, is when the message was written to Redshift, using the COPY command. We can calculate the delta in seconds between the two timestamps using the following SQL query in Redshift.

Using the results of the Redshift query, we can visualize the results in Amazon QuickSight. In my own tests, we see that for 2,400 messages, over approximately 7.5 hours, the minimum delay was 1 second, and a maximum delay was 64 seconds. Hence, near real-time, in this case, is about one minute or less, with an average latency of roughly 30 seconds.

latency

Analyzing the Data with Redshift

I suggest waiting at least thirty minutes for a significant number of messages copied into Redshift. With the data streaming into Redshift, execute each of the database views we created earlier. You should see the streaming message data, joined to the existing static data in RedShift. As data continues to stream into Redshift, the views will display different results based on the current message table contents.

Here, we see the first ten results of the sensor_msg_detail view.

Next, we see the results of the sensor_avg_temp view.

Amazon QuickSight

In a recent post, Getting Started with Data Analysis on AWS using AWS Glue, Amazon Athena, and QuickSight: Part 2, I detailed getting started with Amazon QuickSight. In this post, I will assume you are familiar with QuickSight.

Amazon recently added a full set of aws quicksight APIs for interacting with QuickSight. Though, for this part of the demonstration, we will be working directly in the Amazon QuickSight Console, as opposed to the AWS CLI, AWS CDK, or CloudFormation.

Redshift Data Sets

To visualize the data from Amazon Redshift, we start by creating Data Sets in QuickSight. QuickSight supports a large number of data sources for creating data sets. We will use the Redshift data source. If you recall, we added an inbound rule for QuickSight, allowing us to connect to our Redshift cluster in us-east-1.

screen_shot_2020-03-02_at_10.21.35_pm

We will select the sensor schema, which is where the tables and views for this demonstration are located.

screen_shot_2020-03-02_at_10.21.49_pm

We can choose any of the tables or views in the Redshift dev database that we want to use for visualization.

screen_shot_2020-03-02_at_10.22.11_pm

Below, we see examples of two new data sets, shown in the QuickSight Data Prep Console. Note how QuickSight automatically recognizes field types, including dates, latitude, and longitude.

screen_shot_2020-03-02_at_10.23.12_pm

screen_shot_2020-03-02_at_10.59.32_pm

Visualizations

Using the data sets, QuickSight allows us to create a wide number of rich visualizations. Below, we see the simulated time-series data from the six temperature sensors.

screen_shot_2020-03-04_at_9.26.26_pm

Next, we see an example of QuickSight’s ability to show geospatial data. The Map shows the location of each sensor and the average temperature recorded by that sensor.

screen_shot_2020-03-04_at_9.26.51_pm

Cleaning Up

To remove the resources created for this post, use the following series of AWS CLI commands.

Conclusion

In this brief post, we have learned how streaming data can be analyzed in near real-time, in Amazon Redshift, using Amazon Kinesis Data Firehose. Further, we explored how the results of those analyses can be visualized in Amazon QuickSight. For customers that depend on a data warehouse for data analytics, but who also have streaming data sources, the use of Amazon Kinesis Data Firehose or Amazon Redshift Spectrum is an excellent choice.

This blog represents my own viewpoints and not of my employer, Amazon Web Services.

, , , , , , ,

Leave a comment

Getting Started with Data Analysis on AWS using AWS Glue, Amazon Athena, and QuickSight: Part 2

Introduction

In part one, we learned how to ingest, transform, and enrich raw, semi-structured data, in multiple formats, using Amazon S3, AWS Glue, Amazon Athena, and AWS Lambda. We built an S3-based data lake and learned how AWS leverages open-source technologies, including Presto, Apache Hive, and Apache Parquet. In part two of this post, we will use the transformed and enriched data sources, stored in the data lake, to create compelling visualizations using Amazon QuickSight.

athena-glue-architecture-v2High-level AWS architecture diagram of the demonstration.

Background

If you recall the demonstration from part one of the post, we had adopted the persona of a large, US-based electric energy provider. The energy provider had developed and sold its next-generation Smart Electrical Monitoring Hub (Smart Hub) to residential customers. Customers can analyze their electrical usage with a fine level of granularity, per device and over time. The goal of the Smart Hub is to enable the customers, using data, to reduce their electrical costs. The provider benefits from a reduction in load on the existing electrical grid and a better distribution of daily electrical load as customers shift usage to off-peak times to save money.

Data Visualization and BI

The data analysis process in the demonstration was divided into four logical stages: 1) Raw Data Ingestion, 2) Data Transformation, 3) Data Enrichment, and 4) Data Visualization and Business Intelligence (BI).

athena-glue-0.pngFull data analysis workflow diagram (click to enlarge…)

In the final, Data Visualization and Business Intelligence (BI) stage, the enriched data is presented and analyzed. There are many enterprise-grade services available for data visualization and business intelligence, which integrate with Amazon Athena. Amazon services include Amazon QuickSight, Amazon EMR, and Amazon SageMaker. Third-party solutions from AWS Partners, many available on the AWS Marketplace, include Tableau, Looker, Sisense, and Domo.

In this demonstration, we will focus on Amazon QuickSight. Amazon QuickSight is a fully managed business intelligence (BI) service. QuickSight lets you create and publish interactive dashboards that include ML Insights. Dashboards can be accessed from any device, and embedded into your applications, portals, and websites. QuickSight serverlessly scales automatically from tens of users to tens of thousands without any infrastructure management.

Athena-Glue-4

Using QuickSight

QuickSight APIs

Amazon recently added a full set of aws quicksight APIs for interacting with QuickSight. For example, to preview the three QuickSight data sets created for this part of the demo, with the AWS CLI, we would use the list-data-sets comand.

To examine details of a single data set, with the AWS CLI, we would use the describe-data-set command.

QuickSight Console

However, for this final part of the demonstration, we will be working from the Amazon QuickSight Console, as opposed to the AWS CLI, AWS CDK, or CloudFormation templates.

Signing Up for QuickSight

To use Amazon QuickSight, you must sign up for QuickSight.

screen_shot_2020-01-02_at_9_07_00_pm

There are two Editions of Amazon QuickSight, Standard and Enterprise. For this demonstration, the Standard Edition will suffice.

screen_shot_2020-01-02_at_9_07_40_pm

QuickSight Data Sets

Amazon QuickSight uses Data Sets as the basis for all data visualizations. According to AWS, QuickSight data sets can be created from a wide variety of data sources, including Amazon RDS, Amazon Aurora, Amazon Redshift, Amazon Athena, and Amazon S3. You can also upload Excel spreadsheets or flat files (CSV, TSV, CLF, ELF, and JSON), connect to on-premises databases like SQL Server, MySQL, and PostgreSQL and import data from SaaS applications like Salesforce. Below, we see a list of the latest data sources available in the QuickSight New Data Set Console.

screen_shot_2020-01-12_at_8_18_05_pm_v2

Demonstration Data Sets

For the demonstration, I have created three QuickSight data sets, all based on Amazon Athena. You have two options when using Amazon Athena as a data source. The first option is to select a table from an AWS Glue Data Catalog database, such as the database we created in part one of the post, ‘smart_hub_data_catalog.’ The second option is to create a custom SQL query, based on one or more tables in an AWS Glue Data Catalog database.

screen_shot_2020-01-13_at_9_05_49_pm.png

Of the three data sets created for part two of this demonstration, two data sets use tables directly from the Data Catalog, including ‘etl_output_parquet’ and ‘electricity_rates_parquet.’ The third data set uses a custom SQL query, based on the single Data Catalog table, ‘smart_hub_locations_parquet.’ All three tables used to create the data sets represent the enriched, highly efficient Parquet-format data sources in the S3-based Data Lake.

screen_shot_2020-01-13_at_9_16_54_pm.png

Data Set Features

There are a large number of features available when creating and configuring data sets. We cannot possibly cover all of them in this post. Let’s look at three features: geospatial field types, calculated fields, and custom SQL.

Geospatial Data Types

QuickSight can intelligently detect common types of geographic fields in a data source and assign QuickSight geographic data type, including Country, County, City, Postcode, and State. QuickSight can also detect geospatial data, including Latitude and Longitude. We will take advantage of this QuickSight feature for our three data set’s data sources, including the State, Postcode, Latitude, and Longitude field types.

screen_shot_2020-01-13_at_9_53_19_pm.png

Calculated Fields

A commonly-used QuickSight data set feature is the ‘Calculated field.’ For the ‘etl_output_parquet’ data set, I have created a new field (column), cost_dollar.

screen_shot_2020-01-13_at_4_35_20_pm

The cost field is the electrical cost of the device, over a five minute time interval, in cents (¢). The calculated cost_dollar field is the quotient of the cost field divided by 100. This value represents the electrical cost of the device, over a five minute time interval, in dollars ($). This is a straightforward example. However, a calculated field can be very complex, built from multiple arithmetic, comparison, and conditional functions, string calculations, and data set fields.

screen_shot_2020-01-13_at_9_35_14_pm

Data set calculated fields can also be created and edited from the QuickSight Analysis Console (discussed later).

screen_shot_2020-01-13_at_4_45_47_pm.png

Custom SQL

The third QuickSight data set is based on an Amazon Athena custom SQL query.

Although you can write queries in the QuickSight Data Prep Console, I prefer to write custom Athena queries using the Athena Query Editor. Using the Editor, you can write, run, debug, and optimize queries to ensure they function correctly, first.

screen_shot_2020-01-13_at_12_50_14_pm

The Athena query can then be pasted into the Custom SQL window. Clicking ‘Finish’ in the window is the equivalent of ‘Run query’ in the Athena Query Editor Console. The query runs and returns data.

screen_shot_2020-01-12_at_8_12_44_pm

Similar to the Athena Query Editor, queries executed in the QuickSight Data Prep Console will show up in the Athena History tab, with a /* QuickSight */ comment prefix.

screen_shot_2020-01-14_at_9_10_16_pm

SPICE

You will notice the three QuickSight data sets are labeled, ‘SPICE.’ According to AWS, the acronym, SPICE, stands for ‘Super-fast, Parallel, In-memory, Calculation Engine.’ QuickSight’s in-memory calculation engine, SPICE, achieves blazing fast performance at scale. SPICE automatically replicates data for high availability allowing thousands of users to simultaneously perform fast, interactive analysis while shielding your underlying data infrastructure, saving you time and resources. With the Standard Edition of QuickSight, as the first Author, you get 1 GB of SPICE in-memory data for free.

QuickSight Analysis

The QuickSight Analysis Console is where Analyses are created. A specific QuickSight Analysis will contain a collection of data sets and data visualizations (visuals). Each visual is associated with a single data set.

Types of QuickSight Analysis visuals include: horizontal and vertical, single and stacked bar charts, line graphs, combination charts, area line charts, scatter plots, heat maps, pie and donut charts, tree maps, pivot tables, gauges, key performance indicators (KPI), geospatial diagrams, and word clouds. Individual visual titles, legends, axis, and other visual aspects can be easily modified. Visuals can contain drill-downs.

A data set’s fields can be modified from within the Analysis Console. Field types and formats, such as date, numeric, currency fields, can be customized for display. The Analysis can include a Title and subtitle. There are some customizable themes available to change the overall look of the Analysis.

screen_shot_2020-01-15_at_12_45_37_am.png

Analysis Filters

Data displayed in the visuals can be further shaped using a combination of Filters, Conditional formatting, and Parameters. Below, we see an example of a typical filter based on a range of dates and times. The data set contains two full days’ worth of data. Here, we are filtering the data to a 14-hour peak electrical usage period, between 8 AM and 10 PM on the same day, 12/21/2019.

screen_shot_2020-01-13_at_10_46_57_pm.png

Drill-Down, Drill-Up, Focus, and Exclude

According to AWS, all visual types except pivot tables offer the ability to create a hierarchy of fields for a visual element. The hierarchy lets you drill down or up to see data at different levels of the hierarchy. Focus allows you to concentrate on a single element within a hierarchy of fields. Exclude allows you to remove an element from a hierarchy of fields. Below, we see an example of all four of these features, available to apply to the ‘Central Air Conditioner’. Since the AC unit is the largest consumer of electricity on average per day, applying these filters to understand its impact on the overall electrical usage may be useful to an analysis. We can also drill down to minutes from hours or up to days from hours.

screen_shot_2020-01-15_at_12_56_54_am.png

Example QuickSight Analysis

A QuickSight Analysis is shared by the Analysis Author as a QuickSight Dashboard. Below, we see an example of a QuickSight Dashboard, built and shared for this demonstration. The ‘Residential Electrical Usage Analysis’ is built from the three data sets created earlier. From those data sets, we have constructed several visuals, including a geospatial diagram, donut chart, heat map, KPI, combination chart, stacked vertical bar chart, and line graph. Each visual’s title, layout, and field display has all customized. The data displayed in the visuals have been filtered differently, including by date and time, by customer id (loc_id), and by state. Conditional formatting is used to enhance the visual appearance of visuals, such as the ‘Total Electrical Cost’ KPI.

screen_shot_2020-01-13_at_7_57_47_pm_v4.png

Conclusion

In part one, we learned how to ingest, transform, and enrich raw, semi-structured data, in multiple formats, using Amazon S3, AWS Glue, Amazon Athena, and AWS Lambda. We built an S3-based data lake and learned how AWS leverages open-source technologies, including Presto, Apache Hive, and Apache Parquet. In part two of this post, we used the transformed and enriched datasets, stored in the data lake, to create compelling visualizations using Amazon QuickSight.

All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.

, , , , , ,

2 Comments

Executing Amazon Athena Queries from JetBrains PyCharm

 

Amazon Athena

According to Amazon, Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Amazon Athena supports and works with a variety of popular data file formats, including CSV, JSON, Apache ORC, Apache Avro, and Apache Parquet.

The underlying technology behind Amazon Athena is Presto, the popular, open-source distributed SQL query engine for big data, created by Facebook. According to AWS, the Athena query engine is based on Presto 0.172. Athena is ideal for quick, ad-hoc querying, but it can also handle complex analysis, including large joins, window functions, and arrays. In addition to Presto, Athena also uses Apache Hive to define tables.

screen_shot_2020-01-05_at_10_32_25_am

Athena Query Editor

In the previous post, Getting Started with Data Analysis on AWS using AWS Glue, Amazon Athena, and QuickSight, we used the Athena Query Editor to construct and test SQL queries against semi-structured data in an S3-based Data Lake. The Athena Query Editor has many of the basic features Data Engineers and Analysts expect, including SQL syntax highlighting, code auto-completion, and query formatting. Queries can be run directly from the Editor, saved for future reference, and query results downloaded. The Editor can convert SELECT queries to CREATE TABLE AS (CTAS) and CREATE VIEW AS statements. Access to AWS Glue data sources is also available from within the Editor.

Full-Featured IDE

Although the Athena Query Editor is fairly functional, many Engineers perform a majority of their software development work in a fuller-featured IDE. The choice of IDE may depend on one’s predominant programming language. According to the PYPL Index, the ten most popular, current IDEs are:

  1. Microsoft Visual Studio
  2. Android Studio
  3. Eclipse
  4. Visual Studio Code
  5. Apache NetBeans
  6. JetBrains PyCharm
  7. JetBrains IntelliJ
  8. Apple Xcode
  9. Sublime Text
  10. Atom

Within the domains of data science, big data analytics, and data analysis, languages such as SQL, Python, Java, Scala, and R are common. Although I work in a variety of IDEs, my go-to choices are JetBrains PyCharm for Python (including for PySpark and Jupyter Notebook development) and JetBrains IntelliJ for Java and Scala (including Apache Spark development). Both these IDEs also support many common SQL-based technologies, out-of-the-box, and are easily extendable to add new technologies.

jetbrains.png

Athena Integration with PyCharm

Utilizing the extensibility of the JetBrains suite of professional development IDEs, it is simple to add Amazon Athena to the list of available database drivers and make JDBC (Java Database Connectivity) connections to Athena instances on AWS.

Downloading the Athena JDBC Driver

To start, download the Athena JDBC Driver from Amazon. There are two versions, based on your choice of Java JDKs. Considering Java 8 was released six years ago (March 2014), most users will likely want the AthenaJDBC42-2.0.9.jar is compatible with JDBC 4.2 and JDK 8.0 or later.

screen_shot_2020-01-06_at_9_28_14_pm

Installation Guide

AWS also supplies a JDBC Driver Installation and Configuration Guide. The guide, as well as the Athena JDBC and ODBC Drivers, are produced by Simba Technologies (acquired by Magnitude Software). Instructions for creating an Athena Driver starts on page 23.

screen_shot_2020-01-06_at_9_28_27_pm

Creating a New Athena Driver

From PyCharm’s Database Tool Window, select the Drivers dialog box, select the downloaded Athena JDBC Driver JAR. Select com.simba.athena.jdbc.Driver in the Class dropdown. Name the Driver, ‘Amazon Athena.’

screen_shot_2020-01-06_at_10_06_58_pm

You can configure the Athena Driver further, using the Options and Advanced tabs.

screen_shot_2020-01-11_at_8.25.22_pm

Creating a New Athena Data Source

From PyCharm’s Database Tool Window, select the Data Source dialog box to create a new connection to your Athena instance. Choose ‘Amazon Athena’ from the list of available Database Drivers.

screen_shot_2020-01-08_at_3_47_48_pm

You will need four items to create an Athena Data Source:

  1. Your IAM User Access Key ID
  2. Your IAM User Secret Access Key
  3. The AWS Region of your Athena instance (e.g., us-east-1)
  4. An existing S3 bucket location to store query results

The Athena connection URL is a combination of the AWS Region and the S3 bucket, items 3 and 4, above. The format of the Athena connection URL is as follows.

jdbc:awsathena://AwsRegion=your-region;S3OutputLocation=s3://your-bucket-name/query-results-path

Give the new Athena Data Source a logical Name, input the User (Access Key ID), Password (Secret Access Key), and the Athena URL. To test the Athena Data Source, use the ‘Test Connection’ button.

screen_shot_2020-01-06_at_10_10_03_pm

You can create multiple Athena Data Sources using the Athena Driver. For example, you may have separate Development, Test, and Production instances of Athena, each in a different AWS Account.

Data Access

Once a successful connection has been made, switching to the Schemas tab, you should see a list of available AWS Glue Data Catalog databases. Below, we see the AWS Glue Catalog, which we created in the prior post. This Glue Data Catalog database contains ten metadata tables, each corresponding to a semi-structured, file-based data source in an S3-based data lake.

In the example below, I have chosen to limit the new Athena Data Source to a single Data Catalog database, to which the Data Source’s IAM User has access. Applying the core AWS security principle of granting least privilege, IAM Users should only have the permissions required to perform a specific set of approved tasks. This principle applies to the Glue Data Catalog databases, metadata tables, and the underlying S3 data sources.

screen_shot_2020-01-06_at_10_11_03_pm.png

Querying Athena from PyCharm

From within the PyCharm’s Database Tool Window, you should now see a list of the metadata tables defined in your AWS Glue Data Catalog database(s), as well as the individual columns within each table.

screen_shot_2020-01-06_at_10_12_18_pm

Similar to the Athena Query Editor, you can write SQL queries against the database tables in PyCharm. Like the Athena Query Editor, PyCharm has standard features SQL syntax highlighting, code auto-completion, and query formatting. Right-click on the Athena Data Source and choose New, then Console, to start.

screen_shot_2020-01-08_at_3_46_01_pm

Be mindful when writing queries and searching the Internet for SQL references, the Athena query engine is based on Presto 0.172. The current version of Presto, 0.234, is more than 50 releases ahead of the current Athena version. Both Athena and Presto functionality continue to change and diverge. There are also additional considerations and limitations for SQL queries in Athena to be aware of.

Whereas the Athena Query Editor is limited to only one query per query tab, in PyCharm, we can write and run multiple SQL queries in the same console window and have multiple console sessions opened to Athena at the same time.

screen_shot_2020-01-06_at_10_41_05_pm

By default, PyCharm’s query results are limited to the first ten rows of data. The number of rows displayed, as well as many other preferences, can be changed in the PyCharm’s Database Preferences dialog box.

screen_shot_2020-01-06_at_10_15_34_pm

Saving Queries and Exporting Results

In PyCharm, Athena queries can be saved as part of your PyCharm projects, as .sql files. Whereas the Athena Query Editor is limited to CSV, in PyCharm, query results can be exported in a variety of standard data file formats.

screen_shot_2020-01-08_at_3_43_39_pm

Athena Query History

All Athena queries ran from PyCharm are recorded in the History tab of the Athena Console. Although PyCharm shows query run times, the Athena History tab also displays the amount of data scanned. Knowing the query run time and volume of data scanned is useful when performance tuning queries.

screen_shot_2020-01-07_at_11_12_46_pm

Other IDEs

The technique shown for JetBrains PyCharm can also be applied to other JetBrains products, including GoLand, DataGrip, PhpStorm, and IntelliJ (shown below).

screen-shot-2020-01-08-at-5_35_57-pm.png

This blog represents my own view points and not of my employer, Amazon Web Services.

, , , , , ,

Leave a comment

Getting Started with Data Analysis on AWS using AWS Glue, Amazon Athena, and QuickSight: Part 1

Introduction

According to Wikipedia, data analysis is “a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusion, and supporting decision-making.” In this two-part post, we will explore how to get started with data analysis on AWS, using the serverless capabilities of Amazon Athena, AWS Glue, Amazon QuickSight, Amazon S3, and AWS Lambda. We will learn how to use these complementary services to transform, enrich, analyze, and visualize semi-structured data.

Data Analysis—discovering useful information, informing conclusion, and supporting decision-making. –Wikipedia

In part one, we will begin with raw, semi-structured data in multiple formats. We will discover how to ingest, transform, and enrich that data using Amazon S3, AWS Glue, Amazon Athena, and AWS Lambda. We will build an S3-based data lake, and learn how AWS leverages open-source technologies, such as Presto, Apache Hive, and Apache Parquet. In part two, we will learn how to further analyze and visualize the data using Amazon QuickSight. Here’s a quick preview of what we will build in part one of the post.

Demonstration

In this demonstration, we will adopt the persona of a large, US-based electric energy provider. The energy provider has developed its next-generation Smart Electrical Monitoring Hub (Smart Hub). They have sold the Smart Hub to a large number of residential customers throughout the United States. The hypothetical Smart Hub wirelessly collects detailed electrical usage data from individual, smart electrical receptacles and electrical circuit meters, spread throughout the residence. Electrical usage data is encrypted and securely transmitted from the customer’s Smart Hub to the electric provider, who is running their business on AWS.

Customers are able to analyze their electrical usage with fine granularity, per device, and over time. The goal of the Smart Hub is to enable the customers, using data, to reduce their electrical costs. The provider benefits from a reduction in load on the existing electrical grid and a better distribution of daily electrical load as customers shift usage to off-peak times to save money.

screen_shot_2020-01-13_at_7_57_47_pm_v4.pngPreview of post’s data in Amazon QuickSight.

The original concept for the Smart Hub was developed as part of a multi-day training and hackathon, I recently attended with an AWSome group of AWS Solutions Architects in San Francisco. As a team, we developed the concept of the Smart Hub integrated with a real-time, serverless, streaming data architecture, leveraging AWS IoT Core, Amazon Kinesis, AWS Lambda, and Amazon DynamoDB.

SA_Team_PhotoFrom left: Bruno Giorgini, Mahalingam (‘Mahali’) Sivaprakasam, Gary Stafford, Amit Kumar Agrawal, and Manish Agarwal.

This post will focus on data analysis, as opposed to the real-time streaming aspect of data capture or how the data is persisted on AWS.

athena-glue-architecture-v2High-level AWS architecture diagram of the demonstration.

Featured Technologies

The following AWS services and open-source technologies are featured prominently in this post.

Athena-Glue-v2.png

Amazon S3-based Data Lake

Screen Shot 2020-01-02 at 5.09.05 PMAn Amazon S3-based Data Lake uses Amazon S3 as its primary storage platform. Amazon S3 provides an optimal foundation for a data lake because of its virtually unlimited scalability, from gigabytes to petabytes of content. Amazon S3 provides ‘11 nines’ (99.999999999%) durability. It has scalable performance, ease-of-use features, and native encryption and access control capabilities.

AWS Glue

Screen Shot 2020-01-02 at 5.11.37 PMAWS Glue is a fully managed extract, transform, and load (ETL) service to prepare and load data for analytics. AWS Glue discovers your data and stores the associated metadata (e.g., table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL.

AWS Glue Data Catalog

Screen Shot 2020-01-02 at 5.13.01 PM.pngThe AWS Glue Data Catalog is an Apache Hive Metastore compatible, central repository to store structural and operational metadata for data assets. For a given data set, store table definition, physical location, add business-relevant attributes, as well as track how the data has changed over time.

AWS Glue Crawler

Screen Shot 2020-01-02 at 5.14.57 PMAn AWS Glue Crawler connects to a data store, progresses through a prioritized list of classifiers to extract the schema of your data and other statistics, and then populates the Glue Data Catalog with this metadata. Crawlers can run periodically to detect the availability of new data as well as changes to existing data, including table definition changes. Crawlers automatically add new tables, new partitions to an existing table, and new versions of table definitions. You can even customize Glue Crawlers to classify your own file types.

AWS Glue ETL Job

Screen Shot 2020-01-02 at 5.11.37 PMAn AWS Glue ETL Job is the business logic that performs extract, transform, and load (ETL) work in AWS Glue. When you start a job, AWS Glue runs a script that extracts data from sources, transforms the data, and loads it into targets. AWS Glue generates a PySpark or Scala script, which runs on Apache Spark.

Amazon Athena

Screen Shot 2020-01-02 at 5.17.49 PMAmazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena supports and works with a variety of standard data formats, including CSV, JSON, Apache ORC, Apache Avro, and Apache Parquet. Athena is integrated, out-of-the-box, with AWS Glue Data Catalog. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

The underlying technology behind Amazon Athena is Presto, the open-source distributed SQL query engine for big data, created by Facebook. According to the AWS, the Athena query engine is based on Presto 0.172 (released April 9, 2017). In addition to Presto, Athena uses Apache Hive to define tables.

Amazon QuickSight

Screen Shot 2020-01-02 at 5.18.40 PMAmazon QuickSight is a fully managed business intelligence (BI) service. QuickSight lets you create and publish interactive dashboards that can then be accessed from any device, and embedded into your applications, portals, and websites.

AWS Lambda

Screen Shot 2020-01-02 at 5.25.57 PMAWS Lambda automatically runs code without requiring the provisioning or management servers. AWS Lambda automatically scales applications by running code in response to triggers. Lambda code runs in parallel. With AWS Lambda, you are charged for every 100ms your code executes and the number of times your code is triggered. You pay only for the compute time you consume.

Smart Hub Data

Everything in this post revolves around data. For the post’s demonstration, we will start with four categories of raw, synthetic data. Those data categories include Smart Hub electrical usage data, Smart Hub sensor mapping data, Smart Hub residential locations data, and electrical rate data. To demonstrate the capabilities of AWS Glue to handle multiple data formats, the four categories of raw data consist of three distinct file formats: XML, JSON, and CSV. I have attempted to incorporate as many ‘real-world’ complexities into the data without losing focus on the main subject of the post. The sample datasets are intentionally small to keep your AWS costs to a minimum for the demonstration.

To further reduce costs, we will use a variety of data partitioning schemes. According to AWS, by partitioning your data, you can restrict the amount of data scanned by each query, thus improving performance and reducing cost. We have very little data for the demonstration, in which case partitioning may negatively impact query performance. However, in a ‘real-world’ scenario, there would be millions of potential residential customers generating terabytes of data. In that case, data partitioning would be essential for both cost and performance.

Smart Hub Electrical Usage Data

The Smart Hub’s time-series electrical usage data is collected from the customer’s Smart Hub. In the demonstration’s sample electrical usage data, each row represents a completely arbitrary five-minute time interval. There are a total of ten electrical sensors whose electrical usage in kilowatt-hours (kW) is recorded and transmitted. Each Smart Hub records and transmits electrical usage for 10 device sensors, 288 times per day (24 hr / 5 min intervals), for a total of 2,880 data points per day, per Smart Hub. There are two days worth of usage data for the demonstration, for a total of 5,760 data points. The data is stored in JSON Lines format. The usage data will be partitioned in the Amazon S3-based data lake by date (e.g., ‘dt=2019-12-21’).

Note the electrical usage data contains nested data. The electrical usage for each of the ten sensors is contained in a JSON array, within each time series entry. The array contains ten numeric values of type, double.

Real data is often complex and deeply nested. Later in the post, we will see that AWS Glue can map many common data types, including nested data objects, as illustrated below.

screen_shot_2020-01-05_at_7_46_19_am

Smart Hub Sensor Mappings

The Smart Hub sensor mappings data maps a sensor column in the usage data (e.g., ‘s_01’ to the corresponding actual device (e.g., ‘Central Air Conditioner’). The data contains the device location, wattage, and the last time the record was modified. The data is also stored in JSON Lines format. The sensor mappings data will be partitioned in the Amazon S3-based data lake by the state of the residence (e.g., ‘state=or’ for Oregon).

Smart Hub Locations

The Smart Hub locations data contains the geospatial coordinates, home address, and timezone for each residential Smart Hub. The data is stored in CSV format. The data for the four cities included in this demonstration originated from OpenAddresses, ‘the free and open global address collection.’ There are approximately 4k location records. The location data will be partitioned in the Amazon S3-based data lake by the state of the residence where the Smart Hub is installed (e.g., ‘state=or’ for Oregon).

Electrical Rates

Lastly, the electrical rate data contains the cost of electricity. In this demonstration, the assumption is that the rate varies by state, by month, and by the hour of the day. The data is stored in XML, a data export format still common to older, legacy systems. The electrical rate data will not be partitioned in the Amazon S3-based data lake.

Data Analysis Process

Due to the number of steps involved in the data analysis process in the demonstration, I have divided the process into four logical stages: 1) Raw Data Ingestion, 2) Data Transformation, 3) Data Enrichment, and 4) Data Visualization and Business Intelligence (BI).

athena-glue-0.pngFull data analysis workflow diagram (click to enlarge…)

Raw Data Ingestion

In the Raw Data Ingestion stage, semi-structured CSV-, XML-, and JSON-format data files are copied to a secure Amazon Simple Storage Service (S3) bucket. Within the bucket, data files are organized into folders based on their physical data structure (schema). Due to the potentially unlimited number of data files, files are further organized (partitioned) into subfolders. Organizational strategies for data files are based on date, time, geographic location, customer id, or other common data characteristics.

This collection of semi-structured data files, S3 buckets, and partitions form what is referred to as a Data Lake. According to AWS, a data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale.

A series of AWS Glue Crawlers process the raw CSV-, XML-, and JSON-format files, extracting metadata, and creating table definitions in the AWS Glue Data Catalog. According to AWS, an AWS Glue Data Catalog contains metadata tables, where each table specifies a single data store.

Athena-Glue-1

Data Transformation

In the Data Transformation stage, the raw data in the previous stage is transformed. Data transformation may include both modifying the data and changing the data format. Data modifications include data cleansing, re-casting data types, changing date formats, field-level computations, and field concatenation.

The data is then converted from CSV-, XML-, and JSON-format to Apache Parquet format and written back to the Amazon S3-based data lake. Apache Parquet is a compressed, efficient columnar storage format. Amazon Athena, like many Cloud-based services, charges you by the amount of data scanned per query. Hence, using data partitioning, bucketing, compression, and columnar storage formats, like Parquet, will reduce query cost.

Lastly, the transformed Parquet-format data is cataloged to new tables, alongside the raw CSV, XML, and JSON data, in the Glue Data Catalog.

Athena-Glue-2

Data Enrichment

According to ScienceDirect, data enrichment or augmentation is the process of enhancing existing information by supplementing missing or incomplete data. Typically, data enrichment is achieved by using external data sources, but that is not always the case.

Data Enrichment—the process of enhancing existing information by supplementing missing or incomplete data. –ScienceDirect

In the Data Enrichment stage, the Parquet-format Smart Hub usage data is augmented with related data from the three other data sources: sensor mappings, locations, and electrical rates. The customer’s Smart Hub usage data is enriched with the customer’s device types, the customer’s timezone, and customer’s electricity cost per monitored period based on the customer’s geographic location and time of day.

Athena-Glue-3a

Once the data is enriched, it is converted to Parquet and optimized for query performance, stored in the data lake, and cataloged. At this point, the original CSV-, XML-, and JSON-format raw data files, the transformed Parquet-format data files, and the Parquet-format enriched data files are all stored in the Amazon S3-based data lake and cataloged in the Glue Data Catalog.

Athena-Glue-3b

Data Visualization

In the final Data Visualization and Business Intelligence (BI) stage, the enriched data is presented and analyzed. There are many enterprise-grade services available for visualization and Business Intelligence, which integrate with Athena. Amazon services include Amazon QuickSight, Amazon EMR, and Amazon SageMaker. Third-party solutions from AWS Partners, available on the AWS Marketplace, include Tableau, Looker, Sisense, and Domo. In this demonstration, we will focus on Amazon QuickSight.

Athena-Glue-4

Getting Started

Requirements

To follow along with the demonstration, you will need an AWS Account and a current version of the AWS CLI. To get the most from the demonstration, you should also have Python 3 and jq installed in your work environment.

Source Code

All source code for this post can be found on GitHub. Use the following command to clone a copy of the project.

Source code samples in this post are displayed as GitHub Gists, which will not display correctly on some mobile and social media browsers.

TL;DR?

Just want the jump in without reading the instructions? All the AWS CLI commands, found within the post, are consolidated in the GitHub project’s README file.

CloudFormation Stack

To start, create the ‘smart-hub-athena-glue-stack’ CloudFormation stack using the smart-hub-athena-glue.yml template. The template will create (3) Amazon S3 buckets, (1) AWS Glue Data Catalog Database, (5) Data Catalog Database Tables, (6) AWS Glue Crawlers, (1) AWS Glue ETL Job, and (1) IAM Service Role for AWS Glue.

Make sure to change the DATA_BUCKET, SCRIPT_BUCKET, and LOG_BUCKET variables, first, to your own unique S3 bucket names. I always suggest using the standard AWS 3-part convention of 1) descriptive name, 2) AWS Account ID or Account Alias, and 3) AWS Region, to name your bucket (e.g. ‘smart-hub-data-123456789012-us-east-1’).

Raw Data Files

Next, copy the raw CSV-, XML-, and JSON-format data files from the local project to the DATA_BUCKET S3 bucket (steps 1a-1b in workflow diagram). These files represent the beginnings of the S3-based data lake. Each category of data uses a different strategy for organizing and separating the files. Note the use of the Apache Hive-style partitions (e.g., /smart_hub_data_json/dt=2019-12-21). As discussed earlier, the assumption is that the actual, large volume of data in the data lake would necessitate using partitioning to improve query performance.

Confirm the contents of the DATA_BUCKET S3 bucket with the following command.

There should be a total of (14) raw data files in the DATA_BUCKET S3 bucket.

Lambda Functions

Next, package the (5) Python3.8-based AWS Lambda functions for deployment.

Copy the five Lambda packages to the SCRIPT_BUCKET S3 bucket. The ZIP archive Lambda packages are accessed by the second CloudFormation stack, smart-hub-serverless. This CloudFormation stack, which creates the Lambda functions, will fail to deploy if the packages are not found in the SCRIPT_BUCKET S3 bucket.

I have chosen to place the packages in a different S3 bucket then the raw data files. In a real production environment, these two types of files would be separated, minimally, into separate buckets for security. Remember, only data should go into the data lake.

Create the second ‘smart-hub-lambda-stack’ CloudFormation stack using the smart-hub-lambda.yml CloudFormation template. The template will create (5) AWS Lambda functions and (1) Lambda execution IAM Service Role.

At this point, we have deployed all of the AWS resources required for the demonstration using CloudFormation. We have also copied all of the raw CSV-, XML-, and JSON-format data files in the Amazon S3-based data lake.

AWS Glue Crawlers

If you recall, we created five tables in the Glue Data Catalog database as part of the CloudFormation stack. One table for each of the four raw data types and one table to hold temporary ELT data later in the demonstration. To confirm the five tables were created in the Glue Data Catalog database, use the Glue Data Catalog Console, or run the following AWS CLI / jq command.

The five data catalog tables should be as follows.

We also created six Glue Crawlers as part of the CloudFormation template. Four of these Crawlers are responsible for cataloging the raw CSV-, XML-, and JSON-format data from S3 into the corresponding, existing Glue Data Catalog database tables. The Crawlers will detect any new partitions and add those to the tables as well. Each Crawler corresponds to one of the four raw data types. Crawlers can be scheduled to run periodically, cataloging new data and updating data partitions. Crawlers will also create a Data Catalog database tables. We use Crawlers to create new tables, later in the post.

Run the four Glue Crawlers using the AWS CLI (step 1c in workflow diagram).

You can check the Glue Crawler Console to ensure the four Crawlers finished successfully.

screen_shot_2020-01-03_at_3_05_29_pm

Alternately, use another AWS CLI / jq command.

When complete, all Crawlers should all be in a state of ‘Still Estimating = false’ and ‘TimeLeftSeconds = 0’. In my experience, the Crawlers can take up one minute to start, after the estimation stage, and one minute to stop when complete.

Successfully running the four Crawlers completes the Raw Data Ingestion stage of the demonstration.

Converting to Parquet with CTAS

With the Raw Data Ingestion stage completed, we will now transform the raw Smart Hub usage data, sensor mapping data, and locations data into Parquet-format using three AWS Lambda functions. Each Lambda subsequently calls Athena, which executes a CREATE TABLE AS SELECT SQL statement (aka CTAS) . Each Lambda executes a similar command, varying only by data source, data destination, and partitioning scheme. Below, is an example of the command used for the Smart Hub electrical usage data, taken from the Python-based Lambda, athena-json-to-parquet-data/index.py.

This compact, yet powerful CTAS statement converts a copy of the raw JSON- and CSV-format data files into Parquet-format, and partitions and stores the resulting files back into the S3-based data lake. Additionally, the CTAS SQL statement catalogs the Parquet-format data files into the Glue Data Catalog database, into new tables. Unfortunately, this method will not work for the XML-format raw data files, which we will tackle next.

The five deployed Lambda functions should be visible from the Lambda Console’s Functions tab.

screen_shot_2020-01-04_at_5_57_31_pm

Invoke the three Lambda functions using the AWS CLI. (part of step 2a in workflow diagram).

Here is an example of the same CTAS command, shown above for the Smart Hub electrical usage data, as it is was executed successfully by Athena.

We can view any Athena SQL query from the Athena Console’s History tab. Clicking on a query (in pink) will copy it to the Query Editor tab and execute it. Below, we see the three SQL statements executed by the Lamba functions.

screen_shot_2020-01-04_at_7_08_32_pm

AWS Glue ETL Job for XML

If you recall, the electrical rate data is in XML format. The Lambda functions we just executed, converted the CSV and JSON data to Parquet using Athena. Currently, unlike CSV, JSON, ORC, Parquet, and Avro, Athena does not support the older XML data format. For the XML data files, we will use an AWS Glue ETL Job to convert the XML data to Parquet. The Glue ETL Job is written in Python and uses Apache Spark, along with several AWS Glue PySpark extensions. For this job, I used an existing script created in the Glue ETL Jobs Console as a base, then modified the script to meet my needs.

The three Python command-line arguments the script expects (lines 10–12, above) are defined in the CloudFormation template, smart-hub-athena-glue.yml. Below, we see them on lines 10–12 of the CloudFormation snippet. They are injected automatically when the job is run and can be overridden from the command line when starting the job.

First, copy the Glue ETL Job Python script to the SCRIPT_BUCKET S3 bucket.

Next, start the Glue ETL Job (part of step 2a in workflow diagram). Although the conversion is a relatively simple set of tasks, the creation of the Apache Spark environment, to execute the tasks, will take several minutes. Whereas the Glue Crawlers took about 2 minutes on average, the Glue ETL Job could take 10–15 minutes in my experience. The actual execution time only takes about 1–2 minutes of the 10–15 minutes to complete. In my opinion, waiting up to 15 minutes is too long to be viable for ad-hoc jobs against smaller datasets; Glue ETL Jobs are definitely targeted for big data.

To check on the status of the job, use the Glue ETL Jobs Console, or use the AWS CLI.

When complete, you should see results similar to the following. Note the ‘JobRunState’ is ‘SUCCEEDED.’ This particular job ran for a total of 14.92 minutes, while the actual execution time was 2.25 minutes.

The job’s progress and the results are also visible in the AWS Glue Console’s ETL Jobs tab.

screen_shot_2020-01-04_at_7_42_51_pm

Detailed Apache Spark logs are also available in CloudWatch Management Console, which is accessible directly from the Logs link in the AWS Glue Console’s ETL Jobs tab.

screen_shot_2020-01-04_at_7_44_08_pm

The last step in the Data Transformation stage is to convert catalog the Parquet-format electrical rates data, created with the previous Glue ETL Job, using yet another Glue Crawler (part of step 2b in workflow diagram). Start the following Glue Crawler to catalog the Parquet-format electrical rates data.

This concludes the Data Transformation stage. The raw and transformed data is in the data lake, and the following nine tables should exist in the Glue Data Catalog.

If we examine the tables, we should observe the data partitions we used to organize the data files in the Amazon S3-based data lake are contained in the table metadata. Below, we see the four partitions, based on state, of the Parquet-format locations data.

screen_shot_2020-01-05_at_7_45_46_am

Data Enrichment

To begin the Data Enrichment stage, we will invoke the AWS Lambda, athena-complex-etl-query/index.py. This Lambda accepts input parameters (lines 28–30, below), passed in the Lambda handler’s event parameter. The arguments include the Smart Hub ID, the start date for the data requested, and the end date for the data requested. The scenario for the demonstration is that a customer with the location id value, using the electrical provider’s application, has requested data for a particular range of days (start date and end date), to visualize and analyze.

The Lambda executes a series of Athena INSERT INTO SQL statements, one statement for each of the possible Smart Hub connected electrical sensors, s_01 through s_10, for which there are values in the Smart Hub electrical usage data. Amazon just released the Amazon Athena INSERT INTO a table using the results of a SELECT query capability in September 2019, an essential addition to Athena. New Athena features are listed in the release notes.

Here, the SELECT query is actually a series of chained subqueries, using Presto SQL’s WITH clause capability. The queries join the Parquet-format Smart Hub electrical usage data sources in the S3-based data lake, with the other three Parquet-format, S3-based data sources: sensor mappings, locations, and electrical rates. The Parquet-format data is written as individual files to S3 and inserted into the existing ‘etl_tmp_output_parquet’ Glue Data Catalog database table. Compared to traditional relational database-based queries, the capabilities of Glue and Athena to enable complex SQL queries across multiple semi-structured data files, stored in S3, is truly amazing!

The capabilities of Glue and Athena to enable complex SQL queries across multiple semi-structured data files, stored in S3, is truly amazing!

Below, we see the SQL statement starting on line 43.

Below, is an example of one of the final queries, for the s_10 sensor, as executed by Athena. All the input parameter values, Python variables, and environment variables have been resolved into the query.

Along with enriching the data, the query performs additional data transformation using the other data sources. For example, the Unix timestamp is converted to a localized timestamp containing the date and time, according to the customer’s location (line 7, above). Transforming dates and times is a frequent, often painful, data analysis task. Another example of data enrichment is the augmentation of the data with a new, computed column. The column’s values are calculated using the values of two other columns (line 33, above).

Invoke the Lambda with the following three parameters in the payload (step 3a in workflow diagram).

The ten INSERT INTO SQL statement’s result statuses (one per device sensor) are visible from the Athena Console’s History tab.

screen_shot_2020-01-05_at_9_17_23_pm

Each Athena query execution saves that query’s results to the S3-based data lake as individual, uncompressed Parquet-format data files. The data is partitioned in the Amazon S3-based data lake by the Smart Meter location ID (e.g., ‘loc_id=b6a8d42425fde548’).

Below is a snippet of the enriched data for a customer’s clothes washer (sensor ‘s_04’). Note the timestamp is now an actual date and time in the local timezone of the customer (e.g., ‘2019-12-21 20:10:00.000’). The sensor ID (‘s_04’) is replaced with the actual device name (‘Clothes Washer’). The location of the device (‘Basement’) and the type of electrical usage period (e.g. ‘peak’ or ‘partial-peak’) has been added. Finally, the cost column has been computed.

To transform the enriched CSV-format data to Parquet-format, we need to catalog the CSV-format results using another Crawler, first (step 3d in workflow diagram).

Optimizing Enriched Data

The previous step created enriched Parquet-format data. However, this data is not as optimized for query efficiency as it should be. Using the Athena INSERT INTO WITH SQL statement, allowed the data to be partitioned. However, the method does not allow the Parquet data to be easily combined into larger files and compressed. To perform both these optimizations, we will use one last Lambda, athena-parquet-to-parquet-elt-data/index.py. The Lambda will create a new location in the Amazon S3-based data lake, containing all the enriched data, in a single file and compressed using Snappy compression.

The resulting Parquet file is visible in the S3 Management Console.

screen_shot_2020-01-04_at_6_07_23_pm

The final step in the Data Enrichment stage is to catalog the optimized Parquet-format enriched ETL data. To catalog the data, run the following Glue Crawler (step 3i in workflow diagram

Final Data Lake and Data Catalog

We should now have the following ten top-level folders of partitioned data in the S3-based data lake. The ‘tmp’ folder may be ignored.

Similarly, we should now have the following ten corresponding tables in the Glue Data Catalog. Use the AWS Glue Console to confirm the tables exist.

screen_shot_2020-01-04_at_8_30_50_pm

Alternately, use the following AWS CLI / jq command to list the table names.

‘Unknown’ Bug

You may have noticed the four tables created with the AWS Lambda functions, using the CTAS SQL statement, erroneously have the ‘Classification’ of ‘Unknown’ as opposed to ‘parquet’. I am not sure why, I believe it is a possible bug with the CTAS feature. It seems to have no adverse impact on the table’s functionality. However, to fix the issue, run the following set of commands. This aws glue update-table hack will switch the table’s ‘Classification’ to ‘parquet’.

The results of the fix may be seen from the AWS Glue Console. All ten tables are now classified correctly.

screen_shot_2020-01-05_at_11_43_50_pm

Explore the Data

Before starting to visualize and analyze the data with Amazon QuickSight, try executing a few Athena queries against the tables in the Glue Data Catalog database, using the Athena Query Editor. Working in the Editor is the best way to understand the data, learn Athena, and debug SQL statements and queries. The Athena Query Editor has convenient developer features like SQL auto-complete and query formatting capabilities.

Be mindful when writing queries and searching the Internet for SQL references, the Athena query engine is based on Presto 0.172. The current version of Presto, 0.229, is more than 50 releases ahead of the current Athena version. Both Athena and Presto functionality has changed and diverged. There are additional considerations and limitations for SQL queries in Athena to be aware of.

screen_shot_2020-01-05_at_10_32_25_am

Here are a few simple, ad-hoc queries to run in the Athena Query Editor.

Cleaning Up

You may choose to save the AWS resources created in part one of this demonstration, to be used in part two. Since you are not actively running queries against the data, ongoing AWS costs will be minimal. If you eventually choose to clean up the AWS resources created in part one of this demonstration, execute the following AWS CLI commands. To avoid failures, make sure each command completes before running the subsequent command. You will need to confirm the CloudFormation stacks are deleted using the AWS CloudFormation Console or the AWS CLI. These commands will not remove Amazon QuickSight data sets, analyses, and dashboards created in part two. However, deleting the AWS Glue Data Catalog and the underlying data sources will impact the ability to visualize the data in QuickSight.

Part Two

In part one, starting with raw, semi-structured data in multiple formats, we learned how to ingest, transform, and enrich that data using Amazon S3, AWS Glue, Amazon Athena, and AWS Lambda. We built an S3-based data lake and learned how AWS leverages open-source technologies, including Presto, Apache Hive, and Apache Parquet. In part two of this post, we will use the transformed and enriched datasets, stored in the data lake, to create compelling visualizations using Amazon QuickSight.

All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.

, , , , , , , , , , ,

2 Comments

Getting Started with Apache Zeppelin on Amazon EMR, using AWS Glue, RDS, and S3: Part 2

Introduction

In Part 1 of this two-part post, we created and configured the AWS resources required to demonstrate the use of Apache Zeppelin on Amazon Elastic MapReduce (EMR). Further, we configured Zeppelin integrations with AWS Glue Data CatalogAmazon Relational Database Service (RDS) for PostgreSQL, and Amazon Simple Cloud Storage Service (S3) Data Lake. We also covered how to obtain the project’s source code from the two GitHub repositories, zeppelin-emr-demo and zeppelin-emr-config. Below is a high-level architectural diagram of the infrastructure we constructed in Part 1 for this demonstration.

EMR-Zeppelin

Part 2

In Part 2 of this post, we will explore Apache Zeppelin’s features and integration capabilities with a variety of AWS services using a series of four Zeppelin notebooks. Below is an overview of each Zeppelin notebook with a link to view it using Zepl’s free Notebook Explorer. Zepl was founded by the same engineers that developed Apache Zeppelin. Zepl’s enterprise collaboration platform, built on Apache Zeppelin, enables both Data Science and AI/ML teams to collaborate around data.

Notebook 1

The first notebook uses a small 21k row kaggle dataset, Transactions from a Bakery. The notebook demonstrates Zeppelin’s integration capabilities with the Helium plugin system for adding new chart types, the use of Amazon S3 for data storage and retrieval, and the use of Apache Parquet, a compressed and efficient columnar data storage format, and Zeppelin’s storage integration with GitHub for notebook version control.

Interpreters

When you open a notebook for the first time, you are given the choice of interpreters to bind and unbind to the notebook. The last interpreter in the list shown below, postgres, is the new PostgreSQL JDBC Zeppelin interpreter we created in Part 1 of this post. We will use this interpreter in Notebook 3.

screen-shot-2019-11-24-at-8_03_50-pm

Application Versions

The first two paragraphs of the notebook are used to confirm the version of Spark, Scala, OpenJDK, and Python we are using. Recall we updated the Spark and Python interpreters to use Python 3.

screen_shot_2019-11-26_at_6_58_33_pm

Helium Visualizations

If you recall from Part 1 of the post, we pre-installed several additional Helium Visualizations, including the Ultimate Pie Chart. Below, we see the use of the Spark SQL (%sql) interpreter to query a Spark DataFrame, return results, and visualize the data using the Ultimate Pie Chart.  In addition to the pie chart, we see the other pre-installed Helium visualizations proceeding the five default visualizations, in the menu bar. With Zeppelin, all we have to do is write Spark SQL queries against the Spark DataFrame created earlier in the notebook, and Zeppelin will handle the visualization. You have some basic controls over charts using the ‘settings’ option.

screen-shot-2019-11-24-at-8_06_56-pm

Building the Data Lake

Notebook 1 demonstrates how to read and write data to S3. We read and write the Bakery dataset to both CSV-format and Apache Parquet-format, using Spark (PySpark). We also write the results of Spark SQL queries, like the one above, in Parquet, to S3.

screen-shot-2019-11-24-at-8_20_18-pm

With Parquet, data may be split into multiple files, as shown in the S3 bucket directory below. Parquet is much faster to read into a Spark DataFrame than CSV. Spark provides support for both reading and writing Parquet files. We will write all of our data to Parquet in S3, making future re-use of the data much more efficient than downloading data from the Internet, like GroupLens or kaggle, or consuming CSV from S3.

screen-shot-2019-11-24-at-8_22_46-pm

Preview S3 Data

In addition to using the Zeppelin notebook, we can preview data right in the S3 bucket web interface using the Amazon S3 Select feature. This query in place feature is helpful to quickly understand the structure and content of new data files with which you want to interact within Zeppelin.

screen-shot-2019-11-24-at-8_23_33-pm

screen-shot-2019-11-24-at-8_23_40-pm

screen_shot_2019-11-28_at_7_41_49_pm.png

Saving Changes to GitHub

In Part 1, we configured Zeppelin to read and write the notebooks from your own copy of the GitHub notebook repository. Using the ‘version control’ menu item, changes made to the notebooks can be committed directly to GitHub.

screen-shot-2019-11-24-at-8_16_36-pm

screen-shot-2019-11-24-at-8_38_19-pm

In GitHub, note the committer is the zeppelin user.

screen_shot_2019-11-26_at_7_48_42_pm.png

Notebook 2

The second notebook demonstrates the use of a single-node and multi-node Amazon EMR cluster for the exploration and analysis of public datasets ranging from approximately 100k rows up to 27MM rows, using Zeppelin. We will use the latest GroupLens MovieLens rating datasets to examine the performance characteristics of Zeppelin, using Spark, on single- verses multi-node EMR clusters for analyzing big data using a variety of Amazon EC2 Instance Types.

screen-shot-2019-11-24-at-8_41_31-pm

Multi-Node EMR Cluster

If you recall from Part 1, we waited to create this cluster due to the compute costs of running the cluster’s large EC2 instances. You should understand the cost of these resources before proceeding, and that you ensure they are destroyed immediately upon completion of the demonstration to minimize your expenses.

Normalized Instance Hours
Understanding the costs of EMR requires understanding the concept of normalized instance hours. Cluster displayed in the EMR AWS Console contains two columns, ‘Elapsed time’ and ‘Normalized instance hours’. The ‘Elapsed time’ column reflects the actual wall-clock time the cluster was used. The ‘Normalized instance hours’ column indicates the approximate number of compute hours the cluster has used, rounded up to the nearest hour.

screen_shot_2019-11-28_at_6_09_38_pm

Normalized instance hours calculations are based on a normalization factor. The normalization factor ranges from 1 for a small instance, up to 64 for an 8xlarge. Based on the type and quantity of instances in our multi-node cluster, we would use approximately 56 compute hours (aka normalized instance hours) for every one hour of wall-clock time our EMR cluster is running. Note the multi-node cluster used in our demo, highlighted in yellow above. The cluster ran for two hours, which equated to 112 normalized instance hours.

Screen Shot 2019-12-16 at 9.48.59 PM.png

Create the Multi-Node Cluster

Create the multi-node EMR cluster using CloudFormation. Change the following nine variable values, then run the emr cloudformation create-stack API command, using the AWS CLI.

# change me
ZEPPELIN_DEMO_BUCKET="your-bucket-name"
EC2_KEY_NAME="your-key-name"
LOG_BUCKET="aws-logs-your_aws_account_id-your_region"
GITHUB_ACCOUNT="your-account-name"
GITHUB_REPO="your-new-project-name"
GITHUB_TOKEN="your-token-value"
MASTER_INSTANCE_TYPE="m5.xlarge" # optional
CORE_INSTANCE_TYPE="m5.2xlarge" # optional
CORE_INSTANCE_COUNT=3 # optional

aws cloudformation create-stack \
    --stack-name zeppelin-emr-prod-stack \
    --template-body file://cloudformation/emr_cluster.yml \
    --parameters ParameterKey=ZeppelinDemoBucket,ParameterValue=${ZEPPELIN_DEMO_BUCKET} \
                 ParameterKey=Ec2KeyName,ParameterValue=${EC2_KEY_NAME} \
                 ParameterKey=LogBucket,ParameterValue=${LOG_BUCKET} \
                 ParameterKey=MasterInstanceType,ParameterValue=${MASTER_INSTANCE_TYPE} \
                 ParameterKey=CoreInstanceType,ParameterValue=${CORE_INSTANCE_TYPE} \
                 ParameterKey=CoreInstanceCount,ParameterValue=${CORE_INSTANCE_COUNT} \
                 ParameterKey=GitHubAccount,ParameterValue=${GITHUB_ACCOUNT} \
                 ParameterKey=GitHubRepository,ParameterValue=${GITHUB_REPO} \
                 ParameterKey=GitHubToken,ParameterValue=${GITHUB_TOKEN}

Use the Amazon EMR web interface to confirm the success of the CloudFormation stack. The fully-provisioned cluster should be in the ‘Waiting’ state when ready.

screen_shot_2019-11-26_at_4_58_05_pm

Configuring the EMR Cluster

Refer to Part 1 for the configuration steps necessary to prepare the EMR cluster and Zeppelin before continuing. Repeat all the steps used for the single-node cluster.

Monitoring with Ganglia

In Part 1, we installed Ganglia as part of creating the EMR cluster. Ganglia, according to its website, is a scalable distributed monitoring system for high-performance computing systems such as clusters and grids. Ganglia can be used to evaluate the performance of the single-node and multi-node EMR clusters. With Ganglia, we can easily view cluster and individual instance CPU, memory, and network I/O performance.

screen-shot-2019-11-24-at-8_46_46-pm
Ganglia Example: Cluster CPU

screen-shot-2019-11-24-at-8_48_44-pm
Ganglia Example: Cluster Memory

screen_shot_2019-11-26_at_5_18_51_pm
Ganglia Example: Cluster Network I/O

YARN Resource Manager

The YARN Resource Manager Web UI is also available on our EMR cluster. Using the Resource Manager, we can view the compute resource load on the cluster, as well as the individual EMR Core nodes. Below, we see that the multi-node cluster has 24 vCPUs and 72 GiB of memory available, split evenly across the three Core cluster nodes.

You might recall, the m5.2xlarge EC2 instance type, used for the three Core nodes, each contains 8 vCPUs and 32 GiB of memory. However, by default, although all 8 vCPUs are available for computation per node, only 24 GiB of the node’s 32 GiB of memory are available for computation. EMR ensures a portion of the memory on each node is reserved for other system processes. The maximum available memory is controlled by the YARN memory configuration option, yarn.scheduler.maximum-allocation-mb.

screen_shot_2019-11-26_at_5_15_00_pm

The YARN Resource Manager preview above shows the load on the Code nodes as Notebook 2 is executing the Spark SQL queries on the large MovieLens with 27MM ratings. Note that only 4 of the 24 vCPUs (16.6%) are in use, but that 70.25 of the 72 GiB (97.6%) of available memory is being used. According to Spark, because of the in-memory nature of most Spark computations, Spark programs can be bottlenecked by any resource in the cluster: CPU, network bandwidth, or memory. Most often, if the data fits in memory, the bottleneck is network bandwidth. In this case, memory appears to be the most constrained resource. Using memory-optimized instances, such as r4 or r5 instance types, might be more effective for the core nodes than the m5 instance types.

MovieLens Datasets

By changing one variable in the notebook, we can work with the latest, smaller GroupLens MovieLens dataset containing approximately 100k rows (ml-latest-small) or the larger dataset, containing approximately 27M rows (ml-latest). For this demo, try both datasets on both the single-node and multi-node clusters. Compare the Spark SQL paragraph execution times for each of the four variations, including single-node with the small dataset, single-node with the large dataset, multi-node with the small dataset, and multi-node with the large dataset. Observe how fast the SQL queries are executed on the single-node versus multi-node cluster. Try switching to a different Core node instance type, such as r5.2xlarge. Try creating a cluster with additional Core nodes. How is the compute time effected?

screen_shot_2019-11-26_at_5_02_34_pm

Terminate the multi-node EMR cluster to save yourself the expense before continuing to Notebook 3.

aws cloudformation delete-stack \
    --stack-name=zeppelin-emr-prod-stack

Notebook 3

The third notebook demonstrates Amazon EMR and Zeppelin’s integration capabilities with AWS Glue Data Catalog as an Apache Hive-compatible metastore for Spark SQL. We will create an Amazon S3-based Data Lake using the AWS Glue Data Catalog and a set of AWS Glue Crawlers.

screen_shot_2019-11-27_at_11_44_44_pm

Glue Crawlers

Before continuing with Notebook 3, run the two Glue Crawlers using the AWS CLI.

aws glue start-crawler --name bakery-transactions-crawler
aws glue start-crawler --name movie-ratings-crawler

The two Crawlers will create a total of seven tables in the Glue Data Catalog database.

screen_shot_2019-11-27_at_8_50_09_pm

If we examine the Glue Data Catalog database, we should now observe several tables, one for each dataset found in the S3 bucket. The location of each dataset is shown in the ‘Location’ column of the tables view.

screen-shot-2019-11-24-at-9_14_19-pm

From the Zeppelin notebook, we can even use Spark SQL to query the AWS Glue Data Catalog, itself, for its databases and the tables within them.

screen-shot-2019-11-24-at-9_12_52-pm

According to Amazon, the Glue Data Catalog tables and databases are containers for the metadata definitions that define a schema for underlying source data. Using Zeppelin’s SQL interpreter, we can query the Data Catalog database and return the underlying source data. The SQL query example, below, demonstrates how we can perform a join across two tables in the data catalog database, representing two different data sources, and return results.

screen-shot-2019-11-24-at-9_09_26-pm

Notebook 4

The fourth notebook demonstrates Zeppelin’s ability to integrate with an external data source. In this case, we will interact with data in an Amazon RDS PostgreSQL relational database using three methods, including the Psycopg 2 PostgreSQL adapter for Python, Spark’s native JDBC capability, and Zeppelin’s JDBC Interpreter.

screen-shot-2019-11-27-at-11_26_34-am.png

First, we create a new schema and four related tables for the RDS PostgreSQL movie ratings database, using the Psycopg 2 PostgreSQL adapter for Python and the SQL file we copied to S3 in Part 1.

screen_shot_2019-11-27_at_11_09_42_am

The RDS database’s schema, shown below, approximates the schema of the four CSV files from the GroupLens MovieLens rating dataset we used in Notebook 2.

screen_shot_2019-11-22_at_12_55_25_pm

Since the schema of the PostgreSQL database matches the MovieLens dataset files, we can import the data from the CVS files, downloaded from GroupLens, directly into the RDS database, again using the Psycopg PostgreSQL adapter for Python.

screen_shot_2019-11-27_at_11_09_52_am

According to the Spark documentation, Spark SQL also includes a data source that can read data from other databases using JDBC. Using Spark’s JDBC capability and the PostgreSQL JDBC Driver we installed in Part 1, we can perform Spark SQL queries against the RDS database using PySpark (%spark.pyspark). Below, we see a paragraph example of reading the RDS database’s movies table, using Spark.

screen_shot_2019-11-27_at_11_10_01_am

As a third method of querying the RDS database, we can use the custom Zeppelin PostgreSQL JDBC interpreter (%postgres) we created in Part 1. Although the default driver of the JDBC interpreter is set as PostgreSQL, and the associated JAR is included with Zeppelin, we overrode that older JAR, with the latest PostgreSQL JDBC Driver JAR.

Using the %postgres interpreter, we query the RDS database’s public schema, and return the four database tables we created earlier in the notebook.

screen_shot_2019-11-27_at_11_10_26_am

Again, below, using the %postgres interpreter in the notebook’s paragraph, we query the RDS database and return data, which we then visualize using Zeppelin’s bar chart. Finally, note the use of Zeppelin Dynamic Forms in this example. Dynamic Forms allows Zeppelin to dynamically creates input forms, whose input values are then available to use programmatically. Here, we use two form input values to control the data returned from our query and the resulting visualization.

screen_shot_2019-11-27_at_11_10_54_am

Conclusion

In this two-part post, we learned how effectively Apache Zeppelin integrates with Amazon EMR. We also learned how to extend Zeppelin’s capabilities, using  AWS Glue, Amazon RDS, and Amazon S3 as a Data Lake. Beyond what was covered in this post, there are dozens of more Zeppelin and EMR features, as well as dozens of more AWS services that integrate with Zeppelin and EMR, for you to discover.

All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.

, , , , , , , , ,

1 Comment

Getting Started with Apache Zeppelin on Amazon EMR, using AWS Glue, RDS, and S3: Part 1

Introduction

There is little question big data analyticsdata scienceartificial intelligence (AI), and machine learning (ML), a subcategory of AI, have all experienced a tremendous surge in popularity over the last 3–5 years. Behind the hype cycles and marketing buzz, these technologies are having a significant influence on many aspects of our modern lives. Due to their popularity, commercial enterprises, academic institutions, and the public sector have all rushed to develop hardware and software solutions to decrease the barrier to entry and increase the velocity of ML and Data Scientists and Engineers.

screen_shot_2019-11-17_at_7_24_10_am1Data Science: 5-Year Search Trend (courtesy Google Trends)

screen_shot_2019-11-17_at_7_24_10_am2Machine Learning: 5-Year Search Trend (courtesy Google Trends)

Technologies

All three major cloud providers, Amazon Web Services (AWS), Microsoft Azure, and Google Cloud, have rapidly maturing big data analytics, data science, and AI and ML services. AWS, for example, introduced Amazon Elastic MapReduce (EMR) in 2009, primarily as an Apache Hadoop-based big data processing service. Since then, according to Amazon, EMR has evolved into a service that uses Apache SparkApache Hadoop, and several other leading open-source frameworks to quickly and cost-effectively process and analyze vast amounts of data. More recently, in late 2017, Amazon released SageMaker, a service that provides the ability to build, train, and deploy machine learning models quickly and securely.

Simultaneously, organizations are building solutions that integrate and enhance these Cloud-based big data analytics, data science, AI, and ML services. One such example is Apache Zeppelin. Similar to the immensely popular Project Jupyter and the newly open-sourced Netflix’s Polynote, Apache Zeppelin is a web-based, polyglot, computational notebook. Zeppelin enables data-driven, interactive data analytics and document collaboration using a number of interpreters such as Scala (with Apache Spark), Python (with Apache Spark), Spark SQL, JDBC, Markdown, Shell and so on. Zeppelin is one of the core applications supported natively by Amazon EMR.

screen_shot_2019-11-15_at_11_59_16_pm

In the following two-part post, we will explore the use of Apache Zeppelin on EMR for data science and data analytics using a series of Zeppelin notebooks. The notebooks feature the use of AWS Glue, the fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. The notebooks also feature the use of Amazon Relational Database Service (RDS) for PostgreSQL and Amazon Simple Cloud Storage Service (S3). Amazon S3 will serve as a Data Lake to store our unstructured data. Given the current choice of Zeppelin’s more than twenty different interpreters, we will use Python3 and Apache Spark, specifically Spark SQL and PySpark, for all notebooks.

zeppelin_header

We will build an economical single-node EMR cluster for data exploration, as well as a larger multi-node EMR cluster for analyzing large data sets. Amazon S3 will be used to store input and output data, while intermediate results are stored in the Hadoop Distributed File System (HDFS) on the EMR cluster. Amazon provides a good overview of EMR architecture. Below is a high-level architectural diagram of the infrastructure we will construct during Part 1 for this demonstration.

EMR-Zeppelin.png

Notebook Features

Below is an overview of each Zeppelin notebook with a link to view it using Zepl’s free Notebook Explorer. Zepl was founded by the same engineers that developed Apache Zeppelin, including Moonsoo Lee, Zepl CTO and creator for Apache Zeppelin. Zepl’s enterprise collaboration platform, built on Apache Zeppelin, enables both Data Science and AI/ML teams to collaborate around data.

Notebook 1

The first notebook uses a small 21k row kaggle dataset, Transactions from a Bakery. The notebook demonstrates Zeppelin’s integration capabilities with the Helium plugin system for adding new chart types, the use of Amazon S3 for data storage and retrieval, and the use of Apache Parquet, a compressed and efficient columnar data storage format, and Zeppelin’s storage integration with GitHub for notebook version control.

screen_shot_2019-11-27_at_11_13_53_am

Notebook 2

The second notebook demonstrates the use of a single-node and multi-node Amazon EMR cluster for the exploration and analysis of public datasets ranging from approximately 100k rows up to 27MM rows, using Zeppelin. We will use the latest GroupLens MovieLens rating datasets to examine the performance characteristics of Zeppelin, using Spark, on single- verses multi-node EMR clusters for analyzing big data using a variety of Amazon EC2 Instance Types.

screen_shot_2019-11-27_at_11_14_01_am

Notebook 3

The third notebook demonstrates Amazon EMR and Zeppelin’s integration capabilities with AWS Glue Data Catalog as an Apache Hive-compatible metastore for Spark SQL. We will create an Amazon S3-based Data Lake using the AWS Glue Data Catalog and a set of AWS Glue Crawlers.

screen_shot_2019-11-27_at_11_44_44_pm.png

Notebook 4

The fourth notebook demonstrates Zeppelin’s ability to integrate with an external data source. In this case, we will interact with data in an Amazon RDS PostgreSQL relational database using three methods, including the Psycopg 2 PostgreSQL adapter for Python, Spark’s native JDBC capability, and Zeppelin’s JDBC Interpreter.

screen-shot-2019-11-27-at-11_26_34-am

Demonstration

In Part 1 of the post, as a DataOps Engineer, we will create and configure the AWS resources required to demonstrate the use of Apache Zeppelin on EMR, using an AWS Glue Data Catalog, Amazon RDS PostgreSQL database, and an S3-based data lake. In Part 2 of this post, as a Data Scientist, we will explore Apache Zeppelin’s features and integration capabilities with a variety of AWS services using the Zeppelin notebooks.

Source Code

The demonstration’s source code is contained in two public GitHub repositories. The first repository, zeppelin-emr-demo, includes the four Zeppelin notebooks, organized according to the conventions of Zeppelin’s pluggable notebook storage mechanisms.

.
├── 2ERVVKTCG
│   └── note.json
├── 2ERYY923A
│   └── note.json
├── 2ESH8DGFS
│   └── note.json
├── 2EUZKQXX7
│   └── note.json
├── LICENSE
└── README.md

Zeppelin GitHub Storage

During the demonstration, changes made to your copy of the Zeppelin notebooks running on EMR will be automatically pushed back to GitHub when a commit occurs. To accomplish this, instead of just cloning a local copy of my zeppelin-emr-demo project repository, you will want your own copy, within your personal GitHub account. You could folk my zeppelin-emr-demo GitHub repository or copy a clone into your own GitHub repository.

To make a copy of the project in your own GitHub account, first, create a new empty repository on GitHub, for example, ‘my-zeppelin-emr-demo-copy’. Then, execute the following commands from your terminal, to clone the original project repository to your local environment, and finally, push it to your GitHub account.

# change me
GITHUB_ACCOUNT="your-account-name" # i.e. garystafford
GITHUB_REPO="your-new-project-name" # i.e. my-zeppelin-emr-demo-copy

# shallow clone into new directory
git clone --branch master \
    --single-branch --depth 1 --no-tags \
    https://github.com/garystafford/zeppelin-emr-demo.git \
    ${GITHUB_REPO}

# re-initialize repository
cd ${GITHUB_REPO}
rm -rf .git
git init

# re-commit code
git add -A
git commit -m "Initial commit of my copy of zeppelin-emr-demo"

# push to your repo
git remote add origin \
    https://github.com/$GITHUB_ACCOUNT/$GITHUB_REPO.git
git push -u origin master

GitHub Personal Access Token

To automatically push changes to your GitHub repository when a commit occurs, Zeppelin will need a GitHub personal access token. Create a personal access token with the scope shown below. Be sure to keep the token secret. Make sure you do not accidentally check your token value into your source code on GitHub. To minimize the risk, change or delete the token after completing the demo.

screen_shot_2019-11-18_at_8_44_27_pm

The second repository, zeppelin-emr-config, contains the necessary bootstrap files, CloudFormation templates, and PostgreSQL DDL (Data Definition Language) SQL script.

.
├── LICENSE
├── README.md
├── bootstrap
│   ├── bootstrap.sh
│   ├── emr-config.json
│   ├── helium.json
├── cloudformation
│   ├── crawler.yml
│   ├── emr_single_node.yml
│   ├── emr_cluster.yml
│   └── rds_postgres.yml
└── sql
    └── ratings.sql

Use the following AWS CLI command to clone the GitHub repository to your local environment.

git clone --branch master \
    --single-branch --depth 1 --no-tags \
    https://github.com/garystafford/zeppelin-emr-demo-setup.git

Requirements

To follow along with the demonstration, you will need an AWS Account, an existing Amazon S3 bucket to store EMR configuration and data, and an EC2 key pair. You will also need a current version of the AWS CLI installed in your work environment. Due to the particular EMR features, we will be using, I recommend using the us-east-1 AWS Region to create the demonstration’s resources.

# create secure emr config and data bucket
# change me
ZEPPELIN_DEMO_BUCKET="your-bucket-name"

aws s3api create-bucket \
    --bucket ${ZEPPELIN_DEMO_BUCKET}
aws s3api put-public-access-block \
    --bucket ${ZEPPELIN_DEMO_BUCKET} \
    --public-access-block-configuration \
    BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true

Copy Configuration Files to S3

To start, we need to copy three configuration files, bootstrap.sh, helium.json, and ratings.sql, from the zeppelin-emr-demo-setup project directory to our S3 bucket. Change the ZEPPELIN_DEMO_BUCKET variable value, then run the following s3 cp API command, using the AWS CLI. The three files will be copied to a bootstrap directory within your S3 bucket.

# change me
ZEPPELIN_DEMO_BUCKET="your-bucket-name"
 
aws s3 cp bootstrap/bootstrap.sh s3://${ZEPPELIN_DEMO_BUCKET}/bootstrap/
aws s3 cp bootstrap/helium.json s3://${ZEPPELIN_DEMO_BUCKET}/bootstrap/
aws s3 cp sql/ratings.sql s3://${ZEPPELIN_DEMO_BUCKET}/bootstrap/

Below, sample output from copying local files to S3.

screen-shot-2019-11-21-at-2_03_49-pm

Create AWS Resources

We will start by creating most of the required AWS resources for the demonstration using three CloudFormation templates. We will create a single-node Amazon EMR cluster, an Amazon RDS PostgresSQL database, an AWS Glue Data Catalog database, two AWS Glue Crawlers, and a Glue IAM Role. We will wait to create the multi-node EMR cluster due to the compute costs of running large EC2 instances in the cluster. You should understand the cost of these resources before proceeding, and that you ensure they are destroyed immediately upon completion of the demonstration to minimize your expenses.

Single-Node EMR Cluster

We will start by creating the single-node Amazon EMR cluster, consisting of just one master node with no core or task nodes (a cluster of one). All operations will take place on the master node.

Default EMR Resources

The following EMR instructions assume you have already created at least one EMR cluster in the past, in your current AWS Region, using the EMR web interface with the ‘Create Cluster – Quick Options’ option. Creating a cluster this way creates several additional AWS resources, such as the EMR_EC2_DefaultRole EC2 instance profile, the default EMR_DefaultRole EMR IAM Role, and the default EMR S3 log bucket.

screen_shot_2019-11-17_at_5_46_56_pm.png

If you haven’t created any EMR clusters using the EMR ‘Create Cluster – Quick Options’ in the past, don’t worry, you can also create the required resources with a few quick AWS CLI commands. Change the following LOG_BUCKET variable value, then run the aws emr and aws s3api API commands, using the AWS CLI. The LOG_BUCKET variable value follows the convention of aws-logs-awsaccount-region. For example, aws-logs-012345678901-us-east-1.

# create emr roles
aws emr create-default-roles

# create log secure bucket
# change me
LOG_BUCKET="aws-logs-your_aws_account_id-your_region"

aws s3api create-bucket --bucket ${LOG_BUCKET}
aws s3api put-public-access-block --bucket ${LOG_BUCKET} \
    --public-access-block-configuration \
    BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true

The new EMR IAM Roles can be viewed in the IAM Roles web interface.

screen-shot-2019-11-21-at-2_05_35-pm

Often, I see tutorials that reference these default EMR resources from the AWS CLI or CloudFormation, without any understanding or explanation of how they are created.

EMR Bootstrap Script

As part of creating our EMR cluster, the CloudFormation template, emr_single_node.yml, will call the bootstrap script we copied to S3, earlier, bootstrap.sh. The bootstrap script pre-installs required Python and Linux software packages, and the PostgreSQL driver JAR. The bootstrap script also clones your copy of the zeppelin-emr-demo GitHub repository.

#!/bin/bash
set -ex

if [[ $# -ne 2 ]] ; then
    echo "Script requires two arguments"
    exit 1
fi

GITHUB_ACCOUNT=$1
GITHUB_REPO=$2

# install extra python packages
sudo python3 -m pip install psycopg2-binary boto3

# install extra linux packages
yes | sudo yum install git htop

# clone github repo
cd /tmp
git clone "https://github.com/${GITHUB_ACCOUNT}/${GITHUB_REPO}.git"

# install extra jars
POSTGRES_JAR="postgresql-42.2.8.jar"
wget -nv "https://jdbc.postgresql.org/download/${POSTGRES_JAR}"
sudo chown -R hadoop:hadoop ${POSTGRES_JAR}
mkdir -p /home/hadoop/extrajars/
cp ${POSTGRES_JAR} /home/hadoop/extrajars/

EMR Application Configuration

The EMR CloudFormation template will also modify the EMR cluster’s Spark and Zeppelin application configurations. Amongst other configuration properties, the template sets the default Python version to Python3, instructs Zeppelin to use the cloned GitHub notebook directory path, and adds the PostgreSQL Driver JAR to the JVM ClassPath. Below we can see the configuration properties applied to an existing EMR cluster.

screen-shot-2019-11-21-at-2_24_43-pm
EMR Application Versions

As of the date of this post, EMR is at version 5.28.0. Below, as shown in the EMR web interface, are the current (21) applications and frameworks available for installation on EMR.

emr-28.png

For this demo, we will install Apache Spark v2.4.4, Ganglia v3.7.2, and Zeppelin 0.8.2.

screen_shot_2019-11-17_at_8_32_17_pmApache Zeppelin: Web Interface

screen_shot_2019-11-13_at_5_40_12_pmApache Spark: DAG Visualization

screen_shot_2019-11-13_at_8_31_13_pmGanglia: Cluster CPU Monitoring

Create the EMR CloudFormation Stack

Change the following (7) variable values, then run the emr cloudformation create-stack API command, using the AWS CLI.

# change me
ZEPPELIN_DEMO_BUCKET="your-bucket-name"
EC2_KEY_NAME="your-key-name"
LOG_BUCKET="aws-logs-your_aws_account_id-your_region"
GITHUB_ACCOUNT="your-account-name" # i.e. garystafford
GITHUB_REPO="your-new-project-name" # i.e. my-zeppelin-emr-demo
GITHUB_TOKEN="your-token-value"
MASTER_INSTANCE_TYPE="m5.xlarge" # optional
 
aws cloudformation create-stack \
    --stack-name zeppelin-emr-dev-stack \
    --template-body file://cloudformation/emr_single_node.yml \
    --parameters ParameterKey=ZeppelinDemoBucket,ParameterValue=${ZEPPELIN_DEMO_BUCKET} \
                 ParameterKey=Ec2KeyName,ParameterValue=${EC2_KEY_NAME} \
                 ParameterKey=LogBucket,ParameterValue=${LOG_BUCKET} \
                 ParameterKey=MasterInstanceType,ParameterValue=${MASTER_INSTANCE_TYPE} \
                 ParameterKey=GitHubAccount,ParameterValue=${GITHUB_ACCOUNT} \
                 ParameterKey=GitHubRepository,ParameterValue=${GITHUB_REPO} \
                 ParameterKey=GitHubToken,ParameterValue=${GITHUB_TOKEN}

You can use the Amazon EMR web interface to confirm the results of the CloudFormation stack. The cluster should be in the ‘Waiting’ state.

screen_shot_2019-11-15_at_7_42_09_pm

PostgreSQL on Amazon RDS

Next, create a simple, single-AZ, single-master, non-replicated Amazon RDS PostgreSQL database, using the included CloudFormation template, rds_postgres.yml. We will use this database in Notebook 4. For the demo, I have selected the current-generation general purpose db.m4.large EC2 instance type to run PostgreSQL. You can easily change the instance type to another RDS-supported instance type to suit your own needs.

Change the following (3) variable values, then run the cloudformation create-stack API command, using the AWS CLI.

# change me
DB_MASTER_USER="your-db-username" # i.e. masteruser
DB_MASTER_PASSWORD="your-db-password" # i.e. 5up3r53cr3tPa55w0rd
MASTER_INSTANCE_TYPE="db.m4.large" # optional
 
aws cloudformation create-stack \
    --stack-name zeppelin-rds-stack \
    --template-body file://cloudformation/rds_postgres.yml \
    --parameters ParameterKey=DBUser,ParameterValue=${DB_MASTER_USER} \
                 ParameterKey=DBPassword,ParameterValue=${DB_MASTER_PASSWORD} \
                 ParameterKey=DBInstanceClass,ParameterValue=${MASTER_INSTANCE_TYPE}

You can use the Amazon RDS web interface to confirm the results of the CloudFormation stack.

screen_shot_2019-11-17_at_8_06_44_pm.png

AWS Glue

Next, create the AWS Glue Data Catalog database, the Apache Hive-compatible metastore for Spark SQL, two AWS Glue Crawlers, and a Glue IAM Role (ZeppelinDemoCrawlerRole), using the included CloudFormation template, crawler.yml. The AWS Glue Data Catalog database will be used in Notebook 3.

Change the following variable value, then run the cloudformation create-stack API command, using the AWS CLI.

# change me
ZEPPELIN_DEMO_BUCKET="your-bucket-name"
 
aws cloudformation create-stack \
    --stack-name zeppelin-crawlers-stack \
    --template-body file://cloudformation/crawler.yml \
    --parameters ParameterKey=ZeppelinDemoBucket,ParameterValue=${ZEPPELIN_DEMO_BUCKET} \
    --capabilities CAPABILITY_NAMED_IAM

You can use the AWS Glue web interface to confirm the results of the CloudFormation stack. Note the Data Catalog database and the two Glue Crawlers. We will not run the two crawlers until Part 2 of the post, so no tables will exist in the Data Catalog database, yet.

screen_shot_2019-11-27_at_8_50_09_pmscreen_shot_2019-11-27_at_8_56_47_pm

At this point in the demonstration, you should have successfully created a single-node Amazon EMR cluster, an Amazon RDS PostgresSQL database, and several AWS Glue resources, all using CloudFormation templates.

screen_shot_2019-11-21_at_4_19_01_pm

Post-EMR Creation Configuration

RDS Security

For the new EMR cluster to communicate with the RDS PostgreSQL database, we need to ensure that port 5432 is open from the RDS database’s VPC security group, which is the default VPC security group, to the security groups of the EMR nodes. Obtain the Group ID of the ElasticMapReduce-master and ElasticMapReduce-slave Security Groups from the EMR web interface.

screen_shot_2019-11-20_at_11_51_10_am

Access the Security Group for the RDS database using the RDS web interface. Change the inbound rule for port 5432 to include both Security Group IDs.

screen_shot_2019-11-20_at_11_52_23_am

SSH to EMR Master Node

In addition to the bootstrap script and configurations, we already applied to the EMR cluster, we need to make several post-EMR creation configuration changes to the EMR cluster for our demonstration. These changes will require SSH’ing to the EMR cluster. Using the master node’s public DNS address and SSH command provided in the EMR web console, SSH into the master node.

screen_shot_2019-11-15_at_7_42_09_pm_v3

If you cannot access the node using SSH, check that port 22 is open on the associated EMR master node IAM Security Group (ElasticMapReduce-master) to your IP address or address range.

screen_shot_2019-11-15_at_7_42_09_pm_v2

screen_shot_2019-11-21_at_4_51_01_pm.png

Git Permissions

We need to change permissions on the git repository we installed during the EMR bootstrapping phase. Typically, with an EC2 instance, you perform operations as the ec2-user user. With Amazon EMR, you often perform actions as the hadoop user. With Zeppelin on EMR, the notebooks perform operations, including interacting with the git repository as the zeppelin user. As a result of the bootstrap.sh script, the contents of the git repository directory, /tmp/zeppelin-emr-demo/, are owned by the hadoop user and group by default.

screen_shot_2019-11-17_at_8_01_24_pm

We will change their owner to the zeppelin user and group. We could not perform this step as part of the bootstrap script since the the zeppelin user and group did not exist at the time the script was executed.

cd /tmp/zeppelin-emr-demo/
sudo chown -R zeppelin:zeppelin .

The results should look similar to the following output.

screen_shot_2019-11-17_at_8_02_16_pm

Pre-Install Visualization Packages

Next, we will pre-install several Apache Zeppelin Visualization packages. According to the Zeppelin website, an Apache Zeppelin Visualization is a pluggable package that can be loaded/unloaded on runtime through the Helium framework in Zeppelin. We can use them just like any other built-in visualization in the notebook. A Visualization is a javascript npm package. For example, here is a link to the ultimate-pie-chart on the public npm registry.

We can pre-load plugins by replacing the /usr/lib/zeppelin/conf/helium.json file with the version of helium.json we copied to S3, earlier, and restarting Zeppelin. If you have a lot of Visualizations or package types or use any DataOps automation to create EMR clusters, this approach is much more efficient and repeatable than manually loading plugins using the Zeppelin UI, each time you create a new EMR cluster. Below, the helium.json file, which pre-loads (8) Visualization packages.

{
    "enabled": {
        "ultimate-pie-chart": "ultimate-pie-chart@0.0.2",
        "ultimate-column-chart": "ultimate-column-chart@0.0.2",
        "ultimate-scatter-chart": "ultimate-scatter-chart@0.0.2",
        "ultimate-range-chart": "ultimate-range-chart@0.0.2",
        "ultimate-area-chart": "ultimate-area-chart@0.0.1",
        "ultimate-line-chart": "ultimate-line-chart@0.0.1",
        "zeppelin-bubblechart": "zeppelin-bubblechart@0.0.4",
        "zeppelin-highcharts-scatterplot": "zeppelin-highcharts-scatterplot@0.0.2"
    },
    "packageConfig": {},
    "bundleDisplayOrder": [
        "ultimate-pie-chart",
        "ultimate-column-chart",
        "ultimate-scatter-chart",
        "ultimate-range-chart",
        "ultimate-area-chart",
        "ultimate-line-chart",
        "zeppelin-bubblechart",
        "zeppelin-highcharts-scatterplot"
    ]
}

Run the following commands to load the plugins and adjust the permissions on the file.

# change me
ZEPPELIN_DEMO_BUCKET="your-bucket-name"
 
sudo aws s3 cp s3://${ZEPPELIN_DEMO_BUCKET}/bootstrap/helium.json \
    /usr/lib/zeppelin/conf/helium.json
sudo chown zeppelin:zeppelin /usr/lib/zeppelin/conf/helium.json

Create New JDBC Interpreter

Lastly, we need to create a new Zeppelin JDBC Interpreter to connect to our RDS database. By default, Zeppelin has several interpreters installed. You can review a list of available interpreters using the following command.

sudo sh /usr/lib/zeppelin/bin/install-interpreter.sh --list

screen_shot_2019-11-18_at_6_29_40_am

The new JDBC interpreter will allow us to connect to our RDS PostgreSQL database, using Java Database Connectivity (JDBC). First, ensure all available interpreters are installed, including the current Zeppelin JDBC driver (org.apache.zeppelin:zeppelin-jdbc:0.8.0) to /usr/lib/zeppelin/interpreter/jdbc.

Creating a new interpreter is a two-part process. In this stage, we install the required interpreter files on the master node using the following command. Then later, in the Zeppelin web interface, we will configure the new PostgreSQL JDBC interpreter. Note we must provide a unique name for the interpreter (I chose ‘postgres’ in this case), which we will refer to in part two of the interpreter creation process.

sudo sh /usr/lib/zeppelin/bin/install-interpreter.sh --all
 
sudo sh /usr/lib/zeppelin/bin/install-interpreter.sh \
    --name "postgres" \
    --artifact org.apache.zeppelin:zeppelin-jdbc:0.8.0

To complete the post-EMR creation configuration on the master node, we must restart Zeppelin for our changes to take effect.

sudo stop zeppelin && sudo start zeppelin

In my experience, it could take 2–3 minutes for the Zeppelin UI to become fully responsive after a restart.

screen_shot_2019-11-18_at_10_01_54_pm

Zeppelin Web Interface Access

With all the EMR application configuration complete, we will access the Zeppelin web interface running on the master node. Use the Zeppelin connection information provided in the EMR web interface to setup SSH tunneling to the Zeppelin web interface, running on the master node. Using this method, we can also access the Spark History Server, Ganglia, and Hadoop Resource Manager web interfaces; all links are provided from EMR.

screen_shot_2019-11-15_at_7_42_09_pm

To set up a web connection to the applications installed on the EMR cluster, I am using FoxyProxy as a proxy management tool with Google Chrome.

screen_shot_2019-11-17_at_8_22_09_pm.png

If everything is working so far, you should see the Zeppelin web interface with all four Zeppelin notebooks available from the cloned GitHub repository. You will be logged in as the anonymous user. Zeppelin offers authentication for accessing notebooks on the EMR cluster. For brevity, we will not cover setting up authentication in Zeppelin, using Shiro Authentication.

screen_shot_2019-11-17_at_8_32_17_pm.png

To confirm the path to the local, cloned copy of the GitHub notebook repository, is correct, check the Notebook Repos interface, accessible under the Settings dropdown (anonymous user) in the upper right of the screen. The value should match the ZEPPELIN_NOTEBOOK_DIR configuration property value in the emr_single_node.yml CloudFormation template we executed earlier.

screen_shot_2019-11-18_at_10_04_23_pm

To confirm the Helium Visualizations were pre-installed correctly, using the helium.json file, open the Helium interface, accessible under the Settings dropdown (anonymous user) in the upper right of the screen.

screen_shot_2019-11-15_at_7_45_28_pmNote the enabled visualizations. And, it is easy to enable additional plugins through the web interface.screen_shot_2019-11-15_at_7_45_33_pm

New PostgreSQL JDBC Interpreter

If you recall, earlier, we install the required interpreter files on the master node using the following command using the bootstrap script. We will now complete the process of configuring the new PostgreSQL JDBC interpreter. Open the Interpreter interface, accessible under the Settings dropdown (anonymous user) in the upper right of the screen.

The title of the new interpreter must match the name we used to install the interpreter files, ‘postgres’. The interpreter group will be ‘jdbc’. There are, minimally, three properties we need to configure for your specific RDS database instance, including default.url, default.user, and default.password. These should match the values you used to create your RDS instance, earlier. Make sure to includes the database name in the default.url. An example is shown below.

default.url: jdbc:postgresql://zeppelin-demo.abcd1234efg56.us-east-1.rds.amazonaws.com:5432/ratings
default.user: masteruser
default.password: 5up3r53cr3tPa55w0rd

We also need to provide a path to the PostgreSQL driver JAR dependency. This path is the location where we placed the JAR using the bootstrap.sh script, earlier, /home/hadoop/extrajars/postgresql-42.2.8.jar. Save the new interpreter and make sure it starts successfully (shows a green icon).

screen_shot_2019-11-17_at_9_03_02_pm.png

screen_shot_2019-11-18_at_6_42_13_am

Switch Interpreters to Python 3

The last thing we need to do is change the Spark and Python interpreters to use Python 3 instead of the default Python 2. On the same screen you used to create a new interpreter, modify the Spark and Python interpreters. First, for the Python interpreter, change the zeppelin.python property to python3.

screen_shot_2019-11-18_at_3_30_39_pm

Next, for the Spark interpreter, change the zeppelin.pyspark.python property to python3.

screen_shot_2019-11-18_at_3_32_34_pm

Part 2

In Part 1 of this post, we created and configured the AWS resources required to demonstrate the use of Apache Zeppelin on EMR, using an AWS Glue Data Catalog, Amazon RDS PostgreSQL database, and an S3 data lake. In Part 2 of this post, we will explore some of Apache Zeppelin’s features and integration capabilities with a variety of AWS services using the notebooks.

All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.

, , , , , , , , ,

Leave a comment

Amazon ECR Cross-Account Access for Containerized Applications on ECS

Recently, I was asked a question regarding sharing Docker images from one AWS Account’s Amazon Elastic Container Registry (ECR) with another AWS Account who was deploying to Amazon Elastic Container Service (ECS) with AWS Fargate. The answer was relatively straightforward, use ECR Repository Policies to allow cross-account access to pull images. However, the devil is always in the implementation details. Constructing ECR Repository Policies can depend on your particular architecture, choice deployment tools, and method of account access. In this brief post, we will explore a common architectural scenario that requires configuring ECR Repository Policies to support sharing images across AWS Accounts.

Sharing Images

There are two scenarios I frequently encounter, which require sharing ECR-based Docker images across multiple AWS Accounts. In the first scenario, a vendor wants to securely share a Docker Image with their customer. Many popular container security and observability solutions function in this manner.

Below, we see an example where an application platform consists of three containers. Two of the container’s images originated from the customer’s own ECR repositories (right side). The third container’s image originated from their vendor’s ECR registry (left side).

ecs-example-1.png

In the second scenario, an enterprise operates multiple AWS Accounts to create logical security boundaries between environments and responsibilities. The first AWS Account contains the enterprise’s deployable binary assets, including ECR image repositories. The enterprise has additional accounts, one for each application environment, such as Dev, Test, Staging, and Production. The ECR images in the repository account need to be accessed from each of the environment accounts, often across multiple AWS Regions.

Below, we see an example where the deployed application platform consists of three containers, of which all images originated from the ECR repositories (left side). The images are pulled into the Production account for deployment to ECS (right side).

ecs-example-2

Demonstration

In this post, we will explore the first scenario, a vendor wants to securely share a Docker Image with their customer. We will demonstrate how to share images across AWS Accounts for use with Docker Swarm and ECS with Fargate, using ECR Repository Policies. To accomplish this scenario, we will use an existing application I have created, a RESTful, HTTP-based NLP (Natural Language Processing) API, consisting of three Golang microservices. The edge service, nlp-client, communicates with the rake-app service and the prose-app service.

ecs-example-3

The scenario in the demonstration is that the customer has developed the nlp-client and prose-app container-based services, as part of their NLP application. Instead of developing their own implementation of the RAKE (Rapid Automatic Keyword Extraction) algorithm, they have licensed a version from a vendor, the rake-app service, in the form of a Docker Image.

The NPL API exposes several endpoints, accessible through the nlp-client service. The endpoints perform common NLP operations on text, such as extracting keywords, tokens, and entities. All the endpoints are visible by hitting the /routes endpoint.

[
  {
    "method": "POST",
    "path": "/tokens",
    "name": "main.getTokens"
  },
  {
    "method": "POST",
    "path": "/entities",
    "name": "main.getEntities"
  },
  {
    "method": "GET",
    "path": "/health",
    "name": "main.getHealth"
  },
  {
    "method": "GET",
    "path": "/routes",
    "name": "main.getRoutes"
  },
  {
    "method": "POST",
    "path": "/keywords",
    "name": "main.getKeywords"
  }
]

Requirements

To follow along with the demonstration, you will need two AWS Accounts, one representing the vendor and one representing one of their customers. It’s relatively simple to create additional AWS Accounts, all you need is a unique email address (easy with Gmail) and a credit card. Using AWS Organizations can make the task of creating and managing multiple accounts even easier.

I have purposefully used different AWS Regions within each account to demonstrate how you can share ECR images across both AWS Accounts and Regions. You will need a recent version of the AWS CLI and Docker. Lastly, you will need sufficient access to each AWS Account to create resources.

Source Code

The demonstration’s source code is contained in four public GitHub repositories. The first repository contains all the CloudFormation templates and the Docker Compose Stack file, as shown below.

.
├── LICENSE
├── README.md
├── cfn-templates
│   ├── developer-user-group.yml
│   ├── ecr-repo-not-shared.yml
│   ├── ecr-repo-shared.yml
│   ├── public-subnet-public-loadbalancer.yml
│   └── public-vpc.yml
└── docker
    └── stack.yml

Each of the other three GitHub repositories contains a single Go-based microservice, which together comprises the NLP application. Each respository also contains a Dockerfile.

.
├── Dockerfile
├── LICENSE
├── README.md
├── buildspec.yml
└── main.go

The commands required to clone the four repositories are as follows.

git clone --branch master \
    --single-branch --depth 1 --no-tags \
    https://github.com/garystafford/ecr-cross-account-demo.git 

git clone --branch master \
    --single-branch --depth 1 --no-tags \
    https://github.com/garystafford/nlp-client.git

git clone --branch master \
    --single-branch --depth 1 --no-tags \
    https://github.com/garystafford/prose-app.git

git clone --branch master \
    --single-branch --depth 1 --no-tags \
    https://github.com/garystafford/rake-app.git

Process Overview

We will use AWS CloudFormation to create the necessary resources within both AWS Accounts. In the customer account, we will also use CloudFormation to create an ECS Cluster and an Amazon ECS Task Definition. The Task Definition defines how ECS will deploy our application, consisting of three Docker containers, using AWS Fargate. In addition to ECS, we will create an Amazon Virtual Private Cloud (VPC) to house the ECS cluster and a public-facing, Elastic Load Balancing (ELB) Network Load Balancer (NLB) to load-balance our ECS-based application.

Throughout the post, I will use AWS Cloud9, the cloud-based integrated development environment (IDE), to execute all CloudFormation templates using the AWS CLI. I will also use Cloud9 to build and push the Docker images to the ECR repositories. Personally, I find Cloud9 easier to switch between multiple AWS Accounts and AWS Identity and Access Management (IAM) Users, using separate instances of Cloud9, verses using my local workstation. Conveniently, Cloud9 comes preinstalled with many of the tools you will need for this demonstration.

Creating ECR Repositories

In the first AWS Account, representing the vendor, we will execute two CloudFormation templates. The first template, developer-user-group.yml, creates the Development IAM Group and User. The Developer-01 IAM User will be given explicit access to the vendor’s rake-app ECR repository. I suggest you change the DevUserPassword parameter’s value to something more secure.

# change me
IAM_USER_PSWD=T0pS3cr3Tpa55w0rD 

aws cloudformation create-stack \
    --stack-name developer-user-group \
    --template-body file://developer-user-group.yml \
    --parameters \
        ParameterKey=DevUserPassword,ParameterValue=${IAM_USER_PSWD} \
    --capabilities CAPABILITY_NAMED_IAM

Below, we see an example of the resulting CloudFormation Stack showing the new Development IAM User and Group.

screen_shot_2019-10-27_at_7_05_25_pm

Next, we will execute the second CloudFormation template, ecr-repo-shared.yml, which creates the vendor’s rake-app ECR image repository. The rake-app repository will house a copy of the vendor’s rake-app Docker Image. But first, let’s look at the CloudFormation template used to create the repository, specifically the RepositoryPolicyText section. Here we define two repository policies:

  • The AllowPushPull policy explicitly allows the Developer-01 IAM User to push and pull versions of the image to the ECR repository.  We import the exported Amazon Resource Name (ARN) of the Developer-01 IAM User from the previous CloudFormation Stack Outputs. We have also allowed the AWS CodeBuild service access to the ECR repository. This is known as a Service-Linked Role. We will not use CodeBuild in this brief post.
  • The AllowPull policy allows anyone in the customer’s AWS Account (root) to pull any version of the image. They cannot push, only pull. Of course, cross-account access can be restricted to a finer-grained set of the specific customer’s IAM Entities and source IP addresses.

Note the "ecr:GetAuthorizationToken" policy Action. Later, when the customer needs to pull this vendor’s image, this Action will allow the customer’s User to log into the vendor’s ECR repository and receive an Authorization Token. The customer retrieves a token that is valid for a specified container registry for 12 hours.

RepositoryPolicyText:
  Version: '2012-10-17'
  Statement:
    - Sid: AllowPushPull
      Effect: Allow
      Principal:
        Service: codebuild.amazonaws.com
        AWS:
          Fn::ImportValue:
            !Join [':', [!Ref 'StackName', 'DevUserArn']]
      Action:
        - 'ecr:BatchCheckLayerAvailability'
        - 'ecr:BatchGetImage'
        - 'ecr:CompleteLayerUpload'
        - 'ecr:DescribeImages'
        - 'ecr:DescribeRepositories'
        - 'ecr:GetDownloadUrlForLayer'
        - 'ecr:GetRepositoryPolicy'
        - 'ecr:InitiateLayerUpload'
        - 'ecr:ListImages'
        - 'ecr:PutImage'
        - 'ecr:UploadLayerPart'
    - Sid: AllowPull
      Effect: Allow
      Principal:
        AWS: !Join [':', ['arn:aws:iam:', !Ref 'CustomerAccount', 'root']]
      Action:
        - 'ecr:GetAuthorizationToken'
        - 'ecr:BatchCheckLayerAvailability'
        - 'ecr:GetDownloadUrlForLayer'
        - 'ecr:BatchGetImage'
        - 'ecr:DescribeRepositories' # optional permission
        - 'ecr:DescribeImages' # optional permission

Before executing the following command to deploy the CloudFormation Stack, ecr-repo-shared.yml, replace the CustomerAccount value, shown below, with your pseudo customer’s AWS Account ID.

# change me
CUSTOMER_ACCOUNT=999888777666

# don't change me
REPO_NAME=rake-app
 
aws cloudformation create-stack \
    --stack-name ecr-repo-${REPO_NAME} \
    --template-body file://ecr-repo-shared.yml \
    --parameters \
        ParameterKey=CustomerAccount,ParameterValue=${CUSTOMER_ACCOUNT} \
        ParameterKey=RepoName,ParameterValue=${REPO_NAME} \
    --capabilities CAPABILITY_NAMED_IAM

Below, we see an example of the resulting CloudFormation Stack showing the new ECR repository.

screen_shot_2019-10-27_at_7_10_12_pm

Below, we see the ECR repository policies applied correctly in the Permissions tab of the rake-app repository. The first policy covers both the Developer-01 IAM User, referred to as an IAM Entity, as well as AWS CodeBuild, referred to as a Service Principal.

screen_shot_2019-10-27_at_8_59_52_pm

The second policy covers the customer’s AWS Account ID.

screen_shot_2019-10-27_at_9_00_09_pm

Repeat this process in the customer AWS Account. First, the CloudFormation template, developer-user-group.yml, containing Development IAM Group and Developer-01 User.

# change me
IAM_USER_PSWD=T0pS3cr3Tpa55w0rD 

aws cloudformation create-stack \
    --stack-name development-user-group \
    --template-body file://development-user-group.yml \
    --parameters \
        ParameterKey=DevUserPassword,ParameterValue=${IAM_USER_PSWD} \
    --capabilities CAPABILITY_NAMED_IAM

Next, we will execute the second CloudFormation template, ecr-repo-not-shared.yml, twice, once for each of the customer’s two ECR repositories, nlp-client and prose-app. First, let’s look at the template, specifically the RepositoryPolicyText section. In this CloudFormation template, we only define a single policy. Identical to the vendor’s policy, the AllowPushPull policy explicitly allows the previously-created Developer-01 IAM User to push and pull versions of the image to the ECR repository. There is no cross-account access required to the customer’s two ECR repositories.

RepositoryPolicyText:
  Version: '2012-10-17'
  Statement:
    - Sid: AllowPushPull
      Effect: Allow
      Principal:
        Service: codebuild.amazonaws.com
        AWS:
          Fn::ImportValue:
            !Join [':', [!Ref 'StackName', 'DevUserArn']]
      Action:
        - 'ecr:BatchCheckLayerAvailability'
        - 'ecr:BatchGetImage'
        - 'ecr:CompleteLayerUpload'
        - 'ecr:DescribeImages'
        - 'ecr:DescribeRepositories'
        - 'ecr:GetDownloadUrlForLayer'
        - 'ecr:GetRepositoryPolicy'
        - 'ecr:InitiateLayerUpload'
        - 'ecr:ListImages'
        - 'ecr:PutImage'
        - 'ecr:UploadLayerPart'

Execute the following commands to create the two CloudFormation Stacks. The Stacks use the same template with a different Stack name and RepoName parameter values.

# nlp-client
REPO_NAME=nlp-client
aws cloudformation create-stack \
    --stack-name ecr-repo-${REPO_NAME} \
    --template-body file://ecr-repo-not-shared.yml \
    --parameters \
        ParameterKey=RepoName,ParameterValue=${REPO_NAME} \
    --capabilities CAPABILITY_NAMED_IAM
    
# prose-app
REPO_NAME=nlp-client 
aws cloudformation create-stack \
    --stack-name ecr-repo-${REPO_NAME} \
    --template-body file://ecr-repo-not-shared.yml \
    --parameters \
        ParameterKey=RepoName,ParameterValue=${REPO_NAME} \
    --capabilities CAPABILITY_NAMED_IAM

Below, we see an example of the resulting two ECR repositories.

screen_shot_2019-10-27_at_9_35_49_pm

At this point, we have our three ECR repositories across the two AWS Accounts, with the proper ECR Repository Policies applied to each.

Building and Pushing Images to ECR

Next, we will build and push the three NLP application images to their corresponding ECR repositories. To confirm the ECR policies are working correctly, log in as the Developer-01 IAM User to perform the following actions.

Logged in as the vendor’s Developer-01 IAM User, build and push the Docker image to the rake-app repository. The Dockerfile and Go source code is located in each GitHub repository. With Go and Docker multi-stage builds, we will make super small Docker images, based on Scratch, with just the compiled Go executable binary. At less than 10–20 MBs in size, pushing and pulling these Docker images, even across accounts, is very fast. Make sure you substitute the variable values below with your pseudo vendor’s AWS Account and Region. I am using the acroymn, ISV (Independent Software Vendor) for the vendor.

# change me
ISV_ACCOUNT=111222333444
ISV_ECR_REGION=us-east-2

$(aws ecr get-login --no-include-email --region ${ISV_ECR_REGION})
docker build -t ${ISV_ACCOUNT}.dkr.ecr.${ISV_ECR_REGION}.amazonaws.com/rake-app:1.0.0 . --no-cache
docker push ${ISV_ACCOUNT}.dkr.ecr.${ISV_ECR_REGION}.amazonaws.com/rake-app:1.0.0

Below, we see the output from the vendor’s Developer-01 IAM User logging into the rake-app repository.

screen_shot_2019-10-27_at_8_55_06_pm

Then, we see the results of the vendor’s Development IAM User building and pushing the Docker Image to the rake-app repository.

screen_shot_2019-10-27_at_8_56_23_pm

Next, logged in as the customer’s Developer-01 IAM User, build and push the Docker images to the ECR nlp-client and prose-app repositories. Again, make sure you substitute the variable values below with your pseudo customer’s AWS Account and preferred Region.

# change me
CUSTOMER_ACCOUNT=999888777666
CUSTOMER_ECR_REGION=us-west-2

$(aws ecr get-login --no-include-email --region ${CUSTOMER_ECR_REGION})
docker build -t ${CUSTOMER_ACCOUNT}.dkr.ecr.${CUSTOMER_ECR_REGION}.amazonaws.com/nlp-client:1.0.0 . --no-cache
docker push ${CUSTOMER_ACCOUNT}.dkr.ecr.${CUSTOMER_ECR_REGION}.amazonaws.com/nlp-client:1.0.0

docker build -t ${CUSTOMER_ACCOUNT}.dkr.ecr.${CUSTOMER_ECR_REGION}.amazonaws.com/prose-app:1.0.0 . --no-cache
docker push ${CUSTOMER_ACCOUNT}.dkr.ecr.${CUSTOMER_ECR_REGION}.amazonaws.com/prose-app:1.0.0

At this point, each of the three ECR repositories has a Docker Image pushed to them.

Deploying Locally to Docker Swarm

As a simple demonstration of cross-account ECS access, we will start with Docker Swarm. Logged in as the customer’s Developer-01 IAM User and using the Docker Swarm Stack file included in the project, we can create and run a local copy of our NLP application in our customer’s account. First, we need to log into the vendor’s ECR repository in order to pull the image from the vendor’s ECR registry.

# change me
ISV_ACCOUNT=111222333444
ISV_ECR_REGION=us-east-2

aws ecr get-login \
    --registry-ids ${ISV_ACCOUNT} \
    --region ${ISV_ECR_REGION} \
    --no-include-email

The aws ecr get-login command simplifies the login process by returning a (very lengthy) docker login command in response (shown abridged below). According to AWS, the authorizationToken returned for each registry specified is a base64 encoded string that can be decoded and used in a docker login command to authenticate to an ECR registry.

docker login -u AWS -p eyJwYXlsb2FkI...joidENXMWg1WW0 \
    https://111222333444.dkr.ecr.us-east-2.amazonaws.com

Copy, paste and execute the entire docker login command back into your terminal. Below, we see an example of the expected terminal output from logging into the vendor’s ECR repository.

screen_shot_2019-10-28_at_7_05_03_am

Once successfully logged in to the vendor’s ECR repository, we will pull the image. Using the docker describe-repositories  and docker describe-images, we can list cross-account repositories and images your IAM User has access to if you are unsure.

aws ecr describe-repositories \
    --registry-id ${ISV_ACCOUNT} \
    --region ${ISV_ECR_REGION} \
    --repository-name rake-app

aws ecr describe-images \
    --registry-id ${ISV_ACCOUNT} \
    --region ${ISV_ECR_REGION} \
    --repository-name rake-app

docker pull ${ISV_ACCOUNT}.dkr.ecr.${ISV_ECR_REGION}.amazonaws.com/rake-app:1.0.0

Running the following command, you should see each of our three application Docker Images.

docker image ls --filter=reference='*amazonaws.com/*'

Below, we see an example of the expected terminal output from pulling the image and listing the images.

screen_shot_2019-10-28_at_7_11_32_am

Build the Docker Stack Locally

Next, build the Docker Swarm Stack. The Docker Compose file, stack.yml, is shown below. Note the location of the Images.

version: '3.7'

services:
  nlp-client:
    image: ${CUSTOMER_ACCOUNT}.dkr.ecr.${CUSTOMER_ECR_REGION}.amazonaws.com/nlp-client:1.0.0
    networks:
      - nlp-demo
    ports:
      - 8080:8080
    environment:
      - NLP_CLIENT_PORT
      - RAKE_ENDPOINT
      - PROSE_ENDPOINT
      - AUTH_KEY
  rake-app:
    image: ${ISV_ACCOUNT}.dkr.ecr.${ISV_ECR_REGION}.amazonaws.com/rake-app:1.0.0
    networks:
      - nlp-demo
    environment:
      - RAKE_PORT
      - AUTH_KEY
  prose-app:
    image: ${CUSTOMER_ACCOUNT}.dkr.ecr.${CUSTOMER_ECR_REGION}.amazonaws.com/prose-app:1.0.0
    networks:
      - nlp-demo
    environment:
      - PROSE_PORT
      - AUTH_KEY

networks:
  nlp-demo:

volumes:
  data: {}

Execute the following commands to deploy the Docker Stack to Docker Swarm. Again, make sure you substitute the variable values below with your pseudo vendor and customer’s AWS Accounts and Regions. Additionally, API uses an API Key to protect all exposed endpoints, except the /health endpoint, across all three services. You should change the default CloudFormation template’s API Key parameter to something more secure.

# change me
export ISV_ACCOUNT=111222333444
export ISV_ECR_REGION=us-east-2
export CUSTOMER_ACCOUNT=999888777666
export CUSTOMER_ECR_REGION=us-west-2
export AUTH_KEY=SuP3r5eCRetAutHK3y
 
# don't change me
export NLP_CLIENT_PORT=8080
export RAKE_PORT=8080
export PROSE_PORT=8080
export RAKE_ENDPOINT=http://rake-app:${RAKE_PORT}
export PROSE_ENDPOINT=http://prose-app:${PROSE_PORT}
 
docker swarm init 
docker stack deploy --compose-file stack.yml nlp

We can check the success of the deployment with the following commands.

docker stack ps nlp --no-trunc
docker container ls

Below, we see an example of the expected terminal output.

screen_shot_2019-10-23_at_11_56_38_pm.png

With the Docker Stack, we can hit the nlp-client service directly on localhost:8080. Unlike Fargate, which requires unique static ports for each container in the task, with Docker, we can choose run all the containers on the same port without conflict since only the nlp-client service is exposing port :8080. Additionally, there is no load balancer in front of the Stack, unlike ECS, since we only have a single node in our Swarm, and thus a single container instance of each microservice.

ecs-example-4

To test that the images were pulled successfully and the Docker Stack is running, we can execute a curl command against any of the API endpoints, such as /keywords. Below, I am using jq to pretty-print the JSON response payload.

#change me
AUTH_KEY=SuP3r5eCRetAutHK3y

curl -s -X POST \
    http://localhost:${NLP_CLIENT_PORT}/keywords \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer ${AUTH_KEY}" \
    -d '{"text": "The Internet is the global system of interconnected computer networks that use the Internet protocol suite to link devices worldwide."}' | jq

The resulting JSON payload should look similar to the following output. These results indicate that the nlp-client service was reached successfully and that it was then subsequently able to communicate with the rake-app service, whose container image originated from the vendor’s ECR repository.

[
    {
        "candidate": "interconnected computer networks",
        "score": 9
    },
    {
        "candidate": "link devices worldwide",
        "score": 9
    },
    {
        "candidate": "internet protocol suite",
        "score": 8
    },
    {
        "candidate": "global system",
        "score": 4
    },
    {
        "candidate": "internet",
        "score": 2
    }
]

Creating Amazon ECS Environment

Although using Docker Swarm locally is a great way to understand how cross-account ECR access works, it is not a typical use case for deploying containerized applications on the AWS Platform. More often, you would use Amazon ECS, Amazon Elastic Kubernetes Service (EKS), or enterprise versions of third-party orchestrators, such as Docker Enterprise, RedHat OpenShift, or Rancher.

Using CloudFormation and some very convenient CloudFormation templates supplied by Amazon as a starting point, we will create a complete ECS environment for our application. First, we will create a VPC to house the ECS cluster and a public-facing NLB to front our ECS-based application, using the public-vpc.yml template.

aws cloudformation create-stack \
    --stack-name public-vpc \
    --template-body file://public-vpc.yml \
    --capabilities CAPABILITY_NAMED_IAM

Next, we will create the ECS cluster and an Amazon ECS Task Definition, using the public-subnet-public-loadbalancer.yml template. Again, the Task Definition defines how ECS will deploy our application using AWS Fargate. Amazon Fargate allows you to run containers without having to manage servers or clusters. No EC2 instances to manage! Woot! Below, in the CloudFormation template, we see the ContainerDefinitions section of the TaskDefinition resource, container three container definitions. Note the three images and their ECR locations.

ContainerDefinitions:
  - Name: nlp-client
    Cpu: 256
    Memory: 1024
    Image: !Join ['.', [!Ref AWS::AccountId, 'dkr.ecr', !Ref AWS::Region, 'amazonaws.com/nlp-client:1.0.0']] 
    PortMappings:
      - ContainerPort: !Ref ContainerPortClient
    Essential: true
    LogConfiguration:
      LogDriver: awslogs
      Options:
        awslogs-region: !Ref AWS::Region
        awslogs-group: !Ref CloudWatchLogsGroup
        awslogs-stream-prefix: ecs
    Environment:
      - Name: NLP_CLIENT_PORT
        Value: !Ref ContainerPortClient
      - Name: RAKE_ENDPOINT
        Value: !Join [':', ['http://localhost', !Ref ContainerPortRake]] 
      - Name: PROSE_ENDPOINT
        Value: !Join [':', ['http://localhost', !Ref ContainerPortProse]] 
      - Name: AUTH_KEY
        Value: !Ref AuthKey
  - Name: rake-app
    Cpu: 256
    Memory: 1024
    Image: !Join ['.', [!Ref VendorAccountId, 'dkr.ecr', !Ref VendorEcrRegion, 'amazonaws.com/rake-app:1.0.0']] 
    Essential: true
    LogConfiguration:
      LogDriver: awslogs
      Options:
        awslogs-region: !Ref AWS::Region
        awslogs-group: !Ref CloudWatchLogsGroup
        awslogs-stream-prefix: ecs
    Environment:
      - Name: RAKE_PORT
        Value: !Ref ContainerPortRake
      - Name: AUTH_KEY
        Value: !Ref AuthKey
  - Name: prose-app
    Cpu: 256
    Memory: 1024
    Image: !Join ['.', [!Ref AWS::AccountId, 'dkr.ecr', !Ref AWS::Region, 'amazonaws.com/prose-app:1.0.0']] 
    Essential: true
    LogConfiguration:
      LogDriver: awslogs
      Options:
        awslogs-region: !Ref AWS::Region
        awslogs-group: !Ref CloudWatchLogsGroup
        awslogs-stream-prefix: ecs
    Environment:
      - Name: PROSE_PORT
        Value: !Ref ContainerPortProse
      - Name: AUTH_KEY
        Value: !Ref AuthKey

Execute the following command to create the ECS cluster and an Amazon ECS Task Definition using the CloudFormation template.

# change me
ISV_ACCOUNT=111222333444
ISV_ECR_REGION=us-east-2
AUTH_KEY=SuP3r5eCRetAutHK3y

aws cloudformation create-stack \
    --stack-name public-subnet-public-loadbalancer \
    --template-body file://public-subnet-public-loadbalancer.yml \
    --parameters \
        ParameterKey=VendorAccountId,ParameterValue=${ISV_ACCOUNT} \
        ParameterKey=VendorEcrRegion,ParameterValue=${ISV_ECR_REGION} \
        ParameterKey=AuthKey,ParameterValue=${AUTH_KEY} \
    --capabilities CAPABILITY_NAMED_IAM

Below, we see an example of the expected output from the CloudFormation management console.

screen_shot_2019-10-23_at_6_19_25_pm

The CloudFormation template does not enable CloudWatch Container Insights by default. Insights collects, aggregates, and summarizes metrics and logs from your containerized applications. To enable Insights, execute the following command.

aws ecs put-account-setting \
  --name "containerInsights" --value "enabled"

Confirming the Cross-account Policy

If everything went right in the previous steps, we should now have an ECS cluster, running our containerized application, including the container built from the vendor’s Docker image. Below, we see an example of the ECS cluster, displayed in the management console.

screen_shot_2019-10-23_at_6_19_36_pm

Within the ECR cluster, we should observe a single running ECS Service. According to AWS, Amazon ECS allows you to run and maintain a specified number of instances of a task definition simultaneously in an Amazon ECS cluster. This is called a service.screen_shot_2019-10-23_at_6_19_42_pm

We are running two instances of each container on ECS, thus two copies of the task within the single service. Each task runs it’s containers in a different Availability Zone for high-availability.

screen_shot_2019-10-26_at_11_10_06_pm

Drilling into the service, we should note the new NLB, associated with the new VPC, two public Subnets, and the corresponding Security Group.

screen_shot_2019-10-23_at_6_19_49_pm

Switching to the Task Definitions tab, we should see the details of our task. Note the three containers that compose the application. Note two are located in the customer’s ECR repositories, and one is located in the vendor’s ECR repository.

screen_shot_2019-10-23_at_6_21_03_pm

Drilling in a little farther, we will see the details of each container definition, including environment variables, passed from ECR to the container, and on to the actual Go-binary, running in the container.

screen_shot_2019-10-23_at_6_21_24_pm

Reaching our Application on ECS

Whereas with our earlier Docker Swarm example, the curl command was issued against /localhost, we now have the public-facing Network Load Balancer (NLB) in front of our ECS-based application. We will need to use the DNS name of your NLB as the host, to hit our application on ECS. The DNS address (A Record) can be obtained from the Load Balancer management console, as shown below, or from the Output tab of the public-vpc CloudFormation Stack

screen_shot_2019-10-25_at_1_08_00_pm.png

Another difference between the earlier Docker Swarm example and ECS is the port. Although the edge service, nlp-client, runs on port :8080, the NLB acts as a reverse proxy, passing requests from port :80 on the NLB to port :8080 of the nlp-client container instances (actually, the shared ENI of the running task). For the sake of brevity and simplicity, I did not set up a custom DNS name for the NLB, nor HTTPS, as you normally would in Production.

ecs-example-3

To test our deployed ECS, we can use a tool like curl or Postman to test the API’s endpoints. Below, we see a POST against the /tokens endpoint, using Postman. If you are using Postman, don’t forget to you will need to add the API Key to the Auth tab. The Auth type will be ‘Bearer Token’.

screen_shot_2019-10-23_at_10_11_12_pm

Cleaning Up

To clean up the demonstration’s AWS resources and Docker Stack, run the following scripts in the appropriate AWS Accounts. Importantly, similar to S3, you must delete all the Docker images in the ECR repositories first, before deleting the repository, or else you will receive a CloudFormation error. This includes untagged images.

# customer account only
aws ecr batch-delete-image \
    --repository-name nlp-client \
    --image-ids imageTag=1.0.0

aws ecr batch-delete-image \
    --repository-name prose-app \
    --image-ids imageTag=1.0.0

aws cloudformation delete-stack \
    --stack-name ecr-repo-nlp-client

aws cloudformation delete-stack \
    --stack-name ecr-repo-prose-app

aws cloudformation delete-stack \
    --stack-name public-subnet-public-loadbalancer

aws cloudformation delete-stack \
    --stack-name public-vpc
docker stack rm nlp

# vendor account only
aws ecr batch-delete-image \
    --repository-name rake-app \
    --image-ids imageTag=1.0.0

aws cloudformation delete-stack \
    --stack-name ecr-repo-rake-app

# both accounts
aws cloudformation delete-stack \
    --stack-name developer-user-group

Conclusion

In the preceding post, we saw one-way multiple AWS Accounts can share ECR-based Docker Images. There are variations and restrictions to the configuration of the ECR Repository Policies, depending on the deployment tools you are using, such as AWS CodeBuild, AWS CodeDeploy, or AWS Elastic Beanstalk. AWS does an excellent job of providing some examples in their documentation, including Amazon ECR Repository Policy Examples and Amazon Elastic Container Registry Identity-Based Policy Examples.

All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.

, , , , , ,

Leave a comment