Posts Tagged Amazon

Streaming Analytics with Data Warehouses, using Amazon Kinesis Data Firehose, Amazon Redshift, and Amazon QuickSight

Introduction

Databases are ideal for storing and organizing data that requires a high volume of transaction-oriented query processing while maintaining data integrity. In contrast, data warehouses are designed for performing data analytics on vast amounts of data from one or more disparate sources. In our fast-paced, hyper-connected world, those sources often take the form of continuous streams of web application logs, e-commerce transactions, social media feeds, online gaming activities, financial trading transactions, and IoT sensor readings. Streaming data must be analyzed in near real-time, while often first requiring cleansing, transformation, and enrichment.

In the following post, we will demonstrate the use of Amazon Kinesis Data Firehose, Amazon Redshift, and Amazon QuickSight to analyze streaming data. We will simulate time-series data, streaming from a set of IoT sensors to Kinesis Data Firehose. Kinesis Data Firehose will write the IoT data to an Amazon S3 Data Lake, where it will then be copied to Redshift in near real-time. In Amazon Redshift, we will enhance the streaming sensor data with data contained in the Redshift data warehouse, which has been gathered and denormalized into a star schema.

Streaming-Kinesis-Redshift

In Redshift, we can analyze the data, asking questions like, what is the min, max, mean, and median temperature over a given time period at each sensor location. Finally, we will use Amazon Quicksight to visualize the Redshift data using rich interactive charts and graphs, including displaying geospatial sensor data.

screen_shot_2020-03-04_at_9.27.33_pm

Featured Technologies

The following AWS services are discussed in this post.

Amazon Kinesis Data Firehose

According to Amazon, Amazon Kinesis Data Firehose can capture, transform, and load streaming data into data lakes, data stores, and analytics tools. Direct Kinesis Data Firehose integrations include Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk. Kinesis Data Firehose enables near real-time analytics with existing business intelligence (BI) tools and dashboards.

Amazon Redshift

According to Amazon, Amazon Redshift is the most popular and fastest cloud data warehouse. With Redshift, users can query petabytes of structured and semi-structured data across your data warehouse and data lake using standard SQL. Redshift allows users to query and export data to and from data lakes. Redshift can federate queries of live data from Redshift, as well as across one or more relational databases.

Amazon Redshift Spectrum

According to Amazon, Amazon Redshift Spectrum can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables. Redshift Spectrum tables are created by defining the structure for data files and registering them as tables in an external data catalog. The external data catalog can be AWS Glue or an Apache Hive metastore. While Redshift Spectrum is an alternative to copying the data into Redshift for analysis, we will not be using Redshift Spectrum in this post.

Amazon QuickSight

According to Amazon, Amazon QuickSight is a fully managed business intelligence service that makes it easy to deliver insights to everyone in an organization. QuickSight lets users easily create and publish rich, interactive dashboards that include Amazon QuickSight ML Insights. Dashboards can then be accessed from any device and embedded into applications, portals, and websites.

What is a Data Warehouse?

According to Amazon, a data warehouse is a central repository of information that can be analyzed to make better-informed decisions. Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically on a regular cadence. Business analysts, data scientists, and decision-makers access the data through business intelligence tools, SQL clients, and other analytics applications.

Demonstration

Source Code

All the source code for this post can be found on GitHub. Use the following command to git clone a local copy of the project.

git clone \
–branch master –single-branch –depth 1 –no-tags \
https://github.com/garystafford/kinesis-redshift-streaming-demo.git

CloudFormation

Use the two AWS CloudFormation templates, included in the project, to build two CloudFormation stacks. Please review the two templates and understand the costs of the resources before continuing.

The first CloudFormation template, redshift.yml, provisions a new Amazon VPC with associated network and security resources, a single-node Redshift cluster, and two S3 buckets.

The second CloudFormation template, kinesis-firehose.yml, provisions an Amazon Kinesis Data Firehose delivery stream, associated IAM Policy and Role, and an Amazon CloudWatch log group and two log streams.

Change the REDSHIFT_PASSWORD value to ensure your security. Optionally, change the REDSHIFT_USERNAME value. Make sure that the first stack completes successfully, before creating the second stack.

export AWS_DEFAULT_REGION=us-east-1
REDSHIFT_USERNAME=awsuser
REDSHIFT_PASSWORD=5up3r53cr3tPa55w0rd
# Create resources
aws cloudformation create-stack \
–stack-name redshift-stack \
–template-body file://cloudformation/redshift.yml \
–parameters ParameterKey=MasterUsername,ParameterValue=${REDSHIFT_USERNAME} \
ParameterKey=MasterUserPassword,ParameterValue=${REDSHIFT_PASSWORD} \
ParameterKey=InboundTraffic,ParameterValue=$(curl ifconfig.me -s)/32 \
–capabilities CAPABILITY_NAMED_IAM
# Wait for first stack to complete
aws cloudformation create-stack \
–stack-name kinesis-firehose-stack \
–template-body file://cloudformation/kinesis-firehose.yml \
–parameters ParameterKey=MasterUserPassword,ParameterValue=${REDSHIFT_PASSWORD} \
–capabilities CAPABILITY_NAMED_IAM

Review AWS Resources

To confirm all the AWS resources were created correctly, use the AWS Management Console.

Kinesis Data Firehose

In the Amazon Kinesis Dashboard, you should see the new Amazon Kinesis Data Firehose delivery stream, redshift-delivery-stream.

screen_shot_2020-03-02_at_3.58.12_pm

The Details tab of the new Amazon Kinesis Firehose delivery stream should look similar to the following. Note the IAM Role, FirehoseDeliveryRole, which was created and associated with the delivery stream by CloudFormation.

screen_shot_2020-03-02_at_6.30.25_pm

We are not performing any transformations of the incoming messages. Note the new S3 bucket that was created and associated with the stream by CloudFormation. The bucket name was randomly generated. This bucket is where the incoming messages will be written.

screen_shot_2020-03-02_at_6.31.13_pm

Note the buffer conditions of 1 MB and 60 seconds. Whenever the buffer of incoming messages is greater than 1 MB or the time exceeds 60 seconds, the messages are written in JSON format, using GZIP compression, to S3. These are the minimal buffer conditions, and as close to real-time streaming to Redshift as we can get.

screen_shot_2020-03-02_at_6.31.28_pm

Note the COPY command, which is used to copy the messages from S3 to the message table in Amazon Redshift. Kinesis uses the IAM Role, ClusterPermissionsRole, created by CloudFormation, for credentials. We are using a Manifest to copy the data to Redshift from S3. According to Amazon, a Manifest ensures that the COPY command loads all of the required files, and only the required files, for a data load. The Manifests are automatically generated and managed by the Kinesis Firehose delivery stream.

screen_shot_2020-03-02_at_6.31.43_pm

Redshift Cluster

In the Amazon Redshift Console, you should see a new single-node Redshift cluster consisting of one Redshift dc2.large Dense Compute node type.

screen_shot_2020-03-02_at_7.09.35_pm

Note the new VPC, Subnet, and VPC Security Group created by CloudFormation. Also, observe that the Redshift cluster is publicly accessible from outside the new VPC.

screen_shot_2020-03-02_at_7.09.41_pm

Redshift Ingress Rules

The single-node Redshift cluster is assigned to an AWS Availability Zone in the US East (N. Virginia) us-east-1 AWS Region. The cluster is associated with a VPC Security Group. The Security Group contains three inbound rules, all for Redshift port 5439. The IP addresses associated with the three inbound rules provide access to the following: 1) a /27 CIDR block for Amazon QuickSight in us-east-1, a /27 CIDR block for Amazon Kinesis Firehose in us-east-1, and to you, a /32 CIDR block with your current IP address. If your IP address changes or you do not use the us-east-1 Region, you will need to change one or all of these IP addresses. The list of Kinesis Firehose IP addresses is here. The list of QuickSight IP addresses is here.

screen_shot_2020-03-02_at_7.09.59_pm

If you cannot connect to Redshift from your local SQL client, most often, your IP address has changed and is incorrect in the Security Group’s inbound rule.

Redshift SQL Client

You can choose to use the Redshift Query Editor to interact with Redshift or use a third-party SQL client for greater flexibility. To access the Redshift Query Editor, use the user credentials specified in the redshift.yml CloudFormation template.

screen_shot_2020-03-02_at_4.01.49_pm

There is a lot of useful functionality in the Redshift Console and within the Redshift Query Editor. However, a notable limitation of the Redshift Query Editor, in my opinion, is the inability to execute multiple SQL statements at the same time. Whereas, most SQL clients allow multiple SQL queries to be executed at the same time.

screen_shot_2020-03-04_at_8.48.39_am

I prefer to use JetBrains PyCharm IDE. PyCharm has out-of-the-box integration with Redshift. Using PyCharm, I can edit the project’s Python, SQL, AWS CLI shell, and CloudFormation code, all from within PyCharm.

screen_shot_2020-03-02_at_7.12.11_pm

If you use any of the common SQL clients, you will need to set-up a JDBC (Java Database Connectivity) or ODBC (Open Database Connectivity) connection to Redshift. The ODBC and JDBC connection strings can be found in the Redshift cluster’s Properties tab or in the Outputs tab from the CloudFormation stack, redshift-stack.

screen_shot_2020-03-04_at_9.07.14_am

You will also need the Redshift database username and password you included in the aws cloudformation create-stack AWS CLI command you executed previously. Below, we see PyCharm’s Project Data Sources window containing a new data source for the Redshift dev database.

screen_shot_2020-03-02_at_6.33.01_pm

Database Schema and Tables

When CloudFormation created the Redshift cluster, it also created a new database, dev. Using the Redshift Query Editor or your SQL client of choice, execute the following series of SQL commands to create a new database schema, sensor, and six tables in the sensor schema.

Create new schema in Redshift DB
DROP SCHEMA IF EXISTS sensor CASCADE;
CREATE SCHEMA sensor;
SET search_path = sensor;
Create (6) tables in Redshift DB
CREATE TABLE message streaming data table
(
id BIGINT IDENTITY (1, 1), message id
guid VARCHAR(36) NOT NULL, device guid
ts BIGINT NOT NULL DISTKEY SORTKEY, epoch in seconds
temp NUMERIC(5, 2) NOT NULL, temperature reading
created TIMESTAMP DEFAULT ('now'::text)::timestamp with time zone row created at
);
CREATE TABLE location dimension table
(
id INTEGER NOT NULL DISTKEY SORTKEY, location id
long NUMERIC(10, 7) NOT NULL, longitude
lat NUMERIC(10, 7) NOT NULL, latitude
description VARCHAR(256) location description
);
CREATE TABLE history dimension table
(
id INTEGER NOT NULL DISTKEY SORTKEY, history id
serviced BIGINT NOT NULL, service date
action VARCHAR(20) NOT NULL, INSTALLED, CALIBRATED, FIRMWARE UPGRADED, DECOMMISSIONED, OTHER
technician_id INTEGER NOT NULL, technician id
notes VARCHAR(256) notes
);
CREATE TABLE sensor dimension table
(
id INTEGER NOT NULL DISTKEY SORTKEY, sensor id
guid VARCHAR(36) NOT NULL, device guid
mac VARCHAR(18) NOT NULL, mac address
sku VARCHAR(18) NOT NULL, product sku
upc VARCHAR(12) NOT NULL, product upc
active BOOLEAN DEFAULT TRUE, active status
notes VARCHAR(256) notes
);
CREATE TABLE manufacturer dimension table
(
id INTEGER NOT NULL DISTKEY SORTKEY, manufacturer id
name VARCHAR(100) NOT NULL, company name
website VARCHAR(100) NOT NULL, company website
notes VARCHAR(256) notes
);
CREATE TABLE sensors fact table
(
id BIGINT IDENTITY (1, 1) DISTKEY SORTKEY, fact id
sensor_id INTEGER NOT NULL, sensor id
manufacturer_id INTEGER NOT NULL, manufacturer id
location_id INTEGER NOT NULL, location id
history_id BIGINT NOT NULL, history id
message_guid VARCHAR(36) NOT NULL sensor guid
);

Star Schema

The tables represent denormalized data, taken from one or more relational database sources. The tables form a star schema.  The star schema is widely used to develop data warehouses. The star schema consists of one or more fact tables referencing any number of dimension tables. The location, manufacturer, sensor, and history tables are dimension tables. The sensors table is a fact table.

In the diagram below, the foreign key relationships are virtual, not physical. The diagram was created using PyCharm’s schema visualization tool. Note the schema’s star shape. The message table is where the streaming IoT data will eventually be written. The message table is related to the sensors fact table through the common guid field.

schema-light-2

Sample Data to S3

Next, copy the sample data, included in the project, to the S3 data bucket created with CloudFormation. Each CSV-formatted data file corresponds to one of the tables we previously created. Since the bucket name is semi-random, we can use the AWS CLI and jq to get the bucket name, then use it to perform the copy commands.

# Get data bucket name
DATA_BUCKET=$(aws cloudformation describe-stacks \
–stack-name redshift-stack \
| jq -r '.Stacks[].Outputs[] | select(.OutputKey == "DataBucket") | .OutputValue')
echo $DATA_BUCKET
# Copy data
aws s3 cp data/history.csv s3://${DATA_BUCKET}/history/history.csv
aws s3 cp data/location.csv s3://${DATA_BUCKET}/location/location.csv
aws s3 cp data/manufacturer.csv s3://${DATA_BUCKET}/manufacturer/manufacturer.csv
aws s3 cp data/sensor.csv s3://${DATA_BUCKET}/sensor/sensor.csv
aws s3 cp data/sensors.csv s3://${DATA_BUCKET}/sensors/sensors.csv

The output from the AWS CLI should look similar to the following.

screen_shot_2020-03-02_at_7.18.22_pm

Sample Data to Redshift

Whereas a relational database, such as Amazon RDS is designed for online transaction processing (OLTP), Amazon Redshift is designed for online analytic processing (OLAP) and business intelligence applications. To write data to Redshift we typically use the COPY command versus frequent, individual INSERT statements, as with OLTP, which would be prohibitively slow. According to Amazon, the Redshift COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from files on Amazon S3, from a DynamoDB table, or from text output from one or more remote hosts.

In the following series of SQL statements, replace the placeholder, your_bucket_name, in five places with your S3 data bucket name. The bucket name will start with the prefix, redshift-stack-databucket. The bucket name can be found in the Outputs tab of the CloudFormation stack, redshift-stack. Next, replace the placeholder, cluster_permissions_role_arn, with the ARN (Amazon Resource Name) of the ClusterPermissionsRole. The ARN is formatted as follows, arn:aws:iam::your-account-id:role/ClusterPermissionsRole. The ARN can be found in the Outputs tab of the CloudFormation stack, redshift-stack.

Using the Redshift Query Editor or your SQL client of choice, execute the SQL statements to copy the sample data from S3 to each of the corresponding tables in the Redshift dev database. The TRUNCATE commands guarantee there is no previous sample data present in the tables.

** MUST FIRST CHANGE your_bucket_name and cluster_permissions_role_arn **
sensor schema
SET search_path = sensor;
Copy sample data to tables from S3
TRUNCATE TABLE history;
COPY history (id, serviced, action, technician_id, notes)
FROM 's3://your_bucket_name/history/'
CREDENTIALS 'aws_iam_role=cluster_permissions_role_arn'
CSV IGNOREHEADER 1;
TRUNCATE TABLE location;
COPY location (id, long, lat, description)
FROM 's3://your_bucket_name/location/'
CREDENTIALS 'aws_iam_role=cluster_permissions_role_arn'
CSV IGNOREHEADER 1;
TRUNCATE TABLE sensor;
COPY sensor (id, guid, mac, sku, upc, active, notes)
FROM 's3://your_bucket_name/sensor/'
CREDENTIALS 'aws_iam_role=cluster_permissions_role_arn'
CSV IGNOREHEADER 1;
TRUNCATE TABLE manufacturer;
COPY manufacturer (id, name, website, notes)
FROM 's3://your_bucket_name/manufacturer/'
CREDENTIALS 'aws_iam_role=cluster_permissions_role_arn'
CSV IGNOREHEADER 1;
TRUNCATE TABLE sensors;
COPY sensors (sensor_id, manufacturer_id, location_id, history_id, message_guid)
FROM 's3://your_bucket_name/sensors/'
CREDENTIALS 'aws_iam_role=cluster_permissions_role_arn'
CSV IGNOREHEADER 1;
SELECT COUNT(*) FROM history; 30
SELECT COUNT(*) FROM location; 6
SELECT COUNT(*) FROM sensor; 6
SELECT COUNT(*) FROM manufacturer; 1
SELECT COUNT(*) FROM sensors; 30

Database Views

Next, create four Redshift database Views. These views may be used to analyze the data in Redshift, and later, in Amazon QuickSight.

  1. sensor_msg_detail: Returns aggregated sensor details, using the sensors fact table and all five dimension tables in a SQL Join.
  2. sensor_msg_count: Returns the number of messages received by Redshift, for each sensor.
  3. sensor_avg_temp: Returns the average temperature from each sensor, based on all the messages received from each sensor.
  4. sensor_avg_temp_current: View is identical for the previous view but limited to the last 30 minutes.

Using the Redshift Query Editor or your SQL client of choice, execute the following series of SQL statements.

sensor schema
SET search_path = sensor;
View 1: Sensor details
DROP VIEW IF EXISTS sensor_msg_detail;
CREATE OR REPLACE VIEW sensor_msg_detail AS
SELECT ('1970-01-01'::date + e.ts * interval '1 second') AS recorded,
e.temp,
s.guid,
s.sku,
s.mac,
l.lat,
l.long,
l.description AS location,
('1970-01-01'::date + h.serviced * interval '1 second') AS installed,
e.created AS redshift
FROM sensors f
INNER JOIN sensor s ON (f.sensor_id = s.id)
INNER JOIN history h ON (f.history_id = h.id)
INNER JOIN location l ON (f.location_id = l.id)
INNER JOIN manufacturer m ON (f.manufacturer_id = m.id)
INNER JOIN message e ON (f.message_guid = e.guid)
WHERE s.active IS TRUE
AND h.action = 'INSTALLED'
ORDER BY f.id;
View 2: Message count per sensor
DROP VIEW IF EXISTS sensor_msg_count;
CREATE OR REPLACE VIEW sensor_msg_count AS
SELECT count(e.temp) AS msg_count,
s.guid,
l.lat,
l.long,
l.description AS location
FROM sensors f
INNER JOIN sensor s ON (f.sensor_id = s.id)
INNER JOIN history h ON (f.history_id = h.id)
INNER JOIN location l ON (f.location_id = l.id)
INNER JOIN message e ON (f.message_guid = e.guid)
WHERE s.active IS TRUE
AND h.action = 'INSTALLED'
GROUP BY s.guid, l.description, l.lat, l.long
ORDER BY msg_count, s.guid;
View 3: Average temperature per sensor (all data)
DROP VIEW IF EXISTS sensor_avg_temp;
CREATE OR REPLACE VIEW sensor_avg_temp AS
SELECT avg(e.temp) AS avg_temp,
count(s.guid) AS msg_count,
s.guid,
l.lat,
l.long,
l.description AS location
FROM sensors f
INNER JOIN sensor s ON (f.sensor_id = s.id)
INNER JOIN history h ON (f.history_id = h.id)
INNER JOIN location l ON (f.location_id = l.id)
INNER JOIN message e ON (f.message_guid = e.guid)
WHERE s.active IS TRUE
AND h.action = 'INSTALLED'
GROUP BY s.guid, l.description, l.lat, l.long
ORDER BY avg_temp, s.guid;
View 4: Average temperature per sensor (last 30 minutes)
DROP VIEW IF EXISTS sensor_avg_temp_current;
CREATE OR REPLACE VIEW sensor_avg_temp_current AS
SELECT avg(e.temp) AS avg_temp,
count(s.guid) AS msg_count,
s.guid,
l.lat,
l.long,
l.description AS location
FROM sensors f
INNER JOIN sensor s ON (f.sensor_id = s.id)
INNER JOIN history h ON (f.history_id = h.id)
INNER JOIN location l ON (f.location_id = l.id)
INNER JOIN (SELECT ('1970-01-01'::date + ts * interval '1 second') AS recorded_time,
guid,
temp
FROM message
WHERE DATEDIFF(minute, recorded_time, GETDATE()) <= 30) e ON (f.message_guid = e.guid)
WHERE s.active IS TRUE
AND h.action = 'INSTALLED'
GROUP BY s.guid, l.description, l.lat, l.long
ORDER BY avg_temp, s.guid;

At this point, you should have a total of six tables and four views in the sensor schema of the dev database in Redshift.

Test the System

With all the necessary AWS resources and Redshift database objects created and sample data in the Redshift database, we can test the system. The included Python script, kinesis_put_test_msg.py, will generate a single test message and send it to Kinesis Data Firehose. If everything is working, the message should be delivered from Kinesis Data Firehose to S3, then copied to Redshift, and appear in the message table.

Install the required Python packages and then execute the Python script.

# Install required Python packages
python3 -m pip install –user -r scripts/requirements.txt
# Set default AWS Region for script
export AWS_DEFAULT_REGION=us-east-1
# Execute script in foreground
python3 ./scripts/kinesis_put_test_msg.py

Run the following SQL query to confirm the record is in the message table of the dev database. It will take at least one minute for the message to appear in Redshift.

SELECT COUNT(*) FROM message;

Once the message is confirmed to be present in the message table, delete the record by truncating the table.

TRUNCATE TABLE message;

Streaming Data

Assuming the test message worked, we can proceed with simulating the streaming IoT sensor data. The included Python script, kinesis_put_streaming_data.py, creates six concurrent threads, representing six temperature sensors.

#!/usr/bin/env python3
# Simulated multiple streaming time-series iot sensor data
# Author: Gary A. Stafford
# Date: Revised October 2020
import json
import random
from datetime import datetime
import boto3
import time as tm
import numpy as np
import threading
STREAM_NAME = 'redshift-delivery-stream'
client = boto3.client('firehose')
class MyThread(threading.Thread):
def __init__(self, thread_id, sensor_guid, temp_max):
threading.Thread.__init__(self)
self.thread_id = thread_id
self.sensor_id = sensor_guid
self.temp_max = temp_max
def run(self):
print("Starting Thread: " + str(self.thread_id))
self.create_data()
print("Exiting Thread: " + str(self.thread_id))
def create_data(self):
start = 0
stop = 20
step = 0.1 # step size (e.g 0 to 20, step .1 = 200 steps in cycle)
repeat = 2 # how many times to repeat cycle
freq = 60 # frequency of temperature reading in seconds
max_range = int(stop * (1 / step))
time = np.arange(start, stop, step)
amplitude = np.sin(time)
for x in range(0, repeat):
for y in range(0, max_range):
temperature = round((((amplitude[y] + 1.0) * self.temp_max) + random.uniform(5, 5)) + 60, 2)
payload = {
'guid': self.sensor_id,
'ts': int(datetime.now().strftime('%s')),
'temp': temperature
}
print(json.dumps(payload))
self.send_to_kinesis(payload)
tm.sleep(freq)
@staticmethod
def send_to_kinesis(payload):
_ = client.put_record(
DeliveryStreamName=STREAM_NAME,
Record={
'Data': json.dumps(payload)
}
)
def main():
sensor_guids = [
"03e39872-e105-4be4-83c0-9ade818465dc",
"fa565921-fddd-4bfb-a7fd-d617f816df4b",
"d120422d-5789-435d-9dc6-73d8489b04c2",
"93238559-4d55-4b2a-bdcb-6aa3be0f3908",
"dbc05806-6872-4f0a-aca2-f794cc39bd9b",
"f9ade639-f936-4954-aa5a-1f2ed86c9bcf"
]
timeout = 300 # arbitrarily offset the start of threads (60 / 5 = 12)
# Create new threads
thread1 = MyThread(1, sensor_guids[0], 25)
thread2 = MyThread(2, sensor_guids[1], 10)
thread3 = MyThread(3, sensor_guids[2], 7)
thread4 = MyThread(4, sensor_guids[3], 30)
thread5 = MyThread(5, sensor_guids[4], 5)
thread6 = MyThread(6, sensor_guids[5], 12)
# Start new threads
thread1.start()
tm.sleep(timeout * 1)
thread2.start()
tm.sleep(timeout * 2)
thread3.start()
tm.sleep(timeout * 1)
thread4.start()
tm.sleep(timeout * 3)
thread5.start()
tm.sleep(timeout * 2)
thread6.start()
# Wait for threads to terminate
thread1.join()
thread2.join()
thread3.join()
thread4.join()
thread5.join()
thread6.join()
print("Exiting Main Thread")
if __name__ == '__main__':
main()

The simulated data uses an algorithm that follows an oscillating sine wave or sinusoid, representing rising and falling temperatures. In the script, I have configured each thread to start with an arbitrary offset to add some randomness to the simulated data.

screen_shot_2020-03-04_at_9.27.33_pm

The variables within the script can be adjusted to shorten or lengthen the time it takes to stream the simulated data. By default, each of the six threads creates 400 messages per sensor, in one-minute increments. Including the offset start of each proceeding thread, the total runtime of the script is about 7.5 hours to generate 2,400 simulated IoT sensor temperature readings and push to Kinesis Data Firehose. Make sure you can guarantee you will maintain a connection to the Internet for the entire runtime of the script. I normally run the script in the background, from a small EC2 instance.

To use the Python script, execute either of the two following commands. Using the first command will run the script in the foreground. Using the second command will run the script in the background.

# Install required Python packages
python3 -m pip install –user -r scripts/requirements.txt
# Set default AWS Region for script
export AWS_DEFAULT_REGION=us-east-1
# Option #1: Execute script in foreground
python3 ./scripts/kinesis_put_streaming_data.py
# Option #2: execute script in background
nohup python3 -u ./scripts/kinesis_put_streaming_data.py > output.log 2>&1 </dev/null &
# Check that the process is running
ps -aux | grep 'python3 -u ./scripts/kinesis_put_streaming_data.py'
# Wait 1-2 minutes, then check output to confirm script is working
cat output.log

Viewing the output.log file, you should see messages being generated on each thread and sent to Kinesis Data Firehose. Each message contains the GUID of the sensor, a timestamp, and a temperature reading.

Screen Shot 2020-10-14 at 10.00.37 AM

The messages are sent to Kinesis Data Firehose, which in turn writes the messages to S3. The messages are written in JSON format using GZIP compression. Below, we see an example of the GZIP compressed JSON files in S3. The JSON files are partitioned by year, month, day, and hour.

screen_shot_2020-03-04_at_11.04.36_pm

Confirm Data Streaming to Redshift

From the Amazon Kinesis Firehose Console Metrics tab, you should see incoming messages flowing to S3 and on to Redshift.

screen_shot_2020-03-03_at_6.17.14_pm

Executing the following SQL query should show an increasing number of messages.

SELECT COUNT(*) FROM message;

How Near Real-time?

Earlier, we saw how the Amazon Kinesis Data Firehose delivery stream was configured to buffer data at the rate of 1 MB or 60 seconds. Whenever the buffer of incoming messages is greater than 1 MB or the time exceeds 60 seconds, the messages are written to S3. Each record in the message table has two timestamps. The first timestamp, ts, is when the temperature reading was recorded. The second timestamp, created, is when the message was written to Redshift, using the COPY command. We can calculate the delta in seconds between the two timestamps using the following SQL query in Redshift.

SELECT ('1970-01-01'::date + ts * interval '1 second') AS recorded_time,
created AS redshift_time,
DATEDIFF(seconds, recorded_time, redshift_time) AS diff_seconds
FROM message
ORDER BY diff_seconds;

Using the results of the Redshift query, we can visualize the results in Amazon QuickSight. In my own tests, we see that for 2,400 messages, over approximately 7.5 hours, the minimum delay was 1 second, and a maximum delay was 64 seconds. Hence, near real-time, in this case, is about one minute or less, with an average latency of roughly 30 seconds.

latency

Analyzing the Data with Redshift

I suggest waiting at least thirty minutes for a significant number of messages copied into Redshift. With the data streaming into Redshift, execute each of the database views we created earlier. You should see the streaming message data, joined to the existing static data in Redshift. As data continues to stream into Redshift, the views will display different results based on the current message table contents.

Here, we see the first ten results of the sensor_msg_detail view.


recorded temp guid sku mac lat long location installed redshift
2020-03-04 03:31:59.000000 105.56 03e39872-e105-4be4-83c0-9ade818465dc PR49-24A 8e:fa:46:09:14:b2 37.7068476 -122.4191599 Research Lab #2203 2018-01-31 12:00:00.000000 2020-03-04 03:33:01.580147
2020-03-04 03:29:59.000000 95.93 03e39872-e105-4be4-83c0-9ade818465dc PR49-24A 8e:fa:46:09:14:b2 37.7068476 -122.4191599 Research Lab #2203 2018-01-31 12:00:00.000000 2020-03-04 03:31:01.388887
2020-03-04 03:26:58.000000 91.93 03e39872-e105-4be4-83c0-9ade818465dc PR49-24A 8e:fa:46:09:14:b2 37.7068476 -122.4191599 Research Lab #2203 2018-01-31 12:00:00.000000 2020-03-04 03:28:01.099796
2020-03-04 03:25:58.000000 88.70 03e39872-e105-4be4-83c0-9ade818465dc PR49-24A 8e:fa:46:09:14:b2 37.7068476 -122.4191599 Research Lab #2203 2018-01-31 12:00:00.000000 2020-03-04 03:26:00.196113
2020-03-04 03:22:58.000000 87.65 03e39872-e105-4be4-83c0-9ade818465dc PR49-24A 8e:fa:46:09:14:b2 37.7068476 -122.4191599 Research Lab #2203 2018-01-31 12:00:00.000000 2020-03-04 03:23:01.558514
2020-03-04 03:20:58.000000 77.35 03e39872-e105-4be4-83c0-9ade818465dc PR49-24A 8e:fa:46:09:14:b2 37.7068476 -122.4191599 Research Lab #2203 2018-01-31 12:00:00.000000 2020-03-04 03:21:00.691347
2020-03-04 03:16:57.000000 71.84 03e39872-e105-4be4-83c0-9ade818465dc PR49-24A 8e:fa:46:09:14:b2 37.7068476 -122.4191599 Research Lab #2203 2018-01-31 12:00:00.000000 2020-03-04 03:17:59.307510
2020-03-04 03:15:57.000000 72.35 03e39872-e105-4be4-83c0-9ade818465dc PR49-24A 8e:fa:46:09:14:b2 37.7068476 -122.4191599 Research Lab #2203 2018-01-31 12:00:00.000000 2020-03-04 03:15:59.813656
2020-03-04 03:14:57.000000 67.95 03e39872-e105-4be4-83c0-9ade818465dc PR49-24A 8e:fa:46:09:14:b2 37.7068476 -122.4191599 Research Lab #2203 2018-01-31 12:00:00.000000 2020-03-04 03:15:59.813656

Next, we see the results of the sensor_avg_temp view.


avg_temp guid lat long location
65.25 dbc05806-6872-4f0a-aca2-f794cc39bd9b 37.7066541 -122.4181399 Wafer Inspection Lab #0210A
67.23 d120422d-5789-435d-9dc6-73d8489b04c2 37.7072686 -122.4187016 Zone 4 Wafer Processing Area B3
70.23 fa565921-fddd-4bfb-a7fd-d617f816df4b 37.7071763 -122.4190397 Research Lab #2209
72.22 f9ade639-f936-4954-aa5a-1f2ed86c9bcf 37.7067618 -122.4186191 Wafer Inspection Lab #0211C
85.48 03e39872-e105-4be4-83c0-9ade818465dc 37.7068476 -122.4191599 Research Lab #2203
90.69 93238559-4d55-4b2a-bdcb-6aa3be0f3908 37.7070334 -122.4184393 Zone 2 Semiconductor Assembly Area A2

Amazon QuickSight

In a recent post, Getting Started with Data Analysis on AWS using AWS Glue, Amazon Athena, and QuickSight: Part 2, I detailed getting started with Amazon QuickSight. In this post, I will assume you are familiar with QuickSight.

Amazon recently added a full set of aws quicksight APIs for interacting with QuickSight. Though, for this part of the demonstration, we will be working directly in the Amazon QuickSight Console, as opposed to the AWS CLI, AWS CDK, or CloudFormation.

Redshift Data Sets

To visualize the data from Amazon Redshift, we start by creating Data Sets in QuickSight. QuickSight supports a large number of data sources for creating data sets. We will use the Redshift data source. If you recall, we added an inbound rule for QuickSight, allowing us to connect to our Redshift cluster in us-east-1.

screen_shot_2020-03-02_at_10.21.35_pm

We will select the sensor schema, which is where the tables and views for this demonstration are located.

screen_shot_2020-03-02_at_10.21.49_pm

We can choose any of the tables or views in the Redshift dev database that we want to use for visualization.

screen_shot_2020-03-02_at_10.22.11_pm

Below, we see examples of two new data sets, shown in the QuickSight Data Prep Console. Note how QuickSight automatically recognizes field types, including dates, latitude, and longitude.

screen_shot_2020-03-02_at_10.23.12_pm

screen_shot_2020-03-02_at_10.59.32_pm

Visualizations

Using the data sets, QuickSight allows us to create a wide number of rich visualizations. Below, we see the simulated time-series data from the six temperature sensors.

screen_shot_2020-03-04_at_9.26.26_pm

Next, we see an example of QuickSight’s ability to show geospatial data. The Map shows the location of each sensor and the average temperature recorded by that sensor.

screen_shot_2020-03-04_at_9.26.51_pm

Cleaning Up

To remove the resources created for this post, use the following series of AWS CLI commands.

# Get data bucket name
DATA_BUCKET=$(aws cloudformation describe-stacks \
–stack-name redshift-stack \
| jq -r '.Stacks[].Outputs[] | select(.OutputKey == "DataBucket") | .OutputValue')
echo ${DATA_BUCKET}
# Get log bucket name
LOG_BUCKET=$(aws cloudformation describe-stacks \
–stack-name redshift-stack \
| jq -r '.Stacks[].Outputs[] | select(.OutputKey == "LogBucket") | .OutputValue')
echo ${LOG_BUCKET}
# Delete demonstration resources
python3 ./scripts/delete_buckets.py
aws cloudformation delete-stack –stack-name kinesis-firehose-stack
# Wait for first stack to be deleted
aws cloudformation delete-stack –stack-name redshift-stack

Conclusion

In this brief post, we have learned how streaming data can be analyzed in near real-time, in Amazon Redshift, using Amazon Kinesis Data Firehose. Further, we explored how the results of those analyses can be visualized in Amazon QuickSight. For customers that depend on a data warehouse for data analytics, but who also have streaming data sources, the use of Amazon Kinesis Data Firehose or Amazon Redshift Spectrum is an excellent choice.

This blog represents my own viewpoints and not of my employer, Amazon Web Services.

, , , , , , ,

Leave a comment

Getting Started with PostgreSQL using Amazon RDS, CloudFormation, pgAdmin, and Python

Introduction

In the following post, we will explore how to get started with Amazon Relational Database Service (RDS) for PostgreSQL. CloudFormation will be used to build a PostgreSQL master database instance and a single read replica in a new VPC. AWS Systems Manager Parameter Store will be used to store our CloudFormation configuration values. Amazon RDS Event Notifications will send text messages to our mobile device to let us know when the RDS instances are ready for use. Once running, we will examine a variety of methods to interact with our database instances, including pgAdmin, Adminer, and Python.

Technologies

The primary technologies used in this post include the following.

PostgreSQL

Image result for postgres logoAccording to its website, PostgreSQL, commonly known as Postgres, is the world’s most advanced Open Source relational database. Originating at UC Berkeley in 1986, PostgreSQL has more than 30 years of active core development. PostgreSQL has earned a strong reputation for its proven architecture, reliability, data integrity, robust feature set, extensibility. PostgreSQL runs on all major operating systems and has been ACID-compliant since 2001.

Amazon RDS for PostgreSQL

Image result for amazon rds logoAccording to Amazon, Amazon Relational Database Service (RDS) provides six familiar database engines to choose from, including Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server. RDS is available on several database instance types - optimized for memory, performance, or I/O.

Amazon RDS for PostgreSQL makes it easy to set up, operate, and scale PostgreSQL deployments in the cloud. Amazon RDS supports the latest PostgreSQL version 11, which includes several enhancements to performance, robustness, transaction management, query parallelism, and more.

AWS CloudFormation

Deployment__Management_copy_AWS_CloudFormation-512

According to Amazon, CloudFormation provides a common language to describe and provision all the infrastructure resources within AWS-based cloud environments. CloudFormation allows you to use a JSON- or YAML-based template to model and provision all the resources needed for your applications across all AWS regions and accounts, in an automated and secure manner.

Demonstration

Architecture

Below, we see an architectural representation of what will be built in the demonstration. This is not a typical three-tier AWS architecture, wherein the RDS instances would be placed in private subnets (data tier) and accessible only by the application tier, running on AWS. The architecture for the demonstration is designed for interacting with RDS through external database clients such as pgAdmin, and applications like our local Python scripts, detailed later in the post.

RDS AWS Arch Diagram

Source Code

All source code for this post is available on GitHub in a single public repository, postgres-rds-demo.

.
├── LICENSE.md
├── README.md
├── cfn-templates
│   ├── event.template
│   ├── rds.template
├── parameter_store_values.sh
├── python-scripts
│   ├── create_pagila_data.py
│   ├── database.ini
│   ├── db_config.py
│   ├── query_postgres.py
│   ├── requirements.txt
│   └── unit_tests.py
├── sql-scripts
│   ├── pagila-insert-data.sql
│   └── pagila-schema.sql
└── stack.yml

To clone the GitHub repository, execute the following command.

git clone --branch master --single-branch --depth 1 --no-tags \
  https://github.com/garystafford/aws-rds-postgres.git

Prerequisites

For this demonstration, I will assume you already have an AWS account. Further, that you have the latest copy of the AWS CLI and Python 3 installed on your development machine. Optionally, for pgAdmin and Adminer, you will also need to have Docker installed.

Steps

In this demonstration, we will perform the following steps.

  • Put CloudFormation configuration values in Parameter Store;
  • Execute CloudFormation templates to create AWS resources;
  • Execute SQL scripts using Python to populate the new database with sample data;
  • Configure pgAdmin and Python connections to RDS PostgreSQL instances;

AWS Systems Manager Parameter Store

With AWS, it is typical to use services like AWS Systems Manager Parameter Store and AWS Secrets Manager to store overt, sensitive, and secret configuration values. These values are utilized by your code, or from AWS services like CloudFormation. Parameter Store allows us to follow the proper twelve-factor, cloud-native practice of separating configuration from code.

To demonstrate the use of Parameter Store, we will place a few of our CloudFormation configuration items into Parameter Store. The demonstration’s GitHub repository includes a shell script, parameter_store_values.sh, which will put the necessary parameters into Parameter Store.

Below, we see several of the demo’s configuration values, which have been put into Parameter Store.

screen_shot_2019-08-08_at_7_00_17_pm

SecureString

Whereas our other parameters are stored in Parameter Store as String datatypes, the database’s master user password is stored as a SecureString data-type. Parameter Store uses an AWS Key Management Service (KMS) customer master key (CMK) to encrypt the SecureString parameter value.

screen_shot_2019-08-08_at_7_01_15_pm

SMS Text Alert Option

Before running the Parameter Store script, you will need to change the /rds_demo/alert_phone parameter value in the script (shown below) to your mobile device number, including country code, such as ‘+12038675309’. Amazon SNS will use it to send SMS messages, using Amazon RDS Event Notification. If you don’t want to use this messaging feature, simply ignore this parameter and do not execute the event.template CloudFormation template in the proceeding step.

aws ssm put-parameter \
  --name /rds_demo/alert_phone \
  --type String \
  --value "your_phone_number_here" \
  --description "RDS alert SMS phone number" \
  --overwrite

Run the following command to execute the shell script, parameter_store_values.sh, which will put the necessary parameters into Parameter Store.

sh ./parameter_store_values.sh

CloudFormation Templates

The GitHub repository includes two CloudFormation templates, cfn-templates/event.template and cfn-templates/rds.template. This event template contains two resources, which are an AWS SNS Topic and an AWS RDS Event Subscription. The RDS template also includes several resources, including a VPC, Internet Gateway, VPC security group, two public subnets, one RDS master database instance, and an AWS RDS Read Replica database instance.

The resources are split into two CloudFormation templates so we can create the notification resources, first, independently of creating or deleting the RDS instances. This will ensure we get all our SMS alerts about both the creation and deletion of the databases.

Template Parameters

The two CloudFormation templates contain a total of approximately fifteen parameters. For most, you can use the default values I have set or chose to override them. Four of the parameters will be fulfilled from Parameter Store. Of these, the master database password is treated slightly differently because it is secure (encrypted in Parameter Store). Below is a snippet of the template showing both types of parameters. The last two are fulfilled from Parameter Store.

DBInstanceClass:
  Type: String
  Default: "db.t3.small"
DBStorageType:
  Type: String
  Default: "gp2"
DBUser:
  Type: String
  Default: "{{resolve:ssm:/rds_demo/master_username:1}}"
DBPassword:
  Type: String
  Default: "{{resolve:ssm-secure:/rds_demo/master_password:1}}"
  NoEcho: True

Choosing the default CloudFormation parameter values will result in two minimally-configured RDS instances running the PostgreSQL 11.4 database engine on a db.t3.small instance with 10 GiB of General Purpose (SSD) storage. The db.t3 DB instance is part of the latest generation burstable performance instance class. The master instance is not configured for Multi-AZ high availability. However, the master and read replica each run in a different Availability Zone (AZ) within the same AWS Region.

Parameter Versioning

When placing parameters into Parameter Store, subsequent updates to a parameter result in the version number of that parameter being incremented. Note in the examples above, the version of the parameter is required by CloudFormation, here, ‘1’. If you chose to update a value in Parameter Store, thus incrementing the parameter’s version, you will also need to update the corresponding version number in the CloudFormation template’s parameter.

{
    "Parameter": {
        "Name": "/rds_demo/rds_username",
        "Type": "String",
        "Value": "masteruser",
        "Version": 1,
        "LastModifiedDate": 1564962073.774,
        "ARN": "arn:aws:ssm:us-east-1:1234567890:parameter/rds_demo/rds_username"
    }
}

Validating Templates

Although I have tested both templates, I suggest validating the templates yourself, as you usually would for any CloudFormation template you are creating. You can use the AWS CLI CloudFormation validate-template CLI command to validate the template. Alternately, or I suggest additionally, you can use CloudFormation Lintercfn-lint command.

aws cloudformation validate-template \
  --template-body file://cfn-templates/rds.template

cfn-lint -t cfn-templates/cfn-templates/rds.template

Create the Stacks

To execute the first CloudFormation template and create a CloudFormation Stack containing the two event notification resources, run the following create-stack CLI command.

aws cloudformation create-stack \
  --template-body file://cfn-templates/event.template \
  --stack-name RDSEventDemoStack

The first stack only takes less than one minute to create. Using the AWS CloudFormation Console, make sure the first stack completes successfully before creating the second stack with the command, below.

aws cloudformation create-stack \
  --template-body file://cfn-templates/rds.template \
  --stack-name RDSDemoStack

screen_shot_2019-08-04_at_10_35_20_pm

Wait for my Text

In my tests, the CloudFormation RDS stack takes an average of 25–30 minutes to create and 15–20 minutes to delete, which can seem like an eternity. You could use the AWS CloudFormation console (shown below) or continue to use the CLI to follow the progress of the RDS stack creation.

screen_shot_2019-08-08_at_7_39_45_pm.png

However, if you recall, the CloudFormation event template creates an AWS RDS Event Subscription. This resource will notify us when the databases are ready by sending text messages to our mobile device.

screen_shot_2019-08-04_at_11_06_31_pm

In the CloudFormation events template, the RDS Event Subscription is configured to generate Amazon Simple Notification Service (SNS) notifications for several specific event types, including RDS instance creation and deletion.

  MyEventSubscription:
    Properties:
      Enabled: true
      EventCategories:
        - availability
        - configuration change
        - creation
        - deletion
        - failover
        - failure
        - recovery
      SnsTopicArn:
        Ref: MyDBSNSTopic
      SourceType: db-instance
    Type: AWS::RDS::EventSubscription

Amazon SNS will send SMS messages to the mobile number you placed into Parameter Store. Below, we see messages generated during the creation of the two instances, displayed on an Apple iPhone.

img-2839

Amazon RDS Dashboard

Once the RDS CloudFormation stack has successfully been built, the easiest way to view the results is using the Amazon RDS Dashboard, as shown below. Here we see both the master and read replica instances have been created and are available for our use.

screen_shot_2019-08-08_at_7_05_24_pm

The RDS dashboard offers CloudWatch monitoring of each RDS instance.

screen_shot_2019-08-08_at_7_06_17_pm

The RDS dashboard also provides detailed configuration information about each RDS instance.

screen_shot_2019-08-08_at_7_06_26_pm

The RDS dashboard’s Connection & security tab is where we can obtain connection information about our RDS instances, including the RDS instance’s endpoints. Endpoints information will be required in the next part of the demonstration.

screen_shot_2019-08-08_at_7_06_01_pm

Sample Data

Now that we have our PostgreSQL database instance and read replica successfully provisioned and configured on AWS, with an empty database, we need some test data. There are several sources of sample PostgreSQL databases available on the internet to explore. We will use the Pagila sample movie rental database by pgFoundry. Although the database is several years old, it provides a relatively complex relational schema (table relationships shown below) and plenty of sample data to query, about 100 database objects and 46K rows of data.

pagila_tablespng

In the GitHub repository, I have included the two Pagila database SQL scripts required to install the sample database’s data structures (DDL), sql-scripts/pagila-schema.sql, and the data itself (DML), sql-scripts/pagila-insert-data.sql.

To execute the Pagila SQL scripts and install the sample data, I have included a Python script. If you do not want to use Python, you can skip to the Adminer section of this post. Adminer also has the capability to import SQL scripts.

Before running any of the included Python scripts, you will need to install the required Python packages and configure the database.ini file.

Python Packages

To install the required Python packages using the supplied python-scripts/requirements.txt file, run the below commands.

cd python-scripts
pip3 install --upgrade -r requirements.txt

We are using two packages, psycopg2 and configparser, for the scripts. Psycopg is a PostgreSQL database adapter for Python. According to their website, Psycopg is the most popular PostgreSQL database adapter for the Python programming language. The configparser module allows us to read configuration from files similar to Microsoft Windows INI files. The unittest package is required for a set of unit tests includes the project, but not discussed as part of the demo.

screen_shot_2019-08-13_at_11_06_10_pm

Database Configuration

The python-scripts/database.ini file, read by configparser, provides the required connection information to our RDS master and read replica instance’s databases. Use the input parameters and output values from the CloudFormation RDS template, or the Amazon RDS Dashboard to obtain the required connection information, as shown in the example, below. Your host values will be unique for your master and read replica. The host values are the instance’s endpoint, listed in the RDS Dashboard’s Configuration tab.

[docker]
host=localhost
port=5432
database=pagila
user=masteruser
password=5up3r53cr3tPa55w0rd

[master]
host=demo-instance.dkfvbjrazxmd.us-east-1.rds.amazonaws.com
port=5432
database=pagila
user=masteruser
password=5up3r53cr3tPa55w0rd

[replica]
host=demo-replica.dkfvbjrazxmd.us-east-1.rds.amazonaws.com
port=5432
database=pagila
user=masteruser
password=5up3r53cr3tPa55w0rd

With the INI file configured, run the following command, which executes a supplied Python script, python-scripts/create_pagila_data.py, to create the data structure and insert sample data into the master RDS instance’s Pagila database. The database will be automatically replicated to the RDS read replica instance. From my local laptop, I found the Python script takes approximately 40 seconds to create all 100 database objects and insert 46K rows of movie rental data. That is compared to about 13 seconds locally, using a Docker-based instance of PostgreSQL.

python3 ./create_pagila_data.py --instance master

The Python script’s primary function, create_pagila_db(), reads and executes the two external SQL scripts.

def create_pagila_db():
    """
    Creates Pagila database by running DDL and DML scripts
    """

    try:
        global conn
        with conn:
            with conn.cursor() as curs:
                curs.execute(open("../sql-scripts/pagila-schema.sql", "r").read())
                curs.execute(open("../sql-scripts/pagila-insert-data.sql", "r").read())
                conn.commit()
                print('Pagila SQL scripts executed')
    except (psycopg2.OperationalError, psycopg2.DatabaseError, FileNotFoundError) as err:
        print(create_pagila_db.__name__, err)
        close_conn()
        exit(1)

If the Python script executes correctly, you should see output indicating there are now 28 tables in our master RDS instance’s database.

screen_shot_2019-08-08_at_7_13_11_pm

pgAdmin

pgAdmin is a favorite tool for interacting with and managing PostgreSQL databases. According to its website, pgAdmin is the most popular and feature-rich Open Source administration and development platform for PostgreSQL.

The project includes an optional Docker Swarm stack.yml file. The stack will create a set of three Docker containers, including a local copy of PostgreSQL 11.4, Adminer, and pgAdmin 4. Having a local copy of PostgreSQL, using the official Docker image, is helpful for development and trouble-shooting RDS issues.

screen_shot_2019-08-10_at_1_43_24_pm.png

Use the following commands to deploy the Swarm stack.

# create stack
docker swarm init
docker stack deploy -c stack.yml postgres

# get status of new containers
docker stack ps postgres --no-trunc
docker container ls

If you do not want to spin up the whole Docker Swarm stack, you could use the docker run command to create just a single pgAdmin Docker container. The pgAdmin 4 Docker image being used is the image recommended by pgAdmin.

docker pull dpage/pgadmin4

docker run -p 81:80 \
  -e "PGADMIN_DEFAULT_EMAIL=user@domain.com" \
  -e "PGADMIN_DEFAULT_PASSWORD=SuperSecret" \
  -d dpage/pgadmin4

docker container ls | grep pgadmin4

Database Server Configuration

Once pgAdmin is up and running, we can configure the master and read replica database servers (RDS instances) using the connection string information from your database.ini file or from the Amazon RDS Dashboard. Below, I am configuring the master RDS instance (server).

screen_shot_2019-08-08_at_7_25_27_pm

With that task complete, below, we see the master RDS instance and the read replica, as well as my local Docker instance configured in pgAdmin (left side of screengrab). Note how the Pagila database has been replicated automatically, from the RDS master to the read replica instance.

screen_shot_2019-08-08_at_7_29_00_pm

Building SQL Queries

Switching to the Query tab, we can run regular SQL queries against any of the database instances. Below, I have run a simple SELECT query against the master RDS instance’s Pagila database, returning the complete list of movie titles, along with their genre and release date.

screen_shot_2019-08-08_at_7_27_58_pm

The pgAdmin Query tool even includes an Explain tab to view a graphical representation of the same query, very useful for optimization. Here we see the same query, showing an analysis of the execution order. A popup window displays information about the selected object.

screen_shot_2019-08-08_at_7_28_35_pm

Query the Read Replica

To demonstrate the use of the read replica, below I’ve run the same query against the RDS read replica’s copy of the Pagila database. Any schema and data changes against the master instance are replicated to the read replica(s).

screen_shot_2019-08-08_at_7_30_14_pm

Adminer

Adminer is another good general-purpose database management tool, similar to pgAdmin, but with a few different capabilities. According to its website, with Adminer, you get a tidy user interface, rich support for multiple databases, performance, and security, all from a single PHP file. Adminer is my preferred tool for database admin tasks. Amazingly, Adminer works with MySQL, MariaDB, PostgreSQL, SQLite, MS SQL, Oracle, SimpleDB, Elasticsearch, and MongoDB.

Below, we see the Pagila database’s tables and views displayed in Adminer, along with some useful statistical information about each database object.

screen_shot_2019-08-09_at_7_04_07_am

Similar to pgAdmin, we can also run queries, along with other common development and management tasks, from within the Adminer interface.

screen_shot_2019-08-09_at_7_05_16_am

Import Pagila with Adminer

Another great feature of Adminer is the ability to easily import and export data. As an alternative to Python, you could import the Pagila data using Adminer’s SQL file import function. Below, you see an example of importing the Pagila database objects into the Pagila database, using the file upload function.

screen_shot_2019-08-09_at_7_27_53_am.png

IDE

For writing my AWS infrastructure as code files and Python scripts, I prefer JetBrains PyCharm Professional Edition (v19.2). PyCharm, like all the JetBrains IDEs, has the ability to connect to and manage PostgreSQL database. You can write and run SQL queries, including the Pagila SQL import scripts. Microsoft Visual Studio Code is another excellent, free choice, available on multiple platforms.

screen_shot_2019-08-11_at_9_40_57_pm

Python and RDS

Although our IDE, pgAdmin, and Adminer are useful to build and test our queries, ultimately, we still need to connect to the Amazon RDS PostgreSQL instances and perform data manipulation from our application code. The GitHub repository includes a sample python script, python-scripts/query_postgres.py. This script uses the same Python packages and connection functions as our Pagila data creation script we ran earlier. This time we will perform the same SELECT query using Python as we did previously with pgAdmin and Adminer.

cd python-scripts
python3 ./query_postgres.py --instance master

With a successful database connection established, the scripts primary function, get_movies(return_count), performs the SELECT query. The function accepts an integer representing the desired number of movies to return from the SELECT query. A separate function within the script handles closing the database connection when the query is finished.

def get_movies(return_count=100):
    """
    Queries for all films, by genre and year
    """

    try:
        global conn
        with conn:
            with conn.cursor() as curs:
                curs.execute("""
                    SELECT title AS title, name AS genre, release_year AS released
                    FROM film f
                             JOIN film_category fc
                                  ON f.film_id = fc.film_id
                             JOIN category c
                                  ON fc.category_id = c.category_id
                    ORDER BY title
                    LIMIT %s;
                """, (return_count,))

                movies = []
                row = curs.fetchone()
                while row is not None:
                    movies.append(row)
                    row = curs.fetchone()

                return movies
    except (psycopg2.OperationalError, psycopg2.DatabaseError) as err:
        print(get_movies.__name__, err)
    finally:
        close_conn()


def main():
    set_connection('docker')
    for movie in get_movies(10):
        print('Movie: {0}, Genre: {1}, Released: {2}'
              .format(movie[0], movie[1], movie[2]))

Below, we see an example of the Python script’s formatted output, limited to only the first ten movies.

screen_shot_2019-08-13_at_10_51_47_pm.png

Using the Read Replica

For better application performance, it may be optimal to redirect some or all of the database reads to the read replica, while leaving writes, updates, and deletes to hit the master instance. The script can be easily modified to execute the same query against the read replica rather than the master RDS instance by merely passing the desired section, ‘replica’ versus ‘master’, in the call to the set_connection(section) function. The section parameter refers to one of the two sections in the database.ini file. The configparser module will handle retrieving the correct connection information.

set_connection('replica')

Cleaning Up

When you are finished with the demonstration, the easiest way to clean up all the AWS resources and stop getting billed is to delete the two CloudFormation stacks using the AWS CLI, in the following order.

aws cloudformation delete-stack \
  --stack-name RDSDemoStack

# wait until the above resources are completely deleted
aws cloudformation delete-stack \
  --stack-name RDSEventDemoStack

You should receive the following SMS notifications as the first CloudFormation stack is being deleted.

img-2841

You can delete the running Docker stack using the following command. Note, you will lose all your pgAdmin server connection information, along with your local Pagila database.

docker stack rm postgres

Conclusion

In this brief post, we just scraped the surface of the many benefits and capabilities of Amazon RDS for PostgreSQL. The best way to learn PostgreSQL and the benefits of Amazon RDS is by setting up your own RDS instance, insert some sample data, and start writing queries in your favorite database client or programming language.

All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.

, , , , , , , ,

Leave a comment

Building Asynchronous, Serverless Alexa Skills with AWS Lambda, DynamoDB, S3, and Node.js

Introduction

In the following post, we will use the new version 2 of the Alexa Skills Kit, AWS Lambda, Amazon DynamoDB, Amazon S3, and the latest LTS version Node.js, to create an Alexa Custom Skill. According to Amazon, a custom skill allows you to define the requests the skill can handle (intents) and the words users say to invoke those requests (utterances).

If you want to compare the development of an Alexa Custom Skill with those of Google and Azure, in addition to this post, please read my previous two posts in this series, Building and Integrating LUIS-enabled Chatbots with Slack, using Azure Bot Service, Bot Builder SDK, and Cosmos DB and Building Serverless Actions for Google Assistant with Google Cloud Functions, Cloud Datastore, and Cloud Storage. All three of the article’s demonstrations are written in Node.js, all three leverage their cloud platform’s machine learning-based Natural Language Understanding services, and all three take advantage of NoSQL database and storage services available on their respective cloud platforms.

AWS Technologies

The final high-level architecture of our Alexa Custom Skill will look as follows.

Alexa Skill Final Architecture v2.png

Here is a brief overview of the key AWS technologies we will incorporate into our Skill’s architecture.

Alexa Skills Kit

According to Amazon, the Alexa Skills Kit (ASK) is a collection of self-service APIs, tools, documentation, and code samples that makes it possible to add skills to Alexa. The Alexa Skills Kit supports building different types of skills. Currently, Alexa skill types include Custom, Smart Home, Video, Flash Briefing, and List Skills. Each skill type makes use of a different Alexa Skill API.

AWS Serverless Platform

To create a custom skill for Alexa, you currently have the choice of using an AWS Lambda function or a web service. The AWS Lambda is part of an ecosystem of Cloud services and Developer tools, Amazon refers to as the AWS Serverless Platform. The platform’s services are designed to support the development and hosting of highly-performant, enterprise-grade serverless applications.

In this post, we will leverage three of the AWS Serverless Platform’s services, including Amazon DynamoDB, Amazon Simple Storage Service (Amazon S3), and AWS Lambda.

Node.js

AWS Lamba supports multiple programming languages, including Node.js (JavaScript), Python, Java (Java 8 compatible), and C# (.NET Core) and Go. All are excellent choices for writing modern serverless functions. For this post, we will use Node.js. According to Node.js Foundation, Node.js is an asynchronous event-driven JavaScript runtime built on Chrome’s V8 JavaScript engine.

In April 2018, AWS Lamba announced support for the Node.js 8.10 runtime, which is the current Long Term Support (LTS) version of Node.js. Node 8, also known as Project Carbon, was the first LTS version of Node to support async/await with Promises. Async/await is the new way of handling asynchronous operations in Node.js. We will make use of async/await and Promises with the custom skill.

Demonstration

To demonstrate Alexa Custom Skills we will build an informational skill that responds to the user with interesting facts about Azure¹, Microsoft’s Cloud computing platform (Alexa talking about Azure, ironic, I know). This is not an official Microsoft skill; it is only used for this demonstration and has not been published.

Source Code

All open-source code for this post can be found on GitHub. Code samples in this post are displayed as GitHub Gists, which may not display correctly on some mobile and social media browsers. Links to gists are also provided.

Important, this post and the associated source code were updated from v1.0 to v2.0 on 13 August 2018. You should clone the GitHub project again, to correspond with this revised post, if you originally cloned the project before 14 August 2018. Code changes were significant.

Objectives

This objective of the fact-based skill will be to demonstrate the following.

  • Build, deploy, and test an Alexa Custom Skill using AWS Lambda and Node.js;
  • Use DynamoDB to store and retrieve Alexa voice responses;
  • Maintain a count of user’s questions in DynamoDB using atomic counters;
  • Use Amazon S3 to store and retrieve images, used in Display Cards;
  • Log Alexa Skill activities using Amazon CloudWatch Logs;

Steps to Build

Building the Azure fact skill will involve the following steps.

  • Design the Alexa skill’s voice interaction model;
  • Design the skill’s Display Cards for Alexa-enabled products, to enhance the voice experience;
  • Create the skill’s DynamoDB table and import the responses the skill will return;
  • Create an S3 bucket and upload the images used for the Display Cards;
  • Write the Alexa Skill, which involves mapping the user’s spoken input to the intents your cloud-based service can handle;
  • Write the Lambda function, which involves responding to the user’s utterances, by building and returning appropriate voice and display card responses, from DynamoDB and S3;
  • Extend the default ASK-generated AWS IAM Role, to allow the Lambda to update DynamoDB;
  • Deploy the skill;
  • Test the skill;

Let’s explore each step in detail.

Voice Interaction Model

First, we must design the fact skill’s voice interaction model. We need to consider the way we want the user to interact with the skill. What is the user’s conversational journey? How do they invoke your skill? How will the user provide intent?

This skill will require two intent slot values, the fact the user is interested in (i.e. ‘global infrastructure’) and the user’s first name (i.e. ‘Susan’). We will train the skill to allow Alexa to query the user for each slot value, but also allow the user to provide either or both values in the initial intent invocation. We will also allow the user to request a random fact.

Shown below in the Alexa Skills Kit Development Console Test tab are three examples of interactions the skill is trained to understand and handle:

  1. The first example on the left invokes the skill with no intent (‘Alexa, load Azure Tech Facts). The user is led through a series of three questions to obtain the full intent.
  2. The center example is similar, however, the initial invocation contains a partial intent (‘Alexa, ask Azure Tech Facts for a fact about certifications’). Alexa must still ask for the user’s name.
  3. Lastly, the example on the right is a so-called ‘one-shot’ invocation (‘Alexa, ask Azure Tech Facts about Azure’s platforms for Gary’). The user’s invocation of the skill contains a complete intent, allowing Alexa to respond immediately with a fact about Azure platforms.

alexa-skill-post-020

In all cases, our skill has the ability to continue to provide the user with additional facts if they chose, or they may cancel at any time.

We also need to design how Alexa will respond. What is the persona will Alexa assume through her words, phrases, and use of Speech Synthesis Markup Language (SSML).

User Interaction Previews

Here are a few examples of interactions with the final Alexa skill using an iPhone 8 and the Alexa App. They are intended to show the rich conversational capabilities of custom skills more so the than the display, which is pretty poor on the Alexa App as compared to the Echo Show or even Echo Spot.

Example 1: Indirect Invocation

The first example shows a basic interaction with our Alexa skill. It demonstrates an indirect invocation, a user utterance without initial intent. It also illustrates several variations of user utterances (YouTube).

Example 2: Direct Invocation

The second example of an interaction our skill demonstrates a direct invocation, in which the initial user utterance contains intent. It also demonstrates the user following up with additional requests (YouTube).

Example 3: Direct Invocation, Help, Problem

Lastly, another direct invocation demonstrates the use of the Help Intent. You also see an example of when Alexa does not understand the user’s utterance.  The user is able to repeat their request, more clearly (YouTube).

Visual Interaction Model

Many Alexa-enabled devices are capable of both vocal and visual responses. Designing for a multimodal user experience is important. The instructional skill will provide vocal responses, as well as Display Cards optimized for the Amazon Echo Show. The skill contains a basic design for the Display Card shown during the initial invocation, where there is no intent uttered by the user.

alexa-skill-post-021

The fact skill also contains a Display Card, designed to present the final Alexa response to the user’s intent. The content of the vocal and visual response is returned from DynamoDB via the Lambda function. The random Azure icons, available from Microsoft, are hosted in an S3 bucket. Each fact response is unique, as well as the icon associated with the fact.

alexa-skill-post-022

The Display Cards will also work on other Alexa-enabled screen-based products. Shown below is the same card on an iPhone 8 using the Amazon Alexa app. This is the same app shown in the videos, above.

alexa-skill-post-027

DynamoDB

Next, we create the DynamoDB table used to store the facts the Alexa skill will respond with when invoked by the user. DynamoDB is Amazon’s non-relational database that delivers reliable performance at any scale. DynamoDB consists of three basic components: tables, items, and attributes.

There are numerous ways to create a DynamoDB table. For simplicity, I created the AzureFacts DynamoDB table using the AWS CLI (gist). You could also choose CloudFormation, or create the table using any of nine or more programming languages with an AWS SDK.

aws dynamodb create-table \
--table-name AzureFacts \
--attribute-definitions \
AttributeName=Fact,AttributeType=S \
--key-schema AttributeName=Fact,KeyType=HASH \
--provisioned-throughput ReadCapacityUnits=3,WriteCapacityUnits=3

The AzureFacts table’s schema has four key/value pair attributes per item: Fact, Response, Image, and Hits. The Fact attribute, a string, contains the name of the fact the user is seeking. The Fact attribute also serves as the table’s unique partition key. The Response attribute, a string, contains the conversational response Alexa will return. The Image attribute, a string, contains the name of the image in the S3 bucket displayed by Alexa. Lastly, the Hits attribute, a number, stores the number of user requests for a particular fact.

Importing Table Items

After the DynamoDB table is created, the pre-defined facts are imported into the empty table using AWS CLI (gist). The JSON-formatted data file, AzureFacts.json, is included with the source code on GitHub.

aws dynamodb batch-write-item \
--request-items file://data/AzureFacts.json

The resulting table should appear as follows in the AWS Management Console.

alexa-skill-post-004

Note the imported items shown below. The Hits counts reflect the number of times each fact has been requested.

alexa-skill-post-005

Shown below is a detailed view of a single item that was imported into the DynamoDB table.

alexa-skill-post-006

Amazon S3 Image Bucket

Next, we create the Amazon S3 bucket, which will house the images, actually Azure icons as PNGs, returned by Alexa with each fact. Again, I used the AWS CLI for simplicity (gist).

aws s3api create-bucket \
--bucket <my_bucket_name> \
--region us-east-1

The images can be uploaded manually to the bucket through a web browser, or programmatically, using the AWS CLI or SDKs. You will need to ensure the images are made public so they can be displayed by Alexa.

alexa-skill-post-007

Alexa Skill

Next, we create the actual Alexa custom skill. I have used version 2 of the Alexa Skills Kit (ASK) Software Development Kit (SDK) for Node.js and the new ASK Command Line Interface (ASK CLI) to create the skill. The ASK SDK v2 for Node.js was recently released in April 2018. If you have previously written Alexa skills using version 1 of the Node.js SDK, the creation of a new project and the format of the Lambda Node.js code is somewhat different. I strongly suggest reviewing the example skills provided by Amazon on GitHub.

With version 1, I would have likely used the Alexa Skills Kit Development Console to develop and deploy the skill, and separate IDE, like JetBrains WebStorm, to write the Lambda. The JSON-format skill would live in the Alexa Skills Kit Development Console, and my Lambda in source control. I would have used AWS Serverless Application Model (AWS SAM) or Claudia.js to handle the deployment of Lambda functions.

With version 2 of ASK, you can easily create and manage the Alexa skill’s JSON-formatted code, as well as the Lambda, all from the command-line and a single IDE or text editor. All components that comprise the skill can be kept together in source control. I now only use the Alexa Skills Kit Development Console to preview my deployed skill and for testing. I am not going to go into detail about creating a new project using the ASK CLI, I suggest reviewing Amazon’s instructional guides.

Below, I have initiated a new AWS profile for the Alexa skill using the ask init command.

alexa-skill-post-008

There are three main parts to the new skill project created by the ASK CLI: the skill’s manifest (skill.json), model(s) (en-US.json), and API endpoint, the Lambda (index.js). The skill’s manifest, skill.json, contains information (metadata) about the skill. This is the same information you find in the Distribution tab of the Alexa Skills Kit Development Console. The manifest includes publishing information, example phrases to invoke the skill, the skill’s category, distribution locales, privacy information, and the location of the skill’s API endpoint, the Lambda. An end-user would most commonly see this information in Amazon Alexa app when adding skills to their Alexa-enabled devices.

alexa-skill-post-026

Next, the skill’s model, en-US.json, is located the models sub-directory. This file defines the skill’s custom interaction model, it contains the skill’s interaction model written in JSON, which includes the invocation name, intents, standard and custom slots, sample utterances, slot values, and synonyms of those values. This is the same information you would find in the Build tab of the Alexa Skills Kit Development Console. Amazon has an excellent guide to creating your custom skill’s interaction model.

Intents and Intent Slots

The skill’s custom interaction model contains the AzureFactsIntent intent, along with the boilerplate Cancel, Help and Stop intents. The AzureFactsIntent intent contains two intent slots, myName and myQuestion. The myName intent slot is a standard AMAZON.US_FIRST_NAME slot type. According to Amazon, this slot type understands thousands of popular first names commonly used by speakers in the United States. Shown below, I have included a short list of sample utterances in the intent model, which helps improve voice recognition for Alexa (gist).

{
"name": "AzureFactsIntent",
"slots": [{
"name": "myName",
"type": "AMAZON.US_FIRST_NAME",
"samples": [
"{myName}",
"my name is {myName}",
"my name's {myName}",
"name is {myName}",
"the name is {myName}",
"name's {myName}",
"they call me {myName}"
]
}]
}
view raw myNameIntent.json hosted with ❤ by GitHub

Custom Slot Types and Entities

The myQuestion intent slot is a custom slot type. According to Amazon, a custom slot type defines a list of representative values for the slot. The myQuestion slot contains all the available facts the custom instructional skill understands and can retrieve from DynamoDB. Like myName, the user can provide the fact intent in various ways (gist).

{
"name": "myQuestion",
"type": "list_of_facts",
"samples": [
"{myQuestion}",
"give me a fact about {myQuestion}",
"give me a {myQuestion} fact",
"how many {myQuestion} does Azure have",
"I'd like to hear about {myQuestion}",
"I'd like to hear more about {myQuestion}",
"tell me about {myQuestion}",
"tell me about Azure's {myQuestion}",
"tell me about Azure {myQuestion}",
"tell me a {myQuestion} fact",
"tell me another {myQuestion} fact",
"when was Azure {myQuestion}"
]
}

This slot also contains synonyms for each fact. Collectively, the slot value, it’s synonyms, and the optional ID are collectively referred to as an Entity. According to Amazon, entity resolution improves the way Alexa matches possible slot values in a user’s utterance with the slots defined in the skill’s interaction model.

An example of an entity in the myQuestion custom slot type is ‘competition’. A user can ask Alexa to tell them about Azure’s competition. The slot value ‘competition’ returns a fact about Azure’s leading competitors, as reported on the G2 Crowd website’s Microsoft Azure Alternatives & Competitors page. However, the user might also substitute the words ‘competitor’ or ‘competitors’ for ‘competition’. Using synonyms, if the user utters any of these three words in their intent, they will receive the same response from Alexa (gist).

"types": [{
"name": "list_of_facts",
"values": [{
"name": {
"value": "competition",
"synonyms": [
"competitors",
"competitor"
]
}
},
{
"name": {
"value": "certifications",
"synonyms": [
"certification",
"certification exam",
"certification exams"
]
}
}
]
}]
view raw list-of-facts.json hosted with ❤ by GitHub

Lambda

Initializing a skill with the ASK CLI also creates the default API endpoint, a Lambda (index.js). The serverless Lambda function is written in Node.js 8.10. As mentioned in the Introduction, AWS recently announced support for the Node.js 8.10 runtime, in April. This is the first LTS version of Node to support async/await with Promises. Node’s async/await is the new way of handling asynchronous operations in Node.js.

The layout of the custom skill’s Lambda’s code closely follows the custom Alexa Fact Skill example. I suggest closely reviewing this example. The Lambda has four main sections: constants, setup code, intent handlers, and helper functions.

In addition to the boilerplate Help, Stop, Error, and Session intent handlers, there are the LaunchRequestHandler and the AzureFactsIntent handlers. According to Amazon, a LaunchRequestHandler fires when the Lambda receives a LaunchRequest from Alexa, in which the user invokes the skill with the invocation name, but does not provide any command mapping to an intent.

The AzureFactsIntent aligns with the custom intent we defined in the skill’s model (en-US.json), of the same name. This handler handles an IntentRequest from Alexa. This handler and the buildFactResponse function the handler calls are what translate a request for a fact from the user into a request to DynamoDB for a response.

The AzureFactsIntent handler checks the IntentRequest for both the myName and myQuestion slot values. If the values are unfulfilled, the AzureFactsIntent handler delegates responsibility back to Alexa, using a Dialog delegate directive (addDelegateDirective). Alexa then requests the slot values from the user in a conversational interaction. Alexa then calls the AzureFactsIntent handler again (gist).

const request = handlerInput.requestEnvelope.request;
let currentIntent = request.intent;
if (myNameValue === undefined) {
myNameValue = slotValue(request.intent.slots.myName);
}
if (!myNameValue) {
return handlerInput.responseBuilder
.addDelegateDirective(currentIntent)
.getResponse();
}
let myQuestionValue = slotValue(request.intent.slots.myQuestion);
if (!myQuestionValue) {
return handlerInput.responseBuilder
.addDelegateDirective(currentIntent)
.getResponse();
}

Once both slot values are received by the AzureFactsIntent handler, it calls the buildFactResponse function, passing in the myName and myQuestion slot values. In turn, the buildFactResponse function calls AWS.DynamoDB.DocumentClient.update. The DynamoDB update returns a callback. In turn, the buildFactResponse function returns a Promise, a standard built-in object type, part of the JavaScript ES2015 spec (gist).

function buildFactResponse(myName, myQuestion) {
return new Promise((resolve, reject) => {
if (myQuestion !== undefined) {
let params = {};
params.TableName = "AzureFacts";
params.Key = {"Fact": myQuestion};
params.UpdateExpression = "set Hits = Hits + :val";
params.ExpressionAttributeValues = {":val": 1};
params.ReturnValues = "ALL_NEW";
docClient.update(params, function (err, data) {
if (err) {
console.log("GetItem threw an error:", JSON.stringify(err, null, 2));
reject(err);
} else {
console.log("GetItem succeeded:", JSON.stringify(data, null, 2));
resolve(data);
}
});
}
});
}
view raw buildFactResponse.js hosted with ❤ by GitHub

What is unique about the DynamoDB update call in this case, is it actually performs two functions. First, it implements an Atomic Counter. According to AWS, an atomic counter is a numeric DynamoDB attribute that is incremented, unconditionally, without interfering with other write requests. The update increments the numeric Hits attribute of the requested fact by exactly one. Secondly, the update returns the DynamoDB item. We can increment the count and get the response in a single call.

The buildFactResponse function’s Promise returns the DynamoDB item, a JSON object, from the callback. An example of a JSON response payload is shown below. (gist).

"Attributes": {
"Hits": 4,
"Fact": "global",
"Image": "image-02.png",
"Response": "according to Microsoft, with 54 Azure regions, Azure has more global regions than any other cloud provider. Azure is currently available in 140 countries."
}

The AzureFactsIntent handler uses the async/await methods to perform the call to the buildFactResponse function. Note line 7 of the AzureFactsIntent handler below, where the async method is applied directly to the handler. Note line 33 where the await method is used with the call to the buildFactResponse function (gist).

const AzureFactsIntent = {
canHandle(handlerInput) {
const request = handlerInput.requestEnvelope.request;
return request.type === "IntentRequest"
&& request.intent.name === "AzureFactsIntent";
},
async handle(handlerInput) {
const request = handlerInput.requestEnvelope.request;
let currentIntent = request.intent;
if (myNameValue === undefined) {
myNameValue = slotValue(request.intent.slots.myName);
}
if (!myNameValue) {
return handlerInput.responseBuilder
.addDelegateDirective(currentIntent)
.getResponse();
}
let myQuestionValue = slotValue(request.intent.slots.myQuestion);
if (!myQuestionValue) {
return handlerInput.responseBuilder
.addDelegateDirective(currentIntent)
.getResponse();
}
if (myQuestionValue.toString().trim() === 'random') {
myQuestionValue = selectRandomFact();
}
let fact = await buildFactResponse(myNameValue, myQuestionValue);
myNameValue = Object.is(myNameValue, undefined) ? undefined : capitalizeFirstLetter(myNameValue);
let factToSpeak = `${myNameValue}, ${fact.Attributes.Response}`;
cardContent = factToSpeak;
// optional: logged to CloudWatch Logs
console.log(`myName: ${myNameValue}`);
console.log(`myQuestion: ${myQuestionValue}`);
console.log(`factToSpeak: ${factToSpeak}`);
return handlerInput
.responseBuilder
.speak(factToSpeak)
.reprompt("You can request another fact")
.withStandardCard(CARD_TITLE, cardContent,
IMAGES.smallImageUrl, `${BUCKET_URL}\/${fact.Attributes.Image}`)
.getResponse();
}
};
view raw AzureFactsIntent.js hosted with ❤ by GitHub

The AzureFactsIntent handler awaits the Promise from the buildFactResponse function. In an async function, you can await for any Promise or catch its rejection cause. If the update callback and the ensuing Promise were both returned successfully, the AzureFactsIntent handler returns both a vocal and visual response to Alexa.

AWS IAM Role

By default, an AWS IAM Role was created by ASK when the project was initialized, the ask-lambda-alexa-skill-azure-facts role. This role is automatically associated with the AWS Managed Policy, AWSLambdaBasicExecutionRole. This managed policy simply allows the skill’s Lambda function to create Amazon CloudWatch Events (gist).

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "*"
}
]
}

For the skill’s Lambda to read and write to DynamoDB, we must extend the default role’s permissions, by adding an additional policy. I have created a new AzureFacts_Alexa_Skill IAM Policy, which allows the associated role to get and update items from the AzureFacts DynamoDB table, and that is it. The role only has access to two of forty possible DynamoDB actions, and only for the AzureFacts table, and nothing else. Following the principle of Least Privilege is a cornerstone of AWS Security (gist).

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"dynamodb:GetItem",
"dynamodb:UpdateItem"
],
"Resource": "arn:aws:dynamodb:us-east-1:931066906971:table/AzureFacts"
}
]
}

Below, we see the new IAM Policy in the AWS Management Console.

alexa-skill-post-011

Below, we see the policy being applied to the skill’s IAM Role, along with the original AWS managed policy.

alexa-skill-post-012

Deploying the Skill

Version 2 of the ASK CLI makes deploying the Alexa custom skill very easy. Using the ASK CLI’s deploy command, we can validate and deploy the skill (manifest),  model, and Lambda, all at once, as shown below. This makes DevOps automation of skill deployments with tools like Jenkins or AWS CodeDeploy straight-forward.

alexa-skill-post-009

You can verify the skill has been deployed, from the Alexa Skills Kit Development Console. You should observe the skill’s model (intents, slots, entities, and endpoints) in the Build tab. You should observe the skill’s publishing details in the Distribution tab. Note deploying the skill does not submit the skill to Amazon’s for review and publishing, you must still submit the skill separately.

alexa-skill-post-013

From the AWS Lambda Management Console, you should observe the skill’s Lambda was deployed. You should observe only the skill can trigger the Lambda. Lastly, you should observe that the correct IAM Role was applied to the Lambda, giving the Lambda access to Amazon CloudWatch Logs and Amazon DynamoDB.

alexa-skill-post-010

Testing the Skill

The ASK CLI comes with the simulate command. According to Amazon, the simulate command simulates an invocation of the skill with text-based input. Again, the ASK CLI makes DevOps test automation with tools like Jenkins or AWS CodeDeploy pretty easy (gist).

ask simulate \
--text "Load Azure Tech Facts" \
--locale "en-US" \
--skill-id "<your_skill_id>" \
--profile "default"
view raw ask-simulate.sh hosted with ❤ by GitHub

Below, are the results of simulating the invocation. The simulate command returns the expected verbal response, including any SSML, and the visual responses (the Display Card). You could easily write an automation script to run a battery of these tests on every code commit, and prior to deployment.

alexa-skill-post-024

I also like to manually test my skills from the Alexa Skills Kit Development Console Test tab. You may invoke the skill using your voice or by typing the skill invocation.

alexa-skill-post-014

The Alexa Skills Kit Development Console Test tab both shows and speaks Alexa’s response. The console also displays the request and response body (JSON input/output), as well as the Display Card for an Echo Show and Echo Spot.

alexa-skill-post-015

Lastly, the Alexa Skills Kit Development Console Test tab displays the Device Log. The log captures Alexa Directives and Events. I have found the Device Log to be very helpful in troubleshooting problems with deployed skills.

alexa-skill-post-025.png

CloudWatch Logs

By default the custom skill outputs events to CloudWatch Logs. I have added the DynamoDB callback payload, as well as the slot values of myName and myQuestion to the logs, for each successful Alexa response. CloudWatch logs, like the Device Logs above, are very helpful in troubleshooting problems with deployed skills.

alexa-skill-post-016

Conclusion

In this brief post, we have seen how to use the new ASK SDK/CLI version 2, services from the AWS Serverless Platform, and the LTS version of Node.js, to create an Alexa Custom Skill. Using the AWS Serverless Platform, we could easily extend the example to take advantage of additional serverless services, such as the use of Amazon SNS and SQS for notifications and messaging and Amazon Kinesis for analytics.

In a future post, we will extend this example, adding the capability to securely add and update our DynamoDB table’s items. We will use addition AWS services, including Amazon Cognito to authorize access to our API. We will also use AWS API Gateway to integrate with our Lambdas, producing a completely serverless API.

¹Azure is a trademark of Microsoft

All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.

, , , , , , , , , , , ,

4 Comments