Posts Tagged Data Warehouse
Lakehouse Data Modeling using dbt, Amazon Redshift, Redshift Spectrum, and AWS Glue
Learn how dbt makes it easy to transform data and materialize models in a modern cloud data lakehouse built on AWS
Introduction
Data lakes have grabbed much of the analytics community’s attention in recent years, thanks to an overabundance of VC-backed analytics startups and marketing dollars. Nonetheless, data warehouses, specifically modern cloud data warehouses, continue to gain market share, led by Snowflake, Amazon Redshift, Google Cloud BigQuery, and Microsoft’s Azure Synapse Analytics.
Several factors have fostered the renewed interest and appeal of data warehouses, including the data lakehouse architecture. According to Databricks, “a lakehouse is a new, open architecture that combines the best elements of data lakes and data warehouses. Lakehouses are enabled by a new system design: implementing similar data structures and data management features to those in a data warehouse directly on top of low-cost cloud storage in open formats.” Similarly, Snowflake describes a lakehouse as “a data solution concept that combines elements of the data warehouse with those of the data lake. Data lakehouses implement data warehouses’ data structures and management features for data lakes, which are typically more cost-effective for data storage.”
dbt
In the following post, we will explore the use of dbt (data build tool), developed by dbt Labs, to transform data in an AWS-based data lakehouse, built with Amazon Redshift, Redshift Spectrum, AWS Glue, and Amazon S3. According to dbt Labs, “dbt enables analytics engineers to transform data in their warehouses by simply writing select statements. dbt handles turning these select statements into tables and views.” Further, “dbt does the T in ELT (Extract, Load, Transform) processes — it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse.”

Amazon Redshift
According to AWS, “Amazon Redshift uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes using AWS-designed hardware and machine learning to deliver the best price-performance at any scale.” AWS claims Amazon Redshift is the most widely used cloud data warehouse.
Amazon Redshift Spectrum
According to AWS, “Redshift Spectrum allows you to efficiently query and retrieve structured and semi-structured data from files in Amazon S3 without having to load the data into Amazon Redshift tables.” Redshift Spectrum tables define the data structure for the files in Amazon S3. The external tables exist in an external data catalog, which can be AWS Glue, the data catalog that comes with Amazon Athena, or an Apache Hive metastore.
dbt can interact with Amazon Redshift Spectrum to create external tables, refresh external table partitions, and access raw data in an Amazon S3-based data lake from the data warehouse. We will use dbt along with the dbt package, dbt_external_tables
, to create the external tables in an AWS Glue data catalog.
Prerequisites
Prerequisites to follow along with this post’s demonstration include:
- Amazon S3 bucket to store raw data;
- Amazon Redshift or Amazon Redshift Serverless cluster;
- AWS IAM Role with permissions to Amazon Redshift, Amazon S3, and AWS Glue;
- dbt Cloud account;
- dbt CLI (dbt Core) and dbt Amazon Redshift adapter installed locally;
- Microsoft Visual Studio Code (VS Code) with dbt extensions installed;
The post’s demonstration uses dbt Cloud, VS Code, and the dbt CLI interchangeably with the project’s GitHub repository as a source. Follow along with the demonstration using any or all of these three dbt options.

Cost Warning!
Be careful when creating a new, provisioned Amazon Redshift cluster for this demonstration. The suggested default Production cluster with two ra3.4xlarge
on-demand compute nodes and AQUA (Redshift’s Advanced Query Accelerator) enabled is estimated at $4,694/month ($3.26/node/hour). For this demonstration, choose the minimum size provisioned Redshift cluster configuration of one dc2.large
on-demand compute node, estimated to cost $180/month ($0.25/node/hour). Be sure to delete the cluster when the demonstration is complete.

Amazon Redshift Serverless Option
AWS recently announced the general availability (GA) of Amazon Redshift Serverless on July 12, 2022. Amazon Redshift Serverless allows data analysts, developers, and data scientists to run and scale analytics without having to provision and manage data warehouse clusters. dbt is fully compatible with Amazon Redshift Serverless and is an alternative to provisioned Redshift for this demonstration. According to AWS, Amazon Redshift Serverless measures data warehouse capacity in Redshift Processing Units (RPUs). You pay for the workloads you run in RPU-hours on a per-second basis (with a 60-second minimum charge), including queries that access data in open file formats in Amazon S3.
Source Code
All the source code demonstrated in this post is open source and available on GitHub.
Sample Data
This demonstration uses the TICKIT sample database provided by AWS and designed for use with Amazon Redshift. This sample database application tracks sales activity for the fictional online TICKIT website, where users buy and sell tickets for sporting events, shows, and concerts. The database consists of seven tables in a star schema: five dimension tables and two fact tables. A clean copy of the raw TICKIT data, formatted as pipe-delimited text files, is included in this GitHub project. Use the following shell commands to copy the raw data to Amazon S3:
Prepare Amazon Redshift for dbt
Create New Database
Create a new Redshift database to use for the demonstration, demo
.
Create Database Schemas
Within the new Redshift database,demo
, create the external schema, tickit_external
, and the corresponding external AWS Glue Data Catalog, tickit_dbt
, using the CREATE EXTERNAL SCHEMA
Redshift SQL command. Make sure to update the command to reflect your IAM Role’s ARN. Next, create the schema that will hold our dbt models, tickit_dbt
. Lastly, as a security best practice, drop the default public
schema.
From the AWS Glue console, we should observe a new tickit_dbt
AWS Glue Data Catalog. The description shown below was manually added after the catalog was created.

Create dbt Database User and Group
As a security best practice, create a separate database dbt
user and dbt
group. We are assigning a completely arbitrary connection limit of ten. Then, apply the grants to allow the dbt
group access to the new database and schemas. Lastly, change the two schema’s owners to the dbt
.
Alternately, we could use an IAM Role with a SAML 2.0-compliant IdP.
Initialize and Configure dbt for Redshift
Next, configure your dbt Cloud account and dbt locally with your Amazon Redshift connection information using the dbt init
command. On a Mac, this configuration is stored in the /Users/<your_usernama>/.dbt/profiles.yml
file. You will need your Redshift cluster host URL, port, database, username, and password. With your local install of dbt, we can use the dbt debug
command to confirm the new configuration.

Project Structure
The GitHub project structure follows many of the best practices outlined in dbt Labs’ Best Practice Guide. Data models in the models
directory is organized into the recommended staging
, intermediate
, and marts
subdirectories (aka layers).

From a data lineage perspective, in this project, the staging layer’s data models depend on the external tables (AWS Glue/Amazon Redshift Spectrum). The intermediate layer’s data models depend on the staging models. The marts layer’s data models depend on staging and intermediate models.

Install dbt Packages
The GitHub project’s packages.yml
contains a few commonly recommended packages. The only one required for this post is the dbt-labs/dbt_external_tables
package. Make sure your project is referring to the latest version of the package.
Use the dbt deps
command to install the packages locally.

External Tables
The _tickit__sources.yml
file in the models/staging/tickit/external_tables/
model’s subdirectory defines the schema and S3 location for each of the seven external TICKIT database tables: category, date, event, listing, sale, user, and venue. You will need to update this file to reflect the name of your Amazon S3 bucket, in seven places.
Execute the command, dbt run-operation stage_external_sources
, to create the seven external tables in the AWS Glue Data Catalog. This command is part of the dbt_external_tables
package we installed earlier. It iterates through all source nodes, creates the tables if missing, and refreshes metadata.

If we failed to run the previous SQL statements to set schema ownership to the dbt
user, the following error will likely occur.

Once the command completes, we should observe seven new tables in the AWS Glue Data Catalog.

Examining one of the AWS Glue data catalog tables, we can observe how the configuration in the _tickit__sources.yml
file was used to define the table’s properties and schema. Note the Location
field indicates where the underlying data is located in our Amazon S3 bucket.

Staging Layer
In their best practices guide, dbt describes the staging layer in the following manner: “you can think of the staging layer as condensing and refining this material into the individual atoms we’ll later build more intricate and useful structures with.” The staging data models are the base tables and views we will use to build more complex aggregations and analytics queries in Redshift. The schema.yml
file, also in the models/staging/tickit/
model’s subdirectory, defines seven late-binding views, modeled by dbt, to be created in Amazon Redshift.
The staging model’s SQL statements also follow many of dbt’s best practices. Below, we see an example of the stg_tickit__sales
model (stg_tickit__sales.sql
). This model performs a SELECT
from the external sale
table in the external_table
schema. The model performs column renaming and basic calculations.
The the dbt run
command, according to dbt, “executes compiled SQL model files against the current target
database. dbt connects to the target database and runs the relevant SQL, required to materialize all data models using the specified materialization strategies.” Instead of using the dbt run
command to create all the project’s tables and views at once, for now, we are limiting the command to just the models in the ./models/staging/tickit/
directory using the --select
optional argument. Execute the dbt run --select staging
command to materialize the seven corresponding staging tables in Amazon Redshift.

Once the command completes, we should observe seven new views in Amazon Redshift demo
database’s tickit_dbt
schema with the stg_
prefix.

Selecting from any of the views should return data.

Late Binding Views
This demonstration uses late binding views for staging and intermediate layer models. According to dbt, “using late-binding views in a production deployment of dbt can vastly improve the availability of data in the warehouse, especially for models that are materialized as late-binding views and are queried by end-users, since they won’t be dropped when upstream models are updated. Additionally, late binding views can be used with external tables via Redshift Spectrum.”
Alternatively, we could define the seven staging models as tables instead of late binding views. Once created as tables, the dependent intermediate and marts views will not require a late-binding reference, as in this project.
Intermediate Layer
In their best practices guide, dbt describes the intermediate layer as “purpose-built transformation steps.” Further, “the best guiding principle is to think about verbs (e.g. pivoted
, aggregated_to_user
, joined
, fanned_out_by_quanity
, funnel_created
, etc.) in the intermediate layer.”
The project’s intermediate layer consists of two models related to users. The sample TICKIT database lumps all users into a single table. However, for analytics purposes, different user personas might interest marketing teams, such as buyers, sellers, sellers who also buy, and non-buyers (users who have never purchased tickets). The two models in the project’s intermediate layer filter for buyers and for sellers, resulting in two separate views of user personas.
To materialize the intermediate layer’s two data models into views, execute the command, dbt run --select intermediate
.

Once the command completes, we should observe a total of nine views in Amazon Redshift demo
database’s tickit_dbt
schema — seven staging and two intermediate, identified with the int_
prefix.

Marts Layer
In their best practices guide, dbt describes the marts layer as “business defined entities.” Further, “this is the layer where everything comes together and we start to arrange all of our atoms (staging models) and molecules (intermediate models) into full-fledged cells that have identity and purpose. We sometimes like to call this the entity layer or concept layer, to emphasize that all our marts are meant to represent a specific entity or concept at its unique grain.”
The project’s marts layer consists of four data models across marketing and sales. The models are materialized as two dimension tables and two fact tables. Although it is common practice to describe and label these as traditional star schema dimension (dim_
) or fact (fct_
) tables, in reality, the fact tables in this demonstration are actually flat, de-normalized, wide tables. Wide tables generally have better analytics performance in a modern data warehouse, according to Fivetran and others.
The marts layer’s models take various dependencies through joins on staging and intermediate models. The data model above, fct_sales
, has dependencies on multiple staging and intermediate models.

To materialize the marts layer’s four data models into tables, execute the command, dbt run --select marts
.

Once the command completes, we should observe four tables and nine views in the Redshift demo
database’s tickit_dbt
schema. Note how the dbt model for fct_sales
(shown above), with its Jinja templating and multiple CTEs have been compiled into the resulting table in Redshift, this is the real magic of dbt!

At this point, all of the project’s models have been compiled and created in the Redshift demo
database by dbt.
Analyses
The demonstration’s project also contains example analyses. dbt allows us to version control more analytical-oriented SQL files within our dbt project using the analyses
functionality of dbt. These analyses do not fit the fairly opinionated dbt model definition. We can compile the analyses SQL file using the dbt compile
command, then copy and paste the resulting SQL statements from the target/compiled/
subdirectory into our data warehouse’s query tool of choice.


Project Documentation
Using the dbt docs generate
command will automatically generate the project’s documentation website from the SQL and YAML files. Documentations can be generated and displayed from your dbt Cloud account or hosted locally.

Testing
According to dbt, “Tests are assertions you make about your models and other resources in your dbt project (e.g. sources, seeds, and snapshots). When you run dbt test
, dbt will tell you if each test in your project passes or fails.” The project contains over 50 tests, split between the _tickit__sources.yml
file and individual tests in the test/
directory. Typical dbt tests check for non-null and unique values, values within an expected numeric range, and values from a known list of strings. Any SELECT
statement written in SQL can be tested.
Execute the project’s tests using the dbt test
command. We can execute individual tests using the --select
optional argument, for example, dbt test --select assert_all_sale_amounts_are_positive
. We can also use the --threads
optional argument with most dbt commands, including dbt test
, increasing parallelism and reducing execution time. The example below uses 10 threads, the arbitrary maximum configured for the Amazon Redshift dbt
user.


Jobs
According to dbt, Jobs are a set of dbt commands that you want to run on a schedule. For example, dbt run
and dbt test
. Jobs can load packages, run tests, materialize models, check source freshness (dbt source freshness
), and regenerate documentation. Below, we have created a daily job to test, refresh, and document our project as the data is updated in the data lake.

Notifications
According to dbt, Setting up notifications in dbt Cloud will allow you to receive alerts via Email or a chosen Slack channel when a job run succeeds, fails, or is canceled.

The Slack notifications include run status, timings, and a link to open the job in dbt Cloud. Below, we see a notification regarding our project’s daily job run.

Exposures
Exposures are a recent addition to dbt. Exposures make it possible to define and describe a downstream use of our dbt project, such as in a dashboard, application, or data science pipeline. Below we see an example of an exposure describing a sales dashboard created in Amazon QuickSight.
The exposure YAML file shown above describes the Amazon QuickSight dashboard shown below.

Exposures work with dbt’s auto-documentation feature. dbt populates a dedicated page in the auto-generated documentation site with context relevant to data consumers.


Conclusion
In this post, we covered some of the basic functionality of dbt. We learned how dbt enables analysts to work more like software engineers. We also learned how dbt makes it easy to codify data models in SQL, to version control and manage data models as code with git, and collaborate on data models with other data team members.
Topics not explored in this post but critical to most large-scale dbt-managed production environments include advanced Jinja templating and macros, model freshness, orchestration, job scheduling, Continuous Integration and GitOps, notifications, environment variables, and incremental models. We will explore these additional dbt capabilities in future posts.
This blog represents my viewpoints and not of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners. All diagrams and illustrations are the property of the author unless otherwise noted.
Streaming Analytics with Data Warehouses, using Amazon Kinesis Data Firehose, Amazon Redshift, and Amazon QuickSight
Posted by Gary A. Stafford in AWS, Cloud, Python, Software Development, SQL on March 5, 2020
Introduction
Databases are ideal for storing and organizing data that requires a high volume of transaction-oriented query processing while maintaining data integrity. In contrast, data warehouses are designed for performing data analytics on vast amounts of data from one or more disparate sources. In our fast-paced, hyper-connected world, those sources often take the form of continuous streams of web application logs, e-commerce transactions, social media feeds, online gaming activities, financial trading transactions, and IoT sensor readings. Streaming data must be analyzed in near real-time, while often first requiring cleansing, transformation, and enrichment.
In the following post, we will demonstrate the use of Amazon Kinesis Data Firehose, Amazon Redshift, and Amazon QuickSight to analyze streaming data. We will simulate time-series data, streaming from a set of IoT sensors to Kinesis Data Firehose. Kinesis Data Firehose will write the IoT data to an Amazon S3 Data Lake, where it will then be copied to Redshift in near real-time. In Amazon Redshift, we will enhance the streaming sensor data with data contained in the Redshift data warehouse, which has been gathered and denormalized into a star schema.
In Redshift, we can analyze the data, asking questions like, what is the min, max, mean, and median temperature over a given time period at each sensor location. Finally, we will use Amazon Quicksight to visualize the Redshift data using rich interactive charts and graphs, including displaying geospatial sensor data.
Featured Technologies
The following AWS services are discussed in this post.
Amazon Kinesis Data Firehose
According to Amazon, Amazon Kinesis Data Firehose can capture, transform, and load streaming data into data lakes, data stores, and analytics tools. Direct Kinesis Data Firehose integrations include Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk. Kinesis Data Firehose enables near real-time analytics with existing business intelligence (BI) tools and dashboards.
Amazon Redshift
According to Amazon, Amazon Redshift is the most popular and fastest cloud data warehouse. With Redshift, users can query petabytes of structured and semi-structured data across your data warehouse and data lake using standard SQL. Redshift allows users to query and export data to and from data lakes. Redshift can federate queries of live data from Redshift, as well as across one or more relational databases.
Amazon Redshift Spectrum
According to Amazon, Amazon Redshift Spectrum can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables. Redshift Spectrum tables are created by defining the structure for data files and registering them as tables in an external data catalog. The external data catalog can be AWS Glue or an Apache Hive metastore. While Redshift Spectrum is an alternative to copying the data into Redshift for analysis, we will not be using Redshift Spectrum in this post.
Amazon QuickSight
According to Amazon, Amazon QuickSight is a fully managed business intelligence service that makes it easy to deliver insights to everyone in an organization. QuickSight lets users easily create and publish rich, interactive dashboards that include Amazon QuickSight ML Insights. Dashboards can then be accessed from any device and embedded into applications, portals, and websites.
What is a Data Warehouse?
According to Amazon, a data warehouse is a central repository of information that can be analyzed to make better-informed decisions. Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically on a regular cadence. Business analysts, data scientists, and decision-makers access the data through business intelligence tools, SQL clients, and other analytics applications.
Demonstration
Source Code
All the source code for this post can be found on GitHub. Use the following command to git clone a local copy of the project.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
git clone \ | |
–branch master –single-branch –depth 1 –no-tags \ | |
https://github.com/garystafford/kinesis-redshift-streaming-demo.git |
CloudFormation
Use the two AWS CloudFormation templates, included in the project, to build two CloudFormation stacks. Please review the two templates and understand the costs of the resources before continuing.
The first CloudFormation template, redshift.yml, provisions a new Amazon VPC with associated network and security resources, a single-node Redshift cluster, and two S3 buckets.
The second CloudFormation template, kinesis-firehose.yml, provisions an Amazon Kinesis Data Firehose delivery stream, associated IAM Policy and Role, and an Amazon CloudWatch log group and two log streams.
Change the REDSHIFT_PASSWORD
value to ensure your security. Optionally, change the REDSHIFT_USERNAME
value. Make sure that the first stack completes successfully, before creating the second stack.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
export AWS_DEFAULT_REGION=us-east-1 | |
REDSHIFT_USERNAME=awsuser | |
REDSHIFT_PASSWORD=5up3r53cr3tPa55w0rd | |
# Create resources | |
aws cloudformation create-stack \ | |
–stack-name redshift-stack \ | |
–template-body file://cloudformation/redshift.yml \ | |
–parameters ParameterKey=MasterUsername,ParameterValue=${REDSHIFT_USERNAME} \ | |
ParameterKey=MasterUserPassword,ParameterValue=${REDSHIFT_PASSWORD} \ | |
ParameterKey=InboundTraffic,ParameterValue=$(curl ifconfig.me -s)/32 \ | |
–capabilities CAPABILITY_NAMED_IAM | |
# Wait for first stack to complete | |
aws cloudformation create-stack \ | |
–stack-name kinesis-firehose-stack \ | |
–template-body file://cloudformation/kinesis-firehose.yml \ | |
–parameters ParameterKey=MasterUserPassword,ParameterValue=${REDSHIFT_PASSWORD} \ | |
–capabilities CAPABILITY_NAMED_IAM |
Review AWS Resources
To confirm all the AWS resources were created correctly, use the AWS Management Console.
Kinesis Data Firehose
In the Amazon Kinesis Dashboard, you should see the new Amazon Kinesis Data Firehose delivery stream, redshift-delivery-stream.
The Details tab of the new Amazon Kinesis Firehose delivery stream should look similar to the following. Note the IAM Role, FirehoseDeliveryRole, which was created and associated with the delivery stream by CloudFormation.
We are not performing any transformations of the incoming messages. Note the new S3 bucket that was created and associated with the stream by CloudFormation. The bucket name was randomly generated. This bucket is where the incoming messages will be written.
Note the buffer conditions of 1 MB and 60 seconds. Whenever the buffer of incoming messages is greater than 1 MB or the time exceeds 60 seconds, the messages are written in JSON format, using GZIP compression, to S3. These are the minimal buffer conditions, and as close to real-time streaming to Redshift as we can get.
Note the COPY
command, which is used to copy the messages from S3 to the message
table in Amazon Redshift. Kinesis uses the IAM Role, ClusterPermissionsRole, created by CloudFormation, for credentials. We are using a Manifest to copy the data to Redshift from S3. According to Amazon, a Manifest ensures that the COPY
command loads all of the required files, and only the required files, for a data load. The Manifests are automatically generated and managed by the Kinesis Firehose delivery stream.
Redshift Cluster
In the Amazon Redshift Console, you should see a new single-node Redshift cluster consisting of one Redshift dc2.large Dense Compute node type.
Note the new VPC, Subnet, and VPC Security Group created by CloudFormation. Also, observe that the Redshift cluster is publicly accessible from outside the new VPC.
Redshift Ingress Rules
The single-node Redshift cluster is assigned to an AWS Availability Zone in the US East (N. Virginia) us-east-1 AWS Region. The cluster is associated with a VPC Security Group. The Security Group contains three inbound rules, all for Redshift port 5439. The IP addresses associated with the three inbound rules provide access to the following: 1) a /27
CIDR block for Amazon QuickSight in us-east-1, a /27
CIDR block for Amazon Kinesis Firehose in us-east-1, and to you, a /32
CIDR block with your current IP address. If your IP address changes or you do not use the us-east-1 Region, you will need to change one or all of these IP addresses. The list of Kinesis Firehose IP addresses is here. The list of QuickSight IP addresses is here.
If you cannot connect to Redshift from your local SQL client, most often, your IP address has changed and is incorrect in the Security Group’s inbound rule.
Redshift SQL Client
You can choose to use the Redshift Query Editor to interact with Redshift or use a third-party SQL client for greater flexibility. To access the Redshift Query Editor, use the user credentials specified in the redshift.yml CloudFormation template.
There is a lot of useful functionality in the Redshift Console and within the Redshift Query Editor. However, a notable limitation of the Redshift Query Editor, in my opinion, is the inability to execute multiple SQL statements at the same time. Whereas, most SQL clients allow multiple SQL queries to be executed at the same time.
I prefer to use JetBrains PyCharm IDE. PyCharm has out-of-the-box integration with Redshift. Using PyCharm, I can edit the project’s Python, SQL, AWS CLI shell, and CloudFormation code, all from within PyCharm.
If you use any of the common SQL clients, you will need to set-up a JDBC (Java Database Connectivity) or ODBC (Open Database Connectivity) connection to Redshift. The ODBC and JDBC connection strings can be found in the Redshift cluster’s Properties tab or in the Outputs tab from the CloudFormation stack, redshift-stack
.
You will also need the Redshift database username and password you included in the aws cloudformation create-stack
AWS CLI command you executed previously. Below, we see PyCharm’s Project Data Sources window containing a new data source for the Redshift dev
database.
Database Schema and Tables
When CloudFormation created the Redshift cluster, it also created a new database, dev
. Using the Redshift Query Editor or your SQL client of choice, execute the following series of SQL commands to create a new database schema, sensor
, and six tables in the sensor
schema.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
— Create new schema in Redshift DB | |
DROP SCHEMA IF EXISTS sensor CASCADE; | |
CREATE SCHEMA sensor; | |
SET search_path = sensor; | |
— Create (6) tables in Redshift DB | |
CREATE TABLE message — streaming data table | |
( | |
id BIGINT IDENTITY (1, 1), — message id | |
guid VARCHAR(36) NOT NULL, — device guid | |
ts BIGINT NOT NULL DISTKEY SORTKEY, — epoch in seconds | |
temp NUMERIC(5, 2) NOT NULL, — temperature reading | |
created TIMESTAMP DEFAULT ('now'::text)::timestamp with time zone — row created at | |
); | |
CREATE TABLE location — dimension table | |
( | |
id INTEGER NOT NULL DISTKEY SORTKEY, — location id | |
long NUMERIC(10, 7) NOT NULL, — longitude | |
lat NUMERIC(10, 7) NOT NULL, — latitude | |
description VARCHAR(256) — location description | |
); | |
CREATE TABLE history — dimension table | |
( | |
id INTEGER NOT NULL DISTKEY SORTKEY, — history id | |
serviced BIGINT NOT NULL, — service date | |
action VARCHAR(20) NOT NULL, — INSTALLED, CALIBRATED, FIRMWARE UPGRADED, DECOMMISSIONED, OTHER | |
technician_id INTEGER NOT NULL, — technician id | |
notes VARCHAR(256) — notes | |
); | |
CREATE TABLE sensor — dimension table | |
( | |
id INTEGER NOT NULL DISTKEY SORTKEY, — sensor id | |
guid VARCHAR(36) NOT NULL, — device guid | |
mac VARCHAR(18) NOT NULL, — mac address | |
sku VARCHAR(18) NOT NULL, — product sku | |
upc VARCHAR(12) NOT NULL, — product upc | |
active BOOLEAN DEFAULT TRUE, —active status | |
notes VARCHAR(256) — notes | |
); | |
CREATE TABLE manufacturer — dimension table | |
( | |
id INTEGER NOT NULL DISTKEY SORTKEY, — manufacturer id | |
name VARCHAR(100) NOT NULL, — company name | |
website VARCHAR(100) NOT NULL, — company website | |
notes VARCHAR(256) — notes | |
); | |
CREATE TABLE sensors — fact table | |
( | |
id BIGINT IDENTITY (1, 1) DISTKEY SORTKEY, — fact id | |
sensor_id INTEGER NOT NULL, — sensor id | |
manufacturer_id INTEGER NOT NULL, — manufacturer id | |
location_id INTEGER NOT NULL, — location id | |
history_id BIGINT NOT NULL, — history id | |
message_guid VARCHAR(36) NOT NULL — sensor guid | |
); |
Star Schema
The tables represent denormalized data, taken from one or more relational database sources. The tables form a star schema. The star schema is widely used to develop data warehouses. The star schema consists of one or more fact tables referencing any number of dimension tables. The location
, manufacturer
, sensor
, and history
tables are dimension tables. The sensors
table is a fact table.
In the diagram below, the foreign key relationships are virtual, not physical. The diagram was created using PyCharm’s schema visualization tool. Note the schema’s star shape. The message
table is where the streaming IoT data will eventually be written. The message
table is related to the sensors
fact table through the common guid
field.
Sample Data to S3
Next, copy the sample data, included in the project, to the S3 data bucket created with CloudFormation. Each CSV-formatted data file corresponds to one of the tables we previously created. Since the bucket name is semi-random, we can use the AWS CLI and jq to get the bucket name, then use it to perform the copy commands.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Get data bucket name | |
DATA_BUCKET=$(aws cloudformation describe-stacks \ | |
–stack-name redshift-stack \ | |
| jq -r '.Stacks[].Outputs[] | select(.OutputKey == "DataBucket") | .OutputValue') | |
echo $DATA_BUCKET | |
# Copy data | |
aws s3 cp data/history.csv s3://${DATA_BUCKET}/history/history.csv | |
aws s3 cp data/location.csv s3://${DATA_BUCKET}/location/location.csv | |
aws s3 cp data/manufacturer.csv s3://${DATA_BUCKET}/manufacturer/manufacturer.csv | |
aws s3 cp data/sensor.csv s3://${DATA_BUCKET}/sensor/sensor.csv | |
aws s3 cp data/sensors.csv s3://${DATA_BUCKET}/sensors/sensors.csv |
The output from the AWS CLI should look similar to the following.
Sample Data to Redshift
Whereas a relational database, such as Amazon RDS is designed for online transaction processing (OLTP), Amazon Redshift is designed for online analytic processing (OLAP) and business intelligence applications. To write data to Redshift we typically use the COPY
command versus frequent, individual INSERT
statements, as with OLTP, which would be prohibitively slow. According to Amazon, the Redshift COPY
command leverages the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from files on Amazon S3, from a DynamoDB table, or from text output from one or more remote hosts.
In the following series of SQL statements, replace the placeholder, your_bucket_name
, in five places with your S3 data bucket name. The bucket name will start with the prefix, redshift-stack-databucket
. The bucket name can be found in the Outputs tab of the CloudFormation stack, redshift-stack
. Next, replace the placeholder, cluster_permissions_role_arn
, with the ARN (Amazon Resource Name) of the ClusterPermissionsRole. The ARN is formatted as follows, arn:aws:iam::your-account-id:role/ClusterPermissionsRole
. The ARN can be found in the Outputs tab of the CloudFormation stack, redshift-stack
.
Using the Redshift Query Editor or your SQL client of choice, execute the SQL statements to copy the sample data from S3 to each of the corresponding tables in the Redshift dev
database. The TRUNCATE
commands guarantee there is no previous sample data present in the tables.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
— ** MUST FIRST CHANGE your_bucket_name and cluster_permissions_role_arn ** | |
— sensor schema | |
SET search_path = sensor; | |
— Copy sample data to tables from S3 | |
TRUNCATE TABLE history; | |
COPY history (id, serviced, action, technician_id, notes) | |
FROM 's3://your_bucket_name/history/' | |
CREDENTIALS 'aws_iam_role=cluster_permissions_role_arn' | |
CSV IGNOREHEADER 1; | |
TRUNCATE TABLE location; | |
COPY location (id, long, lat, description) | |
FROM 's3://your_bucket_name/location/' | |
CREDENTIALS 'aws_iam_role=cluster_permissions_role_arn' | |
CSV IGNOREHEADER 1; | |
TRUNCATE TABLE sensor; | |
COPY sensor (id, guid, mac, sku, upc, active, notes) | |
FROM 's3://your_bucket_name/sensor/' | |
CREDENTIALS 'aws_iam_role=cluster_permissions_role_arn' | |
CSV IGNOREHEADER 1; | |
TRUNCATE TABLE manufacturer; | |
COPY manufacturer (id, name, website, notes) | |
FROM 's3://your_bucket_name/manufacturer/' | |
CREDENTIALS 'aws_iam_role=cluster_permissions_role_arn' | |
CSV IGNOREHEADER 1; | |
TRUNCATE TABLE sensors; | |
COPY sensors (sensor_id, manufacturer_id, location_id, history_id, message_guid) | |
FROM 's3://your_bucket_name/sensors/' | |
CREDENTIALS 'aws_iam_role=cluster_permissions_role_arn' | |
CSV IGNOREHEADER 1; | |
SELECT COUNT(*) FROM history; — 30 | |
SELECT COUNT(*) FROM location; — 6 | |
SELECT COUNT(*) FROM sensor; — 6 | |
SELECT COUNT(*) FROM manufacturer; —1 | |
SELECT COUNT(*) FROM sensors; — 30 |
Database Views
Next, create four Redshift database Views. These views may be used to analyze the data in Redshift, and later, in Amazon QuickSight.
- sensor_msg_detail: Returns aggregated sensor details, using the
sensors
fact table and all five dimension tables in a SQL Join. - sensor_msg_count: Returns the number of messages received by Redshift, for each sensor.
- sensor_avg_temp: Returns the average temperature from each sensor, based on all the messages received from each sensor.
- sensor_avg_temp_current: View is identical for the previous view but limited to the last 30 minutes.
Using the Redshift Query Editor or your SQL client of choice, execute the following series of SQL statements.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
— sensor schema | |
SET search_path = sensor; | |
— View 1: Sensor details | |
DROP VIEW IF EXISTS sensor_msg_detail; | |
CREATE OR REPLACE VIEW sensor_msg_detail AS | |
SELECT ('1970-01-01'::date + e.ts * interval '1 second') AS recorded, | |
e.temp, | |
s.guid, | |
s.sku, | |
s.mac, | |
l.lat, | |
l.long, | |
l.description AS location, | |
('1970-01-01'::date + h.serviced * interval '1 second') AS installed, | |
e.created AS redshift | |
FROM sensors f | |
INNER JOIN sensor s ON (f.sensor_id = s.id) | |
INNER JOIN history h ON (f.history_id = h.id) | |
INNER JOIN location l ON (f.location_id = l.id) | |
INNER JOIN manufacturer m ON (f.manufacturer_id = m.id) | |
INNER JOIN message e ON (f.message_guid = e.guid) | |
WHERE s.active IS TRUE | |
AND h.action = 'INSTALLED' | |
ORDER BY f.id; | |
— View 2: Message count per sensor | |
DROP VIEW IF EXISTS sensor_msg_count; | |
CREATE OR REPLACE VIEW sensor_msg_count AS | |
SELECT count(e.temp) AS msg_count, | |
s.guid, | |
l.lat, | |
l.long, | |
l.description AS location | |
FROM sensors f | |
INNER JOIN sensor s ON (f.sensor_id = s.id) | |
INNER JOIN history h ON (f.history_id = h.id) | |
INNER JOIN location l ON (f.location_id = l.id) | |
INNER JOIN message e ON (f.message_guid = e.guid) | |
WHERE s.active IS TRUE | |
AND h.action = 'INSTALLED' | |
GROUP BY s.guid, l.description, l.lat, l.long | |
ORDER BY msg_count, s.guid; | |
— View 3: Average temperature per sensor (all data) | |
DROP VIEW IF EXISTS sensor_avg_temp; | |
CREATE OR REPLACE VIEW sensor_avg_temp AS | |
SELECT avg(e.temp) AS avg_temp, | |
count(s.guid) AS msg_count, | |
s.guid, | |
l.lat, | |
l.long, | |
l.description AS location | |
FROM sensors f | |
INNER JOIN sensor s ON (f.sensor_id = s.id) | |
INNER JOIN history h ON (f.history_id = h.id) | |
INNER JOIN location l ON (f.location_id = l.id) | |
INNER JOIN message e ON (f.message_guid = e.guid) | |
WHERE s.active IS TRUE | |
AND h.action = 'INSTALLED' | |
GROUP BY s.guid, l.description, l.lat, l.long | |
ORDER BY avg_temp, s.guid; | |
— View 4: Average temperature per sensor (last 30 minutes) | |
DROP VIEW IF EXISTS sensor_avg_temp_current; | |
CREATE OR REPLACE VIEW sensor_avg_temp_current AS | |
SELECT avg(e.temp) AS avg_temp, | |
count(s.guid) AS msg_count, | |
s.guid, | |
l.lat, | |
l.long, | |
l.description AS location | |
FROM sensors f | |
INNER JOIN sensor s ON (f.sensor_id = s.id) | |
INNER JOIN history h ON (f.history_id = h.id) | |
INNER JOIN location l ON (f.location_id = l.id) | |
INNER JOIN (SELECT ('1970-01-01'::date + ts * interval '1 second') AS recorded_time, | |
guid, | |
temp | |
FROM message | |
WHERE DATEDIFF(minute, recorded_time, GETDATE()) <= 30) e ON (f.message_guid = e.guid) | |
WHERE s.active IS TRUE | |
AND h.action = 'INSTALLED' | |
GROUP BY s.guid, l.description, l.lat, l.long | |
ORDER BY avg_temp, s.guid; |
At this point, you should have a total of six tables and four views in the sensor
schema of the dev
database in Redshift.
Test the System
With all the necessary AWS resources and Redshift database objects created and sample data in the Redshift database, we can test the system. The included Python script, kinesis_put_test_msg.py, will generate a single test message and send it to Kinesis Data Firehose. If everything is working, the message should be delivered from Kinesis Data Firehose to S3, then copied to Redshift, and appear in the message
table.
Install the required Python packages and then execute the Python script.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Install required Python packages | |
python3 -m pip install –user -r scripts/requirements.txt | |
# Set default AWS Region for script | |
export AWS_DEFAULT_REGION=us-east-1 | |
# Execute script in foreground | |
python3 ./scripts/kinesis_put_test_msg.py |
Run the following SQL query to confirm the record is in the message
table of the dev
database. It will take at least one minute for the message to appear in Redshift.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
SELECT COUNT(*) FROM message; |
Once the message is confirmed to be present in the message
table, delete the record by truncating the table.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
TRUNCATE TABLE message; |
Streaming Data
Assuming the test message worked, we can proceed with simulating the streaming IoT sensor data. The included Python script, kinesis_put_streaming_data.py, creates six concurrent threads, representing six temperature sensors.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python3 | |
# Simulated multiple streaming time-series iot sensor data | |
# Author: Gary A. Stafford | |
# Date: Revised October 2020 | |
import json | |
import random | |
from datetime import datetime | |
import boto3 | |
import time as tm | |
import numpy as np | |
import threading | |
STREAM_NAME = 'redshift-delivery-stream' | |
client = boto3.client('firehose') | |
class MyThread(threading.Thread): | |
def __init__(self, thread_id, sensor_guid, temp_max): | |
threading.Thread.__init__(self) | |
self.thread_id = thread_id | |
self.sensor_id = sensor_guid | |
self.temp_max = temp_max | |
def run(self): | |
print("Starting Thread: " + str(self.thread_id)) | |
self.create_data() | |
print("Exiting Thread: " + str(self.thread_id)) | |
def create_data(self): | |
start = 0 | |
stop = 20 | |
step = 0.1 # step size (e.g 0 to 20, step .1 = 200 steps in cycle) | |
repeat = 2 # how many times to repeat cycle | |
freq = 60 # frequency of temperature reading in seconds | |
max_range = int(stop * (1 / step)) | |
time = np.arange(start, stop, step) | |
amplitude = np.sin(time) | |
for x in range(0, repeat): | |
for y in range(0, max_range): | |
temperature = round((((amplitude[y] + 1.0) * self.temp_max) + random.uniform(–5, 5)) + 60, 2) | |
payload = { | |
'guid': self.sensor_id, | |
'ts': int(datetime.now().strftime('%s')), | |
'temp': temperature | |
} | |
print(json.dumps(payload)) | |
self.send_to_kinesis(payload) | |
tm.sleep(freq) | |
@staticmethod | |
def send_to_kinesis(payload): | |
_ = client.put_record( | |
DeliveryStreamName=STREAM_NAME, | |
Record={ | |
'Data': json.dumps(payload) | |
} | |
) | |
def main(): | |
sensor_guids = [ | |
"03e39872-e105-4be4-83c0-9ade818465dc", | |
"fa565921-fddd-4bfb-a7fd-d617f816df4b", | |
"d120422d-5789-435d-9dc6-73d8489b04c2", | |
"93238559-4d55-4b2a-bdcb-6aa3be0f3908", | |
"dbc05806-6872-4f0a-aca2-f794cc39bd9b", | |
"f9ade639-f936-4954-aa5a-1f2ed86c9bcf" | |
] | |
timeout = 300 # arbitrarily offset the start of threads (60 / 5 = 12) | |
# Create new threads | |
thread1 = MyThread(1, sensor_guids[0], 25) | |
thread2 = MyThread(2, sensor_guids[1], 10) | |
thread3 = MyThread(3, sensor_guids[2], 7) | |
thread4 = MyThread(4, sensor_guids[3], 30) | |
thread5 = MyThread(5, sensor_guids[4], 5) | |
thread6 = MyThread(6, sensor_guids[5], 12) | |
# Start new threads | |
thread1.start() | |
tm.sleep(timeout * 1) | |
thread2.start() | |
tm.sleep(timeout * 2) | |
thread3.start() | |
tm.sleep(timeout * 1) | |
thread4.start() | |
tm.sleep(timeout * 3) | |
thread5.start() | |
tm.sleep(timeout * 2) | |
thread6.start() | |
# Wait for threads to terminate | |
thread1.join() | |
thread2.join() | |
thread3.join() | |
thread4.join() | |
thread5.join() | |
thread6.join() | |
print("Exiting Main Thread") | |
if __name__ == '__main__': | |
main() |
The simulated data uses an algorithm that follows an oscillating sine wave or sinusoid, representing rising and falling temperatures. In the script, I have configured each thread to start with an arbitrary offset to add some randomness to the simulated data.
The variables within the script can be adjusted to shorten or lengthen the time it takes to stream the simulated data. By default, each of the six threads creates 400 messages per sensor, in one-minute increments. Including the offset start of each proceeding thread, the total runtime of the script is about 7.5 hours to generate 2,400 simulated IoT sensor temperature readings and push to Kinesis Data Firehose. Make sure you can guarantee you will maintain a connection to the Internet for the entire runtime of the script. I normally run the script in the background, from a small EC2 instance.
To use the Python script, execute either of the two following commands. Using the first command will run the script in the foreground. Using the second command will run the script in the background.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Install required Python packages | |
python3 -m pip install –user -r scripts/requirements.txt | |
# Set default AWS Region for script | |
export AWS_DEFAULT_REGION=us-east-1 | |
# Option #1: Execute script in foreground | |
python3 ./scripts/kinesis_put_streaming_data.py | |
# Option #2: execute script in background | |
nohup python3 -u ./scripts/kinesis_put_streaming_data.py > output.log 2>&1 </dev/null & | |
# Check that the process is running | |
ps -aux | grep 'python3 -u ./scripts/kinesis_put_streaming_data.py' | |
# Wait 1-2 minutes, then check output to confirm script is working | |
cat output.log |
Viewing the output.log file, you should see messages being generated on each thread and sent to Kinesis Data Firehose. Each message contains the GUID of the sensor, a timestamp, and a temperature reading.
The messages are sent to Kinesis Data Firehose, which in turn writes the messages to S3. The messages are written in JSON format using GZIP compression. Below, we see an example of the GZIP compressed JSON files in S3. The JSON files are partitioned by year, month, day, and hour.
Confirm Data Streaming to Redshift
From the Amazon Kinesis Firehose Console Metrics tab, you should see incoming messages flowing to S3 and on to Redshift.
Executing the following SQL query should show an increasing number of messages.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
SELECT COUNT(*) FROM message; |
How Near Real-time?
Earlier, we saw how the Amazon Kinesis Data Firehose delivery stream was configured to buffer data at the rate of 1 MB or 60 seconds. Whenever the buffer of incoming messages is greater than 1 MB or the time exceeds 60 seconds, the messages are written to S3. Each record in the message
table has two timestamps. The first timestamp, ts, is when the temperature reading was recorded. The second timestamp, created, is when the message was written to Redshift, using the COPY
command. We can calculate the delta in seconds between the two timestamps using the following SQL query in Redshift.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
SELECT ('1970-01-01'::date + ts * interval '1 second') AS recorded_time, | |
created AS redshift_time, | |
DATEDIFF(seconds, recorded_time, redshift_time) AS diff_seconds | |
FROM message | |
ORDER BY diff_seconds; |
Using the results of the Redshift query, we can visualize the results in Amazon QuickSight. In my own tests, we see that for 2,400 messages, over approximately 7.5 hours, the minimum delay was 1 second, and a maximum delay was 64 seconds. Hence, near real-time, in this case, is about one minute or less, with an average latency of roughly 30 seconds.
Analyzing the Data with Redshift
I suggest waiting at least thirty minutes for a significant number of messages copied into Redshift. With the data streaming into Redshift, execute each of the database views we created earlier. You should see the streaming message data, joined to the existing static data in Redshift. As data continues to stream into Redshift, the views will display different results based on the current message
table contents.
Here, we see the first ten results of the sensor_msg_detail
view.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
recorded | temp | guid | sku | mac | lat | long | location | installed | redshift | |
---|---|---|---|---|---|---|---|---|---|---|
2020-03-04 03:31:59.000000 | 105.56 | 03e39872-e105-4be4-83c0-9ade818465dc | PR49-24A | 8e:fa:46:09:14:b2 | 37.7068476 | -122.4191599 | Research Lab #2203 | 2018-01-31 12:00:00.000000 | 2020-03-04 03:33:01.580147 | |
2020-03-04 03:29:59.000000 | 95.93 | 03e39872-e105-4be4-83c0-9ade818465dc | PR49-24A | 8e:fa:46:09:14:b2 | 37.7068476 | -122.4191599 | Research Lab #2203 | 2018-01-31 12:00:00.000000 | 2020-03-04 03:31:01.388887 | |
2020-03-04 03:26:58.000000 | 91.93 | 03e39872-e105-4be4-83c0-9ade818465dc | PR49-24A | 8e:fa:46:09:14:b2 | 37.7068476 | -122.4191599 | Research Lab #2203 | 2018-01-31 12:00:00.000000 | 2020-03-04 03:28:01.099796 | |
2020-03-04 03:25:58.000000 | 88.70 | 03e39872-e105-4be4-83c0-9ade818465dc | PR49-24A | 8e:fa:46:09:14:b2 | 37.7068476 | -122.4191599 | Research Lab #2203 | 2018-01-31 12:00:00.000000 | 2020-03-04 03:26:00.196113 | |
2020-03-04 03:22:58.000000 | 87.65 | 03e39872-e105-4be4-83c0-9ade818465dc | PR49-24A | 8e:fa:46:09:14:b2 | 37.7068476 | -122.4191599 | Research Lab #2203 | 2018-01-31 12:00:00.000000 | 2020-03-04 03:23:01.558514 | |
2020-03-04 03:20:58.000000 | 77.35 | 03e39872-e105-4be4-83c0-9ade818465dc | PR49-24A | 8e:fa:46:09:14:b2 | 37.7068476 | -122.4191599 | Research Lab #2203 | 2018-01-31 12:00:00.000000 | 2020-03-04 03:21:00.691347 | |
2020-03-04 03:16:57.000000 | 71.84 | 03e39872-e105-4be4-83c0-9ade818465dc | PR49-24A | 8e:fa:46:09:14:b2 | 37.7068476 | -122.4191599 | Research Lab #2203 | 2018-01-31 12:00:00.000000 | 2020-03-04 03:17:59.307510 | |
2020-03-04 03:15:57.000000 | 72.35 | 03e39872-e105-4be4-83c0-9ade818465dc | PR49-24A | 8e:fa:46:09:14:b2 | 37.7068476 | -122.4191599 | Research Lab #2203 | 2018-01-31 12:00:00.000000 | 2020-03-04 03:15:59.813656 | |
2020-03-04 03:14:57.000000 | 67.95 | 03e39872-e105-4be4-83c0-9ade818465dc | PR49-24A | 8e:fa:46:09:14:b2 | 37.7068476 | -122.4191599 | Research Lab #2203 | 2018-01-31 12:00:00.000000 | 2020-03-04 03:15:59.813656 |
Next, we see the results of the sensor_avg_temp
view.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
avg_temp | guid | lat | long | location | |
---|---|---|---|---|---|
65.25 | dbc05806-6872-4f0a-aca2-f794cc39bd9b | 37.7066541 | -122.4181399 | Wafer Inspection Lab #0210A | |
67.23 | d120422d-5789-435d-9dc6-73d8489b04c2 | 37.7072686 | -122.4187016 | Zone 4 Wafer Processing Area B3 | |
70.23 | fa565921-fddd-4bfb-a7fd-d617f816df4b | 37.7071763 | -122.4190397 | Research Lab #2209 | |
72.22 | f9ade639-f936-4954-aa5a-1f2ed86c9bcf | 37.7067618 | -122.4186191 | Wafer Inspection Lab #0211C | |
85.48 | 03e39872-e105-4be4-83c0-9ade818465dc | 37.7068476 | -122.4191599 | Research Lab #2203 | |
90.69 | 93238559-4d55-4b2a-bdcb-6aa3be0f3908 | 37.7070334 | -122.4184393 | Zone 2 Semiconductor Assembly Area A2 |
Amazon QuickSight
In a recent post, Getting Started with Data Analysis on AWS using AWS Glue, Amazon Athena, and QuickSight: Part 2, I detailed getting started with Amazon QuickSight. In this post, I will assume you are familiar with QuickSight.
Amazon recently added a full set of aws quicksight
APIs for interacting with QuickSight. Though, for this part of the demonstration, we will be working directly in the Amazon QuickSight Console, as opposed to the AWS CLI, AWS CDK, or CloudFormation.
Redshift Data Sets
To visualize the data from Amazon Redshift, we start by creating Data Sets in QuickSight. QuickSight supports a large number of data sources for creating data sets. We will use the Redshift data source. If you recall, we added an inbound rule for QuickSight, allowing us to connect to our Redshift cluster in us-east-1.
We will select the sensor
schema, which is where the tables and views for this demonstration are located.
We can choose any of the tables or views in the Redshift dev
database that we want to use for visualization.
Below, we see examples of two new data sets, shown in the QuickSight Data Prep Console. Note how QuickSight automatically recognizes field types, including dates, latitude, and longitude.
Visualizations
Using the data sets, QuickSight allows us to create a wide number of rich visualizations. Below, we see the simulated time-series data from the six temperature sensors.
Next, we see an example of QuickSight’s ability to show geospatial data. The Map shows the location of each sensor and the average temperature recorded by that sensor.
Cleaning Up
To remove the resources created for this post, use the following series of AWS CLI commands.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Get data bucket name | |
DATA_BUCKET=$(aws cloudformation describe-stacks \ | |
–stack-name redshift-stack \ | |
| jq -r '.Stacks[].Outputs[] | select(.OutputKey == "DataBucket") | .OutputValue') | |
echo ${DATA_BUCKET} | |
# Get log bucket name | |
LOG_BUCKET=$(aws cloudformation describe-stacks \ | |
–stack-name redshift-stack \ | |
| jq -r '.Stacks[].Outputs[] | select(.OutputKey == "LogBucket") | .OutputValue') | |
echo ${LOG_BUCKET} | |
# Delete demonstration resources | |
python3 ./scripts/delete_buckets.py | |
aws cloudformation delete-stack –stack-name kinesis-firehose-stack | |
# Wait for first stack to be deleted | |
aws cloudformation delete-stack –stack-name redshift-stack |
Conclusion
In this brief post, we have learned how streaming data can be analyzed in near real-time, in Amazon Redshift, using Amazon Kinesis Data Firehose. Further, we explored how the results of those analyses can be visualized in Amazon QuickSight. For customers that depend on a data warehouse for data analytics, but who also have streaming data sources, the use of Amazon Kinesis Data Firehose or Amazon Redshift Spectrum is an excellent choice.
This blog represents my own viewpoints and not of my employer, Amazon Web Services.