Archive for category Cloud
Earning the AWS Certified Machine Learning — Specialty (MLS-C01) Certification
Posted by Gary A. Stafford in AWS, Cloud, Machine Learning on November 14, 2022
Introduction
Recently, I earned the AWS Certified Machine Learning — Specialty (MLS-C01) Certification, my ninth AWS certification. Since a few colleagues asked me about my preparation, I thought I would share it with the community, without divulging any details of the exam, of course.

Prerequisite Experience
Several AWS certifications can be earned with minimal to no hands-on AWS experience, but excellent short-term memorization skills. Although you will have technically earned the certification, you will certainly not be competent to practice the particular discipline. Certification does not equal qualification.
In my opinion, the AWS Certified Machine Learning — Specialty certification exam is not one of those where simple memorization of study materials, alone, will guarantee a passing score. If you lack practical experience in data science, machine learning, basic statistics, or data analytics on AWS, you will be challenged to pass this exam, no matter how much you cram.
Consider Data Analytics Certification First
To prepare for the Machine Learning — Specialty exam, I would strongly suggest first earning the AWS Certified Data Analytics — Specialty certification. According to the Machine Learning — Specialty exam’s content outline, “Domain 1: Data Engineering”, accounts for 20% of the exam’s score. Understanding the AWS Analytics services and how they integrate to form the most efficient data pipelines to feed your Machine Learning model training is a requirement for this portion of the exam’s questions. Preparing for the Data Analytics — Specialty certification will provide this adjacent domain knowledge:
- Amazon Athena
- Amazon EMR (pka Amazon Elastic MapReduce)
- Amazon IAM
- Amazon Kinesis Data Analytics, Data Firehose, Data Streams, Video Streams
- Amazon Redshift
- Amazon S3
- Amazon VPC
- AWS Data Pipeline
- AWS Glue Crawlers, Jobs, Data Catalog
- AWS Lambda
- AWS Step Functions

Study Materials
In my case, certification success was a result of practical experience, coursework, completing and reviewing the results of several practice exams, and taking lots of notes. The following is a list of the study materials I found most impactful:
Documentation
I reviewed the Amazon SageMaker and other AWS fully-managed AI/ML service documentation for my preparation.
Carefully review the Choose an Algorithm section of the Amazon SageMaker Developer Guide. According to the exam’s content outline, “Domain 3: Modeling” accounts for 36% of the exam’s score. Understand 1) recommended use cases for each of SageMaker’s built-in algorithms, 2) the algorithm’s required hyperparameters, and 3) the prescribed model evaluation metrics and tuning techniques. Built-in SageMaker algorithms most commonly covered in most training materials include:
- Tabular
- XGBoost (eXtreme Gradient Boosting)
- Linear Learner
- K-Nearest Neighbors (KNN)
- Factorization Machines
- Object2Vec
- Vision
- Image Classification
- Object Detection
- Semantic Segmentation
- Clustering
- K-Means
- Time-Series Forecast
- DeepAR
- Text Classification & Embedding
- BlazingText
- Text Transformation
- Sequence-to-Sequence (Seq2Seq)
- Text Topic Modeling
- Neural Topic Modeling (NTM)
- Latent Dirichlet Allocation (LDA)
- Dimensionality Reduction
- Principal Component Analysis (PCA)
- Anomaly Detection
- Random Cut Forest (RCF)
- IP Insights
AWS also uses Read the Docs. SageMaker’s Algorithm section is especially helpful with respect to preparing for the Machine Learning — Specialty exam: image processing, text processing, time-series processing, supervised learning, unsupervised learning, and feature engineering algorithms.
Along with algorithms, review SageMaker’s Deploy Models for Inference documentation. According to the exam’s content outline, “Domain 4: Machine Learning Implementation and Operations” accounts for 20% of the exam’s score. Understand SageMaker’s options for model serving, model versioning, deployment strategies, and endpoint monitoring.
Review the AWS fully managed AI/ML services Developer Guide documentation for the following services:
- Amazon Augmented AI
- Amazon CodeGuru
- Amazon Comprehend
- Amazon Forecast
- Amazon Fraud Detector
- Amazon Kendra
- Amazon Lex
- Amazon Personalize
- Amazon Polly
- Amazon Rekognition
- Amazon Textract
- Amazon Transcribe
- Amazon Translate
Understand the use cases for each of these services and most critically, how these managed services can be combined to create more complex AI/ML solutions. For example, building a near-real-time speech-to-speech translator with Amazon Transcribe, Amazon Translate, and Amazon Polly.
Online Courses
For my preparation, I completed three Udemy courses. Most of these online courses regularly go on sale and be purchased for $25 or less:
- AWS Certified Machine Learning Specialty 2022 — Hands On!, by Frank Kane and Stephane Maarek. Both Frank and Stephane are well-known across the industry and respected trainers. I recommend reviewing the algorithm, model evaluation, and high-level ML services sections more than once (Sections 5 and 6).
- AWS Certified Machine Learning Specialty (MLS-C01), by Chandra Lingam. Don’t get caught up in the nitty-gritty details of the Python code; focus on the higher-level machine learning principles. This course also contains a full-length practice exam.
- AWS Certified Machine Learning Specialty: 3 PRACTICE EXAMS, by Abhishek Singh. Nothing beats taking full-length practice exams and learning from your mistakes.
- Whizlabs’ AWS Certified Machine Learning Specialty Practice Tests. I completed a few of Whizlabs’ smaller practice exams, but, with limited time, I chose to complete Udemy’s full-length practice tests. Some of Whizlabs’ questions seemed off-topic to the exam outline and other training materials I reviewed.

Books
For my preparation, I read or re-read three books, two from Packt and one from O’Reilly:
- AWS Certified Machine Learning Specialty: MLS-C01 Certification Guide, by Somanath Nanda and Weslley Moura (Packt Publishing). I recommend this one if you only have time to read a single book.
- Practical Statistics for Data Scientists, 2nd Edition, by Peter Bruce, Andrew Bruce, Peter Gedeck (O’Reilly Media). According to the University of San Diego, “Statistics (or statistical analysis) is core to every machine learning algorithm.” This book covers many of the core statistical concepts behind Machine Learning, covered on the exam:
- BLEU
- Classification metrics: Precision-Recall Curve, ROC Curve, AUC
- Confusion Matrix: TP, FP, TN, FN, Accuracy, Precision, Recall (Sensitivity), Specificity, F1
- Correlated variables, Multicollinearity
- Distributions: Normal (Gaussian or “bell curve”), Bernoulli, Binomial, Poisson
- Elbow Method
- Ensemble Learning: Bagging, Boosting
- Euclidean Distance
- K-Fold Cross-Validation
- L1/L2 Regularization (lasso, alpha, ridge, lambda)
- Overfitting, Underfitting, High Bias, High Variance, Bias-Variance Tradeoff
- Plots: Histograms, Boxplots, Scatterplots
- Regression metrics: MAE, MSE, RMSE, R-squared, Adjusted R-squared
- Residuals
- SMOTE
- Standard Deviation, Three-Sigma/Empirical/68–95–99.7 Rule
- Z-score
- Python Machine Learning — Third Edition, by Sebastian Raschka and Vahid Mirjalili (Packt Publishing). Note that this book dives much deeper into the low-level statistical underpinnings of machine learning than is required for the exam, based on the exam outline. Again, don’t get caught up in the nitty-gritty details of Python; focus on the higher-level machine learning principles.
Scheduling the Exam
One last tip regarding when to take your exam. I have taken 15 AWS exams between nine AWS certifications and several recertifications. Although the Certified Machine Learning — Specialty exam is difficult, I found changing the time I sat the exam, greatly reduced my stress level. In the past, I took time off on a workday to complete exams, either in person or at home using online proctoring. I was preparing for the exam while frequently being interrupted by work-related items. For this exam, I chose to use online proctoring and took my exam at 6:00 AM on a Sunday morning. Up early, fresh, and full of energy, with no work- or family-related interruptions, no lawnmowers, dogs barking, or garbage trucks rumbling by, and no Internet bandwidth issues. I was done by 9:00 AM and eating breakfast with the family.
This blog represents my own viewpoints and not of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.
Lakehouse Data Modeling using dbt, Amazon Redshift, Redshift Spectrum, and AWS Glue
Learn how dbt makes it easy to transform data and materialize models in a modern cloud data lakehouse built on AWS
Introduction
Data lakes have grabbed much of the analytics community’s attention in recent years, thanks to an overabundance of VC-backed analytics startups and marketing dollars. Nonetheless, data warehouses, specifically modern cloud data warehouses, continue to gain market share, led by Snowflake, Amazon Redshift, Google Cloud BigQuery, and Microsoft’s Azure Synapse Analytics.
Several factors have fostered the renewed interest and appeal of data warehouses, including the data lakehouse architecture. According to Databricks, “a lakehouse is a new, open architecture that combines the best elements of data lakes and data warehouses. Lakehouses are enabled by a new system design: implementing similar data structures and data management features to those in a data warehouse directly on top of low-cost cloud storage in open formats.” Similarly, Snowflake describes a lakehouse as “a data solution concept that combines elements of the data warehouse with those of the data lake. Data lakehouses implement data warehouses’ data structures and management features for data lakes, which are typically more cost-effective for data storage.”
dbt
In the following post, we will explore the use of dbt (data build tool), developed by dbt Labs, to transform data in an AWS-based data lakehouse, built with Amazon Redshift, Redshift Spectrum, AWS Glue, and Amazon S3. According to dbt Labs, “dbt enables analytics engineers to transform data in their warehouses by simply writing select statements. dbt handles turning these select statements into tables and views.” Further, “dbt does the T in ELT (Extract, Load, Transform) processes — it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse.”

Amazon Redshift
According to AWS, “Amazon Redshift uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes using AWS-designed hardware and machine learning to deliver the best price-performance at any scale.” AWS claims Amazon Redshift is the most widely used cloud data warehouse.
Amazon Redshift Spectrum
According to AWS, “Redshift Spectrum allows you to efficiently query and retrieve structured and semi-structured data from files in Amazon S3 without having to load the data into Amazon Redshift tables.” Redshift Spectrum tables define the data structure for the files in Amazon S3. The external tables exist in an external data catalog, which can be AWS Glue, the data catalog that comes with Amazon Athena, or an Apache Hive metastore.
dbt can interact with Amazon Redshift Spectrum to create external tables, refresh external table partitions, and access raw data in an Amazon S3-based data lake from the data warehouse. We will use dbt along with the dbt package, dbt_external_tables
, to create the external tables in an AWS Glue data catalog.
Prerequisites
Prerequisites to follow along with this post’s demonstration include:
- Amazon S3 bucket to store raw data;
- Amazon Redshift or Amazon Redshift Serverless cluster;
- AWS IAM Role with permissions to Amazon Redshift, Amazon S3, and AWS Glue;
- dbt Cloud account;
- dbt CLI (dbt Core) and dbt Amazon Redshift adapter installed locally;
- Microsoft Visual Studio Code (VS Code) with dbt extensions installed;
The post’s demonstration uses dbt Cloud, VS Code, and the dbt CLI interchangeably with the project’s GitHub repository as a source. Follow along with the demonstration using any or all of these three dbt options.

Cost Warning!
Be careful when creating a new, provisioned Amazon Redshift cluster for this demonstration. The suggested default Production cluster with two ra3.4xlarge
on-demand compute nodes and AQUA (Redshift’s Advanced Query Accelerator) enabled is estimated at $4,694/month ($3.26/node/hour). For this demonstration, choose the minimum size provisioned Redshift cluster configuration of one dc2.large
on-demand compute node, estimated to cost $180/month ($0.25/node/hour). Be sure to delete the cluster when the demonstration is complete.

Amazon Redshift Serverless Option
AWS recently announced the general availability (GA) of Amazon Redshift Serverless on July 12, 2022. Amazon Redshift Serverless allows data analysts, developers, and data scientists to run and scale analytics without having to provision and manage data warehouse clusters. dbt is fully compatible with Amazon Redshift Serverless and is an alternative to provisioned Redshift for this demonstration. According to AWS, Amazon Redshift Serverless measures data warehouse capacity in Redshift Processing Units (RPUs). You pay for the workloads you run in RPU-hours on a per-second basis (with a 60-second minimum charge), including queries that access data in open file formats in Amazon S3.
Source Code
All the source code demonstrated in this post is open source and available on GitHub.
Sample Data
This demonstration uses the TICKIT sample database provided by AWS and designed for use with Amazon Redshift. This sample database application tracks sales activity for the fictional online TICKIT website, where users buy and sell tickets for sporting events, shows, and concerts. The database consists of seven tables in a star schema: five dimension tables and two fact tables. A clean copy of the raw TICKIT data, formatted as pipe-delimited text files, is included in this GitHub project. Use the following shell commands to copy the raw data to Amazon S3:
Prepare Amazon Redshift for dbt
Create New Database
Create a new Redshift database to use for the demonstration, demo
.
Create Database Schemas
Within the new Redshift database,demo
, create the external schema, tickit_external
, and the corresponding external AWS Glue Data Catalog, tickit_dbt
, using the CREATE EXTERNAL SCHEMA
Redshift SQL command. Make sure to update the command to reflect your IAM Role’s ARN. Next, create the schema that will hold our dbt models, tickit_dbt
. Lastly, as a security best practice, drop the default public
schema.
From the AWS Glue console, we should observe a new tickit_dbt
AWS Glue Data Catalog. The description shown below was manually added after the catalog was created.

Create dbt Database User and Group
As a security best practice, create a separate database dbt
user and dbt
group. We are assigning a completely arbitrary connection limit of ten. Then, apply the grants to allow the dbt
group access to the new database and schemas. Lastly, change the two schema’s owners to the dbt
.
Alternately, we could use an IAM Role with a SAML 2.0-compliant IdP.
Initialize and Configure dbt for Redshift
Next, configure your dbt Cloud account and dbt locally with your Amazon Redshift connection information using the dbt init
command. On a Mac, this configuration is stored in the /Users/<your_usernama>/.dbt/profiles.yml
file. You will need your Redshift cluster host URL, port, database, username, and password. With your local install of dbt, we can use the dbt debug
command to confirm the new configuration.

Project Structure
The GitHub project structure follows many of the best practices outlined in dbt Labs’ Best Practice Guide. Data models in the models
directory is organized into the recommended staging
, intermediate
, and marts
subdirectories (aka layers).

From a data lineage perspective, in this project, the staging layer’s data models depend on the external tables (AWS Glue/Amazon Redshift Spectrum). The intermediate layer’s data models depend on the staging models. The marts layer’s data models depend on staging and intermediate models.

Install dbt Packages
The GitHub project’s packages.yml
contains a few commonly recommended packages. The only one required for this post is the dbt-labs/dbt_external_tables
package. Make sure your project is referring to the latest version of the package.
Use the dbt deps
command to install the packages locally.

External Tables
The _tickit__sources.yml
file in the models/staging/tickit/external_tables/
model’s subdirectory defines the schema and S3 location for each of the seven external TICKIT database tables: category, date, event, listing, sale, user, and venue. You will need to update this file to reflect the name of your Amazon S3 bucket, in seven places.
Execute the command, dbt run-operation stage_external_sources
, to create the seven external tables in the AWS Glue Data Catalog. This command is part of the dbt_external_tables
package we installed earlier. It iterates through all source nodes, creates the tables if missing, and refreshes metadata.

If we failed to run the previous SQL statements to set schema ownership to the dbt
user, the following error will likely occur.

Once the command completes, we should observe seven new tables in the AWS Glue Data Catalog.

Examining one of the AWS Glue data catalog tables, we can observe how the configuration in the _tickit__sources.yml
file was used to define the table’s properties and schema. Note the Location
field indicates where the underlying data is located in our Amazon S3 bucket.

Staging Layer
In their best practices guide, dbt describes the staging layer in the following manner: “you can think of the staging layer as condensing and refining this material into the individual atoms we’ll later build more intricate and useful structures with.” The staging data models are the base tables and views we will use to build more complex aggregations and analytics queries in Redshift. The schema.yml
file, also in the models/staging/tickit/
model’s subdirectory, defines seven late-binding views, modeled by dbt, to be created in Amazon Redshift.
The staging model’s SQL statements also follow many of dbt’s best practices. Below, we see an example of the stg_tickit__sales
model (stg_tickit__sales.sql
). This model performs a SELECT
from the external sale
table in the external_table
schema. The model performs column renaming and basic calculations.
The the dbt run
command, according to dbt, “executes compiled SQL model files against the current target
database. dbt connects to the target database and runs the relevant SQL, required to materialize all data models using the specified materialization strategies.” Instead of using the dbt run
command to create all the project’s tables and views at once, for now, we are limiting the command to just the models in the ./models/staging/tickit/
directory using the --select
optional argument. Execute the dbt run --select staging
command to materialize the seven corresponding staging tables in Amazon Redshift.

Once the command completes, we should observe seven new views in Amazon Redshift demo
database’s tickit_dbt
schema with the stg_
prefix.

Selecting from any of the views should return data.

Late Binding Views
This demonstration uses late binding views for staging and intermediate layer models. According to dbt, “using late-binding views in a production deployment of dbt can vastly improve the availability of data in the warehouse, especially for models that are materialized as late-binding views and are queried by end-users, since they won’t be dropped when upstream models are updated. Additionally, late binding views can be used with external tables via Redshift Spectrum.”
Alternatively, we could define the seven staging models as tables instead of late binding views. Once created as tables, the dependent intermediate and marts views will not require a late-binding reference, as in this project.
Intermediate Layer
In their best practices guide, dbt describes the intermediate layer as “purpose-built transformation steps.” Further, “the best guiding principle is to think about verbs (e.g. pivoted
, aggregated_to_user
, joined
, fanned_out_by_quanity
, funnel_created
, etc.) in the intermediate layer.”
The project’s intermediate layer consists of two models related to users. The sample TICKIT database lumps all users into a single table. However, for analytics purposes, different user personas might interest marketing teams, such as buyers, sellers, sellers who also buy, and non-buyers (users who have never purchased tickets). The two models in the project’s intermediate layer filter for buyers and for sellers, resulting in two separate views of user personas.
To materialize the intermediate layer’s two data models into views, execute the command, dbt run --select intermediate
.

Once the command completes, we should observe a total of nine views in Amazon Redshift demo
database’s tickit_dbt
schema — seven staging and two intermediate, identified with the int_
prefix.

Marts Layer
In their best practices guide, dbt describes the marts layer as “business defined entities.” Further, “this is the layer where everything comes together and we start to arrange all of our atoms (staging models) and molecules (intermediate models) into full-fledged cells that have identity and purpose. We sometimes like to call this the entity layer or concept layer, to emphasize that all our marts are meant to represent a specific entity or concept at its unique grain.”
The project’s marts layer consists of four data models across marketing and sales. The models are materialized as two dimension tables and two fact tables. Although it is common practice to describe and label these as traditional star schema dimension (dim_
) or fact (fct_
) tables, in reality, the fact tables in this demonstration are actually flat, de-normalized, wide tables. Wide tables generally have better analytics performance in a modern data warehouse, according to Fivetran and others.
The marts layer’s models take various dependencies through joins on staging and intermediate models. The data model above, fct_sales
, has dependencies on multiple staging and intermediate models.

To materialize the marts layer’s four data models into tables, execute the command, dbt run --select marts
.

Once the command completes, we should observe four tables and nine views in the Redshift demo
database’s tickit_dbt
schema. Note how the dbt model for fct_sales
(shown above), with its Jinja templating and multiple CTEs have been compiled into the resulting table in Redshift, this is the real magic of dbt!

At this point, all of the project’s models have been compiled and created in the Redshift demo
database by dbt.
Analyses
The demonstration’s project also contains example analyses. dbt allows us to version control more analytical-oriented SQL files within our dbt project using the analyses
functionality of dbt. These analyses do not fit the fairly opinionated dbt model definition. We can compile the analyses SQL file using the dbt compile
command, then copy and paste the resulting SQL statements from the target/compiled/
subdirectory into our data warehouse’s query tool of choice.


Project Documentation
Using the dbt docs generate
command will automatically generate the project’s documentation website from the SQL and YAML files. Documentations can be generated and displayed from your dbt Cloud account or hosted locally.

Testing
According to dbt, “Tests are assertions you make about your models and other resources in your dbt project (e.g. sources, seeds, and snapshots). When you run dbt test
, dbt will tell you if each test in your project passes or fails.” The project contains over 50 tests, split between the _tickit__sources.yml
file and individual tests in the test/
directory. Typical dbt tests check for non-null and unique values, values within an expected numeric range, and values from a known list of strings. Any SELECT
statement written in SQL can be tested.
Execute the project’s tests using the dbt test
command. We can execute individual tests using the --select
optional argument, for example, dbt test --select assert_all_sale_amounts_are_positive
. We can also use the --threads
optional argument with most dbt commands, including dbt test
, increasing parallelism and reducing execution time. The example below uses 10 threads, the arbitrary maximum configured for the Amazon Redshift dbt
user.


Jobs
According to dbt, Jobs are a set of dbt commands that you want to run on a schedule. For example, dbt run
and dbt test
. Jobs can load packages, run tests, materialize models, check source freshness (dbt source freshness
), and regenerate documentation. Below, we have created a daily job to test, refresh, and document our project as the data is updated in the data lake.

Notifications
According to dbt, Setting up notifications in dbt Cloud will allow you to receive alerts via Email or a chosen Slack channel when a job run succeeds, fails, or is canceled.

The Slack notifications include run status, timings, and a link to open the job in dbt Cloud. Below, we see a notification regarding our project’s daily job run.

Exposures
Exposures are a recent addition to dbt. Exposures make it possible to define and describe a downstream use of our dbt project, such as in a dashboard, application, or data science pipeline. Below we see an example of an exposure describing a sales dashboard created in Amazon QuickSight.
The exposure YAML file shown above describes the Amazon QuickSight dashboard shown below.

Exposures work with dbt’s auto-documentation feature. dbt populates a dedicated page in the auto-generated documentation site with context relevant to data consumers.


Conclusion
In this post, we covered some of the basic functionality of dbt. We learned how dbt enables analysts to work more like software engineers. We also learned how dbt makes it easy to codify data models in SQL, to version control and manage data models as code with git, and collaborate on data models with other data team members.
Topics not explored in this post but critical to most large-scale dbt-managed production environments include advanced Jinja templating and macros, model freshness, orchestration, job scheduling, Continuous Integration and GitOps, notifications, environment variables, and incremental models. We will explore these additional dbt capabilities in future posts.
This blog represents my viewpoints and not of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners. All diagrams and illustrations are the property of the author unless otherwise noted.
Serverless Analytics on AWS: Getting Started with Amazon EMR Serverless and Amazon MSK Serverless
Utilizing the recently released Amazon EMR Serverless and Amazon MSK Serverless for batch and streaming analytics with Apache Spark and Apache Kafka
Introduction
Amazon EMR Serverless
AWS recently announced the general availability (GA) of Amazon EMR Serverless on June 1, 2022. EMR Serverless is a new serverless deployment option in Amazon EMR, in addition to EMR on EC2, EMR on EKS, and EMR on AWS Outposts. EMR Serverless provides a serverless runtime environment that simplifies the operation of analytics applications that use the latest open source frameworks, such as Apache Spark and Apache Hive. According to AWS, with EMR Serverless, you don’t have to configure, optimize, secure, or operate clusters to run applications with these frameworks.
Amazon MSK Serverless
Similarly, on April 28, 2022, AWS announced the general availability of Amazon MSK Serverless. According to AWS, Amazon MSK Serverless is a cluster type for Amazon MSK that makes it easy to run Apache Kafka without managing and scaling cluster capacity. MSK Serverless automatically provisions and scales compute and storage resources, so you can use Apache Kafka on demand and only pay for the data you stream and retain.
Serverless Analytics
In the following post, we will learn how to use these two new, powerful, cost-effective, and easy-to-operate serverless technologies to perform batch and streaming analytics. The PySpark examples used in this post are similar to those featured in two earlier posts, which featured non-serverless alternatives Amazon EMR on EC2 and Amazon MSK: Getting Started with Spark Structured Streaming and Kafka on AWS using Amazon MSK and Amazon EMR and Stream Processing with Apache Spark, Kafka, Avro, and Apicurio Registry on AWS using Amazon MSK and EMR.
Source Code
All the source code demonstrated in this post is open-source and available on GitHub.
git clone --depth 1 -b main \
https://github.com/garystafford/emr-msk-serverless-demo.git
Architecture
The post’s high-level architecture consists of an Amazon EMR Serverless Application, Amazon MSK Serverless Cluster, and Amazon EC2 Kafka client instance. To support these three resources, we will need two Amazon Virtual Private Clouds (VPCs), a minimum of three subnets, an AWS Internet Gateway (IGW) or equivalent, an Amazon S3 Bucket, multiple AWS Identity and Access Management (IAM) Roles and Policies, Security Groups, and Route Tables, and a VPC Gateway Endpoint for S3. All resources are constrained to a single AWS account and a single AWS Region, us-east-1
.

Prerequisites
As a prerequisite for this post, you will need to create the following resources:
- (1) Amazon EMR Serverless Application;
- (1) Amazon MSK Serverless Cluster;
- (1) Amazon S3 Bucket;
- (1) VPC Endpoint for S3;
- (3) Apache Kafka topics;
- PySpark applications, related JAR dependencies, and sample data files uploaded to Amazon S3 Bucket;
Let’s walk through each of these prerequisites.
Amazon EMR Serverless Application
Before continuing, I suggest familiarizing yourself with the AWS documentation for Amazon EMR Serverless, especially, What is Amazon EMR Serverless? Create a new EMR Serverless Application by following the AWS documentation, Getting started with Amazon EMR Serverless. The creation of the EMR Serverless Application includes the following resources:
- Amazon S3 bucket for storage of Spark resources;
- Amazon VPC with at least two private subnets and associated Security Group(s);
- EMR Serverless runtime AWS IAM Role and associated IAM Policy;
- Amazon EMR Serverless Application;
For this post, use the latest version of EMR available in the EMR Studio Serverless Application console, the newly released version 6.7.0, to create a Spark application.

Keep the default pre-initialized capacity, application limits, and application behavior settings.

Since we are connecting to MSK Serverless from EMR Serverless, we need to configure VPC access. Select the new VPC and at least two private subnets in different Availability Zones (AZs).

According to the documentation, the subnets selected for EMR Serverless must be private subnets. The associated route tables for the subnets should not contain direct routes to the Internet.


Amazon MSK Serverless Cluster
Similarly, before continuing, I suggest familiarizing yourself with the AWS documentation for Amazon MSK Serverless, especially MSK Serverless. Create a new MSK Serverless Cluster by following the AWS documentation, Getting started using MSK Serverless clusters. The creation of the MSK Serverless Cluster includes the following resources:
- AWS IAM Role and associated IAM Policy for the Amazon EC2 Kafka client instance;
- VPC with at least one public subnet and associated Security Group(s);
- Amazon EC2 instance used as Apache Kafka client, provisioned in the public subnet of the above VPC;
- Amazon MSK Serverless Cluster;

Associate the new MSK Serverless Cluster with the EMR Serverless Application’s VPC and two private subnets. Also, associate the cluster with the EC2-based Kafka client instance’s VPC and its public subnet.


According to the AWS documentation, Amazon MSK does not support all AZs. For example, I tried to use a subnet in us-east-1e
threw an error. If this happens, choose an alternative AZ.


VPC Endpoint for S3
To access the Spark resource in Amazon S3 from EMR Serverless running in the two private subnets, we need a VPC Endpoint for S3. Specifically, a Gateway Endpoint, which sends traffic to Amazon S3 or DynamoDB using private IP addresses. A gateway endpoint for Amazon S3 enables you to use private IP addresses to access Amazon S3 without exposure to the public Internet. EMR Serverless does not require public IP addresses, and you don’t need an internet gateway (IGW), a NAT device, or a virtual private gateway in your VPC to connect to S3.

Create the VPC Endpoint for S3 (Gateway Endpoint) and add the route table for the two EMR Serverless private subnets. You can add additional routes to that route table, such as VPC peering connections to data sources such as Amazon Redshift or Amazon RDS. However, do not add routes that provide direct Internet access.

Kafka Topics and Sample Messages
Once the MSK Serverless Cluster and EC2-based Kafka client instance are provisioned and running, create the three required Kafka topics using the EC2-based Kafka client instance. I recommend using AWS Systems Manager Session Manager to connect to the client instance as the ec2-user
user. Session Manager provides secure and auditable node management without the need to open inbound ports, maintain bastion hosts, or manage SSH keys. Alternatively, you can SSH into the client instance.

Before creating the topics, use a utility like telnet
to confirm connectivity between the Kafka client and the MSK Serverless Cluster. Verifying connectivity will save you a lot of frustration with potential security and networking issues.
With MSK Serverless Cluster connectivity confirmed, create the three Kafka topics: topicA
, topicB
, and topicC
. I am using the default partitioning and replication settings from the AWS Getting Started Tutorial.
To create some quick sample data, we will copy and paste 250 messages from a file included in the GitHub project, sample_data/sales_messages.txt
, into topicA
. The messages are simple mock sales transactions.
Use the kafka-console-producer
Shell script to publish the messages to the Kafka topic. Use the kafka-console-consumer
Shell script to validate the messages made it to the topic by consuming a few messages.
The output should look similar to the following example.

Spark Resources in Amazon S3
To submit and run the five Spark Jobs included in the project, you will need to copy the following resources to your Amazon S3 bucket: (5) Apache Spark jobs, (5) related JAR dependencies, and (2) sample data files.
PySpark Applications
To start, copy the five PySpark applications to a scripts/
subdirectory within your Amazon S3 bucket.

Sample Data
Next, copy the two sample data files to a sample_data/
subdirectory within your Amazon S3 bucket. The large file contains 2,000 messages, while the small file contains 600 messages. These two files can be used interchangeably with the post’s final streaming example.

PySpark Dependencies
Lastly, the PySpark applications have a handful of JAR dependencies that must be available when the job runs, which are not on the EMR Serverless classpath by default. If you are unsure which JARs are already on the EMR Serverless classpath, you can check the Spark UI’s Environment tab’s Classpath Entries section. Accessing the Spark UI is demonstrated in the first PySpark application example, below.

It is critical to choose the correct version of each JAR dependency based on the version of libraries used with the EMR and MSK. Using the wrong version or inconsistent versions, especially Scala, can result in job failures. Specifically, we are targeting Spark 3.2.1 and Scala 2.12 (EMR v6.7.0: Amazon’s Spark 3.2.1, Scala 2.12.15, Amazon Corretto 8 version of OpenJDK), and Apache Kafka 2.8.1 (MSK Serverless: Kafka 2.8.1).
Download the seven JAR files locally, then copy them to a jars/
subdirectory within your Amazon S3 bucket.

PySpark Applications Examples
With the EMR Serverless Application, MSK Serverless Cluster, Kafka topics, and sample data created, and the Spark resources uploaded to Amazon S3, we are ready to explore four different Spark examples.
Example 1: Kafka Batch Aggregation to the Console
The first PySpark application, 01_example_console.py
, reads the same 250 sample sales messages from topicA
you published earlier, aggregates the messages, and writes the total sales and quantity of orders by country to the console (stdout).
There are no hard-coded values in any of the PySpark application examples. All required environment-specific variables, such as your MSK Serverless bootstrap server (host and port) and Amazon S3 bucket name, will be passed to the running Spark jobs as arguments from the spark-submit
command.
To submit your first PySpark job to the EMR Serverless Application, use the emr-serverless
API from the AWS CLI. You will need (4) values: 1) your EMR Serverless Application’s application-id
, 2) the ARN of your EMR Serverless Application’s execution IAM Role, 3) your MSK Serverless bootstrap server (host and port), and 4) the name of your Amazon S3 bucket containing the Spark resources.
Switching to the EMR Serverless Application console, you should see the new Spark job you just submitted in one of several job states.

You can click on the Spark job to get more details. Note the Script arguments and Spark properties passed in from the spark-submit
command.

From the Spark job details tab, access the Spark UI, aka Spark Web UI, from a button in the upper right corner of the screen. If you have experience with Spark, you are most likely familiar with the Spark Web UI to monitor and tune Spark jobs.

From the initial screen, the Spark History Server tab, click on the App ID. You can access an enormous amount of Spark-related information about your job and EMR environment from the Spark Web UI.



The Executors tab will give you access to the Spark job’s output. The output we are most interested in is the driver
executor’s stderr
and stdout
(first row of the second table, shown below).

The stderr
contains output related to the running Spark job. Below we see an example of Kafka consumer configuration values output to stderr
. Several of these values were passed in from the Spark job, including items such as kafka.bootstrap.servers
, security.protocol
, sasl.mechanism
, and sasl.jaas.config
.

The stdout
from the driver
executor contains the console output as directed from the Spark job. Below we see the successfully aggregated results of the first Spark job, output to stdout
.
Example 2: Kafka Batch Aggregation to CSV in S3
Although the console is useful for development and debugging, it is typically not used in Production. Instead, Spark typically sends results to S3 as CSV, JSON, Parquet, or Arvo formatted files, to Kafka, to a database, or to an API endpoint. The second PySpark application, 02_example_csv_s3.py
, reads the same 250 sample sales messages from topicA
you published earlier, aggregates the messages, and writes the total sales and quantity of orders by country to a CSV file in Amazon S3.
To submit your second PySpark job to the EMR Serverless Application, use the emr-serverless
API from the AWS CLI. Similar to the first example, you will need (4) values: 1) your EMR Serverless Application’s application-id
, 2) the ARN of your EMR Serverless Application’s execution IAM Role, 3) your MSK Serverless bootstrap server (host and port), and 4) the name of your Amazon S3 bucket containing the Spark resources.
If successful, the Spark job should create a single CSV file in the designated Amazon S3 key (directory path) and an empty _SUCCESS
indicator file. The presence of an empty _SUCCESS
file signifies that the save()
operation completed normally.

Below we see the expected pipe-delimited output from the second Spark job.
Example 3: Kafka Batch Aggregation to Kafka
The third PySpark application, 03_example_kafka.py
, reads the same 250 sample sales messages from topicA
you published earlier, aggregates the messages, and writes the total sales and quantity of orders by country to a second Kafka topic, topicB
. This job now has both read and write options.
To submit your next PySpark job to the EMR Serverless Application, use the emr-serverless
API from the AWS CLI. Similar to the first two examples, you will need (4) values: 1) your EMR Serverless Application’s application-id
, 2) the ARN of your EMR Serverless Application’s execution IAM Role, 3) your MSK Serverless bootstrap server (host and port), and 4) the name of your Amazon S3 bucket containing the Spark resources.
Once the job completes, you can confirm the results by returning to your EC2-based Kafka client. Use the same kafka-console-consumer
command you used previously to show messages from topicB
.
If the Spark job and the Kafka client command worked successfully, you should see aggregated messages similar to the example output below. Note we are not using keys with the Kafka messages, only values for these simple examples.

Example 4: Spark Structured Streaming
For our final example, we will switch from batch to streaming — from read
to readstream
and from write
to writestream
. Before continuing, I suggest reading the Structured Streaming Programming Guide.
In this example, we will demonstrate how to continuously measure a common business metric — real-time sales volumes. Imagine you are sell products globally and want to understand the relationship between the time of day and buying patterns in different geographic regions in real-time. For any given window of time — this 15-minute period, this hour, this day, or this week— you want to know the current sales volumes by country. You are not reviewing previous sales periods or examing running sales totals, but real-time sales during a sliding time window.
We will use two PySpark jobs running concurrently to simulate this metric. The first application, 04_stream_sales_to_kafka.py
, simulates streaming data by continuously writing messages to topicC
— 2,000 messages with a 0.5-second delay between messages. In my tests, the job ran for ~28–29 minutes.
Simultaneously, the PySpark application, 05_streaming_kafka.py
, continuously consumes the sales transaction messages from the same topic, topicC
. Then, Spark aggregates messages over a sliding event-time window and writes the results to the console.
To submit the two PySpark jobs to the EMR Serverless Application, use the emr-serverless
API from the AWS CLI. Again, you will need (4) values: 1) your EMR Serverless Application’s application-id
, 2) the ARN of your EMR Serverless Application’s execution IAM Role, 3) your MSK Serverless bootstrap server (host and port), and 4) the name of your Amazon S3 bucket containing the Spark resources.
Switching to the EMR Serverless Application console, you should see both Spark jobs you just submitted in one of several job states.

Using the Spark UI again, we can review the output from the second job, 05_streaming_kafka.py
.

With Spark Structured Streaming jobs, we have an extra tab in the Spark UI, Structured Streaming. This tab displays all running jobs with their latest [micro]batch number, the aggregate rate of data arriving, and the aggregate rate at which Spark is processing data. Unfortunately, with MSK Serverless, AWS doesn’t appear to allow access to the detailed streaming query statistics via the Run ID, which greatly reduces its value. You receive a 502 error when clicking on the Run ID hyperlink.

The output we are most interested in, again, is contained in the driver
executor’s stderr
and stdout
(first row of the second table, shown below).

Below we see sample output from stderr
. The output shows the results of a micro-batch. According to the Apache Spark documentation, internally, by default, Structured Streaming queries are processed using a micro-batch processing engine. The engine processes data streams as a series of small batch jobs, achieving end-to-end latencies as low as 100ms and exactly-once fault-tolerance guarantees.
The corresponding output to the micro-batch output above is shown below. We see the initial micro-batch results, starting with the first micro-batch before any messages are streamed to topicC
.
If you are familiar with Spark Structured Streaming, you are likely aware that these Spark jobs run continuously. In other words, the streaming jobs will not stop; they continually await more streaming data.

The first job, 04_stream_sales_to_kafka.py
, will run for ~28–29 minutes and stop with a status of Sucess
. However, the second job, 05_streaming_kafka.py
, the Spark Structured Streaming job, must be manually canceled.

Cleaning Up
You can delete your resources from the AWS Management Console or AWS CLI. However, to delete your Amazon S3 bucket, all objects (including all object versions and delete markers) in the bucket must be deleted before the bucket itself can be deleted.
Conclusion
In this post, we discovered how easy it is to adopt a serverless approach to Analytics on AWS. With EMR Serverless, you don’t have to configure, optimize, secure, or operate clusters to run applications with these frameworks. With MSK Serverless, you can use Apache Kafka on demand and pay for the data you stream and retain. In addition, MSK Serverless automatically provisions and scales compute and storage resources. Given suitable analytics use cases, EMR Serverless with MSK Serverless will likely save you time, effort, and expense.
This blog represents my viewpoints and not of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners. All diagrams and illustrations are the property of the author unless otherwise noted.
Utilizing In-memory Data Caching to Enhance the Performance of Data Lake-based Applications
Posted by Gary A. Stafford in Analytics, AWS, Big Data, Cloud, Java Development, Software Development, SQL on July 12, 2022
Significantly improve the performance and reduce the cost of data lake-based analytics applications using Amazon ElastiCache for Redis
Introduction
The recent post, Developing Spring Boot Applications for Querying Data Lakes on AWS using Amazon Athena, demonstrated how to develop a Cloud-native analytics application using Spring Boot. The application queried data in an Amazon S3-based data lake via an AWS Glue Data Catalog utilizing the Amazon Athena API.
Securely exposing data in a data lake using RESTful APIs can fulfill many data-consumer needs. However, access to that data can be significantly slower than access from a database or data warehouse. For example, in the previous post, we imported the OpenAPI v3 specification from the Spring Boot service into Postman. The API specification contained approximately 17 endpoints.

From my local development laptop, the Postman API test run times for all service endpoints took an average of 32.4 seconds. The Spring Boot service was running three Kubernetes pod replicas on Amazon EKS in the AWS US East (N. Virginia) Region.

Compare the data lake query result times to equivalent queries made against a minimally-sized Amazon RDS for PostgreSQL database instance containing the same data. The average run times for all PostgreSQL queries averaged 10.8 seconds from a similar Spring Boot service. Although not a precise benchmark, we can clearly see that access to the data in the Amazon S3-based data lake is substantially slower, approximately 3x slower, than that of the PostgreSQL database. Tuning the database would easily create an even greater disparity.

Caching for Data Lakes
According to AWS, the speed and throughput of a database can be the most impactful factor for overall application performance. Consequently, in-memory data caching can be one of the most effective strategies to improve overall application performance and reduce database costs. This same caching strategy can be applied to analytics applications built atop data lakes, as this post will demonstrate.

In-memory Caching
According to Hazelcast, memory caching (aka in-memory caching), often simply referred to as caching, is a technique in which computer applications temporarily store data in a computer’s main memory (e.g., RAM) to enable fast retrievals of that data. The RAM used for the temporary storage is known as the cache. As an application tries to read data, typically from a data storage system like a database, it checks to see if the desired record already exists in the cache. If it does, the application will read the data from the cache, thus eliminating the slower access to the database. If the desired record is not in the cache, then the application reads the record from the source. When it retrieves that data, it writes it to the cache so that when the application needs that same data in the future, it can quickly retrieve it from the cache.
Further, according to Hazelcast, as an application tries to read data, typically from a data storage system like a database, it checks to see if the desired record already exists in the cache. If it does, the application will read the data from the cache, thus eliminating the slower access to the database. If the desired record is not in the cache, then the application reads the record from the source. When it retrieves that data, it writes it to the cache so that when the application needs that same data in the future, it can quickly retrieve it from the cache.
Redis In-memory Data Store
According to their website, Redis is the open-source, in-memory data store used by millions of developers as a database, cache, streaming engine, and message broker. Redis provides data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes, and streams. In addition, Redis has built-in replication, Lua scripting, LRU eviction, transactions, and different levels of on-disk persistence and provides high availability via Redis Sentinel and automatic partitioning with Redis Cluster.
Amazon ElastiCache for Redis
According to AWS, Amazon ElastiCache for Redis, the fully-managed version of Redis, according to AWS, is a blazing fast in-memory data store that provides sub-millisecond latency to power internet-scale real-time applications. Redis applications can work seamlessly with ElastiCache for Redis without any code changes. ElastiCache for Redis combines the speed, simplicity, and versatility of open-source Redis with manageability, security, and scalability from AWS. As a result, Redis is an excellent choice for implementing a highly available in-memory cache to decrease data access latency, increase throughput, and ease the load on relational and NoSQL databases.
ElastiCache Performance Results
In the following post, we will add in-memory caching to the Spring Boot service introduced in the previous post. In preliminary tests with Amazon ElastiCache for Redis, the Spring Boot service delivered a 34x improvement in average response times. For example, test runs with the best-case scenario of a Redis cache hit ratio of 100% averaged 0.95 seconds compared to 32.4 seconds without Redis.

Source Code
All the source code and Docker and Kubernetes resources are open-source and available on GitHub.
git clone --depth 1 -b redis \
https://github.com/garystafford/athena-spring-app.git
In addition, a Docker image for the Redis-base Spring Boot service is available on Docker Hub. For this post, use the latest tag with the .redis
suffix.

Code Changes
The following code changes are required to the Spring Boot service to implement Spring Boot Cache with Redis.
Gradle Build
The gradle.build
file now implements two additional dependencies, Spring Boot’s spring-boot-starter-cache
and spring-boot-starter-data-redis
(lines 45–46).
Application Properties
The application properties file, application.yml
, has been modified for both the dev
and prod
Spring Profiles. The dev
Spring Profile expects Redis to be running on localhost
. Correspondingly, the project’s docker-compose.yml
file now includes a Redis container for local development. The time-to-live (TTL) for all Redis caches is arbitrarily set to one minute for dev
and five minutes for prod
. To increase application performance and reduce the cost of querying the data lake using Athena, increase Redis’s TTL. Note that increasing the TTL will reduce data freshness.
Athena Application Class
The AthenaApplication
class declaration is now decorated with Spring Framework’s EnableCaching
annotation (line 22). Additionally, two new Beans have been added (lines 58–68). Spring Redis provides an implementation for the Spring cache abstraction through the org.springframework.data.redis.cache
package. The RedisCacheManager
cache manager creates caches by default upon the first write. The RedisCacheConfiguration
cache configuration helps to customize RedisCache
behavior such as caching null values, cache key prefixes, and binary serialization.
POJO Data Model Classes
Spring Boot Redis caching uses Java serialization and deserialization. Therefore, all the POJO data model classes must implement Serializable
(line 14).
Service Classes
Each public method in the Service classes is now decorated with Spring Framework’s Cachable
annotation (lines 42 and 66). For example, the findById(int id)
method in the CategoryServiceImp
class is annotated with @Cacheable(value = "categories", key = "#id")
. The method’s key
parameter uses Spring Expression Language (SpEL) expression for computing the key dynamically. Default is null, meaning all method parameters are considered as a key unless a custom keyGenerator
has been configured. If no value is found in the Redis cache for the computed key, the target method will be invoked, and the returned value will be stored in the associated cache.
Controller Classes
There are no changes required to the Controller classes.
Amazon ElastiCache for Redis
Multiple options are available for creating an Amazon ElastiCache for Redis cluster, including cluster mode, multi-AZ option, auto-failover option, node type, number of replicas, number of shards, replicas per shard, Availability Zone placements, and encryption at rest and encryption in transit options. The results in this post are based on a minimally-configured Redis version 6.2.6 cluster, with one shard, two cache.r6g.large
nodes, and cluster mode, multi-AZ option, and auto-failover all disabled. In addition, encryption at rest and encryption in transit were also disabled. This cluster configuration is sufficient for development and testing, but not Production.


Testing the Cache
To test Amazon ElastiCache for Redis, we will use Postman again with the imported OpenAPI v3 specification. With all data evicted from existing Redis caches, the first time the Postman tests run, they cause the service’s target methods to be invoked and the returned data stored in the associated caches.

To confirm this caching behavior, use the Redis CLI’s --scan
option. To access the redis-cli
, I deployed a single Redis pod to Amazon EKS. The first time the --scan
command runs, we should get back an empty list of keys. After the first Postman test run, the same --scan
command should return a list of cached keys.

Use the Redis CLI’s MONITOR
option to further confirm data is being cached, as indicated by the set
command.

After the initial caching of data, use the Redis CLI’s MONITOR
option, again, to confirm the cache is being hit instead of calling the target methods, which would then invoke the Athena API. Rerunning the Postman tests, we should see get
commands as opposed to set
commands.

Lastly, to confirm the Spring Boot service is effectively using Redis to cache data, we can also check Amazon Athena’s Recent queries tab in the AWS Management Console. After repeated sequential test runs within the TTL window, we should only see one Athena query per endpoint.

Conclusion
In this brief follow-up to the recent post, Developing Spring Boot Applications for Querying Data Lakes on AWS using Amazon Athena, we saw how to substantially increase data lake application performance using Amazon ElastiCache for Redis. Although this caching technique is often associated with databases, it can also be effectively applied to data lake-based applications, as demonstrated in the post.
This blog represents my viewpoints and not of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners. All diagrams and illustrations are property of the author unless otherwise noted.
Developing Spring Boot Applications for Querying Data Lakes on AWS using Amazon Athena
Posted by Gary A. Stafford in Analytics, AWS, Big Data, Cloud, Java Development, Software Development on June 26, 2022
Learn how to develop Cloud-native, RESTful Java services that query data in an AWS-based data lake using Amazon Athena’s API
Introduction
AWS provides a collection of fully-managed services that makes building and managing secure data lakes faster and easier, including AWS Lake Formation, AWS Glue, and Amazon S3. Additional analytics services such as Amazon EMR, AWS Glue Studio, and Amazon Redshift allow Data Scientists and Analysts to run high-performance queries on large volumes of semi-structured and structured data quickly and economically.
What is not always as obvious is how teams develop internal and external customer-facing analytics applications built on top of data lakes. For example, imagine sellers on an eCommerce platform, the scenario used in this post, want to make better marketing decisions regarding their products by analyzing sales trends and buyer preferences. Further, suppose the data required for the analysis must be aggregated from multiple systems and data sources; the ideal use case for a data lake.

salesbyseller
endpointIn this post, we will explore an example Java Spring Boot RESTful Web Service that allows end-users to query data stored in a data lake on AWS. The RESTful Web Service will access data stored as Apache Parquet in Amazon S3 through an AWS Glue Data Catalog using Amazon Athena. The service will use Spring Boot and the AWS SDK for Java to expose a secure, RESTful Application Programming Interface (API).

Amazon Athena is a serverless, interactive query service based on Presto, used to query data and analyze big data in Amazon S3 using standard SQL. Using Athena functionality exposed by the AWS SDK for Java and Athena API, the Spring Boot service will demonstrate how to access tables, views, prepared statements, and saved queries (aka named queries).

TL;DR
Do you want to explore the source code for this post’s Spring Boot service or deploy it to Kubernetes before reading the full article? All the source code, Docker, and Kubernetes resources are open-source and available on GitHub.
git clone --depth 1 -b main \ https://github.com/garystafford/athena-spring-app.git
A Docker image for the Spring Boot service is also available on Docker Hub.

Data Lake Data Source
There are endless data sources to build a demonstration data lake on AWS. This post uses the TICKIT sample database provided by AWS and designed for Amazon Redshift, AWS’s cloud data warehousing service. The database consists of seven tables. Two previous posts and associated videos, Building a Data Lake on AWS with Apache Airflow and Building a Data Lake on AWS, detail the setup of the data lake used in this post using AWS Glue and optionally Apache Airflow with Amazon MWAA.
Those two posts use the data lake pattern of segmenting data as bronze (aka raw), silver (aka refined), and gold (aka aggregated), popularized by Databricks. The data lake simulates a typical scenario where data originates from multiple sources, including an e-commerce platform, a CRM system, and a SaaS provider must be aggregated and analyzed.

Spring Projects with IntelliJ IDE
Although not a requirement, I used JetBrains IntelliJ IDEA 2022 (Ultimate Edition) to develop and test the post’s Spring Boot service. Bootstrapping Spring projects with IntelliJ is easy. Developers can quickly create a Spring project using the Spring Initializr plugin bundled with the IntelliJ.

The Spring Initializr plugin’s new project creation wizard is based on start.spring.io. The plugin allows you to quickly select the Spring dependencies you want to incorporate into your project.

Visual Studio Code
There are also several Spring extensions for the popular Visual Studio Code IDE, including Microsoft’s Spring Initializr Java Support extension.

Gradle
This post uses Gradle instead of Maven to develop, test, build, package, and deploy the Spring service. Based on the packages selected in the new project setup shown above, the Spring Initializr plugin’s new project creation wizard creates a build.gradle
file. Additional packages, such as Lombak, Micrometer, and Rest Assured, were added separately.
Amazon Corretto
The Spring boot service is developed for and compiled with the most recent version of Amazon Corretto 17. According to AWS, Amazon Corretto is a no-cost, multiplatform, production-ready distribution of the Open Java Development Kit (OpenJDK). Corretto comes with long-term support that includes performance enhancements and security fixes. Corretto is certified as compatible with the Java SE standard and is used internally at Amazon for many production services.
Source Code
Each API endpoint in the Spring Boot RESTful Web Service has a corresponding POJO data model class, service interface and service implementation class, and controller class. In addition, there are also common classes such as configuration, a client factory, and Athena-specific request/response methods. Lastly, there are additional class dependencies for views and prepared statements.

refined_tickit_public_category
tableThe project’s source code is arranged in a logical hierarchy by package and class type.
Amazon Athena Access
There are three standard methods for accessing Amazon Athena with the AWS SDK for Java: 1) the AthenaClient
service client, 2) the AthenaAsyncClient
service client for accessing Athena asynchronously, and 3) using the JDBC driver with the AWS SDK. The AthenaClient
and AthenaAsyncClient
service clients are both parts of the software.amazon.awssdk.services.athena
package. For simplicity, this post’s Spring Boot service uses the AthenaClient
service client instead of Java’s asynchronously programming model. AWS supplies basic code samples as part of their documentation as a starting point for writing Athena applications using the SDK. The code samples also use the AthenaClient
service client.
POJO-based Data Model Class
For each API endpoint in the Spring Boot RESTful Web Service, there is a corresponding Plain Old Java Object (POJO). According to Wikipedia, a POGO is an ordinary Java object, not bound by any particular restriction. The POJO class is similar to a JPA Entity, representing persistent data stored in a relational database. In this case, the POJO uses Lombok’s @Data
annotation. According to the documentation, this annotation generates getters for all fields, a useful toString
method, and hashCode
and equals
implementations that check all non-transient fields. It also generates setters for all non-final fields and a constructor.
Each POJO corresponds directly to a ‘silver’ table in the AWS Glue Data Catalog. For example, the Event
POJO corresponds to the refined_tickit_public_event
table in the tickit_demo
Data Catalog database. The POJO defines the Spring Boot service’s data model for data read from the corresponding AWS Glue Data Catalog table.

The Glue Data Catalog table is the interface between the Athena query and the underlying data stored in Amazon S3 object storage. The Athena query targets the table, which returns the underlying data from S3.

Service Class
Retrieving data from the data lake via AWS Glue, using Athena, is handled by a service class. For each API endpoint in the Spring Boot RESTful Web Service, there is a corresponding Service Interface and implementation class. The service implementation class uses Spring Framework’s @Service
annotation. According to the documentation, it indicates that an annotated class is a “Service,” initially defined by Domain-Driven Design (Evans, 2003) as “an operation offered as an interface that stands alone in the model, with no encapsulated state.” Most importantly for the Spring Boot service, this annotation serves as a specialization of @Component
, allowing for implementation classes to be autodetected through classpath scanning.
Using Spring’s common constructor-based Dependency Injection (DI) method (aka constructor injection), the service auto-wires an instance of the AthenaClientFactory
interface. Note that we are auto-wiring the service interface, not the service implementation, allowing us to wire in a different implementation if desired, such as for testing.
The service calls the AthenaClientFactory
class’s createClient()
method, which returns a connection to Amazon Athena using one of several available authentication methods. The authentication scheme will depend on where the service is deployed and how you want to securely connect to AWS. Some options include environment variables, local AWS profile, EC2 instance profile, or token from the web identity provider.
The service class transforms the payload returned by an instance of GetQueryResultsResponse
into an ordered collection (also known as a sequence), List<E>
, where E
represents a POJO. For example, with the data lake’srefined_tickit_public_event
table, the service returns a List<Event>
. This pattern repeats itself for tables, views, prepared statements, and named queries. Column data types can be transformed and formatted on the fly, new columns added, and existing columns skipped.
For each endpoint defined in the Controller class, for example, get()
, findAll()
, and FindById()
, there is a corresponding method in the Service class. Below, we see an example of the findAll()
method in the SalesByCategoryServiceImp
service class. This method corresponds to the identically named method in the SalesByCategoryController
controller class. Each of these service methods follows a similar pattern of constructing a dynamic Athena SQL query based on input parameters, which is passed to Athena through the AthenaClient
service client using an instance of GetQueryResultsRequest
.
Controller Class
Lastly, there is a corresponding Controller class for each API endpoint in the Spring Boot RESTful Web Service. The controller class uses Spring Framework’s @RestController
annotation. According to the documentation, this annotation is a convenience annotation that is itself annotated with @Controller
and @ResponseBody
. Types that carry this annotation are treated as controllers where @RequestMapping
methods assume @ResponseBody
semantics by default.
The controller class takes a dependency on the corresponding service class application component using constructor-based Dependency Injection (DI). Like the service example above, we are auto-wiring the service interface, not the service implementation.
The controller is responsible for serializing the ordered collection of POJOs into JSON and returning that JSON payload in the body of the HTTP response to the initial HTTP request.
Querying Views
In addition to querying AWS Glue Data Catalog tables (aka Athena tables), we also query views. According to the documentation, a view in Amazon Athena is a logical table, not a physical table. Therefore, the query that defines a view runs each time the view is referenced in a query.
For convenience, each time the Spring Boot service starts, the main AthenaApplication
class calls the View.java
class’s CreateView()
method to check for the existence of the view, view_tickit_sales_by_day_and_category
. If the view does not exist, it is created and becomes accessible to all application end-users. The view is queried through the service’s /salesbycategory
endpoint.

This confirm-or-create pattern is repeated for the prepared statement in the main AthenaApplication
class (detailed in the next section).