Earning the AWS Certified Machine Learning — Specialty (MLS-C01) Certification
Posted by Gary A. Stafford in AWS, Cloud, Machine Learning on November 14, 2022
Introduction
Recently, I earned the AWS Certified Machine Learning — Specialty (MLS-C01) Certification, my ninth AWS certification. Since a few colleagues asked me about my preparation, I thought I would share it with the community, without divulging any details of the exam, of course.

Prerequisite Experience
Several AWS certifications can be earned with minimal to no hands-on AWS experience, but excellent short-term memorization skills. Although you will have technically earned the certification, you will certainly not be competent to practice the particular discipline. Certification does not equal qualification.
In my opinion, the AWS Certified Machine Learning — Specialty certification exam is not one of those where simple memorization of study materials, alone, will guarantee a passing score. If you lack practical experience in data science, machine learning, basic statistics, or data analytics on AWS, you will be challenged to pass this exam, no matter how much you cram.
Consider Data Analytics Certification First
To prepare for the Machine Learning — Specialty exam, I would strongly suggest first earning the AWS Certified Data Analytics — Specialty certification. According to the Machine Learning — Specialty exam’s content outline, “Domain 1: Data Engineering”, accounts for 20% of the exam’s score. Understanding the AWS Analytics services and how they integrate to form the most efficient data pipelines to feed your Machine Learning model training is a requirement for this portion of the exam’s questions. Preparing for the Data Analytics — Specialty certification will provide this adjacent domain knowledge:
- Amazon Athena
- Amazon EMR (pka Amazon Elastic MapReduce)
- Amazon IAM
- Amazon Kinesis Data Analytics, Data Firehose, Data Streams, Video Streams
- Amazon Redshift
- Amazon S3
- Amazon VPC
- AWS Data Pipeline
- AWS Glue Crawlers, Jobs, Data Catalog
- AWS Lambda
- AWS Step Functions

Study Materials
In my case, certification success was a result of practical experience, coursework, completing and reviewing the results of several practice exams, and taking lots of notes. The following is a list of the study materials I found most impactful:
Documentation
I reviewed the Amazon SageMaker and other AWS fully-managed AI/ML service documentation for my preparation.
Carefully review the Choose an Algorithm section of the Amazon SageMaker Developer Guide. According to the exam’s content outline, “Domain 3: Modeling” accounts for 36% of the exam’s score. Understand 1) recommended use cases for each of SageMaker’s built-in algorithms, 2) the algorithm’s required hyperparameters, and 3) the prescribed model evaluation metrics and tuning techniques. Built-in SageMaker algorithms most commonly covered in most training materials include:
- Tabular
- XGBoost (eXtreme Gradient Boosting)
- Linear Learner
- K-Nearest Neighbors (KNN)
- Factorization Machines
- Object2Vec
- Vision
- Image Classification
- Object Detection
- Semantic Segmentation
- Clustering
- K-Means
- Time-Series Forecast
- DeepAR
- Text Classification & Embedding
- BlazingText
- Text Transformation
- Sequence-to-Sequence (Seq2Seq)
- Text Topic Modeling
- Neural Topic Modeling (NTM)
- Latent Dirichlet Allocation (LDA)
- Dimensionality Reduction
- Principal Component Analysis (PCA)
- Anomaly Detection
- Random Cut Forest (RCF)
- IP Insights
AWS also uses Read the Docs. SageMaker’s Algorithm section is especially helpful with respect to preparing for the Machine Learning — Specialty exam: image processing, text processing, time-series processing, supervised learning, unsupervised learning, and feature engineering algorithms.
Along with algorithms, review SageMaker’s Deploy Models for Inference documentation. According to the exam’s content outline, “Domain 4: Machine Learning Implementation and Operations” accounts for 20% of the exam’s score. Understand SageMaker’s options for model serving, model versioning, deployment strategies, and endpoint monitoring.
Review the AWS fully managed AI/ML services Developer Guide documentation for the following services:
- Amazon Augmented AI
- Amazon CodeGuru
- Amazon Comprehend
- Amazon Forecast
- Amazon Fraud Detector
- Amazon Kendra
- Amazon Lex
- Amazon Personalize
- Amazon Polly
- Amazon Rekognition
- Amazon Textract
- Amazon Transcribe
- Amazon Translate
Understand the use cases for each of these services and most critically, how these managed services can be combined to create more complex AI/ML solutions. For example, building a near-real-time speech-to-speech translator with Amazon Transcribe, Amazon Translate, and Amazon Polly.
Online Courses
For my preparation, I completed three Udemy courses. Most of these online courses regularly go on sale and be purchased for $25 or less:
- AWS Certified Machine Learning Specialty 2022 — Hands On!, by Frank Kane and Stephane Maarek. Both Frank and Stephane are well-known across the industry and respected trainers. I recommend reviewing the algorithm, model evaluation, and high-level ML services sections more than once (Sections 5 and 6).
- AWS Certified Machine Learning Specialty (MLS-C01), by Chandra Lingam. Don’t get caught up in the nitty-gritty details of the Python code; focus on the higher-level machine learning principles. This course also contains a full-length practice exam.
- AWS Certified Machine Learning Specialty: 3 PRACTICE EXAMS, by Abhishek Singh. Nothing beats taking full-length practice exams and learning from your mistakes.
- Whizlabs’ AWS Certified Machine Learning Specialty Practice Tests. I completed a few of Whizlabs’ smaller practice exams, but, with limited time, I chose to complete Udemy’s full-length practice tests. Some of Whizlabs’ questions seemed off-topic to the exam outline and other training materials I reviewed.

Books
For my preparation, I read or re-read three books, two from Packt and one from O’Reilly:
- AWS Certified Machine Learning Specialty: MLS-C01 Certification Guide, by Somanath Nanda and Weslley Moura (Packt Publishing). I recommend this one if you only have time to read a single book.
- Practical Statistics for Data Scientists, 2nd Edition, by Peter Bruce, Andrew Bruce, Peter Gedeck (O’Reilly Media). According to the University of San Diego, “Statistics (or statistical analysis) is core to every machine learning algorithm.” This book covers many of the core statistical concepts behind Machine Learning, covered on the exam:
- BLEU
- Classification metrics: Precision-Recall Curve, ROC Curve, AUC
- Confusion Matrix: TP, FP, TN, FN, Accuracy, Precision, Recall (Sensitivity), Specificity, F1
- Correlated variables, Multicollinearity
- Distributions: Normal (Gaussian or “bell curve”), Bernoulli, Binomial, Poisson
- Elbow Method
- Ensemble Learning: Bagging, Boosting
- Euclidean Distance
- K-Fold Cross-Validation
- L1/L2 Regularization (lasso, alpha, ridge, lambda)
- Overfitting, Underfitting, High Bias, High Variance, Bias-Variance Tradeoff
- Plots: Histograms, Boxplots, Scatterplots
- Regression metrics: MAE, MSE, RMSE, R-squared, Adjusted R-squared
- Residuals
- SMOTE
- Standard Deviation, Three-Sigma/Empirical/68–95–99.7 Rule
- Z-score
- Python Machine Learning — Third Edition, by Sebastian Raschka and Vahid Mirjalili (Packt Publishing). Note that this book dives much deeper into the low-level statistical underpinnings of machine learning than is required for the exam, based on the exam outline. Again, don’t get caught up in the nitty-gritty details of Python; focus on the higher-level machine learning principles.
Scheduling the Exam
One last tip regarding when to take your exam. I have taken 15 AWS exams between nine AWS certifications and several recertifications. Although the Certified Machine Learning — Specialty exam is difficult, I found changing the time I sat the exam, greatly reduced my stress level. In the past, I took time off on a workday to complete exams, either in person or at home using online proctoring. I was preparing for the exam while frequently being interrupted by work-related items. For this exam, I chose to use online proctoring and took my exam at 6:00 AM on a Sunday morning. Up early, fresh, and full of energy, with no work- or family-related interruptions, no lawnmowers, dogs barking, or garbage trucks rumbling by, and no Internet bandwidth issues. I was done by 9:00 AM and eating breakfast with the family.
This blog represents my own viewpoints and not of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.
Exploring Popular Open-source Stream Processing Technologies: Part 2 of 2
Posted by Gary A. Stafford in Analytics, Big Data, Java Development, Python, Software Development, SQL on September 26, 2022
A brief demonstration of Apache Spark Structured Streaming, Apache Kafka Streams, Apache Flink, and Apache Pinot with Apache Superset
Introduction
According to TechTarget, “Stream processing is a data management technique that involves ingesting a continuous data stream to quickly analyze, filter, transform or enhance the data in real-time. Once processed, the data is passed off to an application, data store, or another stream processing engine.” Confluent, a fully-managed Apache Kafka market leader, defines stream processing as “a software paradigm that ingests, processes, and manages continuous streams of data while they’re still in motion.”
This two-part post series and forthcoming video explore four popular open-source software (OSS) stream processing projects: Apache Spark Structured Streaming, Apache Kafka Streams, Apache Flink, and Apache Pinot.

This post uses the open-source projects, making it easier to follow along with the demonstration and keeping costs to a minimum. However, you could easily substitute the open-source projects for your preferred SaaS, CSP, or COSS service offerings.
Part Two
We will continue our exploration in part two of this two-part post, covering Apache Flink and Apache Pinot. In addition, we will incorporate Apache Superset into the demonstration to visualize the real-time results of our stream processing pipelines as a dashboard.
Demonstration #3: Apache Flink
In the third demonstration of four, we will examine Apache Flink. For this part of the post, we will also use the third of the three GitHub repository projects, flink-kafka-demo
. The project contains a Flink application written in Java, which performs stream processing, incremental aggregation, and multi-stream joins.

New Streaming Stack
To get started, we need to replace the first streaming Docker Swarm stack, deployed in part one, with the second streaming Docker Swarm stack. The second stack contains Apache Kafka, Apache Zookeeper, Apache Flink, Apache Pinot, Apache Superset, UI for Apache Kafka, and Project Jupyter (JupyterLab).
https://programmaticponderings.wordpress.com/media/601efca17604c3a467a4200e93d7d3ff
The stack will take a few minutes to deploy fully. When complete, there should be ten containers running in the stack.

Flink Application
The Flink application has two entry classes. The first class, RunningTotals
, performs an identical aggregation function as the previous KStreams demo.
The second class, JoinStreams
, joins the stream of data from the demo.purchases
topic and the demo.products
topic, processing and combining them, in real-time, into an enriched transaction and publishing the results to a new topic, demo.purchases.enriched
.
The resulting enriched purchases messages look similar to the following:
Running the Flink Job
To run the Flink application, we must first compile it into an uber JAR.
We can copy the JAR into the Flink container or upload it through the Apache Flink Dashboard, a browser-based UI. For this demonstration, we will upload it through the Apache Flink Dashboard, accessible on port 8081.
The project’s build.gradle
file has preset the Main class (Flink’s Entry class) to org.example.JoinStreams
. Optionally, to run the Running Totals demo, we could change the build.gradle
file and recompile, or simply change Flink’s Entry class to org.example.RunningTotals
.

Before running the Flink job, restart the sales generator in the background (nohup python3 ./producer.py &
) to generate a new stream of data. Then start the Flink job.

To confirm the Flink application is running, we can check the contents of the new demo.purchases.enriched
topic using the Kafka CLI.