Programmatic Ponderings

film_id	title	release_year	film_language	rating	categories	actor_array	rental_duration	length_minutes	replacement_cost	rental_rate
389	Gunfighter Mussolini	2006	English	PG-13	{Sports}	{"Audrey Olivier","Judy Dean","Scarlett Damon","Russell Close"}	3	127	9.99	2.99
581	Minority Kiss	2006	English	G	{Music}	{"Vivien Basinger"}	4	59	16.99	0.99
598	Mosquito Armageddon	2006	English	G	{Sports}	{"Goldie Brody","Kirk Jovovich","Nick Stallone","Reese West"}	6	57	22.99	0.99
943	Villain Desperate	2006	English	PG-13	{Documentary}	{"Dustin Tautou","Cary Mcconaughey"}	4	76	27.99	4.99
490	Jumanji Blade	2006	English	G	{New}	{"Jennifer Davis","Bob Fawcett","Nick Stallone","Gary Phoenix","Mena Temple","Jim Mostel"}	4	121	13.99	2.99
243	Doors President	2006	English	NC-17	{Animation}	{"Karl Berry","Lucille Tracy","Natalie Hopkins","Christian Akroyd","Sylvester Dern","Gene Hopkins","Ed Mansfield","Kim Allen","Reese West"}	3	49	22.99	4.99
40	Army Flintstones	2006	English	R	{Documentary}	{"Ed Chase","Cary Mcconaughey","Mae Hoffman","Gene Willis","Penelope Cronyn","Matthew Carrey","Russell Close"}	4	148	22.99	0.99
317	Fireball Philadelphia	2006	English	PG	{Comedy}	{"Val Bolger","Jude Cruise","Adam Grant","James Pitt","Frances Tomei"}	4	148	25.99	0.99
17	Alone Trip	2006	English	R	{Music}	{"Ed Chase","Karl Berry","Uma Wood","Woody Jolie","Spencer Depp","Chris Depp","Laurence Bullock","Renee Ball"}	3	82	14.99	0.99
195	Crowds Telemark	2006	English	R	{Sci-Fi}	{"Matthew Johansson","Anne Cronyn","Jeff Silverstone","Matthew Carrey"}	3	112	16.99	4.99

id	date	patient	code	description	reasoncode	reasondescription
714fd61a-f9fd-43ff-87b9-3cc45a3f1e53	2014-01-09	33f33990-ae8b-4be8-938f-e47ad473abfe	185345009	Encounter for symptom	444814009	Viral sinusitis (disorder)
23e07532-8b96-4d05-b14e-d4c5a5288ed2	2014-08-18	33f33990-ae8b-4be8-938f-e47ad473abfe	185349003	Outpatient Encounter
45044100-aaba-4209-8ad1-15383c76842d	2015-07-12	33f33990-ae8b-4be8-938f-e47ad473abfe	185345009	Encounter for symptom	36971009	Sinusitis (disorder)
ffdddbfb-35e8-4a74-a801-89e97feed2f3	2014-08-12	36d131ee-dd5b-4acb-acbe-19961c32c099	185345009	Encounter for symptom	444814009	Viral sinusitis (disorder)
352d1693-591a-4615-9b1b-f145648f49cc	2016-05-25	36d131ee-dd5b-4acb-acbe-19961c32c099	185349003	Outpatient Encounter
4620bd2f-8010-46a9-82ab-8f25eb621c37	2016-10-07	36d131ee-dd5b-4acb-acbe-19961c32c099	185345009	Encounter for symptom	195662009	Acute viral pharyngitis (disorder)
815494d8-2570-4918-a8de-fd4000d8100f	2010-08-02	660bec03-9e58-47f2-98b9-2f1c564f3838	698314001	Consultation for treatment
67ec5c2d-f41e-4538-adbe-8c06c71ddc35	2010-11-22	660bec03-9e58-47f2-98b9-2f1c564f3838	170258001	Outpatient Encounter
dbe481ce-b961-4f43-ac0a-07fa8cfa8bdd	2012-11-21	660bec03-9e58-47f2-98b9-2f1c564f3838	50849002	Emergency room admission
b5f1ab7e-5e67-4070-bcf0-52451eb20551	2013-12-04	660bec03-9e58-47f2-98b9-2f1c564f3838	185345009	Encounter for symptom	10509002	Acute bronchitis (disorder)

date	patient	encounter	code	description	value	units
2011-07-02	33f33990-ae8b-4be8-938f-e47ad473abfe	673daa98-67e9-4e80-be46-a0b547533653	8302-2	Body Height	175.76	cm
2011-07-02	33f33990-ae8b-4be8-938f-e47ad473abfe	673daa98-67e9-4e80-be46-a0b547533653	29463-7	Body Weight	56.51	kg
2011-07-02	33f33990-ae8b-4be8-938f-e47ad473abfe	673daa98-67e9-4e80-be46-a0b547533653	39156-5	Body Mass Index	18.29	kg/m2
2011-07-02	33f33990-ae8b-4be8-938f-e47ad473abfe	673daa98-67e9-4e80-be46-a0b547533653	8480-6	Systolic Blood Pressure	119.0	mmHg
2011-07-02	33f33990-ae8b-4be8-938f-e47ad473abfe	673daa98-67e9-4e80-be46-a0b547533653	8462-4	Diastolic Blood Pressure	77.0	mmHg
2012-06-17	33f33990-ae8b-4be8-938f-e47ad473abfe	be0aa510-645e-421b-ad21-8a1ab442ca48	8302-2	Body Height	177.25	cm
2012-06-17	33f33990-ae8b-4be8-938f-e47ad473abfe	be0aa510-645e-421b-ad21-8a1ab442ca48	29463-7	Body Weight	59.87	kg
2012-06-17	33f33990-ae8b-4be8-938f-e47ad473abfe	be0aa510-645e-421b-ad21-8a1ab442ca48	39156-5	Body Mass Index	19.05	kg/m2
2012-06-17	33f33990-ae8b-4be8-938f-e47ad473abfe	be0aa510-645e-421b-ad21-8a1ab442ca48	8480-6	Systolic Blood Pressure	113.0	mmHg
2012-03-26	36d131ee-dd5b-4acb-acbe-19961c32c099	296a1fd4-56de-451c-a5fe-b50f9a18472d	8302-2	Body Height	174.17	cm

start	stop	patient	encounter	code	description
2012-09-05	2012-10-16	bc33b032-8e41-4d16-bc7e-00b674b6b9f8	05a6ef43-d690-455e-ab2f-1ea19d902274	44465007	Sprain of ankle
2014-09-08	2014-09-28	bc33b032-8e41-4d16-bc7e-00b674b6b9f8	1cdcbe46-caaf-4b3f-b58c-9ca9ccb13013	283371005	Laceration of forearm
2014-11-28	2014-12-13	bc33b032-8e41-4d16-bc7e-00b674b6b9f8	b222e257-98da-4a1b-a46c-45d5ad01bbdc	195662009	Acute viral pharyngitis (disorder)
1980-01-09		01858c8d-f81c-4a95-ab4f-bd79fb62b284	ffbd4177-280a-4a08-a1af-9770a06b5146	40055000	Chronic sinusitis (disorder)
1989-06-25		01858c8d-f81c-4a95-ab4f-bd79fb62b284	ffbd4177-280a-4a08-a1af-9770a06b5146	201834006	Localized primary osteoarthritis of the hand
1996-01-07		01858c8d-f81c-4a95-ab4f-bd79fb62b284	ffbd4177-280a-4a08-a1af-9770a06b5146	196416002	Impacted molars
2016-02-07		01858c8d-f81c-4a95-ab4f-bd79fb62b284	748cda45-c267-46b2-b00d-3b405a44094e	15777000	Prediabetes
2016-04-27	2016-05-20	01858c8d-f81c-4a95-ab4f-bd79fb62b284	a64734f1-5b21-4a59-b2e8-ebfdb9058f8b	444814009	Viral sinusitis (disorder)
2014-02-06	2014-02-19	d32e9ad2-4ea1-4bb9-925d-c00fe85851ae	c64d3637-8922-4531-bba5-f3051ece6354	43878008	Streptococcal sore throat (disorder)
1982-05-18		08858d24-52f2-41dd-9fe9-cbf1f77b28b2	3fff3d52-a769-475f-b01b-12622f4fee17	368581000119106	Neuropathy due to type 2 diabetes mellitus (disorder)

	# streaming data generator, Apache Spark and Apache Pinot examples
	git clone –depth 1 -b main \
	https://github.com/garystafford/streaming-sales-generator.git

	# Apache Flink examples
	git clone –depth 1 -b main \
	https://github.com/garystafford/flink-kafka-demo.git

	# Kafka Streams examples
	git clone –depth 1 -b main \
	https://github.com/garystafford/kstreams-kafka-demo.git

	# cd into project
	cd streaming-sales-generator/

	# initialize swarm stack – 1x only
	docker swarm init

	# optional: delete previous streaming-stack
	docker stack rm streaming-stack

	# deploy first streaming-stack
	docker stack deploy streaming-stack –compose-file docker/spark-kstreams-stack.yml

	# observe the deployment's progress
	docker stack services streaming-stack

	[SALES]
	# minimum sales frequency in seconds (debug with 1, typical min. 120)
	min_sale_freq = 2

	# maximum sales frequency in seconds (debug with 3, typical max. 300)
	max_sale_freq = 5

	# number of transactions to generate
	number_of_sales = 1000

	# install required python packages (1x)
	python3 -m pip install kafka-python

	cd sales_generator/

	# run in foreground
	python3 ./producer.py

	# better option, run as background process
	nohup python3 ./producer.py &

	# confirm process is running
	ps -u

	# establish an interactive session with the spark container
	KAFKA_CONTAINER=$(docker container ls –filter name=streaming-stack_kafka.1 –format "{{.ID}}")
	docker exec -it ${KAFKA_CONTAINER} bash

	# set environment variables used by jobs
	export BOOTSTRAP_SERVERS="localhost:9092"
	export TOPIC_PRODUCTS="demo.products"
	export TOPIC_PURCHASES="demo.purchases"
	export TOPIC_INVENTORIES="demo.inventories"

	# list topics
	kafka-topics.sh –list –bootstrap-server $BOOTSTRAP_SERVERS

	# read topics from beginning
	kafka-console-consumer.sh \
	–topic $TOPIC_PRODUCTS –from-beginning \
	–bootstrap-server $BOOTSTRAP_SERVERS

	kafka-console-consumer.sh \
	–topic $TOPIC_PURCHASES –from-beginning \
	–bootstrap-server $BOOTSTRAP_SERVERS

	kafka-console-consumer.sh \
	–topic $TOPIC_INVENTORIES –from-beginning \
	–bootstrap-server $BOOTSTRAP_SERVERS

	# establish an interactive session with the spark container
	SPARK_CONTAINER=$(docker container ls –filter name=streaming-stack_spark.1 –format "{{.ID}}")
	docker exec -it -u 0 ${SPARK_CONTAINER} bash

	# update and install wget
	apt-get update && apt-get install wget vim -y

	# install required job dependencies
	wget https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.11.1/commons-pool2-2.11.1.jar
	wget https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/3.3.1/kafka-clients-3.3.1.jar
	wget https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.12/3.3.1/spark-sql-kafka-0-10_2.12-3.3.1.jar
	wget https://repo1.maven.org/maven2/org/apache/spark/spark-token-provider-kafka-0-10_2.12/3.3.1/spark-token-provider-kafka-0-10_2.12-3.3.1.jar
	mv *.jar /opt/bitnami/spark/jars/

	exit

	ds_sales = (
	df_sales.selectExpr("CAST(value AS STRING)")
	.select(F.from_json("value", schema=schema).alias("data"))
	.select("data.*")
	.withColumn("row", F.row_number().over(window))
	.withColumn("quantity", F.sum(F.col("quantity")).over(window_agg))
	.withColumn("sales", F.sum(F.col("total_purchase")).over(window_agg))
	.filter(F.col("row") == 1)
	.drop("row")
	.select(
	"product_id",
	F.format_number("sales", 2).alias("sales"),
	F.format_number("quantity", 0).alias("quantity"),
	)
	.coalesce(1)
	.orderBy(F.regexp_replace("sales", ",", "").cast("float"), ascending=False)
	.write.format("console")
	.option("numRows", 25)
	.option("truncate", False)
	.save()
	)

	ds_sales = (
	df_sales.selectExpr("CAST(value AS STRING)")
	.select(F.from_json("value", schema=schema).alias("data"))
	.select("data.*")
	.withWatermark("transaction_time", "10 minutes")
	.groupBy("product_id", F.window("transaction_time", "10 minutes", "5 minutes"))
	.agg(F.sum("total_purchase"), F.sum("quantity"))
	.orderBy(F.col("window").desc(), F.col("sum(total_purchase)").desc())
	.select(
	"product_id",
	F.format_number("sum(total_purchase)", 2).alias("sales"),
	F.format_number("sum(quantity)", 0).alias("drinks"),
	"window.start",
	"window.end",
	)
	.coalesce(1)
	.writeStream.queryName("streaming_to_console")
	.trigger(processingTime="1 minute")
	.outputMode("complete")
	.format("console")
	.option("numRows", 10)
	.option("truncate", False)
	.start()
	)

	ds_sales.awaitTermination()

	# run stream processing job
	spark-submit spark_streaming_kafka.py

Programmatic Ponderings

Exploring Popular Open-source Stream Processing Technologies: Part 1 of 2

Introduction

Batch vs. Stream Processing

Bounded vs. Unbounded Data

Stream Processing Technologies

Apache Spark Structured Streaming

Apache Kafka Streams

Apache Flink

Apache Pinot

Streaming Data Source

Source Code

Docker

Demonstration #1: Apache Spark

Deploying the Streaming Stack

Sales Generator

Prepare Spark

Running the Spark Jobs

Batch Processing with Spark

Stream Processing with Spark

Demonstration #2: Apache Kafka Streams

KStreams Application

Running the Application

Part Two

Lakehouse Data Modeling using dbt, Amazon Redshift, Redshift Spectrum, and AWS Glue

Introduction

dbt

Amazon Redshift

Amazon Redshift Spectrum

Prerequisites

Cost Warning!

Amazon Redshift Serverless Option

Source Code

Sample Data

Prepare Amazon Redshift for dbt

Create New Database

Create Database Schemas

Create dbt Database User and Group

Initialize and Configure dbt for Redshift

Project Structure

Install dbt Packages

External Tables

Staging Layer

Late Binding Views

Intermediate Layer

Marts Layer

Analyses

Project Documentation

Testing

Jobs

Notifications

Exposures

Conclusion

Serverless Analytics on AWS: Getting Started with Amazon EMR Serverless and Amazon MSK Serverless

Introduction

Amazon EMR Serverless

Amazon MSK Serverless

Serverless Analytics

Source Code

Architecture

Prerequisites

Amazon EMR Serverless Application

Amazon MSK Serverless Cluster

VPC Endpoint for S3

Kafka Topics and Sample Messages

Spark Resources in Amazon S3

PySpark Applications Examples

Example 1: Kafka Batch Aggregation to the Console

Example 2: Kafka Batch Aggregation to CSV in S3

Example 3: Kafka Batch Aggregation to Kafka

Example 4: Spark Structured Streaming

Cleaning Up

Conclusion

Utilizing In-memory Data Caching to Enhance the Performance of Data Lake-based Applications

Introduction

Caching for Data Lakes

In-memory Caching

Redis In-memory Data Store

Amazon ElastiCache for Redis

ElastiCache Performance Results