Programmatic Ponderings

id	date	patient	code	description	reasoncode	reasondescription
714fd61a-f9fd-43ff-87b9-3cc45a3f1e53	2014-01-09	33f33990-ae8b-4be8-938f-e47ad473abfe	185345009	Encounter for symptom	444814009	Viral sinusitis (disorder)
23e07532-8b96-4d05-b14e-d4c5a5288ed2	2014-08-18	33f33990-ae8b-4be8-938f-e47ad473abfe	185349003	Outpatient Encounter
45044100-aaba-4209-8ad1-15383c76842d	2015-07-12	33f33990-ae8b-4be8-938f-e47ad473abfe	185345009	Encounter for symptom	36971009	Sinusitis (disorder)
ffdddbfb-35e8-4a74-a801-89e97feed2f3	2014-08-12	36d131ee-dd5b-4acb-acbe-19961c32c099	185345009	Encounter for symptom	444814009	Viral sinusitis (disorder)
352d1693-591a-4615-9b1b-f145648f49cc	2016-05-25	36d131ee-dd5b-4acb-acbe-19961c32c099	185349003	Outpatient Encounter
4620bd2f-8010-46a9-82ab-8f25eb621c37	2016-10-07	36d131ee-dd5b-4acb-acbe-19961c32c099	185345009	Encounter for symptom	195662009	Acute viral pharyngitis (disorder)
815494d8-2570-4918-a8de-fd4000d8100f	2010-08-02	660bec03-9e58-47f2-98b9-2f1c564f3838	698314001	Consultation for treatment
67ec5c2d-f41e-4538-adbe-8c06c71ddc35	2010-11-22	660bec03-9e58-47f2-98b9-2f1c564f3838	170258001	Outpatient Encounter
dbe481ce-b961-4f43-ac0a-07fa8cfa8bdd	2012-11-21	660bec03-9e58-47f2-98b9-2f1c564f3838	50849002	Emergency room admission
b5f1ab7e-5e67-4070-bcf0-52451eb20551	2013-12-04	660bec03-9e58-47f2-98b9-2f1c564f3838	185345009	Encounter for symptom	10509002	Acute bronchitis (disorder)

date	patient	encounter	code	description	value	units
2011-07-02	33f33990-ae8b-4be8-938f-e47ad473abfe	673daa98-67e9-4e80-be46-a0b547533653	8302-2	Body Height	175.76	cm
2011-07-02	33f33990-ae8b-4be8-938f-e47ad473abfe	673daa98-67e9-4e80-be46-a0b547533653	29463-7	Body Weight	56.51	kg
2011-07-02	33f33990-ae8b-4be8-938f-e47ad473abfe	673daa98-67e9-4e80-be46-a0b547533653	39156-5	Body Mass Index	18.29	kg/m2
2011-07-02	33f33990-ae8b-4be8-938f-e47ad473abfe	673daa98-67e9-4e80-be46-a0b547533653	8480-6	Systolic Blood Pressure	119.0	mmHg
2011-07-02	33f33990-ae8b-4be8-938f-e47ad473abfe	673daa98-67e9-4e80-be46-a0b547533653	8462-4	Diastolic Blood Pressure	77.0	mmHg
2012-06-17	33f33990-ae8b-4be8-938f-e47ad473abfe	be0aa510-645e-421b-ad21-8a1ab442ca48	8302-2	Body Height	177.25	cm
2012-06-17	33f33990-ae8b-4be8-938f-e47ad473abfe	be0aa510-645e-421b-ad21-8a1ab442ca48	29463-7	Body Weight	59.87	kg
2012-06-17	33f33990-ae8b-4be8-938f-e47ad473abfe	be0aa510-645e-421b-ad21-8a1ab442ca48	39156-5	Body Mass Index	19.05	kg/m2
2012-06-17	33f33990-ae8b-4be8-938f-e47ad473abfe	be0aa510-645e-421b-ad21-8a1ab442ca48	8480-6	Systolic Blood Pressure	113.0	mmHg
2012-03-26	36d131ee-dd5b-4acb-acbe-19961c32c099	296a1fd4-56de-451c-a5fe-b50f9a18472d	8302-2	Body Height	174.17	cm

start	stop	patient	encounter	code	description
2012-09-05	2012-10-16	bc33b032-8e41-4d16-bc7e-00b674b6b9f8	05a6ef43-d690-455e-ab2f-1ea19d902274	44465007	Sprain of ankle
2014-09-08	2014-09-28	bc33b032-8e41-4d16-bc7e-00b674b6b9f8	1cdcbe46-caaf-4b3f-b58c-9ca9ccb13013	283371005	Laceration of forearm
2014-11-28	2014-12-13	bc33b032-8e41-4d16-bc7e-00b674b6b9f8	b222e257-98da-4a1b-a46c-45d5ad01bbdc	195662009	Acute viral pharyngitis (disorder)
1980-01-09		01858c8d-f81c-4a95-ab4f-bd79fb62b284	ffbd4177-280a-4a08-a1af-9770a06b5146	40055000	Chronic sinusitis (disorder)
1989-06-25		01858c8d-f81c-4a95-ab4f-bd79fb62b284	ffbd4177-280a-4a08-a1af-9770a06b5146	201834006	Localized primary osteoarthritis of the hand
1996-01-07		01858c8d-f81c-4a95-ab4f-bd79fb62b284	ffbd4177-280a-4a08-a1af-9770a06b5146	196416002	Impacted molars
2016-02-07		01858c8d-f81c-4a95-ab4f-bd79fb62b284	748cda45-c267-46b2-b00d-3b405a44094e	15777000	Prediabetes
2016-04-27	2016-05-20	01858c8d-f81c-4a95-ab4f-bd79fb62b284	a64734f1-5b21-4a59-b2e8-ebfdb9058f8b	444814009	Viral sinusitis (disorder)
2014-02-06	2014-02-19	d32e9ad2-4ea1-4bb9-925d-c00fe85851ae	c64d3637-8922-4531-bba5-f3051ece6354	43878008	Streptococcal sore throat (disorder)
1982-05-18		08858d24-52f2-41dd-9fe9-cbf1f77b28b2	3fff3d52-a769-475f-b01b-12622f4fee17	368581000119106	Neuropathy due to type 2 diabetes mellitus (disorder)

payment_id	customer_id	amount	payment_date	city	district	country
16940	130	5.99	2021-05-08 21:21:56.996577 +00:00	guas Lindas de Gois	Gois	Brazil
16406	459	5.99	2021-05-08 21:22:59.996577 +00:00	Qomsheh	Esfahan	Iran
16315	408	6.99	2021-05-08 21:32:05.996577 +00:00	Jaffna	Northern	Sri Lanka
16185	333	7.99	2021-05-08 21:33:07.996577 +00:00	Baku	Baki	Azerbaijan
17097	222	9.99	2021-05-08 21:33:47.996577 +00:00	Jaroslavl	Jaroslavl	Russian Federation
16579	549	3.99	2021-05-08 21:36:33.996577 +00:00	Santiago de Compostela	Galicia	Spain
16050	269	4.99	2021-05-08 21:40:19.996577 +00:00	Salinas	California	United States
17126	239	7.99	2021-05-08 22:00:12.996577 +00:00	Ciomas	West Java	Indonesia
16933	126	7.99	2021-05-08 22:29:06.996577 +00:00	Po	So Paulo	Brazil
16297	399	8.99	2021-05-08 22:30:47.996577 +00:00	Okara	Punjab	Pakistan

country	sales	orders
India	138.80	20
China	133.80	20
Mexico	106.86	14
Japan	100.86	14
Brazil	96.87	13
Russian Federation	94.87	13
United States	92.86	14
Nigeria	58.93	7
Philippines	58.92	8
South Africa	46.94	6
Argentina	42.93	7
Germany	39.96	4
Indonesia	38.95	5
Italy	35.95	5
Iran	33.95	5
South Korea	33.94	6
Poland	30.97	3
Pakistan	25.97	3
Taiwan	25.96	4
Mozambique	23.97	3
Vietnam	23.96	4
Ukraine	23.96	4
Venezuela	22.97	3
France	20.98	2
Peru	19.98	2

	country	region
	Afghanistan	Asia & Pacific
	Aland Islands	Europe
	Albania	Europe
	Algeria	Arab States
	American Samoa	Asia & Pacific
	Andorra	Europe
	Angola	Africa
	Anguilla	Latin America
	Antarctica	Asia & Pacific

	# Purpose: DataHub example recipe for PostgreSQL datasource
	# Author: Gary A. Stafford
	# Date: March 2022

	# see https://datahubproject.io/docs/metadata-ingestion/source_docs/postgres
	source:
	type: postgres
	config:
	# Coordinates
	host_port: ${DB_HOST_PORT}
	database: tickit

	# Credentials
	username: ${DB_USERNAME}
	password: ${DB_PASSWORD}

	# Options
	profiling:
	enabled: true

	# Environment
	env: DEV

	# see https://datahubproject.io/docs/metadata-ingestion/transformers/#adding-a-set-of-tags
	transformers:
	– type: "simple_add_dataset_tags"
	config:
	tag_urns:
	– "urn:li:tag:AWS"
	– "urn:li:tag:${ACCOUNT_ID}"
	– "urn:li:tag:us-east-1"
	– type: "pattern_add_dataset_terms"
	config:
	term_pattern:
	rules:
	".users.": ["urn:li:glossaryTerm:Classification.Sensitive"]
	– type: "simple_add_dataset_ownership"
	config:
	owner_urns:
	– "urn:li:corpuser:Database Administrators"
	ownership_type: "DATAOWNER"

	# see https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub for complete documentation
	sink:
	type: "datahub-rest"
	config:
	server: ${DATAHUB_REST_ENDPOINT}

	# see https://datahubproject.io/docs/metadata-ingestion/source_docs/reporting_telemetry/
	pipeline_name: "postgres-pipeline-tickit"

	reporting:
	– type: "datahub"
	config:
	datahub_api:
	server: ${DATAHUB_REST_ENDPOINT}

	# Purpose: Simple programmatic DataHub pipline example
	# Author: Gary A. Stafford
	# Date: March 2022
	# Reference: https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/programatic_pipeline.py

	from datahub.ingestion.run.pipeline import Pipeline

	# The pipeline configuration is similar to the recipe YAML files provided to the CLI tool.
	pipeline = Pipeline.create(
	{
	"run_id": "postgres-run",
	"source": {
	"type": "postgres",
	"config": {
	"host_port": "demo-instance.abcd1234.us-east-1.rds.amazonaws.com:5432",
	"database": "tickit",
	"username": "datahub",
	"password": "My5up3r53cr3tPa55w0rd",
	"env": "DEV",
	"profiling": {
	"enabled": "true"
	}
	}
	},
	"transformers": [
	{
	"type": "simple_add_dataset_tags",
	"config": {
	"tag_urns": [
	f"urn:li:tag:AWS",
	f"urn:li:tag:111222333444",
	f"urn:li:tag:us-east-1"
	]
	}
	},
	{
	"type": "pattern_add_dataset_terms",
	"config": {
	"term_pattern": {
	"rules": {
	".users.": [
	"urn:li:glossaryTerm:Classification.Sensitive"
	]
	}
	}
	}
	},
	{
	"type": "simple_add_dataset_ownership",
	"config": {
	"owner_urns": [
	f"urn:li:corpuser:Database Administrators"
	],
	"ownership_type": "DATAOWNER"
	}
	}
	],
	"sink": {
	"type": "datahub-rest",
	"config": {
	"server": "http://192.168.111.222:33333"
	}
	}
	}
	)

	# Run the pipeline and report the results.
	pipeline.run()
	pipeline.pretty_print_summary()

	# Purpose: Programmatic DataHub pipline example
	# Author: Gary A. Stafford
	# Date: March 2022

	import json
	import logging

	import boto3
	from botocore.exceptions import ClientError
	from datahub.ingestion.run.pipeline import Pipeline

	logging.basicConfig(
	format="[%(asctime)s] %(levelname)s – %(message)s", level=logging.INFO
	)


	def main():
	sts_client = boto3.client("sts")

	params = get_parameters()
	params["owner"] = "Database Administrators"
	params["environment"] = "DEV"
	params["database"] = "tickit"
	params["region"] = sts_client.meta.region_name
	params["account"] = sts_client.get_caller_identity()["Account"]
	logging.info(f"Params: {json.dumps(params, indent=4, sort_keys=True)}")

	ingestion_pipeline = create_pipeline(params)
	run_pipeline(ingestion_pipeline)


	def create_pipeline(params) -> Pipeline:
	"""Constructs a Pipeline for a PostgreSQL Source and a DataHub Sink

	:return: instance of datahub.ingestion.run.pipeline
	"""

	pipeline = Pipeline.create(
	{
	"run_id": "postgres-run",
	"source": {
	"type": "postgres",
	"config": {
	"host_port": params.get("/datahub_demo/postgres_host_port_tickit"),
	"database": params.get("database"),
	"username": params.get("/datahub_demo/postgres_username_tickit"),
	"password": params.get("/datahub_demo/postgres_password_tickit"),
	"profiling": {
	"enabled": "true"
	},
	"env": params.get("environment"),
	}
	},
	"transformers": [
	{
	"type": "simple_add_dataset_tags",
	"config": {
	"tag_urns": [
	f"urn:li:tag:{params.get('account')}",
	f"urn:li:tag:{params.get('region')}"
	]
	}
	},
	{
	"type": "pattern_add_dataset_terms",
	"config": {
	"term_pattern": {
	"rules": {
	".users.": [
	"urn:li:glossaryTerm:Classification.Sensitive"
	]
	}
	}
	}
	},
	{
	"type": "simple_add_dataset_ownership",
	"config": {
	"owner_urns": [
	f"urn:li:corpuser:{params.get('owner')}"
	],
	"ownership_type": "DATAOWNER"
	}
	}
	],
	"sink": {
	"type": "datahub-rest",
	"config": {
	"server": params.get("/datahub_demo/datahub_rest_endpoint_public")
	}
	}
	}
	)

	return pipeline


	def run_pipeline(pipeline):
	"""Runs the ingestion pipeline and prints summary of the results

	:param pipeline: instance of datahub.ingestion.run.pipeline
	:return:
	"""

	pipeline.run()
	pipeline.pretty_print_summary()


	def get_parameters() -> dict:
	"""
	Load parameter values from AWS Systems Manager (SSM) Parameter Store

	:return: dict of parameter k/v's
	"""

	ssm_client = boto3.client("ssm")
	params: dict = {}

	try:
	# make a single SSM API call for all parameters
	response = ssm_client.get_parameters_by_path(
	Path="/datahub_demo"
	)

	# create a dictionary of parameter k/v's
	for param in response.get("Parameters"):
	params[param["Name"]] = param["Value"]

	logging.debug(f"Params: {params}")
	except ClientError as e:
	logging.error(e)
	exit(1)

	return params


	if __name__ == '__main__':
	main()

	# see sample: https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/bootstrap_data/business_glossary.yml
	version: 1
	source: DataHub
	owners:
	users:
	– datahub
	url: "https://github.com/datahub-project/datahub/"
	nodes:
	– name: Classification
	description: A set of terms related to Data Classification
	terms:
	– name: Sensitive
	description: Sensitive Data
	custom_properties:
	is_confidential: false
	– name: Confidential
	description: Confidential Data
	custom_properties:
	is_confidential: true
	– name: HighlyConfidential
	description: Highly Confidential Data
	custom_properties:
	is_confidential: true
	– name: PersonalInformation
	description: All terms related to personal information
	owners:
	users:
	– datahub
	terms:
	– name: ID
	description: An individual's unqiue identifier
	inherits:
	– Classification.Sensitive
	– name: Name
	description: An individual's Name
	inherits:
	– Classification.Sensitive
	– name: SSN
	description: An individual's SSN
	inherits:
	– Classification.Confidential
	– name: DriverLicense
	description: An individual's Driver License ID
	inherits:
	– Classification.Confidential
	– name: Email
	description: An individual's email address
	inherits:
	– Classification.Confidential
	– name: Address
	description: A physical address
	– name: Gender
	description: The gender identity of the individual
	inherits:
	– Classification.Sensitive

	— Purpose: Process data for sinusitis study using Amazon Athena
	— Author: Gary A. Stafford (January 2022)

	CREATE TABLE "sinusitis_athena" WITH (
	format = 'Parquet',
	write_compression = 'SNAPPY',
	external_location = 's3://databrew-demo-111222333444-us-east-1/sinusitis_athena/',
	bucketed_by = ARRAY['patient'],
	bucket_count = 1
	) AS
	SELECT DISTINCT
	patient,
	code,
	description,
	date_diff(
	'day',
	date(substr(birthdate, 1, 10)),
	date(substr(start, 1, 10))
	) as condition_age,
	marital,
	race,
	ethnicity,
	gender
	FROM conditions AS c,
	patients AS p
	WHERE c.patient = p.id
	AND gender <> ''
	AND ethnicity <> ''
	AND race <> ''
	AND marital <> ''
	AND description LIKE '%sinusitis%'
	ORDER BY patient, code;

	# Purpose: Process encounters dataset using either Amazon EMR and AWS Glue with PySpark
	# Author: Gary A. Stafford (January 2022)

	from pyspark.sql import SparkSession

	table_name = "encounter_emr_spark"

	spark = SparkSession \
	.builder \
	.appName(table_name) \
	.config("hive.metastore.client.factory.class",
	"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
	.config("hive.exec.dynamic.partition",
	"true") \
	.config("hive.exec.dynamic.partition.mode",
	"nonstrict") \
	.config("hive.exec.max.dynamic.partitions",
	"10000") \
	.config("hive.exec.max.dynamic.partitions.pernode",
	"10000") \
	.enableHiveSupport() \
	.getOrCreate()

	spark.sql("USE synthea_patient_big_data;")

	sql_query_data = """
	SELECT DISTINCT
	id,
	patient,
	code,
	description,
	reasoncode,
	reasondescription,
	year(date) as year,
	month(date) as month,
	day(date) as day
	FROM encounters
	WHERE description='Encounter for symptom';
	"""

	package main.spark.demo

	// Purpose: Process observations dataset using Spark on Amazon EMR with Scala
	// Author: Gary A. Stafford
	// Date: 2022-03-06

	import org.apache.spark.SparkContext
	import org.apache.spark.sql.{DataFrame, SparkSession}

	object Observations {
	def main(args: Array[String]): Unit = {

	val (spark: SparkSession, sc: SparkContext) = createSession

	performELT(spark, sc)
	}

	private def createSession = {
	val spark: SparkSession = SparkSession.builder
	.appName("Observations ELT App")
	.config("hive.metastore.client.factory.class",
	"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
	.config("hive.exec.dynamic.partition",
	"true")
	.config("hive.exec.dynamic.partition.mode",
	"nonstrict")
	.config("hive.exec.max.dynamic.partitions",
	"10000")
	.config("hive.exec.max.dynamic.partitions.pernode",
	"10000")
	.enableHiveSupport()
	.getOrCreate()

	val sc: SparkContext = spark.sparkContext
	sc.setLogLevel("INFO")

	(spark, sc)
	}

	private def performELT(spark: SparkSession, sc: SparkContext) = {

	Applications:
	– Name: 'Hadoop'
	– Name: 'Spark'
	– Name: 'JupyterEnterpriseGateway'
	– Name: 'Livy'
	BootstrapActions:
	– Name: bootstrap-script
	ScriptBootstrapAction:
	Path: !Join [ '', [ 's3://', !Ref ProjectBucket, '/spark/bootstrap_actions.sh' ] ]

	#!/bin/bash

	# Purpose: EMR bootstrap script
	# Author: Gary A. Stafford
	# Date: 2021-09-10

	# arg passed in by CloudFormation
	if [ $# -eq 0 ]
	then
	echo "No arguments supplied"
	fi

	SPARK_BUCKET=$1

	# update yum packages, install jq
	sudo yum update -y
	sudo yum install -y jq

	# jsk truststore for connecting to msk
	sudo cp /usr/lib/jvm/java-1.8.0-amazon-corretto.x86_64/jre/lib/security/cacerts \
	/tmp/kafka.client.truststore.jks

	# set region for boto3
	aws configure set region \
	"$(curl –silent http://169.254.169.254/latest/dynamic/instance-identity/document \| jq -r .region)"

	# install python packages for pyspark scripts
	sudo python3 -m pip install boto3 botocore ec2-metadata

	# install required jars for spark
	sudo aws s3 cp \
	"s3://${SPARK_BUCKET}/jars/" /usr/lib/spark/jars/ \
	–recursive –exclude "" –include ".jar"

Archive for category Build Automation

Introduction

Choosing a Generative AI Coding Tool

Ten Ways to Leverage Generative AI

1. Application Development

Generating Unit Tests

2. Infrastructure as Code (IaC)

HashiCorp Terraform

3. AWS Lambda

4. IAM Policies

5. Structured Query Language (SQL)

6. Big Data

7. Configuration and Properties Files

8. Apache Airflow DAGs

9. Containerization

Kubernetes

10. Utility Scripts

Conclusion

Introduction

Data Catalog Competitors

Open Source Software

Commercial

Cloud Service Providers

Data Catalog Features

Datasources

DataHub on AWS

External Storage Layer Dependencies

DataHub CLI and Plug-ins

Secure Metadata Ingestion

Schedule and Monitor Pipelines

Markup or Code?

Using Python

Sinking to DataHub

Column-level Metadata

Business Glossary

SQL-based Profiler

Data Lineage

DataHub Analytics

Conclusion

Introduction

Analytics Use Case

Sample Dataset

AWS Glue Data Catalog

Test Cases

Test Case 1: Encounters for Symptom

Test Case 2: Observations

Test Case 3: Sinusitis Study

AWS ELT Options

AWS Glue Studio

AWS Glue DataBrew

Amazon Athena

CTAS and Partitions

Amazon EMR with Apache Spark

Results

Amazon Athena

AWS Glue Studio (AWS Glue PySpark Extensions)

AWS Glue Studio (Apache PySpark script)

Amazon EMR with Apache Spark

Amazon Glue DataBrew

Observations

Conclusion

Introduction

Technologies

Apache Airflow

Amazon Managed Workflows for Apache Airflow

GitHub Actions

Terminology

DataOps

DevOps

Fail Fast

Source Code

Architecture

Workflows

No DevOps

GitHub Actions

Types of Tests

Python Dependencies

Flake8

Black

Pytest