Posts Tagged Testing

Unlocking the Potential of Generative AI for Synthetic Data Generation

Explore the capabilities and applications of generative AI to create realistic synthetic data for software development, analytics, and machine learning

Licensed image: Yurchanka Siarhei/Shutterstock
Licensed image: Yurchanka Siarhei/Shutterstock

Introduction

Generative AI refers to a class of artificial intelligence algorithms capable of generating new data similar to a given dataset. These algorithms learn the underlying patterns and relationships in the data and use this knowledge to create new data consistent with the original dataset. Generative AI is a rapidly evolving field that has the potential to revolutionize the way we generate and use data.

Generative AI can generate synthetic data based on patterns and relationships learned from actual data. This ability to generate synthetic data has numerous applications, from creating realistic virtual environments for training and simulation to generating new data for machine learning models. In this article, we will explore the capabilities of generative AI and its potential to generate synthetic data, both directly and indirectly, for software development, data analytics, and machine learning.

Common Forms of Synthetic Data

According to AltexSoft, in their article Synthetic Data for Machine Learning: its Nature, Types, and Ways of Generation, common forms of synthetic data include:

  1. Tabular data: This type of synthetic data is often used to generate datasets that resemble real-world data in terms of structure and statistical properties.
  2. Time series data: This type of synthetic data generates datasets that resemble real-world time series data. It is commonly used when real-world time series data is unavailable or too expensive.
  3. Image and video data: This synthetic data is used to generate realistic images and videos for training machine learning models or simulations.
  4. Text data: This synthetic data generates realistic text for natural language processing tasks or for generating training data for machine learning models.
  5. Sound data: This synthetic data generates realistic sound for training machine learning models or simulations.

Synthetic Tabular Data Types

Synthetic tabular data refers to artificially generated datasets that resemble real-world tabular data in terms of structure and statistical properties. Tabular data is organized into rows and columns, like tables or spreadsheets. Some specific types of synthetic tabular data include:

  1. Financial data: Synthetic datasets that resemble real-world financial data such as bank transactions, stock prices, or credit card information.
  2. Customer data: Synthetic datasets that resemble real-world customer data, such as purchase history, demographic information, or customer behavior.
  3. Medical data: Synthetic datasets that resemble real-world medical data, such as patient records, medical test results, or treatment history.
  4. Sensor data: Synthetic datasets that resemble real-world sensor data such as temperature readings, humidity levels, or air quality measurements.
  5. Sales data: Synthetic datasets that resemble real-world sales data, such as sales transactions, product information, or customer behavior.
  6. Inventory data: Synthetic datasets that resemble real-world inventory data, such as stock levels, product information, or supplier information.
  7. Marketing data: Synthetic datasets that resemble real-world marketing data, such as campaign performance, customer behavior, or market trends.
  8. Human resources data: Synthetic datasets that resemble real-world human resources data, such as employee records, performance evaluations, or salary information.

Challenges with Creating Synthetic Data

According to sources including Towards Data Science, enov8, and J.P. Morgan, there are several challenges in creating synthetic data, including:

  1. Technical difficulty: Properly modeling complex real-world behaviors such as synthetic data is challenging, given available technologies.
  2. Biased behavior: The flexible nature of synthetic data makes it prone to potentially biased results.
  3. Privacy concerns: Care must be taken to ensure synthetic data does not reveal sensitive information.
  4. Quality of the data model: If the quality of the data model is not high, wrong conclusions can be reached.
  5. Time and effort: Synthetic data generation requires time and effort.

Difficult Patterns and Behaviors to Model

Many patterns and behaviors can be challenging to model in synthetic data, for example:

  1. Rare events: If certain events rarely occur in the real world, generating synthetic data that accurately reflects their distribution can be difficult.
  2. Complex relationships: Synthetic data generators may struggle to capture complex relationships between variables, such as non-linear interactions, feedback loops, or conditional dependencies.
  3. Contextual variability: Contextual factors, such as time, location, or individual differences, can have a significant impact on the distribution of data. Modeling this variability accurately can be challenging.
  4. Outliers and anomalies: Synthetic data generators may be unable to generate outliers or anomalies that are realistic and representative of the real world.
  5. Dynamic data: If the data is dynamic and changes over time, it can be challenging to generate synthetic data that captures these changes accurately.
  6. Unobserved variables: Sometimes, variables may be necessary for understanding the data distribution but are not directly observable. These variables can be challenging to model in synthetic data.
  7. Data bias: If the real-world data is biased in some way, such as over-representation of certain groups or under-representation of others, it can be challenging to generate synthetic data that is unbiased and representative of the population.
  8. Time-dependent patterns: If the data exhibits time-dependent patterns, such as seasonality or trends, it can be challenging to generate synthetic data that accurately reflects them.
  9. Spatial patterns: If the data has a spatial component, such as location data or images, it can be challenging to generate synthetic data that captures the spatial patterns realistically.
  10. Data sparsity: If the data is sparse or incomplete, it can be difficult to generate synthetic data that accurately reflects the distribution of the entire dataset.
  11. Human behavior: If the data involves human behavior, such as in social science or behavioral economics, it can be challenging to model the complex and nuanced behaviors of individuals and groups.
  12. Sensitive or confidential information: In some cases, the data may contain sensitive or confidential information that cannot be shared, making it challenging to generate synthetic data that preserves privacy while accurately reflecting the underlying distribution.

Overall, many patterns can be challenging to model accurately in synthetic data. It often requires careful consideration of the specific data characteristics and the synthetic data generation techniques’ limitations.

Easily Modeled Patterns and Behaviors

There are many simple and well-understood patterns that can be easily modeled in synthetic data, for example:

  1. Randomness: If the data is purely random, it can be easily generated using a random number generator or other simple techniques.
  2. Gaussian distribution: If the data follows a Gaussian or normal distribution, it can be generated using a Gaussian random number generator.
  3. Uniform distribution: If the data follows a uniform distribution, it can be easily generated using a uniform random number generator.
  4. Linear relationships: If the data follows a linear relationship between variables, it can be modeled using simple linear regression techniques.
  5. Categorical variables: If the data consists of categorical variables, such as gender or occupation, it can be generated using a categorical distribution.
  6. Text data: If the data consists of text, it can be generated using natural language processing techniques, such as language models or text generation algorithms.
  7. Time series: If the data consists of time series data, such as stock prices or weather data, it can be generated using time series models, such as Autoregressive integrated moving average (ARIMA) or Long short-term memory (LSTM).
  8. Seasonality: If the data exhibits seasonal patterns, such as higher sales data during holiday periods, it can be generated using seasonal time series models.
  9. Proportions and percentages: If the data consists of proportions or percentages, such as product sales distribution across different regions, it can be generated using beta or Dirichlet distributions.
  10. Multivariate normal distribution: If the data follows a multivariate normal distribution, it can be generated using a multivariate Gaussian random number generator.
  11. Networks: If the data consists of network or graph data, such as social networks or transportation networks, it can be generated using network models, such as Erdos-Renyi or Barabasi-Albert models.
  12. Binary data: If the data consists of binary data, such as whether a customer churned, it can be generated using a Bernoulli distribution.
  13. Geospatial data: If it involves geospatial data, such as the location of points of interest, it can be easily using geospatial models, such as point processes or spatial point patterns.
  14. Customer behaviors: If the data involves customer behaviors, such as browsing or purchase histories, it can be generated using customer journey models, such as Markov models.

In general, simple and well-understood patterns can be easily modeled using synthetic data techniques. In contrast, more complex and nuanced patterns may require more sophisticated modeling techniques and a deeper understanding of the underlying data characteristics.

Creating Synthetic Data with Generative AI

We can use many popular generative AI-powered tools to create synthetic data for testing applications, constructing analytics pipelines, and building machine learning models. Tools include OpenAI ChatGPT, Microsoft’s all-new Bing Chat, ChatSonic, Tabnine, GitHub Copilot, and Amazon CodeWhisperer. For more information on these tools, check out my recent blog post:

Accelerating Development with Generative AI-Powered Coding Tools
Explore six popular generative AI-powered tools, including ChatGPT, Copilot, CodeWhisperer, Tabnine, Bing, and…garystafford.medium.com

Let’s start with a simple example of generating synthetic sales data. Suppose we have created a new sales forecasting application for coffee shops that we need to test using synthetic data. We might start by prompting a generative AI tool like OpenAI’s ChatGPT for some data:

Create a CSV file with 25 random sales records for a coffee shop.
Each record should include the following fields:
- id (incrementing integer starting at 1)
- date (random date between 1/1/2022 and 12/31/2022)
- time (random time between 6:00am and 9:00pm in 1-minute increments)
- product_id (incrementing integer starting at 1)
- product
- calories
- price in USD
- type (drink or food)
- quantity (random integer between 1 and 3)
- amount (price * quantity)
- payment type (cash, credit, debit, or gift card)

The content and structure of a prompt can vary, and this can strongly influence ChatGPT’s response. Based on the above prompt, the results were accurate but not very useful for testing our application in this format. ChatGPT cannot create a physical CSV file. Furthermore, ChatGPT’s response length is limited; only about twenty records were returned. According to ChatGPT, in general, ChatGPT can generate responses of up to 2,048 tokens, the maximum output length allowed by the GPT-3 model.

OpenAI ChatGPT UI

Instead of outputting the actual synthetic data, we could ask ChatGPT to write a program that can, in turn, generate the synthetic data. This option is certainly more scalable. Let’s prompt ChatGPT to write a Python program to generate synthetic sales data with the same characteristics as before:

Create a Python3 program to generate 100 sales of common items 
sold in a coffee shop. The data should be written to a CSV file
and include a header row. Each record should include the following fields:
- id (incrementing integer starting at 1)
- date (random date between 1/1/2022 and 12/31/2022)
- time (random time between 6:00am and 9:00pm in 1-minute increments)
- product_id (incrementing integer starting at 1)
- product
- calories
- price in USD
- type (drink or food)
- quantity (random integer between 1 and 3)
- amount (price * quantity)
- payment type (cash, credit, debit, or gift card)

Using a single concise prompt, ChatGPT generated a complete Python program, including code comments, to generate synthetic sales data. Unfortunately, given ChatGPT’s response size limitation, the coffee shop menu was limited to just six items. Reprompting for more items would result in the truncation of the output and, thus, the program, making it unrunnable. Instead, we could use an additional prompt to generate a longer Python list of menu items and combine the two pieces of code in our IDE. Regardless, we will still need to copy and paste the code into our IDE to review, debug, test, and run.

OpenAI ChatGPT UI

ChatGPT’s Python program, copied and pasted into VS Code, ran without modifications, and wrote 100 synthetic sales records to a CSV file!

Running ChatGPT’s Python program in VS Code

Using IDE-based Generative AI Tools

Although generating synthetic data directly or snippets of code in chat-based generative AI tools are helpful for limited use cases, writing code in IDE gives us several advantages:

  1. Code does not need to be copied and pasted from external sources into an IDE
  2. Consecutive lines of code, method, and block code completion overcome the single response size limits of chat-based tools like OpenAI ChatGPT
  3. Code can be reused and adapted to evolving use cases over time
  4. Python interpreter and debugger or equivalent for other languages
  5. Automatic code formatting, linting, and code style enforcement
  6. Unit, integration, and functional testing
  7. Static code analysis (SCA)
  8. Vulnerability scanning
  9. IntelliSense for code completion
  10. Source code management (SCM) / version control

Let’s use the same techniques we used with ChatGPT, but from within an IDE to generate three types of synthetic data. We will choose Microsoft’s VS Code with GitHub Copilot and Python as our programming language.

VS Code IDE with GitHub Copilot Nightly extension installed

Source Code

All the code examples shown in this post can be found on GitHub.

Example #1: Coffee Shop Sales Data

First, we will start by outlining the program’s objective using code comments on the top of our Python file. This detailed context helps us to clearly express our goal and enables Copilot to generate an accurate response.

# Write a program that creates synthetic sales data for a coffee shop.
# The program should accept a command line argument that specifies the number of records to generate.
# The program should write the sales data to a file called 'coffee_shop_sales_data.csv'.
# The program should contain the following functions:
# - main() function that calls the other functions
# - function that returns one random product from a list of dictionaries
# - function that returns a dictionary containing one sales record
# - function that writes the sales records to a file

Following the import statements also generated with the assistance of Copilot, we will write the first function to return a random product from a list of 25 products. Again, we will use code comments as a prompt to generate the code. Copilot was able to generate 100% of the function’s code from the comments.

# Write a function to create list of dictionaries.
# The list of dictionaries should contain 15 drink items and 10 food items sold in a coffee shop.
# Include the product id, product name, calories, price, and type (Food or Drink).
# Capilize the first letter of each product name.
# Return a random item from the list of dictionaries.

Below is an example of Copilot’s ability to generate complete lines of code. Ultimately, it generated 100% of the function including choosing the items sold in a coffee shop, with a reasonable price and caloric count. Copilot is not limited to just understanding code.

Copilot can generate entire lines of code

Next, we will write a function to return a random sales record.

# Write a function to return a random sales record.
# The record should be a dictionary with the following fields:
# - id (an incrementing integer starting at 1)
# - date (a random date between 1/1/2022 and 12/31/2022)
# - time (a random time between 6:00am and 9:00pm in 1 minute increments)
# - product_id, product, calories, price, and type (from the get_product function)
# - quantity (a random integer between 1 and 3)
# - amount (price * quantity)
# - payment type (Cash, Credit, Debit, Gift Card, Apple Pay, Google Pay, or Venmo)

Again, Copilot generated 100% of the function’s code as a single block based on the code comments.

Copilot can generate entire blocks of code or methods

Lastly, we will create a function to write the sales data into a CSV file using Copilot’s help.

# Write a function to write the sales records to a CSV file called 'coffee_shop_sales.csv'.
# Use an input parameter to specify the number of records to write.
# The CSV file must have a header row and be comma delimited.
# All string values must be enclosed in double quotes.

Again below, we see an example of Copilot’s ability to generate an entire Python function. I needed to correct a few problems with the generated code. First, there was a lack of quotes for string values, which I added to the function (quotechar='"', quoting=csv.QUOTE_NONNUMERIC). Also, the function was missing a key line of code, sale = get_sales_record(), which would have caused the code to fail. Remember, just because the code was generated does not mean it is correct.

Copilot can generate entire blocks of code or methods

Here is the complete program that creates synthetic sales data for a coffee shop with Copilot assistance:

# Purpose: Generate coffee shop sales data
# Author: Gary A. Stafford and GitHub Copilot
# Date: 2023-04-12
# Usage: python3 coffee_shop_data_gen.py 100
# Command-line argument(s): rec_count (number of records to generate as an integer)
# Write a program that creates synthetic sales data for a coffee shop.
# The program should accept a command line argument that specifies the number of records to generate.
# The program should write the sales data to a file called 'coffee_shop_sales_data.csv'.
# The program should contain the following functions:
# – main() function that calls the other functions
# – function that returns one random product from a list of dictionaries
# – function that returns a dictionary containing one sales record
# – function that writes the sales records to a file
import argparse
import csv
import hashlib
import random
from datetime import datetime, timedelta
def main():
# create a parser object
parser = argparse.ArgumentParser(
description="Generate coffee shop sales data")
# add a command line argument to specify the number of records to generate
parser.add_argument("num_recs",
type=int,
help="The number of records to generate",
default=100)
num_recs = parser.parse_args().num_recs
write_data(num_recs)
# Write a function to create list of dictionaries.
# The list of dictionaries should contain 15 drink items and 10 food items sold in a coffee shop.
# Include the product id, product name, calories, price, and type (Food or Drink).
# Capilize the first letter of each product name.
# Return a random item from the list of dictionaries.
def get_product():
products = [
{"id": 1, "product": "Latte", "calories": 120, "price": 3.50, "type": "Drink"},
{"id": 2, "product": "Cappuccino", "calories": 100, "price": 3.00, "type": "Drink"},
{"id": 3, "product": "Americano", "calories": 5, "price": 2.50, "type": "Drink"},
{"id": 4, "product": "Espresso", "calories": 10, "price": 2.00, "type": "Drink"},
{"id": 5, "product": "Mocha", "calories": 250, "price": 4.00, "type": "Drink"},
{"id": 6, "product": "Iced Coffee", "calories": 80, "price": 2.50, "type": "Drink"},
{"id": 7, "product": "Hot Chocolate", "calories": 300, "price": 3.50, "type": "Drink"},
{"id": 8, "product": "Tea", "calories": 0, "price": 2.00, "type": "Drink"},
{"id": 9, "product": "Frappe", "calories": 450, "price": 5.00, "type": "Drink"},
{"id": 10, "product": "Smoothie", "calories": 200, "price": 4.00, "type": "Drink"},
{"id": 11, "product": "Iced Tea", "calories": 0, "price": 2.50, "type": "Drink"},
{"id": 12, "product": "Lemonade", "calories": 120, "price": 3.00, "type": "Drink"},
{"id": 13, "product": "Hot Tea", "calories": 0, "price": 2.00, "type": "Drink"},
{"id": 14, "product": "Chai Tea", "calories": 200, "price": 3.50, "type": "Drink"},
{"id": 15, "product": "Iced Chai", "calories": 250, "price": 4.00, "type": "Drink"},
{"id": 16, "product": "Croissant", "calories": 231, "price": 2.99, "type": "Food"},
{"id": 17, "product": "Bagel", "calories": 289, "price": 3.49, "type": "Food"},
{"id": 18, "product": "Muffin", "calories": 426, "price": 3.99, "type": "Food"},
{"id": 19, "product": "Sandwich", "calories": 512, "price": 6.99, "type": "Food"},
{"id": 20, "product": "Wrap", "calories": 388, "price": 5.99, "type": "Food"},
{"id": 21, "product": "Salad", "calories": 231, "price": 7.99, "type": "Food"},
{"id": 22, "product": "Quiche", "calories": 456, "price": 4.99, "type": "Food"},
{"id": 23, "product": "Scone", "calories": 335, "price": 2.49, "type": "Food"},
{"id": 24, "product": "Pastry", "calories": 397, "price": 3.99, "type": "Food"},
{"id": 25, "product": "Cake", "calories": 512, "price": 5.99, "type": "Food"},
]
# return one random item from list of dictionaries
return random.choice(products)
# Write a function to return a random sales record.
# The record should be a dictionary with the following fields:
# – transaction_id (a hash of the date, time, and product_id)
# – date (a random date between 1/1/2022 and 12/31/2022)
# – time (a random time between 6:00am and 9:00pm in 1 minute increments)
# – product_id, product, calories, price, and type (from the get_product function)
# – quantity (a random integer between 1 and 3)
# – amount (price * quantity)
# – payment type (Cash, Credit, Debit, Gift Card, Apple Pay, Google Pay, or Venmo)
def get_sales_record():
# get a random product
product = get_product()
# get a random date between 1/1/2022 and 12/31/2022
start_date = datetime(2022, 1, 1)
end_date = datetime(2022, 12, 31)
random_date = start_date + timedelta(
# Get a random number of seconds between 0 and the number of seconds between start_date and end_date
seconds=random.randint(0, int(
(end_date start_date).total_seconds())), )
# get a random time between 6:00am and 9:00pm
start_time = datetime.strptime("6:00am", "%I:%M%p")
end_time = datetime.strptime("9:00pm", "%I:%M%p")
random_time = start_time + timedelta(
# Get a random number of seconds between 0 and the number of seconds between start_time and end_time
seconds=random.randint(0, int(
(end_time start_time).total_seconds())), )
# get a random quantity between 1 and 3
random_quantity = random.randint(1, 3)
# get a random payment type:
# Cash, Credit, Debit, Gift card, Apple Pay, Google Pay, Venmo
random_payment_type = random.choice(
["Cash", "Credit", "Debit", "Gift card", "Apple Pay", "Google Pay", "Venmo"])
sales_record = {
"date": random_date.strftime("%m/%d/%Y"),
"time": random_time.strftime("%H:%M:%S"),
"product_id": product["id"],
"product": product["product"],
"calories": product["calories"],
"price": product["price"],
"type": product["type"],
"quantity": random_quantity,
"amount": product["price"] * random_quantity,
"payment_type": random_payment_type,
}
return sales_record
# Write a function to write the sales records to a CSV file called 'coffee_shop_sales_data.csv'.
# Use an input parameter to specify the number of records to write.
# Call the get_sales_record function once for each record to write.
# The CSV file must have a header row and be comma delimited.
# All string values must be enclosed in double quotes.
def write_data(rec_count):
# open the file for writing
with open("output/coffee_shop_sales_data.csv", "w", newline="") as csv_file:
# create a csv writer object
csv_writer = csv.writer(csv_file,
delimiter=",",
quotechar='"',
quoting=csv.QUOTE_NONNUMERIC)
# write the header row
# id,date,time,product_id,product,calories,price,type,quantity,amount,payment_type
csv_writer.writerow([
"transaction_id",
"date",
"time",
"product_id",
"product",
"calories",
"price",
"type",
"quantity",
"amount",
"payment_type",
])
# write the sales records
for i in range(rec_count):
sale = get_sales_record()
transaction_id = hashlib.md5((f'{sale["date"]} {sale["time"]} {sale["product_id"]}').encode()).hexdigest()
csv_writer.writerow([
transaction_id,
sale["date"],
sale["time"],
sale["product_id"],
sale["product"],
sale["calories"],
sale["price"],
sale["type"],
sale["quantity"],
sale["amount"],
sale["payment_type"],
])
if __name__ == "__main__":
main()

Copilot generated an astounding 80–85% of the program’s final code. The initial program took 10–15 minutes to write using code comments. I then added a few new features, including the ability to pass in the record count on the command line and the hash-based transaction id, which took another 5 minutes. Finally, I used GitHub Code Brushes to optimize the code and generate the Python docstrings, and Black Formatter and Flake8 extensions to format and lint, all of which took less than 5 minutes. With testing and debugging, the total time was about 25–30 minutes.

The most significant difference with Copilot was that I never had to leave the IDE to look up code references or find existing sales datasets or even a coffee shop menu to duplicate. The code, as well as the list of products, price, calories, and product type, were all generated by Copilot.

To make this example more realistic, you could use Copilot’s assistance to write algorithms capable of reflecting daily, weekly, and seasonal variations in product choice and sales volumes. This might include simulating increased sales during the busy morning rush hour or a preference for iced drinks in the summer months versus hot drinks during the winter months.

Here is an example of the synthetic sales data output by the example application:

transaction_id date time product_id product calories price type quantity amount payment_type
47a157f84e727fe3335db1519ee736a6 06/27/2022 19:59:42 22 Quiche 456 4.99 Food 2 9.98 Debit
1bf01013e699ca0f804650ea50826c82 11/20/2022 06:21:14 22 Quiche 456 4.99 Food 3 14.97 Cash
84f41c15749090d1e79bf9a48a58d6c3 08/18/2022 11:50:22 14 Chai Tea 200 3.5 Drink 2 7.0 Apple Pay
ef1845b8438bf3b5b99d2f4891a48f03 11/13/2022 17:20:51 12 Lemonade 120 3.0 Drink 2 6.0 Debit
9863de11be3099d6361392584e30e624 06/03/2022 18:27:03 18 Muffin 426 3.99 Food 2 7.98 Gift card
f50ed8878250bc06f66b97f5cd2f6df7 02/21/2022 17:02:18 7 Hot Chocolate 300 3.5 Drink 2 7.0 Credit
1903169473f41a0275ee702f2c6b1dd6 05/24/2022 14:58:25 10 Smoothie 200 4.0 Drink 3 12.0 Venmo
164a9519fd3db952e721e9f55dc1be74 01/07/2022 14:19:35 14 Chai Tea 200 3.5 Drink 2 7.0 Debit
dc85a202143de48ad4646190cdc0bf5c 01/28/2022 08:52:38 20 Wrap 388 5.99 Food 1 5.99 Venmo
7d182be793ab4b3290d14968e9c9a3e3 02/20/2022 10:16:10 16 Croissant 231 2.99 Food 2 5.98 Google Pay
f524811f456481f5a4a2f0c09dcafa28 10/03/2022 16:19:35 10 Smoothie 200 4.0 Drink 3 12.0 Debit
f2f63eacd33cb9f13849b08638e6fc3d 07/06/2022 16:14:20 10 Smoothie 200 4.0 Drink 3 12.0 Cash
7435e8cc8771a8b538529ff6f873cc05 04/12/2022 16:35:51 14 Chai Tea 200 3.5 Drink 3 10.5 Venmo
0e1154abb6a79b9ece570d51e3116846 11/28/2022 14:44:05 21 Salad 231 7.99 Food 2 15.98 Debit
d5926d58011a2a25798bf2aa72552cf0 04/19/2022 16:06:09 22 Quiche 456 4.99 Food 1 4.99 Venmo

Example #2: Residential Address Data

We could use these same techniques to generate a list of residential addresses. To start, we can prompt Copilot for the values in a list of common street names and street types in the United States:

# Write a function that creates a list of common street names
# in the United States, in alphabetical order.
# List should be in alphabetical order. Each name should be unique.
# Return a random street name.
def get_street_name():
street_names = [
"Ash", "Bend", "Bluff", "Branch", "Bridge", "Broadway", "Brook", "Burg",
"Bury", "Canyon", "Cape", "Cedar", "Cove", "Creek", "Crest", "Crossing",
"Dale", "Dam", "Divide", "Downs", "Elm", "Estates", "Falls", "Fifth",
"First", "Fork", "Fourth", "Glen", "Green", "Grove", "Harbor", "Heights",
"Hickory", "Hill", "Hollow", "Island", "Isle", "Knoll", "Lake", "Landing",
...
]

return random.choice(street_names)


# Write a function that creates a list of common street types
# in the United States, in alphabetical order.
# List should be in alphabetical order. Each name should be unique.
# Return a random street type.
def get_street_type():
street_types = [
"Alley", "Avenue", "Bend", "Bluff", "Boulevard", "Branch", "Bridge", "Brook",
"Burg", "Circle", "Commons", "Court", "Drive", "Highway", "Lane", "Parkway",
"Place", "Road", "Square", "Street", "Terrace", "Trail", "Way"
]

return random.choice(street_types)

Next, we can create a function that returns a property type based on a categorical distribution of common residential property types with the prompt:

# Write a function to return a random property type.
# Accept a random value between 0 and 1 as an input parameter.
# The function must return one of the following values based on the %:
# 63% Single-family, 26% Multi-family, 4% Condo,
# 3% Townhouse, 2% Mobile home, 1% Farm, 1% Other.

Again, Copilot generated 100% of the function’s code as a single block based on the code comments.

Copilot generating a categorical distribution of common residential property types

Additionally, we could have Copilot help us generate a list of the 50 largest cities in the United States with state, zip code, and population, with the prompt:

# Write a function to returns the 50 largest cities in the United States.
# List should be sorted in descending order by population.
# Include the city, state abbreviation, zip code, and population.
# Return a list of dictionaries.

Once again, Copilot generated 100% of the function’s code using a combination of single lines and code blocks based on the code comments.

Copilot generating a block of US cities with state, zip code, and populations

When randomly choosing a city, we can use a categorical distribution of populations of all the cities to control the distribution of cities in the final synthetic dataset. For example, there will be more addresses in larger cities like New York City or Los Angeles than in smaller cities like Buffalo or Virginia Beach, with the prompt:

# Write a function that calculates the total population of the list of cities.
# Add a 'pcnt_of_total_population' and 'pcnt_running_total' columns to list.
# Returns a sorted list of cities by population.

Here is the complete program that creates synthetic US-based address data with Copilot assistance:

# Purpose: Generate US residential address data
# Author: Gary A. Stafford and GitHub Copilot
# Date: 2023-04-13
# Usage: python3 residential_address_data_gen.py 100
# Command-line argument(s): rec_count (number of records to generate as an integer)
# Write an application that create a random list of united states addresses.
# The application should accept a command line argument that specifies the number of records to generate.
# Include address, city, state, zip code, country, and property type.
# Write the data to a csv file named 'address_data.csv'.
# The application should contain the following functions:
# – main() function that calls the other functions
# – function that returns a list of common street names in the United States
# – function that returns a list of common street types in the United States
# – function that returns a list of common city, state, zip code, and population in the United States
# – function that returns a property type
import csv
import random
import argparse
cities_final = []
def main():
parser = argparse.ArgumentParser(description="Generate coffee shop sales data")
parser.add_argument(
"rec_count", type=int, help="The number of records to generate", default=100
)
# add population calculations to the city data
cities = get_cities()
prepare_cities(cities)
rec_count = parser.parse_args().rec_count
write_data(rec_count)
# Write a function that creates a list of common street names
# in the United States, in alphabetical order.
# Each one should be unique.
# Return a random street name.
def get_street_name():
street_names = [
"Ash", "Bend", "Bluff", "Branch", "Bridge", "Broadway", "Brook", "Burg",
"Bury", "Canyon", "Cape", "Cedar", "Cove", "Creek", "Crest", "Crossing",
"Dale", "Dam", "Divide", "Downs", "Elm", "Estates", "Falls", "Fifth",
"First", "Fork", "Fourth", "Glen", "Green", "Grove", "Harbor", "Heights",
"Hickory", "Hill", "Hollow", "Island", "Isle", "Knoll", "Lake", "Landing",
"Lawn", "Main", "Manor", "Maple", "Meadow", "Meadows", "Mill", "Mills",
"Mission", "Mount", "Mountain", "Oak", "Oaks", "Orchard", "Park", "Parkway",
"Pass", "Path", "Pike", "Pine", "Place", "Plain", "Plains", "Port", "Prairie",
"Ridge", "River", "Road", "Rock", "Rocks", "Second", "Seventh", "Shoals",
"Shore", "Shores", "Sixth", "Skyway", "Spring", "Springs", "Spur", "Station",
"Summit", "Sunset", "Terrace", "Third", "Trace", "Track", "Trail", "Tunnel",
"Turnpike", "Vale", "Valley", "View", "Village", "Ville", "Vista", "Walk",
"Way", "Well", "Wells", "Wood", "Woods", "Worth"
]
return random.choice(street_names)
# Write a function that creates a list of common street types
# in the United States, in alphabetical order.
# Each one should be unique.
# Return a random street type.
def get_street_type():
street_types = [
"Alley", "Avenue", "Bend", "Bluff", "Boulevard", "Branch", "Bridge", "Brook",
"Burg", "Circle", "Commons", "Court", "Drive", "Highway", "Lane", "Parkway",
"Place", "Road", "Square", "Street", "Terrace", "Trail", "Way"
]
return random.choice(street_types)
# Write a function that calculates the total population of the list of cities.
# Add a 'pcnt_of_total_population' and 'pcnt_running_total' columns to list.
# Returns a sorted list of cities by population.
def prepare_cities(cities):
total_population = 0 # 51,035,885
for city in cities:
total_population += city["population"]
for city in cities:
city["pcnt_of_total_population"] = city["population"] / total_population
global cities_final
cities_final = sorted(cities, key=lambda d: d["population"], reverse=True)
running_total = 1
for city in cities_final:
running_total -= city["pcnt_of_total_population"]
city["pcnt_running_total"] = running_total
# Write a function to returns the 50 largest cities in the United States.
# Include the city, state abbreviation, zip code, and population.
# List should be sorted in descending order by population.
# Return a list of dictionaries.
def get_cities():
cities = [
{"city": "Albuquerque", "state": "NM", "zip": "87102", "population": 559277},
{"city": "Anaheim", "state": "CA", "zip": "92801", "population": 345012},
{"city": "Anchorage", "state": "AK", "zip": "99501", "population": 291826},
{"city": "Arlington", "state": "TX", "zip": "76010", "population": 398121},
{"city": "Atlanta", "state": "GA", "zip": "30303", "population": 486290},
{"city": "Aurora", "state": "CO", "zip": "80010", "population": 325078},
{"city": "Austin", "state": "TX", "zip": "78701", "population": 931830},
{"city": "Bakersfield", "state": "CA", "zip": "93301", "population": 372576},
{"city": "Baltimore", "state": "MD", "zip": "21202", "population": 602495},
{"city": "Boston", "state": "MA", "zip": "02108", "population": 667137},
{"city": "Buffalo", "state": "NY", "zip": "14202", "population": 258959},
{"city": "Charlotte", "state": "NC", "zip": "28202", "population": 872498},
{"city": "Chicago", "state": "IL", "zip": "60602", "population": 2695598},
{"city": "Cincinnati", "state": "OH", "zip": "45202", "population": 296943},
{"city": "Cleveland", "state": "OH", "zip": "44113", "population": 390113},
{"city": "Colorado Springs", "state": "CO", "zip": "80903", "population": 456568},
{"city": "Columbus", "state": "OH", "zip": "43215", "population": 822553},
{"city": "Dallas", "state": "TX", "zip": "75201", "population": 1345047},
{"city": "Denver", "state": "CO", "zip": "80202", "population": 682545},
{"city": "Detroit", "state": "MI", "zip": "48226", "population": 672662},
{"city": "El Paso", "state": "TX", "zip": "79901", "population": 674433},
{"city": "Fort Worth", "state": "TX", "zip": "76102", "population": 792727},
{"city": "Fresno", "state": "CA", "zip": "93721", "population": 509924},
{"city": "Houston", "state": "TX", "zip": "77002", "population": 2296224},
{"city": "Indianapolis", "state": "IN", "zip": "46204", "population": 843393},
{"city": "Jacksonville", "state": "FL", "zip": "32202", "population": 842583},
{"city": "Kansas City", "state": "MO", "zip": "64102", "population": 467007},
{"city": "Las Vegas", "state": "NV", "zip": "89101", "population": 603488},
{"city": "Long Beach", "state": "CA", "zip": "90802", "population": 462257},
{"city": "Los Angeles", "state": "CA", "zip": "90001", "population": 3971883},
{"city": "Louisville", "state": "KY", "zip": "40202", "population": 609893},
{"city": "Memphis", "state": "TN", "zip": "38103", "population": 653450},
{"city": "Mesa", "state": "AZ", "zip": "85201", "population": 508958},
{"city": "Miami", "state": "FL", "zip": "33128", "population": 463347},
{"city": "Milwaukee", "state": "WI", "zip": "53202", "population": 594833},
{"city": "Minneapolis", "state": "MN", "zip": "55402", "population": 410939},
{"city": "Nashville", "state": "TN", "zip": "37203", "population": 654610},
{"city": "New York", "state": "NY", "zip": "10007", "population": 8405837},
{"city": "Newark", "state": "NJ", "zip": "07102", "population": 281944},
{"city": "Oakland", "state": "CA", "zip": "94607", "population": 406253},
{"city": "Oklahoma City", "state": "OK", "zip": "73102", "population": 631346},
{"city": "Omaha", "state": "NE", "zip": "68102", "population": 434353},
{"city": "Philadelphia", "state": "PA", "zip": "19107", "population": 1526006},
{"city": "Phoenix", "state": "AZ", "zip": "85003", "population": 1445632},
{"city": "Pittsburgh", "state": "PA", "zip": "15222", "population": 305841},
{"city": "Portland", "state": "OR", "zip": "97201", "population": 609456},
{"city": "Raleigh", "state": "NC", "zip": "27601", "population": 403892},
{"city": "Sacramento", "state": "CA", "zip": "95814", "population": 479686},
{"city": "San Antonio", "state": "TX", "zip": "78205", "population": 1327407},
{"city": "San Diego", "state": "CA", "zip": "92101", "population": 1307402},
{"city": "San Francisco", "state": "CA", "zip": "94102", "population": 805235},
{"city": "San Jose", "state": "CA", "zip": "95113", "population": 998537},
{"city": "Seattle", "state": "WA", "zip": "98101", "population": 608660},
{"city": "St. Louis", "state": "MO", "zip": "63102", "population": 319294},
{"city": "Tampa", "state": "FL", "zip": "33602", "population": 335709},
{"city": "Tucson", "state": "AZ", "zip": "85701", "population": 520116},
{"city": "Virginia Beach", "state": "VA", "zip": "23451", "population": 448479},
{"city": "Washington", "state": "DC", "zip": "20001", "population": 601723},
]
return cities
# write a function to return a random city
# accept a random value between 0 and 1 as an input parameter
def get_city(rnd_value):
for city in cities_final:
if rnd_value >= city["pcnt_running_total"]:
return city
# Write a function to return a random property type.
# Accept a random value between 0 and 1 as an input parameter.
# The function must return one of the following values based on the %:
# 63% Single-family, 26% Multi-family, 4% Condo,
# 3% Townhouse, 2% Mobile home, 1% Farm, 1% Other.
def get_property_type(rnd_value):
if rnd_value < 0.63:
return "Single-family"
elif rnd_value < 0.89:
return "Multi-family"
elif rnd_value < 0.93:
return "Condo"
elif rnd_value < 0.96:
return "Townhouse"
elif rnd_value < 0.98:
return "Mobile home"
elif rnd_value < 0.99:
return "Farm"
else:
return "Other"
# Create a function to write the address records to a csv file called 'address_data.csv'.
# Use an input parameter to specify the number of records to write.
# The csv file must have a header row and be comma delimited.
# All string values must be enclosed in double quotes.
def write_data(rec_count):
address_id = 0
with open("output/address_data.csv", "w", newline="") as csv_file:
csv_writer = csv.writer(
csv_file, delimiter=",", quotechar='"', quoting=csv.QUOTE_NONNUMERIC
)
csv_writer.writerow(
[
"id",
"address",
"city",
"state",
"zip",
"country",
"property_type",
"assessed_value",
]
)
for i in range(rec_count):
address_id += 1
street_name = get_street_name()
street_type = get_street_type()
street_address = f"{random.randint(1, 9999)} {street_name} {street_type}"
city_state_zip = get_city(random.random())
country = "United States"
property_type = get_property_type(random.random())
assessed_value = random.randint(52500, 1950000)
csv_writer.writerow(
[
address_id,
street_address,
city_state_zip["city"],
city_state_zip["state"],
city_state_zip["zip"],
country,
property_type,
assessed_value,
]
)
if __name__ == "__main__":
main()

To make this example more realistic, you could use Copilot’s assistance to write algorithms capable of accurately reflecting assessed property values based on the type of residence and the zip code.

Here is an example of the synthetic US-based residential address data output by the example application:

id address city state zip country property_type assessed_value
1 1008 Walk Burg Houston TX 77002 United States Multi-family 1122321
2 7088 Second Square Oklahoma City OK 73102 United States Single-family 261940
3 1425 Ridge Terrace Indianapolis IN 46204 United States Single-family 1030391
4 982 Way Lane New York NY 10007 United States Multi-family 95499
5 9404 Port Court Columbus OH 43215 United States Single-family 922404
6 7135 Crossing Trail Virginia Beach VA 23451 United States Single-family 272910
7 9481 Harbor Brook New York NY 10007 United States Multi-family 232795
8 8585 Manor Branch Raleigh NC 27601 United States Single-family 701217
9 7703 Bluff Boulevard Las Vegas NV 89101 United States Single-family 530581
10 4186 Worth Circle New York NY 10007 United States Townhouse 626577
11 8739 Prairie Trail Portland OR 97201 United States Single-family 316167
12 4 Shores Road Virginia Beach VA 23451 United States Single-family 925711
13 3148 Rocks Road Mesa AZ 85201 United States Single-family 182479
14 1545 Wells Drive Chicago IL 60602 United States Multi-family 1000777
15 1699 Ash Brook Seattle WA 98101 United States Single-family 605392

Example #3: Demographic Data

We could use the same techniques again to generate synthetic demographic data. With the assistance of Copilot, we can write functions that randomly return typical feminine or masculine first names (forenames) and common last names (surnames) found in the United States, for this use case.

# Write a function that generates a list of common feminine first names in the United States.
# List should be in alphabetical order.
# Each name should be unique.
# Return random first name.
def get_first_name_feminine():
first_name_feminine = [
"Alice", "Amanda", "Amy", "Angela", "Ann", "Anna", "Barbara", "Betty",
"Brenda", "Carol", "Carolyn", "Catherine", "Christine", "Cynthia", "Deborah", "Debra",
"Diane", "Donna", "Doris", "Dorothy", "Elizabeth", "Frances", "Gloria", "Heather",
"Helen", "Janet", "Jennifer", "Jessica", "Joyce", "Julie", "Karen", "Kathleen", "Kimberly",
]

return random.choice(first_name_feminine)

With the assistance of Copilot, we can also write functions that return demographic information, such as age, gender, race, marital status, religion, and political affiliation. Similar to the previous sales data example, we can influence the final synthetic dataset based on categorical distributions of different demographic categories, for instance, with the prompt:

# Write a function that returns a person's martial status.
# Accept a random value between 0 and 1 as an input parameter.
# The function must return one of the following values based on the %:
# 50% Married, 33% Single, 17% Unknown.
def get_martial_status(rnd_value):
if rnd_value < 0.50:
return "Married"
elif rnd_value < 0.83:
return "Single"
else:
return "Unknown"

By altering the categorical distributions, we can quickly alter the resulting synthetic dataset to reflect differing demographic characteristics: an older or younger population, the predominance of a single race, religious affiliation, or marital status, or the ratio of males to females.

Next, we can use a Gaussian distribution (aka normal distribution) to return the year of birth in a bell-shaped curve, given a mean year and a standard deviation, using Python’s random.normalvariate function.

# Write a function that generates a normal distribution of date of births.
# with a mean year of 1975 and a standard deviation of 10.
# Return random date of birth as a string in the format YYYY-MM-DD
def get_dob():
day_of_year = random.randint(1, 365)
year_of_birth = int(random.normalvariate(1975, 10))
dob = date(int(year_of_birth), 1, 1) + timedelta(day_of_year - 1)
dob = dob.strftime("%Y-%m-%d")
return dob

Here is the complete program that creates synthetic demographic data with Copilot’s assistance:

# Purpose: Generate demographic data
# Author: Gary A. Stafford and GitHub Copilot
# Date: 2023-04-14
# Usage: python3 demographic_data_gen.py 100
# Command-line argument(s): rec_count (number of records to generate as an integer)
# Write an application that creates a file containing demographic data.
# The application should accept a command line argument that specifies the number of records to generate.
# The application should write the demographic data to a file called 'demographic_data.csv'.
# The application should contain the following functions:
# – main() function that calls the other functions
# – function that returns a random first name
# – function that returns a random last name
# – function that returns a random date of birth
# – function that returns a random gender
# – function that returns a random religious affiliation
# – function that returns a random race
import random
import argparse
import csv
from datetime import date, timedelta
def main():
parser = argparse.ArgumentParser(description="Generate demographic data")
parser.add_argument(
"rec_count", type=int, help="The number of records to generate", default=100
)
rec_count = parser.parse_args().rec_count
write_data(rec_count)
# Write a function that generates a list of common feminine first names in the United States.
# List should be in alphabetical order.
# Each name should be unique.
# Return random first name.
def get_first_name_feminine():
first_name_feminine = [
"Alice", "Amanda", "Amy", "Angela", "Ann", "Anna", "Barbara", "Betty",
"Brenda", "Carol", "Carolyn", "Catherine", "Christine", "Cynthia", "Deborah", "Debra",
"Diane", "Donna", "Doris", "Dorothy", "Elizabeth", "Frances", "Gloria", "Heather",
"Helen", "Janet", "Jennifer", "Jessica", "Joyce", "Julie", "Karen", "Kathleen", "Kimberly",
"Laura", "Linda", "Lisa", "Margaret", "Maria", "Marie", "Martha", "Mary", "Melissa",
"Michelle", "Nancy", "Pamela", "Patricia", "Rebecca", "Ruth", "Sandra", "Sarah", "Sharon",
"Shirley", "Stephanie", "Susan", "Teresa", "Virginia",
]
return random.choice(first_name_feminine)
# Write a function that generates a list of common masculine first names in the United States.
# List should be in alphabetical order.
# Each name should be unique.
# Return random first name.
def get_first_name_masculine():
first_names_masculine = [
"Adams", "Alexander", "Allen", "Anderson", "Bailey", "Baker", "Barnes", "Bell",
"Bennett", "Brooks", "Brown", "Bryant", "Butler", "Campbell", "Carter", "Clark",
"Coleman", "Collins", "Cook", "Cooper", "Cox", "Davis", "Edwards", "Evans",
"Flores", "Foster", "Garcia", "Gonzales", "Gonzalez", "Gray", "Green", "Griffin",
"Hall", "Harris", "Henderson", "Hernandez", "Hill", "Howard", "Hughes", "Jackson",
"James", "Jenkins", "Johnson", "Jones", "Kelly", "King", "Lee", "Lewis", "Long",
"Lopez", "Martin", "Martinez", "Miller", "Mitchell", "Moore", "Morgan", "Morris",
"Murphy", "Nelson", "Parker", "Patterson", "Perez", "Perry", "Peterson", "Phillips",
"Powell", "Price", "Ramirez", "Reed", "Richardson", "Rivera", "Roberts", "Robinson",
"Rodriguez", "Rogers", "Ross", "Russell", "Sanchez", "Sanders", "Scott", "Simmons",
"Smith", "Stewart", "Taylor", "Thomas", "Thompson", "Torres", "Turner", "Walker",
"Ward", "Washington", "Watson", "White", "Williams", "Wilson", "Wood", "Wright", "Young"
]
return random.choice(first_names_masculine)
# Write a function that returns a person's gender.
# Return a random gender.
# Accept a random value between 0 and 1 as an input parameter.
# The function must return one of the following values based on the %:
# 53% Male, 40% Female, 6% Other, 1% Transgender
def get_gender(rnd_value):
if rnd_value < 0.53:
return "Male"
elif rnd_value < 0.93:
return "Feamle"
elif rnd_value < 0.99:
return "Other"
else:
return "Transgener"
# Write a function that returns a feminine or masculine first name.
# Return random first name.
# Accept a random value between 0 and 1 as an input parameter.
# The function must return one of the following values based on the %:
# 53% chance of being feminine, 40% chance of being masculine
def get_first_name(rnd_value):
if rnd_value < 0.53:
return get_first_name_masculine()
elif rnd_value < 0.93:
return get_first_name_feminine()
elif rnd_value < 0.97:
return get_first_name_masculine()
else:
return get_first_name_feminine()
# Write a function that generates a list of common last names in the United States.
# List should be in alphabetical order.
# Each name should be unique.
# Return random last name.
def get_last_name():
last_names = [
"Adams", "Alexander", "Allen", "Anderson", "Bailey", "Baker", "Barnes", "Bell",
"Bennett", "Brooks", "Brown", "Bryant", "Butler", "Campbell", "Carter", "Clark",
"Coleman", "Collins", "Cook", "Cooper", "Cox", "Davis", "Edwards", "Evans", "Flores",
"Foster", "Garcia", "Gonzales", "Gonzalez", "Gray", "Green", "Griffin", "Hall",
"Harris", "Henderson", "Hernandez", "Hill", "Howard", "Hughes", "Jackson", "James",
"Jenkins", "Johnson", "Jones", "Kelly", "King", "Lee", "Lewis", "Long", "Lopez",
"Martin", "Martinez", "Miller", "Mitchell", "Moore", "Morgan", "Morris", "Murphy",
"Nelson", "Parker", "Patterson", "Perez", "Perry", "Peterson", "Phillips", "Powell",
"Price", "Ramirez", "Reed", "Richardson", "Rivera", "Roberts", "Robinson", "Rodriguez",
"Rogers", "Ross", "Russell", "Sanchez", "Sanders", "Scott", "Simmons", "Smith", "Stewart",
"Taylor", "Thomas", "Thompson", "Torres", "Turner", "Walker", "Ward", "Washington", "Watson",
"White", "Williams", "Wilson", "Wood", "Wright", "Young",
]
return random.choice(last_names)
# Write a function that returns a martial status.
# Return random martial status.
# Accept a random value between 0 and 1 as an input parameter.
# The function must return one of the following values based on the %:
# 50% Married, 33% Single, 17% Unknown
def get_martial_status(rnd_value):
if rnd_value < 0.50:
return "Married"
elif rnd_value < 0.83:
return "Single"
else:
return "Unknown"
# Write a function that returns a person's race.
# Return random race.
# Accept a random value between 0 and 1 as an input parameter.
# The function must return one of the following values based on the %:
# 58% White, 19% Hispanic, 12% Black, 6% Asian, 4% Multiracial
def get_race(rnd_value):
if rnd_value < 0.58:
return "White"
elif rnd_value < 0.77:
return "Hispanic"
elif rnd_value < 0.89:
return "Black"
elif rnd_value < 0.95:
return "Asian"
else:
return "Multiracial"
# Write a function that returns a person's religious affiliation.
# Return random regilion.
# Accept a random value between 0 and 1 as an input parameter.
# The function must return one of the following values based on the %:
# 70% Christian, 20% Agnostic, 3% Atheist, 2% Jewish,
# 2% Other, 1% Muslim, 1% Hindu, 1% Buddhist
def get_regilion(rnd_value):
if rnd_value < 0.7:
return "Christian"
elif rnd_value < 0.9:
return "Agnostic"
elif rnd_value < 0.93:
return "Atheist"
elif rnd_value < 0.95:
return "Jewish"
elif rnd_value < 0.97:
return "Other"
elif rnd_value < 0.98:
return "Muslim"
elif rnd_value < 0.99:
return "Hindu"
else:
return "Buddhist"
# Write a function that returns a person's gender.
# Return a random gender.
# Accept a random value between 0 and 1 as an input parameter.
# The function must return one of the following values based on the %:
# 53% Male, 40% Female, 6% Other, 1% Transgender
def get_gender(rnd_value):
if rnd_value < 0.53:
return "Male"
elif rnd_value < 0.93:
return "Feamle"
elif rnd_value < 0.99:
return "Other"
else:
return "Transgener"
# Write a function that generates a normal distribution of ages.
# Using normalvariate() from the random module
# with a mean of 40 and a standard deviation of 10.
# Return random age as an integer.
def get_age():
return int(random.normalvariate(40, 10))
# Write a function that generates a normal distribution of date of births.
# with a mean year of 1975 and a standard deviation of 10.
# Return random date of birth as a string in the format YYYY-MM-DD
def get_dob():
day_of_year = random.randint(1, 365)
year_of_birth = int(random.normalvariate(1975, 10))
dob = date(int(year_of_birth), 1, 1) + timedelta(day_of_year 1)
dob = dob.strftime("%Y-%m-%d")
return dob
# Create a function to write the demographic records to a csv file called 'demographic_data.csv'.
# Use an input parameter to specify the number of records to write.
# The csv file must have a header row and be comma delimited.
# String values must be enclosed in double quotes.
def write_data(rec_count):
id = 0
with open("output/demographic_data.csv", "w", newline="") as csv_file:
csv_writer = csv.writer(
csv_file, delimiter=",", quotechar='"', quoting=csv.QUOTE_NONNUMERIC
)
csv_writer.writerow(
[
"id",
"first_name",
"last_name",
"dob",
"gender",
"martital_status",
"race",
"religion"
]
)
for i in range(rec_count):
rnd_gender = random.random()
id += 1
first_name = get_first_name(rnd_gender)
last_name = get_last_name()
dob = get_dob()
gender = get_gender(rnd_gender)
martial_status = get_martial_status(random.random())
race = get_race(random.random())
religion = get_regilion(random.random())
csv_writer.writerow(
[
id,
first_name,
last_name,
dob,
gender,
martial_status,
race,
religion
]
)
if __name__ == "__main__":
main()

To make this example more realistic, you could use Copilot’s assistance to write algorithms capable of more accurately representing the nuanced associations and correlations between age, gender, race, marital status, religion, and political affiliation.

Here is an example of the synthetic demographic data output by the example application:

user_id first_name last_name dob gender martital_status race religion
1 Thomas Powell 1967-06-10 Male Married Black Christian
2 Ward Williams 1973-07-22 Male Single Asian Christian
3 Martha Watson 1975-02-28 Feamle Single Hispanic Agnostic
4 Brenda Bailey 1979-07-07 Feamle Married Black Christian
5 Parker Johnson 1955-07-14 Male Married White Christian
6 Rebecca Wilson 1972-05-27 Feamle Married White Christian
7 Doris Allen 1956-07-09 Feamle Married Multiracial Christian
8 Rebecca Sanchez 1965-09-16 Feamle Single White Christian
9 Mary Johnson 1971-04-04 Feamle Single White Christian
10 Anderson Roberts 1983-11-02 Male Single Hispanic Christian
11 Robinson Peterson 1974-10-25 Male Single White Jewish
12 Lopez Ross 1985-04-30 Male Married White Christian
13 Flores Reed 1977-09-01 Male Married Black Agnostic
14 Martin Phillips 1960-03-24 Male Single White Christian
15 Carter Torres 1983-07-31 Male Married White Christian

Generative AI Tools for Unit Testing

In addition to writing code and documentation, a common use of generative AI code assistants like Copilot is unit tests. For example, we can create unit tests for each function in our coffee shop sales data code generator, using the same method of prompting with code comments.

Copilot can also assist with writing unit tests

Conclusion

In this post, we learned how Generative AI could assist us in creating synthetic data for software development, analytics, and machine learning. The examples herein generated data using simple techniques. Using advanced modeling techniques, we could generate increasingly complex, realistic synthetic data.

To learn about other ways Generative AI can be used to assist in writing code, please read my previous article, Ten Ways to Leverage Generative AI for Development on AWS.

🔔 To keep up with future content, follow Gary Stafford on LinkedIn.


This blog represents my viewpoints and not those of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.

, , , , ,

Leave a comment

Managing AWS Infrastructure as Code using Ansible, CloudFormation, and CodeBuild

Introduction

When it comes to provisioning and configuring resources on the AWS cloud platform, there is a wide variety of services, tools, and workflows you could choose from. You could decide to exclusively use the cloud-based services provided by AWS, such as CodeBuild, CodePipeline, CodeStar, and OpsWorks. Alternatively, you could choose open-source software (OSS) for provisioning and configuring AWS resources, such as community editions of Jenkins, HashiCorp Terraform, Pulumi, Chef, and Puppet. You might also choose to use licensed products, such as Octopus Deploy, TeamCity, CloudBees Core, Travis CI Enterprise, and XebiaLabs XL Release. You might even decide to write your own custom tools or scripts in Python, Go, JavaScript, Bash, or other common languages.

The reality in most enterprises I have worked with, teams integrate a combination of AWS services, open-source software, custom scripts, and occasionally licensed products to construct complete, end-to-end, infrastructure as code-based workflows for provisioning and configuring AWS resources. Choices are most often based on team experience, vendor relationships, and an enterprise’s specific business use cases.

In the following post, we will explore one such set of easily-integrated tools for provisioning and configuring AWS resources. The tool-stack is comprised of Red Hat Ansible, AWS CloudFormation, and AWS CodeBuild, along with several complementary AWS technologies. Using these tools, we will provision a relatively simple AWS environment, then deploy, configure, and test a highly-available set of Apache HTTP Servers. The demonstration is similar to the one featured in a previous post, Getting Started with Red Hat Ansible for Google Cloud Platform.

ansible-aws-stack2.png

Why Ansible?

With its simplicity, ease-of-use, broad compatibility with most major cloud, database, network, storage, and identity providers amongst other categories, Ansible has been a popular choice of Engineering teams for configuration-management since 2012. Given the wide variety of polyglot technologies used within modern Enterprises and the growing predominance of multi-cloud and hybrid cloud architectures, Ansible provides a common platform for enabling mature DevOps and infrastructure as code practices. Ansible is easily integrated with higher-level orchestration systems, such as AWS CodeBuild, Jenkins, or Red Hat AWX and Tower.

Technologies

The primary technologies used in this post include the following.

Red Hat Ansible

ansibleAnsible, purchased by Red Hat in October 2015, seamlessly provides workflow orchestration with configuration management, provisioning, and application deployment in a single platform. Unlike similar tools, Ansible’s workflow automation is agentless, relying on Secure Shell (SSH) and Windows Remote Management (WinRM). If you are interested in learning more on the advantages of Ansible, they’ve published a whitepaper on The Benefits of Agentless Architecture.

According to G2 Crowd, Ansible is a clear leader in the Configuration Management Software category, ranked right behind GitLab. Competitors in the category include GitLab, AWS Config, Puppet, Chef, Codenvy, HashiCorp Terraform, Octopus Deploy, and JetBrains TeamCity.

AWS CloudFormation

Deployment__Management_copy_AWS_CloudFormation-512

According to AWS, CloudFormation provides a common language to describe and provision all the infrastructure resources within AWS-based cloud environments. CloudFormation allows you to use a JSON- or YAML-based template to model and provision, in an automated and secure manner, all the resources needed for your applications across all AWS regions and accounts.

Codifying your infrastructure, often referred to as ‘Infrastructure as Code,’ allows you to treat your infrastructure as just code. You can author it with any IDE, check it into a version control system, and review the files with team members before deploying it.

AWS CodeBuild

code-build-console-iconAccording to AWS, CodeBuild is a fully managed continuous integration service that compiles your source code, runs tests, and produces software packages that are ready to deploy. With CodeBuild, you don’t need to provision, manage, and scale your own build servers. CodeBuild scales continuously and processes multiple builds concurrently, so your builds are not left waiting in a queue.

CloudBuild integrates seamlessly with other AWS Developer tools, including CodeStar, CodeCommit, CodeDeploy, and CodePipeline.

According to G2 Crowd, the main competitors to AWS CodeBuild, in the Build Automation Software category, include Jenkins, CircleCI, CloudBees Core and CodeShip, Travis CI, JetBrains TeamCity, and Atlassian Bamboo.

Other Technologies

In addition to the major technologies noted above, we will also be leveraging the following services and tools to a lesser extent, in the demonstration:

  • AWS CodeCommit
  • AWS CodePipeline
  • AWS Systems Manager Parameter Store
  • Amazon Simple Storage Service (S3)
  • AWS Identity and Access Management (IAM)
  • AWS Command Line Interface (CLI)
  • CloudFormation Linter
  • Apache HTTP Server

Demonstration

Source Code

All source code for this post is contained in two GitHub repositories. The CloudFormation templates and associated files are in the ansible-aws-cfn GitHub repository. The Ansible Roles and related files are in the ansible-aws-roles GitHub repository. Both repositories may be cloned using the following commands.

git clone --branch master --single-branch --depth 1 --no-tags \ 
  https://github.com/garystafford/ansible-aws-cfn.git

git clone --branch master --single-branch --depth 1 --no-tags \
  https://github.com/garystafford/ansible-aws-roles.git

Development Process

The general process we will follow for provisioning and configuring resources in this demonstration are as follows:

  • Create an S3 bucket to store the validated CloudFormation templates
  • Create an Amazon EC2 Key Pair for Ansible
  • Create two AWS CodeCommit Repositories to store the project’s source code
  • Put parameters in Parameter Store
  • Write and test the CloudFormation templates
  • Configure Ansible and AWS Dynamic Inventory script
  • Write and test the Ansible Roles and Playbooks
  • Write the CodeBuild build specification files
  • Create an IAM Role for CodeBuild and CodePipeline
  • Create and test CodeBuild Projects and CodePipeline Pipelines
  • Provision, deploy, and configure the complete web platform to AWS
  • Test the final web platform

Prerequisites

For this demonstration, I will assume you already have an AWS account, the AWS CLI, Python, and Ansible installed locally, an S3 bucket to store the final CloudFormation templates and an Amazon EC2 Key Pair for Ansible to use for SSH.

 Continuous Integration and Delivery Overview

In this demonstration, we will be building multiple CI/CD pipelines for provisioning and configuring our resources to AWS, using several AWS services. These services include CodeCommit, CodeBuild, CodePipeline, Systems Manager Parameter Store, and Amazon Simple Storage Service (S3). The diagram below shows the complete CI/CD workflow we will build using these AWS services, along with Ansible.

aws_devops

AWS CodeCommit

According to Amazon, AWS CodeCommit is a fully-managed source control service that makes it easy to host secure and highly scalable private Git repositories. CodeCommit eliminates the need to operate your own source control system or worry about scaling its infrastructure.

Start by creating two AWS CodeCommit repositories to hold the two GitHub projects your cloned earlier. Commit both projects to your own AWS CodeCommit repositories.

screen_shot_2019-07-26_at_9_02_54_pm

Configuration Management

We have several options for storing the configuration values necessary to provision and configure the resources on AWS. We could set configuration values as environment variables directly in CodeBuild. We could set configuration values from within our Ansible Roles. We could use AWS Systems Manager Parameter Store to store configuration values. For this demonstration, we will use a combination of all three options.

AWS Systems Manager Parameter Store

According to Amazon, AWS Systems Manager Parameter Store provides secure, hierarchical storage for configuration data management and secrets management. You can store data such as passwords, database strings, and license codes as parameter values, as either plain text or encrypted.

The demonstration uses two CloudFormation templates. The two templates have several parameters. A majority of those parameter values will be stored in Parameter Store, retrieved by CloudBuild, and injected into the CloudFormation template during provisioning.

screen_shot_2019-07-26_at_9_38_33_pm

The Ansible GitHub project includes a shell script, parameter_store_values.sh, to put the necessary parameters into Parameter Store. The script requires the AWS Command Line Interface (CLI) to be installed locally. You will need to change the KEY_PATH key value in the script (snippet shown below) to match the location your private key, part of the Amazon EC2 Key Pair you created earlier for use by Ansible.

KEY_PATH="/path/to/private/key"

# put encrypted parameter to Parameter Store
aws ssm put-parameter \
  --name $PARAMETER_PATH/ansible_private_key \
  --type SecureString \
  --value "file://${KEY_PATH}" \
  --description "Ansible private key for EC2 instances" \
  --overwrite

SecureString

Whereas all other parameters are stored in Parameter Store as String datatypes, the private key is stored as a SecureString datatype. Parameter Store uses an AWS Key Management Service (KMS) customer master key (CMK) to encrypt the SecureString parameter value. The IAM Role used by CodeBuild (discussed later) will have the correct permissions to use the KMS key to retrieve and decrypt the private key SecureString parameter value.

screen_shot_2019-07-26_at_9_41_42_pm

CloudFormation

The demonstration uses two CloudFormation templates. The first template, network-stack.template, contains the AWS network stack resources. The template includes one VPC, one Internet Gateway, two NAT Gateways, four Subnets, two Elastic IP Addresses, and associated Route Tables and Security Groups. The second template, compute-stack.template, contains the webserver compute stack resources. The template includes an Auto Scaling Group, Launch Configuration, Application Load Balancer (ALB), ALB Listener, ALB Target Group, and an Instance Security Group. Both templates originated from the AWS CloudFormation template sample library, and were modified for this demonstration.

The two templates are located in the cfn_templates directory of the CloudFormation project, as shown below in the tree view.

.
├── LICENSE.md
├── README.md
├── buildspec_files
│   ├── build.sh
│   └── buildspec.yml
├── cfn_templates
│   ├── compute-stack.template
│   └── network-stack.template
├── codebuild_projects
│   ├── build.sh
│   └── cfn-validate-s3.json
├── codepipeline_pipelines
│   ├── build.sh
│   └── cfn-validate-s3.json
└── requirements.txt

The templates require no modifications for the demonstration. All parameters are in Parameter store or set by the Ansible Roles, and consumed by the Ansible Playbooks via CodeBuild.

Ansible

We will use Red Hat Ansible to provision the network and compute resources by interacting directly with CloudFormation, deploy and configure Apache HTTP Server, and finally, perform final integration tests of the system. In my opinion, the closest equivalent to Ansible on the AWS platform is AWS OpsWorks. OpsWorks lets you use Chef and Puppet (direct competitors to Ansible) to automate how servers are configured, deployed, and managed across Amazon EC2 instances or on-premises compute environments.

Ansible Config

To use Ansible with AWS and CloudFormation, you will first want to customize your project’s ansible.cfg file to enable the aws_ec2 inventory plugin. Below is part of my configuration file as a reference.

[defaults]
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp
fact_caching_timeout = 300

host_key_checking = False
roles_path = roles
inventory = inventories/hosts
remote_user = ec2-user
private_key_file = ~/.ssh/ansible

[inventory]
enable_plugins = host_list, script, yaml, ini, auto, aws_ec2

Ansible Roles

According to Ansible, Roles are ways of automatically loading certain variable files, tasks, and handlers based on a known file structure. Grouping content by roles also allows easy sharing of roles with other users. For the demonstration, I have written four roles, located in the roles directory, as shown below in the project tree view. The default, common role is not used in this demonstration.

.
├── LICENSE.md
├── README.md
├── ansible.cfg
├── buildspec_files
│   ├── buildspec_compute.yml
│   ├── buildspec_integration_tests.yml
│   ├── buildspec_network.yml
│   └── buildspec_web_config.yml
├── codebuild_projects
│   ├── ansible-test.json
│   ├── ansible-web-config.json
│   ├── build.sh
│   ├── cfn-compute.json
│   ├── cfn-network.json
│   └── notes.md
├── filter_plugins
├── group_vars
├── host_vars
├── inventories
│   ├── aws_ec2.yml
│   ├── ec2.ini
│   ├── ec2.py
│   └── hosts
├── library
├── module_utils
├── notes.md
├── parameter_store_values.sh
├── playbooks
│   ├── 10_cfn_network.yml
│   ├── 20_cfn_compute.yml
│   ├── 30_web_config.yml
│   └── 40_integration_tests.yml
├── production
├── requirements.txt
├── roles
│   ├── cfn_compute
│   ├── cfn_network
│   ├── common
│   ├── httpd
│   └── integration_tests
├── site.yml
└── staging

The four roles include a role for provisioning the network, the cfn_network role. A role for configuring the compute resources, the cfn_compute role. A role for deploying and configuring the Apache servers, the httpd role. Finally, a role to perform final integration tests of the platform, the integration_tests role. The individual roles help separate the project’s major parts, network, compute, and middleware, into logical code files. Each role was initially built using Ansible Galaxy (ansible-galaxy init). They follow Galaxy’s standard file structure, as shown in the tree view below, of the cfn_network role.

.
├── README.md
├── defaults
│   └── main.yml
├── files
├── handlers
│   └── main.yml
├── meta
│   └── main.yml
├── tasks
│   ├── create.yml
│   ├── delete.yml
│   └── main.yml
├── templates
├── tests
│   ├── inventory
│   └── test.yml
└── vars
    └── main.yml

Testing Ansible Roles

In addition to checking each role during development and on each code commit with Ansible Lint, each role contains a set of unit tests, in the tests directory, to confirm the success or failure of the role’s tasks. Below we see a basic set of tests for the cfn_compute role. First, we gather Facts about the deployed EC2 instances. Facts information Ansible can automatically derive from your remote systems. We check the facts for expected properties of the running EC2 instances, including timezone, Operating System, major OS version, and the UserID. Note the use of the failed_when conditional. This Ansible playbook error handling conditional is used to confirm the success or failure of tasks.

---
- name: Test cfn_compute Ansible role
  gather_facts: True
  hosts: tag_Group_webservers

  pre_tasks:
  - name: List all ansible facts
    debug:
      msg: "{{ ansible_facts }}"

  tasks:
  - name: Check if EC2 instance's timezone is set to 'UTC'
    debug:
      msg: Timezone is UTC
    failed_when: ansible_facts['date_time']['tz'] != 'UTC'

  - name: Check if EC2 instance's OS is 'Amazon'
    debug:
      msg: OS is Amazon
    failed_when: ansible_facts['distribution_file_variety'] != 'Amazon'

  - name: Check if EC2 instance's OS major version is '2018'
    debug:
      msg: OS major version is 2018
    failed_when: ansible_facts['distribution_major_version'] != '2018'

  - name: Check if EC2 instance's UserID is 'ec2-user'
    debug:
      msg: UserID is ec2-user
    failed_when: ansible_facts['user_id'] != 'ec2-user'

If we were to run the test on their own, against the two correctly provisioned and configured EC2 web servers, we would see results similar to the following.

screen_shot_2019-07-26_at_6_55_04_pm

In the cfn_network role unit tests, below, note the use of the Ansible cloudformation_facts module. This module allows us to obtain facts about the successfully completed AWS CloudFormation stack. We can then use these facts to drive additional provisioning and configuration, or testing. In the task below, we get the network CloudFormation stack’s Outputs. These are the exact same values we would see in the stack’s Output tab, from the AWS CloudFormation management console.

---
- name: Test cfn_network Ansible role
  gather_facts: False
  hosts: localhost

  pre_tasks:
    - name: Get facts about the newly created cfn network stack
      cloudformation_facts:
        stack_name: "ansible-cfn-demo-network"
      register: cfn_network_stack_facts

    - name: List 'stack_outputs' from cached facts
      debug:
        msg: "{{ cloudformation['ansible-cfn-demo-network'].stack_outputs }}"

  tasks:
  - name: Check if the AWS Region of the VPC is {{ lookup('env','AWS_REGION') }}
    debug:
      msg: "AWS Region of the VPC is {{ lookup('env','AWS_REGION') }}"
    failed_when: cloudformation['ansible-cfn-demo-network'].stack_outputs['VpcRegion'] != lookup('env','AWS_REGION')

Similar to the CloudFormation templates, the Ansible roles require no modifications. Most of the project’s parameters are decoupled from the code and stored in Parameter Store or CodeBuild buildspec files (discussed next). The few parameters found in the roles, in the defaults/main.yml files are neither account- or environment-specific.

Ansible Playbooks

The roles will be called by our Ansible Playbooks. There is a create and a delete set of tasks for the cfn_network and cfn_compute roles. Either create or delete tasks are accessible through the role, using the main.yml file and referencing the create or delete Ansible Tags.

---
- import_tasks: create.yml
  tags:
    - create

- import_tasks: delete.yml
  tags:
    - delete

Below, we see the create tasks for the cfn_network role, create.yml, referenced above by main.yml. The use of the cloudcormation module in the first task allows us to create or delete AWS CloudFormation stacks and demonstrates the real power of Ansible—the ability to execute complex AWS resource provisioning, by extending its core functionality via a module. By switching the Cloud module, we could just as easily provision resources on Google Cloud, Azure, AliCloud, OpenStack, or VMWare, to name but a few.

---
- name: create a stack, pass in the template via an S3 URL
  cloudformation:
    stack_name: "{{ stack_name }}"
    state: present
    region: "{{ lookup('env','AWS_REGION') }}"
    disable_rollback: false
    template_url: "{{ lookup('env','TEMPLATE_URL') }}"
    template_parameters:
      VpcCIDR: "{{ lookup('env','VPC_CIDR') }}"
      PublicSubnet1CIDR: "{{ lookup('env','PUBLIC_SUBNET_1_CIDR') }}"
      PublicSubnet2CIDR: "{{ lookup('env','PUBLIC_SUBNET_2_CIDR') }}"
      PrivateSubnet1CIDR: "{{ lookup('env','PRIVATE_SUBNET_1_CIDR') }}"
      PrivateSubnet2CIDR: "{{ lookup('env','PRIVATE_SUBNET_2_CIDR') }}"
      TagEnv: "{{ lookup('env','TAG_ENVIRONMENT') }}"
    tags:
      Stack: "{{ stack_name }}"

The CloudFormation parameters in the above task are mainly derived from environment variables, whose values were retrieved from the Parameter Store by CodeBuild and set in the environment. We obtain these external values using Ansible’s Lookup Plugins. The stack_name variable’s value is derived from the role’s defaults/main.yml file. The task variables use the Python Jinja2 templating system style of encoding.

variables

The associated Ansible Playbooks, which call the tasks, are located in the playbooks directory, as shown previously in the tree view. The playbooks define a few required parameters, like where the list of hosts will be derived and calls the appropriate roles. For our simple demonstration, only a single role is called per playbook. Typically, in a larger project, you would call multiple roles from a single playbook. Below, we see the Network playbook, playbooks/10_cfn_network.yml, which calls the cfn_network role.

---
- name: Provision VPC and Subnets
  hosts: localhost
  connection: local
  gather_facts: False

  roles:
    - role: cfn_network

Dynamic Inventory

Another principal feature of Ansible is demonstrated in the Web Server Configuration playbook, playbooks/30_web_config.yml, shown below. Note the hosts to which we want to deploy and configure Apache HTTP Server is based on an AWS tag value, indicated by the reference to tag_Group_webservers. This indirectly refers to an AWS tag, named Group, with the value of webservers, which was applied to our EC2 hosts by CloudFormation. The ability to generate a Dynamic Inventory, using a dynamic external inventory system, is a key feature of Ansible.

---
- name: Configure Apache Web Servers
  hosts: tag_Group_webservers
  gather_facts: False
  become: yes
  become_method: sudo

  roles:
    - role: httpd

To generate a dynamic inventory of EC2 hosts, we are using the Ansible AWS EC2 Dynamic Inventory script, inventories/ec2.py and inventories/ec2.ini files. The script dynamically queries AWS for all the EC2 hosts containing specific AWS tags, belonging to a particular Security Group, Region, Availability Zone, and so forth.

I have customized the AWS EC2 Dynamic Inventory script’s configuration in the inventories/aws_ec2.yml file. Amongst other configuration items, the file defines  keyed_groups. This instructs the script to inventory EC2 hosts according to their unique AWS tags and tag values.

plugin: aws_ec2
remote_user: ec2-user
private_key_file: ~/.ssh/ansible
regions:
  - us-east-1
keyed_groups:
  - key: tags.Name
    prefix: tag_Name_
    separator: ''
  - key: tags.Group
    prefix: tag_Group_
    separator: ''
hostnames:
  - dns-name
  - ip-address
  - private-dns-name
  - private-ip-address
compose:
  ansible_host: ip_address

Once you have built the CloudFormation compute stack in the proceeding section of the demonstration, to build the dynamic EC2 inventory of hosts, you would use the following command.

ansible-inventory -i inventories/aws_ec2.yml --graph

You would then see an inventory of all your EC2 hosts, resembling the following.

@all:
  |--@aws_ec2:
  |  |--ec2-18-234-137-73.compute-1.amazonaws.com
  |  |--ec2-3-95-215-112.compute-1.amazonaws.com
  |--@tag_Group_webservers:
  |  |--ec2-18-234-137-73.compute-1.amazonaws.com
  |  |--ec2-3-95-215-112.compute-1.amazonaws.com
  |--@tag_Name_Apache_Web_Server:
  |  |--ec2-18-234-137-73.compute-1.amazonaws.com
  |  |--ec2-3-95-215-112.compute-1.amazonaws.com
  |--@ungrouped:

Note the two EC2 web servers instances, listed under tag_Group_webservers. They represent the target inventory onto which we will install Apache HTTP Server. We could also use the tag, Name, with the value tag_Name_Apache_Web_Server.

AWS CodeBuild

Recalling our diagram, you will note the use of CodeBuild is a vital part of each of our five DevOps workflows. CodeBuild is used to 1) validate the CloudFormation templates, 2) provision the network resources,  3) provision the compute resources, 4) install and configure the web servers, and 5) run integration tests.

aws_devops

Splitting these processes into separate workflows, we can redeploy the web servers without impacting the compute resources or redeploy the compute resources without affecting the network resources. Often, different teams within a large enterprise are responsible for each of these resources categories—architecture, security (IAM), network, compute, web servers, and code deployments. Separating concerns makes a shared ownership model easier to manage.

Build Specifications

CodeBuild projects rely on a build specification or buildspec file for its configuration, as shown below. CodeBuild’s buildspec file is synonymous to Jenkins’ Jenkinsfile. Each of our five workflows will use CodeBuild. Each CodeBuild project references a separate buildspec file, included in the two GitHub projects, which by now you have pushed to your two CodeCommit repositories.

screen_shot_2019-07-26_at_6_10_59_pm

Below we see an example of the buildspec file for the CodeBuild project that deploys our AWS network resources, buildspec_files/buildspec_network.yml.

version: 0.2

env:
  variables:
    TEMPLATE_URL: "https://s3.amazonaws.com/garystafford_cloud_formation/cf_demo/network-stack.template"
    AWS_REGION: "us-east-1"
    TAG_ENVIRONMENT: "ansible-cfn-demo"
  parameter-store:
    VPC_CIDR: "/ansible_demo/vpc_cidr"
    PUBLIC_SUBNET_1_CIDR: "/ansible_demo/public_subnet_1_cidr"
    PUBLIC_SUBNET_2_CIDR: "/ansible_demo/public_subnet_2_cidr"
    PRIVATE_SUBNET_1_CIDR: "/ansible_demo/private_subnet_1_cidr"
    PRIVATE_SUBNET_2_CIDR: "/ansible_demo/private_subnet_2_cidr"

phases:
  install:
    runtime-versions:
      python: 3.7
    commands:
      - pip install -r requirements.txt -q
  build:
    commands:
      - ansible-playbook -i inventories/aws_ec2.yml playbooks/10_cfn_network.yml --tags create  -v
  post_build:
    commands:
      - ansible-playbook -i inventories/aws_ec2.yml roles/cfn_network/tests/test.yml

There are several distinct sections to the buildspec file. First, in the variables section, we define our variables. They are a combination of three static variable values and five variable values retrieved from the Parameter Store. Any of these may be overwritten at build-time, using the AWS CLI, SDK, or from the CodeBuild management console. You will need to update some of the variables to match your particular environment, such as the TEMPLATE_URL to match your S3 bucket path.

Next, the phases of our build. Again, if you are familiar with Jenkins, think of these as Stages with multiple Steps. The first phase, install, builds a Docker container, in which the build process is executed. Here we are using Python 3.7. We also run a pip command to install the required Python packages from our requirements.txt file. Next, we perform our build phase by executing an Ansible command.

 ansible-playbook \
  -i inventories/aws_ec2.yml \
  playbooks/10_cfn_network.yml --tags create -v

The command calls our playbook, playbooks/10_cfn_network.yml. The command references the create tag. This causes the playbook to run to cfn_network role’s create tasks (roles/cfn_network/tasks/create.yml), as defined in the main.yml file (roles/cfn_network/tasks/main.yml). Lastly, in our post_build phase, we execute our role’s unit tests (roles/cfn_network/tests/test.yml), using a second Ansible command.

CodeBuild Projects

Next, we need to create CodeBuild projects. You can do this using the AWS CLI or from the CodeBuild management console (shown below). I have included individual templates and a creation script in each project, in the codebuild_projects directory, which you could use to build the projects, using the AWS CLI. You would have to modify the JSON templates, replacing all references to my specific, unique AWS resources, with your own. For the demonstration, I suggest creating the five projects manually in the CodeBuild management console, using the supplied CodeBuild project templates as a guide.

screen_shot_2019-07-26_at_6_10_12_pm

CodeBuild IAM Role

To execute our CodeBuild projects, we need an IAM Role or Roles CodeBuild with permission to such resources as CodeCommit, S3, and CloudWatch. For this demonstration, I chose to create a single IAM Role for all workflows. I then allowed CodeBuild to assign the required policies to the Role as needed, which is a feature of CodeBuild.

screen_shot_2019-07-26_at_6_52_23_pm

CodePipeline Pipeline

In addition to CodeBuild, we are using CodePipeline for our first of five workflows. CodePipeline validates the CloudFormation templates and pushes them to our S3 bucket. The pipeline calls the corresponding CodeBuild project to validate each template, then deploys the valid CloudFormation templates to S3.

codepipeline

In true CI/CD fashion, the pipeline is automatically executed every time source code from the CloudFormation project is committed to the CodeCommit repository.

screen_shot_2019-07-26_at_6_12_51_pm

CodePipeline calls CodeBuild, which performs a build, based its buildspec file. This particular CodeBuild buildspec file also demonstrates another ability of CodeBuild, executing an external script. When we have a complex build phase, we may choose to call an external script, such as a Bash or Python script, verses embedding the commands in the buildspec.

version: 0.2

phases:
  install:
    runtime-versions:
      python: 3.7
  pre_build:
    commands:
      - pip install -r requirements.txt -q
      - cfn-lint -v
  build:
    commands:
      - sh buildspec_files/build.sh

artifacts:
  files:
    - '**/*'
  base-directory: 'cfn_templates'
  discard-paths: yes

Below, we see the script that is called. Here we are using both the CloudFormation Linter, cfn-lint, and the cloudformation validate-template command to validate our templates for comparison. The two tools give slightly different, yet relevant, linting results.

#!/usr/bin/env bash

set -e

for filename in cfn_templates/*.*; do
    cfn-lint -t ${filename}
    aws cloudformation validate-template \
      --template-body file://${filename}
done

Similar to the CodeBuild project templates, I have included a CodePipeline template, in the codepipeline_pipelines directory, which you could modify and create using the AWS CLI. Alternatively, I suggest using the CodePipeline management console to create the pipeline for the demo, using the supplied CodePipeline template as a guide.

screen_shot_2019-07-26_at_6_11_51_pm

Below, the stage view of the final CodePipleine pipeline.

screen_shot_2019-07-26_at_6_12_26_pm

Build the Platform

With all the resources, code, and DevOps workflows in place, we should be ready to build our platform on AWS. The CodePipeline project comes first, to validate the CloudFormation templates and place them into your S3 bucket. Since you are probably not committing new code to the CloudFormation file CodeCommit repository,  which would trigger the pipeline, you can start the pipeline using the AWS CLI (shown below) or via the management console.

# list names of pipelines
aws codepipeline list-pipelines

# execute the validation pipeline
aws codepipeline start-pipeline-execution --name cfn-validate-s3

screen_shot_2019-07-26_at_6_08_03_pm

The pipeline should complete within a few seconds.

screen_shot_2019-07-26_at_10_12_53_pm.png

Next, execute each of the four CodeBuild projects in the following order.

# list the names of the projects
aws codebuild list-projects

# execute the builds in order
aws codebuild start-build --project-name cfn-network
aws codebuild start-build --project-name cfn-compute

# ensure EC2 instance checks are complete before starting
# the ansible-web-config build!
aws codebuild start-build --project-name ansible-web-config
aws codebuild start-build --project-name ansible-test

As the code comment above states, be careful not to start the ansible-web-config build until you have confirmed the EC2 instance Status Checks have completed and have passed, as shown below. The previous, cfn-compute build will complete when CloudFormation finishes building the new compute stack. However, the fact CloudFormation finished does not indicate that the EC2 instances are fully up and running. Failure to wait will result in a failed build of the ansible-web-config CodeBuild project, which installs and configures the Apache HTTP Servers.

screen_shot_2019-07-26_at_6_27_52_pm

Below, we see the cfn_network CodeBuild project first building a Python-based Docker container, within which to perform the build. Each build is executed in a fresh, separate Docker container, something that can trip you up if you are expecting a previous cache of Ansible Facts or previously defined environment variables, persisted across multiple builds.

screen_shot_2019-07-26_at_6_15_12_pm

Below, we see the two completed CloudFormation Stacks, a result of our CodeBuild projects and Ansible.

screen_shot_2019-07-26_at_6_44_43_pm

The fifth and final CodeBuild build tests our platform by attempting to hit the Apache HTTP Server’s default home page, using the Application Load Balancer’s public DNS name.

screen_shot_2019-07-26_at_6_32_09_pm

Below, we see an example of what happens when a build fails. In this case, one of the final integration tests failed to return the expected results from the ALB endpoint.

screen_shot_2019-07-26_at_6_40_37_pm

Below, with the bug is fixed, we rerun the build, which re-executed the tests, successfully.

screen_shot_2019-07-26_at_6_38_21_pm

We can manually confirm the platform is working by hitting the same public DNS name of the ALB as our tests in our browser. The request should load-balance our request to one of the two running web server’s default home page. Normally, at this point, you would deploy your application to Apache, using a software continuous deployment tool, such as Jenkins, CodeDeploy, Travis CI, TeamCity, or Bamboo.

screen_shot_2019-07-26_at_6_39_26_pm

Cleaning Up

To clean up the running AWS resources from the demonstration, first delete the CloudFormation compute stack, then delete the network stack. To do so, execute the following commands, one at a time. The commands call the same playbooks we called to create the stacks, except this time, we use the delete tag, as opposed to the create tag.

# first delete cfn compute stack
ansible-playbook \ 
  -i inventories/aws_ec2.yml \ 
  playbooks/20_cfn_compute.yml -t delete -v

# then delete cfn network stack
ansible-playbook \ 
  -i inventories/aws_ec2.yml \ 
  playbooks/10_cfn_network.yml -t delete -v

You should observe the following output, indicating both CloudFormation stacks have been deleted.

screen_shot_2019-07-26_at_7_12_38_pm

Confirm the stacks were deleted from the CloudFormation management console or from the AWS CLI.

 

All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.

, , , , , , , , , , , , ,

Leave a comment

Managing Applications Across Multiple Kubernetes Environments with Istio: Part 2

In this two-part post, we are exploring the creation of a GKE cluster, replete with the latest version of Istio, often referred to as IoK (Istio on Kubernetes). We will then deploy, perform integration testing, and promote an application across multiple environments within the cluster.

Part Two

In Part One of this post, we created a Kubernetes cluster on the Google Cloud Platform, installed Istio, provisioned a PostgreSQL database, and configured DNS for routing. Under the assumption that v1 of the Election microservice had already been released to Production, we deployed v1 to each of the three namespaces.

In Part Two of this post, we will learn how to utilize the advanced API testing capabilities of Postman and Newman to ensure v2 is ready for UAT and release to Production. We will deploy and perform integration testing of a new v2 of the Election microservice, locally on Kubernetes Minikube. Once confident v2 is functioning as intended, we will promote and test v2 across the dev, test, and uat namespaces.

Source Code

As a reminder, all source code for this post can be found on GitHub. The project’s README file contains a list of the Election microservice’s endpoints. To get started quickly, use one of the two following options (gist).

# clone the official v3.0.0 release for this post
git clone --depth 1 --branch v3.0.0 \
https://github.com/garystafford/spring-postgresql-demo.git \
&& cd spring-postgresql-demo \
&& git checkout -b v3.0.0
# clone the latest version of code (newer than article)
git clone --depth 1 --branch master \
https://github.com/garystafford/spring-postgresql-demo.git \
&& cd spring-postgresql-demo

Code samples in this post are displayed as Gists, which may not display correctly on some mobile and social media browsers. Links to gists are also provided.

This project includes a kubernetes sub-directory, containing all the Kubernetes resource files and scripts necessary to recreate the example shown in the post.

Testing Locally with Minikube

Deploying to GKE, no matter how automated, takes time and resources, whether those resources are team members or just compute and system resources. Before deploying v2 of the Election service to the non-prod GKE cluster, we should ensure that it has been thoroughly tested locally. Local testing should include the following test criteria:

  1. Source code builds successfully
  2. All unit-tests pass
  3. A new Docker Image can be created from the build artifact
  4. The Service can be deployed to Kubernetes (Minikube)
  5. The deployed instance can connect to the database and execute the Liquibase changesets
  6. The deployed instance passes a minimal set of integration tests

Minikube gives us the ability to quickly iterate and test an application, as well as the Kubernetes and Istio resources required for its operation, before promoting to GKE. These resources include Kubernetes Namespaces, Secrets, Deployments, Services, Route Rules, and Istio Ingresses. Since Minikube is just that, a miniature version of our GKE cluster, we should be able to have a nearly one-to-one parity between the Kubernetes resources we apply locally and those applied to GKE. This post assumes you have the latest version of Minikube installed, and are familiar with its operation.

This project includes a minikube sub-directory, containing all the Kubernetes resource files and scripts necessary to recreate the Minikube deployment example shown in this post. The three included scripts are designed to be easily adapted to a CI/CD DevOps workflow. You may need to modify the scripts to match your environment’s configuration. Note this Minikube-deployed version of the Election service relies on the external Amazon RDS database instance.

Local Database Version

To eliminate the AWS costs, I have included a second, alternate version of the Minikube Kubernetes resource files, minikube_db_local This version deploys a single containerized PostgreSQL database instance to Minikube, as opposed to relying on the external Amazon RDS instance. Be aware, the database does not have persistent storage or an Istio sidecar proxy.

istio_100.png

Minikube Cluster

If you do not have a running Minikube cluster, create one with the minikube start command.

istio_081

Minikube allows you to use normal kubectl CLI commands to interact with the Minikube cluster. Using the kubectl get nodes command, we should see a single Minikube node running the latest Kubernetes v1.10.0.

istio_082

Istio on Minikube

Next, install Istio following Istio’s online installation instructions. A basic Istio installation on Minikube, without the additional add-ons, should only require a single Istio install script.

istio_083

If successful, you should observe a new istio-system namespace, containing the four main Istio components: istio-ca, istio-ingress, istio-mixer, and istio-pilot.

istio_084

Deploy v2 to Minikube

Next, create a Minikube Development environment, consisting of a dev Namespace, Istio Ingress, and Secret, using the part1-create-environment.sh script. Next, deploy v2 of the Election service to thedev Namespace, along with an associated Route Rule, using the part2-deploy-v2.sh script. One v2 instance should be sufficient to satisfy the testing requirements.

istio_085

Access to v2 of the Election service on Minikube is a bit different than with GKE. When routing external HTTP requests, there is no load balancer, no external public IP address, and no public DNS or subdomains. To access the single instance of v2 running on Minikube, we use the local IP address of the Minikube cluster, obtained with the minikube ip command. The access port required is the Node Port (nodePort) of the istio-ingress Service. The command is shown below (gist) and included in the part3-smoke-test.sh script.

export GATEWAY_URL="$(minikube ip):"\
"$(kubectl get svc istio-ingress -n istio-system -o 'jsonpath={.spec.ports[0].nodePort}')"
echo $GATEWAY_URL
curl $GATEWAY_URL/v2/actuator/health && echo

The second part of our HTTP request routing is the same as with GKE, relying on an Istio Route Rules. The /v2/ sub-collection resource in the HTTP request URL is rewritten and routed to the v2 election Pod by the Route Rule. To confirm v2 of the Election service is running and addressable, curl the /v2/actuator/health endpoint. Spring Actuator’s /health endpoint is frequently used at the end of a CI/CD server’s deployment pipeline to confirm success. The Spring Boot application can take a few minutes to fully start up and be responsive to requests, depending on the speed of your local machine.

istio_093.png

Using the Kubernetes Dashboard, we should see our deployment of the single Election service Pod is running successfully in Minikube’s dev namespace.

istio_087

Once deployed, we run a battery of integration tests to confirm that the new v2 functionality is working as intended before deploying to GKE. In the next section of this post, we will explore the process creating and managing Postman Collections and Postman Environments, and how to automate those Collections of tests with Newman and Jenkins.

istio_088

Integration Testing

The typical reason an application is deployed to lower environments, prior to Production, is to perform application testing. Although definitions vary across organizations, testing commonly includes some or all of the following types: Integration Testing, Functional Testing, System Testing, Stress or Load Testing, Performance Testing, Security Testing, Usability Testing, Acceptance Testing, Regression Testing, Alpha and Beta Testing, and End-to-End Testing. Test teams may also refer to other testing forms, such as Whitebox (Glassbox), Blackbox Testing, Smoke, Validation, or Sanity Testing, and Happy Path Testing.

The site, softwaretestinghelp.com, defines integration testing as, ‘testing of all integrated modules to verify the combined functionality after integration is termed so. Modules are typically code modules, individual applications, client and server applications on a network, etc. This type of testing is especially relevant to client/server and distributed systems.

In this post, we are concerned that our integrated modules are functioning cohesively, primarily the Election service, Amazon RDS database, DNS, Istio Ingress, Route Rules, and the Istio sidecar Proxy. Unlike Unit Testing and Static Code Analysis (SCA), which is done pre-deployment, integration testing requires an application to be deployed and running in an environment.

Postman

I have chosen Postman, along with Newman, to execute a Collection of integration tests before promoting to the next environment. The integration tests confirm the deployed application’s name and version. The integration tests then perform a series of HTTP GET, POST, PUT, PATCH, and DELETE actions against the service’s resources. The integration tests verify a successful HTTP response code is returned, based on the type of request made.

istio_055

Postman tests are written in JavaScript, similar to other popular, modern testing frameworks. Postman offers advanced features such as test-chaining. Tests can be chained together through the use of environment variables to store response values and pass them onto to other tests. Values shared between tests are also stored in the Postman Environments. Below, we store the ID of the new candidate, the result of an HTTP POST to the /candidates endpoint. We then use the stored candidate ID in proceeding HTTP GET, PUT, and PATCH test requests to the same /candidates endpoint.

istio_056

Environment-specific variables, such as the resource host, port, and environment sub-collection resource, are abstracted and stored as key/value pairs within Postman Environments, and called through variables in the request URL and within the tests. Thus, the same Postman Collection of tests may be run against multiple environments using different Postman Environments.

istio_057

Postman Runner allows us to run multiple iterations of our Collection. We also have the option to build in delays between tests. Lastly, Postman Runner can load external JSON and CSV formatted test data, which is beyond the scope of this post.

istio_058

Postman contains a simple Run Summary UI for viewing test results.

istio_060

Test Automation

To support running tests from the command line, Postman provides Newman. According to Postman, Newman is a command-line collection runner for Postman. Newman offers the same functionality as Postman’s Collection Runner, all part of the newman CLI. Newman is Node.js module, installed globally as an npm package, npm install newman --global.

Typically, Development and Testing teams compose Postman Collections and define Postman Environments, locally. Teams run their tests locally in Postman, during their development cycle. Then, those same Postman Collections are executed from the command line, or more commonly as part of a CI/CD pipeline, such as with Jenkins.

Below, the same Collection of integration tests ran in the Postman Runner UI, are run from the command line, using Newman.

istio_061

Jenkins

Without a doubt, Jenkins is the leading open-source CI/CD automation server. The building, testing, publishing, and deployment of microservices to Kubernetes is relatively easy with Jenkins. Generally, you would build, unit-test, push a new Docker image, and then deploy your application to Kubernetes using a series of CI/CD pipelines. Below, we see examples of these pipelines using Jenkins Blue Ocean, starting with a continuous integration pipeline, which includes unit-testing and Static Code Analysis (SCA) with SonarQube.

istio_108

Followed by a pipeline to build the Docker Image, using the build artifact from the above pipeline, and pushes the Image to Docker Hub.

istio_109

The third pipeline that demonstrates building the three Kubernetes environments and deploying v1 of the Election service to the dev namespace. This pipeline is just for demonstration purposes; typically, you would separate these functions.

istio_110

Spinnaker

An alternative to Jenkins for the deployment of microservices is Spinnaker, created by Netflix. According to Netflix, ‘Spinnaker is an open source, multi-cloud continuous delivery platform for releasing software changes with high velocity and confidence.’ Spinnaker is designed to integrate easily with Jenkins, dividing responsibilities for continuous integration and delivery, with deployment. Below, Spinnaker two sample deployment pipelines, similar to Jenkins, for deploying v1 and v2 of the Election service to the non-prod GKE cluster.

spin_07

Below, Spinnaker has deployed v2 of the Election service to dev using a Highlander deployment strategy. Subsequently, Spinnaker has deployed v2 to test using a Red/Black deployment strategy, leaving the previously released v1 Server Group in place, in case a rollback is required.

spin_08

Once Spinnaker is has completed the deployment tasks, the Postman Collections of smoke and integration tests are executed by Newman, as part of another Jenkins CI/CD pipeline.

istio_101B.png

In this pipeline, a set of basic smoke tests is run first to ensure the new deployment is running properly, and then the integration tests are executed.

istio_102

In this simple example, we have a three-stage pipeline created from a Jenkinsfile (gist).

#!groovy
def ACCOUNT = "garystafford"
def PROJECT_NAME = "spring-postgresql-demo"
def ENVIRONMENT = "dev" // assumes 'api.dev.voter-demo.com' reachable
pipeline {
agent any
stages {
stage('Checkout SCM') {
steps {
git changelog: true, poll: false,
branch: 'master',
url: "https://github.com/${ACCOUNT}/${PROJECT_NAME}"
}
}
stage('Smoke Test') {
steps {
dir('postman') {
nodejs('nodejs') {
sh "sh ./newman-smoke-tests-${ENVIRONMENT}.sh"
}
junit '**/newman/*.xml'
}
}
}
stage('Integration Tests') {
steps {
dir('postman') {
nodejs('nodejs') {
sh "sh ./newman-integration-tests-${ENVIRONMENT}.sh"
}
junit '**/newman/*.xml'
}
}
}
}
}

Test Results

Newman offers several options for displaying test results. For easy integration with Jenkins, Newman results can be delivered in a format that can be displayed as JUnit test reports. The JUnit test report format, XML, is a popular method of standardizing test results from different testing tools. Below is a truncated example of a test report file (gist).

<?xml version="1.0" encoding="UTF-8"?>
<testsuites name="spring-postgresql-demo-v2" time="13.339000000000002">
<testsuite name="/candidates/{{candidateId}}" id="31cee570-95a1-4768-9ac3-3d714fc7e139" tests="1" time="0.669">
<testcase name="Status code is 200" time="0.669"/>
</testsuite>
<testsuite name="/candidates/{{candidateId}}" id="a5a62fe9-6271-4c89-a076-c95bba458ef8" tests="1" time="0.575">
<testcase name="Status code is 200" time="0.575"/>
</testsuite>
<testsuite name="/candidates/{{candidateId}}" id="2fc4c902-b931-4b35-b28a-7e264f40ee9c" tests="1" time="0.568">
<testcase name="Status code is 204" time="0.568"/>
</testsuite>
<testsuite name="/candidates/summary" id="94fe972e-32f4-4f58-a5d5-999cacdf7460" tests="1" time="0.337">
<testcase name="Status code is 200" time="0.337"/>
</testsuite>
<testsuite name="/candidates/summary/{election}" id="f8f817c8-4785-49f1-8d09-8055b84c4fc0" tests="1" time="0.351">
<testcase name="Status code is 200" time="0.351"/>
</testsuite>
<testsuite name="/candidates/search/findByLastName?lastName=Paul" id="504f8741-e9d2-4f05-b1ad-c14136030f34" tests="1" time="0.256">
<testcase name="Status code is 200" time="0.256"/>
</testsuite>
</testsuites>

Translating Newman test results to JUnit reports allows the percentage of test cases successfully executed, to be tracked over multiple deployments, a universal testing metric. Below we see the JUnit Test Reports Test Result Trend graph for a series of test runs.

istio_103

Deploying to Development

Development environments typically have a rapid turnover of application versions. Many teams use their Development environment as a continuous integration environment, where every commit that successfully builds and passes all unit tests, is deployed. The purpose of the CI deployments is to ensure build artifacts will successfully deploy through the CI/CD pipeline, start properly, and pass a basic set of smoke tests.

Other teams use the Development environments as an extension of their local Minikube environment. The Development environment will possess some or all of the required external integration points, which the Developer’s local Minikube environment may not. The goal of the Development environment is to help Developers ensure their application is functioning correctly and is ready for the Test teams to evaluate, prior to promotion to the Test environment.

Some external integration points, such as external payment gateways, customer relationship management (CRM) systems, content management systems (CMS), or data analytics engines, are often stubbed-out in lower environments. Generally, third-party providers only offer a limited number of parallel non-Production integration environments. While an application may pass through several non-prod environments, testing against all external integration points will only occur in one or two of those environments.

With v2 of the Election service ready for testing on GKE, we deploy it to the GKE cluster’s dev namespace using the part4a-deploy-v2-dev.sh script. We will also delete the previous v1 version of the Election service. Similar to the v1 deployment script, the v2 scripts perform a kube-inject command, which manually injects the Istio sidecar proxy alongside the Election service, into each election v2 Pod. The deployment script also deploys an alternate Istio Route Rule, which routes requests to api.dev.voter-demo.com/v2/* resource of v2 of the Election service.

istio_054.png

Once deployed, we run our Postman Collection of integration tests with Newman or as part of a CI/CD pipeline. In the Development environment, we may choose to run a limited set of tests for the sake of expediency, or because not all external integration points are accessible.

Promotion to Test

With local Minikube and Development environment testing complete, we promote and deploy v2 of the Election service to the Test environment, using the part4b-deploy-v2-test.sh script. In Test, we will not delete v1 of the Election service.

istio_062

Often, an organization will maintain a running copy of all versions of an application currently deployed to Production, in a lower environment. Let’s look at two scenarios where this is common. First, v1 of the Election service has an issue in Production, which needs to be confirmed and may require a hot-fix by the Development team. Validation of the v1 Production bug is often done in a lower environment. The second scenario for having both versions running in an environment is when v1 and v2 both need to co-exist in Production. Organizations frequently support multiple API versions. Cutting over an entire API user-base to a new API version is often completed over a series of releases, and requires careful coordination with API consumers.

Testing All Versions

An essential role of integration testing should be to confirm that both versions of the Election service are functioning correctly, while simultaneously running in the same namespace. For example, we want to verify traffic is routed correctly, based on the HTTP request URL, to the correct version. Another common test scenario is database schema changes. Suppose we make what we believe are backward-compatible database changes to v2 of the Election service. We should be able to prove, through testing, that both the old and new versions function correctly against the latest version of the database schema.

There are different automation strategies that could be employed to test multiple versions of an application without creating separate Collections and Environments. A simple solution would be to templatize the Environments file, and then programmatically change the Postman Environment’s version variable injected from a pipeline parameter (abridged environment file shown below).

istio_095.png

Once initial automated integration testing is complete, Test teams will typically execute additional forms of application testing if necessary, before signing off for UAT and Performance Testing to begin.

User-Acceptance Testing

With testing in the Test environments completed, we continue onto UAT. The term UAT suggest that a set of actual end-users (API consumers) of the Election service will perform their own testing. Frequently, UAT is only done for a short, fixed period of time, often with a specialized team of Testers. Issues experienced during UAT can be expensive and impact the ability to release an application to Production on-time if sign-off is delayed.

After deploying v2 of the Election service to UAT, and before opening it up to the UAT team, we would naturally want to repeat the same integration testing process we conducted in the previous Test environment. We must ensure that v2 is functioning as expected before our end-users begin their testing. This is where leveraging a tool like Jenkins makes automated integration testing more manageable and repeatable. One strategy would be to duplicate our existing Development and Test pipelines, and re-target the new pipeline to call v2 of the Election service in UAT.

istio_104.png

Again, in a JUnit report format, we can examine individual results through the Jenkins Console.

istio_105.png

We can also examine individual results from each test run using a specific build’s Console Output.

istio_106.png

Testing and Instrumentation

To fully evaluate the integration test results, you must look beyond just the percentage of test cases executed successfully. It makes little sense to release a new version of an application if it passes all functional tests, but significantly increases client response times, unnecessarily increases memory consumption or wastes other compute resources, or is grossly inefficient in the number of calls it makes to the database or third-party dependencies. Often times, integration testing uncovers potential performance bottlenecks that are incorporated into performance test plans.

Critical intelligence about the performance of the application can only be obtained through the use of logging and metrics collection and instrumentation. Istio provides this telemetry out-of-the-box with Zipkin, Jaeger, Service Graph, Fluentd, Prometheus, and Grafana. In the included Grafana Istio Dashboard below, we see the performance of v1 of the Election service, under test, in the Test environment. We can compare request and response payload size and timing, as well as request and response times to external integration points, such as our Amazon RDS database. We are able to observe the impact of individual test requests on the application and all its integration points.

istio_067

As part of integration testing, we should monitor the Amazon RDS CloudWatch metrics. CloudWatch allows us to evaluate critical database performance metrics, such as the number of concurrent database connections, CPU utilization, read and write IOPS, Memory consumption, and disk storage requirements.

istio_043

A discussion of metrics starts moving us toward load and performance testing against Production service-level agreements (SLAs). Using a similar approach to integration testing, with load and performance testing, we should be able to accurately estimate the sizing requirements our new application for Production. Load and Performance Testing helps answer questions like the type and size of compute resources are required for our GKE Production cluster and for our Amazon RDS database, or how many compute nodes and number of instances (Pods) are necessary to support the expected user-load.

All opinions expressed in this post are my own, and not necessarily the views of my current or past employers, or their clients.

, , , , , , , , , , , , , ,

4 Comments

Using soapUI to Test RESTful Web Services

Introduction

There are many excellent tools available to test RESTful web services. Applications like Fiddler, cURL, Firefox with Firebug, and Google Chrome’s Advanced REST Client and REST Console are commonly used by developers and test engineers. Another powerful tool, used by many enterprise software development organizations, is soapUI by SmatBear Software.

SmartBear offers several versions, from the free open-source edition (shown here), to the full-featured soapUI Pro. Although, soapUI Pro has many useful advanced features, the free edition is fine to start with. Here is a product comparison of the various editions available from SmartBear.

SmartBear’s soapUI is available for Windows, Mac OS, and Linux (shown here). The application supports a wide range of technologies, including SOAP, WSDL, REST, HTTP, HTTPS, AMF, JDBC, JMS, WS-I Integeration, WS-Security, WS-Addressing, WS-Reliable Messaging, according to the SmartBear web site. It is easy to download and install.

SmartBear soapUI 4.5.1 for Linux

Testing RESTful Web Services with soapUI

In my last post, we used JDBC to map JPA entity classes to tables and views within a MySQL database. We then built RESTful web services, EJB classes, which communicated with MySQL through the entities. The RESTful web services, part of a Java Web Application, were deployed to GlassFish.

Architectural Diagram of Previous Post

Using that post’s RESTful web services, here is a quick example of how easy it is to use the free, open-source edition of soapUI to test those services. Start by locating the address of the RESTful service’s WADL. The WADL address is displayed in the upper left corner of the NetBeans’ browser-based test page, shown in the earlier post.

Displaying the RESTful Web Services WADL

Displaying the RESTful Web Services WADL

Next, create a new soapUI project. Give the project the WADL address and a project name.

Creating a New soapUI Project

Creating a New soapUI Project

Using the WADL, soapUI will create sample HTTP Request for each service resource’s methods (left-side of screen). Populating the sample request with any required input parameters, you make an HTTP Request to the service’s method.

In this first example, I call the Actor resource’s ‘findAll’ method using an HTTP GET method. The call to ‘http://localhost:8080/MySQLDemoService/webresources/com.mysql.entities.actor’  results in an HTTP Response with the list of Actor objects, mapped (serialized) to JSON  (right-side of screen).

Results of Querying the Actor's findAll Method URI

Results of Querying the Actor’s findAll Method

In this second example, I call the Film resource’s ‘Id’ method, to locate a single Film object, using the HTTP GET method. The call to ‘http://localhost:8080/MySQLDemoService/webresources/com.mysql.entities.film/719’  results in an HTTP Response with the a single Film object, identified by Id 719. This time the object is marshalled to XML, instead of JSON.

Querying the Film's Id Method with Input Parameter

Querying the Film’s Id Method with Input Parameter

, , , , , ,

Leave a comment