Unlocking the Potential of Generative AI for Synthetic Data Generation

Posted by Gary A. Stafford in AI/ML, Cloud, Enterprise Software Development, Machine Learning, Python, Software Development, Technology Consulting on April 18, 2023

Explore the capabilities and applications of generative AI to create realistic synthetic data for software development, analytics, and machine learning

Licensed image: Yurchanka Siarhei/Shutterstock

Introduction

Generative AI refers to a class of artificial intelligence algorithms capable of generating new data similar to a given dataset. These algorithms learn the underlying patterns and relationships in the data and use this knowledge to create new data consistent with the original dataset. Generative AI is a rapidly evolving field that has the potential to revolutionize the way we generate and use data.

Generative AI can generate synthetic data based on patterns and relationships learned from actual data. This ability to generate synthetic data has numerous applications, from creating realistic virtual environments for training and simulation to generating new data for machine learning models. In this article, we will explore the capabilities of generative AI and its potential to generate synthetic data, both directly and indirectly, for software development, data analytics, and machine learning.

Common Forms of Synthetic Data

According to AltexSoft, in their article Synthetic Data for Machine Learning: its Nature, Types, and Ways of Generation, common forms of synthetic data include:

Tabular data: This type of synthetic data is often used to generate datasets that resemble real-world data in terms of structure and statistical properties.
Time series data: This type of synthetic data generates datasets that resemble real-world time series data. It is commonly used when real-world time series data is unavailable or too expensive.
Image and video data: This synthetic data is used to generate realistic images and videos for training machine learning models or simulations.
Text data: This synthetic data generates realistic text for natural language processing tasks or for generating training data for machine learning models.
Sound data: This synthetic data generates realistic sound for training machine learning models or simulations.

Synthetic Tabular Data Types

Synthetic tabular data refers to artificially generated datasets that resemble real-world tabular data in terms of structure and statistical properties. Tabular data is organized into rows and columns, like tables or spreadsheets. Some specific types of synthetic tabular data include:

Financial data: Synthetic datasets that resemble real-world financial data such as bank transactions, stock prices, or credit card information.
Customer data: Synthetic datasets that resemble real-world customer data, such as purchase history, demographic information, or customer behavior.
Medical data: Synthetic datasets that resemble real-world medical data, such as patient records, medical test results, or treatment history.
Sensor data: Synthetic datasets that resemble real-world sensor data such as temperature readings, humidity levels, or air quality measurements.
Sales data: Synthetic datasets that resemble real-world sales data, such as sales transactions, product information, or customer behavior.
Inventory data: Synthetic datasets that resemble real-world inventory data, such as stock levels, product information, or supplier information.
Marketing data: Synthetic datasets that resemble real-world marketing data, such as campaign performance, customer behavior, or market trends.
Human resources data: Synthetic datasets that resemble real-world human resources data, such as employee records, performance evaluations, or salary information.

Challenges with Creating Synthetic Data

According to sources including Towards Data Science, enov8, and J.P. Morgan, there are several challenges in creating synthetic data, including:

Technical difficulty: Properly modeling complex real-world behaviors such as synthetic data is challenging, given available technologies.
Biased behavior: The flexible nature of synthetic data makes it prone to potentially biased results.
Privacy concerns: Care must be taken to ensure synthetic data does not reveal sensitive information.
Quality of the data model: If the quality of the data model is not high, wrong conclusions can be reached.
Time and effort: Synthetic data generation requires time and effort.

Difficult Patterns and Behaviors to Model

Many patterns and behaviors can be challenging to model in synthetic data, for example:

Rare events: If certain events rarely occur in the real world, generating synthetic data that accurately reflects their distribution can be difficult.
Complex relationships: Synthetic data generators may struggle to capture complex relationships between variables, such as non-linear interactions, feedback loops, or conditional dependencies.
Contextual variability: Contextual factors, such as time, location, or individual differences, can have a significant impact on the distribution of data. Modeling this variability accurately can be challenging.
Outliers and anomalies: Synthetic data generators may be unable to generate outliers or anomalies that are realistic and representative of the real world.
Dynamic data: If the data is dynamic and changes over time, it can be challenging to generate synthetic data that captures these changes accurately.
Unobserved variables: Sometimes, variables may be necessary for understanding the data distribution but are not directly observable. These variables can be challenging to model in synthetic data.
Data bias: If the real-world data is biased in some way, such as over-representation of certain groups or under-representation of others, it can be challenging to generate synthetic data that is unbiased and representative of the population.
Time-dependent patterns: If the data exhibits time-dependent patterns, such as seasonality or trends, it can be challenging to generate synthetic data that accurately reflects them.
Spatial patterns: If the data has a spatial component, such as location data or images, it can be challenging to generate synthetic data that captures the spatial patterns realistically.
Data sparsity: If the data is sparse or incomplete, it can be difficult to generate synthetic data that accurately reflects the distribution of the entire dataset.
Human behavior: If the data involves human behavior, such as in social science or behavioral economics, it can be challenging to model the complex and nuanced behaviors of individuals and groups.
Sensitive or confidential information: In some cases, the data may contain sensitive or confidential information that cannot be shared, making it challenging to generate synthetic data that preserves privacy while accurately reflecting the underlying distribution.

Overall, many patterns can be challenging to model accurately in synthetic data. It often requires careful consideration of the specific data characteristics and the synthetic data generation techniques’ limitations.

Easily Modeled Patterns and Behaviors

There are many simple and well-understood patterns that can be easily modeled in synthetic data, for example:

Randomness: If the data is purely random, it can be easily generated using a random number generator or other simple techniques.
Gaussian distribution: If the data follows a Gaussian or normal distribution, it can be generated using a Gaussian random number generator.
Uniform distribution: If the data follows a uniform distribution, it can be easily generated using a uniform random number generator.
Linear relationships: If the data follows a linear relationship between variables, it can be modeled using simple linear regression techniques.
Categorical variables: If the data consists of categorical variables, such as gender or occupation, it can be generated using a categorical distribution.
Text data: If the data consists of text, it can be generated using natural language processing techniques, such as language models or text generation algorithms.
Time series: If the data consists of time series data, such as stock prices or weather data, it can be generated using time series models, such as Autoregressive integrated moving average (ARIMA) or Long short-term memory (LSTM).
Seasonality: If the data exhibits seasonal patterns, such as higher sales data during holiday periods, it can be generated using seasonal time series models.
Proportions and percentages: If the data consists of proportions or percentages, such as product sales distribution across different regions, it can be generated using beta or Dirichlet distributions.
Multivariate normal distribution: If the data follows a multivariate normal distribution, it can be generated using a multivariate Gaussian random number generator.
Networks: If the data consists of network or graph data, such as social networks or transportation networks, it can be generated using network models, such as Erdos-Renyi or Barabasi-Albert models.
Binary data: If the data consists of binary data, such as whether a customer churned, it can be generated using a Bernoulli distribution.
Geospatial data: If it involves geospatial data, such as the location of points of interest, it can be easily using geospatial models, such as point processes or spatial point patterns.
Customer behaviors: If the data involves customer behaviors, such as browsing or purchase histories, it can be generated using customer journey models, such as Markov models.

In general, simple and well-understood patterns can be easily modeled using synthetic data techniques. In contrast, more complex and nuanced patterns may require more sophisticated modeling techniques and a deeper understanding of the underlying data characteristics.

Creating Synthetic Data with Generative AI

We can use many popular generative AI-powered tools to create synthetic data for testing applications, constructing analytics pipelines, and building machine learning models. Tools include OpenAI ChatGPT, Microsoft’s all-new Bing Chat, ChatSonic, Tabnine, GitHub Copilot, and Amazon CodeWhisperer. For more information on these tools, check out my recent blog post:

Accelerating Development with Generative AI-Powered Coding Tools
Explore six popular generative AI-powered tools, including ChatGPT, Copilot, CodeWhisperer, Tabnine, Bing, and…garystafford.medium.com

Let’s start with a simple example of generating synthetic sales data. Suppose we have created a new sales forecasting application for coffee shops that we need to test using synthetic data. We might start by prompting a generative AI tool like OpenAI’s ChatGPT for some data:

Create a CSV file with 25 random sales records for a coffee shop.
Each record should include the following fields:
  - id (incrementing integer starting at 1)
  - date (random date between 1/1/2022 and 12/31/2022)
  - time (random time between 6:00am and 9:00pm in 1-minute increments)
  - product_id (incrementing integer starting at 1)
  - product
  - calories
  - price in USD
  - type (drink or food)
  - quantity (random integer between 1 and 3)
  - amount (price * quantity)
  - payment type (cash, credit, debit, or gift card)

The content and structure of a prompt can vary, and this can strongly influence ChatGPT’s response. Based on the above prompt, the results were accurate but not very useful for testing our application in this format. ChatGPT cannot create a physical CSV file. Furthermore, ChatGPT’s response length is limited; only about twenty records were returned. According to ChatGPT, in general, ChatGPT can generate responses of up to 2,048 tokens, the maximum output length allowed by the GPT-3 model.

Instead of outputting the actual synthetic data, we could ask ChatGPT to write a program that can, in turn, generate the synthetic data. This option is certainly more scalable. Let’s prompt ChatGPT to write a Python program to generate synthetic sales data with the same characteristics as before:

Create a Python3 program to generate 100 sales of common items 
sold in a coffee shop. The data should be written to a CSV file
and include a header row. Each record should include the following fields:
  - id (incrementing integer starting at 1)
  - date (random date between 1/1/2022 and 12/31/2022)
  - time (random time between 6:00am and 9:00pm in 1-minute increments)
  - product_id (incrementing integer starting at 1)
  - product
  - calories
  - price in USD
  - type (drink or food)
  - quantity (random integer between 1 and 3)
  - amount (price * quantity)
  - payment type (cash, credit, debit, or gift card)

Using a single concise prompt, ChatGPT generated a complete Python program, including code comments, to generate synthetic sales data. Unfortunately, given ChatGPT’s response size limitation, the coffee shop menu was limited to just six items. Reprompting for more items would result in the truncation of the output and, thus, the program, making it unrunnable. Instead, we could use an additional prompt to generate a longer Python list of menu items and combine the two pieces of code in our IDE. Regardless, we will still need to copy and paste the code into our IDE to review, debug, test, and run.

ChatGPT’s Python program, copied and pasted into VS Code, ran without modifications, and wrote 100 synthetic sales records to a CSV file!

Running ChatGPT’s Python program in VS Code

Using IDE-based Generative AI Tools

Although generating synthetic data directly or snippets of code in chat-based generative AI tools are helpful for limited use cases, writing code in IDE gives us several advantages:

Code does not need to be copied and pasted from external sources into an IDE
Consecutive lines of code, method, and block code completion overcome the single response size limits of chat-based tools like OpenAI ChatGPT
Code can be reused and adapted to evolving use cases over time
Python interpreter and debugger or equivalent for other languages
Automatic code formatting, linting, and code style enforcement
Unit, integration, and functional testing
Static code analysis (SCA)
Vulnerability scanning
IntelliSense for code completion
Source code management (SCM) / version control

Let’s use the same techniques we used with ChatGPT, but from within an IDE to generate three types of synthetic data. We will choose Microsoft’s VS Code with GitHub Copilot and Python as our programming language.

VS Code IDE with GitHub Copilot Nightly extension installed

Source Code

All the code examples shown in this post can be found on GitHub.

Example #1: Coffee Shop Sales Data

First, we will start by outlining the program’s objective using code comments on the top of our Python file. This detailed context helps us to clearly express our goal and enables Copilot to generate an accurate response.

# Write a program that creates synthetic sales data for a coffee shop.
# The program should accept a command line argument that specifies the number of records to generate.
# The program should write the sales data to a file called 'coffee_shop_sales_data.csv'.
# The program should contain the following functions:
#   - main() function that calls the other functions
#   - function that returns one random product from a list of dictionaries
#   - function that returns a dictionary containing one sales record
#   - function that writes the sales records to a file

Following the import statements also generated with the assistance of Copilot, we will write the first function to return a random product from a list of 25 products. Again, we will use code comments as a prompt to generate the code. Copilot was able to generate 100% of the function’s code from the comments.

# Write a function to create list of dictionaries.
# The list of dictionaries should contain 15 drink items and 10 food items sold in a coffee shop.
# Include the product id, product name, calories, price, and type (Food or Drink).
# Capilize the first letter of each product name.
# Return a random item from the list of dictionaries.

Below is an example of Copilot’s ability to generate complete lines of code. Ultimately, it generated 100% of the function including choosing the items sold in a coffee shop, with a reasonable price and caloric count. Copilot is not limited to just understanding code.

Copilot can generate entire lines of code

Next, we will write a function to return a random sales record.

# Write a function to return a random sales record.
# The record should be a dictionary with the following fields:
#   - id (an incrementing integer starting at 1)
#   - date (a random date between 1/1/2022 and 12/31/2022)
#   - time (a random time between 6:00am and 9:00pm in 1 minute increments)
#   - product_id, product, calories, price, and type (from the get_product function)
#   - quantity (a random integer between 1 and 3)
#   - amount (price * quantity)
#   - payment type (Cash, Credit, Debit, Gift Card, Apple Pay, Google Pay, or Venmo)

Again, Copilot generated 100% of the function’s code as a single block based on the code comments.

Copilot can generate entire blocks of code or methods

Lastly, we will create a function to write the sales data into a CSV file using Copilot’s help.

# Write a function to write the sales records to a CSV file called 'coffee_shop_sales.csv'.
# Use an input parameter to specify the number of records to write.
# The CSV file must have a header row and be comma delimited.
# All string values must be enclosed in double quotes.

Again below, we see an example of Copilot’s ability to generate an entire Python function. I needed to correct a few problems with the generated code. First, there was a lack of quotes for string values, which I added to the function (quotechar='"', quoting=csv.QUOTE_NONNUMERIC). Also, the function was missing a key line of code, sale = get_sales_record(), which would have caused the code to fail. Remember, just because the code was generated does not mean it is correct.

Here is the complete program that creates synthetic sales data for a coffee shop with Copilot assistance:

	# Purpose: Generate coffee shop sales data
	# Author: Gary A. Stafford and GitHub Copilot
	# Date: 2023-04-12
	# Usage: python3 coffee_shop_data_gen.py 100
	# Command-line argument(s): rec_count (number of records to generate as an integer)

	# Write a program that creates synthetic sales data for a coffee shop.
	# The program should accept a command line argument that specifies the number of records to generate.
	# The program should write the sales data to a file called 'coffee_shop_sales_data.csv'.
	# The program should contain the following functions:
	# – main() function that calls the other functions
	# – function that returns one random product from a list of dictionaries
	# – function that returns a dictionary containing one sales record
	# – function that writes the sales records to a file

	import argparse
	import csv
	import hashlib
	import random
	from datetime import datetime, timedelta


	def main():
	# create a parser object
	parser = argparse.ArgumentParser(
	description="Generate coffee shop sales data")

	# add a command line argument to specify the number of records to generate
	parser.add_argument("num_recs",
	type=int,
	help="The number of records to generate",
	default=100)

	num_recs = parser.parse_args().num_recs

	write_data(num_recs)


	# Write a function to create list of dictionaries.
	# The list of dictionaries should contain 15 drink items and 10 food items sold in a coffee shop.
	# Include the product id, product name, calories, price, and type (Food or Drink).
	# Capilize the first letter of each product name.
	# Return a random item from the list of dictionaries.
	def get_product():
	products = [
	{"id": 1, "product": "Latte", "calories": 120, "price": 3.50, "type": "Drink"},
	{"id": 2, "product": "Cappuccino", "calories": 100, "price": 3.00, "type": "Drink"},
	{"id": 3, "product": "Americano", "calories": 5, "price": 2.50, "type": "Drink"},
	{"id": 4, "product": "Espresso", "calories": 10, "price": 2.00, "type": "Drink"},
	{"id": 5, "product": "Mocha", "calories": 250, "price": 4.00, "type": "Drink"},
	{"id": 6, "product": "Iced Coffee", "calories": 80, "price": 2.50, "type": "Drink"},
	{"id": 7, "product": "Hot Chocolate", "calories": 300, "price": 3.50, "type": "Drink"},
	{"id": 8, "product": "Tea", "calories": 0, "price": 2.00, "type": "Drink"},
	{"id": 9, "product": "Frappe", "calories": 450, "price": 5.00, "type": "Drink"},
	{"id": 10, "product": "Smoothie", "calories": 200, "price": 4.00, "type": "Drink"},
	{"id": 11, "product": "Iced Tea", "calories": 0, "price": 2.50, "type": "Drink"},
	{"id": 12, "product": "Lemonade", "calories": 120, "price": 3.00, "type": "Drink"},
	{"id": 13, "product": "Hot Tea", "calories": 0, "price": 2.00, "type": "Drink"},
	{"id": 14, "product": "Chai Tea", "calories": 200, "price": 3.50, "type": "Drink"},
	{"id": 15, "product": "Iced Chai", "calories": 250, "price": 4.00, "type": "Drink"},
	{"id": 16, "product": "Croissant", "calories": 231, "price": 2.99, "type": "Food"},
	{"id": 17, "product": "Bagel", "calories": 289, "price": 3.49, "type": "Food"},
	{"id": 18, "product": "Muffin", "calories": 426, "price": 3.99, "type": "Food"},
	{"id": 19, "product": "Sandwich", "calories": 512, "price": 6.99, "type": "Food"},
	{"id": 20, "product": "Wrap", "calories": 388, "price": 5.99, "type": "Food"},
	{"id": 21, "product": "Salad", "calories": 231, "price": 7.99, "type": "Food"},
	{"id": 22, "product": "Quiche", "calories": 456, "price": 4.99, "type": "Food"},
	{"id": 23, "product": "Scone", "calories": 335, "price": 2.49, "type": "Food"},
	{"id": 24, "product": "Pastry", "calories": 397, "price": 3.99, "type": "Food"},
	{"id": 25, "product": "Cake", "calories": 512, "price": 5.99, "type": "Food"},
	]

	# return one random item from list of dictionaries
	return random.choice(products)


	# Write a function to return a random sales record.
	# The record should be a dictionary with the following fields:
	# – transaction_id (a hash of the date, time, and product_id)
	# – date (a random date between 1/1/2022 and 12/31/2022)
	# – time (a random time between 6:00am and 9:00pm in 1 minute increments)
	# – product_id, product, calories, price, and type (from the get_product function)
	# – quantity (a random integer between 1 and 3)
	# – amount (price * quantity)
	# – payment type (Cash, Credit, Debit, Gift Card, Apple Pay, Google Pay, or Venmo)
	def get_sales_record():
	# get a random product
	product = get_product()

	# get a random date between 1/1/2022 and 12/31/2022
	start_date = datetime(2022, 1, 1)
	end_date = datetime(2022, 12, 31)
	random_date = start_date + timedelta(
	# Get a random number of seconds between 0 and the number of seconds between start_date and end_date
	seconds=random.randint(0, int(
	(end_date – start_date).total_seconds())), )

	# get a random time between 6:00am and 9:00pm
	start_time = datetime.strptime("6:00am", "%I:%M%p")
	end_time = datetime.strptime("9:00pm", "%I:%M%p")
	random_time = start_time + timedelta(
	# Get a random number of seconds between 0 and the number of seconds between start_time and end_time
	seconds=random.randint(0, int(
	(end_time – start_time).total_seconds())), )

	# get a random quantity between 1 and 3
	random_quantity = random.randint(1, 3)

	# get a random payment type:
	# Cash, Credit, Debit, Gift card, Apple Pay, Google Pay, Venmo
	random_payment_type = random.choice(
	["Cash", "Credit", "Debit", "Gift card", "Apple Pay", "Google Pay", "Venmo"])

	sales_record = {
	"date": random_date.strftime("%m/%d/%Y"),
	"time": random_time.strftime("%H:%M:%S"),
	"product_id": product["id"],
	"product": product["product"],
	"calories": product["calories"],
	"price": product["price"],
	"type": product["type"],
	"quantity": random_quantity,
	"amount": product["price"] * random_quantity,
	"payment_type": random_payment_type,
	}

	return sales_record


	# Write a function to write the sales records to a CSV file called 'coffee_shop_sales_data.csv'.
	# Use an input parameter to specify the number of records to write.
	# Call the get_sales_record function once for each record to write.
	# The CSV file must have a header row and be comma delimited.
	# All string values must be enclosed in double quotes.
	def write_data(rec_count):
	# open the file for writing
	with open("output/coffee_shop_sales_data.csv", "w", newline="") as csv_file:
	# create a csv writer object
	csv_writer = csv.writer(csv_file,
	delimiter=",",
	quotechar='"',
	quoting=csv.QUOTE_NONNUMERIC)

	# write the header row
	# id,date,time,product_id,product,calories,price,type,quantity,amount,payment_type
	csv_writer.writerow([
	"transaction_id",
	"date",
	"time",
	"product_id",
	"product",
	"calories",
	"price",
	"type",
	"quantity",
	"amount",
	"payment_type",
	])

	# write the sales records
	for i in range(rec_count):
	sale = get_sales_record()
	transaction_id = hashlib.md5((f'{sale["date"]} {sale["time"]} {sale["product_id"]}').encode()).hexdigest()
	csv_writer.writerow([
	transaction_id,
	sale["date"],
	sale["time"],
	sale["product_id"],
	sale["product"],
	sale["calories"],
	sale["price"],
	sale["type"],
	sale["quantity"],
	sale["amount"],
	sale["payment_type"],
	])


	if __name__ == "__main__":
	main()

view raw coffee_shop_data_gen.py hosted with ❤ by GitHub

Copilot generated an astounding 80–85% of the program’s final code. The initial program took 10–15 minutes to write using code comments. I then added a few new features, including the ability to pass in the record count on the command line and the hash-based transaction id, which took another 5 minutes. Finally, I used GitHub Code Brushes to optimize the code and generate the Python docstrings, and Black Formatter and Flake8 extensions to format and lint, all of which took less than 5 minutes. With testing and debugging, the total time was about 25–30 minutes.

The most significant difference with Copilot was that I never had to leave the IDE to look up code references or find existing sales datasets or even a coffee shop menu to duplicate. The code, as well as the list of products, price, calories, and product type, were all generated by Copilot.

To make this example more realistic, you could use Copilot’s assistance to write algorithms capable of reflecting daily, weekly, and seasonal variations in product choice and sales volumes. This might include simulating increased sales during the busy morning rush hour or a preference for iced drinks in the summer months versus hot drinks during the winter months.

Here is an example of the synthetic sales data output by the example application:

transaction_id	date	time	product_id	product	calories	price	type	quantity	amount	payment_type
47a157f84e727fe3335db1519ee736a6	06/27/2022	19:59:42	22	Quiche	456	4.99	Food	2	9.98	Debit
1bf01013e699ca0f804650ea50826c82	11/20/2022	06:21:14	22	Quiche	456	4.99	Food	3	14.97	Cash
84f41c15749090d1e79bf9a48a58d6c3	08/18/2022	11:50:22	14	Chai Tea	200	3.5	Drink	2	7.0	Apple Pay
ef1845b8438bf3b5b99d2f4891a48f03	11/13/2022	17:20:51	12	Lemonade	120	3.0	Drink	2	6.0	Debit
9863de11be3099d6361392584e30e624	06/03/2022	18:27:03	18	Muffin	426	3.99	Food	2	7.98	Gift card
f50ed8878250bc06f66b97f5cd2f6df7	02/21/2022	17:02:18	7	Hot Chocolate	300	3.5	Drink	2	7.0	Credit
1903169473f41a0275ee702f2c6b1dd6	05/24/2022	14:58:25	10	Smoothie	200	4.0	Drink	3	12.0	Venmo
164a9519fd3db952e721e9f55dc1be74	01/07/2022	14:19:35	14	Chai Tea	200	3.5	Drink	2	7.0	Debit
dc85a202143de48ad4646190cdc0bf5c	01/28/2022	08:52:38	20	Wrap	388	5.99	Food	1	5.99	Venmo
7d182be793ab4b3290d14968e9c9a3e3	02/20/2022	10:16:10	16	Croissant	231	2.99	Food	2	5.98	Google Pay
f524811f456481f5a4a2f0c09dcafa28	10/03/2022	16:19:35	10	Smoothie	200	4.0	Drink	3	12.0	Debit
f2f63eacd33cb9f13849b08638e6fc3d	07/06/2022	16:14:20	10	Smoothie	200	4.0	Drink	3	12.0	Cash
7435e8cc8771a8b538529ff6f873cc05	04/12/2022	16:35:51	14	Chai Tea	200	3.5	Drink	3	10.5	Venmo
0e1154abb6a79b9ece570d51e3116846	11/28/2022	14:44:05	21	Salad	231	7.99	Food	2	15.98	Debit
d5926d58011a2a25798bf2aa72552cf0	04/19/2022	16:06:09	22	Quiche	456	4.99	Food	1	4.99	Venmo

view raw coffee_shop_sales_data.csv hosted with ❤ by GitHub

Example #2: Residential Address Data

We could use these same techniques to generate a list of residential addresses. To start, we can prompt Copilot for the values in a list of common street names and street types in the United States:

# Write a function that creates a list of common street names
# in the United States, in alphabetical order.
# List should be in alphabetical order. Each name should be unique.
# Return a random street name.
def get_street_name():
    street_names = [
        "Ash", "Bend", "Bluff", "Branch", "Bridge", "Broadway", "Brook", "Burg",
        "Bury", "Canyon", "Cape", "Cedar", "Cove", "Creek", "Crest", "Crossing",
        "Dale", "Dam", "Divide", "Downs", "Elm", "Estates", "Falls", "Fifth",
        "First", "Fork", "Fourth", "Glen", "Green", "Grove", "Harbor", "Heights",
        "Hickory", "Hill", "Hollow", "Island", "Isle", "Knoll", "Lake", "Landing",
        ...
    ]

    return random.choice(street_names)


# Write a function that creates a list of common street types 
# in the United States, in alphabetical order.
# List should be in alphabetical order. Each name should be unique.
# Return a random street type.
def get_street_type():
    street_types = [
        "Alley", "Avenue", "Bend", "Bluff", "Boulevard", "Branch", "Bridge", "Brook", 
        "Burg", "Circle", "Commons", "Court", "Drive", "Highway", "Lane", "Parkway", 
        "Place", "Road", "Square", "Street", "Terrace", "Trail", "Way"
    ]

    return random.choice(street_types)

Next, we can create a function that returns a property type based on a categorical distribution of common residential property types with the prompt:

# Write a function to return a random property type.
# Accept a random value between 0 and 1 as an input parameter.
# The function must return one of the following values based on the %:
# 63% Single-family, 26% Multi-family, 4% Condo,
# 3% Townhouse, 2% Mobile home, 1% Farm, 1% Other.

Again, Copilot generated 100% of the function’s code as a single block based on the code comments.

Copilot generating a categorical distribution of common residential property types

Additionally, we could have Copilot help us generate a list of the 50 largest cities in the United States with state, zip code, and population, with the prompt:

# Write a function to returns the 50 largest cities in the United States.
# List should be sorted in descending order by population.
# Include the city, state abbreviation, zip code, and population.
# Return a list of dictionaries.

Once again, Copilot generated 100% of the function’s code using a combination of single lines and code blocks based on the code comments.

Copilot generating a block of US cities with state, zip code, and populations

When randomly choosing a city, we can use a categorical distribution of populations of all the cities to control the distribution of cities in the final synthetic dataset. For example, there will be more addresses in larger cities like New York City or Los Angeles than in smaller cities like Buffalo or Virginia Beach, with the prompt:

# Write a function that calculates the total population of the list of cities.
# Add a 'pcnt_of_total_population' and 'pcnt_running_total' columns to list.
# Returns a sorted list of cities by population.

Here is the complete program that creates synthetic US-based address data with Copilot assistance:

	# Purpose: Generate US residential address data
	# Author: Gary A. Stafford and GitHub Copilot
	# Date: 2023-04-13
	# Usage: python3 residential_address_data_gen.py 100
	# Command-line argument(s): rec_count (number of records to generate as an integer)

	# Write an application that create a random list of united states addresses.
	# The application should accept a command line argument that specifies the number of records to generate.
	# Include address, city, state, zip code, country, and property type.
	# Write the data to a csv file named 'address_data.csv'.
	# The application should contain the following functions:
	# – main() function that calls the other functions
	# – function that returns a list of common street names in the United States
	# – function that returns a list of common street types in the United States
	# – function that returns a list of common city, state, zip code, and population in the United States
	# – function that returns a property type


	import csv
	import random
	import argparse

	cities_final = []


	def main():
	parser = argparse.ArgumentParser(description="Generate coffee shop sales data")
	parser.add_argument(
	"rec_count", type=int, help="The number of records to generate", default=100
	)

	# add population calculations to the city data
	cities = get_cities()
	prepare_cities(cities)

	rec_count = parser.parse_args().rec_count
	write_data(rec_count)


	# Write a function that creates a list of common street names
	# in the United States, in alphabetical order.
	# Each one should be unique.
	# Return a random street name.
	def get_street_name():
	street_names = [
	"Ash", "Bend", "Bluff", "Branch", "Bridge", "Broadway", "Brook", "Burg",
	"Bury", "Canyon", "Cape", "Cedar", "Cove", "Creek", "Crest", "Crossing",
	"Dale", "Dam", "Divide", "Downs", "Elm", "Estates", "Falls", "Fifth",
	"First", "Fork", "Fourth", "Glen", "Green", "Grove", "Harbor", "Heights",
	"Hickory", "Hill", "Hollow", "Island", "Isle", "Knoll", "Lake", "Landing",
	"Lawn", "Main", "Manor", "Maple", "Meadow", "Meadows", "Mill", "Mills",
	"Mission", "Mount", "Mountain", "Oak", "Oaks", "Orchard", "Park", "Parkway",
	"Pass", "Path", "Pike", "Pine", "Place", "Plain", "Plains", "Port", "Prairie",
	"Ridge", "River", "Road", "Rock", "Rocks", "Second", "Seventh", "Shoals",
	"Shore", "Shores", "Sixth", "Skyway", "Spring", "Springs", "Spur", "Station",
	"Summit", "Sunset", "Terrace", "Third", "Trace", "Track", "Trail", "Tunnel",
	"Turnpike", "Vale", "Valley", "View", "Village", "Ville", "Vista", "Walk",
	"Way", "Well", "Wells", "Wood", "Woods", "Worth"
	]

	return random.choice(street_names)


	# Write a function that creates a list of common street types
	# in the United States, in alphabetical order.
	# Each one should be unique.
	# Return a random street type.
	def get_street_type():
	street_types = [
	"Alley", "Avenue", "Bend", "Bluff", "Boulevard", "Branch", "Bridge", "Brook",
	"Burg", "Circle", "Commons", "Court", "Drive", "Highway", "Lane", "Parkway",
	"Place", "Road", "Square", "Street", "Terrace", "Trail", "Way"
	]

	return random.choice(street_types)


	# Write a function that calculates the total population of the list of cities.
	# Add a 'pcnt_of_total_population' and 'pcnt_running_total' columns to list.
	# Returns a sorted list of cities by population.
	def prepare_cities(cities):
	total_population = 0 # 51,035,885
	for city in cities:
	total_population += city["population"]

	for city in cities:
	city["pcnt_of_total_population"] = city["population"] / total_population

	global cities_final
	cities_final = sorted(cities, key=lambda d: d["population"], reverse=True)

	running_total = 1
	for city in cities_final:
	running_total -= city["pcnt_of_total_population"]
	city["pcnt_running_total"] = running_total


	# Write a function to returns the 50 largest cities in the United States.
	# Include the city, state abbreviation, zip code, and population.
	# List should be sorted in descending order by population.
	# Return a list of dictionaries.
	def get_cities():
	cities = [
	{"city": "Albuquerque", "state": "NM", "zip": "87102", "population": 559277},
	{"city": "Anaheim", "state": "CA", "zip": "92801", "population": 345012},
	{"city": "Anchorage", "state": "AK", "zip": "99501", "population": 291826},
	{"city": "Arlington", "state": "TX", "zip": "76010", "population": 398121},
	{"city": "Atlanta", "state": "GA", "zip": "30303", "population": 486290},
	{"city": "Aurora", "state": "CO", "zip": "80010", "population": 325078},
	{"city": "Austin", "state": "TX", "zip": "78701", "population": 931830},
	{"city": "Bakersfield", "state": "CA", "zip": "93301", "population": 372576},
	{"city": "Baltimore", "state": "MD", "zip": "21202", "population": 602495},
	{"city": "Boston", "state": "MA", "zip": "02108", "population": 667137},
	{"city": "Buffalo", "state": "NY", "zip": "14202", "population": 258959},
	{"city": "Charlotte", "state": "NC", "zip": "28202", "population": 872498},
	{"city": "Chicago", "state": "IL", "zip": "60602", "population": 2695598},
	{"city": "Cincinnati", "state": "OH", "zip": "45202", "population": 296943},
	{"city": "Cleveland", "state": "OH", "zip": "44113", "population": 390113},
	{"city": "Colorado Springs", "state": "CO", "zip": "80903", "population": 456568},
	{"city": "Columbus", "state": "OH", "zip": "43215", "population": 822553},
	{"city": "Dallas", "state": "TX", "zip": "75201", "population": 1345047},
	{"city": "Denver", "state": "CO", "zip": "80202", "population": 682545},
	{"city": "Detroit", "state": "MI", "zip": "48226", "population": 672662},
	{"city": "El Paso", "state": "TX", "zip": "79901", "population": 674433},
	{"city": "Fort Worth", "state": "TX", "zip": "76102", "population": 792727},
	{"city": "Fresno", "state": "CA", "zip": "93721", "population": 509924},
	{"city": "Houston", "state": "TX", "zip": "77002", "population": 2296224},
	{"city": "Indianapolis", "state": "IN", "zip": "46204", "population": 843393},
	{"city": "Jacksonville", "state": "FL", "zip": "32202", "population": 842583},
	{"city": "Kansas City", "state": "MO", "zip": "64102", "population": 467007},
	{"city": "Las Vegas", "state": "NV", "zip": "89101", "population": 603488},
	{"city": "Long Beach", "state": "CA", "zip": "90802", "population": 462257},
	{"city": "Los Angeles", "state": "CA", "zip": "90001", "population": 3971883},
	{"city": "Louisville", "state": "KY", "zip": "40202", "population": 609893},
	{"city": "Memphis", "state": "TN", "zip": "38103", "population": 653450},
	{"city": "Mesa", "state": "AZ", "zip": "85201", "population": 508958},
	{"city": "Miami", "state": "FL", "zip": "33128", "population": 463347},
	{"city": "Milwaukee", "state": "WI", "zip": "53202", "population": 594833},
	{"city": "Minneapolis", "state": "MN", "zip": "55402", "population": 410939},
	{"city": "Nashville", "state": "TN", "zip": "37203", "population": 654610},
	{"city": "New York", "state": "NY", "zip": "10007", "population": 8405837},
	{"city": "Newark", "state": "NJ", "zip": "07102", "population": 281944},
	{"city": "Oakland", "state": "CA", "zip": "94607", "population": 406253},
	{"city": "Oklahoma City", "state": "OK", "zip": "73102", "population": 631346},
	{"city": "Omaha", "state": "NE", "zip": "68102", "population": 434353},
	{"city": "Philadelphia", "state": "PA", "zip": "19107", "population": 1526006},
	{"city": "Phoenix", "state": "AZ", "zip": "85003", "population": 1445632},
	{"city": "Pittsburgh", "state": "PA", "zip": "15222", "population": 305841},
	{"city": "Portland", "state": "OR", "zip": "97201", "population": 609456},
	{"city": "Raleigh", "state": "NC", "zip": "27601", "population": 403892},
	{"city": "Sacramento", "state": "CA", "zip": "95814", "population": 479686},
	{"city": "San Antonio", "state": "TX", "zip": "78205", "population": 1327407},
	{"city": "San Diego", "state": "CA", "zip": "92101", "population": 1307402},
	{"city": "San Francisco", "state": "CA", "zip": "94102", "population": 805235},
	{"city": "San Jose", "state": "CA", "zip": "95113", "population": 998537},
	{"city": "Seattle", "state": "WA", "zip": "98101", "population": 608660},
	{"city": "St. Louis", "state": "MO", "zip": "63102", "population": 319294},
	{"city": "Tampa", "state": "FL", "zip": "33602", "population": 335709},
	{"city": "Tucson", "state": "AZ", "zip": "85701", "population": 520116},
	{"city": "Virginia Beach", "state": "VA", "zip": "23451", "population": 448479},
	{"city": "Washington", "state": "DC", "zip": "20001", "population": 601723},
	]

	return cities


	# write a function to return a random city
	# accept a random value between 0 and 1 as an input parameter
	def get_city(rnd_value):
	for city in cities_final:
	if rnd_value >= city["pcnt_running_total"]:
	return city


	# Write a function to return a random property type.
	# Accept a random value between 0 and 1 as an input parameter.
	# The function must return one of the following values based on the %:
	# 63% Single-family, 26% Multi-family, 4% Condo,
	# 3% Townhouse, 2% Mobile home, 1% Farm, 1% Other.
	def get_property_type(rnd_value):
	if rnd_value < 0.63:
	return "Single-family"
	elif rnd_value < 0.89:
	return "Multi-family"
	elif rnd_value < 0.93:
	return "Condo"
	elif rnd_value < 0.96:
	return "Townhouse"
	elif rnd_value < 0.98:
	return "Mobile home"
	elif rnd_value < 0.99:
	return "Farm"
	else:
	return "Other"


	# Create a function to write the address records to a csv file called 'address_data.csv'.
	# Use an input parameter to specify the number of records to write.
	# The csv file must have a header row and be comma delimited.
	# All string values must be enclosed in double quotes.
	def write_data(rec_count):
	address_id = 0
	with open("output/address_data.csv", "w", newline="") as csv_file:
	csv_writer = csv.writer(
	csv_file, delimiter=",", quotechar='"', quoting=csv.QUOTE_NONNUMERIC
	)

	csv_writer.writerow(
	[
	"id",
	"address",
	"city",
	"state",
	"zip",
	"country",
	"property_type",
	"assessed_value",
	]
	)

	for i in range(rec_count):
	address_id += 1
	street_name = get_street_name()
	street_type = get_street_type()
	street_address = f"{random.randint(1, 9999)} {street_name} {street_type}"
	city_state_zip = get_city(random.random())
	country = "United States"
	property_type = get_property_type(random.random())
	assessed_value = random.randint(52500, 1950000)

	csv_writer.writerow(
	[
	address_id,
	street_address,
	city_state_zip["city"],
	city_state_zip["state"],
	city_state_zip["zip"],
	country,
	property_type,
	assessed_value,
	]
	)


	if __name__ == "__main__":
	main()

view raw residential_address_data_gen.py hosted with ❤ by GitHub

To make this example more realistic, you could use Copilot’s assistance to write algorithms capable of accurately reflecting assessed property values based on the type of residence and the zip code.

Here is an example of the synthetic US-based residential address data output by the example application:

id	address	city	state	zip	country	property_type	assessed_value
1	1008 Walk Burg	Houston	TX	77002	United States	Multi-family	1122321
2	7088 Second Square	Oklahoma City	OK	73102	United States	Single-family	261940
3	1425 Ridge Terrace	Indianapolis	IN	46204	United States	Single-family	1030391
4	982 Way Lane	New York	NY	10007	United States	Multi-family	95499
5	9404 Port Court	Columbus	OH	43215	United States	Single-family	922404
6	7135 Crossing Trail	Virginia Beach	VA	23451	United States	Single-family	272910
7	9481 Harbor Brook	New York	NY	10007	United States	Multi-family	232795
8	8585 Manor Branch	Raleigh	NC	27601	United States	Single-family	701217
9	7703 Bluff Boulevard	Las Vegas	NV	89101	United States	Single-family	530581
10	4186 Worth Circle	New York	NY	10007	United States	Townhouse	626577
11	8739 Prairie Trail	Portland	OR	97201	United States	Single-family	316167
12	4 Shores Road	Virginia Beach	VA	23451	United States	Single-family	925711
13	3148 Rocks Road	Mesa	AZ	85201	United States	Single-family	182479
14	1545 Wells Drive	Chicago	IL	60602	United States	Multi-family	1000777
15	1699 Ash Brook	Seattle	WA	98101	United States	Single-family	605392

view raw address_data.csv hosted with ❤ by GitHub

Example #3: Demographic Data

We could use the same techniques again to generate synthetic demographic data. With the assistance of Copilot, we can write functions that randomly return typical feminine or masculine first names (forenames) and common last names (surnames) found in the United States, for this use case.

# Write a function that generates a list of common feminine first names in the United States.
# List should be in alphabetical order.
# Each name should be unique.
# Return random first name.
def get_first_name_feminine():
    first_name_feminine = [
        "Alice", "Amanda", "Amy", "Angela", "Ann", "Anna", "Barbara", "Betty",
        "Brenda", "Carol", "Carolyn", "Catherine", "Christine", "Cynthia", "Deborah", "Debra",
        "Diane", "Donna", "Doris", "Dorothy", "Elizabeth", "Frances", "Gloria", "Heather",
        "Helen", "Janet", "Jennifer", "Jessica", "Joyce", "Julie", "Karen", "Kathleen", "Kimberly",
    ]

    return random.choice(first_name_feminine)

With the assistance of Copilot, we can also write functions that return demographic information, such as age, gender, race, marital status, religion, and political affiliation. Similar to the previous sales data example, we can influence the final synthetic dataset based on categorical distributions of different demographic categories, for instance, with the prompt:

# Write a function that returns a person's martial status.
# Accept a random value between 0 and 1 as an input parameter.
# The function must return one of the following values based on the %:
# 50% Married, 33% Single, 17% Unknown.
def get_martial_status(rnd_value):
    if rnd_value < 0.50:
        return "Married"
    elif rnd_value < 0.83:
        return "Single"
    else:
        return "Unknown"

By altering the categorical distributions, we can quickly alter the resulting synthetic dataset to reflect differing demographic characteristics: an older or younger population, the predominance of a single race, religious affiliation, or marital status, or the ratio of males to females.

Next, we can use a Gaussian distribution (aka normal distribution) to return the year of birth in a bell-shaped curve, given a mean year and a standard deviation, using Python’s random.normalvariate function.

# Write a function that generates a normal distribution of date of births.
# with a mean year of 1975 and a standard deviation of 10.
# Return random date of birth as a string in the format YYYY-MM-DD
def get_dob():
    day_of_year = random.randint(1, 365)
    year_of_birth = int(random.normalvariate(1975, 10))
    dob = date(int(year_of_birth), 1, 1) + timedelta(day_of_year - 1)
    dob = dob.strftime("%Y-%m-%d")
    return dob

Here is the complete program that creates synthetic demographic data with Copilot’s assistance:

	# Purpose: Generate demographic data
	# Author: Gary A. Stafford and GitHub Copilot
	# Date: 2023-04-14
	# Usage: python3 demographic_data_gen.py 100
	# Command-line argument(s): rec_count (number of records to generate as an integer)

	# Write an application that creates a file containing demographic data.
	# The application should accept a command line argument that specifies the number of records to generate.
	# The application should write the demographic data to a file called 'demographic_data.csv'.
	# The application should contain the following functions:
	# – main() function that calls the other functions
	# – function that returns a random first name
	# – function that returns a random last name
	# – function that returns a random date of birth
	# – function that returns a random gender
	# – function that returns a random religious affiliation
	# – function that returns a random race


	import random
	import argparse
	import csv
	from datetime import date, timedelta


	def main():
	parser = argparse.ArgumentParser(description="Generate demographic data")
	parser.add_argument(
	"rec_count", type=int, help="The number of records to generate", default=100
	)

	rec_count = parser.parse_args().rec_count
	write_data(rec_count)


	# Write a function that generates a list of common feminine first names in the United States.
	# List should be in alphabetical order.
	# Each name should be unique.
	# Return random first name.
	def get_first_name_feminine():
	first_name_feminine = [
	"Alice", "Amanda", "Amy", "Angela", "Ann", "Anna", "Barbara", "Betty",
	"Brenda", "Carol", "Carolyn", "Catherine", "Christine", "Cynthia", "Deborah", "Debra",
	"Diane", "Donna", "Doris", "Dorothy", "Elizabeth", "Frances", "Gloria", "Heather",
	"Helen", "Janet", "Jennifer", "Jessica", "Joyce", "Julie", "Karen", "Kathleen", "Kimberly",
	"Laura", "Linda", "Lisa", "Margaret", "Maria", "Marie", "Martha", "Mary", "Melissa",
	"Michelle", "Nancy", "Pamela", "Patricia", "Rebecca", "Ruth", "Sandra", "Sarah", "Sharon",
	"Shirley", "Stephanie", "Susan", "Teresa", "Virginia",
	]

	return random.choice(first_name_feminine)


	# Write a function that generates a list of common masculine first names in the United States.
	# List should be in alphabetical order.
	# Each name should be unique.
	# Return random first name.
	def get_first_name_masculine():
	first_names_masculine = [
	"Adams", "Alexander", "Allen", "Anderson", "Bailey", "Baker", "Barnes", "Bell",
	"Bennett", "Brooks", "Brown", "Bryant", "Butler", "Campbell", "Carter", "Clark",
	"Coleman", "Collins", "Cook", "Cooper", "Cox", "Davis", "Edwards", "Evans",
	"Flores", "Foster", "Garcia", "Gonzales", "Gonzalez", "Gray", "Green", "Griffin",
	"Hall", "Harris", "Henderson", "Hernandez", "Hill", "Howard", "Hughes", "Jackson",
	"James", "Jenkins", "Johnson", "Jones", "Kelly", "King", "Lee", "Lewis", "Long",
	"Lopez", "Martin", "Martinez", "Miller", "Mitchell", "Moore", "Morgan", "Morris",
	"Murphy", "Nelson", "Parker", "Patterson", "Perez", "Perry", "Peterson", "Phillips",
	"Powell", "Price", "Ramirez", "Reed", "Richardson", "Rivera", "Roberts", "Robinson",
	"Rodriguez", "Rogers", "Ross", "Russell", "Sanchez", "Sanders", "Scott", "Simmons",
	"Smith", "Stewart", "Taylor", "Thomas", "Thompson", "Torres", "Turner", "Walker",
	"Ward", "Washington", "Watson", "White", "Williams", "Wilson", "Wood", "Wright", "Young"
	]

	return random.choice(first_names_masculine)


	# Write a function that returns a person's gender.
	# Return a random gender.
	# Accept a random value between 0 and 1 as an input parameter.
	# The function must return one of the following values based on the %:
	# 53% Male, 40% Female, 6% Other, 1% Transgender
	def get_gender(rnd_value):
	if rnd_value < 0.53:
	return "Male"
	elif rnd_value < 0.93:
	return "Feamle"
	elif rnd_value < 0.99:
	return "Other"
	else:
	return "Transgener"

	# Write a function that returns a feminine or masculine first name.
	# Return random first name.
	# Accept a random value between 0 and 1 as an input parameter.
	# The function must return one of the following values based on the %:
	# 53% chance of being feminine, 40% chance of being masculine
	def get_first_name(rnd_value):
	if rnd_value < 0.53:
	return get_first_name_masculine()
	elif rnd_value < 0.93:
	return get_first_name_feminine()
	elif rnd_value < 0.97:
	return get_first_name_masculine()
	else:
	return get_first_name_feminine()


	# Write a function that generates a list of common last names in the United States.
	# List should be in alphabetical order.
	# Each name should be unique.
	# Return random last name.
	def get_last_name():
	last_names = [
	"Adams", "Alexander", "Allen", "Anderson", "Bailey", "Baker", "Barnes", "Bell",
	"Bennett", "Brooks", "Brown", "Bryant", "Butler", "Campbell", "Carter", "Clark",
	"Coleman", "Collins", "Cook", "Cooper", "Cox", "Davis", "Edwards", "Evans", "Flores",
	"Foster", "Garcia", "Gonzales", "Gonzalez", "Gray", "Green", "Griffin", "Hall",
	"Harris", "Henderson", "Hernandez", "Hill", "Howard", "Hughes", "Jackson", "James",
	"Jenkins", "Johnson", "Jones", "Kelly", "King", "Lee", "Lewis", "Long", "Lopez",
	"Martin", "Martinez", "Miller", "Mitchell", "Moore", "Morgan", "Morris", "Murphy",
	"Nelson", "Parker", "Patterson", "Perez", "Perry", "Peterson", "Phillips", "Powell",
	"Price", "Ramirez", "Reed", "Richardson", "Rivera", "Roberts", "Robinson", "Rodriguez",
	"Rogers", "Ross", "Russell", "Sanchez", "Sanders", "Scott", "Simmons", "Smith", "Stewart",
	"Taylor", "Thomas", "Thompson", "Torres", "Turner", "Walker", "Ward", "Washington", "Watson",
	"White", "Williams", "Wilson", "Wood", "Wright", "Young",
	]

	return random.choice(last_names)


	# Write a function that returns a martial status.
	# Return random martial status.
	# Accept a random value between 0 and 1 as an input parameter.
	# The function must return one of the following values based on the %:
	# 50% Married, 33% Single, 17% Unknown
	def get_martial_status(rnd_value):
	if rnd_value < 0.50:
	return "Married"
	elif rnd_value < 0.83:
	return "Single"
	else:
	return "Unknown"


	# Write a function that returns a person's race.
	# Return random race.
	# Accept a random value between 0 and 1 as an input parameter.
	# The function must return one of the following values based on the %:
	# 58% White, 19% Hispanic, 12% Black, 6% Asian, 4% Multiracial
	def get_race(rnd_value):
	if rnd_value < 0.58:
	return "White"
	elif rnd_value < 0.77:
	return "Hispanic"
	elif rnd_value < 0.89:
	return "Black"
	elif rnd_value < 0.95:
	return "Asian"
	else:
	return "Multiracial"


	# Write a function that returns a person's religious affiliation.
	# Return random regilion.
	# Accept a random value between 0 and 1 as an input parameter.
	# The function must return one of the following values based on the %:
	# 70% Christian, 20% Agnostic, 3% Atheist, 2% Jewish,
	# 2% Other, 1% Muslim, 1% Hindu, 1% Buddhist
	def get_regilion(rnd_value):
	if rnd_value < 0.7:
	return "Christian"
	elif rnd_value < 0.9:
	return "Agnostic"
	elif rnd_value < 0.93:
	return "Atheist"
	elif rnd_value < 0.95:
	return "Jewish"
	elif rnd_value < 0.97:
	return "Other"
	elif rnd_value < 0.98:
	return "Muslim"
	elif rnd_value < 0.99:
	return "Hindu"
	else:
	return "Buddhist"


	# Write a function that returns a person's gender.
	# Return a random gender.
	# Accept a random value between 0 and 1 as an input parameter.
	# The function must return one of the following values based on the %:
	# 53% Male, 40% Female, 6% Other, 1% Transgender
	def get_gender(rnd_value):
	if rnd_value < 0.53:
	return "Male"
	elif rnd_value < 0.93:
	return "Feamle"
	elif rnd_value < 0.99:
	return "Other"
	else:
	return "Transgener"


	# Write a function that generates a normal distribution of ages.
	# Using normalvariate() from the random module
	# with a mean of 40 and a standard deviation of 10.
	# Return random age as an integer.
	def get_age():
	return int(random.normalvariate(40, 10))


	# Write a function that generates a normal distribution of date of births.
	# with a mean year of 1975 and a standard deviation of 10.
	# Return random date of birth as a string in the format YYYY-MM-DD
	def get_dob():
	day_of_year = random.randint(1, 365)
	year_of_birth = int(random.normalvariate(1975, 10))
	dob = date(int(year_of_birth), 1, 1) + timedelta(day_of_year – 1)
	dob = dob.strftime("%Y-%m-%d")
	return dob


	# Create a function to write the demographic records to a csv file called 'demographic_data.csv'.
	# Use an input parameter to specify the number of records to write.
	# The csv file must have a header row and be comma delimited.
	# String values must be enclosed in double quotes.
	def write_data(rec_count):
	id = 0

	with open("output/demographic_data.csv", "w", newline="") as csv_file:
	csv_writer = csv.writer(
	csv_file, delimiter=",", quotechar='"', quoting=csv.QUOTE_NONNUMERIC
	)

	csv_writer.writerow(
	[
	"id",
	"first_name",
	"last_name",
	"dob",
	"gender",
	"martital_status",
	"race",
	"religion"
	]
	)

	for i in range(rec_count):
	rnd_gender = random.random()

	id += 1
	first_name = get_first_name(rnd_gender)
	last_name = get_last_name()
	dob = get_dob()
	gender = get_gender(rnd_gender)
	martial_status = get_martial_status(random.random())
	race = get_race(random.random())
	religion = get_regilion(random.random())

	csv_writer.writerow(
	[
	id,
	first_name,
	last_name,
	dob,
	gender,
	martial_status,
	race,
	religion
	]
	)


	if __name__ == "__main__":
	main()

view raw demographic_data_gen.py hosted with ❤ by GitHub

To make this example more realistic, you could use Copilot’s assistance to write algorithms capable of more accurately representing the nuanced associations and correlations between age, gender, race, marital status, religion, and political affiliation.

Here is an example of the synthetic demographic data output by the example application:

user_id	first_name	last_name	dob	gender	martital_status	race	religion
1	Thomas	Powell	1967-06-10	Male	Married	Black	Christian
2	Ward	Williams	1973-07-22	Male	Single	Asian	Christian
3	Martha	Watson	1975-02-28	Feamle	Single	Hispanic	Agnostic
4	Brenda	Bailey	1979-07-07	Feamle	Married	Black	Christian
5	Parker	Johnson	1955-07-14	Male	Married	White	Christian
6	Rebecca	Wilson	1972-05-27	Feamle	Married	White	Christian
7	Doris	Allen	1956-07-09	Feamle	Married	Multiracial	Christian
8	Rebecca	Sanchez	1965-09-16	Feamle	Single	White	Christian
9	Mary	Johnson	1971-04-04	Feamle	Single	White	Christian
10	Anderson	Roberts	1983-11-02	Male	Single	Hispanic	Christian
11	Robinson	Peterson	1974-10-25	Male	Single	White	Jewish
12	Lopez	Ross	1985-04-30	Male	Married	White	Christian
13	Flores	Reed	1977-09-01	Male	Married	Black	Agnostic
14	Martin	Phillips	1960-03-24	Male	Single	White	Christian
15	Carter	Torres	1983-07-31	Male	Married	White	Christian

view raw demographic_data.csv hosted with ❤ by GitHub

Generative AI Tools for Unit Testing

In addition to writing code and documentation, a common use of generative AI code assistants like Copilot is unit tests. For example, we can create unit tests for each function in our coffee shop sales data code generator, using the same method of prompting with code comments.

Copilot can also assist with writing unit tests

Conclusion

In this post, we learned how Generative AI could assist us in creating synthetic data for software development, analytics, and machine learning. The examples herein generated data using simple techniques. Using advanced modeling techniques, we could generate increasingly complex, realistic synthetic data.

To learn about other ways Generative AI can be used to assist in writing code, please read my previous article, Ten Ways to Leverage Generative AI for Development on AWS.

🔔 To keep up with future content, follow Gary Stafford on LinkedIn.

This blog represents my viewpoints and not those of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.

AI, Artificial General Intelligence (AGI), Generative AI, Machine Learning, Synthetic Data, Testing

Programmatic Ponderings

Posts Tagged Synthetic Data