Lakehouse Data Modeling using dbt, Amazon Redshift, Redshift Spectrum, and AWS Glue

Learn how dbt makes it easy to transform data and materialize models in a modern cloud data lakehouse built on AWS

Introduction

Data lakes have grabbed much of the analytics community’s attention in recent years, thanks to an overabundance of VC-backed analytics startups and marketing dollars. Nonetheless, data warehouses, specifically modern cloud data warehouses, continue to gain market share, led by SnowflakeAmazon RedshiftGoogle Cloud BigQuery, and Microsoft’s Azure Synapse Analytics.

Several factors have fostered the renewed interest and appeal of data warehouses, including the data lakehouse architecture. According to Databricks, “a lakehouse is a new, open architecture that combines the best elements of data lakes and data warehouses. Lakehouses are enabled by a new system design: implementing similar data structures and data management features to those in a data warehouse directly on top of low-cost cloud storage in open formats.” Similarly, Snowflake describes a lakehouse as “a data solution concept that combines elements of the data warehouse with those of the data lake. Data lakehouses implement data warehouses’ data structures and management features for data lakes, which are typically more cost-effective for data storage.

dbt

In the following post, we will explore the use of dbt (data build tool), developed by dbt Labs, to transform data in an AWS-based data lakehouse, built with Amazon Redshift, Redshift Spectrum, AWS Glue, and Amazon S3. According to dbt Labs, “dbt enables analytics engineers to transform data in their warehouses by simply writing select statements. dbt handles turning these select statements into tables and views.” Further, “dbt does the T in ELT (Extract, Load, Transform) processes — it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse.

This post’s project, displayed in dbt Cloud

Amazon Redshift

According to AWS, “Amazon Redshift uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes using AWS-designed hardware and machine learning to deliver the best price-performance at any scale.” AWS claims Amazon Redshift is the most widely used cloud data warehouse.

Amazon Redshift Spectrum

According to AWS, “Redshift Spectrum allows you to efficiently query and retrieve structured and semi-structured data from files in Amazon S3 without having to load the data into Amazon Redshift tables.” Redshift Spectrum tables define the data structure for the files in Amazon S3. The external tables exist in an external data catalog, which can be AWS Glue, the data catalog that comes with Amazon Athena, or an Apache Hive metastore.

dbt can interact with Amazon Redshift Spectrum to create external tables, refresh external table partitions, and access raw data in an Amazon S3-based data lake from the data warehouse. We will use dbt along with the dbt package, dbt_external_tables, to create the external tables in an AWS Glue data catalog.

Prerequisites

Prerequisites to follow along with this post’s demonstration include:

  • Amazon S3 bucket to store raw data;
  • Amazon Redshift or Amazon Redshift Serverless cluster;
  • AWS IAM Role with permissions to Amazon Redshift, Amazon S3, and AWS Glue;
  • dbt Cloud account;
  • dbt CLI (dbt Core) and dbt Amazon Redshift adapter installed locally;
  • Microsoft Visual Studio Code (VS Code) with dbt extensions installed;

The post’s demonstration uses dbt Cloud, VS Code, and the dbt CLI interchangeably with the project’s GitHub repository as a source. Follow along with the demonstration using any or all of these three dbt options.

Example of this project in VS Code with dbt extensions installed

Cost Warning!

Be careful when creating a new, provisioned Amazon Redshift cluster for this demonstration. The suggested default Production cluster with two ra3.4xlarge on-demand compute nodes and AQUA (Redshift’s Advanced Query Accelerator) enabled is estimated at $4,694/month ($3.26/node/hour). For this demonstration, choose the minimum size provisioned Redshift cluster configuration of one dc2.large on-demand compute node, estimated to cost $180/month ($0.25/node/hour). Be sure to delete the cluster when the demonstration is complete.

Creating a new Amazon Redshift cluster

Amazon Redshift Serverless Option

AWS recently announced the general availability (GA) of Amazon Redshift Serverless on July 12, 2022. Amazon Redshift Serverless allows data analysts, developers, and data scientists to run and scale analytics without having to provision and manage data warehouse clusters. dbt is fully compatible with Amazon Redshift Serverless and is an alternative to provisioned Redshift for this demonstration. According to AWS, Amazon Redshift Serverless measures data warehouse capacity in Redshift Processing Units (RPUs). You pay for the workloads you run in RPU-hours on a per-second basis (with a 60-second minimum charge), including queries that access data in open file formats in Amazon S3.

Source Code

All the source code demonstrated in this post is open source and available on GitHub.

Sample Data

This demonstration uses the TICKIT sample database provided by AWS and designed for use with Amazon Redshift. This sample database application tracks sales activity for the fictional online TICKIT website, where users buy and sell tickets for sporting events, shows, and concerts. The database consists of seven tables in a star schema: five dimension tables and two fact tables. A clean copy of the raw TICKIT data, formatted as pipe-delimited text files, is included in this GitHub project. Use the following shell commands to copy the raw data to Amazon S3:

# Purpose: Unzip raw data files and copy to s3
# Author: Gary A. Stafford
# Date: 2022-12-29
# sh ./raw_date_to_s3.sh
# ** REPLACE ME! **
s3_bucket="<your_s3_bucket_name>"
pushd raw_tickit_data/
unzip tickit_data.zip
popd
declare -a TableArray=("category" "date" "event" "listing" "sale" "user" "venue")
for table in "${TableArray[@]}"; do
aws s3 cp ./raw_tickit_data/$table.txt s3://$s3_bucket/raw_tickit_data/$table/
done

Prepare Amazon Redshift for dbt

Create New Database

Create a new Redshift database to use for the demonstration, demo.

create new database
create database demo with owner admin;

Create Database Schemas

Within the new Redshift database,demo, create the external schema, tickit_external, and the corresponding external AWS Glue Data Catalogtickit_dbt, using the CREATE EXTERNAL SCHEMA Redshift SQL command. Make sure to update the command to reflect your IAM Role’s ARN. Next, create the schema that will hold our dbt models, tickit_dbt. Lastly, as a security best practice, drop the default public schema.

switch redshift database connection to new demo database before continuing!
create external tables schema and new glue data catalog
create external schema tickit_external
from data catalog
database 'tickit_dbt'
iam_role 'arn:aws:iam::<your_aws_acccount_id>:role/ClusterPermissionsRole'
create external database if not exists;
create schema tickit_dbt;
drop schema public;

From the AWS Glue console, we should observe a new tickit_dbt AWS Glue Data Catalog. The description shown below was manually added after the catalog was created.

Newly created AWS Glue Data Catalog

Create dbt Database User and Group

As a security best practice, create a separate database dbt user and dbt group. We are assigning a completely arbitrary connection limit of ten. Then, apply the grants to allow the dbt group access to the new database and schemas. Lastly, change the two schema’s owners to the dbt.

create dbt user and group
create user dbt with password 'CHANGE_ME!'
nocreatedb nocreateuser syslog access restricted
connection limit 10;
create dbt group
create group dbt with user dbt;
grants on tickit_external schema
grant usage on schema tickit_external to group dbt;
grant create on schema tickit_external to group dbt;
grant all on all tables in schema tickit_external to group dbt;
grants on tickit_dbt schema
grant usage on schema tickit_dbt to group dbt;
grant create on schema tickit_dbt to group dbt;
grant all on all tables in schema tickit_dbt to group dbt;
reassign schema ownership to dbt
alter schema tickit_dbt owner to dbt;
alter schema tickit_external owner to dbt;

Alternately, we could use an IAM Role with a SAML 2.0-compliant IdP.

Initialize and Configure dbt for Redshift

Next, configure your dbt Cloud account and dbt locally with your Amazon Redshift connection information using the dbt init command. On a Mac, this configuration is stored in the /Users/<your_usernama>/.dbt/profiles.yml file. You will need your Redshift cluster host URL, port, database, username, and password. With your local install of dbt, we can use the dbt debug command to confirm the new configuration.

Confirming configuration using dbt debug command

Project Structure

The GitHub project structure follows many of the best practices outlined in dbt Labs’ Best Practice Guide. Data models in the models directory is organized into the recommended stagingintermediate, and marts subdirectories (aka layers).

Project structure for data models

From a data lineage perspective, in this project, the staging layer’s data models depend on the external tables (AWS Glue/Amazon Redshift Spectrum). The intermediate layer’s data models depend on the staging models. The marts layer’s data models depend on staging and intermediate models.

dbt Cloud’s Lineage Graph showing an example of data model relationships

Install dbt Packages

The GitHub project’s packages.yml contains a few commonly recommended packages. The only one required for this post is the dbt-labs/dbt_external_tables package. Make sure your project is referring to the latest version of the package.

# updated 2022-12-29 (dbt=1.3.1 locally and dbt=1.3.1 in dbt cloud)
packages:
package: dbt-labs/dbt_external_tables
version: 0.8.2
package: dbt-labs/codegen
version: 0.9.0
package: dbt-labs/dbt_utils
version: 1.0.0
package: dbt-labs/redshift
version: 0.8.0
view raw packages.yml hosted with ❤ by GitHub

Use the dbt deps command to install the packages locally.

Installing dbt packages

External Tables

The _tickit__sources.yml file in the models/staging/tickit/external_tables/ model’s subdirectory defines the schema and S3 location for each of the seven external TICKIT database tables: category, date, event, listing, sale, user, and venue. You will need to update this file to reflect the name of your Amazon S3 bucket, in seven places.

version: 2
sources:
name: tickit_external
description: Sales activity for the fictional TICKIT web site, where users buy and sell tickets online for sporting events, shows, and concerts.
database: demo
schema: tickit_external
loader: s3
tables:
name: category
description: dimension table – TICKIT categories
external:
location: "s3://<your_s3_bucket_name>/raw_tickit_data/category/"
row_format: >
serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
'separatorChar'='|'
)
table_properties: "('skip.header.line.count'='1')"
columns:
name: catid
data_type: int
description: primary key
tests:
unique
not_null
name: catgroup
data_type: varchar(20)
name: catname
data_type: varchar(20)
name: catdesc
data_type: varchar(50)

Execute the command, dbt run-operation stage_external_sources, to create the seven external tables in the AWS Glue Data Catalog. This command is part of the dbt_external_tables package we installed earlier. It iterates through all source nodes, creates the tables if missing, and refreshes metadata.

Creating external tables in AWS Glue Data Catalog

If we failed to run the previous SQL statements to set schema ownership to the dbt user, the following error will likely occur.

Typical error resulting from incorrect ownership of the external_table schema

Once the command completes, we should observe seven new tables in the AWS Glue Data Catalog.

Seven new tables in the AWS Glue Data Catalog

Examining one of the AWS Glue data catalog tables, we can observe how the configuration in the _tickit__sources.yml file was used to define the table’s properties and schema. Note the Location field indicates where the underlying data is located in our Amazon S3 bucket.

The ‘category’ table in the AWS Glue Data Catalog

Staging Layer

In their best practices guide, dbt describes the staging layer in the following manner: “you can think of the staging layer as condensing and refining this material into the individual atoms we’ll later build more intricate and useful structures with.” The staging data models are the base tables and views we will use to build more complex aggregations and analytics queries in Redshift. The schema.yml file, also in the models/staging/tickit/ model’s subdirectory, defines seven late-binding views, modeled by dbt, to be created in Amazon Redshift.

version: 2
models:
name: stg_tickit__categories
description: Late binding view of external dimension table – TICKIT category
name: stg_tickit__dates
description: Late binding view of external dimension table – TICKIT date
name: stg_tickit__events
description: Late binding view of external dimension table – TICKIT event
name: stg_tickit__listings
description: Late binding view of external fact table – TICKIT listing
name: stg_tickit__sales
description: Late binding view of external fact table – TICKIT sale
name: stg_tickit__users
description: Late binding view of external dimension table – TICKIT user
name: stg_tickit__venues
description: Late binding view of external dimension table – TICKIT venue
view raw schema.yml hosted with ❤ by GitHub

The staging model’s SQL statements also follow many of dbt’s best practices. Below, we see an example of the stg_tickit__sales model (stg_tickit__sales.sql). This model performs a SELECT from the external sale table in the external_table schema. The model performs column renaming and basic calculations.

{{ config(materialized='view', bind=False) }}
with source as (
select * from {{ source('tickit_external', 'sale') }}
),
renamed as (
select
saleid as sale_id,
listid as list_id,
sellerid as seller_id,
buyerid as buyer_id,
eventid as event_id,
dateid as date_id,
qtysold as qty_sold,
round (pricepaid / qtysold, 2) as ticket_price,
pricepaid as price_paid,
round((commission / pricepaid) * 100, 2) as commission_prcnt,
commission,
(pricepaid commission) as earnings,
saletime as sale_time
from
source
where
sale_id IS NOT NULL
order by
sale_id
)
select * from renamed

The the dbt run command, according to dbt, “executes compiled SQL model files against the current target database. dbt connects to the target database and runs the relevant SQL, required to materialize all data models using the specified materialization strategies.” Instead of using the dbt run command to create all the project’s tables and views at once, for now, we are limiting the command to just the models in the ./models/staging/tickit/ directory using the --select optional argument. Execute the dbt run --select staging command to materialize the seven corresponding staging tables in Amazon Redshift.

The Staging layer’s data models successfully materialized in Redshift

Once the command completes, we should observe seven new views in Amazon Redshift demo database’s tickit_dbt schema with the stg_ prefix.

Amazon Redshift Query Editor v2 Console

Selecting from any of the views should return data.

Amazon Redshift Query Editor v2 Console

Late Binding Views

This demonstration uses late binding views for staging and intermediate layer models. According to dbt, “using late-binding views in a production deployment of dbt can vastly improve the availability of data in the warehouse, especially for models that are materialized as late-binding views and are queried by end-users, since they won’t be dropped when upstream models are updated. Additionally, late binding views can be used with external tables via Redshift Spectrum.

Alternatively, we could define the seven staging models as tables instead of late binding views. Once created as tables, the dependent intermediate and marts views will not require a late-binding reference, as in this project.

Intermediate Layer

In their best practices guide, dbt describes the intermediate layer as “purpose-built transformation steps.” Further, “the best guiding principle is to think about verbs (e.g. pivotedaggregated_to_userjoinedfanned_out_by_quanityfunnel_created, etc.) in the intermediate layer.

The project’s intermediate layer consists of two models related to users. The sample TICKIT database lumps all users into a single table. However, for analytics purposes, different user personas might interest marketing teams, such as buyers, sellers, sellers who also buy, and non-buyers (users who have never purchased tickets). The two models in the project’s intermediate layer filter for buyers and for sellers, resulting in two separate views of user personas.

{{ config(materialized='view', bind=False) }}
with sales as (
select * from {{ ref('stg_tickit__sales') }}
),
users as (
select * from {{ ref('stg_tickit__users') }}
),
first_purchase as (
select min(date(sale_time)) as first_purchase_date, buyer_id
from sales
group by buyer_id
),
final as (
select distinct
u.user_id,
u.username,
cast((u.last_name||', '||u.first_name) as varchar(100)) as full_name,
f.first_purchase_date,
u.city,
u.state,
u.email,
u.phone,
u.like_broadway,
u.like_classical,
u.like_concerts,
u.like_jazz,
u.like_musicals,
u.like_opera,
u.like_rock,
u.like_sports,
u.like_theatre,
u.like_vegas
from
sales as s
join users as u on u.user_id = s.buyer_id
join first_purchase as f on f.buyer_id = s.buyer_id
order by
user_id
)
select * from final

To materialize the intermediate layer’s two data models into views, execute the command, dbt run --select intermediate.

The Intermediate layer’s data models successfully materialized in Redshift

Once the command completes, we should observe a total of nine views in Amazon Redshift demo database’s tickit_dbt schema — seven staging and two intermediate, identified with the int_ prefix.

Amazon Redshift Query Editor v2 Console

Marts Layer

In their best practices guide, dbt describes the marts layer as “business defined entities.” Further, “this is the layer where everything comes together and we start to arrange all of our atoms (staging models) and molecules (intermediate models) into full-fledged cells that have identity and purpose. We sometimes like to call this the entity layer or concept layer, to emphasize that all our marts are meant to represent a specific entity or concept at its unique grain.

The project’s marts layer consists of four data models across marketing and sales. The models are materialized as two dimension tables and two fact tables. Although it is common practice to describe and label these as traditional star schema dimension (dim_) or fact (fct_) tables, in reality, the fact tables in this demonstration are actually flat, de-normalized, wide tables. Wide tables generally have better analytics performance in a modern data warehouse, according to Fivetran and others.

{{ config(materialized='table', sort='sale_id', dist='sale_id') }}
with categories as (
select * from {{ ref('stg_tickit__categories') }}
),
dates as (
select * from {{ ref('stg_tickit__dates') }}
),
events as (
select * from {{ ref('stg_tickit__events') }}
),
listings as (
select * from {{ ref('stg_tickit__listings') }}
),
sales as (
select * from {{ ref('stg_tickit__sales') }}
),
sellers as (
select * from {{ ref('int_sellers_extracted_from_users') }}
),
buyers as (
select * from {{ ref('int_buyers_extracted_from_users') }}
),
event_categories as (
select
e.event_id,
e.event_name,
c.cat_group,
c.cat_name
from events as e
join categories as c on c.cat_id = e.cat_id
),
final as (
select
s.sale_id,
s.sale_time,
d.qtr,
ec.cat_group,
ec.cat_name,
ec.event_name,
b.username as buyer_username,
b.full_name as buyer_name,
b.state as buyer_state,
b.first_purchase_date as buyer_first_purchase_date,
se.username as seller_username,
se.full_name as seller_name,
se.state as seller_state,
se.first_sale_date as seller_first_sale_date,
s.ticket_price,
s.qty_sold,
s.price_paid,
s.commission_prcnt,
s.commission,
s.earnings
from
sales as s
join listings as l on l.list_id = s.list_id
join buyers as b on b.user_id = s.buyer_id
join sellers as se on se.user_id = s.seller_id
join event_categories as ec on ec.event_id = s.event_id
join dates as d on d.date_id = s.date_id
order by
sale_id
)
select * from final
view raw fct_sales.sql hosted with ❤ by GitHub

The marts layer’s models take various dependencies through joins on staging and intermediate models. The data model above, fct_sales, has dependencies on multiple staging and intermediate models.

Lineage graph (dependency graph) of a data model, displayed in dbt Cloud

To materialize the marts layer’s four data models into tables, execute the command, dbt run --select marts.

The Marts layer’s data models successfully materialized in Redshift as tables

Once the command completes, we should observe four tables and nine views in the Redshift demo database’s tickit_dbt schema. Note how the dbt model for fct_sales (shown above), with its Jinja templating and multiple CTEs have been compiled into the resulting table in Redshift, this is the real magic of dbt!

Amazon Redshift Query Editor v2 Console showing fct_sales table

At this point, all of the project’s models have been compiled and created in the Redshift demo database by dbt.

Analyses

The demonstration’s project also contains example analyses. dbt allows us to version control more analytical-oriented SQL files within our dbt project using the analyses functionality of dbt. These analyses do not fit the fairly opinionated dbt model definition. We can compile the analyses SQL file using the dbt compile command, then copy and paste the resulting SQL statements from the target/compiled/ subdirectory into our data warehouse’s query tool of choice.

Analysis shown in dbt Cloud interface
Compiled analysis SQL executed in Amazon Redshift

Project Documentation

Using the dbt docs generate command will automatically generate the project’s documentation website from the SQL and YAML files. Documentations can be generated and displayed from your dbt Cloud account or hosted locally.

Project’s documentation website

Testing

According to dbt, “Tests are assertions you make about your models and other resources in your dbt project (e.g. sources, seeds, and snapshots). When you run dbt test, dbt will tell you if each test in your project passes or fails.” The project contains over 50 tests, split between the _tickit__sources.yml file and individual tests in the test/ directory. Typical dbt tests check for non-null and unique values, values within an expected numeric range, and values from a known list of strings. Any SELECT statement written in SQL can be tested.

name: user
description: dimension table – TICKIT users
columns:
name: userid
data_type: int
description: primary key
tests:
unique
not_null
name: username
data_type: char(8)
tests:
unique
not_null
name: firstname
data_type: varchar(30)
tests:
not_null
Snippet of tests in the _tickit__sources.sql file
all prices paid for tickets should be a positive value of $1 or greater
there are no credits (negative values)
select price_paid
from {{ ref('stg_tickit__sales') }}
group by price_paid
having not (price_paid >= 1)

Execute the project’s tests using the dbt test command. We can execute individual tests using the --select optional argument, for example, dbt test --select assert_all_sale_amounts_are_positive. We can also use the --threads optional argument with most dbt commands, including dbt test, increasing parallelism and reducing execution time. The example below uses 10 threads, the arbitrary maximum configured for the Amazon Redshift dbt user.

Running dbt tests
Successful test run

Jobs

According to dbt, Jobs are a set of dbt commands that you want to run on a schedule. For example, dbt run and dbt test. Jobs can load packages, run tests, materialize models, check source freshness (dbt source freshness), and regenerate documentation. Below, we have created a daily job to test, refresh, and document our project as the data is updated in the data lake.

dbt Cloud’s Job Run Overview

Notifications

According to dbt, Setting up notifications in dbt Cloud will allow you to receive alerts via Email or a chosen Slack channel when a job run succeeds, fails, or is canceled.

dbt Cloud’s Notifications interface for configuring email and Slack

The Slack notifications include run status, timings, and a link to open the job in dbt Cloud. Below, we see a notification regarding our project’s daily job run.

Notification of successful job run in Slack from dbt Cloud

Exposures

Exposures are a recent addition to dbt. Exposures make it possible to define and describe a downstream use of our dbt project, such as in a dashboard, application, or data science pipeline. Below we see an example of an exposure describing a sales dashboard created in Amazon QuickSight.

version: 2
exposures:
name: tickit_sales_summary
type: dashboard
maturity: medium
url: https://us-east-1.quicksight.aws.amazon.com
description: >
TICKIT sales summary dashboard, authored in Amazon QuickSight
depends_on:
ref('fct_sales')
ref('fct_listings')
owner:
name: Gary A. Stafford
email: gary.a.stafford@gmail.com
view raw dashboard.yml hosted with ❤ by GitHub

The exposure YAML file shown above describes the Amazon QuickSight dashboard shown below.

Amazon QuickSight dashboard

Exposures work with dbt’s auto-documentation feature. dbt populates a dedicated page in the auto-generated documentation site with context relevant to data consumers.

Project’s documentation website, showing dashboard exposure
Project’s documentation website, showing dashboard exposure’s lineage graph

Conclusion

In this post, we covered some of the basic functionality of dbt. We learned how dbt enables analysts to work more like software engineers. We also learned how dbt makes it easy to codify data models in SQL, to version control and manage data models as code with git, and collaborate on data models with other data team members.

Topics not explored in this post but critical to most large-scale dbt-managed production environments include advanced Jinja templating and macrosmodel freshnessorchestrationjob schedulingContinuous Integration and GitOpsnotificationsenvironment variables, and incremental models. We will explore these additional dbt capabilities in future posts.


This blog represents my viewpoints and not of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners. All diagrams and illustrations are the property of the author unless otherwise noted.

, , , , , ,

  1. #1 by esercizziando on January 10, 2023 - 11:56 am

    Congratulations on the fantastic article you have created! it is perfect for those who want to try their hand at dbt and everything is explained in detail! I also appreciated the part about managing costs. It would be nice to add a diagram to the article for all the AWS and DBT tools that are used to better understand how they communicate with each other.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: