Using Ahana Cloud for Presto to perform analytics on AWS using both Apache Hive and AWS Glue as metastores
The following series of five videos are an extended version of the demonstration featured in the October 2021 webinar, Build an Open Data Lake on AWS with Presto. An on-demand copy of the live webinar is available on Ahana.io, featuring Dipti Borkar (Ahana Co-Founder and CPO) and I.
In the demonstration, we will build a data lake on AWS using a combination of Ahana Cloud for Presto, Apache Hive, Apache Superset, Amazon S3, AWS Glue, and Amazon Athena. We then analyze the data in Apache Superset using Ahana Cloud for Presto.
The demonstration is divided into five YouTube videos (playlist):
All source code for this post and the previous posts in this series are open-sourced and located on GitHub. In the webinar and the videos, the Apache Hive and AWS Glue data catalog tables contain an
_presto suffix. For clarity, in the source code, I have changed those to indicate the metastore they are associated with,
_glue, since either set of tables can be queried Presto. Additionally, in the webinar and the videos, the raw data files were uploaded to Amazon S3 in uncompressed CSV format; this is unnecessary. The
CTAS SQL statements both expect GZIP-compressed CSV files. To save time and cost, upload the compressed files, as they are, to Amazon S3.
The following files are used in the demonstration:
README.md: Instructions for demo
ahana_demo_glue_artists.sql: AWS Glue SQL statements
ahana_demo_glue_artworks.sql: AWS Glue SQL statements
ahana_demo_hive.sql: Apache Hive SQL statements
joins.sql: Simple SQL join statement
superset_charts.sql: SQL statements for Superset charts
moma_public_artists.txt.gz: Compressed raw artists data
moma_public_artworks.txt.gz: Compressed raw artworks data
This blog represents my own viewpoints and not of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.