Building a Data Lake on AWS

Data Lake Demonstration
Building a Simple Data Lake on AWS
Gary A. Stafford

Twitter/LinkedIn
GaryStafford
Blog
garystafford.medium.com

Agenda
What is a Data Lake?
Dataset
Source Code
Architecture
Demonstration

What is a Data Lake?
“A data lake is a central location that holds a large amount of data in its native, raw
format. Compared to a hierarchical data warehouse, which stores data in files or
folders, a data lake uses a flat architecture and object storage to store the data.” -
Databricks
“A centralized repository that allows you to store all your structured and
unstructured data at any scale. You can store your data as-is, without having to
first structure the data, and run different types of analytics—from dashboards and
visualizations to big data processing, real-time analytics, and machine learning to
guide better decisions.” - AWS

Dataset
TICKIT database
E-commerce platform
Bringing together buyers and sellers of tickets to entertainments events
Designed for demonstrating Amazon Redshift
Small database consists of seven tables: two fact tables and five dimensions
Tables: Categories, Events, Venues, Users, Listings, Sales, Dates
https://docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html

Dataset
Table Simulated Data Source Demo Data Source
Category Software as a Service (SaaS) Amazon RDS for PostgreSQL
Event Software as a Service (SaaS) Amazon RDS for PostgreSQL
Venue Software as a Service (SaaS) Amazon RDS for PostgreSQL
Listing Ecommerce Platform Amazon RDS for MySQL
Sales Ecommerce Platform Amazon RDS for MySQL
Date Ecommerce Platform Amazon RDS for MySQL
Users Customer Relationship Management (CRM) Microsoft SQL Server

github.com/garystafford/tickit-data-lake-demo

Architecture: AWS Services Used
AWS Glue Studio (alt. AWS Glue DataBrew)
AWS Glue Data Catalog (alt. Apache Hive on EMR)
AWS Glue Crawlers (alt. CDC with Kafka Connect or DMS)
AWS Glue Jobs (alt. AWS Glue DataBrew, or Spark or Presto on EMR)
Amazon Athena (alt. Presto on EMR)
Amazon S3

Architecture: Out of Scope (but critically important)
Change Data Capture (CDC): handling changes to systems of record
Transactional Storage Layer: Apache Hudi, Apache Iceberg, Delta Lake
Streaming Data: Spark Structured Streaming, Kinesis, Flink
Fine-grained Authorization: database-, table-, column-, and row-level access
Data Lineage: Tracking data as it flows from sources to consumption

Architecture: Out of Scope (but critically important)
Data Inspection: Scanning incoming data for sensitive info such as PII
DevOps/DataOps: Automating testing, deployment, job execution
Data Warehouse / Lake House Architecture
Data Lake Storage Tiering

Building a Data Lake on AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building a Data Lake on AWS

Similar to Building a Data Lake on AWS (20)

More from Gary Stafford

More from Gary Stafford (6)

Recently uploaded

Recently uploaded (20)

Building a Data Lake on AWS