Building Data Lakes with Apache Airflow

Data Lake Demonstration
Building Data Lakes with Apache Airflow
Gary A. Stafford

Twitter/LinkedIn
GaryStafford
Blog
garystafford.medium.com

Agenda
What is a Data Lake?
Dataset
Architecture
Source Code
Demonstration

What is a Data Lake?
“A data lake is a central location that holds a large amount of data in its native, raw
format. Compared to a hierarchical data warehouse, which stores data in files or
folders, a data lake uses a flat architecture and object storage to store the data.” -
Databricks
“A centralized repository that allows you to store all your structured and
unstructured data at any scale. You can store your data as-is, without having to
first structure the data, and run different types of analytics—from dashboards and
visualizations to big data processing, real-time analytics, and machine learning to
guide better decisions.” - AWS

Dataset
TICKIT database
E-commerce platform
Bringing together buyers and sellers of tickets to entertainments events
Designed to demonstrate Amazon Redshift Cloud Data Warehouse
Small database consists of seven tables: two fact and five dimension tables
Tables: Categories, Events, Venues, Users, Listings, Sales, Dates
docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html

Dataset
Table Simulated Datasource Demo Datasource
Category Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL
Event Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL
Venue Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL
Listing COTS E-commerce Platform Amazon RDS for MySQL
Sales COTS E-commerce Platform Amazon RDS for MySQL
Date COTS E-commerce Platform Amazon RDS for MySQL
Users Custom Customer Relationship Management (CRM) Amazon RDS for SQL Server

Architecture: AWS Services Used
Amazon Simple Storage Service (Amazon S3)
AWS Glue Studio (alt. AWS Glue DataBrew)
AWS Glue Data Catalog (alt. Apache Hive on EMR)
AWS Glue Crawlers (alt. CDC with AWS DMS or Kafka Connect)
AWS Glue Jobs (alt. AWS Glue DataBrew, or Apache Spark or Presto on EMR)
Amazon Athena (alt. Presto on EMR)
Amazon Managed Workflows for Apache Airflow (MWAA) (alt. AWS Step Functions)

Architecture: Out of Scope (but critically important)
Change Data Capture (CDC): Handling changes to systems of record
Transactional Storage Layer: Managing changes to the SoR in the data lake
Streaming Data: Data continuously generated by different sources
Fine-grained Authorization: database-, table-, column-, and row-level access
Data Lineage: Tracking data’s lifecycle as it flows from sources to consumption

Architecture: Out of Scope (but critically important)
Data Discovery/Inspection: Scanning data for sensitive or unexpected content (PII)
DataOps: Automating testing, deployment, job execution
Infrastructure as Code (IaC): Infrastructure provisioning automation
Data Warehousing (Lake House architecture)
Data Lake Storage Tiering, Archival, and Backup

github.com/garystafford/tickit-data-lake-demo

Building Data Lakes with Apache Airflow

More Related Content

What's hot

Similar to Building Data Lakes with Apache Airflow

More from Gary Stafford

Recently uploaded

Building Data Lakes with Apache Airflow