Data Lake Demonstration
Building Data Lakes with Apache Airflow
Gary A. Stafford
Twitter/LinkedIn
GaryStafford
Blog
garystafford.medium.com
Agenda
What is a Data Lake?
Dataset
Architecture
Source Code
Demonstration
What is a Data Lake?
What is a Data Lake?
“A data lake is a central location that holds a large amount of data in its native, raw
format. Compared to a hierarchical data warehouse, which stores data in files or
folders, a data lake uses a flat architecture and object storage to store the data.” -
Databricks
“A centralized repository that allows you to store all your structured and
unstructured data at any scale. You can store your data as-is, without having to
first structure the data, and run different types of analytics—from dashboards and
visualizations to big data processing, real-time analytics, and machine learning to
guide better decisions.” - AWS
What is a Data Lake?
Dataset
Dataset
TICKIT database
E-commerce platform
Bringing together buyers and sellers of tickets to entertainments events
Designed to demonstrate Amazon Redshift Cloud Data Warehouse
Small database consists of seven tables: two fact and five dimension tables
Tables: Categories, Events, Venues, Users, Listings, Sales, Dates
docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html
Dataset
Table Simulated Datasource Demo Datasource
Category Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL
Event Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL
Venue Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL
Listing COTS E-commerce Platform Amazon RDS for MySQL
Sales COTS E-commerce Platform Amazon RDS for MySQL
Date COTS E-commerce Platform Amazon RDS for MySQL
Users Custom Customer Relationship Management (CRM) Amazon RDS for SQL Server
Dataset
Architecture
Architecture: AWS Services Used
Amazon Simple Storage Service (Amazon S3)
AWS Glue Studio (alt. AWS Glue DataBrew)
AWS Glue Data Catalog (alt. Apache Hive on EMR)
AWS Glue Crawlers (alt. CDC with AWS DMS or Kafka Connect)
AWS Glue Jobs (alt. AWS Glue DataBrew, or Apache Spark or Presto on EMR)
Amazon Athena (alt. Presto on EMR)
Amazon Managed Workflows for Apache Airflow (MWAA) (alt. AWS Step Functions)
Architecture: Out of Scope (but critically important)
Change Data Capture (CDC): Handling changes to systems of record
Transactional Storage Layer: Managing changes to the SoR in the data lake
Streaming Data: Data continuously generated by different sources
Fine-grained Authorization: database-, table-, column-, and row-level access
Data Lineage: Tracking data’s lifecycle as it flows from sources to consumption
Architecture: Out of Scope (but critically important)
Data Discovery/Inspection: Scanning data for sensitive or unexpected content (PII)
DataOps: Automating testing, deployment, job execution
Infrastructure as Code (IaC): Infrastructure provisioning automation
Data Warehousing (Lake House architecture)
Data Lake Storage Tiering, Archival, and Backup
Source Code
github.com/garystafford/tickit-data-lake-demo
Demonstration

Building Data Lakes with Apache Airflow

  • 1.
    Data Lake Demonstration BuildingData Lakes with Apache Airflow Gary A. Stafford
  • 2.
  • 3.
    Agenda What is aData Lake? Dataset Architecture Source Code Demonstration
  • 4.
    What is aData Lake?
  • 5.
    What is aData Lake? “A data lake is a central location that holds a large amount of data in its native, raw format. Compared to a hierarchical data warehouse, which stores data in files or folders, a data lake uses a flat architecture and object storage to store the data.” - Databricks “A centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.” - AWS
  • 6.
    What is aData Lake?
  • 7.
  • 8.
    Dataset TICKIT database E-commerce platform Bringingtogether buyers and sellers of tickets to entertainments events Designed to demonstrate Amazon Redshift Cloud Data Warehouse Small database consists of seven tables: two fact and five dimension tables Tables: Categories, Events, Venues, Users, Listings, Sales, Dates docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html
  • 10.
    Dataset Table Simulated DatasourceDemo Datasource Category Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL Event Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL Venue Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL Listing COTS E-commerce Platform Amazon RDS for MySQL Sales COTS E-commerce Platform Amazon RDS for MySQL Date COTS E-commerce Platform Amazon RDS for MySQL Users Custom Customer Relationship Management (CRM) Amazon RDS for SQL Server
  • 11.
  • 12.
  • 13.
    Architecture: AWS ServicesUsed Amazon Simple Storage Service (Amazon S3) AWS Glue Studio (alt. AWS Glue DataBrew) AWS Glue Data Catalog (alt. Apache Hive on EMR) AWS Glue Crawlers (alt. CDC with AWS DMS or Kafka Connect) AWS Glue Jobs (alt. AWS Glue DataBrew, or Apache Spark or Presto on EMR) Amazon Athena (alt. Presto on EMR) Amazon Managed Workflows for Apache Airflow (MWAA) (alt. AWS Step Functions)
  • 16.
    Architecture: Out ofScope (but critically important) Change Data Capture (CDC): Handling changes to systems of record Transactional Storage Layer: Managing changes to the SoR in the data lake Streaming Data: Data continuously generated by different sources Fine-grained Authorization: database-, table-, column-, and row-level access Data Lineage: Tracking data’s lifecycle as it flows from sources to consumption
  • 17.
    Architecture: Out ofScope (but critically important) Data Discovery/Inspection: Scanning data for sensitive or unexpected content (PII) DataOps: Automating testing, deployment, job execution Infrastructure as Code (IaC): Infrastructure provisioning automation Data Warehousing (Lake House architecture) Data Lake Storage Tiering, Archival, and Backup
  • 18.
  • 19.
  • 20.