Building Reliable Delta
Lakes at scale
Steps to running this tutorial
Instructions - https://dbricks.co/saiseu19-delta
1. Create an account + sign in to Databricks Community Edition
https://databricks.com/try
2. Create a cluster with Databricks Runtime 6.1
3. Import the Python notebook and attach it to the cluster
You can also use Scala notebook if you prefer
1. Collect
Everything
• Recommendation Engines
• Risk, Fraud Detection
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing
3. Data Science &
Machine Learning
2. Store it all in
the Data Lake
The Promise of the Data Lake
Garbage In Garbage Stored Garbage Out
🔥
🔥
🔥
🔥🔥
🔥
🔥
Tutorial instructions - https://dbricks.co/saiseu19-delta
What does a typical
data lake project look like?
Tutorial instructions - https://dbricks.co/saiseu19-delta
Evolution of a Cutting-Edge Data Lake
Events
?
AI & Reporting
Streaming
Analytics
Data Lake
Tutorial instructions - https://dbricks.co/saiseu19-delta
Evolution of a Cutting-Edge Data Lake
Events
AI & Reporting
Streaming
Analytics
Data Lake
Tutorial instructions - https://dbricks.co/saiseu19-delta
Challenge #1: Historical Queries?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
λ-arch1
1
1
Tutorial instructions - https://dbricks.co/saiseu19-delta
Challenge #2: Messy Data?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
1
21
1
2
Tutorial instructions - https://dbricks.co/saiseu19-delta
Reprocessing
Challenge #3: Mistakes and Failures?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Partitioned
1
2
3
1
1
3
2
Tutorial instructions - https://dbricks.co/saiseu19-delta
Reprocessing
Challenge #4: Updates?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Updates
Partitioned
DELETE, UPDATE
& MERGE
Scheduled to
Avoid
Modifications
1
2
3
1
1
3
4
4
4
2
Tutorial instructions - https://dbricks.co/saiseu19-delta
Wasting Time & Money
Solving Systems Problems
Instead of Extracting Value From Data
Tutorial instructions - https://dbricks.co/saiseu19-delta
Data Lake Distractions
No atomicity means failed production jobs
leave data in corrupt state requiring tedious
recovery
✗
No quality enforcement creates inconsistent
and unusable data
No consistency / isolation makes it almost
impossible to mix appends and reads, batch and
streaming
Tutorial instructions - https://dbricks.co/saiseu19-delta
Let’s try it instead with
Tutorial instructions - https://dbricks.co/saiseu19-delta
Reprocessing
Challenges of the Data Lake
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Updates
Partitioned
UPDATE &
MERGE
Scheduled to
Avoid
Modifications
1
2
3
1
1
3
4
4
4
2
Tutorial instructions - https://dbricks.co/saiseu19-delta
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
Quality
Delta Lake allows you to incrementally improve the
quality of your data until it is ready for consumption.
*Data Quality Levels *
The Architecture
Tutorial instructions - https://dbricks.co/saiseu19-delta
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
*Data Quality Levels *
The Architecture
Full ACID Transactions
Focus on your data flow, instead of worrying about failures.
Tutorial instructions - https://dbricks.co/saiseu19-delta
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
*Data Quality Levels *
The Architecture
Open Standards, Open Source
Store petabytes of data without worries of lock-in. Growing
community including Spark, Presto, Hive and more.
Tutorial instructions - https://dbricks.co/saiseu19-delta
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
Tutorial instructions - https://dbricks.co/saiseu19-delta
Powered by
Unifies Streaming / Batch. Convert existing jobs with minimal
modifications.
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
UPDATE
DELETE
MERGE
OVERWRITE
INSERT
Tutorial instructions - https://dbricks.co/saiseu19-delta
Support for DMLs
Use Delete/Update/Merge operations for data
corrections, GDPR, Change Data Capture, etc.
Open source and open formats
Unified Batch and Streaming
sources
ACID Transactions
Schema Enforcement and
Evolution
Delete, Update, Merge
Audit History
Versioning and Time Travel
Scalable metadata management
Support from Spark, Presto, Hive
Tutorial instructions - https://dbricks.co/saiseu19-delta
Used by 1000s of organizations world wide
> 2 exabyte processed last month alone
Tutorial instructions - https://dbricks.co/saiseu19-delta
Let’s begin the tutorial!
Build your own Delta Lake
at https://delta.io

Building Reliable Data Lakes at Scale with Delta Lake

  • 1.
  • 2.
    Steps to runningthis tutorial Instructions - https://dbricks.co/saiseu19-delta 1. Create an account + sign in to Databricks Community Edition https://databricks.com/try 2. Create a cluster with Databricks Runtime 6.1 3. Import the Python notebook and attach it to the cluster You can also use Scala notebook if you prefer
  • 3.
    1. Collect Everything • RecommendationEngines • Risk, Fraud Detection • IoT & Predictive Maintenance • Genomics & DNA Sequencing 3. Data Science & Machine Learning 2. Store it all in the Data Lake The Promise of the Data Lake Garbage In Garbage Stored Garbage Out 🔥 🔥 🔥 🔥🔥 🔥 🔥 Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 4.
    What does atypical data lake project look like? Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 5.
    Evolution of aCutting-Edge Data Lake Events ? AI & Reporting Streaming Analytics Data Lake Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 6.
    Evolution of aCutting-Edge Data Lake Events AI & Reporting Streaming Analytics Data Lake Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 7.
    Challenge #1: HistoricalQueries? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events λ-arch1 1 1 Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 8.
    Challenge #2: MessyData? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation 1 21 1 2 Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 9.
    Reprocessing Challenge #3: Mistakesand Failures? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Partitioned 1 2 3 1 1 3 2 Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 10.
    Reprocessing Challenge #4: Updates? DataLake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Updates Partitioned DELETE, UPDATE & MERGE Scheduled to Avoid Modifications 1 2 3 1 1 3 4 4 4 2 Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 11.
    Wasting Time &Money Solving Systems Problems Instead of Extracting Value From Data Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 12.
    Data Lake Distractions Noatomicity means failed production jobs leave data in corrupt state requiring tedious recovery ✗ No quality enforcement creates inconsistent and unusable data No consistency / isolation makes it almost impossible to mix appends and reads, batch and streaming Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 13.
    Let’s try itinstead with Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 14.
    Reprocessing Challenges of theData Lake Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Updates Partitioned UPDATE & MERGE Scheduled to Avoid Modifications 1 2 3 1 1 3 4 4 4 2 Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 15.
    Data Lake AI &Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion Bronze Silver Gold CSV, JSON, TXT… Kinesis Quality Delta Lake allows you to incrementally improve the quality of your data until it is ready for consumption. *Data Quality Levels * The Architecture Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 16.
    Data Lake AI &Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion Bronze Silver Gold CSV, JSON, TXT… Kinesis *Data Quality Levels * The Architecture Full ACID Transactions Focus on your data flow, instead of worrying about failures. Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 17.
    Data Lake AI &Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion Bronze Silver Gold CSV, JSON, TXT… Kinesis *Data Quality Levels * The Architecture Open Standards, Open Source Store petabytes of data without worries of lock-in. Growing community including Spark, Presto, Hive and more. Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 18.
    Data Lake AI &Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Tutorial instructions - https://dbricks.co/saiseu19-delta Powered by Unifies Streaming / Batch. Convert existing jobs with minimal modifications.
  • 19.
    Data Lake AI &Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis UPDATE DELETE MERGE OVERWRITE INSERT Tutorial instructions - https://dbricks.co/saiseu19-delta Support for DMLs Use Delete/Update/Merge operations for data corrections, GDPR, Change Data Capture, etc.
  • 20.
    Open source andopen formats Unified Batch and Streaming sources ACID Transactions Schema Enforcement and Evolution Delete, Update, Merge Audit History Versioning and Time Travel Scalable metadata management Support from Spark, Presto, Hive Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 21.
    Used by 1000sof organizations world wide > 2 exabyte processed last month alone Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 22.
  • 23.
    Build your ownDelta Lake at https://delta.io