Building Reliable Data Lakes at Scale with Delta Lake
This document provides a tutorial for building scalable Delta Lakes using Databricks, including steps such as account creation, cluster setup, and data storage strategies. It addresses challenges in data lakes such as historical queries, messy data, and the need for updates and consistency. The document emphasizes the benefits of Delta Lakes such as data quality improvement, support for ACID transactions, and integration with various data formats and sources.
Steps to runningthis tutorial
Instructions - https://dbricks.co/saiseu19-delta
1. Create an account + sign in to Databricks Community Edition
https://databricks.com/try
2. Create a cluster with Databricks Runtime 6.1
3. Import the Python notebook and attach it to the cluster
You can also use Scala notebook if you prefer
3.
1. Collect
Everything
• RecommendationEngines
• Risk, Fraud Detection
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing
3. Data Science &
Machine Learning
2. Store it all in
the Data Lake
The Promise of the Data Lake
Garbage In Garbage Stored Garbage Out
🔥
🔥
🔥
🔥🔥
🔥
🔥
Tutorial instructions - https://dbricks.co/saiseu19-delta
4.
What does atypical
data lake project look like?
Tutorial instructions - https://dbricks.co/saiseu19-delta
5.
Evolution of aCutting-Edge Data Lake
Events
?
AI & Reporting
Streaming
Analytics
Data Lake
Tutorial instructions - https://dbricks.co/saiseu19-delta
6.
Evolution of aCutting-Edge Data Lake
Events
AI & Reporting
Streaming
Analytics
Data Lake
Tutorial instructions - https://dbricks.co/saiseu19-delta
7.
Challenge #1: HistoricalQueries?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
λ-arch1
1
1
Tutorial instructions - https://dbricks.co/saiseu19-delta
8.
Challenge #2: MessyData?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
1
21
1
2
Tutorial instructions - https://dbricks.co/saiseu19-delta
Wasting Time &Money
Solving Systems Problems
Instead of Extracting Value From Data
Tutorial instructions - https://dbricks.co/saiseu19-delta
12.
Data Lake Distractions
Noatomicity means failed production jobs
leave data in corrupt state requiring tedious
recovery
✗
No quality enforcement creates inconsistent
and unusable data
No consistency / isolation makes it almost
impossible to mix appends and reads, batch and
streaming
Tutorial instructions - https://dbricks.co/saiseu19-delta
13.
Let’s try itinstead with
Tutorial instructions - https://dbricks.co/saiseu19-delta
14.
Reprocessing
Challenges of theData Lake
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Updates
Partitioned
UPDATE &
MERGE
Scheduled to
Avoid
Modifications
1
2
3
1
1
3
4
4
4
2
Tutorial instructions - https://dbricks.co/saiseu19-delta
15.
Data Lake
AI &Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
Quality
Delta Lake allows you to incrementally improve the
quality of your data until it is ready for consumption.
*Data Quality Levels *
The Architecture
Tutorial instructions - https://dbricks.co/saiseu19-delta
16.
Data Lake
AI &Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
*Data Quality Levels *
The Architecture
Full ACID Transactions
Focus on your data flow, instead of worrying about failures.
Tutorial instructions - https://dbricks.co/saiseu19-delta
17.
Data Lake
AI &Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
*Data Quality Levels *
The Architecture
Open Standards, Open Source
Store petabytes of data without worries of lock-in. Growing
community including Spark, Presto, Hive and more.
Tutorial instructions - https://dbricks.co/saiseu19-delta
18.
Data Lake
AI &Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
Tutorial instructions - https://dbricks.co/saiseu19-delta
Powered by
Unifies Streaming / Batch. Convert existing jobs with minimal
modifications.
19.
Data Lake
AI &Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
UPDATE
DELETE
MERGE
OVERWRITE
INSERT
Tutorial instructions - https://dbricks.co/saiseu19-delta
Support for DMLs
Use Delete/Update/Merge operations for data
corrections, GDPR, Change Data Capture, etc.
20.
Open source andopen formats
Unified Batch and Streaming
sources
ACID Transactions
Schema Enforcement and
Evolution
Delete, Update, Merge
Audit History
Versioning and Time Travel
Scalable metadata management
Support from Spark, Presto, Hive
Tutorial instructions - https://dbricks.co/saiseu19-delta
21.
Used by 1000sof organizations world wide
> 2 exabyte processed last month alone
Tutorial instructions - https://dbricks.co/saiseu19-delta