Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard

The Delta
Architecture
Quentin Ambard
quentin.ambard@databricks.com

Databricks Workspace
Collaborative Notebooks, production jobs & business insights
Managed platform
Cloud Native
Databricks: Unified Data Analytics Platform
ML Runtime
For your Big data and Machine Learning Lifecycle
...

● A typical Data Lake Architecture
● The Delta Architecture
● Inside Delta Lake
● Demo
The Delta Agenda

Enterprises have been spending millions
of dollars getting data into data lakes
Data Lake

The aspiration is to do data science and
ML on all that data using Apache Spark!
Data Lake
Data Science & ML
• Recommendation Engines
• Risk, Fraud Detection
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing

Data Lake
But the data is not ready for data science & ML
The majority of these projects are failing due to
Complex pipeline and unreliable data!
Data Science & ML
• Recommendation Engines
• Risk, Fraud Detection
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing

What does a typical
data lake project look like?

Evolution of a Cutting-Edge Data Lake
Events
?
AI & Reporting
Streaming
Analytics
Data Lake

Evolution of a Cutting-Edge Data Lake
Events
AI & Reporting
Streaming
Analytics
Data Lake

Challenge #1: Historical Queries?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
λ-arch1
1
1

Challenge #2: Messy Data?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
1
21
1
2

Reprocessing
Challenge #3: Mistakes and Failures?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Partitioned
1
2
3
1
1
3
2

Challenge #4: Updates?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Updates GDPR...
Partitioned
UPDATE &
MERGE
Scheduled to
Avoid
Modifications
1
2
3
1
1
3
4
4
4
2
Reprocessing

Challenge #5: Stability at scale?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Updates GDPR...
Small filesPartitioned
UPDATE &
MERGE
Scheduled to
Avoid
Modifications
1
2
3
1
1
3
4
4
4
2
5
5
Reprocessing

Data reliability challenges with data lakes
No atomicity: failed jobs leaves data in
corrupt state requiring tedious recovery✗
No quality enforcement: creates inconsistent and low
quality data
Lack of consistency / isolation: makes it almost impossible
to mix delete, appends and reads, batch and streaming

● Open Format Based on Parquet
● By the creator of Apache Spark
● With Transactions
● Using Spark API’s
A New Standard for Building Data Lakes

Is there a better architecture?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Updates GDPR...
Small filesPartitioned
UPDATE &
MERGE
Scheduled to
Avoid
Modifications
1
2
3
1
1
3
4
4
4
2
5
5
Reprocessing

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Delta Lake allows you to improve the quality of your
data until it is ready for consumption.

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Raw data with minimal parsing
Supports long retention (years)

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Intermediate data with some cleanup applied.
Schema enforcement/evolution, data expectation
Queryable for easy debugging!

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Clean data, ready for consumption.
Read with Spark, Presto, Glue*
*Coming Soon

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
• Full ACID Transactions
• Open Source (Apache License)
• Powered by

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver
CSV,
JSON,
TXT…
Kinesis
Streams move data through the Delta Lake
•Low-latency or manually triggered
•Eliminates management of schedules and jobs
Gold

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver
CSV,
JSON,
TXT…
Kinesis
Delta Lake also supports batch jobs
and standard DML while streams run
UPDATE
DELETE
MERGE
OVERWRITE
• Retention
• Corrections
• GDPR
INSERT
Gold

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver
CSV,
JSON,
TXT…
Kinesis
Easy to recompute when business logic changes:
• Clear tables
• Restart streams
DELETE DELETE
Gold

dataframe
.write
.format("delta")
.save("/data")
Get Started with Delta using Spark APIs
dataframe
.write
.format("parquet")
.save("/data")
Instead of parquet... … simply say delta
Add Spark Package
pyspark --packages io.delta:delta-core_2.12:0.1.0
bin/spark-shell --packages io.delta:delta-core_2.12:0.1.0
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_2.12</artifactId>
<version>0.1.0</version>
</dependency>
Maven

Delta On Disk
my_table/
_delta_log/
00000.json
00001.json
date=2019-01-01/
file-1.parquet
Transaction Log
Table Versions
(Optional) Partition Directories
Data Files

Log Structured Storage
Changes to the table
are stored as
ordered, atomic units
called commits
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
000000.json
000001.json
…

Handling Massive Metadata
Large tables can have millions of files in them! How do we scale
the metadata? Use Spark for scaling!
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
Checkpoint
…
0009.json
0010.json
checkpoint-1.parquet
0011.json
…
Transaction log

Transactional
Log
Parquet Files
Delta Lake ensures data reliability
Streaming
● ACID Transactions / full DML
● Data quality
● Unified Batch & Streaming
● Time Travel/Data Snapshots
Key Features
High Quality & Reliable Data
always ready for analytics
Batch
Updates/Deletes

Support concurrent operation
Notebook/User 1:
SELECT * FROM customers WHERE firstname='xxx'
Notebook/User 2:
INSERT INTO customers (firstname, …) VALUES ('marc', …)
Notebook/User 3:
DELETE FROM customers WHERE firstname='quentin'

Support concurrent operation
Isolation level: WriteSerializable
Delta solves conflict optimistically
Concurrent modifications on a table triggers a rollback

Upsert/Merge: Fine-grained Updates
MERGE INTO customers -- Delta table
USING updates
ON customers.customerId = source.customerId
WHEN MATCHED THEN
UPDATE SET address = updates.address
WHEN NOT MATCHED
THEN INSERT (customerId, address) VALUES (updates.customerId,
updates.address)

Ensure Data Quality*
Enforce metadata, schema, and quality declaratively.
Inserts will fail if data doesn’t respect schema or quality
table("warehouse")
.location(…) // Location on DBFS
.schema(my_schema) // Optional strict schema checking
.metastoreName(…) // Registration in Hive Metastore
.description(…) // Human readable description for users
*Coming Soon
.expect("validTimestamp", // Expectations on data quality*
"timestamp > 2012-01-01 AND …",
"fail / alert / quarantine")

Unified batch and streaming
Concurrent stream/batch with exactly-once processing guarantee
Data Lake
AI & Reporting
Streaming
Analytics
Join stream with
table/stream
Bronze Silver
CSV,
JSON,
TXT…
Kinesis
DELETE DELETE
Gold

SELECT count(*) FROM events
TIMESTAMP AS OF timestamp
SELECT count(*) FROM events
VERSION AS OF version
Time Travel
spark.read.format(" delta").option("timestampAsOf",
timestamp_string).load("/events/")
INSERT INTO my_table
SELECT * FROM my_table TIMESTAMP AS OF
date_sub( current_date(), 1)
Reproduce experiments & reports Rollback accidental bad writes

Workshop Delta & MLFlow
Jeudi 7 Novembre
9h-12h30
https://dbricks.co/workshop-databricks

Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard

Similar to Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard (20)

More from Paris Data Engineers !

More from Paris Data Engineers ! (11)

Recently uploaded

Recently uploaded (20)

Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard