Delta Lake Architecture: Delta Lake + Apache Spark Structured Streaming

The Delta Architecture
Delta Lake + Apache Spark Structured Streaming
李潇 (@gatorsmile)
2019 Oct. @ Shanghai

自我介绍
• Tech Lead and Engineering Manager at Databricks
• Apache Spark Committer and PMC Member
• Previously, IBM Master Inventor
• Spark, Database Replication, Information Integration
• Ph.D. in University of Florida
• Github: gatorsmile

+
Delta Lake Joins the Linux Foundation!

2017/10 2018/06 2019/042017/06
启程
宣布
亮相
Spark + AI
Summit
开源
Dominique Brezinski (Apple Inc.)
Michael Armbrust (Databricks)
2019/10

目录 • 数据工程师的纠结与运维的凌乱
• Delta Lake 基本原理
• Delta 架构
• Delta 架构的特性
• Delta 架构的经典案例 & Demo
• Delta Lake 社区

Process data continuously and incrementally as new data arrive in a
cost efficient way without having to choose between batch or streaming
项目经理如是说，
数据工程师

Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
数据工程师的第一份架构草图，

Table
(Data gets written
continuously)
AI & Reporting
Events
Spark 作业由于小文件太
多导致不断变慢
Stream
Stream
数据工程师的第二份方案

Events
额外的压缩导致延迟
Stream
Table
(Data gets written
continuously)
每小时定期
压缩数据
数据工程师的第三份方案
AI & ReportingTable
BatchBatch

AI & ReportingTable
Batch
Events
业务不接受
超过 1 小时的延迟
Stream
Table
(Data gets written
continuously)
Batch
每小时定期
压缩数据
数据工程师的第三份方案

Table
(Data gets written
continuously)
Stream
Unified ViewStream
Events
Lambda 架构
大幅增加
运营负担
Batch Batch
每小时定期
压缩数据
数据工程师的第四份方案

Table
(Data gets written
continuously)
AI & Reporting
Stream
Unified ViewStream
Events
验证与其他
数据清理需
要批流各做
一遍Batch Batch
数据验证
每小时定期
压缩数据
数据工程师的第五份方案

Table
(Data gets written
continuously)
AI & Reporting
Stream
Unified ViewStream
Events
每小时定期
压缩数据
验证后的纠
错意味某些
Partition 需
要重新处理Batch Batch
数据验证
重新处理
数据工程师的第六份方案

Table
(Data gets written
continuously)
AI & Reporting
Stream
Unified ViewStream
Events
每小时定期
压缩数据
数据湖的
Update /
Merge 异常
复杂Batch Batch
数据验证
重新处理
Update / Merge
数据工程师的第七份方案

Table
(Data gets written
continuously)
AI & Reporting
Stream
Unified ViewStream
Events
每小时定期
压缩数据
数据湖的
Update /
Merge 异常
复杂Batch Batch
数据验证
重新处理
Update / Merge
经过半年研发，喜大普奔
苦心研究Lambda 架构
正式上线！

Table
(Data gets written
continuously)
AI & Reporting
Stream
Unified ViewStream
Events
每小时定期
压缩数据
数据湖的
Update /
Merge 异常
复杂Batch Batch
数据验证
重新处理
Update / Merge
Extremely slow dataframe loading
Commands Blocked on Metadata Operations
Keep getting FileNotFound
CRITICAL: inconsistent job results
Different field types causes conflicting schema
Refresh Table Issues???
Concatenate small files
How to control number of parquet files ?
Eventual Consistency !!!
坑爹
想骂娘

Table
(Data gets written
continuously)
AI & Reporting
Stream
Unified ViewStream
Events
每小时定期
压缩数据
数据湖的
Update /
Merge 异常
复杂Batch Batch
数据验证
重新处理
Update / Merge
面对 Lambda 架构，运维工程师凌乱了。。。

Table
(Data gets written
continuously)
AI & Reporting
Stream
Unified ViewStream
Events
每小时定期
压缩数据
数据湖的
Update /
Merge 异常
复杂Batch Batch
数据验证
重新处理
Update / Merge
运维已不易，相煎何太急!
面对 Lambda 架构，运维工程师凌乱了。。。
砖家点评：此方案，费钱费力，
将大好时光浪费到了解决系统局
限，而不是去从数据中抽取价值

Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
数据工程师的第一份架构草图，
?
到底最初的方案，哪里错了？？？
为何选择复[ keng ] 杂[ die ] 的 Lambda 架构！！！

1）同时读写，并且要保证数据的一致性
2）可以高吞吐从大表读数据
3）遇到错误写出可以回滚和删改
4）在线业务不下线的同时可以重新处理历史数据
5）处理迟到数据而无需推迟下阶段的数据处理
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
？到底缺了什么？

+ =
Delta
架构
• 批流合并，持续数据处理
• 按需随时可重新处理历史事件
• 独立且弹性扩展计算和存储资源
Structured
Streaming

Delta Lake 的基本原理

Delta On Disk
Transaction Log
Table Versions
(Optional) Partition Directories
Data Files
my_table/
_delta_log/
00000.json
00001.json
date=2019-01-01/
file-1.parquet

Action Types
• Change Metadata – name, schema, partitioning, etc.
• Add File – adds a file (with optional statistics)
• Remove File – removes a file
Table = result of a set of actions
Result: Current Metadata, List of Files, List of Txns, Version

Changes to the table are stored as ordered, atomic
units called commits
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
000000.json
000001.json
。。。
Atomicity 的实现

1. Record start version
2. Record reads/writes
3. Attempt commit, check
for conflicts among
transactions
4. If someone else wins,
check if anything you
read has changed.
5. Try again.
乐观并发控制
000000.json
000001.json
000002.json
User 1 User 2
Write: Append
Read: Schema
Write: Append
Read: Schema

上百万的commit log files! 如果解决海量元数据处理？
大规模元数据的处理 –UseSpark!!!
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
Checkpoint

Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
1）同时读写，并且要保证数据的一致性 (Full ACID Transaction)
Snapshot isolation between writers and readers. Focus on your data flow,
instead of worrying about failures.
流式写入 Delta Table

1）同时读写，并且要保证数据的一致性 (Full ACID Transaction)
Snapshot isolation between writers and readers. Focus on your data flow,
instead of worrying about failures.
流式写入 Delta Table
流式读取 Delta Table
一次性写入 Delta Table
读当前版本的 Delta Table
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting

Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
2）可以高吞吐从大表读数据 (Scalable metadata handling)
问题1: 百万级的 partition values?
- 从 Hive metastore 取每个partition 的location path?
问题2: 亿级的files？
- 每个partition 的 location 还要list 无数的大大小小的文件？
天荒地老
花儿枯了

Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
天荒地老
花儿枯了
老大，这不就是典型的大数据问题？

天荒地老
花儿枯了
老大，这不就是典型的大数据问题？
• 使用 Parquet 存 file paths
• 使用 Spark’s Distributed Vectorized Parquet Reader 读
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting

Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
3）遇到错误写出可以回滚 (rollback) 可以删改(update/delete/merge)
数据这么脏。。。孰能无错 !

Time Travel – 仅仅为了纠错？？？
查询过往版本
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting

Time Travel - 纠错，Debug，重建过往报告，查账，审计，复杂的
temporal query，快速更新Table 数据的版本查询
查询过往版本
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting

Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
删除

Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
更新

Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
merge

Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
update/delete/merge 能提供标准SQL文法吗？
正在努力！Spark 3.0 is coming！
支持 Spark 2.4，需要 Delta 需要加上自己的 SQL parser

4）在线业务不下线的同时可以重新处理历史数据 (replay historical data)
Stream the backfilled historical data through the same pipeline
因为 ACID support，删掉相关的结果，重新改业务逻辑，历史数据的做批处理，
流可以同时持续处理最新的数据。
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting

5）处理迟到数据 (late arriving data) 而无需推迟下阶段的数据处理
Stream any late arriving data added to the table as they get added
因为 ACID support ，迟到的数据也可以通过MERGE/UPSERT 来处理
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting

1）同时读写，并且要保证数据的一致性 (read consistent data)
4）在线业务不下线的同时可以重新处理历史数据 (replay historical data)
5）处理迟到数据 (late arriving data) 而无需推迟下阶段的数据处理
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting

Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
数据工程师的第一份架构草图 - Delta 架构

AI & Reporting
Streaming
Analytics
Data Lake
CSV,
JSON, TXT…
Kinesis
Delta 架构
A continuous data flow model to unify batch & streaming

Delta 架构
A continuous data flow model to unify batch & streaming
Data Lake AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
Quality
INSERT
UPDATE
DELETE UPSERT
OVERWRITE

持续流入流出 Delta Lake table.
● 批流合并 . Same engine. Same APIs. Same user code. 无需如
Lambda架构同时使用批流处理.
● 高效增量数据载入. 无需对是否有新文件加入做状态管理
[e.g., Structured Streaming’s Trigger.Once]
● 快速无延迟流处理. [Trigger.ProcessingTime, Continuous]
● 简单修改即可将批处理转换成持续流处理
#1. 持续数据流

按需物化 Dataframes; 特别当需要大量数据转换 (transformations).
物化 (materialization) 的目的:
● 容错恢复
#2.物化中间结果
T1
T1
T2
T2
T3
T3
T4
T5
T6
T7
Source Table Dest Tables
存于内存的中间结果
● 一写多读● 方便故障排查

T1 T2 T3
T4
T5
T6
T7
落地的中间结果
按需物化 Dataframes; 特别当需要大量数据转换 (transformations).
物化 (materialization) 的目的:
● 容错恢复 ● 一写多读● 方便故障排查

T1 T2 T3
T4
T5
T6
T7
落地的中间结果落地的中间结果
多少个物化 Dataframes？
Reliability 和 end-2-end latency 的取舍

1. 流处理; 持续数据流入和处理. 无需作业调度管理
[Trigger.ProcessingTime, Continuous]
#3. 费用与延迟的取舍
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
*Data Quality Levels *

1. 流处理; 持续数据流入和处理: 需要永远在线的 clusters
2. 频繁批处理; 分钟级数据流入和处理(比如, 每半小时一次):
需要 a warm pool of machines. 无事关机，按需启动. 可使用
Spark streaming 的 Trigger.Once 模式 [incremental processing].
3. 非频繁批处理; 若干小时或若干天的数据批流入和处理: 无
事关机，按需启动. 可使用 Spark streaming 的 Trigger.Once 模式
[incremental processing].
#3. 费用与延迟的取舍

根据常用查询的predicate，为改善读取速度，可优化数据的物
理存储：
● Partitioning on low cardinality 列 (确保每个partition 大于 1GB).
○ partitionBy(date, eventType). // 仅有 100 个不同 event types
● Z-Ordering on high cardinality 列
○ optimize table ZORDER BY userId. // 1 个亿的不同 user ids
#4. 优化数据的物理存储

长期保留原始数据(raw data) +
stream = 廉价的重处理
• 仅需要删除目标表，重启流处理
• 利用云的弹性计算 (cloud
elasticity) 快速处理初始回填
(initial backfill)
#5. 重新处理历史数据
.
.
.

● 原始数据表的schemas 做自动合并: 可以确保永远捕获最初的
events 而不丢失任何数据.
● 数据写入时强制 schema 限制和data expectations: 逐阶段改善数
据质量，直到数据质量达到数据分析的需求
#6. 数据质量的调整
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
Quality
*Data Quality Levels *

● 支持批处理和标准DML:
#6. 数据质量的调整
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
Quality
• Retention
• Corrections
• GDPR
• UPSERTS
INSERT
UPDATE
DELETE MERGE
OVERWRITE

• 持续的数据流入和处理 [再无Lambda架构的批流分离]
• 物化中间结果来改善可靠性和方便故障排查
• 基于用户使用场景和商业需求做费用与延迟的取舍
• 根据查询的常用模式来优化数据的物理存储
• 历史数据的再处理只需要删除结果table 重启流处理
• 通过调整schema management， data expectations
and UPDATE/DELETE/MERGE 来一步一步地改善数据
质量，直到数据可以被用于分析
关键特性

1. 减少端到端的pipeline SLA.
多个使用单位把data pipeline 的 SLA 从小时减少到分钟级别
2. 减少pipeline 的维护成本
避免了为了达到分钟级别的用例延迟而引入 lambda 架构
3. 更容易的处理数据更新和删除.
简化了Change data capture, GDPR, Sessionization, 数据去冗
4. 通过计算和存储的分离和可弹缩而降低了infrastructure 的费用
多个使用单位将infrastructure的费用降低了超过十倍
Delta 架构的优点

Delta 架构的经典案例

Improved reliability:
Petabyte-scale jobs
10x lower compute:
640 服务器 → 64!
Simpler, faster ETL:
84 jobs → 3 jobs
halved data latency

Easier transactional
updates:
No downtime or
consistency issues!
Simple CDC:
Easy with MERGE
Improved performance: Queries run faster
大于一小时 → 少于六秒

自我介绍
64
Databricks Cluster
Data consistency and
integrity:
not available before
Increased data quality:
name match accuracy
up 80% → 95%
Faster data loads:
一天 → 二十分钟

Instead of parquet … … simply say delta
dataframe
.write
.format("parquet")
.save("/data")
dataframe
.write
.format("delta")
.save("/data")

pyspark --packages io.delta:delta-core_2.12:0.4.0
bin/spark-shell --packages io.delta:delta-core_2.12:0.4.0
Add Spark Package
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_2.12</artifactId>
<version>0.4.0</version>
</dependency>
Maven

Demo
Delta Lake Primer
https://dbricks.co/dlw-01

自我介绍
Delta Lake Roadmap
Releases Features
0.2.0 • Cloud storage support
• Improved concurrency
0.3.0 • Scala/Java APIs for DML commands
• Scala/Java APIs for query commit history
• Scala/Java APIs for vacuuming old files
0.4.0 • Python APIs for DML and utility operations
• In-place Conversion of Parquet to Delta Lake table
Q4 • Enable Hive support reading Delta tables
• SQL DML support with Spark 3.0
• And more

Delta Lake Community
2+
Exabytes of Delta
Read/Writes per month
3700+
Orgs using Delta
0
5,000
10,000
15,000
20,000
M
arch
April
M
ay
June
July
AugustSeptem
ber

自我介绍
踊跃参加 Delta Lake 社区
今天就开始造属于你的Delta Lake!
https://delta.io/
Slack Channel: delta-users.slack.com
Mailing List: groups.google.com/forum/#!forum/delta-users

李潇 (lixiao AT databricks.com)

Unified data analytics platform for accelerating innovation across
data science, data engineering, and business analytics
Original creators of popular data and machine learning open source projects
Global company with 5,000 customers and 450+ partners

中国
世界那么大，全球招砖家 !

Delta Lake Architecture: Delta Lake + Apache Spark Structured Streaming

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Delta Lake Architecture: Delta Lake + Apache Spark Structured Streaming

Similar to Delta Lake Architecture: Delta Lake + Apache Spark Structured Streaming (20)

Delta Lake Architecture: Delta Lake + Apache Spark Structured Streaming