Flink Forward San Francisco 2022.
With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi.
by
Ethan Guo & Kyle Weller
6. Copy-On-Write Table
Snapshot Query
Incremental Query
Insert: A, B, C, D, E
commit time=0
Update: A => A’, D => D’
commit time=1
Update: A’ => A”, E => E’, Insert: F
commit time=2
A, B
file1_t0.parquet
C, D
file2_t0.parquet
E
file3_t0.parquet
A’, B
file1_t1.parquet
C, D’
file2_t1.parquet
A”, B
file1_t2.parquet
E’,F
file3_t2.parquet
A,B,C,D,E
A,B,C,D,E
A’,B,C,D’,E
A’,D’
A”,B,C,D’,E’,F
A”,E’,F
7. Merge-On-Read Table
Snapshot Query
Incremental Query
Read Optimized Query
Insert: A, B, C, D, E Update: A => A’,
D => D’
commit time=1
Update: A’=>A”,
E=>E’,Insert: F
commit time=2
commit time=0
A, B
file1_t0.parquet
C, D
file2_t0.parquet
E
file3_t0.parquet
A’
.file1_t1.log
D’
.file2_t1.log
A”
.file1_t2.log
E’, F
.file3_t2.log
A,B,C,D,E
A,B,C,D,E
A,B,C,D,E
A’,B,C,D’,E
A’,D’
A,B,C,D,E
A”,B,C,D’,E’,F
A”,E’,F
A,B,C,D,E
Compaction
commit time=3
A”, B
file1_t3.parquet
C, D’
file2_t3.parquet
E’,F
file3_t3.parquet
A”,B,C,D’,E’,F
A”,E’,F
A”,B,C,D’,E’,F
9. Handling Multi-dimensional data
❏ Simplest clustering algorithm : sort the data by
a set of fields f1, f2, .. fn.
❏ Most effective for queries with
❏ f1 as predicate
❏ f1, f2 as predicates
❏ f1, f2, f3 as predicates
❏ …
❏ Effectiveness decreases right to left
❏ e.g with f3 as predicate
f1 f2 f3
10. Space Curves
❏ Basic idea : Multi-dimensional ordering/sorting
❏ Map multiple dimensions to single dimension
❏ About dozen exist in literature, over few decades
Z-Order Curves
❏ Interleaving binary representation of the points
❏ Resulting z-value’s order depends on all fields
Hilbert Curves
❏ Better ordering properties for high dimensions
❏ More expensive to build, for higher orders
11. Hudi Clustering Goals
Optimize data layout alongside ingestion
❏ Problem 1: faster ingestion -> smaller
file sizes
❏ Problem 2: data locality for query
(e.g., by city)
≠ ingestion order (e.g., trips by time)
❏ Auto sizing, reorg data, no
compromise on ingestion
12. Hudi Clustering Service
Self-managed table service
❏ Scheduling: identify target data,
generate plan in timeline
❏ Running: execute plan with pluggable
strategy
❏ Reorg data with linear sorting,
Z-order, Hilbert, etc.
❏ “REPLACE” commit in timeline
14. Metaserver (Coming in 2022)
Interesting fact : Hudi has a metaserver
already
- Runs on Spark driver; Serves
FileSystem RPCs + queries on
timeline
- Backed by rocksDB/pluggable
- updated incrementally on every
timeline action
- Very useful in streaming jobs
Data lakes need a new metaserver
- Flat file metastores are cool? (really?)
- Speed up planning by orders of
magnitude
15. Lake Cache (Coming in 2022)
LRU Cache ala DB Buffer Pool
Frequent Commits => Small objects/blocks
- Today : Aggressively table services
- Tomorrow : File Group/Hudi file model
aware caching
- Mutable data => FileSystem/Block level
caches are not that effective.
Benefits
- Great performance for CDC tables
- Avoid open/close costs for small objects
16. Engage With Our Community
User Docs : https://hudi.apache.org
Technical Wiki : https://cwiki.apache.org/confluence/display/HUDI
Github : https://github.com/apache/hudi/
Twitter : https://twitter.com/apachehudi
Mailing list(s) : dev-subscribe@hudi.apache.org (send an empty email to subscribe)
dev@hudi.apache.org (actual mailing list)
Slack : https://join.slack.com/t/apache-hudi/signup
Community Syncs : https://hudi.apache.org/community/syncs