7. +
Apache Kafka
Raw Cleaned Derived
Open
Formats
CDC Incremental
Change Feed
Transactions +
Concurrency
Managed
Perf Tuning
+++
More
Auto Catalog
Sync
Merge-On-Read
Stream Writers
S3
AWS Glue
Data Catalog
Metastore
BigQuery
Catalogs
+ Many More
Central Low-Latency Lakehouse Platform
8. Trailblazer, now Industry Proven
Uber rides - 250+ Petabytes from 24h+ to minutes latency
https:/
/eng.uber.com/uber-big-data-platform/
Package deliveries - real-time event analytics at PB scale
https:/
/aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/
TikTok/Bytedance recommendation system - at Exabyte scale
http:/
/hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance
Trading transactions - Near real-time CDC from 4000+ postgres tables
https:/
/s.apache.org/hudi-robinhood-talk
150 source systems, ETL processing for 10,000+ tables
https:/
/aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-native-data-pipelines-at-enterprise-scale-using-the-aws-platform/
Real-time advertising for 20M+ concurrent viewers
https:/
/www.youtube.com/watch?v=mFpqrVxxwKc
Store transactions - CDC & Warehousing
https:/
/searchdatamanagement.techtarget.com/feature/Hudi-powering-data-lake-efforts-at-Walmart-and-Disney-Hotstar
9. The Community
2200+
Slack Members
250+
Contributors
1000+
GH Engagers
20+
Committers
Pre-installed on 5 cloud providers
Diverse PMC/Committers
1M DLs/month
(400% YoY)
800B+
Records/Day
(from even just 1 customer!)
Rich community of participants
10. + - Streaming on Cloud Storage
Compaction
v1
v2
Reader
Writer
versioned parquet files
v1
v2
v1
v2
v1
v2
v1
v2
Reader
Copy on Write
Writer
parquet files + change logs
v1 v1 v1 v1
Reader
Merge on Read
COW MOR
Write Cost Higher Lower
Data Latency Slower Faster
Query Speed Faster Slower before
compaction
Same after
compaction
Overall Cost Aggressive
rewrites with
every update
Can amortize
compaction with
other services
11. + - Streaming on Cloud Storage
Compaction
v1
v2
Reader
Writer
parquet files + change logs
v1 v1 v1 v1
Reader
Merge on Read
Query Types
1. Snapshot Query - Merge changes and read everything
2. Read-Optimized Query - Read the latest compacted data
3. Incremental Query - Read only data that has changed between an interval
1
1
2
2
3
3
12. + - Merge On Read Stories
https://www.youtube.com/watch?v=ZamXiT9aqs8
https://chowdera.com/2022/184/202207030146453436.html
https://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance/
100GB/s
Throughput
400+PB
Even just 1 Table
Daily -> Min
Analytics Latency
70%
CPU Savings
(write+read)
300GB/d
Throughput
25+TB
Dataset
Hourly
Analytics Latency
https://www.youtube.com/watch?v=ZamXiT9aqs8
100M+/d
Events
10+TB
Dataset
8h -> 1h
Analytics Latency
https://www.youtube.com/watch?v=Yn8-tPX6Zoo
10min
Analytics Latency
13. Table Services with Streaming Ingestion
● Self managing database runtime
○ Cleaning (committed/uncommitted),
archival, clustering, compaction
● Table services know each other
○ Avoid duplicate schedules
○ Skip compacting files being clustered
● Run continuously or scheduled,
asynchronously
14. Compaction - Optimizing Queries on MOR
● Periodically and asynchronously
compact log files to new base files
● Reduces write amplification
● Keep the query performance in check
Latest: parquet files + change logs
v1
Snapshot
Query
Merging
Compaction
v1
v2
Snapshot
Query
Latest: parquet files only
15. Clustering - Optimizing Data Layout
○ Faster streaming ingestion -> smaller file sizes
○ Data locality for query (e.g., by city) ≠ ingestion order (e.g., trips by time)
○ Clustering to the rescue: auto file sizing, reorg data, no compromise on ingestion
16. Clustering Service
● Scheduling: identify target data,
generate plan in timeline
● Running: execute plan with
pluggable strategy
○ Reorg data with linear sorting,
Z-order, Hilbert, etc.
○ “REPLACE” commit in timeline
17. ● Widely employed in database systems
○ Locate information quickly
○ Reduce I/O cost
○ Improve Query efficiency
● Hudi’s indexing provides fast upserts
○ Locate records for incoming writes
○ Bloom filter based, Simple, Hbase etc
https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/
Indexes
18. Multi-Modal Index - New in Hudi 0.11
● Generalized indexing subsystem in Lakehouse
○ Scale to 10-100x data on the lake
○ Improve read and queries besides writes
● Key principles
○ Scalable metadata with MOR metadata table
○ ACID updates with multi-table transaction
○ Fast pointed lookup
https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-l
akehouse-in-apache-hudi
19. Multi-Modal Index - File Listing
● Improve file listing on cloud storage like S3
○ Direct listing of 100k files across 1000s of partitions hits throttling and I/O bottleneck
○ The files partition in metadata table provides 2-20x speedup of file listing
20. Multi-Modal Index - Data Skipping
● Leverage column stats (min, max, count, etc.) to prune files in a query
○ Reduce unnecessary scans, paired with clustering. Integrated with Flink.
○ 10-30x speedup of needle-in-a-haystack type of queries
Q1a: low specificity,
more targeted data/files
Q1b: high specificity,
less targeted data/files
28. Metaserver (Coming in 2022)
Interesting fact: Hudi has a metaserver
already
○ Runs on Spark driver; Serves
FileSystem RPCs + queries on timeline
○ Backed by rocksDB/pluggable
○ Updated incrementally on every
timeline action
○ Very useful in streaming jobs
Data lakes need a new metaserver
○ Flat file metastores are cool? (really?)
○ Speed up planning by orders of
magnitude
29. Lake Cache (Coming in 2022)
LRU Cache ala DB Buffer Pool
Frequent Commits => Small objects/blocks
○ Today: Aggressively table services
○ Tomorrow: File Group/Hudi file model
aware caching
○ Mutable data => FileSystem/Block level
caches are not that effective.
Benefits
○ Great performance for CDC tables
○ Avoid open/close costs for small objects
30. Come Build With The Community!
Docs : https://hudi.apache.org
Slack : https://join.slack.com/t/apache-hudi/shared_invite/zt-1d5zjsfl3-d_TefVaGyvEe16EANrxz6Q
Twitter : https://twitter.com/apachehudi
Github: https://github.com/apache/hudi/ Give us a star ⭐!
Mailing list(s) :
dev-subscribe@hudi.apache.org (send an empty email to subscribe)
dev@hudi.apache.org (actual mailing list)
Join Hudi Slack