Apache Flink and Apache Hudi.pdf

How to build a streaming
Lakehouse w/ Flink + Hudi
Ethan Guo + Kyle Weller
Wed, Aug 3 | 1:30 PM

Kyle Weller - Head of Product @ Onehouse.ai
https://www.linkedin.com/in/lakehouse/
Ethan Guo - Software Engineer @ Onehouse.ai
https://www.linkedin.com/in/yihua-ethan-guo/
Introductions

PostgresSQL Debezium Apache Kafka
Database Ingestion Real-Time Analytics
Apache Flink
Old-School Batch ETL
Amazon S3
Problems
● Replicate business logic
● Slow batch pipes always lag
● Devops to maintain and sync
● No updates/deletes on S3
Stream vs Batch fork

Hudi Lakehouse
S3
Apache Hudi
+
+
Topics
● Unify batch and streaming workloads
● Build centralized platform for multiple compute engines
● Unlock concurrency for multiple readers/writers with ACID transactions
● Blazing fast data lake stream ingestion and processing with Hudi Merge-On-Read
● Efficient Upserts/Deletes with indexing and primary keys
● Implement incremental processing for Hudi change streams
PostgresSQL Debezium Apache Kafka

The Hudi Platform
Lake Storage
(Cloud Object Stores, HDFS, …)
Open File/Data Formats
(Parquet, HFile, Avro, Orc, …)
Concurrency Control
(OCC, MVCC, Non-blocking, Lock
providers, Orchestration, Scheduling...)
Table Services
(cleaning, compaction, clustering,
indexing, file sizing,...)
Indexes
(Bloom filter, HBase, Bucket
index, Hash based, Lucene..)
Table Format
(Schema, File listings, Stats,
Evolution, …)
Lake Cache
(Columnar, transactional,
mutable, WIP,...)
Metaserver
(Stats, table service coordination,...)
SQL Query Engines
(Spark, Flink, Hive, Presto, Trino, Impala,
Redshift, BigQuery, Snowflake,..)
Platform Services
(Streaming/Batch ingest,
various sources, Catalog sync,
Admin CLI, Data Quality,...)
Transactional
Database
Layer
Execution/Runtimes

+
Apache Kafka
Raw Cleaned Derived
Open
Formats
CDC Incremental
Change Feed
Transactions +
Concurrency
Managed
Perf Tuning
+++
More
Auto Catalog
Sync
Merge-On-Read
Stream Writers
S3
AWS Glue
Data Catalog
Metastore
BigQuery
Catalogs
+ Many More
Central Low-Latency Lakehouse Platform

Trailblazer, now Industry Proven
Uber rides - 250+ Petabytes from 24h+ to minutes latency
https:/
/eng.uber.com/uber-big-data-platform/
Package deliveries - real-time event analytics at PB scale
https:/
/aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/
TikTok/Bytedance recommendation system - at Exabyte scale
http:/
/hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance
Trading transactions - Near real-time CDC from 4000+ postgres tables
https:/
/s.apache.org/hudi-robinhood-talk
150 source systems, ETL processing for 10,000+ tables
https:/
/aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-native-data-pipelines-at-enterprise-scale-using-the-aws-platform/
Real-time advertising for 20M+ concurrent viewers
https:/
/www.youtube.com/watch?v=mFpqrVxxwKc
Store transactions - CDC & Warehousing
https:/
/searchdatamanagement.techtarget.com/feature/Hudi-powering-data-lake-efforts-at-Walmart-and-Disney-Hotstar

The Community
2200+
Slack Members
250+
Contributors
1000+
GH Engagers
20+
Committers
Pre-installed on 5 cloud providers
Diverse PMC/Committers
1M DLs/month
(400% YoY)
800B+
Records/Day
(from even just 1 customer!)
Rich community of participants

+ - Streaming on Cloud Storage
Compaction
v1
v2
Reader
Writer
versioned parquet files
v1
v2
v1
v2
v1
v2
v1
v2
Reader
Copy on Write
Writer
parquet files + change logs
v1 v1 v1 v1
Reader
Merge on Read
COW MOR
Write Cost Higher Lower
Data Latency Slower Faster
Query Speed Faster Slower before
compaction
Same after
compaction
Overall Cost Aggressive
rewrites with
every update
Can amortize
compaction with
other services

+ - Streaming on Cloud Storage
Compaction
v1
v2
Reader
Writer
parquet files + change logs
v1 v1 v1 v1
Reader
Merge on Read
Query Types
1. Snapshot Query - Merge changes and read everything
2. Read-Optimized Query - Read the latest compacted data
3. Incremental Query - Read only data that has changed between an interval
1
1
2
2
3
3

+ - Merge On Read Stories
https://www.youtube.com/watch?v=ZamXiT9aqs8
https://chowdera.com/2022/184/202207030146453436.html
https://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance/
100GB/s
Throughput
400+PB
Even just 1 Table
Daily -> Min
Analytics Latency
70%
CPU Savings
(write+read)
300GB/d
Throughput
25+TB
Dataset
Hourly
Analytics Latency
https://www.youtube.com/watch?v=ZamXiT9aqs8
100M+/d
Events
10+TB
Dataset
8h -> 1h
Analytics Latency
https://www.youtube.com/watch?v=Yn8-tPX6Zoo
10min
Analytics Latency

Table Services with Streaming Ingestion
● Self managing database runtime
○ Cleaning (committed/uncommitted),
archival, clustering, compaction
● Table services know each other
○ Avoid duplicate schedules
○ Skip compacting files being clustered
● Run continuously or scheduled,
asynchronously

Compaction - Optimizing Queries on MOR
● Periodically and asynchronously
compact log files to new base files
● Reduces write amplification
● Keep the query performance in check
Latest: parquet files + change logs
v1
Snapshot
Query
Merging
Compaction
v1
v2
Snapshot
Query
Latest: parquet files only

Clustering - Optimizing Data Layout
○ Faster streaming ingestion -> smaller file sizes
○ Data locality for query (e.g., by city) ≠ ingestion order (e.g., trips by time)
○ Clustering to the rescue: auto file sizing, reorg data, no compromise on ingestion

Clustering Service
● Scheduling: identify target data,
generate plan in timeline
● Running: execute plan with
pluggable strategy
○ Reorg data with linear sorting,
Z-order, Hilbert, etc.
○ “REPLACE” commit in timeline

● Widely employed in database systems
○ Locate information quickly
○ Reduce I/O cost
○ Improve Query efficiency
● Hudi’s indexing provides fast upserts
○ Locate records for incoming writes
○ Bloom filter based, Simple, Hbase etc
https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/
Indexes

Multi-Modal Index - New in Hudi 0.11
● Generalized indexing subsystem in Lakehouse
○ Scale to 10-100x data on the lake
○ Improve read and queries besides writes
● Key principles
○ Scalable metadata with MOR metadata table
○ ACID updates with multi-table transaction
○ Fast pointed lookup
https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-l
akehouse-in-apache-hudi

Multi-Modal Index - File Listing
● Improve file listing on cloud storage like S3
○ Direct listing of 100k files across 1000s of partitions hits throttling and I/O bottleneck
○ The files partition in metadata table provides 2-20x speedup of file listing

Multi-Modal Index - Data Skipping
● Leverage column stats (min, max, count, etc.) to prune files in a query
○ Reduce unnecessary scans, paired with clustering. Integrated with Flink.
○ 10-30x speedup of needle-in-a-haystack type of queries
Q1a: low specificity,
more targeted data/files
Q1b: high specificity,
less targeted data/files

Metaserver (Coming in 2022)
Interesting fact: Hudi has a metaserver
already
○ Runs on Spark driver; Serves
FileSystem RPCs + queries on timeline
○ Backed by rocksDB/pluggable
○ Updated incrementally on every
timeline action
○ Very useful in streaming jobs
Data lakes need a new metaserver
○ Flat file metastores are cool? (really?)
○ Speed up planning by orders of
magnitude

Lake Cache (Coming in 2022)
LRU Cache ala DB Buffer Pool
Frequent Commits => Small objects/blocks
○ Today: Aggressively table services
○ Tomorrow: File Group/Hudi file model
aware caching
○ Mutable data => FileSystem/Block level
caches are not that effective.
Benefits
○ Great performance for CDC tables
○ Avoid open/close costs for small objects

Come Build With The Community!
Docs : https://hudi.apache.org
Slack : https://join.slack.com/t/apache-hudi/shared_invite/zt-1d5zjsﬂ3-d_TefVaGyvEe16EANrxz6Q
Twitter : https://twitter.com/apachehudi
Github: https://github.com/apache/hudi/ Give us a star ⭐!
Mailing list(s) :
dev-subscribe@hudi.apache.org (send an empty email to subscribe)
dev@hudi.apache.org (actual mailing list)
Join Hudi Slack

Apache Flink and Apache Hudi.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Flink and Apache Hudi.pdf

Similar to Apache Flink and Apache Hudi.pdf (20)

Recently uploaded

Recently uploaded (20)

Apache Flink and Apache Hudi.pdf