Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022

Ethan Guo & Vinoth Chandar
Designing Apache Hudi for
Incremental Processing on
Exabyte-scale Lakehouses

Speakers
• Database Engineer @ Onehouse.ai
• Apache Hudi Committer
• Data, Networking @ Uber
• CEO, Founder @ Onehouse.ai
• PMC Chair @ Apache Hudi
• Data, Infra, Networking, Databases @ Uber
• Kafka, ksqlDB, Streams @ Confluent
• Key Value Stores @ LinkedIn
• CDC, Goldengate @ Oracle
Ethan Guo Vinoth Chandar

Agenda
Rise of the Lakehouse Architecture
Incremental Processing Model
Hudi in Action
Community

Rise Of the Lakehouse
Data Lakes, Transactions, Table Formats

Evolution of Data Infrastructure
On-Prem Data
warehouses
(Traditional
BI/Reporting)
2000s - Hadoop
Data Lakes
(Search/Social)
2014 - Apache Spark
(Data Science)
2016 - Apache Hudi
(Transactional Data Lake)
2017 - Databricks
Delta*
2012 - BigQuery
(Serverless)
2014 - Snowflake
(Decoupling/UX)
2013- Amazon
Redshift
(Cloud)
Warehouse
Lake(house)
*Databricks coined term “Lakehouse”

Lakehouse Architecture
Lakehouses
Cloud Storage
Local
Cache
SQL Exec
Node A Node B Node C
SQL Exec SQL Exec
Query
Engines
Storage
Table Format
Metadata
Txn manager Transaction
layer
Optimizer Optimizer Optimizer
Local
Cache
Local
Cache
Table
Services
Parquet/ORC
Traditional Data Lakes
Cloud Storage
Local
Cache
SQL Exec
Node A Node B Node C
SQL Exec SQL Exec
Query
Engines
Storage
Optimizer Optimizer Optimizer
Local
Cache
Local
Cache
Parquet/ORC/JSON/CSV

How they stack up?
Warehouse Lakehouse
Closed
Built for BI
Fully managed
Expensive as you scale
Open** (conditions apply)
Better ML/DS/AI Support
DIY
Cheaper at scale

S3
AWS Glue
Data Catalog
Metastore
BigQuery
Catalogs
+ Many More
Lakehouse Platform
Apache Kafka
Raw
Tables
Cleaned
Tables
Derived
Tables
Truly Open & Interoperable

The Hudi Platform
Lake Storage
(Cloud Object Stores, HDFS, …)
Open File/Data Formats
(Parquet, HFile, Avro, Orc, …)
Concurrency Control
(OCC, MVCC, Non-blocking, Lock providers,
Scheduling...)
Table Services
(cleaning, compaction, clustering, indexing,
file sizing,...)
Indexes
(Bloom filter, HBase, Bucket index, Hash
based, Lucene..)
Table Format
(Schema, File listings, Stats, Evolution, …)
Lake Cache*
(Columnar, transactional, mutable, WIP,...)
Metaserver*
(Stats, table service coordination,...)
Transactional
Database
Layer
Query Engines
(Spark, Flink, Hive, Presto, Trino, Impala,
Redshift, BigQuery, Snowflake,..)
Platform Services
(Streaming/Batch ingest, various sources,
Catalog sync, Admin CLI, Data Quality,...)
User Interface
Readers
(Snapshot, Time Travel, Incremental, etc)
Writers
(Inserts, Updates, Deletes, Smart Layout
Management, etc)
Programming API

Proven @ Massive Scale
https://www.youtube.com/watch?v=ZamXiT9aqs8
https://chowdera.com/2022/184/202207030146453436.html
https://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance/
100GB/s
Throughput
> 1Exabyte
Even just 1 Table
Daily -> Min
Analytics Latency
70%
CPU Savings
(write+read)
300GB/d
Throughput
25+TB
Datasets
Hourly
Analytics Latency
https://aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-
native-data-pipelines-at-enterprise-scale-using-the-aws-platform/
10,000+
Tables
150+
Source systems
CDC, ETL
Use cases
https://www.uber.com/blog/apache-hudi-graduation/
4000+
Tables
250+PB
Raw + Derived
Daily -> Min
Analytics Latency
800B
Records/Day

Incremental Processing
Looking Beyond Files & Formats

Elephant in the Room : Batch Processing
“In data warehousing, in order to represent a business they
had to actually kind of reinvent event streams in a very slow
way”
https://www.oreilly.com/content/ubers-case-for-
incremental-processing-on-hadoop/;
Vinoth Chandar, 2016

Stream vs Batch Processing Dichotomy
PostgresSQL Debezium Apache Kafka
Database Ingestion Real-Time Analytics
Apache Flink
Old-School Batch ETL
Amazon S3

Stream vs Batch Processing Dichotomy
Database Ingestion Real-Time Analytics
Apache Flink
Old-School Batch ETL
Amazon S3
+ Intelligent, Incremental
+ Fast, Efficient
- Row oriented
- Not scan optimized
+ Scans, Columnar formats
+ Scalable Compute
- In-efficient recompute
- No updates/deletes

Incremental Processing Model
Coined at Uber; 2015
Bring “stream processing” model to
“batch” data
Bridges best of both worlds :
process only new input, with
columnar/scan optimized storage
We needed a state store!

The Missing State Store
Hudi
Table
upsert(records)
at time t
Changes
to table
Changes
from table
incremental_query
(t-1, t)
query at time t
Latest committed records

Case Study : Uber’s Big Data Platform
Truly real-time business;
Poor Ingest performance
Slow, Expensive re-computations
in Batch ETL
Massive data volumes - ~100PB in
2016, maybe an exabyte today
Need minute level latencies

Hudi Feature Highlights
Incremental Reads
Maintains monotonically increasing
commit metadata to provide
incremental queries
Multi-modal Indexes
Bloom, Simple and HBase indexes
to provide faster lookups, updates &
deletes
Streaming Latency
To reduce data latency and write
amplification when ingesting
records in an MOR table with async
compaction
Concurrency Control
Hudi provides OCC between writers,
while providing lock-free, non-
blocking MVCC
Field level upserts
To perform updates, merges and
deletes to the data
Clustering
To reorganize data for improved
query performance & data re-
writing service
Integrations
Works well with Presto, Spark, Flink,
Trino & Hive, Kafka Connect etc.
Timeline Metadata
Time-travel using rewind and
rollback semantics to fix DQ issues
https://bit.ly/hudi-feature-comparison

Hudi Table Types
Compaction
v1
v2
Reader
Writer
versioned parquet files
v1
v2
v1
v2
v1
v2
v1
v2
Reader
Copy on Write
Writer
parquet files + change logs
v1 v1 v1 v1
Reader
Merge on Read
COW MOR
Write Cost Higher Lower
Data Latency Slower Faster
Query Speed Faster Slower before
compaction
Faster after
compaction
Overall Cost Aggressive
rewrites with
every update
Can amortize
compaction with
other services

Hudi Query Types
Compaction
v1
v2
Reader
Writer
parquet files + change logs
v1 v1 v1 v1
Reader
Merge on Read
Query Types
1. Snapshot Query - Merge changes and read everything
2. Read-Optimized Query - Read the latest compacted data
3. Incremental Query - Read only data that has changed within an interval
1
1
2
2
3
3

Optimizing For Large Scale Updates
Challenges
● 10x harder problem than designing formats
● Opens up every database problem in the textbook
● Primary keys, faster metadata changes, consistency between index and data
● Needs fundamentally different concurrency control techniques

● Partial Updates
○ Many databases generate partial updates
○ Supplemental logging is very expensive
● DR Scenarios
○ Databases can be running active-active
○ Need conflict resolution techniques
● Record level merge APIs
○ Support different CDC formats (e.g DMS,
Debezium)
○ Moving towards the newer API (RFC-46)
Merge APIs
Current RecordPayload Interface
New HoodieRecordMerger Interface

● Widely employed in database systems
○ Locate information quickly
○ Reduce I/O cost
○ Improve Query efficiency
● Indexing provides fast upserts
○ Locate records for incoming writes
○ Bloom filter based, Simple, Hbase, etc.
○ Record level Indexes, Lucene based
https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/
Indexes

Multi-Modal Index
● Generalized indexing subsystem in Lakehouse
○ Converge metadata + indexes
○ Scale to 10-100x data on the lake
○ Improve read and queries besides writes
○ Asynchronously rebuild new/existing indexes
● Key principles
○ Design for frequent changes
○ MoR metadata table w/ log compaction
○ ACID updates with multi-table transaction
○ Fast point lookups/range scans
https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-
lakehouse-in-apache-hudi

Compaction - Balancing Read/Write Costs
○ TBs of updates against PBs of data
○ Delete/Update patterns often very different
than query patterns
○ GDPR deletes are random
○ Analytics Queries more likely to read
recent data
○ Periodically and asynchronously compact log
files to new base files
○ Reduces write amplification
○ Keep the query performance in check
Latest: parquet files + change logs
v1
Snapshot
Query
Merging
Compaction
v1
v2 Snapshot
Query
Latest: parquet files only

Clustering - Optimizing Data Layout
○ Faster streaming ingestion -> smaller file sizes
○ Data locality for query (e.g., by city) ≠ ingestion order (e.g., trips by time)
○ Clustering to the rescue: auto file sizing, reorg data, no compromise on ingestion

Clustering Service
○ Complete runtime for executing
clustering in tandem with
writers/compaction
○ Scheduling: identify target data,
generate plan in timeline
○ Running: execute plan with pluggable
strategy
○ Reorg data with linear sorting, Z-order,
Hilbert, etc.

Table Services with Continuous Writers
● Self managing database runtime
○ Cleaning (committed/uncommitted), archival,
clustering, compaction
○ Similar to how RocksDB daemons work
● Table services know each other
○ Avoid duplicate schedules
○ Skip compacting files being clustered
● Run continuously or scheduled,
asynchronously

Hudi In Action
Common Incremental Processing Patterns

How to do Streaming Ingest
Hudi
Table
Apache Kafka
Flattening
loc.lon -> loc_lon
loc.lat -> loc_lat
Projection
User-defined
transformation
SELECT
a.loc_lon,
a.loc_lat,
a.name
FROM <SRC> a
HoodieDeltaStreamer tool
Hudi configs (table name, path, type,
record key, ordering field, etc.)
Transformations
Source limit, sync frequency
Kafka source configs
Enable async clustering

CDC Example
Schema registry
Hudi
DeltaStreamer
HoodieDeltaStreamer tool
Postgres Debezium
connector configs
Postgres Debezium
Source and Payload
for CDC
Config for small file
handling
Hive sync configs

Incremental ETL Example
Stream Table
(fact)
(Join)
Dataset 1
(dimension)
Incremental
consumption
(Join)
Dataset 2
(dimension)
Upsert
Joined
Table
Apache Kafka
Hudi Incremental Source
Incr source configs
SQL Transformer for Joins

ETL Load Strategies
Full load
➕ easy to implement e.g. if you need JOINs
➖ expensive, slow
➖ updates to (too) old data are lost
Incremental ETL with Hudi 👍
➕ still easy to implement
➕ efficient
➗ not real-time, but close
Streaming with Flink (and Hudi?)
➖ have to call services or JOIN streams
➕ real-time

Wins Reported by Uber
Accuracy We achieved 100% data accuracy: no updates are lost, even for a
year-old trips.
Efficiency
Process less data on each run. Weekly aggregation with full load:
4-5 hours per run; fact table with incremental: 45 minutes and can
be further improved.
Freshness Potential to bring the freshness SLA of the earnings in Hive from
31 hours down to a couple of hours (work in progress).
This unlocks earnings features closer to real-time
Cheaper In our benchmarking, we’ve found Lakehouse based incremental
ETLs are ~50% cheaper than the old school batch pipelines.

Community
It’s going to take a village

The Community
2400+
Slack Members
320+
Contributors
1200+
GH Engagers
20+
Committers
Pre-installed on 5 cloud providers
Diverse PMC/Committers
1M DLs/month
(400% YoY)
800B+
Records/Day
(from even just 1 customer!)
Rich community of participants

Trailblazer, now Industry Proven
Uber rides - 250+ Petabytes from 24h+ to minutes latency
https://eng.uber.com/uber-big-data-platform/
Package deliveries - real-time event analytics at PB scale
https://aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/
TikTok/Bytedance recommendation system - at Exabyte scale
http://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance
Trading transactions - Near real-time CDC from 4000+ postgres tables
https://s.apache.org/hudi-robinhood-talk
150 source systems, ETL processing for 10,000+ tables
https://aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-native-data-pipelines-at-enterprise-scale-using-the-aws-platform/
Real-time advertising for 20M+ concurrent viewers
https://www.youtube.com/watch?v=mFpqrVxxwKc
Store transactions - CDC & Warehousing
https://searchdatamanagement.techtarget.com/feature/Hudi-powering-data-lake-efforts-at-Walmart-and-Disney-Hotstar

Lake House Architecture @ Halodoc: Data Platform 2.0
https://blogs.halodoc.io/lake-house-architecture-halodoc-data-platform-2-0/
Incremental, Multi region data lake platform
https://aws.amazon.com/blogs/architecture/insights-for-ctos-part-3-growing-your-business-with-modern-data-capabilities/
Unified, batch + streaming data lake on Hudi
https://developpaper.com/apache-hudi-x-pulsar-meetup-hangzhou-station-is-hot-and-the-practice-dry-goods-are-waiting-for-you/
Streaming data lake for device data
https://www.youtube.com/watch?v=8Q0kM-emMyo
Near real-time grocery delivery tracking
https://lambda.blinkit.com/origins-of-data-lake-at-grofers-6c011f94b86c
Minute level data ingestion to lakehouse
https://www.youtube.com/watch?v=Yn8-tPX6Zoo
Trailblazer, now Industry Proven
Serverless, real-time analytics platform on Hudi
https://aws.amazon.com/blogs/big-data/how-nerdwallet-uses-aws-and-apache-hudi-to-build-a-serverless-real-time-analytics-platform/

Metaserver (Coming in Q4 2022)
Interesting fact: Hudi has a metaserver already
○ Runs on Spark driver; Serves FileSystem RPCs +
queries on timeline
○ Backed by rocksDB/pluggable
○ Updated incrementally on every timeline action
○ Very useful in streaming jobs
Data lakes need a new metaserver
○ Flat file metastores are cool? (really?)
○ Speed up planning by orders of magnitude
RFC-36, HUDI-3345: Metaserver for all metadata

Lake Cache (Coming Early 2023)
LRU Cache ala DB Buffer Pool
Frequent Commits => Small objects/blocks
○ Today: Aggressively table services
○ Tomorrow: File Group/Hudi file model aware caching
○ Mutable data => FileSystem/Block level caches are not
that effective.
Benefits
○ Great performance for CDC tables
○ Avoid open/close costs for small objects
Strawman design: Mutable, Transactional caching for Hudi Tables

New CDC Format (Coming in Q4 2022)
Change Data Capture in Hudi table as a source
○ Support record-level CDC logs and queries
○ Debezium-like format : “before” and “after” images
○ Insert: null à inserted row
○ Update: old row à new row
○ Delete: pre-delete row à null
Trade-offs on deducing changelogs
○ Incremental query can already pull changes
○ Compute changelogs on the fly (more read cost)
○ Fully materialized changelogs (more write cost)
RFC-51, HUDI-3478: Support of Change Data Capture (CDC) with Hudi change logs

Come Build With The Community!
Docs : https://hudi.apache.org
Blogs : https://hudi.apache.org/blog
Slack : https://join.slack.com/t/apache-hudi/shared_invite/zt-1e94d3xro-JvlNO1kSeIHJBTVfLPlI5w
Twitter : https://twitter.com/apachehudi
Github: https://github.com/apache/hudi/ Give us a star ⭐!
Mailing list(s) :
dev-subscribe@hudi.apache.org (send an empty email to subscribe)
Join Hudi Slack

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022

Similar to Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022 (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022