"The medallion architecture graduates raw data sitting in operational systems into a set of refined tables in a series of stages, ultimately processing data to serve analytics from gold tables. While there is a deep desire to build this architecture incrementally from streaming data sources like Kafka, it is very challenging with current technologies available on lakehouses; a lot of technologies can’t efficiently update records or efficiently process incremental data without recomputing all the data to serve low-latency tables. Apache Hudi is a transactional data lake platform with full mutability support, including streaming upserts, and provides a powerful incremental processing framework. Apache Hudi powers the largest transactional data lakes in the industry, differentiating on fast upserts and change streams to only process and serve the change records.
To further improve the upsert performance, Hudi now supports a new record-level index that deterministically maps the record key to the file location orders of magnitude faster. As a result, Hudi speeds up computationally expensive MERGE operations even more by avoiding full table scans. On the query side, Hudi now supports database-style change data capture with before, and after images to chain flow of inserts, updates and deletes change records from bronze to silver to gold tables.
In this talk, attendees will walk away with:
- The current challenges of building a medallion architecture at low-latency
- How the record index and incremental updates work with Apache Hudi
- How the new Hudi CDC feature unlocks incremental processing on the lake
- How you can efficiently build a medallion architecture by avoiding expensive operations"
A Glide, Skip or a Jump: Efficiently Stream Data into Your Medallion Architecture with Apache Hudi
1. A Glide, Skip or a Jump:
Efficiently Stream Data into Your
Medallion Architecture with Apache Hudi
Nadine Farah Ethan Guo
{nadine, ethan}@onehouse.ai
September 27, 2023
3. Share your highlight from this session to win
one of 10 Hudi Hoodies
- Tag and follow OnehouseHQ on Linkedin
with a post about this session
OR
- Live tweet this session & tag and follow
@apachehudi
Session Highlights: Share to Win Hudi Hoodies
Hudi Slack Community
Collect your hoodie at the Onehouse booth, 414 expo hall (by the laté/coffee
bar area)
16. The Missing State Store
Hudi
Table
upsert(records)
at time t
Changes
to table
Changes
from table
incremental_query
(t-1, t)
query at time t
Latest committed records
17. Proven @ Massive Scale
https://www.youtube.com/watch?v=ZamXiT9aqs8
https://chowdera.com/2022/184/202207030146453436.html
https://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance/
100GB/s
Throughput
> 1Exabyte
Even just 1 Table
Daily -> Min
Analytics Latency
70%
CPU Savings
(write+read)
300GB/d
Throughput
25+TB
Datasets
Hourly
Analytics Latency
https://aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-nat
ive-data-pipelines-at-enterprise-scale-using-the-aws-platform/
10,000+
Tables
150+
Source systems
CDC, ETL
Use cases
https://www.uber.com/blog/apache-hudi-graduation/
4000+
Tables
250+PB
Raw + Derived
Daily -> Min
Analytics Latency
800B
Records/Day
21. Hudi Incr. Processing: Under the hood
● Record-level changes with primary keys -> index lookup, record payload and merging
● Faster metadata changes, consistency between index and data -> metadata management
● Optimize data layout on storage -> small-file handling, table services
● Needs fundamentally different concurrency control techniques -> OCC and MVCC
Incremental /
CDC Changes
From Source
Pre-
Process
Locate
Records
Optimize
File
Layout
Perform
Upsert
Write
New
Files
Update
Index /
Metadata
Commit
Sched/Run
Table
Services
Incremental /
CDC Changes
From Hudi
22. Deep Dive on Record-Level Mutation
uuid name ts balance
1 Ethan 1000 100
2 XYZ 1000 200
uuid name ts balance
1 Ethan 5000 60
2 XYZ 1000 200
3 Nadine 4000 100
uuid name ts balance is_delete
3 Nadine 4000 100 false
1 Ethan 5000 60 false
uuid name ts balance is_delete
2 XYZ 6000 null true
1 Ethan 2000 80 false
uuid name ts balance
1 Ethan 5000 60
2 XYZ 1000 200
3 Nadine 4000 100
Incoming Data 1 Incoming Data 2
Primary Key
Insert
Update Update
Delete
Hudi Table
Hudi Timeline
t1 t2 t3
● Payload and merge API for customized upserts;
built-in support for event-time ordering
● Auto primary key generation for log ingestion
(upcoming 0.14.0 release)
Upsert operation Ordering Field
23. Incremental Processing with CDC Feature
uuid name ts balance
1 Ethan 1000 100
2 XYZ 1000 200
uuid name ts balance
1 Ethan 5000 60
2 XYZ 1000 200
3 Nadine 4000 100
uuid name ts balance
1 Ethan 5000 60
2 XYZ 1000 200
3 Nadine 4000 100
Hudi Timeline
t1 t2 t3
Debezium-like change logs with before and after images with “hoodie.table.cdc.enabled=true”
op ts before after
i t2 null
{“uuid”:“3”,“name”:“Nadine”,
“ts”:“4000”,“balance”:“100”}
u t2
{“uuid”:“1”,“name”:“Ethan”,
“ts”:“1000”,“balance”:“100”}
{“uuid”:“1”,“name”:“Ethan”,
“ts”:“5000”,“balance”:“60”}
d t3
{“uuid”:“2”,“name”:“XYZ”,
“ts”:“1000”,“balance”:“200”}
null
spark.read.format("hudi").
option("hoodie.datasource.query.type",
"incremental").
option("hoodie.datasource.query.incremental.format",
"cdc").
option("hoodie.datasource.read.begin.instanttime",
t1).
option("hoodie.datasource.read.end.instanttime",
t3).
load("/path/to/hudi")
(New in 0.13.0 release)
25. ● Widely employed in DB systems
○ Locate information quickly
○ Reduce I/O cost
○ Improve Query efficiency
https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/
Indexes: Locating Records Efficiently
● Indexing provides fast upserts
○ Locate records for incoming writes
○ Bloom filter based, Simple, Hbase, etc.
26. Existing Indexes in Hudi
● Simple Index
○ Simply read keys and location from table
○ Best for random updates and deletes
● Bloom Index
○ Prune data files by bloom filters and key ranges
○ Best for late arriving updates and dedup
● HBase Index
○ Look up key-to-location mapping in an external
HBase table
○ Best for large-scale datasets (10s of TB to PB)
27. Challenges for Large Datasets
● Simple Index
○ Read keys from all files
● Bloom Index
○ Read all bloom filters
○ Read keys after file pruning to avoid false
positives
● HBase Index
○ Key-to-location mapping for every record
Reading data and metadata
per file is expensive
HBase cluster maintenance is
required and operationally difficult
Particular for cloud storage which
enforces rate limiting on I/O
A new Index to address both challenges?
28. Record-Level Index (RLI) Design
● Key-to-location mapping in table-level metadata
○ A new partition,“record_index”, in the metadata table
○ Stored in a few file groups instead of all data files
● Efficient key-to-location entry as payload
○ Random UUID key and datestr partition: 50-55 B per record in MDT
● Fast index update and lookup
○ MDT, an internal Hudi MOR table, enables uniformed fast updates
○ HFile format enables fast point lookup
29. Record-Level Index on Storage
“record_index”
partition
FG N-1
FG 1
File Group 0
File Group 0
File Slice t0
…
FS t1
HFile
Log File 1
HFile
record_key 0 -> partition 1, file 1
record_key 1 -> partition 1, file 1
record_key 2 -> partition 2, file 3
record_key 3 -> partition 1, file 2
.
.
.
Compaction
HFile
Log File 1
Header
HFile Data Block 0
record_key 6 -> partition 1, file 5
record_key 7-> partition 1, file 1
…
HFile Data Block 1
Footer
File Group ID
by the hash
Record
Keys Log File 2
30. SELECT * FROM table WHERE key = 'val'
DELETE FROM table WHERE key = 'val'
Performance Benefit from RLI
● Improves index lookup and write latency
○ 1TB dataset, 200MB batch, random updates
○ 17x speedup on index lookup, 2x on write
Record-Level Index will be available in
upcoming Hudi 0.14.0 release
17x 2x
2x
3x
● Reduces SQL latency with point lookups
○ TPC-DS 10TB datasets, store_sales table
○ 2-3x improvement compared to no RLI
35. Customer-360: “Clickstream” Schema
Field Description
click_id Unique identifier for each click
customer_id Customer table reference
session_id User session id
url User clicked on url
timestamp Timestamp of click
36. Customer-360: “Purchase” Schema
Field Description
purchase_id Unique identifier for purchase
customer_id Unique identifier for customer
product_id Unique identifier for product
quantity # product purchase
purchase_price Products total price
purchase_date timestamp
payment_method Customer’s payment method
order_status Delivered, in-route, etc
37. Customer-360: “Cart Activity” Schema
Field Description
activity_id Unique identifier for activity
customer_id Unique identifier for customer
product_id Unique identifier for product
timestamp Activity timestamp
activity-type Added, removed etc items
quantity How many items customer add/remove
cart-status Active/abandoned cart
38. Customer-360: “Customer” Schema
Field Description
customer_id Unique identifier for customer
first_name Customer’s first name
last_name Customer’s last name
email Customer’s email
signup_date Account creation date
last_login Most recent login date
41. Correlate User’s Activity with Purchases
SELECT
c.first_name,
c.last_name,
cs.url AS clicked_url,
cs.timestamp AS click_timestamp,
p.product_id AS purchased_product,
p.purchase_date
FROM customers c
-- Joining clickstream data
LEFT JOIN clickstream cs ON c.customer_id = cs.customer_id
-- Joining purchase data
LEFT JOIN purchases p ON c.customer_id = p.customer_id
WHERE cs.timestamp > '2023-01-01' AND p.purchase_date > '2023-01-01'
ORDER BY c.last_name, cs.timestamp DESC, p.purchase_date DESC;
43. What’s Next in Apache Hudi
● Hudi 0.14.0 release will be out soon
○ Record-Level Index to speed up index lookup and upsert performance
○ Auto-generated keys for use cases without user-provided primary key field
○ New MOR reader in Spark to boost query performance
● Hudi 1.x (RFC-69)
○ Re-imagination of Hudi, the transactional database for the lake
○ Storage format changes to unlock long retention of timeline, non-blocking
concurrency control
○ Enhancement to the indexing, performance and better abstractions, APIs for
engine integration
44. Come Build With The Community!
Docs : https://hudi.apache.org
Blogs : https://hudi.apache.org/blog
Slack : Apache Hudi Slack Group
Twitter : https://twitter.com/apachehudi
Github: https://github.com/apache/hudi/ Give us a star ⭐!
Mailing list(s) :
dev-subscribe@hudi.apache.org (send an empty email to subscribe)
Join Hudi Slack
45. Thanks!
Questions?
A Glide, Skip or a Jump:
Efficiently Stream Data into Your
Medallion Architecture with Apache Hudi
Join Hudi Slack
46. Challenges with Lakehouse Technologies
Context
❏ Append-only; no support for
upserts & deletes
Problems
❏ No indexing -> Full table scans
❏ Inconsistent view of the data lake
❏ No record modifications