SlideShare a Scribd company logo
1 of 48
Download to read offline
A Glide, Skip or a Jump:
Efficiently Stream Data into Your
Medallion Architecture with Apache Hudi
Nadine Farah Ethan Guo
{nadine, ethan}@onehouse.ai
September 27, 2023
Ethan Guo
❏ Dev Rel @ Onehouse
❏ Contributor @ Apache Hudi
❏ Former @ Rockset, Bose
❏ Software Engineer @ Onehouse
❏ PMC @ Apache Hudi
❏ Data, Networking @ Uber
in/nadinefarah/
@nfarah86
Nadine Farah
Speaker Bio
in/yihua-ethan-guo/
Share your highlight from this session to win
one of 10 Hudi Hoodies
- Tag and follow OnehouseHQ on Linkedin
with a post about this session
OR
- Live tweet this session & tag and follow
@apachehudi
Session Highlights: Share to Win Hudi Hoodies
Hudi Slack Community
Collect your hoodie at the Onehouse booth, 414 expo hall (by the laté/coffee
bar area)
The Medallion Architecture
Overview
Medallion Architecture Overview
So, what does it take to build
medallion architecture?
Challenges in the Medallion Architecture
Bottlenecks that Cause the Challenges
But … what if you can simplify
the medallion architecture?
Simplify the Medallion Architecture with
Apache Hudi
Apache Hudi Overview
Compute-efficient Architecture with Hudi
Open & Interoperable Lakehouse Platform
S3
AWS Glue
Data Catalog
Metastore
BigQuery
Catalogs
+ Many More
Lakehouse Platform
Apache Kafka
Raw
Tables
Cleaned
Tables
Derived
Tables
Hudi Table Deep-Dive
The Missing State Store
Hudi
Table
upsert(records)
at time t
Changes
to table
Changes
from table
incremental_query
(t-1, t)
query at time t
Latest committed records
Proven @ Massive Scale
https://www.youtube.com/watch?v=ZamXiT9aqs8
https://chowdera.com/2022/184/202207030146453436.html
https://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance/
100GB/s
Throughput
> 1Exabyte
Even just 1 Table
Daily -> Min
Analytics Latency
70%
CPU Savings
(write+read)
300GB/d
Throughput
25+TB
Datasets
Hourly
Analytics Latency
https://aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-nat
ive-data-pipelines-at-enterprise-scale-using-the-aws-platform/
10,000+
Tables
150+
Source systems
CDC, ETL
Use cases
https://www.uber.com/blog/apache-hudi-graduation/
4000+
Tables
250+PB
Raw + Derived
Daily -> Min
Analytics Latency
800B
Records/Day
Incremental Data Processing
in the Medallion Architecture
Incremental Processing with Apache Hudi
Hudi
Streamer
Hudi
Streamer
Hudi Streamer: E2E Incremental Processing
Clean
Table
(fact)
(Join)
Dataset 1
(dimension)
(Join)
Dataset 2
(dimension)
Upsert
Summary
Table
Apache Kafka
Raw
Table
(fact)
Hudi Streamer
Hudi Streamer Hudi Streamer
Projection
User-
defined
Transform
ation
SELECT
a.loc.lon as
loc_lon,
a.loc.lat as
loc_lat,
a.name
FROM <SRC> a
Schema registry
Bronze Silver Gold
PostgresSQL
Debezium Upsert
Bulk Insert
Hudi Incr. Processing: Under the hood
● Record-level changes with primary keys -> index lookup, record payload and merging
● Faster metadata changes, consistency between index and data -> metadata management
● Optimize data layout on storage -> small-file handling, table services
● Needs fundamentally different concurrency control techniques -> OCC and MVCC
Incremental /
CDC Changes
From Source
Pre-
Process
Locate
Records
Optimize
File
Layout
Perform
Upsert
Write
New
Files
Update
Index /
Metadata
Commit
Sched/Run
Table
Services
Incremental /
CDC Changes
From Hudi
Deep Dive on Record-Level Mutation
uuid name ts balance
1 Ethan 1000 100
2 XYZ 1000 200
uuid name ts balance
1 Ethan 5000 60
2 XYZ 1000 200
3 Nadine 4000 100
uuid name ts balance is_delete
3 Nadine 4000 100 false
1 Ethan 5000 60 false
uuid name ts balance is_delete
2 XYZ 6000 null true
1 Ethan 2000 80 false
uuid name ts balance
1 Ethan 5000 60
2 XYZ 1000 200
3 Nadine 4000 100
Incoming Data 1 Incoming Data 2
Primary Key
Insert
Update Update
Delete
Hudi Table
Hudi Timeline
t1 t2 t3
● Payload and merge API for customized upserts;
built-in support for event-time ordering
● Auto primary key generation for log ingestion
(upcoming 0.14.0 release)
Upsert operation Ordering Field
Incremental Processing with CDC Feature
uuid name ts balance
1 Ethan 1000 100
2 XYZ 1000 200
uuid name ts balance
1 Ethan 5000 60
2 XYZ 1000 200
3 Nadine 4000 100
uuid name ts balance
1 Ethan 5000 60
2 XYZ 1000 200
3 Nadine 4000 100
Hudi Timeline
t1 t2 t3
Debezium-like change logs with before and after images with “hoodie.table.cdc.enabled=true”
op ts before after
i t2 null
{“uuid”:“3”,“name”:“Nadine”,
“ts”:“4000”,“balance”:“100”}
u t2
{“uuid”:“1”,“name”:“Ethan”,
“ts”:“1000”,“balance”:“100”}
{“uuid”:“1”,“name”:“Ethan”,
“ts”:“5000”,“balance”:“60”}
d t3
{“uuid”:“2”,“name”:“XYZ”,
“ts”:“1000”,“balance”:“200”}
null
spark.read.format("hudi").
option("hoodie.datasource.query.type",
"incremental").
option("hoodie.datasource.query.incremental.format",
"cdc").
option("hoodie.datasource.read.begin.instanttime",
t1).
option("hoodie.datasource.read.end.instanttime",
t3).
load("/path/to/hudi")
(New in 0.13.0 release)
Speed up UPSERT Operations
with Record-Level Index
● Widely employed in DB systems
○ Locate information quickly
○ Reduce I/O cost
○ Improve Query efficiency
https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/
Indexes: Locating Records Efficiently
● Indexing provides fast upserts
○ Locate records for incoming writes
○ Bloom filter based, Simple, Hbase, etc.
Existing Indexes in Hudi
● Simple Index
○ Simply read keys and location from table
○ Best for random updates and deletes
● Bloom Index
○ Prune data files by bloom filters and key ranges
○ Best for late arriving updates and dedup
● HBase Index
○ Look up key-to-location mapping in an external
HBase table
○ Best for large-scale datasets (10s of TB to PB)
Challenges for Large Datasets
● Simple Index
○ Read keys from all files
● Bloom Index
○ Read all bloom filters
○ Read keys after file pruning to avoid false
positives
● HBase Index
○ Key-to-location mapping for every record
Reading data and metadata
per file is expensive
HBase cluster maintenance is
required and operationally difficult
Particular for cloud storage which
enforces rate limiting on I/O
A new Index to address both challenges?
Record-Level Index (RLI) Design
● Key-to-location mapping in table-level metadata
○ A new partition,“record_index”, in the metadata table
○ Stored in a few file groups instead of all data files
● Efficient key-to-location entry as payload
○ Random UUID key and datestr partition: 50-55 B per record in MDT
● Fast index update and lookup
○ MDT, an internal Hudi MOR table, enables uniformed fast updates
○ HFile format enables fast point lookup
Record-Level Index on Storage
“record_index”
partition
FG N-1
FG 1
File Group 0
File Group 0
File Slice t0
…
FS t1
HFile
Log File 1
HFile
record_key 0 -> partition 1, file 1
record_key 1 -> partition 1, file 1
record_key 2 -> partition 2, file 3
record_key 3 -> partition 1, file 2
.
.
.
Compaction
HFile
Log File 1
Header
HFile Data Block 0
record_key 6 -> partition 1, file 5
record_key 7-> partition 1, file 1
…
HFile Data Block 1
Footer
File Group ID
by the hash
Record
Keys Log File 2
SELECT * FROM table WHERE key = 'val'
DELETE FROM table WHERE key = 'val'
Performance Benefit from RLI
● Improves index lookup and write latency
○ 1TB dataset, 200MB batch, random updates
○ 17x speedup on index lookup, 2x on write
Record-Level Index will be available in
upcoming Hudi 0.14.0 release
17x 2x
2x
3x
● Reduces SQL latency with point lookups
○ TPC-DS 10TB datasets, store_sales table
○ 2-3x improvement compared to no RLI
Case Walkthrough
Customer-360 Walkthrough
Customer-360 Architecture
Customer-360: Bronze layer
Customer-360: “Clickstream” Schema
Field Description
click_id Unique identifier for each click
customer_id Customer table reference
session_id User session id
url User clicked on url
timestamp Timestamp of click
Customer-360: “Purchase” Schema
Field Description
purchase_id Unique identifier for purchase
customer_id Unique identifier for customer
product_id Unique identifier for product
quantity # product purchase
purchase_price Products total price
purchase_date timestamp
payment_method Customer’s payment method
order_status Delivered, in-route, etc
Customer-360: “Cart Activity” Schema
Field Description
activity_id Unique identifier for activity
customer_id Unique identifier for customer
product_id Unique identifier for product
timestamp Activity timestamp
activity-type Added, removed etc items
quantity How many items customer add/remove
cart-status Active/abandoned cart
Customer-360: “Customer” Schema
Field Description
customer_id Unique identifier for customer
first_name Customer’s first name
last_name Customer’s last name
email Customer’s email
signup_date Account creation date
last_login Most recent login date
Customer-360: Silver layer
Customer-360: Gold layer
Correlate User’s Activity with Purchases
SELECT
c.first_name,
c.last_name,
cs.url AS clicked_url,
cs.timestamp AS click_timestamp,
p.product_id AS purchased_product,
p.purchase_date
FROM customers c
-- Joining clickstream data
LEFT JOIN clickstream cs ON c.customer_id = cs.customer_id
-- Joining purchase data
LEFT JOIN purchases p ON c.customer_id = p.customer_id
WHERE cs.timestamp > '2023-01-01' AND p.purchase_date > '2023-01-01'
ORDER BY c.last_name, cs.timestamp DESC, p.purchase_date DESC;
Customer-360: Analytics
What’s Next in Apache Hudi
● Hudi 0.14.0 release will be out soon
○ Record-Level Index to speed up index lookup and upsert performance
○ Auto-generated keys for use cases without user-provided primary key field
○ New MOR reader in Spark to boost query performance
● Hudi 1.x (RFC-69)
○ Re-imagination of Hudi, the transactional database for the lake
○ Storage format changes to unlock long retention of timeline, non-blocking
concurrency control
○ Enhancement to the indexing, performance and better abstractions, APIs for
engine integration
Come Build With The Community!
Docs : https://hudi.apache.org
Blogs : https://hudi.apache.org/blog
Slack : Apache Hudi Slack Group
Twitter : https://twitter.com/apachehudi
Github: https://github.com/apache/hudi/ Give us a star ⭐!
Mailing list(s) :
dev-subscribe@hudi.apache.org (send an empty email to subscribe)
Join Hudi Slack
Thanks!
Questions?
A Glide, Skip or a Jump:
Efficiently Stream Data into Your
Medallion Architecture with Apache Hudi
Join Hudi Slack
Challenges with Lakehouse Technologies
Context
❏ Append-only; no support for
upserts & deletes
Problems
❏ No indexing -> Full table scans
❏ Inconsistent view of the data lake
❏ No record modifications
Challenges in medallion architecture
Open & Interoperable Lakehouse Platform
Lake Storage
(Cloud Object Stores, HDFS, …)
Open File/Data Formats
(Parquet, HFile, Avro, Orc, …)
Concurrency Control
(OCC, MVCC, Non-blocking, Lock
providers, Scheduling...)
Table Services
(cleaning, compaction, clustering,
indexing, file sizing,...)
Indexes
(Bloom filter, HBase, Bucket index,
Hash based, Lucene..)
Table Format
(Schema, File listings, Stats,
Evolution, …)
Lake Cache*
(Columnar, transactional, mutable,
WIP,...)
Metaserver*
(Stats, table service coordination,...)
Query Engines
(Spark, Flink, Hive, Presto, Trino,
Impala, Redshift, BigQuery,
Snowflake,..)
Platform Services
(Streaming/Batch ingest, various
sources, Catalog sync, Admin CLI,
Data Quality,...)
Transactional
Database
Layer
User Interface
Readers
(Snapshot, Time Travel,
Incremental, etc)
Writers
(Inserts, Updates, Deletes, Smart
Layout Management, etc)
Programming API

More Related Content

Similar to A Glide, Skip or a Jump: Efficiently Stream Data into Your Medallion Architecture with Apache Hudi

In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore IndexSolidQ
 
Scaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudScaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudChangshu Liu
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureMark Kromer
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudKellyn Pot'Vin-Gorman
 
Real-Time ETL in Practice with WSO2 Enterprise Integrator
Real-Time ETL in Practice with WSO2 Enterprise IntegratorReal-Time ETL in Practice with WSO2 Enterprise Integrator
Real-Time ETL in Practice with WSO2 Enterprise IntegratorWSO2
 
Flink Forward Europe 2019 - Berlin
Flink Forward Europe 2019 - BerlinFlink Forward Europe 2019 - Berlin
Flink Forward Europe 2019 - BerlinDavid Morin
 
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the Field
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the FieldPartner Enablement: Key Differentiators of Denodo Platform 6.0 for the Field
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the FieldDenodo
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDBDenny Lee
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Apache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTApache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTjixuan1989
 
WSO2 Analytics Platform - The one stop shop for all your data needs
WSO2 Analytics Platform - The one stop shop for all your data needsWSO2 Analytics Platform - The one stop shop for all your data needs
WSO2 Analytics Platform - The one stop shop for all your data needsSriskandarajah Suhothayan
 
扩展世界上最大的图片Blog社区
扩展世界上最大的图片Blog社区扩展世界上最大的图片Blog社区
扩展世界上最大的图片Blog社区yiditushe
 
Fotolog: Scaling the World's Largest Photo Blogging Community
Fotolog: Scaling the World's Largest Photo Blogging CommunityFotolog: Scaling the World's Largest Photo Blogging Community
Fotolog: Scaling the World's Largest Photo Blogging Communityfarhan "Frank"​ mashraqi
 
AWS User Group: Building Cloud Analytics Solution with AWS
AWS User Group: Building Cloud Analytics Solution with AWSAWS User Group: Building Cloud Analytics Solution with AWS
AWS User Group: Building Cloud Analytics Solution with AWSDmitry Anoshin
 
Lessons learned mongodb to redhsift - meetup July 1st Tel Aviv
Lessons learned   mongodb to redhsift - meetup July 1st Tel AvivLessons learned   mongodb to redhsift - meetup July 1st Tel Aviv
Lessons learned mongodb to redhsift - meetup July 1st Tel AvivRoie Shavit
 
Agile Testing Days 2017 Introducing AgileBI Sustainably
Agile Testing Days 2017 Introducing AgileBI SustainablyAgile Testing Days 2017 Introducing AgileBI Sustainably
Agile Testing Days 2017 Introducing AgileBI SustainablyRaphael Branger
 
OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland Meet-up - Updates from Oracle Open World 2016OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland Meet-up - Updates from Oracle Open World 2016Brendan Tierney
 
Database Shootout: What's best for BI?
Database Shootout: What's best for BI?Database Shootout: What's best for BI?
Database Shootout: What's best for BI?Jos van Dongen
 

Similar to A Glide, Skip or a Jump: Efficiently Stream Data into Your Medallion Architecture with Apache Hudi (20)

In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
Scaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudScaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in Cloud
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle Cloud
 
Real-Time ETL in Practice with WSO2 Enterprise Integrator
Real-Time ETL in Practice with WSO2 Enterprise IntegratorReal-Time ETL in Practice with WSO2 Enterprise Integrator
Real-Time ETL in Practice with WSO2 Enterprise Integrator
 
Flink Forward Europe 2019 - Berlin
Flink Forward Europe 2019 - BerlinFlink Forward Europe 2019 - Berlin
Flink Forward Europe 2019 - Berlin
 
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the Field
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the FieldPartner Enablement: Key Differentiators of Denodo Platform 6.0 for the Field
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the Field
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Apache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTApache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoT
 
WSO2 Analytics Platform - The one stop shop for all your data needs
WSO2 Analytics Platform - The one stop shop for all your data needsWSO2 Analytics Platform - The one stop shop for all your data needs
WSO2 Analytics Platform - The one stop shop for all your data needs
 
扩展世界上最大的图片Blog社区
扩展世界上最大的图片Blog社区扩展世界上最大的图片Blog社区
扩展世界上最大的图片Blog社区
 
Fotolog: Scaling the World's Largest Photo Blogging Community
Fotolog: Scaling the World's Largest Photo Blogging CommunityFotolog: Scaling the World's Largest Photo Blogging Community
Fotolog: Scaling the World's Largest Photo Blogging Community
 
Operational Data Vault
Operational Data VaultOperational Data Vault
Operational Data Vault
 
AWS User Group: Building Cloud Analytics Solution with AWS
AWS User Group: Building Cloud Analytics Solution with AWSAWS User Group: Building Cloud Analytics Solution with AWS
AWS User Group: Building Cloud Analytics Solution with AWS
 
Lessons learned mongodb to redhsift - meetup July 1st Tel Aviv
Lessons learned   mongodb to redhsift - meetup July 1st Tel AvivLessons learned   mongodb to redhsift - meetup July 1st Tel Aviv
Lessons learned mongodb to redhsift - meetup July 1st Tel Aviv
 
Agile Testing Days 2017 Introducing AgileBI Sustainably
Agile Testing Days 2017 Introducing AgileBI SustainablyAgile Testing Days 2017 Introducing AgileBI Sustainably
Agile Testing Days 2017 Introducing AgileBI Sustainably
 
OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland Meet-up - Updates from Oracle Open World 2016OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland Meet-up - Updates from Oracle Open World 2016
 
Database Shootout: What's best for BI?
Database Shootout: What's best for BI?Database Shootout: What's best for BI?
Database Shootout: What's best for BI?
 
Building Streaming Applications with Streaming SQL
Building Streaming Applications with Streaming SQLBuilding Streaming Applications with Streaming SQL
Building Streaming Applications with Streaming SQL
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonHostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolHostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesHostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaHostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonHostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonHostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyHostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonHostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLHostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondHostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsHostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

A Glide, Skip or a Jump: Efficiently Stream Data into Your Medallion Architecture with Apache Hudi

  • 1. A Glide, Skip or a Jump: Efficiently Stream Data into Your Medallion Architecture with Apache Hudi Nadine Farah Ethan Guo {nadine, ethan}@onehouse.ai September 27, 2023
  • 2. Ethan Guo ❏ Dev Rel @ Onehouse ❏ Contributor @ Apache Hudi ❏ Former @ Rockset, Bose ❏ Software Engineer @ Onehouse ❏ PMC @ Apache Hudi ❏ Data, Networking @ Uber in/nadinefarah/ @nfarah86 Nadine Farah Speaker Bio in/yihua-ethan-guo/
  • 3. Share your highlight from this session to win one of 10 Hudi Hoodies - Tag and follow OnehouseHQ on Linkedin with a post about this session OR - Live tweet this session & tag and follow @apachehudi Session Highlights: Share to Win Hudi Hoodies Hudi Slack Community Collect your hoodie at the Onehouse booth, 414 expo hall (by the laté/coffee bar area)
  • 4.
  • 7. So, what does it take to build medallion architecture?
  • 8. Challenges in the Medallion Architecture
  • 9. Bottlenecks that Cause the Challenges
  • 10. But … what if you can simplify the medallion architecture?
  • 11. Simplify the Medallion Architecture with Apache Hudi
  • 14. Open & Interoperable Lakehouse Platform S3 AWS Glue Data Catalog Metastore BigQuery Catalogs + Many More Lakehouse Platform Apache Kafka Raw Tables Cleaned Tables Derived Tables
  • 16. The Missing State Store Hudi Table upsert(records) at time t Changes to table Changes from table incremental_query (t-1, t) query at time t Latest committed records
  • 17. Proven @ Massive Scale https://www.youtube.com/watch?v=ZamXiT9aqs8 https://chowdera.com/2022/184/202207030146453436.html https://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance/ 100GB/s Throughput > 1Exabyte Even just 1 Table Daily -> Min Analytics Latency 70% CPU Savings (write+read) 300GB/d Throughput 25+TB Datasets Hourly Analytics Latency https://aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-nat ive-data-pipelines-at-enterprise-scale-using-the-aws-platform/ 10,000+ Tables 150+ Source systems CDC, ETL Use cases https://www.uber.com/blog/apache-hudi-graduation/ 4000+ Tables 250+PB Raw + Derived Daily -> Min Analytics Latency 800B Records/Day
  • 18. Incremental Data Processing in the Medallion Architecture
  • 19. Incremental Processing with Apache Hudi Hudi Streamer Hudi Streamer
  • 20. Hudi Streamer: E2E Incremental Processing Clean Table (fact) (Join) Dataset 1 (dimension) (Join) Dataset 2 (dimension) Upsert Summary Table Apache Kafka Raw Table (fact) Hudi Streamer Hudi Streamer Hudi Streamer Projection User- defined Transform ation SELECT a.loc.lon as loc_lon, a.loc.lat as loc_lat, a.name FROM <SRC> a Schema registry Bronze Silver Gold PostgresSQL Debezium Upsert Bulk Insert
  • 21. Hudi Incr. Processing: Under the hood ● Record-level changes with primary keys -> index lookup, record payload and merging ● Faster metadata changes, consistency between index and data -> metadata management ● Optimize data layout on storage -> small-file handling, table services ● Needs fundamentally different concurrency control techniques -> OCC and MVCC Incremental / CDC Changes From Source Pre- Process Locate Records Optimize File Layout Perform Upsert Write New Files Update Index / Metadata Commit Sched/Run Table Services Incremental / CDC Changes From Hudi
  • 22. Deep Dive on Record-Level Mutation uuid name ts balance 1 Ethan 1000 100 2 XYZ 1000 200 uuid name ts balance 1 Ethan 5000 60 2 XYZ 1000 200 3 Nadine 4000 100 uuid name ts balance is_delete 3 Nadine 4000 100 false 1 Ethan 5000 60 false uuid name ts balance is_delete 2 XYZ 6000 null true 1 Ethan 2000 80 false uuid name ts balance 1 Ethan 5000 60 2 XYZ 1000 200 3 Nadine 4000 100 Incoming Data 1 Incoming Data 2 Primary Key Insert Update Update Delete Hudi Table Hudi Timeline t1 t2 t3 ● Payload and merge API for customized upserts; built-in support for event-time ordering ● Auto primary key generation for log ingestion (upcoming 0.14.0 release) Upsert operation Ordering Field
  • 23. Incremental Processing with CDC Feature uuid name ts balance 1 Ethan 1000 100 2 XYZ 1000 200 uuid name ts balance 1 Ethan 5000 60 2 XYZ 1000 200 3 Nadine 4000 100 uuid name ts balance 1 Ethan 5000 60 2 XYZ 1000 200 3 Nadine 4000 100 Hudi Timeline t1 t2 t3 Debezium-like change logs with before and after images with “hoodie.table.cdc.enabled=true” op ts before after i t2 null {“uuid”:“3”,“name”:“Nadine”, “ts”:“4000”,“balance”:“100”} u t2 {“uuid”:“1”,“name”:“Ethan”, “ts”:“1000”,“balance”:“100”} {“uuid”:“1”,“name”:“Ethan”, “ts”:“5000”,“balance”:“60”} d t3 {“uuid”:“2”,“name”:“XYZ”, “ts”:“1000”,“balance”:“200”} null spark.read.format("hudi"). option("hoodie.datasource.query.type", "incremental"). option("hoodie.datasource.query.incremental.format", "cdc"). option("hoodie.datasource.read.begin.instanttime", t1). option("hoodie.datasource.read.end.instanttime", t3). load("/path/to/hudi") (New in 0.13.0 release)
  • 24. Speed up UPSERT Operations with Record-Level Index
  • 25. ● Widely employed in DB systems ○ Locate information quickly ○ Reduce I/O cost ○ Improve Query efficiency https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/ Indexes: Locating Records Efficiently ● Indexing provides fast upserts ○ Locate records for incoming writes ○ Bloom filter based, Simple, Hbase, etc.
  • 26. Existing Indexes in Hudi ● Simple Index ○ Simply read keys and location from table ○ Best for random updates and deletes ● Bloom Index ○ Prune data files by bloom filters and key ranges ○ Best for late arriving updates and dedup ● HBase Index ○ Look up key-to-location mapping in an external HBase table ○ Best for large-scale datasets (10s of TB to PB)
  • 27. Challenges for Large Datasets ● Simple Index ○ Read keys from all files ● Bloom Index ○ Read all bloom filters ○ Read keys after file pruning to avoid false positives ● HBase Index ○ Key-to-location mapping for every record Reading data and metadata per file is expensive HBase cluster maintenance is required and operationally difficult Particular for cloud storage which enforces rate limiting on I/O A new Index to address both challenges?
  • 28. Record-Level Index (RLI) Design ● Key-to-location mapping in table-level metadata ○ A new partition,“record_index”, in the metadata table ○ Stored in a few file groups instead of all data files ● Efficient key-to-location entry as payload ○ Random UUID key and datestr partition: 50-55 B per record in MDT ● Fast index update and lookup ○ MDT, an internal Hudi MOR table, enables uniformed fast updates ○ HFile format enables fast point lookup
  • 29. Record-Level Index on Storage “record_index” partition FG N-1 FG 1 File Group 0 File Group 0 File Slice t0 … FS t1 HFile Log File 1 HFile record_key 0 -> partition 1, file 1 record_key 1 -> partition 1, file 1 record_key 2 -> partition 2, file 3 record_key 3 -> partition 1, file 2 . . . Compaction HFile Log File 1 Header HFile Data Block 0 record_key 6 -> partition 1, file 5 record_key 7-> partition 1, file 1 … HFile Data Block 1 Footer File Group ID by the hash Record Keys Log File 2
  • 30. SELECT * FROM table WHERE key = 'val' DELETE FROM table WHERE key = 'val' Performance Benefit from RLI ● Improves index lookup and write latency ○ 1TB dataset, 200MB batch, random updates ○ 17x speedup on index lookup, 2x on write Record-Level Index will be available in upcoming Hudi 0.14.0 release 17x 2x 2x 3x ● Reduces SQL latency with point lookups ○ TPC-DS 10TB datasets, store_sales table ○ 2-3x improvement compared to no RLI
  • 35. Customer-360: “Clickstream” Schema Field Description click_id Unique identifier for each click customer_id Customer table reference session_id User session id url User clicked on url timestamp Timestamp of click
  • 36. Customer-360: “Purchase” Schema Field Description purchase_id Unique identifier for purchase customer_id Unique identifier for customer product_id Unique identifier for product quantity # product purchase purchase_price Products total price purchase_date timestamp payment_method Customer’s payment method order_status Delivered, in-route, etc
  • 37. Customer-360: “Cart Activity” Schema Field Description activity_id Unique identifier for activity customer_id Unique identifier for customer product_id Unique identifier for product timestamp Activity timestamp activity-type Added, removed etc items quantity How many items customer add/remove cart-status Active/abandoned cart
  • 38. Customer-360: “Customer” Schema Field Description customer_id Unique identifier for customer first_name Customer’s first name last_name Customer’s last name email Customer’s email signup_date Account creation date last_login Most recent login date
  • 41. Correlate User’s Activity with Purchases SELECT c.first_name, c.last_name, cs.url AS clicked_url, cs.timestamp AS click_timestamp, p.product_id AS purchased_product, p.purchase_date FROM customers c -- Joining clickstream data LEFT JOIN clickstream cs ON c.customer_id = cs.customer_id -- Joining purchase data LEFT JOIN purchases p ON c.customer_id = p.customer_id WHERE cs.timestamp > '2023-01-01' AND p.purchase_date > '2023-01-01' ORDER BY c.last_name, cs.timestamp DESC, p.purchase_date DESC;
  • 43. What’s Next in Apache Hudi ● Hudi 0.14.0 release will be out soon ○ Record-Level Index to speed up index lookup and upsert performance ○ Auto-generated keys for use cases without user-provided primary key field ○ New MOR reader in Spark to boost query performance ● Hudi 1.x (RFC-69) ○ Re-imagination of Hudi, the transactional database for the lake ○ Storage format changes to unlock long retention of timeline, non-blocking concurrency control ○ Enhancement to the indexing, performance and better abstractions, APIs for engine integration
  • 44. Come Build With The Community! Docs : https://hudi.apache.org Blogs : https://hudi.apache.org/blog Slack : Apache Hudi Slack Group Twitter : https://twitter.com/apachehudi Github: https://github.com/apache/hudi/ Give us a star ⭐! Mailing list(s) : dev-subscribe@hudi.apache.org (send an empty email to subscribe) Join Hudi Slack
  • 45. Thanks! Questions? A Glide, Skip or a Jump: Efficiently Stream Data into Your Medallion Architecture with Apache Hudi Join Hudi Slack
  • 46. Challenges with Lakehouse Technologies Context ❏ Append-only; no support for upserts & deletes Problems ❏ No indexing -> Full table scans ❏ Inconsistent view of the data lake ❏ No record modifications
  • 47. Challenges in medallion architecture
  • 48. Open & Interoperable Lakehouse Platform Lake Storage (Cloud Object Stores, HDFS, …) Open File/Data Formats (Parquet, HFile, Avro, Orc, …) Concurrency Control (OCC, MVCC, Non-blocking, Lock providers, Scheduling...) Table Services (cleaning, compaction, clustering, indexing, file sizing,...) Indexes (Bloom filter, HBase, Bucket index, Hash based, Lucene..) Table Format (Schema, File listings, Stats, Evolution, …) Lake Cache* (Columnar, transactional, mutable, WIP,...) Metaserver* (Stats, table service coordination,...) Query Engines (Spark, Flink, Hive, Presto, Trino, Impala, Redshift, BigQuery, Snowflake,..) Platform Services (Streaming/Batch ingest, various sources, Catalog sync, Admin CLI, Data Quality,...) Transactional Database Layer User Interface Readers (Snapshot, Time Travel, Incremental, etc) Writers (Inserts, Updates, Deletes, Smart Layout Management, etc) Programming API