SlideShare a Scribd company logo
1 of 31
Download to read offline
How to build a streaming
Lakehouse w/ Flink + Hudi
Ethan Guo + Kyle Weller
Wed, Aug 3 | 1:30 PM
Kyle Weller - Head of Product @ Onehouse.ai
https://www.linkedin.com/in/lakehouse/
Ethan Guo - Software Engineer @ Onehouse.ai
https://www.linkedin.com/in/yihua-ethan-guo/
Introductions
What is a
Lakehouse?
PostgresSQL Debezium Apache Kafka
Database Ingestion Real-Time Analytics
Apache Flink
Old-School Batch ETL
Amazon S3
Problems
● Replicate business logic
● Slow batch pipes always lag
● Devops to maintain and sync
● No updates/deletes on S3
Stream vs Batch fork
Hudi Lakehouse
S3
Apache Hudi
+
+
Topics
● Unify batch and streaming workloads
● Build centralized platform for multiple compute engines
● Unlock concurrency for multiple readers/writers with ACID transactions
● Blazing fast data lake stream ingestion and processing with Hudi Merge-On-Read
● Efficient Upserts/Deletes with indexing and primary keys
● Implement incremental processing for Hudi change streams
PostgresSQL Debezium Apache Kafka
The Hudi Platform
Lake Storage
(Cloud Object Stores, HDFS, …)
Open File/Data Formats
(Parquet, HFile, Avro, Orc, …)
Concurrency Control
(OCC, MVCC, Non-blocking, Lock
providers, Orchestration, Scheduling...)
Table Services
(cleaning, compaction, clustering,
indexing, file sizing,...)
Indexes
(Bloom filter, HBase, Bucket
index, Hash based, Lucene..)
Table Format
(Schema, File listings, Stats,
Evolution, …)
Lake Cache
(Columnar, transactional,
mutable, WIP,...)
Metaserver
(Stats, table service coordination,...)
SQL Query Engines
(Spark, Flink, Hive, Presto, Trino, Impala,
Redshift, BigQuery, Snowflake,..)
Platform Services
(Streaming/Batch ingest,
various sources, Catalog sync,
Admin CLI, Data Quality,...)
Transactional
Database
Layer
Execution/Runtimes
+
Apache Kafka
Raw Cleaned Derived
Open
Formats
CDC Incremental
Change Feed
Transactions +
Concurrency
Managed
Perf Tuning
+++
More
Auto Catalog
Sync
Merge-On-Read
Stream Writers
S3
AWS Glue
Data Catalog
Metastore
BigQuery
Catalogs
+ Many More
Central Low-Latency Lakehouse Platform
Trailblazer, now Industry Proven
Uber rides - 250+ Petabytes from 24h+ to minutes latency
https:/
/eng.uber.com/uber-big-data-platform/
Package deliveries - real-time event analytics at PB scale
https:/
/aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/
TikTok/Bytedance recommendation system - at Exabyte scale
http:/
/hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance
Trading transactions - Near real-time CDC from 4000+ postgres tables
https:/
/s.apache.org/hudi-robinhood-talk
150 source systems, ETL processing for 10,000+ tables
https:/
/aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-native-data-pipelines-at-enterprise-scale-using-the-aws-platform/
Real-time advertising for 20M+ concurrent viewers
https:/
/www.youtube.com/watch?v=mFpqrVxxwKc
Store transactions - CDC & Warehousing
https:/
/searchdatamanagement.techtarget.com/feature/Hudi-powering-data-lake-efforts-at-Walmart-and-Disney-Hotstar
The Community
2200+
Slack Members
250+
Contributors
1000+
GH Engagers
20+
Committers
Pre-installed on 5 cloud providers
Diverse PMC/Committers
1M DLs/month
(400% YoY)
800B+
Records/Day
(from even just 1 customer!)
Rich community of participants
+ - Streaming on Cloud Storage
Compaction
v1
v2
Reader
Writer
versioned parquet files
v1
v2
v1
v2
v1
v2
v1
v2
Reader
Copy on Write
Writer
parquet files + change logs
v1 v1 v1 v1
Reader
Merge on Read
COW MOR
Write Cost Higher Lower
Data Latency Slower Faster
Query Speed Faster Slower before
compaction
Same after
compaction
Overall Cost Aggressive
rewrites with
every update
Can amortize
compaction with
other services
+ - Streaming on Cloud Storage
Compaction
v1
v2
Reader
Writer
parquet files + change logs
v1 v1 v1 v1
Reader
Merge on Read
Query Types
1. Snapshot Query - Merge changes and read everything
2. Read-Optimized Query - Read the latest compacted data
3. Incremental Query - Read only data that has changed between an interval
1
1
2
2
3
3
+ - Merge On Read Stories
https://www.youtube.com/watch?v=ZamXiT9aqs8
https://chowdera.com/2022/184/202207030146453436.html
https://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance/
100GB/s
Throughput
400+PB
Even just 1 Table
Daily -> Min
Analytics Latency
70%
CPU Savings
(write+read)
300GB/d
Throughput
25+TB
Dataset
Hourly
Analytics Latency
https://www.youtube.com/watch?v=ZamXiT9aqs8
100M+/d
Events
10+TB
Dataset
8h -> 1h
Analytics Latency
https://www.youtube.com/watch?v=Yn8-tPX6Zoo
10min
Analytics Latency
Table Services with Streaming Ingestion
● Self managing database runtime
○ Cleaning (committed/uncommitted),
archival, clustering, compaction
● Table services know each other
○ Avoid duplicate schedules
○ Skip compacting files being clustered
● Run continuously or scheduled,
asynchronously
Compaction - Optimizing Queries on MOR
● Periodically and asynchronously
compact log files to new base files
● Reduces write amplification
● Keep the query performance in check
Latest: parquet files + change logs
v1
Snapshot
Query
Merging
Compaction
v1
v2
Snapshot
Query
Latest: parquet files only
Clustering - Optimizing Data Layout
○ Faster streaming ingestion -> smaller file sizes
○ Data locality for query (e.g., by city) ≠ ingestion order (e.g., trips by time)
○ Clustering to the rescue: auto file sizing, reorg data, no compromise on ingestion
Clustering Service
● Scheduling: identify target data,
generate plan in timeline
● Running: execute plan with
pluggable strategy
○ Reorg data with linear sorting,
Z-order, Hilbert, etc.
○ “REPLACE” commit in timeline
● Widely employed in database systems
○ Locate information quickly
○ Reduce I/O cost
○ Improve Query efficiency
● Hudi’s indexing provides fast upserts
○ Locate records for incoming writes
○ Bloom filter based, Simple, Hbase etc
https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/
Indexes
Multi-Modal Index - New in Hudi 0.11
● Generalized indexing subsystem in Lakehouse
○ Scale to 10-100x data on the lake
○ Improve read and queries besides writes
● Key principles
○ Scalable metadata with MOR metadata table
○ ACID updates with multi-table transaction
○ Fast pointed lookup
https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-l
akehouse-in-apache-hudi
Multi-Modal Index - File Listing
● Improve file listing on cloud storage like S3
○ Direct listing of 100k files across 1000s of partitions hits throttling and I/O bottleneck
○ The files partition in metadata table provides 2-20x speedup of file listing
Multi-Modal Index - Data Skipping
● Leverage column stats (min, max, count, etc.) to prune files in a query
○ Reduce unnecessary scans, paired with clustering. Integrated with Flink.
○ 10-30x speedup of needle-in-a-haystack type of queries
Q1a: low specificity,
more targeted data/files
Q1b: high specificity,
less targeted data/files
+ Demo - Flink + Hudi on EMR
+ Demo - Flink + Hudi on EMR
+ Demo - Flink + Hudi on EMR
+ Demo - Flink + Hudi on EMR
+ Demo - Flink + Hudi on EMR
+ Demo - Flink + Hudi on EMR
+ Demo - Flink + Hudi on EMR
Metaserver (Coming in 2022)
Interesting fact: Hudi has a metaserver
already
○ Runs on Spark driver; Serves
FileSystem RPCs + queries on timeline
○ Backed by rocksDB/pluggable
○ Updated incrementally on every
timeline action
○ Very useful in streaming jobs
Data lakes need a new metaserver
○ Flat file metastores are cool? (really?)
○ Speed up planning by orders of
magnitude
Lake Cache (Coming in 2022)
LRU Cache ala DB Buffer Pool
Frequent Commits => Small objects/blocks
○ Today: Aggressively table services
○ Tomorrow: File Group/Hudi file model
aware caching
○ Mutable data => FileSystem/Block level
caches are not that effective.
Benefits
○ Great performance for CDC tables
○ Avoid open/close costs for small objects
Come Build With The Community!
Docs : https://hudi.apache.org
Slack : https://join.slack.com/t/apache-hudi/shared_invite/zt-1d5zjsfl3-d_TefVaGyvEe16EANrxz6Q
Twitter : https://twitter.com/apachehudi
Github: https://github.com/apache/hudi/ Give us a star ⭐!
Mailing list(s) :
dev-subscribe@hudi.apache.org (send an empty email to subscribe)
dev@hudi.apache.org (actual mailing list)
Join Hudi Slack
Thanks!
Questions?

More Related Content

What's hot

Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...Altinity Ltd
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
ksqlDB: Building Consciousness on Real Time Events
ksqlDB: Building Consciousness on Real Time EventsksqlDB: Building Consciousness on Real Time Events
ksqlDB: Building Consciousness on Real Time Eventsconfluent
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...StreamNative
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Sid Anand
 
Streaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache KafkaStreaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache Kafkaconfluent
 
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataBuilding Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataTimothy Spann
 

What's hot (20)

Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
ksqlDB: Building Consciousness on Real Time Events
ksqlDB: Building Consciousness on Real Time EventsksqlDB: Building Consciousness on Real Time Events
ksqlDB: Building Consciousness on Real Time Events
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
 
Streaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache KafkaStreaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache Kafka
 
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataBuilding Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
 

Similar to Apache Flink and Apache Hudi.pdf

Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudShubham Tagra
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/HudiVinoth Chandar
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudInteractive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudAlluxio, Inc.
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioAlluxio, Inc.
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio, Inc.
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Alluxio, Inc.
 
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreMeetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreAlluxio, Inc.
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezJan Pieter Posthuma
 
AWS Webcast - Attunity Couchsurfing
AWS Webcast - Attunity CouchsurfingAWS Webcast - Attunity Couchsurfing
AWS Webcast - Attunity CouchsurfingAmazon Web Services
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileRoy Kim
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Cloudian
 
BIOIT14: Deploying very low cost cloud storage technology in a traditional re...
BIOIT14: Deploying very low cost cloud storage technology in a traditional re...BIOIT14: Deploying very low cost cloud storage technology in a traditional re...
BIOIT14: Deploying very low cost cloud storage technology in a traditional re...Dirk Petersen
 

Similar to Apache Flink and Apache Hudi.pdf (20)

Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudInteractive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS
 
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreMeetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
 
Amazed by AWS Series #4
Amazed by AWS Series #4Amazed by AWS Series #4
Amazed by AWS Series #4
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
AWS Webcast - Attunity Couchsurfing
AWS Webcast - Attunity CouchsurfingAWS Webcast - Attunity Couchsurfing
AWS Webcast - Attunity Couchsurfing
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
 
BIOIT14: Deploying very low cost cloud storage technology in a traditional re...
BIOIT14: Deploying very low cost cloud storage technology in a traditional re...BIOIT14: Deploying very low cost cloud storage technology in a traditional re...
BIOIT14: Deploying very low cost cloud storage technology in a traditional re...
 

Recently uploaded

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 

Recently uploaded (20)

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 

Apache Flink and Apache Hudi.pdf

  • 1. How to build a streaming Lakehouse w/ Flink + Hudi Ethan Guo + Kyle Weller Wed, Aug 3 | 1:30 PM
  • 2. Kyle Weller - Head of Product @ Onehouse.ai https://www.linkedin.com/in/lakehouse/ Ethan Guo - Software Engineer @ Onehouse.ai https://www.linkedin.com/in/yihua-ethan-guo/ Introductions
  • 4. PostgresSQL Debezium Apache Kafka Database Ingestion Real-Time Analytics Apache Flink Old-School Batch ETL Amazon S3 Problems ● Replicate business logic ● Slow batch pipes always lag ● Devops to maintain and sync ● No updates/deletes on S3 Stream vs Batch fork
  • 5. Hudi Lakehouse S3 Apache Hudi + + Topics ● Unify batch and streaming workloads ● Build centralized platform for multiple compute engines ● Unlock concurrency for multiple readers/writers with ACID transactions ● Blazing fast data lake stream ingestion and processing with Hudi Merge-On-Read ● Efficient Upserts/Deletes with indexing and primary keys ● Implement incremental processing for Hudi change streams PostgresSQL Debezium Apache Kafka
  • 6. The Hudi Platform Lake Storage (Cloud Object Stores, HDFS, …) Open File/Data Formats (Parquet, HFile, Avro, Orc, …) Concurrency Control (OCC, MVCC, Non-blocking, Lock providers, Orchestration, Scheduling...) Table Services (cleaning, compaction, clustering, indexing, file sizing,...) Indexes (Bloom filter, HBase, Bucket index, Hash based, Lucene..) Table Format (Schema, File listings, Stats, Evolution, …) Lake Cache (Columnar, transactional, mutable, WIP,...) Metaserver (Stats, table service coordination,...) SQL Query Engines (Spark, Flink, Hive, Presto, Trino, Impala, Redshift, BigQuery, Snowflake,..) Platform Services (Streaming/Batch ingest, various sources, Catalog sync, Admin CLI, Data Quality,...) Transactional Database Layer Execution/Runtimes
  • 7. + Apache Kafka Raw Cleaned Derived Open Formats CDC Incremental Change Feed Transactions + Concurrency Managed Perf Tuning +++ More Auto Catalog Sync Merge-On-Read Stream Writers S3 AWS Glue Data Catalog Metastore BigQuery Catalogs + Many More Central Low-Latency Lakehouse Platform
  • 8. Trailblazer, now Industry Proven Uber rides - 250+ Petabytes from 24h+ to minutes latency https:/ /eng.uber.com/uber-big-data-platform/ Package deliveries - real-time event analytics at PB scale https:/ /aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/ TikTok/Bytedance recommendation system - at Exabyte scale http:/ /hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance Trading transactions - Near real-time CDC from 4000+ postgres tables https:/ /s.apache.org/hudi-robinhood-talk 150 source systems, ETL processing for 10,000+ tables https:/ /aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-native-data-pipelines-at-enterprise-scale-using-the-aws-platform/ Real-time advertising for 20M+ concurrent viewers https:/ /www.youtube.com/watch?v=mFpqrVxxwKc Store transactions - CDC & Warehousing https:/ /searchdatamanagement.techtarget.com/feature/Hudi-powering-data-lake-efforts-at-Walmart-and-Disney-Hotstar
  • 9. The Community 2200+ Slack Members 250+ Contributors 1000+ GH Engagers 20+ Committers Pre-installed on 5 cloud providers Diverse PMC/Committers 1M DLs/month (400% YoY) 800B+ Records/Day (from even just 1 customer!) Rich community of participants
  • 10. + - Streaming on Cloud Storage Compaction v1 v2 Reader Writer versioned parquet files v1 v2 v1 v2 v1 v2 v1 v2 Reader Copy on Write Writer parquet files + change logs v1 v1 v1 v1 Reader Merge on Read COW MOR Write Cost Higher Lower Data Latency Slower Faster Query Speed Faster Slower before compaction Same after compaction Overall Cost Aggressive rewrites with every update Can amortize compaction with other services
  • 11. + - Streaming on Cloud Storage Compaction v1 v2 Reader Writer parquet files + change logs v1 v1 v1 v1 Reader Merge on Read Query Types 1. Snapshot Query - Merge changes and read everything 2. Read-Optimized Query - Read the latest compacted data 3. Incremental Query - Read only data that has changed between an interval 1 1 2 2 3 3
  • 12. + - Merge On Read Stories https://www.youtube.com/watch?v=ZamXiT9aqs8 https://chowdera.com/2022/184/202207030146453436.html https://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance/ 100GB/s Throughput 400+PB Even just 1 Table Daily -> Min Analytics Latency 70% CPU Savings (write+read) 300GB/d Throughput 25+TB Dataset Hourly Analytics Latency https://www.youtube.com/watch?v=ZamXiT9aqs8 100M+/d Events 10+TB Dataset 8h -> 1h Analytics Latency https://www.youtube.com/watch?v=Yn8-tPX6Zoo 10min Analytics Latency
  • 13. Table Services with Streaming Ingestion ● Self managing database runtime ○ Cleaning (committed/uncommitted), archival, clustering, compaction ● Table services know each other ○ Avoid duplicate schedules ○ Skip compacting files being clustered ● Run continuously or scheduled, asynchronously
  • 14. Compaction - Optimizing Queries on MOR ● Periodically and asynchronously compact log files to new base files ● Reduces write amplification ● Keep the query performance in check Latest: parquet files + change logs v1 Snapshot Query Merging Compaction v1 v2 Snapshot Query Latest: parquet files only
  • 15. Clustering - Optimizing Data Layout ○ Faster streaming ingestion -> smaller file sizes ○ Data locality for query (e.g., by city) ≠ ingestion order (e.g., trips by time) ○ Clustering to the rescue: auto file sizing, reorg data, no compromise on ingestion
  • 16. Clustering Service ● Scheduling: identify target data, generate plan in timeline ● Running: execute plan with pluggable strategy ○ Reorg data with linear sorting, Z-order, Hilbert, etc. ○ “REPLACE” commit in timeline
  • 17. ● Widely employed in database systems ○ Locate information quickly ○ Reduce I/O cost ○ Improve Query efficiency ● Hudi’s indexing provides fast upserts ○ Locate records for incoming writes ○ Bloom filter based, Simple, Hbase etc https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/ Indexes
  • 18. Multi-Modal Index - New in Hudi 0.11 ● Generalized indexing subsystem in Lakehouse ○ Scale to 10-100x data on the lake ○ Improve read and queries besides writes ● Key principles ○ Scalable metadata with MOR metadata table ○ ACID updates with multi-table transaction ○ Fast pointed lookup https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-l akehouse-in-apache-hudi
  • 19. Multi-Modal Index - File Listing ● Improve file listing on cloud storage like S3 ○ Direct listing of 100k files across 1000s of partitions hits throttling and I/O bottleneck ○ The files partition in metadata table provides 2-20x speedup of file listing
  • 20. Multi-Modal Index - Data Skipping ● Leverage column stats (min, max, count, etc.) to prune files in a query ○ Reduce unnecessary scans, paired with clustering. Integrated with Flink. ○ 10-30x speedup of needle-in-a-haystack type of queries Q1a: low specificity, more targeted data/files Q1b: high specificity, less targeted data/files
  • 21. + Demo - Flink + Hudi on EMR
  • 22. + Demo - Flink + Hudi on EMR
  • 23. + Demo - Flink + Hudi on EMR
  • 24. + Demo - Flink + Hudi on EMR
  • 25. + Demo - Flink + Hudi on EMR
  • 26. + Demo - Flink + Hudi on EMR
  • 27. + Demo - Flink + Hudi on EMR
  • 28. Metaserver (Coming in 2022) Interesting fact: Hudi has a metaserver already ○ Runs on Spark driver; Serves FileSystem RPCs + queries on timeline ○ Backed by rocksDB/pluggable ○ Updated incrementally on every timeline action ○ Very useful in streaming jobs Data lakes need a new metaserver ○ Flat file metastores are cool? (really?) ○ Speed up planning by orders of magnitude
  • 29. Lake Cache (Coming in 2022) LRU Cache ala DB Buffer Pool Frequent Commits => Small objects/blocks ○ Today: Aggressively table services ○ Tomorrow: File Group/Hudi file model aware caching ○ Mutable data => FileSystem/Block level caches are not that effective. Benefits ○ Great performance for CDC tables ○ Avoid open/close costs for small objects
  • 30. Come Build With The Community! Docs : https://hudi.apache.org Slack : https://join.slack.com/t/apache-hudi/shared_invite/zt-1d5zjsfl3-d_TefVaGyvEe16EANrxz6Q Twitter : https://twitter.com/apachehudi Github: https://github.com/apache/hudi/ Give us a star ⭐! Mailing list(s) : dev-subscribe@hudi.apache.org (send an empty email to subscribe) dev@hudi.apache.org (actual mailing list) Join Hudi Slack