SlideShare a Scribd company logo
Ethan Guo & Vinoth Chandar
Designing Apache Hudi for
Incremental Processing on
Exabyte-scale Lakehouses
Speakers
• Database Engineer @ Onehouse.ai
• Apache Hudi Committer
• Data, Networking @ Uber
• CEO, Founder @ Onehouse.ai
• PMC Chair @ Apache Hudi
• Data, Infra, Networking, Databases @ Uber
• Kafka, ksqlDB, Streams @ Confluent
• Key Value Stores @ LinkedIn
• CDC, Goldengate @ Oracle
Ethan Guo Vinoth Chandar
Agenda
Rise of the Lakehouse Architecture
Incremental Processing Model
Hudi in Action
Community
Rise Of the Lakehouse
Data Lakes, Transactions, Table Formats
Evolution of Data Infrastructure
On-Prem Data
warehouses
(Traditional
BI/Reporting)
2000s - Hadoop
Data Lakes
(Search/Social)
2014 - Apache Spark
(Data Science)
2016 - Apache Hudi
(Transactional Data Lake)
2017 - Databricks
Delta*
2012 - BigQuery
(Serverless)
2014 - Snowflake
(Decoupling/UX)
2013- Amazon
Redshift
(Cloud)
Warehouse
Lake(house)
*Databricks coined term “Lakehouse”
Lakehouse Architecture
Lakehouses
Cloud Storage
Local
Cache
SQL Exec
Node A Node B Node C
SQL Exec SQL Exec
Query
Engines
Storage
Table Format
Metadata
Txn manager Transaction
layer
Optimizer Optimizer Optimizer
Local
Cache
Local
Cache
Table
Services
Parquet/ORC
Traditional Data Lakes
Cloud Storage
Local
Cache
SQL Exec
Node A Node B Node C
SQL Exec SQL Exec
Query
Engines
Storage
Optimizer Optimizer Optimizer
Local
Cache
Local
Cache
Parquet/ORC/JSON/CSV
How they stack up?
Warehouse Lakehouse
Closed
Built for BI
Fully managed
Expensive as you scale
Open** (conditions apply)
Better ML/DS/AI Support
DIY
Cheaper at scale
S3
AWS Glue
Data Catalog
Metastore
BigQuery
Catalogs
+ Many More
Lakehouse Platform
Apache Kafka
Raw
Tables
Cleaned
Tables
Derived
Tables
Truly Open & Interoperable
Hudi Table – Under the Hood
The Hudi Platform
Lake Storage
(Cloud Object Stores, HDFS, …)
Open File/Data Formats
(Parquet, HFile, Avro, Orc, …)
Concurrency Control
(OCC, MVCC, Non-blocking, Lock providers,
Scheduling...)
Table Services
(cleaning, compaction, clustering, indexing,
file sizing,...)
Indexes
(Bloom filter, HBase, Bucket index, Hash
based, Lucene..)
Table Format
(Schema, File listings, Stats, Evolution, …)
Lake Cache*
(Columnar, transactional, mutable, WIP,...)
Metaserver*
(Stats, table service coordination,...)
Transactional
Database
Layer
Query Engines
(Spark, Flink, Hive, Presto, Trino, Impala,
Redshift, BigQuery, Snowflake,..)
Platform Services
(Streaming/Batch ingest, various sources,
Catalog sync, Admin CLI, Data Quality,...)
User Interface
Readers
(Snapshot, Time Travel, Incremental, etc)
Writers
(Inserts, Updates, Deletes, Smart Layout
Management, etc)
Programming API
Proven @ Massive Scale
https://www.youtube.com/watch?v=ZamXiT9aqs8
https://chowdera.com/2022/184/202207030146453436.html
https://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance/
100GB/s
Throughput
> 1Exabyte
Even just 1 Table
Daily -> Min
Analytics Latency
70%
CPU Savings
(write+read)
300GB/d
Throughput
25+TB
Datasets
Hourly
Analytics Latency
https://aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-
native-data-pipelines-at-enterprise-scale-using-the-aws-platform/
10,000+
Tables
150+
Source systems
CDC, ETL
Use cases
https://www.uber.com/blog/apache-hudi-graduation/
4000+
Tables
250+PB
Raw + Derived
Daily -> Min
Analytics Latency
800B
Records/Day
Incremental Processing
Looking Beyond Files & Formats
Elephant in the Room : Batch Processing
“In data warehousing, in order to represent a business they
had to actually kind of reinvent event streams in a very slow
way”
https://www.oreilly.com/content/ubers-case-for-
incremental-processing-on-hadoop/;
Vinoth Chandar, 2016
Stream vs Batch Processing Dichotomy
PostgresSQL Debezium Apache Kafka
Database Ingestion Real-Time Analytics
Apache Flink
Old-School Batch ETL
Amazon S3
Stream vs Batch Processing Dichotomy
PostgresSQL Debezium Apache Kafka
Database Ingestion Real-Time Analytics
Apache Flink
Old-School Batch ETL
Amazon S3
+ Intelligent, Incremental
+ Fast, Efficient
- Row oriented
- Not scan optimized
+ Scans, Columnar formats
+ Scalable Compute
- In-efficient recompute
- No updates/deletes
Incremental Processing Model
Coined at Uber; 2015
Bring “stream processing” model to
“batch” data
Bridges best of both worlds :
process only new input, with
columnar/scan optimized storage
We needed a state store!
The Missing State Store
Hudi
Table
upsert(records)
at time t
Changes
to table
Changes
from table
incremental_query
(t-1, t)
query at time t
Latest committed records
Case Study : Uber’s Big Data Platform
Truly real-time business;
Poor Ingest performance
Slow, Expensive re-computations
in Batch ETL
Massive data volumes - ~100PB in
2016, maybe an exabyte today
Need minute level latencies
Hudi Feature Highlights
Incremental Reads
Maintains monotonically increasing
commit metadata to provide
incremental queries
Multi-modal Indexes
Bloom, Simple and HBase indexes
to provide faster lookups, updates &
deletes
Streaming Latency
To reduce data latency and write
amplification when ingesting
records in an MOR table with async
compaction
Concurrency Control
Hudi provides OCC between writers,
while providing lock-free, non-
blocking MVCC
Field level upserts
To perform updates, merges and
deletes to the data
Clustering
To reorganize data for improved
query performance & data re-
writing service
Integrations
Works well with Presto, Spark, Flink,
Trino & Hive, Kafka Connect etc.
Timeline Metadata
Time-travel using rewind and
rollback semantics to fix DQ issues
https://bit.ly/hudi-feature-comparison
Hudi Table Types
Compaction
v1
v2
Reader
Writer
versioned parquet files
v1
v2
v1
v2
v1
v2
v1
v2
Reader
Copy on Write
Writer
parquet files + change logs
v1 v1 v1 v1
Reader
Merge on Read
COW MOR
Write Cost Higher Lower
Data Latency Slower Faster
Query Speed Faster Slower before
compaction
Faster after
compaction
Overall Cost Aggressive
rewrites with
every update
Can amortize
compaction with
other services
Hudi Query Types
Compaction
v1
v2
Reader
Writer
parquet files + change logs
v1 v1 v1 v1
Reader
Merge on Read
Query Types
1. Snapshot Query - Merge changes and read everything
2. Read-Optimized Query - Read the latest compacted data
3. Incremental Query - Read only data that has changed within an interval
1
1
2
2
3
3
Optimizing For Large Scale Updates
Challenges
● 10x harder problem than designing formats
● Opens up every database problem in the textbook
● Primary keys, faster metadata changes, consistency between index and data
● Needs fundamentally different concurrency control techniques
● Partial Updates
○ Many databases generate partial updates
○ Supplemental logging is very expensive
● DR Scenarios
○ Databases can be running active-active
○ Need conflict resolution techniques
● Record level merge APIs
○ Support different CDC formats (e.g DMS,
Debezium)
○ Moving towards the newer API (RFC-46)
Merge APIs
Current RecordPayload Interface
New HoodieRecordMerger Interface
● Widely employed in database systems
○ Locate information quickly
○ Reduce I/O cost
○ Improve Query efficiency
● Indexing provides fast upserts
○ Locate records for incoming writes
○ Bloom filter based, Simple, Hbase, etc.
○ Record level Indexes, Lucene based
https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/
Indexes
Multi-Modal Index
● Generalized indexing subsystem in Lakehouse
○ Converge metadata + indexes
○ Scale to 10-100x data on the lake
○ Improve read and queries besides writes
○ Asynchronously rebuild new/existing indexes
● Key principles
○ Design for frequent changes
○ MoR metadata table w/ log compaction
○ ACID updates with multi-table transaction
○ Fast point lookups/range scans
https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-
lakehouse-in-apache-hudi
Compaction - Balancing Read/Write Costs
○ TBs of updates against PBs of data
○ Delete/Update patterns often very different
than query patterns
○ GDPR deletes are random
○ Analytics Queries more likely to read
recent data
○ Periodically and asynchronously compact log
files to new base files
○ Reduces write amplification
○ Keep the query performance in check
Latest: parquet files + change logs
v1
Snapshot
Query
Merging
Compaction
v1
v2 Snapshot
Query
Latest: parquet files only
Clustering - Optimizing Data Layout
○ Faster streaming ingestion -> smaller file sizes
○ Data locality for query (e.g., by city) ≠ ingestion order (e.g., trips by time)
○ Clustering to the rescue: auto file sizing, reorg data, no compromise on ingestion
Clustering Service
○ Complete runtime for executing
clustering in tandem with
writers/compaction
○ Scheduling: identify target data,
generate plan in timeline
○ Running: execute plan with pluggable
strategy
○ Reorg data with linear sorting, Z-order,
Hilbert, etc.
Table Services with Continuous Writers
● Self managing database runtime
○ Cleaning (committed/uncommitted), archival,
clustering, compaction
○ Similar to how RocksDB daemons work
● Table services know each other
○ Avoid duplicate schedules
○ Skip compacting files being clustered
● Run continuously or scheduled,
asynchronously
Hudi In Action
Common Incremental Processing Patterns
How to do Streaming Ingest
Hudi
Table
Apache Kafka
Flattening
loc.lon -> loc_lon
loc.lat -> loc_lat
Projection
User-defined
transformation
SELECT
a.loc_lon,
a.loc_lat,
a.name
FROM <SRC> a
HoodieDeltaStreamer tool
Hudi configs (table name, path, type,
record key, ordering field, etc.)
Transformations
Source limit, sync frequency
Kafka source configs
Enable async clustering
CDC Example
PostgresSQL Debezium Apache Kafka
Schema registry
Hudi
DeltaStreamer
HoodieDeltaStreamer tool
Postgres Debezium
connector configs
Postgres Debezium
Source and Payload
for CDC
Config for small file
handling
Hive sync configs
Incremental ETL Example
Stream Table
(fact)
(Join)
Dataset 1
(dimension)
Incremental
consumption
(Join)
Dataset 2
(dimension)
Upsert
Joined
Table
Apache Kafka
Hudi Incremental Source
Incr source configs
SQL Transformer for Joins
ETL Load Strategies
Full load
➕ easy to implement e.g. if you need JOINs
➖ expensive, slow
➖ updates to (too) old data are lost
Incremental ETL with Hudi 👍
➕ still easy to implement
➕ efficient
➗ not real-time, but close
Streaming with Flink (and Hudi?)
➖ have to call services or JOIN streams
➕ real-time
Wins Reported by Uber
Accuracy We achieved 100% data accuracy: no updates are lost, even for a
year-old trips.
Efficiency
Process less data on each run. Weekly aggregation with full load:
4-5 hours per run; fact table with incremental: 45 minutes and can
be further improved.
Freshness Potential to bring the freshness SLA of the earnings in Hive from
31 hours down to a couple of hours (work in progress).
This unlocks earnings features closer to real-time
Cheaper In our benchmarking, we’ve found Lakehouse based incremental
ETLs are ~50% cheaper than the old school batch pipelines.
Community
It’s going to take a village
The Community
2400+
Slack Members
320+
Contributors
1200+
GH Engagers
20+
Committers
Pre-installed on 5 cloud providers
Diverse PMC/Committers
1M DLs/month
(400% YoY)
800B+
Records/Day
(from even just 1 customer!)
Rich community of participants
Trailblazer, now Industry Proven
Uber rides - 250+ Petabytes from 24h+ to minutes latency
https://eng.uber.com/uber-big-data-platform/
Package deliveries - real-time event analytics at PB scale
https://aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/
TikTok/Bytedance recommendation system - at Exabyte scale
http://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance
Trading transactions - Near real-time CDC from 4000+ postgres tables
https://s.apache.org/hudi-robinhood-talk
150 source systems, ETL processing for 10,000+ tables
https://aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-native-data-pipelines-at-enterprise-scale-using-the-aws-platform/
Real-time advertising for 20M+ concurrent viewers
https://www.youtube.com/watch?v=mFpqrVxxwKc
Store transactions - CDC & Warehousing
https://searchdatamanagement.techtarget.com/feature/Hudi-powering-data-lake-efforts-at-Walmart-and-Disney-Hotstar
Lake House Architecture @ Halodoc: Data Platform 2.0
https://blogs.halodoc.io/lake-house-architecture-halodoc-data-platform-2-0/
Incremental, Multi region data lake platform
https://aws.amazon.com/blogs/architecture/insights-for-ctos-part-3-growing-your-business-with-modern-data-capabilities/
Unified, batch + streaming data lake on Hudi
https://developpaper.com/apache-hudi-x-pulsar-meetup-hangzhou-station-is-hot-and-the-practice-dry-goods-are-waiting-for-you/
Streaming data lake for device data
https://www.youtube.com/watch?v=8Q0kM-emMyo
Near real-time grocery delivery tracking
https://lambda.blinkit.com/origins-of-data-lake-at-grofers-6c011f94b86c
Minute level data ingestion to lakehouse
https://www.youtube.com/watch?v=Yn8-tPX6Zoo
Trailblazer, now Industry Proven
Serverless, real-time analytics platform on Hudi
https://aws.amazon.com/blogs/big-data/how-nerdwallet-uses-aws-and-apache-hudi-to-build-a-serverless-real-time-analytics-platform/
Metaserver (Coming in Q4 2022)
Interesting fact: Hudi has a metaserver already
○ Runs on Spark driver; Serves FileSystem RPCs +
queries on timeline
○ Backed by rocksDB/pluggable
○ Updated incrementally on every timeline action
○ Very useful in streaming jobs
Data lakes need a new metaserver
○ Flat file metastores are cool? (really?)
○ Speed up planning by orders of magnitude
RFC-36, HUDI-3345: Metaserver for all metadata
Lake Cache (Coming Early 2023)
LRU Cache ala DB Buffer Pool
Frequent Commits => Small objects/blocks
○ Today: Aggressively table services
○ Tomorrow: File Group/Hudi file model aware caching
○ Mutable data => FileSystem/Block level caches are not
that effective.
Benefits
○ Great performance for CDC tables
○ Avoid open/close costs for small objects
Strawman design: Mutable, Transactional caching for Hudi Tables
New CDC Format (Coming in Q4 2022)
Change Data Capture in Hudi table as a source
○ Support record-level CDC logs and queries
○ Debezium-like format : “before” and “after” images
○ Insert: null à inserted row
○ Update: old row à new row
○ Delete: pre-delete row à null
Trade-offs on deducing changelogs
○ Incremental query can already pull changes
○ Compute changelogs on the fly (more read cost)
○ Fully materialized changelogs (more write cost)
RFC-51, HUDI-3478: Support of Change Data Capture (CDC) with Hudi change logs
Come Build With The Community!
Docs : https://hudi.apache.org
Blogs : https://hudi.apache.org/blog
Slack : https://join.slack.com/t/apache-hudi/shared_invite/zt-1e94d3xro-JvlNO1kSeIHJBTVfLPlI5w
Twitter : https://twitter.com/apachehudi
Github: https://github.com/apache/hudi/ Give us a star ⭐!
Mailing list(s) :
dev-subscribe@hudi.apache.org (send an empty email to subscribe)
Join Hudi Slack
Thanks!
Questions?

More Related Content

What's hot

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
Dalibor Wijas
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
Alluxio, Inc.
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
Eric Sun
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Amazon Web Services
 

What's hot (20)

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 

Similar to Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022

[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
Vinoth Chandar
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
Prasanna Rajaperumal
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
Roy Kim
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
Vinoth Chandar
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Precisely
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
Amazon Web Services
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
Ashish Mrig
 
SharePoint and Large Scale SQL Deployments - NZSPC
SharePoint and Large Scale SQL Deployments - NZSPCSharePoint and Large Scale SQL Deployments - NZSPC
SharePoint and Large Scale SQL Deployments - NZSPC
guest7c2e070
 
Large Scale SharePoint SQL Deployments
Large Scale SharePoint SQL DeploymentsLarge Scale SharePoint SQL Deployments
Large Scale SharePoint SQL Deployments
Joel Oleson
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
Amazon Web Services
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
Qubole
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS
Alluxio, Inc.
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
How Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon RedshiftHow Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon Redshift
Attunity
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
James Serra
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Amazon Web Services
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
James Serra
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle Cloud
Kellyn Pot'Vin-Gorman
 

Similar to Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022 (20)

[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
 
SharePoint and Large Scale SQL Deployments - NZSPC
SharePoint and Large Scale SQL Deployments - NZSPCSharePoint and Large Scale SQL Deployments - NZSPC
SharePoint and Large Scale SQL Deployments - NZSPC
 
Large Scale SharePoint SQL Deployments
Large Scale SharePoint SQL DeploymentsLarge Scale SharePoint SQL Deployments
Large Scale SharePoint SQL Deployments
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
How Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon RedshiftHow Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon Redshift
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle Cloud
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 

Recently uploaded (20)

A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022

  • 1. Ethan Guo & Vinoth Chandar Designing Apache Hudi for Incremental Processing on Exabyte-scale Lakehouses
  • 2. Speakers • Database Engineer @ Onehouse.ai • Apache Hudi Committer • Data, Networking @ Uber • CEO, Founder @ Onehouse.ai • PMC Chair @ Apache Hudi • Data, Infra, Networking, Databases @ Uber • Kafka, ksqlDB, Streams @ Confluent • Key Value Stores @ LinkedIn • CDC, Goldengate @ Oracle Ethan Guo Vinoth Chandar
  • 3. Agenda Rise of the Lakehouse Architecture Incremental Processing Model Hudi in Action Community
  • 4. Rise Of the Lakehouse Data Lakes, Transactions, Table Formats
  • 5. Evolution of Data Infrastructure On-Prem Data warehouses (Traditional BI/Reporting) 2000s - Hadoop Data Lakes (Search/Social) 2014 - Apache Spark (Data Science) 2016 - Apache Hudi (Transactional Data Lake) 2017 - Databricks Delta* 2012 - BigQuery (Serverless) 2014 - Snowflake (Decoupling/UX) 2013- Amazon Redshift (Cloud) Warehouse Lake(house) *Databricks coined term “Lakehouse”
  • 6. Lakehouse Architecture Lakehouses Cloud Storage Local Cache SQL Exec Node A Node B Node C SQL Exec SQL Exec Query Engines Storage Table Format Metadata Txn manager Transaction layer Optimizer Optimizer Optimizer Local Cache Local Cache Table Services Parquet/ORC Traditional Data Lakes Cloud Storage Local Cache SQL Exec Node A Node B Node C SQL Exec SQL Exec Query Engines Storage Optimizer Optimizer Optimizer Local Cache Local Cache Parquet/ORC/JSON/CSV
  • 7. How they stack up? Warehouse Lakehouse Closed Built for BI Fully managed Expensive as you scale Open** (conditions apply) Better ML/DS/AI Support DIY Cheaper at scale
  • 8. S3 AWS Glue Data Catalog Metastore BigQuery Catalogs + Many More Lakehouse Platform Apache Kafka Raw Tables Cleaned Tables Derived Tables Truly Open & Interoperable
  • 9. Hudi Table – Under the Hood
  • 10. The Hudi Platform Lake Storage (Cloud Object Stores, HDFS, …) Open File/Data Formats (Parquet, HFile, Avro, Orc, …) Concurrency Control (OCC, MVCC, Non-blocking, Lock providers, Scheduling...) Table Services (cleaning, compaction, clustering, indexing, file sizing,...) Indexes (Bloom filter, HBase, Bucket index, Hash based, Lucene..) Table Format (Schema, File listings, Stats, Evolution, …) Lake Cache* (Columnar, transactional, mutable, WIP,...) Metaserver* (Stats, table service coordination,...) Transactional Database Layer Query Engines (Spark, Flink, Hive, Presto, Trino, Impala, Redshift, BigQuery, Snowflake,..) Platform Services (Streaming/Batch ingest, various sources, Catalog sync, Admin CLI, Data Quality,...) User Interface Readers (Snapshot, Time Travel, Incremental, etc) Writers (Inserts, Updates, Deletes, Smart Layout Management, etc) Programming API
  • 11. Proven @ Massive Scale https://www.youtube.com/watch?v=ZamXiT9aqs8 https://chowdera.com/2022/184/202207030146453436.html https://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance/ 100GB/s Throughput > 1Exabyte Even just 1 Table Daily -> Min Analytics Latency 70% CPU Savings (write+read) 300GB/d Throughput 25+TB Datasets Hourly Analytics Latency https://aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud- native-data-pipelines-at-enterprise-scale-using-the-aws-platform/ 10,000+ Tables 150+ Source systems CDC, ETL Use cases https://www.uber.com/blog/apache-hudi-graduation/ 4000+ Tables 250+PB Raw + Derived Daily -> Min Analytics Latency 800B Records/Day
  • 13. Elephant in the Room : Batch Processing “In data warehousing, in order to represent a business they had to actually kind of reinvent event streams in a very slow way” https://www.oreilly.com/content/ubers-case-for- incremental-processing-on-hadoop/; Vinoth Chandar, 2016
  • 14. Stream vs Batch Processing Dichotomy PostgresSQL Debezium Apache Kafka Database Ingestion Real-Time Analytics Apache Flink Old-School Batch ETL Amazon S3
  • 15. Stream vs Batch Processing Dichotomy PostgresSQL Debezium Apache Kafka Database Ingestion Real-Time Analytics Apache Flink Old-School Batch ETL Amazon S3 + Intelligent, Incremental + Fast, Efficient - Row oriented - Not scan optimized + Scans, Columnar formats + Scalable Compute - In-efficient recompute - No updates/deletes
  • 16. Incremental Processing Model Coined at Uber; 2015 Bring “stream processing” model to “batch” data Bridges best of both worlds : process only new input, with columnar/scan optimized storage We needed a state store!
  • 17. The Missing State Store Hudi Table upsert(records) at time t Changes to table Changes from table incremental_query (t-1, t) query at time t Latest committed records
  • 18. Case Study : Uber’s Big Data Platform Truly real-time business; Poor Ingest performance Slow, Expensive re-computations in Batch ETL Massive data volumes - ~100PB in 2016, maybe an exabyte today Need minute level latencies
  • 19. Hudi Feature Highlights Incremental Reads Maintains monotonically increasing commit metadata to provide incremental queries Multi-modal Indexes Bloom, Simple and HBase indexes to provide faster lookups, updates & deletes Streaming Latency To reduce data latency and write amplification when ingesting records in an MOR table with async compaction Concurrency Control Hudi provides OCC between writers, while providing lock-free, non- blocking MVCC Field level upserts To perform updates, merges and deletes to the data Clustering To reorganize data for improved query performance & data re- writing service Integrations Works well with Presto, Spark, Flink, Trino & Hive, Kafka Connect etc. Timeline Metadata Time-travel using rewind and rollback semantics to fix DQ issues https://bit.ly/hudi-feature-comparison
  • 20. Hudi Table Types Compaction v1 v2 Reader Writer versioned parquet files v1 v2 v1 v2 v1 v2 v1 v2 Reader Copy on Write Writer parquet files + change logs v1 v1 v1 v1 Reader Merge on Read COW MOR Write Cost Higher Lower Data Latency Slower Faster Query Speed Faster Slower before compaction Faster after compaction Overall Cost Aggressive rewrites with every update Can amortize compaction with other services
  • 21. Hudi Query Types Compaction v1 v2 Reader Writer parquet files + change logs v1 v1 v1 v1 Reader Merge on Read Query Types 1. Snapshot Query - Merge changes and read everything 2. Read-Optimized Query - Read the latest compacted data 3. Incremental Query - Read only data that has changed within an interval 1 1 2 2 3 3
  • 22. Optimizing For Large Scale Updates Challenges ● 10x harder problem than designing formats ● Opens up every database problem in the textbook ● Primary keys, faster metadata changes, consistency between index and data ● Needs fundamentally different concurrency control techniques
  • 23. ● Partial Updates ○ Many databases generate partial updates ○ Supplemental logging is very expensive ● DR Scenarios ○ Databases can be running active-active ○ Need conflict resolution techniques ● Record level merge APIs ○ Support different CDC formats (e.g DMS, Debezium) ○ Moving towards the newer API (RFC-46) Merge APIs Current RecordPayload Interface New HoodieRecordMerger Interface
  • 24. ● Widely employed in database systems ○ Locate information quickly ○ Reduce I/O cost ○ Improve Query efficiency ● Indexing provides fast upserts ○ Locate records for incoming writes ○ Bloom filter based, Simple, Hbase, etc. ○ Record level Indexes, Lucene based https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/ Indexes
  • 25. Multi-Modal Index ● Generalized indexing subsystem in Lakehouse ○ Converge metadata + indexes ○ Scale to 10-100x data on the lake ○ Improve read and queries besides writes ○ Asynchronously rebuild new/existing indexes ● Key principles ○ Design for frequent changes ○ MoR metadata table w/ log compaction ○ ACID updates with multi-table transaction ○ Fast point lookups/range scans https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the- lakehouse-in-apache-hudi
  • 26. Compaction - Balancing Read/Write Costs ○ TBs of updates against PBs of data ○ Delete/Update patterns often very different than query patterns ○ GDPR deletes are random ○ Analytics Queries more likely to read recent data ○ Periodically and asynchronously compact log files to new base files ○ Reduces write amplification ○ Keep the query performance in check Latest: parquet files + change logs v1 Snapshot Query Merging Compaction v1 v2 Snapshot Query Latest: parquet files only
  • 27. Clustering - Optimizing Data Layout ○ Faster streaming ingestion -> smaller file sizes ○ Data locality for query (e.g., by city) ≠ ingestion order (e.g., trips by time) ○ Clustering to the rescue: auto file sizing, reorg data, no compromise on ingestion
  • 28. Clustering Service ○ Complete runtime for executing clustering in tandem with writers/compaction ○ Scheduling: identify target data, generate plan in timeline ○ Running: execute plan with pluggable strategy ○ Reorg data with linear sorting, Z-order, Hilbert, etc.
  • 29. Table Services with Continuous Writers ● Self managing database runtime ○ Cleaning (committed/uncommitted), archival, clustering, compaction ○ Similar to how RocksDB daemons work ● Table services know each other ○ Avoid duplicate schedules ○ Skip compacting files being clustered ● Run continuously or scheduled, asynchronously
  • 30. Hudi In Action Common Incremental Processing Patterns
  • 31. How to do Streaming Ingest Hudi Table Apache Kafka Flattening loc.lon -> loc_lon loc.lat -> loc_lat Projection User-defined transformation SELECT a.loc_lon, a.loc_lat, a.name FROM <SRC> a HoodieDeltaStreamer tool Hudi configs (table name, path, type, record key, ordering field, etc.) Transformations Source limit, sync frequency Kafka source configs Enable async clustering
  • 32. CDC Example PostgresSQL Debezium Apache Kafka Schema registry Hudi DeltaStreamer HoodieDeltaStreamer tool Postgres Debezium connector configs Postgres Debezium Source and Payload for CDC Config for small file handling Hive sync configs
  • 33. Incremental ETL Example Stream Table (fact) (Join) Dataset 1 (dimension) Incremental consumption (Join) Dataset 2 (dimension) Upsert Joined Table Apache Kafka Hudi Incremental Source Incr source configs SQL Transformer for Joins
  • 34. ETL Load Strategies Full load ➕ easy to implement e.g. if you need JOINs ➖ expensive, slow ➖ updates to (too) old data are lost Incremental ETL with Hudi 👍 ➕ still easy to implement ➕ efficient ➗ not real-time, but close Streaming with Flink (and Hudi?) ➖ have to call services or JOIN streams ➕ real-time
  • 35. Wins Reported by Uber Accuracy We achieved 100% data accuracy: no updates are lost, even for a year-old trips. Efficiency Process less data on each run. Weekly aggregation with full load: 4-5 hours per run; fact table with incremental: 45 minutes and can be further improved. Freshness Potential to bring the freshness SLA of the earnings in Hive from 31 hours down to a couple of hours (work in progress). This unlocks earnings features closer to real-time Cheaper In our benchmarking, we’ve found Lakehouse based incremental ETLs are ~50% cheaper than the old school batch pipelines.
  • 36. Community It’s going to take a village
  • 37. The Community 2400+ Slack Members 320+ Contributors 1200+ GH Engagers 20+ Committers Pre-installed on 5 cloud providers Diverse PMC/Committers 1M DLs/month (400% YoY) 800B+ Records/Day (from even just 1 customer!) Rich community of participants
  • 38. Trailblazer, now Industry Proven Uber rides - 250+ Petabytes from 24h+ to minutes latency https://eng.uber.com/uber-big-data-platform/ Package deliveries - real-time event analytics at PB scale https://aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/ TikTok/Bytedance recommendation system - at Exabyte scale http://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance Trading transactions - Near real-time CDC from 4000+ postgres tables https://s.apache.org/hudi-robinhood-talk 150 source systems, ETL processing for 10,000+ tables https://aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-native-data-pipelines-at-enterprise-scale-using-the-aws-platform/ Real-time advertising for 20M+ concurrent viewers https://www.youtube.com/watch?v=mFpqrVxxwKc Store transactions - CDC & Warehousing https://searchdatamanagement.techtarget.com/feature/Hudi-powering-data-lake-efforts-at-Walmart-and-Disney-Hotstar
  • 39. Lake House Architecture @ Halodoc: Data Platform 2.0 https://blogs.halodoc.io/lake-house-architecture-halodoc-data-platform-2-0/ Incremental, Multi region data lake platform https://aws.amazon.com/blogs/architecture/insights-for-ctos-part-3-growing-your-business-with-modern-data-capabilities/ Unified, batch + streaming data lake on Hudi https://developpaper.com/apache-hudi-x-pulsar-meetup-hangzhou-station-is-hot-and-the-practice-dry-goods-are-waiting-for-you/ Streaming data lake for device data https://www.youtube.com/watch?v=8Q0kM-emMyo Near real-time grocery delivery tracking https://lambda.blinkit.com/origins-of-data-lake-at-grofers-6c011f94b86c Minute level data ingestion to lakehouse https://www.youtube.com/watch?v=Yn8-tPX6Zoo Trailblazer, now Industry Proven Serverless, real-time analytics platform on Hudi https://aws.amazon.com/blogs/big-data/how-nerdwallet-uses-aws-and-apache-hudi-to-build-a-serverless-real-time-analytics-platform/
  • 40. Metaserver (Coming in Q4 2022) Interesting fact: Hudi has a metaserver already ○ Runs on Spark driver; Serves FileSystem RPCs + queries on timeline ○ Backed by rocksDB/pluggable ○ Updated incrementally on every timeline action ○ Very useful in streaming jobs Data lakes need a new metaserver ○ Flat file metastores are cool? (really?) ○ Speed up planning by orders of magnitude RFC-36, HUDI-3345: Metaserver for all metadata
  • 41. Lake Cache (Coming Early 2023) LRU Cache ala DB Buffer Pool Frequent Commits => Small objects/blocks ○ Today: Aggressively table services ○ Tomorrow: File Group/Hudi file model aware caching ○ Mutable data => FileSystem/Block level caches are not that effective. Benefits ○ Great performance for CDC tables ○ Avoid open/close costs for small objects Strawman design: Mutable, Transactional caching for Hudi Tables
  • 42. New CDC Format (Coming in Q4 2022) Change Data Capture in Hudi table as a source ○ Support record-level CDC logs and queries ○ Debezium-like format : “before” and “after” images ○ Insert: null à inserted row ○ Update: old row à new row ○ Delete: pre-delete row à null Trade-offs on deducing changelogs ○ Incremental query can already pull changes ○ Compute changelogs on the fly (more read cost) ○ Fully materialized changelogs (more write cost) RFC-51, HUDI-3478: Support of Change Data Capture (CDC) with Hudi change logs
  • 43. Come Build With The Community! Docs : https://hudi.apache.org Blogs : https://hudi.apache.org/blog Slack : https://join.slack.com/t/apache-hudi/shared_invite/zt-1e94d3xro-JvlNO1kSeIHJBTVfLPlI5w Twitter : https://twitter.com/apachehudi Github: https://github.com/apache/hudi/ Give us a star ⭐! Mailing list(s) : dev-subscribe@hudi.apache.org (send an empty email to subscribe) Join Hudi Slack