SlideShare a Scribd company logo
1 of 27
Download to read offline
A Thorough Comparison of
Delta Lake, Iceberg and Hudi
Junjie Chen
About Me
▪ Software engineer at Tencent Data Lake Team
▪ Focus on big data area for years
Agenda
Introduction to Delta
Lake, Apache Iceberg
and Apache Hudi
Key Features
Comparison
▪ Transaction
▪ Data mutation
▪ Streaming
Support
▪ Schema evolution
Maturity
▪ Tooling
▪ Integration
▪ Performance
Conclusion
What features are expect for the data lake?
Data Lake
Data Quality
Transaction
(ACID)
Independence
of Engines
Unified Batch
& Streaming
Storage
Pluggable
Scalable
Metadata
Data
Mutation
Delta Lake
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™
and big data workloads.
Apache Iceberg
An table format for huge analytic datasets which delivers high query performance for tables with tens of
petabytes of data, along with atomic commits, concurrent writes, and SQL-compatible table evolution.
DFS/Cloud Storage
Spark Batch
&
Streaming
AI &
Reporting
Interactive
Queries
Streaming
Streaming
Analytics
Apache Hudi
Apache Hudi ingests & manages storage of large analytical datasets over DFS
A Quick Comparison
Delta Lake (open source) Apache Iceberg Apache Hudi
Transaction (ACID) Y Y Y
MVCC Y Y Y
Time travel Y Y Y
Schema Evolution Y Y Y
Data Mutation Y (update/delete/merge into) N Y (upsert)
Streaming Sink and source for spark struct
streaming
Sink and source(wip) for Spark
struct streaming, Flink (wip)
DeltaStreamer
HiveIncrementalPuller
File Format Parquet Parquet, ORC, AVRO Parquet
Compaction/Cleanup Manual API available (Spark Action) Manual and Auto
Integration DSv1, Delta connector DSv2, InputFormat DSv1, InputFormat
Multiple language support Scala/java/python Java/python Java/python
Storage Abstraction Y Y N
API dependency Spark-bundled Native/Engine bundled Spark-bundled
Data ingestion Spark, presto, hive Spark, hive DeltaStreamer
2020-05
Transaction
Delta Lake
▪ Model
▪ Transaction Log (DeltaLog)
▪ Optimistic concurrency control
▪ Checkpoint changes into parquet
▪ Atomicity Guarantee
▪ HDFS rename
▪ S3 file write
▪ Azure rename without overwrite
▪ Time Travel
▪ timestamp
▪ version number
Apache Iceberg
▪ Model
▪ Snapshot
▪ Optimistic concurrency control
▪ Atomicity Guarantee
▪ HDFS Rename
▪ Hive metastore lock
▪ Time Travel
▪ snapshot id
▪ timestamp
R W
S1 S2 S3 S4
Apache Hudi
▪ Model
▪ Timeline
▪ Optimistic concurrency control
▪ Atomicity Guarantee
▪ HDFS rename
▪ Time Travel
▪ Hoodie_commit_time
Data Mutation
Delta Lake
▪ Copy on Write mode
▪ Step 1: find files to delete according to filter expression
▪ Step 2: load files as dataframe and update column values in rows
▪ Step 3: save dataframe to new files
▪ Step 4: logs the files to delete and add into JSON, commit to table
▪ Table level APIs
▪ update, delete (condition based)
▪ merge into (upsert a source into target table)
Apache Hudi
▪ Copy on Write table
▪ Step1: read out records from parquet
▪ Step2: merge records according to passing update records
▪ Step3: write merged records to files
▪ Step4: commit to table commitActionExecutor
▪ Merge on Read table
▪ Store delta records into AVRO format log file
▪ Scheduled compaction
▪ Indexing
▪ Mapping Hudi record key (in metadata column) to file group and file id
▪ In-memory, bloom filter and HBase
▪ Table level APIs
▪ upsert
Apache Iceberg
▪ Copy on Write Mode
▪ File level overwrite APIs available
▪ Merge on Read mode
▪ Position based delete files and equality based delete files
Streaming Support
Delta Lake
▪ Deeply integrated with Spark Structured Streaming
▪ As a streaming source
▪ Streaming control: maxBytesPerTrigger, maxFilesPerTrigger
▪ Does NOT handle non-append (ignoreDeletes or ignoreChanges)
▪ As a streaming sink
▪ Append mode
▪ Complete mode
Apache Hudi
▪ DeltaStreamer
▪ Exactly once ingestion of new event from Kafka
▪ Support JSON, AVRO or custom record types
▪ Manage checkpoints, rollback & recovery
▪ Support for plugging in transformations
▪ Incremental Queries
▪ HiveIncrementalPuller
▪ As Spark data source (beginInstantTime)
Apache Iceberg
▪ Support spark struct streaming
▪ As streaming source (WIP)
▪ Rate limit: max-files-per-batch
▪ Offset range
▪ As streaming sink
▪ Append mode
▪ Complete mode
▪ Support flink (WIP)
Table Schema Evolution
▪ Delta Lake
▪ Use Spark schema
▪ Allow Schema merge and overwrite
▪ Apache Hudi
▪ Use Spark schema
▪ Support adding new fields in stream, column delete is not allowed.
▪ Apache Iceberg
▪ Independent ID-based schema abstraction
▪ Full schema evolution and partition evolution
Maturity
Integrations
▪ Delta Lake
▪ DSv1
▪ Delta.io connector enable Apache Hive, Presto
▪ Apache Iceberg
▪ DSv2, InputFormat, Hive StorageHandle (WIP)
▪ Flink sink(WIP)
▪ Apache Hudi
▪ InputFormat, DSv1
▪ DeltaStreamer for data ingesting
Query Performance Optimization
▪ Delta Lake
▪ Vectorization from Spark
▪ Data skipping via statistic from Parquet
▪ Vacuum, optimize
▪ Apache Hudi
▪ Vectorization from Spark
▪ Data skipping via statistic from Parquet
▪ Auto compaction
▪ Apache Iceberg
▪ Predicate push down
▪ Native vectorized reader (WIP)
▪ Statistic from Iceberg manifest file
▪ Hidden partitioning
Tooling
▪ Delta Lake
▪ CLI: VACUUM, HISTORY, GENERATE, CONVERT TO
▪ Apache Iceberg
▪ Metadata visible as table
▪ Built-in catalog service, enable DDL, DML support in Spark-3.0
▪ Apache Hudi
▪ CLI, auxiliary commands( inspecting, view, statistics, compaction etc..)
▪ DeltaStreamer, HiveIncrementalPuller, HoodieDeltaStreamer
Conclusion
▪ Delta Lake has best integration with Spark ecosystem and could
be used out of box.
▪ Apache Iceberg has great design and abstraction that enable
more potentials
▪ Apache Hudi provides most conveniences for streaming process
Thank You & Questions

More Related Content

What's hot

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 

What's hot (20)

Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 

Similar to A Thorough Comparison of Delta Lake, Iceberg and Hudi

October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Yahoo Developer Network
 
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
Tim Vaillancourt
 

Similar to A Thorough Comparison of Delta Lake, Iceberg and Hudi (20)

Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
E2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/LivyE2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/Livy
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on AuroraDUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
 
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Recently uploaded (20)

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 

A Thorough Comparison of Delta Lake, Iceberg and Hudi

  • 1. A Thorough Comparison of Delta Lake, Iceberg and Hudi Junjie Chen
  • 2. About Me ▪ Software engineer at Tencent Data Lake Team ▪ Focus on big data area for years
  • 3. Agenda Introduction to Delta Lake, Apache Iceberg and Apache Hudi Key Features Comparison ▪ Transaction ▪ Data mutation ▪ Streaming Support ▪ Schema evolution Maturity ▪ Tooling ▪ Integration ▪ Performance Conclusion
  • 4. What features are expect for the data lake? Data Lake Data Quality Transaction (ACID) Independence of Engines Unified Batch & Streaming Storage Pluggable Scalable Metadata Data Mutation
  • 5. Delta Lake Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads.
  • 6. Apache Iceberg An table format for huge analytic datasets which delivers high query performance for tables with tens of petabytes of data, along with atomic commits, concurrent writes, and SQL-compatible table evolution. DFS/Cloud Storage Spark Batch & Streaming AI & Reporting Interactive Queries Streaming Streaming Analytics
  • 7. Apache Hudi Apache Hudi ingests & manages storage of large analytical datasets over DFS
  • 8. A Quick Comparison Delta Lake (open source) Apache Iceberg Apache Hudi Transaction (ACID) Y Y Y MVCC Y Y Y Time travel Y Y Y Schema Evolution Y Y Y Data Mutation Y (update/delete/merge into) N Y (upsert) Streaming Sink and source for spark struct streaming Sink and source(wip) for Spark struct streaming, Flink (wip) DeltaStreamer HiveIncrementalPuller File Format Parquet Parquet, ORC, AVRO Parquet Compaction/Cleanup Manual API available (Spark Action) Manual and Auto Integration DSv1, Delta connector DSv2, InputFormat DSv1, InputFormat Multiple language support Scala/java/python Java/python Java/python Storage Abstraction Y Y N API dependency Spark-bundled Native/Engine bundled Spark-bundled Data ingestion Spark, presto, hive Spark, hive DeltaStreamer 2020-05
  • 10. Delta Lake ▪ Model ▪ Transaction Log (DeltaLog) ▪ Optimistic concurrency control ▪ Checkpoint changes into parquet ▪ Atomicity Guarantee ▪ HDFS rename ▪ S3 file write ▪ Azure rename without overwrite ▪ Time Travel ▪ timestamp ▪ version number
  • 11. Apache Iceberg ▪ Model ▪ Snapshot ▪ Optimistic concurrency control ▪ Atomicity Guarantee ▪ HDFS Rename ▪ Hive metastore lock ▪ Time Travel ▪ snapshot id ▪ timestamp R W S1 S2 S3 S4
  • 12. Apache Hudi ▪ Model ▪ Timeline ▪ Optimistic concurrency control ▪ Atomicity Guarantee ▪ HDFS rename ▪ Time Travel ▪ Hoodie_commit_time
  • 14. Delta Lake ▪ Copy on Write mode ▪ Step 1: find files to delete according to filter expression ▪ Step 2: load files as dataframe and update column values in rows ▪ Step 3: save dataframe to new files ▪ Step 4: logs the files to delete and add into JSON, commit to table ▪ Table level APIs ▪ update, delete (condition based) ▪ merge into (upsert a source into target table)
  • 15. Apache Hudi ▪ Copy on Write table ▪ Step1: read out records from parquet ▪ Step2: merge records according to passing update records ▪ Step3: write merged records to files ▪ Step4: commit to table commitActionExecutor ▪ Merge on Read table ▪ Store delta records into AVRO format log file ▪ Scheduled compaction ▪ Indexing ▪ Mapping Hudi record key (in metadata column) to file group and file id ▪ In-memory, bloom filter and HBase ▪ Table level APIs ▪ upsert
  • 16. Apache Iceberg ▪ Copy on Write Mode ▪ File level overwrite APIs available ▪ Merge on Read mode ▪ Position based delete files and equality based delete files
  • 18. Delta Lake ▪ Deeply integrated with Spark Structured Streaming ▪ As a streaming source ▪ Streaming control: maxBytesPerTrigger, maxFilesPerTrigger ▪ Does NOT handle non-append (ignoreDeletes or ignoreChanges) ▪ As a streaming sink ▪ Append mode ▪ Complete mode
  • 19. Apache Hudi ▪ DeltaStreamer ▪ Exactly once ingestion of new event from Kafka ▪ Support JSON, AVRO or custom record types ▪ Manage checkpoints, rollback & recovery ▪ Support for plugging in transformations ▪ Incremental Queries ▪ HiveIncrementalPuller ▪ As Spark data source (beginInstantTime)
  • 20. Apache Iceberg ▪ Support spark struct streaming ▪ As streaming source (WIP) ▪ Rate limit: max-files-per-batch ▪ Offset range ▪ As streaming sink ▪ Append mode ▪ Complete mode ▪ Support flink (WIP)
  • 21. Table Schema Evolution ▪ Delta Lake ▪ Use Spark schema ▪ Allow Schema merge and overwrite ▪ Apache Hudi ▪ Use Spark schema ▪ Support adding new fields in stream, column delete is not allowed. ▪ Apache Iceberg ▪ Independent ID-based schema abstraction ▪ Full schema evolution and partition evolution
  • 23. Integrations ▪ Delta Lake ▪ DSv1 ▪ Delta.io connector enable Apache Hive, Presto ▪ Apache Iceberg ▪ DSv2, InputFormat, Hive StorageHandle (WIP) ▪ Flink sink(WIP) ▪ Apache Hudi ▪ InputFormat, DSv1 ▪ DeltaStreamer for data ingesting
  • 24. Query Performance Optimization ▪ Delta Lake ▪ Vectorization from Spark ▪ Data skipping via statistic from Parquet ▪ Vacuum, optimize ▪ Apache Hudi ▪ Vectorization from Spark ▪ Data skipping via statistic from Parquet ▪ Auto compaction ▪ Apache Iceberg ▪ Predicate push down ▪ Native vectorized reader (WIP) ▪ Statistic from Iceberg manifest file ▪ Hidden partitioning
  • 25. Tooling ▪ Delta Lake ▪ CLI: VACUUM, HISTORY, GENERATE, CONVERT TO ▪ Apache Iceberg ▪ Metadata visible as table ▪ Built-in catalog service, enable DDL, DML support in Spark-3.0 ▪ Apache Hudi ▪ CLI, auxiliary commands( inspecting, view, statistics, compaction etc..) ▪ DeltaStreamer, HiveIncrementalPuller, HoodieDeltaStreamer
  • 26. Conclusion ▪ Delta Lake has best integration with Spark ecosystem and could be used out of box. ▪ Apache Iceberg has great design and abstraction that enable more potentials ▪ Apache Hudi provides most conveniences for streaming process
  • 27. Thank You & Questions