A Thorough Comparison of Delta Lake, Iceberg and Hudi

A Thorough Comparison of
Delta Lake, Iceberg and Hudi
Junjie Chen

About Me
▪ Software engineer at Tencent Data Lake Team
▪ Focus on big data area for years

Agenda
Introduction to Delta
Lake, Apache Iceberg
and Apache Hudi
Key Features
Comparison
▪ Transaction
▪ Data mutation
▪ Streaming
Support
▪ Schema evolution
Maturity
▪ Tooling
▪ Integration
▪ Performance
Conclusion

What features are expect for the data lake?
Data Lake
Data Quality
Transaction
(ACID)
Independence
of Engines
Unified Batch
& Streaming
Storage
Pluggable
Scalable
Metadata
Data
Mutation

Delta Lake
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™
and big data workloads.

Apache Iceberg
An table format for huge analytic datasets which delivers high query performance for tables with tens of
petabytes of data, along with atomic commits, concurrent writes, and SQL-compatible table evolution.
DFS/Cloud Storage
Spark Batch
&
Streaming
AI &
Reporting
Interactive
Queries
Streaming
Streaming
Analytics

Apache Hudi
Apache Hudi ingests & manages storage of large analytical datasets over DFS

A Quick Comparison
Delta Lake (open source) Apache Iceberg Apache Hudi
Transaction (ACID) Y Y Y
MVCC Y Y Y
Time travel Y Y Y
Schema Evolution Y Y Y
Data Mutation Y (update/delete/merge into) N Y (upsert)
Streaming Sink and source for spark struct
streaming
Sink and source(wip) for Spark
struct streaming, Flink (wip)
DeltaStreamer
HiveIncrementalPuller
File Format Parquet Parquet, ORC, AVRO Parquet
Compaction/Cleanup Manual API available (Spark Action) Manual and Auto
Integration DSv1, Delta connector DSv2, InputFormat DSv1, InputFormat
Multiple language support Scala/java/python Java/python Java/python
Storage Abstraction Y Y N
API dependency Spark-bundled Native/Engine bundled Spark-bundled
Data ingestion Spark, presto, hive Spark, hive DeltaStreamer
2020-05

Delta Lake
▪ Model
▪ Transaction Log (DeltaLog)
▪ Optimistic concurrency control
▪ Checkpoint changes into parquet
▪ Atomicity Guarantee
▪ HDFS rename
▪ S3 file write
▪ Azure rename without overwrite
▪ Time Travel
▪ timestamp
▪ version number

Apache Iceberg
▪ Model
▪ Snapshot
▪ HDFS Rename
▪ Hive metastore lock
▪ Time Travel
▪ snapshot id
▪ timestamp
R W
S1 S2 S3 S4

Apache Hudi
▪ Model
▪ Timeline
▪ HDFS rename
▪ Time Travel
▪ Hoodie_commit_time

Delta Lake
▪ Copy on Write mode
▪ Step 1: find files to delete according to filter expression
▪ Step 2: load files as dataframe and update column values in rows
▪ Step 3: save dataframe to new files
▪ Step 4: logs the files to delete and add into JSON, commit to table
▪ Table level APIs
▪ update, delete (condition based)
▪ merge into (upsert a source into target table)

Apache Hudi
▪ Copy on Write table
▪ Step1: read out records from parquet
▪ Step2: merge records according to passing update records
▪ Step3: write merged records to files
▪ Step4: commit to table commitActionExecutor
▪ Merge on Read table
▪ Store delta records into AVRO format log file
▪ Scheduled compaction
▪ Indexing
▪ Mapping Hudi record key (in metadata column) to file group and file id
▪ In-memory, bloom filter and HBase
▪ Table level APIs
▪ upsert

Apache Iceberg
▪ Copy on Write Mode
▪ File level overwrite APIs available
▪ Merge on Read mode
▪ Position based delete files and equality based delete files

Delta Lake
▪ Deeply integrated with Spark Structured Streaming
▪ As a streaming source
▪ Streaming control: maxBytesPerTrigger, maxFilesPerTrigger
▪ Does NOT handle non-append (ignoreDeletes or ignoreChanges)
▪ As a streaming sink
▪ Append mode
▪ Complete mode

Apache Hudi
▪ DeltaStreamer
▪ Exactly once ingestion of new event from Kafka
▪ Support JSON, AVRO or custom record types
▪ Manage checkpoints, rollback & recovery
▪ Support for plugging in transformations
▪ Incremental Queries
▪ HiveIncrementalPuller
▪ As Spark data source (beginInstantTime)

Apache Iceberg
▪ Support spark struct streaming
▪ As streaming source (WIP)
▪ Rate limit: max-files-per-batch
▪ Offset range
▪ As streaming sink
▪ Append mode
▪ Complete mode
▪ Support flink (WIP)

Table Schema Evolution
▪ Delta Lake
▪ Use Spark schema
▪ Allow Schema merge and overwrite
▪ Apache Hudi
▪ Use Spark schema
▪ Support adding new fields in stream, column delete is not allowed.
▪ Apache Iceberg
▪ Independent ID-based schema abstraction
▪ Full schema evolution and partition evolution

Integrations
▪ Delta Lake
▪ DSv1
▪ Delta.io connector enable Apache Hive, Presto
▪ Apache Iceberg
▪ DSv2, InputFormat, Hive StorageHandle (WIP)
▪ Flink sink(WIP)
▪ Apache Hudi
▪ InputFormat, DSv1
▪ DeltaStreamer for data ingesting

Query Performance Optimization
▪ Delta Lake
▪ Vectorization from Spark
▪ Data skipping via statistic from Parquet
▪ Vacuum, optimize
▪ Apache Hudi
▪ Vectorization from Spark
▪ Data skipping via statistic from Parquet
▪ Auto compaction
▪ Apache Iceberg
▪ Predicate push down
▪ Native vectorized reader (WIP)
▪ Statistic from Iceberg manifest file
▪ Hidden partitioning

Tooling
▪ Delta Lake
▪ CLI: VACUUM, HISTORY, GENERATE, CONVERT TO
▪ Apache Iceberg
▪ Metadata visible as table
▪ Built-in catalog service, enable DDL, DML support in Spark-3.0
▪ Apache Hudi
▪ CLI, auxiliary commands( inspecting, view, statistics, compaction etc..)
▪ DeltaStreamer, HiveIncrementalPuller, HoodieDeltaStreamer

Conclusion
▪ Delta Lake has best integration with Spark ecosystem and could
be used out of box.
▪ Apache Iceberg has great design and abstraction that enable
more potentials
▪ Apache Hudi provides most conveniences for streaming process

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A Thorough Comparison of Delta Lake, Iceberg and Hudi

Similar to A Thorough Comparison of Delta Lake, Iceberg and Hudi (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

A Thorough Comparison of Delta Lake, Iceberg and Hudi