Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
1 of 27

A Thorough Comparison of Delta Lake, Iceberg and Hudi



Download to read offline

Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.

A Thorough Comparison of Delta Lake, Iceberg and Hudi

  1. 1. A Thorough Comparison of Delta Lake, Iceberg and Hudi Junjie Chen
  2. 2. About Me ▪ Software engineer at Tencent Data Lake Team ▪ Focus on big data area for years
  3. 3. Agenda Introduction to Delta Lake, Apache Iceberg and Apache Hudi Key Features Comparison ▪ Transaction ▪ Data mutation ▪ Streaming Support ▪ Schema evolution Maturity ▪ Tooling ▪ Integration ▪ Performance Conclusion
  4. 4. What features are expect for the data lake? Data Lake Data Quality Transaction (ACID) Independence of Engines Unified Batch & Streaming Storage Pluggable Scalable Metadata Data Mutation
  5. 5. Delta Lake Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads.
  6. 6. Apache Iceberg An table format for huge analytic datasets which delivers high query performance for tables with tens of petabytes of data, along with atomic commits, concurrent writes, and SQL-compatible table evolution. DFS/Cloud Storage Spark Batch & Streaming AI & Reporting Interactive Queries Streaming Streaming Analytics
  7. 7. Apache Hudi Apache Hudi ingests & manages storage of large analytical datasets over DFS
  8. 8. A Quick Comparison Delta Lake (open source) Apache Iceberg Apache Hudi Transaction (ACID) Y Y Y MVCC Y Y Y Time travel Y Y Y Schema Evolution Y Y Y Data Mutation Y (update/delete/merge into) N Y (upsert) Streaming Sink and source for spark struct streaming Sink and source(wip) for Spark struct streaming, Flink (wip) DeltaStreamer HiveIncrementalPuller File Format Parquet Parquet, ORC, AVRO Parquet Compaction/Cleanup Manual API available (Spark Action) Manual and Auto Integration DSv1, Delta connector DSv2, InputFormat DSv1, InputFormat Multiple language support Scala/java/python Java/python Java/python Storage Abstraction Y Y N API dependency Spark-bundled Native/Engine bundled Spark-bundled Data ingestion Spark, presto, hive Spark, hive DeltaStreamer 2020-05
  9. 9. Transaction
  10. 10. Delta Lake ▪ Model ▪ Transaction Log (DeltaLog) ▪ Optimistic concurrency control ▪ Checkpoint changes into parquet ▪ Atomicity Guarantee ▪ HDFS rename ▪ S3 file write ▪ Azure rename without overwrite ▪ Time Travel ▪ timestamp ▪ version number
  11. 11. Apache Iceberg ▪ Model ▪ Snapshot ▪ Optimistic concurrency control ▪ Atomicity Guarantee ▪ HDFS Rename ▪ Hive metastore lock ▪ Time Travel ▪ snapshot id ▪ timestamp R W S1 S2 S3 S4
  12. 12. Apache Hudi ▪ Model ▪ Timeline ▪ Optimistic concurrency control ▪ Atomicity Guarantee ▪ HDFS rename ▪ Time Travel ▪ Hoodie_commit_time
  13. 13. Data Mutation
  14. 14. Delta Lake ▪ Copy on Write mode ▪ Step 1: find files to delete according to filter expression ▪ Step 2: load files as dataframe and update column values in rows ▪ Step 3: save dataframe to new files ▪ Step 4: logs the files to delete and add into JSON, commit to table ▪ Table level APIs ▪ update, delete (condition based) ▪ merge into (upsert a source into target table)
  15. 15. Apache Hudi ▪ Copy on Write table ▪ Step1: read out records from parquet ▪ Step2: merge records according to passing update records ▪ Step3: write merged records to files ▪ Step4: commit to table commitActionExecutor ▪ Merge on Read table ▪ Store delta records into AVRO format log file ▪ Scheduled compaction ▪ Indexing ▪ Mapping Hudi record key (in metadata column) to file group and file id ▪ In-memory, bloom filter and HBase ▪ Table level APIs ▪ upsert
  16. 16. Apache Iceberg ▪ Copy on Write Mode ▪ File level overwrite APIs available ▪ Merge on Read mode ▪ Position based delete files and equality based delete files
  17. 17. Streaming Support
  18. 18. Delta Lake ▪ Deeply integrated with Spark Structured Streaming ▪ As a streaming source ▪ Streaming control: maxBytesPerTrigger, maxFilesPerTrigger ▪ Does NOT handle non-append (ignoreDeletes or ignoreChanges) ▪ As a streaming sink ▪ Append mode ▪ Complete mode
  19. 19. Apache Hudi ▪ DeltaStreamer ▪ Exactly once ingestion of new event from Kafka ▪ Support JSON, AVRO or custom record types ▪ Manage checkpoints, rollback & recovery ▪ Support for plugging in transformations ▪ Incremental Queries ▪ HiveIncrementalPuller ▪ As Spark data source (beginInstantTime)
  20. 20. Apache Iceberg ▪ Support spark struct streaming ▪ As streaming source (WIP) ▪ Rate limit: max-files-per-batch ▪ Offset range ▪ As streaming sink ▪ Append mode ▪ Complete mode ▪ Support flink (WIP)
  21. 21. Table Schema Evolution ▪ Delta Lake ▪ Use Spark schema ▪ Allow Schema merge and overwrite ▪ Apache Hudi ▪ Use Spark schema ▪ Support adding new fields in stream, column delete is not allowed. ▪ Apache Iceberg ▪ Independent ID-based schema abstraction ▪ Full schema evolution and partition evolution
  22. 22. Maturity
  23. 23. Integrations ▪ Delta Lake ▪ DSv1 ▪ connector enable Apache Hive, Presto ▪ Apache Iceberg ▪ DSv2, InputFormat, Hive StorageHandle (WIP) ▪ Flink sink(WIP) ▪ Apache Hudi ▪ InputFormat, DSv1 ▪ DeltaStreamer for data ingesting
  24. 24. Query Performance Optimization ▪ Delta Lake ▪ Vectorization from Spark ▪ Data skipping via statistic from Parquet ▪ Vacuum, optimize ▪ Apache Hudi ▪ Vectorization from Spark ▪ Data skipping via statistic from Parquet ▪ Auto compaction ▪ Apache Iceberg ▪ Predicate push down ▪ Native vectorized reader (WIP) ▪ Statistic from Iceberg manifest file ▪ Hidden partitioning
  25. 25. Tooling ▪ Delta Lake ▪ CLI: VACUUM, HISTORY, GENERATE, CONVERT TO ▪ Apache Iceberg ▪ Metadata visible as table ▪ Built-in catalog service, enable DDL, DML support in Spark-3.0 ▪ Apache Hudi ▪ CLI, auxiliary commands( inspecting, view, statistics, compaction etc..) ▪ DeltaStreamer, HiveIncrementalPuller, HoodieDeltaStreamer
  26. 26. Conclusion ▪ Delta Lake has best integration with Spark ecosystem and could be used out of box. ▪ Apache Iceberg has great design and abstraction that enable more potentials ▪ Apache Hudi provides most conveniences for streaming process
  27. 27. Thank You & Questions