Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Presto Summit 2018 - 09 - Netflix Iceberg

705 views

Published on

Iceberg: a modern table format for big data (Ryan Blue & Parth Brahmbhatt, Netflix)
Presto Summit 2018 (https://www.starburstdata.com/technical-blog/presto-summit-2018-recap/)

Published in: Data & Analytics
  • Be the first to comment

Presto Summit 2018 - 09 - Netflix Iceberg

  1. 1. Iceberg A modern table format for big data Ryan Blue & Parth Brahmbhatt July 2018 - Presto Summit
  2. 2. ● A Netflix use case and performance results ● What is Iceberg? ○ How large Hive tables (fail to) work ○ How Iceberg works and its benefits ● Iceberg at Netflix ○ Future of Netflix’s data platform ● Iceberg & Raptor comparison Contents
  3. 3. Iceberg Performance July 2018 - Presto Summit
  4. 4. ● Historical Atlas data: ○ Time-series metrics from Netflix runtime systems ○ 1 month: 2.7 million files in 2,688 partitions ○ Problem: cannot process more than a few days of data ● Sample query: select distinct tags['type'] as type from iceberg.atlas where name = 'metric-name' and date > 20180222 and date <= 20180228 order by type; Case Study: Atlas
  5. 5. ● Hive table – with Parquet filters: ○ 400k+ splits per day, not combined ○ EXPLAIN query: 9.6 min (planning wall time) ● Iceberg table – partition data filtering: ○ 15,218 splits, combined ○ 13 min (wall time) / 61.5 hr (task time) / 10 sec (planning) ● Iceberg table – partition and min/max filtering: ○ 412 splits ○ 42 sec (wall time) / 22 min (task time) / 25 sec (planning) Case Study: Atlas Performance
  6. 6. Iceberg’s Design July 2018 - Presto Summit
  7. 7. First, what is a table format?
  8. 8. ● Key idea: organize data in a directory tree date=20180513/ |- hour=18/ | |- ... |- hour=19/ | |- part-000.parquet | |- ... | |- part-031.parquet |- hour=20/ | |- ... |- ... Hive Table Design
  9. 9. ● Filter by directories as columns SELECT ... WHERE date = '20180513' AND hour = 19 date=20180513/ |- hour=18/ | |- ... |- hour=19/ | |- part-000.parquet | |- ... | |- part-031.parquet |- hour=20/ | |- ... |- ... Hive Table Design
  10. 10. ● Problem: too much directory listing for large tables ● Solution: use HMS to track partitions ○ Partition key to FS location date=20180513/hour=19 -> hdfs:/.../date=20180513/hour=19 date=20180513/hour=20 -> hdfs:/.../date=20180513/hour=20 ○ Enables predicate push-down in HMS for (some) scans ● The file system still tracks the files in each partition... Hive Metastore
  11. 11. ● Table state is stored in two places ○ Partitions in the Hive Metastore ○ Files in a FS with no transaction support ● Requires elaborate locking for correctness ○ Nothing respects the locking scheme ● Still requires directory listing to plan jobs ○ O(n) listing calls, n = # matching partitions ○ Eventual consistency breaks correctness Design Problems
  12. 12. ● Key idea: track all files in a table over time ○ A snapshot is a complete list of files in a table ○ Each write produces and commits a new snapshot Iceberg’s Design S1 S2 S3 ...
  13. 13. ● Snapshot isolation without locking ○ Readers use a current snapshot ○ Writers produce new snapshots in isolation, then commit ● Any change to the file list is an atomic operation ○ Append data across partitions ○ Merge or rewrite files Snapshot Design Benefits S1 S2 S3 ... R W
  14. 14. In reality, it’s a bit more complicated.
  15. 15. Design Benefits ● Reads and writes are isolated and all changes are atomic ● No expensive or eventually-consistent FS operations: ○ No directory or prefix listing ○ No rename: data files written in place ● Faster scan planning, distributed across the cluster ○ O(1) manifest reads, not O(n) partition list calls ○ Upper and lower bounds used to eliminate files ● Reliable CBO metrics
  16. 16. Iceberg at Netflix July 2018 - Presto Summit
  17. 17. ● Hidden partitioning ○ Partition filters derived from data filters ○ No more accidental full table scans ● Full schema evolution ○ Supports add, drop, and rename columns ● Reliable support for types ○ date, time, timestamp, and decimal ○ struct, list, map, and mixed nesting Works as Users Expect
  18. 18. ● Queries are not broken by layout changes ● Physical layout can evolve without painful migration ○ Mistakes can be fixed ○ Prototypes can move to production faster ○ Tables can change as volume grows over time ● Data Platform can transparently fix table layout Table Layout is Hidden
  19. 19. ● Any write is atomic – either complete or invisible ○ Rewrite files instead of partitions ○ Tables never have partially committed data ● Simple, built-in change detection ○ Cache and materialized view maintenance ○ Incremental processing ● Data Platform can monitor and fix data files ○ Compact small files ○ Repartition to a new layout Snapshot-based Tables
  20. 20. ● Common implementation for table operations ○ Tune write options per table (Parquet row group size) ○ Tune read defaults once (split combination) ● Simple data gathering ○ Log scan predicates and projection to Kafka ○ Analyze table settings from the Iceberg table ● Data Platform can automate tuning configuration ○ Test file format tuning settings per table ○ Update table to affect all writes Table Format Library
  21. 21. ● Current: merge service ○ Dedicated cluster to convert data to Parquet ○ Makes data available to tables after merge completes ● Iceberg: Data is available after commit, in row-oriented format ● Autotune (planned) ○ Recommend tuning parameters ○ Rewrite files based on priority ○ Opportunistic Parquet conversion Autotune: Data Librarian
  22. 22. Iceberg & Raptor Comparison July 2018 - Presto Summit
  23. 23. ● Fine-grained tracking of data ○ Iceberg: file-level, Raptor: shard-level ● Use min/max values for efficient job planning ● Safe atomic writes ● Same metadata, different purpose and design Raptor Similarities
  24. 24. ● Raptor targets low-latency query execution ○ Data stored in flash ○ Shard tracking in MySQL ○ Built for Presto ● Iceberg manages table metadata targeting scale ○ Open specification for Spark, Presto, and others ○ Distributed metadata workload ○ Atomic changes across clusters and engines ● Iceberg & Raptor are complementary projects Raptor Differences
  25. 25. Getting Started with Iceberg July 2018 - Presto Summit
  26. 26. ● github.com/Netflix/iceberg ○ Apache Licensed, ALv2 ○ Spark 2.3.x data source plug-in ○ Pig (read-only) support in development ○ Planned python support ● Presto Iceberg PR: coming soon! Using Iceberg
  27. 27. Current blocker: HMS support
  28. 28. Questions? July 2018 - Presto Summit
  29. 29. Additional Slides July 2018 - Presto Summit
  30. 30. ● Implementation of snapshot-based tracking ○ Adds table schema, partition layout, string properties ○ Tracks old snapshots for eventual garbage collection ● Table metadata is immutable and always moves forward ● The current snapshot (pointer) can be rolled back Iceberg Metadata v1.json S1 S2 v2.json S1 S2 S3 v3.json S2 S3
  31. 31. ● Snapshots are split across one or more manifest files ○ Manifests store partition data for each data file ○ Reused to avoid high write volume Manifest Files v1.json S1 S2 v2.json S1 S2 S3 v3.json S2 S3 m0.avro m1.avro m2.avro
  32. 32. ● Basic data file info: ○ File location and format ○ Iceberg tracking data ● Values to filter files for a scan: ○ Partition data values ○ Per-column lower and upper bounds ● Metrics for cost-based optimization: ○ File-level: row count, size ○ Column-level: value count, null count, size Manifest File Contents
  33. 33. ● To commit, a writer must: ○ Note the current metadata version – the base version ○ Create new metadata and manifest files ○ Atomically swap the base version for the new version ● This atomic swap ensures a linear history ● Atomic swap is implemented by: ○ A custom metastore implementation ○ Atomic rename for HDFS or local tables Commits
  34. 34. ● Writers optimistically write new versions: ○ Assume that no other writer is operating ○ On conflict, retry based on the latest metadata ● To support retry, operations are structured as: ○ Assumptions about the current table state ○ Pending changes to the current table state ● Changes are safe if the assumptions are all true Commits: Conflict Resolution
  35. 35. ● Use case: safely merge small files ○ Merge input: file1.avro, file2.avro ○ Merge output: merge1.parquet ● Rewrite operation: ○ Assumption: file1.avro and file2.avro are still present ○ Pending changes: Remove file1.avro and file2.avro Add merge1.parquet ● Deleting file1.avro or file2.avro will cause a commit failure Commits: Resolution Example
  36. 36. Presto iceberg connector ● Why a new connector ? ● What it can do ○ Read Support is done ○ Split planning, predicate pushdown ○ All iceberg datatypes supported ○ DDL and DML in the works ● Transparent querying between hive and icerberg catalogs.
  37. 37. ● Hive table with no partition information ● Use iceberg api for split planning ● Iceberg APIs prunes manifest and datafiles based on stats. ● Stats pruning results into 3X performance improvements. ● Parquet version upgrade needed. How it works

×