SlideShare a Scribd company logo
1 of 34
Download to read offline
Iceberg
A modern table format for big data
Ryan Blue @6d352b5d3028e4b
Owen O’Malley @owen_omalley
September 2018 - Strata NY
● A Netflix use case and performance results
● Hive tables
○ How large Hive tables work
○ Drawbacks of this table design
● Iceberg tables
○ How Iceberg addresses the challenges
○ Benefits of Iceberg’s design
● How to get started
Contents
Iceberg Performance
September 2018 - Strata NY
● Historical Atlas data:
○ Time-series metrics from Netflix runtime systems
○ 1 month: 2.7 million files in 2,688 partitions
○ Problem: cannot process more than a few days of data
● Sample query:
select distinct tags['type'] as type
from iceberg.atlas
where
name = 'metric-name' and
date > 20180222 and date <= 20180228
order by type;
Case Study: Netflix Atlas
● Hive table – with Parquet filters:
○ 400k+ splits, not combined
○ EXPLAIN query: 9.6 min (planning wall time)
● Iceberg table – partition data filtering:
○ 15,218 splits, combined
○ 13 min (wall time) / 61.5 hr (task time) / 10 sec (planning)
● Iceberg table – partition and min/max filtering:
○ 412 splits
○ 42 sec (wall time) / 22 min (task time) / 25 sec (planning)
Atlas Historical Queries
What is a table format?
September 2018 - Strata NY
You meant file format, right?
● How to track what files store the table’s data.
○ Files in the table are in Avro, Parquet, ORC, etc.
● Often overlooked, but determines:
○ What guarantees are possible (like correctness)
○ How hard it is to write fast queries
○ How the table can change over time
○ Job performance
No. Table Format.
● Should be specified: must be documented and portable
● Should support expected database table behavior:
○ Atomic changes that commit all rows or nothing
○ Schema evolution without unintended consequences
○ Efficient access like predicate or projection pushdown
● Bonus features:
○ Hidden layout: no need to know the table structure
○ Layout evolution: change the table structure over time
What is a good table format?
Hive Tables
September 2018 - Strata NY
● Key idea: organize data in a directory tree
○ Partition columns become a directory level with values
date=20180513/
|- hour=18/
| |- ...
|- hour=19/
| |- 000000_0
| |- ...
| |- 000031_0
|- hour=20/
| |- ...
|- ...
Hive Table Design
● Filter by directories as columns
SELECT ... WHERE date = '20180513' AND hour = 19
date=20180513/
|- hour=18/
| |- ...
|- hour=19/
| |- 000000_0
| |- ...
| |- 000031_0
|- hour=20/
| |- ...
|- ...
Hive Table Design
● HMS keeps metadata in SQL database
○ Tracks information about partitions
○ Tracks schema information
○ Tracks table statistics
● Allows filtering by partition values
○ Filters only pushed to DB for string types
● Uses external SQL database
○ Metastore is often the bottleneck for query planning
● Only file system tracks the files in each partition…
○ No per-file statistics
Hive Metastore
● Provides snapshot isolation and atomic updates
● Transaction state is stored in the metastore
● Uses the same partition/directory layout
○ Creates new directory structure inside partitions
date=20180513/
|- hour=19/
| |- base_0000000/
| | |- bucket_00000
| | |- ...
| | |- bucket_00031
| |- delta_0000001_0000100/
| | |- bucket_00000
| | |- ...
Hive ACID layout
● Table state is stored in two places
○ Partitions in the Hive Metastore
○ Files in a file system
● Bucketing is defined by Hive’s (Java) hash implementation.
● Non-ACID layout’s only atomic operation is add partition
● Requires atomic move of objects in file system
● Still requires directory listing to plan jobs
○ O(n) listing calls, n = # matching partitions
○ Eventual consistency breaks correctness
Design Problems
● Partition values are stored as strings
○ Requires character escaping
○ null stored as __HIVE_DEFAULT_PARTITION__
● HMS table statistics become stale
○ Statistics have to be regenerated manually
● A lot of undocumented layout variants
● Bucket definition tied to Java and Hive
Less Obvious Problems
● Users must know and use a table’s physical layout
○ ts > X ⇒ full table scan!
○ Did you mean this?
ts > X and (d > day(X) or (d = day(X) and hr >= hour(X))
● Schema evolution rules are dependent on file format
○ CSV – by position; Avro & ORC – by name
● Unreliable: type support varies across formats
○ Which formats support decimal?
○ Does CSV support maps with struct keys?
Other Annoyances
Iceberg Tables
September 2018 - Strata NY
● Key idea: track all files in a table over time
○ A snapshot is a complete list of files in a table
○ Each write produces and commits a new snapshot
Iceberg’s Design
S1 S2 S3 ...
● Snapshot isolation without locking
○ Readers use a current snapshot
○ Writers produce new snapshots in isolation, then commit
● Any change to the file list is an atomic operation
○ Append data across partitions
○ Merge or rewrite files
Snapshot Design Benefits
S1 S2 S3 ...
R W
In reality, it’s a bit more
complicated...
● Implements snapshot-based tracking
○ Adds table schema, partition layout, string properties
○ Tracks old snapshots for eventual garbage collection
● Each metadata file is immutable
● Metadata always moves forward, history is linear
● The current snapshot (pointer) can be rolled back
Iceberg Metadata
v1.json
S1 S2
v2.json
S1 S2 S3
v3.json
S2 S3
● Snapshots are split across one or more manifest files
○ A manifest stores files across many partitions
○ A partition data tuple is stored for each data file
○ Reused to avoid high write volume
Manifest Files
v2.json
S1 S2 S3
m0.avro m1.avro m2.avro
● Basic data file info:
○ File location and format
○ Iceberg tracking data
● Values to filter files for a scan:
○ Partition data values
○ Per-column lower and upper bounds
● Metrics for cost-based optimization:
○ File-level: row count, size
○ Column-level: value count, null count, size
Manifest File Contents
● To commit, a writer must:
○ Note the current metadata version – the base version
○ Create new metadata and manifest files
○ Atomically swap the base version for the new version
● This atomic swap ensures a linear history
● Atomic swap is implemented by:
○ A custom metastore implementation
○ Atomic rename for HDFS or local tables
Commits
● Writers optimistically write new versions:
○ Assume that no other writer is operating
○ On conflict, retry based on the latest metadata
● To support retry, operations are structured as:
○ Assumptions about the current table state
○ Pending changes to the current table state
● Changes are safe if the assumptions are all true
Commits: Conflict Resolution
● Use case: safely merge small files
○ Merge input: file1.avro, file2.avro
○ Merge output: merge1.parquet
● Rewrite operation:
○ Assumption: file1.avro and file2.avro are still present
○ Pending changes:
Remove file1.avro and file2.avro
Add merge1.parquet
● Deleting file1.avro or file2.avro will cause a commit failure
Commits: Resolution Example
Design Benefits
● Reads and writes are isolated and all changes are atomic
● No expensive or eventually-consistent FS operations:
○ No directory or prefix listing
○ No rename: data files written in place
● Faster scan planning
○ O(1) manifest reads, not O(n) partition list calls
○ Without listing, partition granularity can be higher
○ Upper and lower bounds used to eliminate files
● Full schema evolution: add, drop, rename, reorder columns
● Reliable support for types
○ date, time, timestamp, and decimal
○ struct, list, map, and mixed nesting
● Hidden partitioning
○ Partition filters derived from data filters
○ Supports evolving table partitioning
● Mixed file format support, reliable CBO metrics, etc.
Other Improvements
● Spark improvements
○ Standard logical plans and behavior
○ Data source v2 API revisions
● ORC improvements
○ Added additional statistics
○ Adding timestamp with local timezone
● Parquet & Avro improvements
○ Column resolution by ID
○ New materialization API
Contributions to other projects
Getting Started with Iceberg
September 2018 - Strata NY
● github.com/Netflix/iceberg
○ Apache Licensed, ALv2
○ Core Java library available from JitPack
○ Contribute with github issues and pull requests
● Supported engines:
○ Spark 2.3.x - data source v2 plug-in
○ Read-only Pig support
● Mailing list:
○ iceberg-devel@googlegroups.com
Using Iceberg
● Hive Metastore catalog (PR available)
○ Uses table locking to implement atomic commits
● Python library coming soon
● Presto support PR coming soon
Future work
Questions?
blue@apache.org
omalley@apache.org
September 2018 - Strata NY

More Related Content

What's hot

Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
Databricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 

What's hot (20)

Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
 
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 

Similar to Iceberg: A modern table format for big data (Strata NY 2018)

Data Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch FixData Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch Fix
Stefan Krawczyk
 
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixData Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Stefan Krawczyk
 

Similar to Iceberg: A modern table format for big data (Strata NY 2018) (20)

ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
 
Data Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch FixData Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch Fix
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
Why PostgreSQL for Analytics Infrastructure (DW)?
Why PostgreSQL for Analytics Infrastructure (DW)?Why PostgreSQL for Analytics Infrastructure (DW)?
Why PostgreSQL for Analytics Infrastructure (DW)?
 
week1slides1704202828322.pdf
week1slides1704202828322.pdfweek1slides1704202828322.pdf
week1slides1704202828322.pdf
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Software Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationSoftware Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale Automation
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixData Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
 
Towards Data Operations
Towards Data OperationsTowards Data Operations
Towards Data Operations
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data Platforms
 
Druid
DruidDruid
Druid
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101
 
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack World
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Raft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdfRaft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdf
 

Recently uploaded

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 

Recently uploaded (20)

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 

Iceberg: A modern table format for big data (Strata NY 2018)

  • 1. Iceberg A modern table format for big data Ryan Blue @6d352b5d3028e4b Owen O’Malley @owen_omalley September 2018 - Strata NY
  • 2. ● A Netflix use case and performance results ● Hive tables ○ How large Hive tables work ○ Drawbacks of this table design ● Iceberg tables ○ How Iceberg addresses the challenges ○ Benefits of Iceberg’s design ● How to get started Contents
  • 4. ● Historical Atlas data: ○ Time-series metrics from Netflix runtime systems ○ 1 month: 2.7 million files in 2,688 partitions ○ Problem: cannot process more than a few days of data ● Sample query: select distinct tags['type'] as type from iceberg.atlas where name = 'metric-name' and date > 20180222 and date <= 20180228 order by type; Case Study: Netflix Atlas
  • 5. ● Hive table – with Parquet filters: ○ 400k+ splits, not combined ○ EXPLAIN query: 9.6 min (planning wall time) ● Iceberg table – partition data filtering: ○ 15,218 splits, combined ○ 13 min (wall time) / 61.5 hr (task time) / 10 sec (planning) ● Iceberg table – partition and min/max filtering: ○ 412 splits ○ 42 sec (wall time) / 22 min (task time) / 25 sec (planning) Atlas Historical Queries
  • 6. What is a table format? September 2018 - Strata NY
  • 7. You meant file format, right?
  • 8. ● How to track what files store the table’s data. ○ Files in the table are in Avro, Parquet, ORC, etc. ● Often overlooked, but determines: ○ What guarantees are possible (like correctness) ○ How hard it is to write fast queries ○ How the table can change over time ○ Job performance No. Table Format.
  • 9. ● Should be specified: must be documented and portable ● Should support expected database table behavior: ○ Atomic changes that commit all rows or nothing ○ Schema evolution without unintended consequences ○ Efficient access like predicate or projection pushdown ● Bonus features: ○ Hidden layout: no need to know the table structure ○ Layout evolution: change the table structure over time What is a good table format?
  • 11. ● Key idea: organize data in a directory tree ○ Partition columns become a directory level with values date=20180513/ |- hour=18/ | |- ... |- hour=19/ | |- 000000_0 | |- ... | |- 000031_0 |- hour=20/ | |- ... |- ... Hive Table Design
  • 12. ● Filter by directories as columns SELECT ... WHERE date = '20180513' AND hour = 19 date=20180513/ |- hour=18/ | |- ... |- hour=19/ | |- 000000_0 | |- ... | |- 000031_0 |- hour=20/ | |- ... |- ... Hive Table Design
  • 13. ● HMS keeps metadata in SQL database ○ Tracks information about partitions ○ Tracks schema information ○ Tracks table statistics ● Allows filtering by partition values ○ Filters only pushed to DB for string types ● Uses external SQL database ○ Metastore is often the bottleneck for query planning ● Only file system tracks the files in each partition… ○ No per-file statistics Hive Metastore
  • 14. ● Provides snapshot isolation and atomic updates ● Transaction state is stored in the metastore ● Uses the same partition/directory layout ○ Creates new directory structure inside partitions date=20180513/ |- hour=19/ | |- base_0000000/ | | |- bucket_00000 | | |- ... | | |- bucket_00031 | |- delta_0000001_0000100/ | | |- bucket_00000 | | |- ... Hive ACID layout
  • 15. ● Table state is stored in two places ○ Partitions in the Hive Metastore ○ Files in a file system ● Bucketing is defined by Hive’s (Java) hash implementation. ● Non-ACID layout’s only atomic operation is add partition ● Requires atomic move of objects in file system ● Still requires directory listing to plan jobs ○ O(n) listing calls, n = # matching partitions ○ Eventual consistency breaks correctness Design Problems
  • 16. ● Partition values are stored as strings ○ Requires character escaping ○ null stored as __HIVE_DEFAULT_PARTITION__ ● HMS table statistics become stale ○ Statistics have to be regenerated manually ● A lot of undocumented layout variants ● Bucket definition tied to Java and Hive Less Obvious Problems
  • 17. ● Users must know and use a table’s physical layout ○ ts > X ⇒ full table scan! ○ Did you mean this? ts > X and (d > day(X) or (d = day(X) and hr >= hour(X)) ● Schema evolution rules are dependent on file format ○ CSV – by position; Avro & ORC – by name ● Unreliable: type support varies across formats ○ Which formats support decimal? ○ Does CSV support maps with struct keys? Other Annoyances
  • 19. ● Key idea: track all files in a table over time ○ A snapshot is a complete list of files in a table ○ Each write produces and commits a new snapshot Iceberg’s Design S1 S2 S3 ...
  • 20. ● Snapshot isolation without locking ○ Readers use a current snapshot ○ Writers produce new snapshots in isolation, then commit ● Any change to the file list is an atomic operation ○ Append data across partitions ○ Merge or rewrite files Snapshot Design Benefits S1 S2 S3 ... R W
  • 21. In reality, it’s a bit more complicated...
  • 22. ● Implements snapshot-based tracking ○ Adds table schema, partition layout, string properties ○ Tracks old snapshots for eventual garbage collection ● Each metadata file is immutable ● Metadata always moves forward, history is linear ● The current snapshot (pointer) can be rolled back Iceberg Metadata v1.json S1 S2 v2.json S1 S2 S3 v3.json S2 S3
  • 23. ● Snapshots are split across one or more manifest files ○ A manifest stores files across many partitions ○ A partition data tuple is stored for each data file ○ Reused to avoid high write volume Manifest Files v2.json S1 S2 S3 m0.avro m1.avro m2.avro
  • 24. ● Basic data file info: ○ File location and format ○ Iceberg tracking data ● Values to filter files for a scan: ○ Partition data values ○ Per-column lower and upper bounds ● Metrics for cost-based optimization: ○ File-level: row count, size ○ Column-level: value count, null count, size Manifest File Contents
  • 25. ● To commit, a writer must: ○ Note the current metadata version – the base version ○ Create new metadata and manifest files ○ Atomically swap the base version for the new version ● This atomic swap ensures a linear history ● Atomic swap is implemented by: ○ A custom metastore implementation ○ Atomic rename for HDFS or local tables Commits
  • 26. ● Writers optimistically write new versions: ○ Assume that no other writer is operating ○ On conflict, retry based on the latest metadata ● To support retry, operations are structured as: ○ Assumptions about the current table state ○ Pending changes to the current table state ● Changes are safe if the assumptions are all true Commits: Conflict Resolution
  • 27. ● Use case: safely merge small files ○ Merge input: file1.avro, file2.avro ○ Merge output: merge1.parquet ● Rewrite operation: ○ Assumption: file1.avro and file2.avro are still present ○ Pending changes: Remove file1.avro and file2.avro Add merge1.parquet ● Deleting file1.avro or file2.avro will cause a commit failure Commits: Resolution Example
  • 28. Design Benefits ● Reads and writes are isolated and all changes are atomic ● No expensive or eventually-consistent FS operations: ○ No directory or prefix listing ○ No rename: data files written in place ● Faster scan planning ○ O(1) manifest reads, not O(n) partition list calls ○ Without listing, partition granularity can be higher ○ Upper and lower bounds used to eliminate files
  • 29. ● Full schema evolution: add, drop, rename, reorder columns ● Reliable support for types ○ date, time, timestamp, and decimal ○ struct, list, map, and mixed nesting ● Hidden partitioning ○ Partition filters derived from data filters ○ Supports evolving table partitioning ● Mixed file format support, reliable CBO metrics, etc. Other Improvements
  • 30. ● Spark improvements ○ Standard logical plans and behavior ○ Data source v2 API revisions ● ORC improvements ○ Added additional statistics ○ Adding timestamp with local timezone ● Parquet & Avro improvements ○ Column resolution by ID ○ New materialization API Contributions to other projects
  • 31. Getting Started with Iceberg September 2018 - Strata NY
  • 32. ● github.com/Netflix/iceberg ○ Apache Licensed, ALv2 ○ Core Java library available from JitPack ○ Contribute with github issues and pull requests ● Supported engines: ○ Spark 2.3.x - data source v2 plug-in ○ Read-only Pig support ● Mailing list: ○ iceberg-devel@googlegroups.com Using Iceberg
  • 33. ● Hive Metastore catalog (PR available) ○ Uses table locking to implement atomic commits ● Python library coming soon ● Presto support PR coming soon Future work