SlideShare a Scribd company logo
1 of 9
1
ICEBERG Dead Ahead!
Iceberg, is a new table format developed at Netflix that aims to
replace older table formats like Hive to add better flexibility as the
schema evolves, atomic operations, speed, and just dependability.
To be clear, it's not a new file format, as it still uses ORC, Parquet,
and Avro, but a table format.
What is Iceberg?
Boston Meetup 9/7/22
Sr. Solutions Architect - StarburstData -
Mid-Atlantic
Brendan Collins
2
One Step Back, Two Forward..
○ Hive: SQL Layer built on Hadoop for data analysis
.. but it has limitations
■ File relationship to bucketing
■ Transactional/ACID has always been squirrely
■ Metastore separation was costly computationally
■ Partitioning was rigid
■ Schema evolution
● That said, Hive was and has been critical for the
evolution of SQL querying in distributed systems
2
2
3
Let’s propose a Scenario with Hive
○ I currently partition all of my incoming data
by Month
■ For this particularly month unique amount of
data growth (ie new product release,
economic trend, global pandemic…)
● Hive move is to create a new table and
partition by week or day. But now I have two
tables partitioned differently
3
3
4
Open Table Format
Table format not file format,
you can still use parquet, ORC,
avro files. This is a table format
on top of those files
Time Travel
Yes, really! .. okay not really but
you can use snapshots to
rollback to previous versions
Serializable Isolation
Addresses lack of consistency
between Metadata and file
state that has plagued Hive
Evolving Schemas
Change schemas on the fly, ie
adding new columns in flight
Introducing Iceberg
5
Your Title
It is a crucial factor of the price-to-book ratio, due to it indicating the actual
payment for tangible assets and not the more difficult valuation, of intangibles.
Accordingly, the P/B could be considered a comparatively, conservative metric.
The amount to pay in taxes for long term investments, investments that span
over a year long term, and short term investments such as those that are below
a year.
75%
Architecture
6
Your Title
It is a crucial factor of the price-to-book ratio, due to
it indicating the actual payment for tangible assets
and not
Iceberg Example Query
7
Let’s Take a Quick Look
8
Updating Tables for GDPR
Hive tables not initially designed for deletes being standard process as is
required by GDPR
Recommendation Engines
Many tables were not designed to be user focused, they were designed to be
operationally focused - how do you pull customer focused data back without
scanning full tables?
Your Title
Primary Use Cases
9
So What?
9
• Snapshot Isolation for Transactions
• Faster planning and execution
• Explose logic and not physical
• Event listeners
• Efficiently make Smaller updates
• All Engines see changes immediately

More Related Content

Similar to Boston Data Engineering: Iceberg Dead Ahead with Starburst

Traditional data word
Traditional data wordTraditional data word
Traditional data wordorcoxsm
 
IRJET- Business Intelligence using Hadoop
IRJET-  	  Business Intelligence using HadoopIRJET-  	  Business Intelligence using Hadoop
IRJET- Business Intelligence using HadoopIRJET Journal
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin
 
How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?Slim Baltagi
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the OrganizationSeeling Cheung
 
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introductionScott Miao
 
Understanding Big Data for policy professionals
Understanding Big Data for policy professionalsUnderstanding Big Data for policy professionals
Understanding Big Data for policy professionalsAlex Jouravlev
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeDatabricks
 
SQL vs NoSQL: Big Data Adoption & Success in the Enterprise
SQL vs NoSQL: Big Data Adoption & Success in the EnterpriseSQL vs NoSQL: Big Data Adoption & Success in the Enterprise
SQL vs NoSQL: Big Data Adoption & Success in the EnterpriseAnita Luthra
 
Nationwide_Ensures_Basel_II_Compliance
Nationwide_Ensures_Basel_II_ComplianceNationwide_Ensures_Basel_II_Compliance
Nationwide_Ensures_Basel_II_ComplianceAndrew Painter
 
Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)
Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)
Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)Trivadis
 
DBT ELT approach for Advanced Analytics.pptx
DBT ELT approach for Advanced Analytics.pptxDBT ELT approach for Advanced Analytics.pptx
DBT ELT approach for Advanced Analytics.pptxHong Ong
 
Nw2008 tips tricks_edw_v10
Nw2008 tips tricks_edw_v10Nw2008 tips tricks_edw_v10
Nw2008 tips tricks_edw_v10Harsha Gowda B R
 
oracle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdforacle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdfssuserf8f9b2
 
10 Reasons Snowflake Is Great for Analytics
10 Reasons Snowflake Is Great for Analytics10 Reasons Snowflake Is Great for Analytics
10 Reasons Snowflake Is Great for AnalyticsSenturus
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database RoundtableEric Kavanagh
 
Managing Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceManaging Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceSense Corp
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Databricks
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016StampedeCon
 
Ceph - High Performance Without High Costs
Ceph - High Performance Without High CostsCeph - High Performance Without High Costs
Ceph - High Performance Without High CostsJonathan Long
 

Similar to Boston Data Engineering: Iceberg Dead Ahead with Starburst (20)

Traditional data word
Traditional data wordTraditional data word
Traditional data word
 
IRJET- Business Intelligence using Hadoop
IRJET-  	  Business Intelligence using HadoopIRJET-  	  Business Intelligence using Hadoop
IRJET- Business Intelligence using Hadoop
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introduction
 
Understanding Big Data for policy professionals
Understanding Big Data for policy professionalsUnderstanding Big Data for policy professionals
Understanding Big Data for policy professionals
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data Lake
 
SQL vs NoSQL: Big Data Adoption & Success in the Enterprise
SQL vs NoSQL: Big Data Adoption & Success in the EnterpriseSQL vs NoSQL: Big Data Adoption & Success in the Enterprise
SQL vs NoSQL: Big Data Adoption & Success in the Enterprise
 
Nationwide_Ensures_Basel_II_Compliance
Nationwide_Ensures_Basel_II_ComplianceNationwide_Ensures_Basel_II_Compliance
Nationwide_Ensures_Basel_II_Compliance
 
Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)
Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)
Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)
 
DBT ELT approach for Advanced Analytics.pptx
DBT ELT approach for Advanced Analytics.pptxDBT ELT approach for Advanced Analytics.pptx
DBT ELT approach for Advanced Analytics.pptx
 
Nw2008 tips tricks_edw_v10
Nw2008 tips tricks_edw_v10Nw2008 tips tricks_edw_v10
Nw2008 tips tricks_edw_v10
 
oracle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdforacle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdf
 
10 Reasons Snowflake Is Great for Analytics
10 Reasons Snowflake Is Great for Analytics10 Reasons Snowflake Is Great for Analytics
10 Reasons Snowflake Is Great for Analytics
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
Managing Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceManaging Large Amounts of Data with Salesforce
Managing Large Amounts of Data with Salesforce
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
 
Ceph - High Performance Without High Costs
Ceph - High Performance Without High CostsCeph - High Performance Without High Costs
Ceph - High Performance Without High Costs
 

Recently uploaded

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Boston Data Engineering: Iceberg Dead Ahead with Starburst

  • 1. 1 ICEBERG Dead Ahead! Iceberg, is a new table format developed at Netflix that aims to replace older table formats like Hive to add better flexibility as the schema evolves, atomic operations, speed, and just dependability. To be clear, it's not a new file format, as it still uses ORC, Parquet, and Avro, but a table format. What is Iceberg? Boston Meetup 9/7/22 Sr. Solutions Architect - StarburstData - Mid-Atlantic Brendan Collins
  • 2. 2 One Step Back, Two Forward.. ○ Hive: SQL Layer built on Hadoop for data analysis .. but it has limitations ■ File relationship to bucketing ■ Transactional/ACID has always been squirrely ■ Metastore separation was costly computationally ■ Partitioning was rigid ■ Schema evolution ● That said, Hive was and has been critical for the evolution of SQL querying in distributed systems 2 2
  • 3. 3 Let’s propose a Scenario with Hive ○ I currently partition all of my incoming data by Month ■ For this particularly month unique amount of data growth (ie new product release, economic trend, global pandemic…) ● Hive move is to create a new table and partition by week or day. But now I have two tables partitioned differently 3 3
  • 4. 4 Open Table Format Table format not file format, you can still use parquet, ORC, avro files. This is a table format on top of those files Time Travel Yes, really! .. okay not really but you can use snapshots to rollback to previous versions Serializable Isolation Addresses lack of consistency between Metadata and file state that has plagued Hive Evolving Schemas Change schemas on the fly, ie adding new columns in flight Introducing Iceberg
  • 5. 5 Your Title It is a crucial factor of the price-to-book ratio, due to it indicating the actual payment for tangible assets and not the more difficult valuation, of intangibles. Accordingly, the P/B could be considered a comparatively, conservative metric. The amount to pay in taxes for long term investments, investments that span over a year long term, and short term investments such as those that are below a year. 75% Architecture
  • 6. 6 Your Title It is a crucial factor of the price-to-book ratio, due to it indicating the actual payment for tangible assets and not Iceberg Example Query
  • 7. 7 Let’s Take a Quick Look
  • 8. 8 Updating Tables for GDPR Hive tables not initially designed for deletes being standard process as is required by GDPR Recommendation Engines Many tables were not designed to be user focused, they were designed to be operationally focused - how do you pull customer focused data back without scanning full tables? Your Title Primary Use Cases
  • 9. 9 So What? 9 • Snapshot Isolation for Transactions • Faster planning and execution • Explose logic and not physical • Event listeners • Efficiently make Smaller updates • All Engines see changes immediately