Boston Data Engineering: Iceberg Dead Ahead with Starburst

•Download as PPTX, PDF•

0 likes•4 views

For the last decade or so, big data professionals' only option to query their data lakes was to, in some way shape or form, use the Hive model. The Hive model is very simple, but it enabled running queries over files in a distributed file system. The Hive model solved some initial issues facing engineers in big data, but there were quite a few issues with this model. It is very rigid and not able to adapt to changing requirements and SLA’s. This could include updating your schema, changing the fields that are used to partition the data, and much more. Iceberg is a new table format developed at Netflix that aims to replace older table formats like Hive to add better flexibility as the schema evolves, atomic operations, speed, and higher dependability. To be clear, it's not a new file format, as it still uses ORC, Parquet, and Avro, but in table format. Major topics include Iceberg, Trino, a review of Hive & legacy table formats, and use case examples. Meetup: https://www.meetup.com/f7324858-b804-4ed8-ba45-580c262189f1/events/288430613/

Technology

1
ICEBERG Dead Ahead!
Iceberg, is a new table format developed at Netflix that aims to
replace older table formats like Hive to add better flexibility as the
schema evolves, atomic operations, speed, and just dependability.
To be clear, it's not a new file format, as it still uses ORC, Parquet,
and Avro, but a table format.
What is Iceberg?
Boston Meetup 9/7/22
Sr. Solutions Architect - StarburstData -
Mid-Atlantic
Brendan Collins

2
One Step Back, Two Forward..
○ Hive: SQL Layer built on Hadoop for data analysis
.. but it has limitations
■ File relationship to bucketing
■ Transactional/ACID has always been squirrely
■ Metastore separation was costly computationally
■ Partitioning was rigid
■ Schema evolution
● That said, Hive was and has been critical for the
evolution of SQL querying in distributed systems
2
2

3
Let’s propose a Scenario with Hive
○ I currently partition all of my incoming data
by Month
■ For this particularly month unique amount of
data growth (ie new product release,
economic trend, global pandemic…)
● Hive move is to create a new table and
partition by week or day. But now I have two
tables partitioned differently
3
3

4
Open Table Format
Table format not file format,
you can still use parquet, ORC,
avro files. This is a table format
on top of those files
Time Travel
Yes, really! .. okay not really but
you can use snapshots to
rollback to previous versions
Serializable Isolation
Addresses lack of consistency
between Metadata and file
state that has plagued Hive
Evolving Schemas
Change schemas on the fly, ie
adding new columns in flight
Introducing Iceberg

5
Your Title
It is a crucial factor of the price-to-book ratio, due to it indicating the actual
payment for tangible assets and not the more difficult valuation, of intangibles.
Accordingly, the P/B could be considered a comparatively, conservative metric.
The amount to pay in taxes for long term investments, investments that span
over a year long term, and short term investments such as those that are below
a year.
75%
Architecture

6
Your Title
It is a crucial factor of the price-to-book ratio, due to
it indicating the actual payment for tangible assets
and not
Iceberg Example Query

8
Updating Tables for GDPR
Hive tables not initially designed for deletes being standard process as is
required by GDPR
Recommendation Engines
Many tables were not designed to be user focused, they were designed to be
operationally focused - how do you pull customer focused data back without
scanning full tables?
Your Title
Primary Use Cases

9
So What?
9
• Snapshot Isolation for Transactions
• Faster planning and execution
• Explose logic and not physical
• Event listeners
• Efficiently make Smaller updates
• All Engines see changes immediately

Similar to Boston Data Engineering: Iceberg Dead Ahead with Starburst

Traditional data wordorcoxsm

IRJET- Business Intelligence using HadoopIRJET Journal

Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin

How to select a modern data warehouse and get the most out of it?Slim Baltagi

Hadoop and SQL: Delivery Analytics Across the OrganizationSeeling Cheung

001 hbase introductionScott Miao

Understanding Big Data for policy professionalsAlex Jouravlev

Cloud-native Semantic Layer on Data LakeDatabricks

SQL vs NoSQL: Big Data Adoption & Success in the EnterpriseAnita Luthra

Nationwide_Ensures_Basel_II_ComplianceAndrew Painter

Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)Trivadis

DBT ELT approach for Advanced Analytics.pptxHong Ong

Nw2008 tips tricks_edw_v10Harsha Gowda B R

oracle-adw-melts snowflake-report.pdfssuserf8f9b2

10 Reasons Snowflake Is Great for AnalyticsSenturus

Horses for Courses: Database RoundtableEric Kavanagh

Managing Large Amounts of Data with SalesforceSense Corp

Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Databricks

Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016StampedeCon

Ceph - High Performance Without High CostsJonathan Long

Similar to Boston Data Engineering: Iceberg Dead Ahead with Starburst (20)

Traditional data word

IRJET- Business Intelligence using Hadoop

Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution

How to select a modern data warehouse and get the most out of it?

Hadoop and SQL: Delivery Analytics Across the Organization

001 hbase introduction

Understanding Big Data for policy professionals

Cloud-native Semantic Layer on Data Lake

SQL vs NoSQL: Big Data Adoption & Success in the Enterprise

Nationwide_Ensures_Basel_II_Compliance

Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)

DBT ELT approach for Advanced Analytics.pptx

Nw2008 tips tricks_edw_v10

oracle-adw-melts snowflake-report.pdf

10 Reasons Snowflake Is Great for Analytics

Horses for Courses: Database Roundtable

Managing Large Amounts of Data with Salesforce

Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...

Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016

Ceph - High Performance Without High Costs

Recently uploaded

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

🐬 The future of MySQL is Postgres 🐘RTylerCroy

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Real Time Object Detection Using Open CVKhem

A Call to Action for Generative AI in 2024Results

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Histor y of HAM Radio presentation slidevu2urc

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Slack Application Development 101 Slidespraypatel2

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script

🐬 The future of MySQL is Postgres 🐘

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Handwritten Text Recognition for manuscripts and early printed texts

Real Time Object Detection Using Open CV

A Call to Action for Generative AI in 2024

Scaling API-first – The story of a global engineering organization

Finology Group – Insurtech Innovation Award 2024

GenCyber Cyber Security Day Presentation

Histor y of HAM Radio presentation slide

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Advantages of Hiring UIUX Design Service Providers for Your Business

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Data Cloud, More than a CDP by Matt Robison

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Slack Application Development 101 Slides

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

What Are The Drone Anti-jamming Systems Technology?

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Boston Data Engineering: Iceberg Dead Ahead with Starburst

1. 1 ICEBERG Dead Ahead! Iceberg, is a new table format developed at Netflix that aims to replace older table formats like Hive to add better flexibility as the schema evolves, atomic operations, speed, and just dependability. To be clear, it's not a new file format, as it still uses ORC, Parquet, and Avro, but a table format. What is Iceberg? Boston Meetup 9/7/22 Sr. Solutions Architect - StarburstData - Mid-Atlantic Brendan Collins

2. 2 One Step Back, Two Forward.. ○ Hive: SQL Layer built on Hadoop for data analysis .. but it has limitations ■ File relationship to bucketing ■ Transactional/ACID has always been squirrely ■ Metastore separation was costly computationally ■ Partitioning was rigid ■ Schema evolution ● That said, Hive was and has been critical for the evolution of SQL querying in distributed systems 2 2

3. 3 Let’s propose a Scenario with Hive ○ I currently partition all of my incoming data by Month ■ For this particularly month unique amount of data growth (ie new product release, economic trend, global pandemic…) ● Hive move is to create a new table and partition by week or day. But now I have two tables partitioned differently 3 3

4. 4 Open Table Format Table format not file format, you can still use parquet, ORC, avro files. This is a table format on top of those files Time Travel Yes, really! .. okay not really but you can use snapshots to rollback to previous versions Serializable Isolation Addresses lack of consistency between Metadata and file state that has plagued Hive Evolving Schemas Change schemas on the fly, ie adding new columns in flight Introducing Iceberg

5. 5 Your Title It is a crucial factor of the price-to-book ratio, due to it indicating the actual payment for tangible assets and not the more difficult valuation, of intangibles. Accordingly, the P/B could be considered a comparatively, conservative metric. The amount to pay in taxes for long term investments, investments that span over a year long term, and short term investments such as those that are below a year. 75% Architecture

6. 6 Your Title It is a crucial factor of the price-to-book ratio, due to it indicating the actual payment for tangible assets and not Iceberg Example Query

7. 7 Let’s Take a Quick Look

8. 8 Updating Tables for GDPR Hive tables not initially designed for deletes being standard process as is required by GDPR Recommendation Engines Many tables were not designed to be user focused, they were designed to be operationally focused - how do you pull customer focused data back without scanning full tables? Your Title Primary Use Cases

9. 9 So What? 9 • Snapshot Isolation for Transactions • Faster planning and execution • Explose logic and not physical • Event listeners • Efficiently make Smaller updates • All Engines see changes immediately

Boston Data Engineering: Iceberg Dead Ahead with Starburst

Recommended

Recommended

More Related Content

Similar to Boston Data Engineering: Iceberg Dead Ahead with Starburst

Similar to Boston Data Engineering: Iceberg Dead Ahead with Starburst (20)

Recently uploaded

Recently uploaded (20)

Boston Data Engineering: Iceberg Dead Ahead with Starburst