For the last decade or so, big data professionals' only option to query their data lakes was to, in some way shape or form, use the Hive model. The Hive model is very simple, but it enabled running queries over files in a distributed file system. The Hive model solved some initial issues facing engineers in big data, but there were quite a few issues with this model. It is very rigid and not able to adapt to changing requirements and SLA’s. This could include updating your schema, changing the fields that are used to partition the data, and much more.
Iceberg is a new table format developed at Netflix that aims to replace older table formats like Hive to add better flexibility as the schema evolves, atomic operations, speed, and higher dependability. To be clear, it's not a new file format, as it still uses ORC, Parquet, and Avro, but in table format.
Major topics include Iceberg, Trino, a review of Hive & legacy table formats, and use case examples.
Meetup: https://www.meetup.com/f7324858-b804-4ed8-ba45-580c262189f1/events/288430613/
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Boston Data Engineering: Iceberg Dead Ahead with Starburst
1. 1
ICEBERG Dead Ahead!
Iceberg, is a new table format developed at Netflix that aims to
replace older table formats like Hive to add better flexibility as the
schema evolves, atomic operations, speed, and just dependability.
To be clear, it's not a new file format, as it still uses ORC, Parquet,
and Avro, but a table format.
What is Iceberg?
Boston Meetup 9/7/22
Sr. Solutions Architect - StarburstData -
Mid-Atlantic
Brendan Collins
2. 2
One Step Back, Two Forward..
○ Hive: SQL Layer built on Hadoop for data analysis
.. but it has limitations
■ File relationship to bucketing
■ Transactional/ACID has always been squirrely
■ Metastore separation was costly computationally
■ Partitioning was rigid
■ Schema evolution
● That said, Hive was and has been critical for the
evolution of SQL querying in distributed systems
2
2
3. 3
Let’s propose a Scenario with Hive
○ I currently partition all of my incoming data
by Month
■ For this particularly month unique amount of
data growth (ie new product release,
economic trend, global pandemic…)
● Hive move is to create a new table and
partition by week or day. But now I have two
tables partitioned differently
3
3
4. 4
Open Table Format
Table format not file format,
you can still use parquet, ORC,
avro files. This is a table format
on top of those files
Time Travel
Yes, really! .. okay not really but
you can use snapshots to
rollback to previous versions
Serializable Isolation
Addresses lack of consistency
between Metadata and file
state that has plagued Hive
Evolving Schemas
Change schemas on the fly, ie
adding new columns in flight
Introducing Iceberg
5. 5
Your Title
It is a crucial factor of the price-to-book ratio, due to it indicating the actual
payment for tangible assets and not the more difficult valuation, of intangibles.
Accordingly, the P/B could be considered a comparatively, conservative metric.
The amount to pay in taxes for long term investments, investments that span
over a year long term, and short term investments such as those that are below
a year.
75%
Architecture
6. 6
Your Title
It is a crucial factor of the price-to-book ratio, due to
it indicating the actual payment for tangible assets
and not
Iceberg Example Query
8. 8
Updating Tables for GDPR
Hive tables not initially designed for deletes being standard process as is
required by GDPR
Recommendation Engines
Many tables were not designed to be user focused, they were designed to be
operationally focused - how do you pull customer focused data back without
scanning full tables?
Your Title
Primary Use Cases
9. 9
So What?
9
• Snapshot Isolation for Transactions
• Faster planning and execution
• Explose logic and not physical
• Event listeners
• Efficiently make Smaller updates
• All Engines see changes immediately