This document discusses Apache Iceberg table format. It begins with an agenda that includes problematic legacy systems, Hive table format issues, Apache Iceberg surface features, internal structure, a coding task, and conclusion. Hive table format tracks data in directories, but Iceberg improves on this with features like ACID transactions, schema evolution, and time travel queries while remaining engine and storage agnostic. The coding task demonstrates creating an Iceberg table from Harry Potter data and performing inserts, updates, and time travel queries. In summary, Iceberg provides a unified solution for various teams through an optimized table format that tracks both data and metadata.
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Harry Potter & Apache iceberg format
1. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Taras Fedorov | December 2023
Harry Potter & Apache Iceberg
Table Format: The magic
of Apache Iceberg
2. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
・In IT since 2008
・Software developer, Team Lead, Competence lead
・Scala, Java, Java Script, Ruby
・Data engineer, Big Data, ML, HighLoad, Backend, Frontend
・In Big Data since 2015, Spark certified
・Working with Apache Iceberg ~3 years
・Grid Dynamics. Project *****
Taras Fedorov
Meet our speaker!
2
3. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Agenda
3
・Problematic
・Table Format. What is it?
・Hive Table Format. What is wrong with it?
・Apache Iceberg. Surface part.
・Apache Iceberg. Underwater part.
・Coding task. Сonquering the iceberg.
・Conclusion. Summary & experience.
・Q&A
3
4. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Brainstorm to choose …
Large company has DataLake / DataWharehouse.
A lot of legacy pipelines and teams
・Dev team 1 : use Spark / batch processing / parquete
・Dev team 2: Flink / Streaming / avro
・Bi team: SnowFlake and Hive
・Data Eng team: Data constantly evolved ,new sources and changing structures
・SRE team: Data consistency, track changes, rollback
・DevOps team: On-premise but going to migrate into cloud?
4
5. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Is it possible to be?
・Engine agnostic
・Env/Storage agnostic
・Multiformat
・Support ACID
・Performance Optimized
Is it possible to have one tool and useful?
5
6. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Table Format. What is it?
7. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
・Abstraction layer between:
・Physical data files.
・How they're structured to form a table.
・Table Format is the way to organize & track
data files in table.
Table Format is:
7
7
8. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
・Spark?
Who are concurrents?
8
・HIVE?
・Parquet?
・HDFS?
・S3?
9. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Table Format: organise files & directory
as table.
・Based on file formats: Parquet, ORC, …
・Keeps schema inside.
・Tightly integrated with systems
(Hive/Iceberg).
・Work seamlessly with query engines
and tools.
・Important in distributed storages:
HDFS, S3, …
Why Table formats? If exists file formats
File format: organise data in a file
for general purpose.
・Parquet, Avro, ORC, JSON…
・May keep schema inside.
・Not system-specific.
・Storage, data interchange,
and processing across
different systems.
9
9
10. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
10
Concurrents Table Formats are:
Cassandra Memtable
Bigtable Tables
10
11. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
11
Batch
Streaming
ACID
Schema evolution
Partition evolution
Read/write features
Index management
Time
travel
DW/ Analytics
Real Time
Data Lake
12. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Hive Table Format.
What is wrong with it?
13. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
・Organized data in a directory structure.
・The folder name = table partition key.
・What about files inside the directory?
・Directories are black boxes of files.
・Any file inside is considered part of the table.
Hive Table Format
13
14. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
✅ Allows reduce load on system.
✅ HMS read only related to partition data.
Why do we need partitions (directories)
14
SELECT * FROM logs WHERE dt=“2001-01-01”
AND country like “GB”
🚫 May produce small files
15. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Hive Table format capabilities
15
✅ Add and delete partitions in a transactional
manner.
🚫 Add/delete files in a filesystem that doesn’t
provide transactional capabilities.
🚫 Multiple jobs modifying the same dataset
isn’t a safe operation.
🚫 Hive table statistics are usually stale.
(How to statistic a black box?)
🚫 Designed in Pre-cloud era. e.g. Directory
listings in the object storage.
16. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
❌ No transaction → No lineage
❌ No lineage → No history
❌ No lineage → No time travel
❌ No statistic on the files → No Query Optimization Techniques
❌ No Schema Evolution (on files)
❌ No Partition Evolution
❌ Each time use command listing (bad for object store)
Summarise issues with tracking directories only
16
17. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Apache Iceberg. Surface part.
18. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
✅ Keep all Hive advantages: partitioning, directory optimization
✅ ACID Transactions → isolated reads and writes
✅ Query Optimization Techniques → Scan planning is fast & advanced filtering
✅ Schema Evolution → add, drop, update, or rename (no side effect)
✅ Partition Evolution
✅ Compaction Management
✅ Upserts and Deletes
Iceberg resolves:
18
18
19. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
1. Schema Evolution
2. Partition Evolution
3. Compaction
4. Query Optimization
5. Time Travel
What is it all?
19
19
20. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
20
Schema Evolution is:
・Retain current data (old schema)
with new data (new schema)
・No rewrite data if schema changed
・Need to follow rules
21. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
・Operations: Add/Drop/Rename/Update/Reorder
・Iceberg schema updates are metadata changes
・SE in Iceberg independent & free of side-effects
・Use cases:
∙ Changing Data Requirements
∙ Ingestion diff sources: Data Lakes
∙ Data Versioning: e.g. historical
∙ Migrating to New Technologies
Schema evolution
in Iceberg:
21
21
22. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
・Partitions granularity.
・Metadata for each p. is kept separately.
・Write queries that select the data you need.
・Automatically prunes out not matching files.
・Use cases:
∙ Time-Based Partitions: peaks, or archive.
∙ Regional or Geographic Data: more usable.
∙ Versioning/Source/Analytics.
22
Partition evolution is:
23. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
・Resolve Small Files Problem.
・Requires call procedures.
・Doing in transactional manner.
・Different strategies:
∙ Bin pack
∙ Order
∙ Z-order
23
Compaction is:
24. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
・Extend spark(engine) optimization
・Uses statistic in metadata
・Use cases:
∙ Partition/File Pruning
∙ Join Optimization
∙ Data Skipping
∙ Column Pruning
∙ Predicate Pushdown
24
Query Optimization is:
25. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
・ New state of table is a snapshot
・ Work with previous snapshot as table
・ Use different versions of table in one select
・ Rollback from previous version
・ Use cases:
∙ Read data added last batch
∙ Anomaly detection. Compare updated data
Time Travel is:
25
25
26. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Apache Iceberg. Underwater part.
27. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Key concepts
・TableMetadata → Information about schema,
partitioning, properties, and current and historical
snapshots
・Snapshot → Contains information about the manifests
that compose a table at a given point in time
・Manifest → A list of files and associated metadata such
as column statistics
・Data Files → Immutable files stored in one of the
support file formats
27
27
28. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Files & layers
・Catalog → manage a collection of tables = namespaces
・Metadata file → information about a table schema,
partition information, & snapshot details
・Manifest List→ information about all the manifest files &
anchors & additional details.
・Manifest file → stores a list of data files with the
column-level metrics and stats
・Data Files -> Immutable files stored in one of the
support file formats
28
28
-----------------------------------------metadata layer
---------------------------------------------data layer
-----------------------------------------catalog layer
29. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
29
Regular SQL
SELECT * FROM table1
30. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
30
Time travel
SELECT * FROM table1 AS OF '2021-01-28 00:00:00’
31. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Who is responsible for whom ?
31
Iceberg Catalog = version-hint.text
└── Metadata File. v3.metadata.json
│── Manifest List 1 = snap_…_b103.avro.
│ ├── Manifest 1 = …_ad19_4e.avro
│ │ ├──Data File = 0000-1-aef71.parquet
│── Manifest List 2. = snap_…._16c3.avro
│ ├── Manifest 2 (Snapshot 2) =. .._98fa_77.avro
│ │ ├──Data File = 0000-1-aef71.parquet
│ │ ├──Data File = 0000-3-0fa3a.parquet
└── Metadata File 1. v2.metadata.json
32. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Coding task. Сonquering the iceberg.
33. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Practise
・Iceberg Demo Application:
∙ Built in Scala
・Aim to provide a practical illustration: table format
and key features
・Data from Kaggle (Harry Potter)
・Simple Database & Table Creation on local environment
33
33
34. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
34
Dataset from www.kaggle.com
35. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
35
Dataset structure
root
|-- Id: string (nullable = true)
|-- Name: string (nullable = true)
|-- Gender: string (nullable = true)
|-- Job: string (nullable = true)
|-- House: string (nullable = true)
|-- Wand: string (nullable = true)
|-- Patronus: string (nullable = true)
|-- Species: string (nullable = true)
|-- Blood_status: string (nullable = true)
|-- Hair_colour: string (nullable = true)
|-- Eye_colour: string (nullable = true)
|-- Loyalty: string (nullable = true)
|-- Skills: string (nullable = true)
|-- Birth: string (nullable = true)
|-- Death: string (nullable = true)
⋅ Id - it’s id 😁
⋅ Name - it’s human readable value
⋅ Gender - pure data field
⋅ House - it’s partition key
⋅ Blood_status- it’s data no change
⋅ Hair_colour - it’s data to change
Important columns for us:
36. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
36
Dataset App example
ID Name Gender House Blood_status Hair_colour
1 Harry James Potter Male Gryffindor Half-blood Black
2 Ronald Bilius Wea Male Gryffindor Pure-blood Red
3 Hermione Jean Gra Female Gryffindor Muggle-born Brown
4 lbus Percival Wu Male Gryffindor Half-blood Silver| formerly
5 Rubeus Hagrid Male Gryffindor Part-Human (Half Black
37. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Live coding app
・Install required dependencies.
・Run code for creation DB & Table (locally).
・Initial insert of data. (from Kaggle).
・Upsert data (updating).
・Table Management / Time travel.
37
37
38. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
App questions
What is spark.config:
・spark.sql.extensions : IcebergSparkSessionExtensions
・spark.sql.catalog.harry_ns.warehouse: src/main/resources/warehouse/catalog/harry_ns/
Why we use in code:
・GlobalTempView
・MERGE INTO
Who give us querry:
・SELECT * FROM harry_ns.input_table.snapshots?
・CALL harry_ns.system.expire_snapshots(...
38
39. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
39
After DDL
src/main/resources/warehouse
└── catalog
└── harry_ns
└── input_table
└── metadata
├── v1.metadata.json
└── version-hint.text
⋅ Created dirs/files in warehouse
⋅ Created version-hint.text (version 1): point
of entrance
⋅ Created v1.metadata.json
⋅ No snapshot. T0
⋅ No data folder
Warehouse is created
40. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
40
Insert data
⋅ version-hint.text (version 2)
⋅ Created v2.metadata.json
⋅ Contains 1st snapshot (T1)
⋅ Created manifest-list snap-T1-id.avro
⋅ Created manifest -m0.avro
⋅ Created data folder
⋅ Created 8 partition/ 8 data files parquet
Full-fledged warehouse filled
----------------------------------------- metadata layer
--------------------------------------------- data layer
41. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
41
Upsert data
⋅ version-hint.text (version 3)
⋅ Created v3.metadata.json
⋅ Contains 2st snapshot (T2)
⋅ Created 2nd manifest-list snap-T2-id.avro
⋅ Created 2nd manifest -m1.avro
⋅ House=Gryffindor 2 files (1 file added)
Today & Past exists in warehouse
----------------------------------------- metadata layer
--------------------------------------------- data layer
42. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
42
What we expect to see?
・Data replaced with delta.
・Old data still present on DB.
・Old data & new data labeled by metadata files.
・Is it possible to get “old” data?
43. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
43
Time travel
⋅ version-hint.text (version 3)
⋅ No new metadata
⋅ No new snapshots/metadata
⋅ Select T1 & T2 separately and together
Works with versions as table
44. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
44
Table Management
⋅ version-hint.text (version 4)
⋅ Created v4.metadata.json
⋅ Contains 3d snapshot only (T3)
⋅ 1 manifest-list file snap-T3-id.avro
⋅ Exists 2 manifest -m0.avro , -m1.avro
⋅ House=Gryffindor 1 file
No past. 1 version only
45. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
45
What files are left over?
src/main/resources/warehouse/…/metadata
├── d23c0655-5393-4d89-9bc0-b658425e7d52-m0.avro
├── d23c0655-5393-4d89-9bc0-b658425e7d52-m1.avro
├── snap-5466781927818564652-1-d23c0655-5393-4d89-9bc0-b658425e7d52.avro
├── v1.metadata.json
├── v2.metadata.json
├── v3.metadata.json
├── v4.metadata.json
└── version-hint.text
46. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
46
What we got?
・Data Files is not everything you need
・The more you do with metadata, the betters
∙ Especially on large datasets
・Operations affect data on the warehouse dir:
∙ insert/update/delete
∙ table management
・ Take into account metadata/files and do table management
47. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Conclusion.
Summary & experience.
48. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
So is it possible?
48
✅ Engine agnostic
✅ Env/Storage agnostic
✅ Multiformat
✅ Support ACID
✅ Performance Optimized
48
49. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
This is done by
49
・Way of organisation files on disk
・Adding metadata
・Managing metadata over the data
49
50. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
50
Iceberg pros & cons from experience
・Performance optimizations
・ACID transactions
・Time-travel queries
・Seamless schema evolution
・Efficient for incremental data updates
・Multiple processing engines
Pros Cons
・Learning Curve
・Complexity:
∙ Must have table management
∙ Universalism (Batch/streaming)
∙ Size Overheads
∙ Optimistic lock
50
51. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
When to use?
・Data Lake and Cloud Storage: optimize list operation
・Data Volume: Substantial and complex data requirements
・Incremental Data Updates and changes. Data is frequently
updated with small changes
・ACID Transactions: Ensure data integrity and consistency in
multi-user environments.
・Performance Optimization: Improved file layouts,
metadata management, and indexing
・Time-Travel and Historical Data Analysis: auditing, trend
analysis, and understanding data changes over time
・Multi-Engine Compatibility: Apache Spark, Presto, Hive,
streaming
51
51
52. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Q&A
53. Grid Dynamics / Harry Potter & Apache Iceberg Table Format
5000 Executive Parkway,
Suite 520 / San Ramon, CA
650-523-5000
info@griddynamics.com
www.griddynamics.com
Grid Dynamics Holdings, Inc.
Thank you for your attention!
Salarpuria Sattva Knowledge Park,
HITEC City,
Hyderabad,
Telangana 500081
/taras.tros /tarasfedorov @taras_fedorov