Harry Potter & Apache iceberg format

Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Taras Fedorov | December 2023
Harry Potter & Apache Iceberg
Table Format: The magic
of Apache Iceberg

・In IT since 2008
・Software developer, Team Lead, Competence lead
・Scala, Java, Java Script, Ruby
・Data engineer, Big Data, ML, HighLoad, Backend, Frontend
・In Big Data since 2015, Spark certified
・Working with Apache Iceberg ~3 years
・Grid Dynamics. Project *****
Taras Fedorov
Meet our speaker!
2

Agenda
3
・Problematic
・Table Format. What is it?
・Hive Table Format. What is wrong with it?
・Apache Iceberg. Surface part.
・Apache Iceberg. Underwater part.
・Coding task. Сonquering the iceberg.
・Conclusion. Summary & experience.
・Q&A
3

Brainstorm to choose …
Large company has DataLake / DataWharehouse.
A lot of legacy pipelines and teams
・Dev team 1 : use Spark / batch processing / parquete
・Dev team 2: Flink / Streaming / avro
・Bi team: SnowFlake and Hive
・Data Eng team: Data constantly evolved ,new sources and changing structures
・SRE team: Data consistency, track changes, rollback
・DevOps team: On-premise but going to migrate into cloud?
4

Is it possible to be?
・Engine agnostic
・Env/Storage agnostic
・Multiformat
・Support ACID
・Performance Optimized
Is it possible to have one tool and useful?
5

Table Format. What is it?

・Abstraction layer between:
・Physical data files.
・How they're structured to form a table.
・Table Format is the way to organize & track
data files in table.
Table Format is:
7
7

・Spark?
Who are concurrents?
8
・HIVE?
・Parquet?
・HDFS?
・S3?

Table Format: organise files & directory
as table.
・Based on file formats: Parquet, ORC, …
・Keeps schema inside.
・Tightly integrated with systems
(Hive/Iceberg).
・Work seamlessly with query engines
and tools.
・Important in distributed storages:
HDFS, S3, …
Why Table formats? If exists file formats
File format: organise data in a file
for general purpose.
・Parquet, Avro, ORC, JSON…
・May keep schema inside.
・Not system-specific.
・Storage, data interchange,
and processing across
different systems.
9
9

10
Concurrents Table Formats are:
Cassandra Memtable
Bigtable Tables
10

11
Batch
Streaming
ACID
Schema evolution
Partition evolution
Read/write features
Index management
Time
travel
DW/ Analytics
Real Time
Data Lake

Hive Table Format.
What is wrong with it?

・Organized data in a directory structure.
・The folder name = table partition key.
・What about files inside the directory?
・Directories are black boxes of files.
・Any file inside is considered part of the table.
Hive Table Format
13

✅ Allows reduce load on system.
✅ HMS read only related to partition data.
Why do we need partitions (directories)
14
SELECT * FROM logs WHERE dt=“2001-01-01”
AND country like “GB”
🚫 May produce small files

Hive Table format capabilities
15
✅ Add and delete partitions in a transactional
manner.
🚫 Add/delete files in a filesystem that doesn’t
provide transactional capabilities.
🚫 Multiple jobs modifying the same dataset
isn’t a safe operation.
🚫 Hive table statistics are usually stale.
(How to statistic a black box?)
🚫 Designed in Pre-cloud era. e.g. Directory
listings in the object storage.

❌ No transaction → No lineage
❌ No lineage → No history
❌ No lineage → No time travel
❌ No statistic on the files → No Query Optimization Techniques
❌ No Schema Evolution (on files)
❌ No Partition Evolution
❌ Each time use command listing (bad for object store)
Summarise issues with tracking directories only
16

Apache Iceberg. Surface part.

✅ Keep all Hive advantages: partitioning, directory optimization
✅ ACID Transactions → isolated reads and writes
✅ Query Optimization Techniques → Scan planning is fast & advanced filtering
✅ Schema Evolution → add, drop, update, or rename (no side effect)
✅ Partition Evolution
✅ Compaction Management
✅ Upserts and Deletes
Iceberg resolves:
18
18

1. Schema Evolution
2. Partition Evolution
3. Compaction
4. Query Optimization
5. Time Travel
What is it all?
19
19

20
Schema Evolution is:
・Retain current data (old schema)
with new data (new schema)
・No rewrite data if schema changed
・Need to follow rules

・Operations: Add/Drop/Rename/Update/Reorder
・Iceberg schema updates are metadata changes
・SE in Iceberg independent & free of side-effects
・Use cases:
∙ Changing Data Requirements
∙ Ingestion diff sources: Data Lakes
∙ Data Versioning: e.g. historical
∙ Migrating to New Technologies
Schema evolution
in Iceberg:
21
21

・Partitions granularity.
・Metadata for each p. is kept separately.
・Write queries that select the data you need.
・Automatically prunes out not matching files.
・Use cases:
∙ Time-Based Partitions: peaks, or archive.
∙ Regional or Geographic Data: more usable.
∙ Versioning/Source/Analytics.
22
Partition evolution is:

・Resolve Small Files Problem.
・Requires call procedures.
・Doing in transactional manner.
・Different strategies:
∙ Bin pack
∙ Order
∙ Z-order
23
Compaction is:

・Extend spark(engine) optimization
・Uses statistic in metadata
・Use cases:
∙ Partition/File Pruning
∙ Join Optimization
∙ Data Skipping
∙ Column Pruning
∙ Predicate Pushdown
24
Query Optimization is:

・ New state of table is a snapshot
・ Work with previous snapshot as table
・ Use different versions of table in one select
・ Rollback from previous version
・ Use cases:
∙ Read data added last batch
∙ Anomaly detection. Compare updated data
Time Travel is:
25
25

Apache Iceberg. Underwater part.

Key concepts
・TableMetadata → Information about schema,
partitioning, properties, and current and historical
snapshots
・Snapshot → Contains information about the manifests
that compose a table at a given point in time
・Manifest → A list of files and associated metadata such
as column statistics
・Data Files → Immutable files stored in one of the
support file formats
27
27

Files & layers
・Catalog → manage a collection of tables = namespaces
・Metadata file → information about a table schema,
partition information, & snapshot details
・Manifest List→ information about all the manifest files &
anchors & additional details.
・Manifest file → stores a list of data files with the
column-level metrics and stats
・Data Files -> Immutable files stored in one of the
support file formats
28
28
-----------------------------------------metadata layer
---------------------------------------------data layer
-----------------------------------------catalog layer

29
Regular SQL
SELECT * FROM table1

30
Time travel
SELECT * FROM table1 AS OF '2021-01-28 00:00:00’

Who is responsible for whom ?
31
Iceberg Catalog = version-hint.text
└── Metadata File. v3.metadata.json
│── Manifest List 1 = snap_…_b103.avro.
│ ├── Manifest 1 = …_ad19_4e.avro
│ │ ├──Data File = 0000-1-aef71.parquet
│── Manifest List 2. = snap_…._16c3.avro
│ ├── Manifest 2 (Snapshot 2) =. .._98fa_77.avro
│ │ ├──Data File = 0000-1-aef71.parquet
│ │ ├──Data File = 0000-3-0fa3a.parquet
└── Metadata File 1. v2.metadata.json

Coding task. Сonquering the iceberg.

Practise
・Iceberg Demo Application:
∙ Built in Scala
・Aim to provide a practical illustration: table format
and key features
・Data from Kaggle (Harry Potter)
・Simple Database & Table Creation on local environment
33
33

34
Dataset from www.kaggle.com

35
Dataset structure
root
|-- Id: string (nullable = true)
|-- Name: string (nullable = true)
|-- Gender: string (nullable = true)
|-- Job: string (nullable = true)
|-- House: string (nullable = true)
|-- Wand: string (nullable = true)
|-- Patronus: string (nullable = true)
|-- Species: string (nullable = true)
|-- Blood_status: string (nullable = true)
|-- Hair_colour: string (nullable = true)
|-- Eye_colour: string (nullable = true)
|-- Loyalty: string (nullable = true)
|-- Skills: string (nullable = true)
|-- Birth: string (nullable = true)
|-- Death: string (nullable = true)
⋅ Id - it’s id 😁
⋅ Name - it’s human readable value
⋅ Gender - pure data field
⋅ House - it’s partition key
⋅ Blood_status- it’s data no change
⋅ Hair_colour - it’s data to change
Important columns for us:

36
Dataset App example
ID Name Gender House Blood_status Hair_colour
1 Harry James Potter Male Gryffindor Half-blood Black
2 Ronald Bilius Wea Male Gryffindor Pure-blood Red
3 Hermione Jean Gra Female Gryffindor Muggle-born Brown
4 lbus Percival Wu Male Gryffindor Half-blood Silver| formerly
5 Rubeus Hagrid Male Gryffindor Part-Human (Half Black

Live coding app
・Install required dependencies.
・Run code for creation DB & Table (locally).
・Initial insert of data. (from Kaggle).
・Upsert data (updating).
・Table Management / Time travel.
37
37

App questions
What is spark.config:
・spark.sql.extensions : IcebergSparkSessionExtensions
・spark.sql.catalog.harry_ns.warehouse: src/main/resources/warehouse/catalog/harry_ns/
Why we use in code:
・GlobalTempView
・MERGE INTO
Who give us querry:
・SELECT * FROM harry_ns.input_table.snapshots?
・CALL harry_ns.system.expire_snapshots(...
38

39
After DDL
src/main/resources/warehouse
└── catalog
└── harry_ns
└── input_table
└── metadata
├── v1.metadata.json
└── version-hint.text
⋅ Created dirs/files in warehouse
⋅ Created version-hint.text (version 1): point
of entrance
⋅ Created v1.metadata.json
⋅ No snapshot. T0
⋅ No data folder
Warehouse is created

40
Insert data
⋅ version-hint.text (version 2)
⋅ Contains 1st snapshot (T1)
⋅ Created manifest-list snap-T1-id.avro
⋅ Created manifest -m0.avro
⋅ Created data folder
⋅ Created 8 partition/ 8 data files parquet
Full-fledged warehouse filled
----------------------------------------- metadata layer
--------------------------------------------- data layer

41
Upsert data
⋅ Contains 2st snapshot (T2)
⋅ Created 2nd manifest-list snap-T2-id.avro
⋅ Created 2nd manifest -m1.avro
⋅ House=Gryffindor 2 files (1 file added)
Today & Past exists in warehouse
----------------------------------------- metadata layer
--------------------------------------------- data layer

42
What we expect to see?
・Data replaced with delta.
・Old data still present on DB.
・Old data & new data labeled by metadata files.
・Is it possible to get “old” data?

43
Time travel
⋅ No new metadata
⋅ No new snapshots/metadata
⋅ Select T1 & T2 separately and together
Works with versions as table

44
Table Management
⋅ Contains 3d snapshot only (T3)
⋅ 1 manifest-list file snap-T3-id.avro
⋅ Exists 2 manifest -m0.avro , -m1.avro
⋅ House=Gryffindor 1 file
No past. 1 version only

45
What files are left over?
src/main/resources/warehouse/…/metadata
├── d23c0655-5393-4d89-9bc0-b658425e7d52-m0.avro
├── d23c0655-5393-4d89-9bc0-b658425e7d52-m1.avro
├── snap-5466781927818564652-1-d23c0655-5393-4d89-9bc0-b658425e7d52.avro
└── version-hint.text

46
What we got?
・Data Files is not everything you need
・The more you do with metadata, the betters
∙ Especially on large datasets
・Operations affect data on the warehouse dir:
∙ insert/update/delete
∙ table management
・ Take into account metadata/files and do table management

Conclusion.
Summary & experience.

So is it possible?
48
✅ Engine agnostic
✅ Env/Storage agnostic
✅ Multiformat
✅ Support ACID
✅ Performance Optimized
48

This is done by
49
・Way of organisation files on disk
・Adding metadata
・Managing metadata over the data
49

50
Iceberg pros & cons from experience
・Performance optimizations
・ACID transactions
・Time-travel queries
・Seamless schema evolution
・Efficient for incremental data updates
・Multiple processing engines
Pros Cons
・Learning Curve
・Complexity:
∙ Must have table management
∙ Universalism (Batch/streaming)
∙ Size Overheads
∙ Optimistic lock
50

When to use?
・Data Lake and Cloud Storage: optimize list operation
・Data Volume: Substantial and complex data requirements
・Incremental Data Updates and changes. Data is frequently
updated with small changes
・ACID Transactions: Ensure data integrity and consistency in
multi-user environments.
・Performance Optimization: Improved file layouts,
metadata management, and indexing
・Time-Travel and Historical Data Analysis: auditing, trend
analysis, and understanding data changes over time
・Multi-Engine Compatibility: Apache Spark, Presto, Hive,
streaming
51
51

Q&A

5000 Executive Parkway,
Suite 520 / San Ramon, CA
650-523-5000
info@griddynamics.com
www.griddynamics.com
Grid Dynamics Holdings, Inc.
Thank you for your attention!
Salarpuria Sattva Knowledge Park,
HITEC City,
Hyderabad,
Telangana 500081
/taras.tros /tarasfedorov @taras_fedorov

Harry Potter & Apache iceberg format

Recommended

Recommended

More Related Content

Similar to Harry Potter & Apache iceberg format

Similar to Harry Potter & Apache iceberg format (20)

Recently uploaded

Recently uploaded (20)

Harry Potter & Apache iceberg format