SlideShare a Scribd company logo
1 of 53
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Taras Fedorov | December 2023
Harry Potter & Apache Iceberg
Table Format: The magic
of Apache Iceberg
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
・In IT since 2008
・Software developer, Team Lead, Competence lead
・Scala, Java, Java Script, Ruby
・Data engineer, Big Data, ML, HighLoad, Backend, Frontend
・In Big Data since 2015, Spark certified
・Working with Apache Iceberg ~3 years
・Grid Dynamics. Project *****
Taras Fedorov
Meet our speaker!
2
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Agenda
3
・Problematic
・Table Format. What is it?
・Hive Table Format. What is wrong with it?
・Apache Iceberg. Surface part.
・Apache Iceberg. Underwater part.
・Coding task. Сonquering the iceberg.
・Conclusion. Summary & experience.
・Q&A
3
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Brainstorm to choose …
Large company has DataLake / DataWharehouse.
A lot of legacy pipelines and teams
・Dev team 1 : use Spark / batch processing / parquete
・Dev team 2: Flink / Streaming / avro
・Bi team: SnowFlake and Hive
・Data Eng team: Data constantly evolved ,new sources and changing structures
・SRE team: Data consistency, track changes, rollback
・DevOps team: On-premise but going to migrate into cloud?
4
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Is it possible to be?
・Engine agnostic
・Env/Storage agnostic
・Multiformat
・Support ACID
・Performance Optimized
Is it possible to have one tool and useful?
5
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Table Format. What is it?
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
・Abstraction layer between:
・Physical data files.
・How they're structured to form a table.
・Table Format is the way to organize & track
data files in table.
Table Format is:
7
7
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
・Spark?
Who are concurrents?
8
・HIVE?
・Parquet?
・HDFS?
・S3?
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Table Format: organise files & directory
as table.
・Based on file formats: Parquet, ORC, …
・Keeps schema inside.
・Tightly integrated with systems
(Hive/Iceberg).
・Work seamlessly with query engines
and tools.
・Important in distributed storages:
HDFS, S3, …
Why Table formats? If exists file formats
File format: organise data in a file
for general purpose.
・Parquet, Avro, ORC, JSON…
・May keep schema inside.
・Not system-specific.
・Storage, data interchange,
and processing across
different systems.
9
9
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
10
Concurrents Table Formats are:
Cassandra Memtable
Bigtable Tables
10
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
11
Batch
Streaming
ACID
Schema evolution
Partition evolution
Read/write features
Index management
Time
travel
DW/ Analytics
Real Time
Data Lake
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Hive Table Format.
What is wrong with it?
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
・Organized data in a directory structure.
・The folder name = table partition key.
・What about files inside the directory?
・Directories are black boxes of files.
・Any file inside is considered part of the table.
Hive Table Format
13
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
✅ Allows reduce load on system.
✅ HMS read only related to partition data.
Why do we need partitions (directories)
14
SELECT * FROM logs WHERE dt=“2001-01-01”
AND country like “GB”
🚫 May produce small files
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Hive Table format capabilities
15
✅ Add and delete partitions in a transactional
manner.
🚫 Add/delete files in a filesystem that doesn’t
provide transactional capabilities.
🚫 Multiple jobs modifying the same dataset
isn’t a safe operation.
🚫 Hive table statistics are usually stale.
(How to statistic a black box?)
🚫 Designed in Pre-cloud era. e.g. Directory
listings in the object storage.
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
❌ No transaction → No lineage
❌ No lineage → No history
❌ No lineage → No time travel
❌ No statistic on the files → No Query Optimization Techniques
❌ No Schema Evolution (on files)
❌ No Partition Evolution
❌ Each time use command listing (bad for object store)
Summarise issues with tracking directories only
16
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Apache Iceberg. Surface part.
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
✅ Keep all Hive advantages: partitioning, directory optimization
✅ ACID Transactions → isolated reads and writes
✅ Query Optimization Techniques → Scan planning is fast & advanced filtering
✅ Schema Evolution → add, drop, update, or rename (no side effect)
✅ Partition Evolution
✅ Compaction Management
✅ Upserts and Deletes
Iceberg resolves:
18
18
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
1. Schema Evolution
2. Partition Evolution
3. Compaction
4. Query Optimization
5. Time Travel
What is it all?
19
19
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
20
Schema Evolution is:
・Retain current data (old schema)
with new data (new schema)
・No rewrite data if schema changed
・Need to follow rules
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
・Operations: Add/Drop/Rename/Update/Reorder
・Iceberg schema updates are metadata changes
・SE in Iceberg independent & free of side-effects
・Use cases:
∙ Changing Data Requirements
∙ Ingestion diff sources: Data Lakes
∙ Data Versioning: e.g. historical
∙ Migrating to New Technologies
Schema evolution
in Iceberg:
21
21
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
・Partitions granularity.
・Metadata for each p. is kept separately.
・Write queries that select the data you need.
・Automatically prunes out not matching files.
・Use cases:
∙ Time-Based Partitions: peaks, or archive.
∙ Regional or Geographic Data: more usable.
∙ Versioning/Source/Analytics.
22
Partition evolution is:
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
・Resolve Small Files Problem.
・Requires call procedures.
・Doing in transactional manner.
・Different strategies:
∙ Bin pack
∙ Order
∙ Z-order
23
Compaction is:
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
・Extend spark(engine) optimization
・Uses statistic in metadata
・Use cases:
∙ Partition/File Pruning
∙ Join Optimization
∙ Data Skipping
∙ Column Pruning
∙ Predicate Pushdown
24
Query Optimization is:
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
・ New state of table is a snapshot
・ Work with previous snapshot as table
・ Use different versions of table in one select
・ Rollback from previous version
・ Use cases:
∙ Read data added last batch
∙ Anomaly detection. Compare updated data
Time Travel is:
25
25
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Apache Iceberg. Underwater part.
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Key concepts
・TableMetadata → Information about schema,
partitioning, properties, and current and historical
snapshots
・Snapshot → Contains information about the manifests
that compose a table at a given point in time
・Manifest → A list of files and associated metadata such
as column statistics
・Data Files → Immutable files stored in one of the
support file formats
27
27
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Files & layers
・Catalog → manage a collection of tables = namespaces
・Metadata file → information about a table schema,
partition information, & snapshot details
・Manifest List→ information about all the manifest files &
anchors & additional details.
・Manifest file → stores a list of data files with the
column-level metrics and stats
・Data Files -> Immutable files stored in one of the
support file formats
28
28
-----------------------------------------metadata layer
---------------------------------------------data layer
-----------------------------------------catalog layer
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
29
Regular SQL
SELECT * FROM table1
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
30
Time travel
SELECT * FROM table1 AS OF '2021-01-28 00:00:00’
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Who is responsible for whom ?
31
Iceberg Catalog = version-hint.text
└── Metadata File. v3.metadata.json
│── Manifest List 1 = snap_…_b103.avro.
│ ├── Manifest 1 = …_ad19_4e.avro
│ │ ├──Data File = 0000-1-aef71.parquet
│── Manifest List 2. = snap_…._16c3.avro
│ ├── Manifest 2 (Snapshot 2) =. .._98fa_77.avro
│ │ ├──Data File = 0000-1-aef71.parquet
│ │ ├──Data File = 0000-3-0fa3a.parquet
└── Metadata File 1. v2.metadata.json
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Coding task. Сonquering the iceberg.
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Practise
・Iceberg Demo Application:
∙ Built in Scala
・Aim to provide a practical illustration: table format
and key features
・Data from Kaggle (Harry Potter)
・Simple Database & Table Creation on local environment
33
33
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
34
Dataset from www.kaggle.com
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
35
Dataset structure
root
|-- Id: string (nullable = true)
|-- Name: string (nullable = true)
|-- Gender: string (nullable = true)
|-- Job: string (nullable = true)
|-- House: string (nullable = true)
|-- Wand: string (nullable = true)
|-- Patronus: string (nullable = true)
|-- Species: string (nullable = true)
|-- Blood_status: string (nullable = true)
|-- Hair_colour: string (nullable = true)
|-- Eye_colour: string (nullable = true)
|-- Loyalty: string (nullable = true)
|-- Skills: string (nullable = true)
|-- Birth: string (nullable = true)
|-- Death: string (nullable = true)
⋅ Id - it’s id 😁
⋅ Name - it’s human readable value
⋅ Gender - pure data field
⋅ House - it’s partition key
⋅ Blood_status- it’s data no change
⋅ Hair_colour - it’s data to change
Important columns for us:
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
36
Dataset App example
ID Name Gender House Blood_status Hair_colour
1 Harry James Potter Male Gryffindor Half-blood Black
2 Ronald Bilius Wea Male Gryffindor Pure-blood Red
3 Hermione Jean Gra Female Gryffindor Muggle-born Brown
4 lbus Percival Wu Male Gryffindor Half-blood Silver| formerly
5 Rubeus Hagrid Male Gryffindor Part-Human (Half Black
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Live coding app
・Install required dependencies.
・Run code for creation DB & Table (locally).
・Initial insert of data. (from Kaggle).
・Upsert data (updating).
・Table Management / Time travel.
37
37
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
App questions
What is spark.config:
・spark.sql.extensions : IcebergSparkSessionExtensions
・spark.sql.catalog.harry_ns.warehouse: src/main/resources/warehouse/catalog/harry_ns/
Why we use in code:
・GlobalTempView
・MERGE INTO
Who give us querry:
・SELECT * FROM harry_ns.input_table.snapshots?
・CALL harry_ns.system.expire_snapshots(...
38
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
39
After DDL
src/main/resources/warehouse
└── catalog
└── harry_ns
└── input_table
└── metadata
├── v1.metadata.json
└── version-hint.text
⋅ Created dirs/files in warehouse
⋅ Created version-hint.text (version 1): point
of entrance
⋅ Created v1.metadata.json
⋅ No snapshot. T0
⋅ No data folder
Warehouse is created
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
40
Insert data
⋅ version-hint.text (version 2)
⋅ Created v2.metadata.json
⋅ Contains 1st snapshot (T1)
⋅ Created manifest-list snap-T1-id.avro
⋅ Created manifest -m0.avro
⋅ Created data folder
⋅ Created 8 partition/ 8 data files parquet
Full-fledged warehouse filled
----------------------------------------- metadata layer
--------------------------------------------- data layer
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
41
Upsert data
⋅ version-hint.text (version 3)
⋅ Created v3.metadata.json
⋅ Contains 2st snapshot (T2)
⋅ Created 2nd manifest-list snap-T2-id.avro
⋅ Created 2nd manifest -m1.avro
⋅ House=Gryffindor 2 files (1 file added)
Today & Past exists in warehouse
----------------------------------------- metadata layer
--------------------------------------------- data layer
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
42
What we expect to see?
・Data replaced with delta.
・Old data still present on DB.
・Old data & new data labeled by metadata files.
・Is it possible to get “old” data?
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
43
Time travel
⋅ version-hint.text (version 3)
⋅ No new metadata
⋅ No new snapshots/metadata
⋅ Select T1 & T2 separately and together
Works with versions as table
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
44
Table Management
⋅ version-hint.text (version 4)
⋅ Created v4.metadata.json
⋅ Contains 3d snapshot only (T3)
⋅ 1 manifest-list file snap-T3-id.avro
⋅ Exists 2 manifest -m0.avro , -m1.avro
⋅ House=Gryffindor 1 file
No past. 1 version only
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
45
What files are left over?
src/main/resources/warehouse/…/metadata
├── d23c0655-5393-4d89-9bc0-b658425e7d52-m0.avro
├── d23c0655-5393-4d89-9bc0-b658425e7d52-m1.avro
├── snap-5466781927818564652-1-d23c0655-5393-4d89-9bc0-b658425e7d52.avro
├── v1.metadata.json
├── v2.metadata.json
├── v3.metadata.json
├── v4.metadata.json
└── version-hint.text
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
46
What we got?
・Data Files is not everything you need
・The more you do with metadata, the betters
∙ Especially on large datasets
・Operations affect data on the warehouse dir:
∙ insert/update/delete
∙ table management
・ Take into account metadata/files and do table management
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Conclusion.
Summary & experience.
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
So is it possible?
48
✅ Engine agnostic
✅ Env/Storage agnostic
✅ Multiformat
✅ Support ACID
✅ Performance Optimized
48
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
This is done by
49
・Way of organisation files on disk
・Adding metadata
・Managing metadata over the data
49
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
50
Iceberg pros & cons from experience
・Performance optimizations
・ACID transactions
・Time-travel queries
・Seamless schema evolution
・Efficient for incremental data updates
・Multiple processing engines
Pros Cons
・Learning Curve
・Complexity:
∙ Must have table management
∙ Universalism (Batch/streaming)
∙ Size Overheads
∙ Optimistic lock
50
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
When to use?
・Data Lake and Cloud Storage: optimize list operation
・Data Volume: Substantial and complex data requirements
・Incremental Data Updates and changes. Data is frequently
updated with small changes
・ACID Transactions: Ensure data integrity and consistency in
multi-user environments.
・Performance Optimization: Improved file layouts,
metadata management, and indexing
・Time-Travel and Historical Data Analysis: auditing, trend
analysis, and understanding data changes over time
・Multi-Engine Compatibility: Apache Spark, Presto, Hive,
streaming
51
51
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
Q&A
Grid Dynamics / Harry Potter & Apache Iceberg Table Format
5000 Executive Parkway,
Suite 520 / San Ramon, CA
650-523-5000
info@griddynamics.com
www.griddynamics.com
Grid Dynamics Holdings, Inc.
Thank you for your attention!
Salarpuria Sattva Knowledge Park,
HITEC City,
Hyderabad,
Telangana 500081
/taras.tros /tarasfedorov @taras_fedorov

More Related Content

Similar to Harry Potter & Apache iceberg format

Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
tsliwowicz
 

Similar to Harry Potter & Apache iceberg format (20)

Revealing the Power of Legacy Machine Data
Revealing the Power of Legacy Machine DataRevealing the Power of Legacy Machine Data
Revealing the Power of Legacy Machine Data
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & python
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on AuroraDUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 

Recently uploaded

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Recently uploaded (20)

Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 

Harry Potter & Apache iceberg format

  • 1. Grid Dynamics / Harry Potter & Apache Iceberg Table Format Taras Fedorov | December 2023 Harry Potter & Apache Iceberg Table Format: The magic of Apache Iceberg
  • 2. Grid Dynamics / Harry Potter & Apache Iceberg Table Format ・In IT since 2008 ・Software developer, Team Lead, Competence lead ・Scala, Java, Java Script, Ruby ・Data engineer, Big Data, ML, HighLoad, Backend, Frontend ・In Big Data since 2015, Spark certified ・Working with Apache Iceberg ~3 years ・Grid Dynamics. Project ***** Taras Fedorov Meet our speaker! 2
  • 3. Grid Dynamics / Harry Potter & Apache Iceberg Table Format Agenda 3 ・Problematic ・Table Format. What is it? ・Hive Table Format. What is wrong with it? ・Apache Iceberg. Surface part. ・Apache Iceberg. Underwater part. ・Coding task. Сonquering the iceberg. ・Conclusion. Summary & experience. ・Q&A 3
  • 4. Grid Dynamics / Harry Potter & Apache Iceberg Table Format Brainstorm to choose … Large company has DataLake / DataWharehouse. A lot of legacy pipelines and teams ・Dev team 1 : use Spark / batch processing / parquete ・Dev team 2: Flink / Streaming / avro ・Bi team: SnowFlake and Hive ・Data Eng team: Data constantly evolved ,new sources and changing structures ・SRE team: Data consistency, track changes, rollback ・DevOps team: On-premise but going to migrate into cloud? 4
  • 5. Grid Dynamics / Harry Potter & Apache Iceberg Table Format Is it possible to be? ・Engine agnostic ・Env/Storage agnostic ・Multiformat ・Support ACID ・Performance Optimized Is it possible to have one tool and useful? 5
  • 6. Grid Dynamics / Harry Potter & Apache Iceberg Table Format Table Format. What is it?
  • 7. Grid Dynamics / Harry Potter & Apache Iceberg Table Format ・Abstraction layer between: ・Physical data files. ・How they're structured to form a table. ・Table Format is the way to organize & track data files in table. Table Format is: 7 7
  • 8. Grid Dynamics / Harry Potter & Apache Iceberg Table Format ・Spark? Who are concurrents? 8 ・HIVE? ・Parquet? ・HDFS? ・S3?
  • 9. Grid Dynamics / Harry Potter & Apache Iceberg Table Format Table Format: organise files & directory as table. ・Based on file formats: Parquet, ORC, … ・Keeps schema inside. ・Tightly integrated with systems (Hive/Iceberg). ・Work seamlessly with query engines and tools. ・Important in distributed storages: HDFS, S3, … Why Table formats? If exists file formats File format: organise data in a file for general purpose. ・Parquet, Avro, ORC, JSON… ・May keep schema inside. ・Not system-specific. ・Storage, data interchange, and processing across different systems. 9 9
  • 10. Grid Dynamics / Harry Potter & Apache Iceberg Table Format 10 Concurrents Table Formats are: Cassandra Memtable Bigtable Tables 10
  • 11. Grid Dynamics / Harry Potter & Apache Iceberg Table Format 11 Batch Streaming ACID Schema evolution Partition evolution Read/write features Index management Time travel DW/ Analytics Real Time Data Lake
  • 12. Grid Dynamics / Harry Potter & Apache Iceberg Table Format Hive Table Format. What is wrong with it?
  • 13. Grid Dynamics / Harry Potter & Apache Iceberg Table Format ・Organized data in a directory structure. ・The folder name = table partition key. ・What about files inside the directory? ・Directories are black boxes of files. ・Any file inside is considered part of the table. Hive Table Format 13
  • 14. Grid Dynamics / Harry Potter & Apache Iceberg Table Format ✅ Allows reduce load on system. ✅ HMS read only related to partition data. Why do we need partitions (directories) 14 SELECT * FROM logs WHERE dt=“2001-01-01” AND country like “GB” 🚫 May produce small files
  • 15. Grid Dynamics / Harry Potter & Apache Iceberg Table Format Hive Table format capabilities 15 ✅ Add and delete partitions in a transactional manner. 🚫 Add/delete files in a filesystem that doesn’t provide transactional capabilities. 🚫 Multiple jobs modifying the same dataset isn’t a safe operation. 🚫 Hive table statistics are usually stale. (How to statistic a black box?) 🚫 Designed in Pre-cloud era. e.g. Directory listings in the object storage.
  • 16. Grid Dynamics / Harry Potter & Apache Iceberg Table Format ❌ No transaction → No lineage ❌ No lineage → No history ❌ No lineage → No time travel ❌ No statistic on the files → No Query Optimization Techniques ❌ No Schema Evolution (on files) ❌ No Partition Evolution ❌ Each time use command listing (bad for object store) Summarise issues with tracking directories only 16
  • 17. Grid Dynamics / Harry Potter & Apache Iceberg Table Format Apache Iceberg. Surface part.
  • 18. Grid Dynamics / Harry Potter & Apache Iceberg Table Format ✅ Keep all Hive advantages: partitioning, directory optimization ✅ ACID Transactions → isolated reads and writes ✅ Query Optimization Techniques → Scan planning is fast & advanced filtering ✅ Schema Evolution → add, drop, update, or rename (no side effect) ✅ Partition Evolution ✅ Compaction Management ✅ Upserts and Deletes Iceberg resolves: 18 18
  • 19. Grid Dynamics / Harry Potter & Apache Iceberg Table Format 1. Schema Evolution 2. Partition Evolution 3. Compaction 4. Query Optimization 5. Time Travel What is it all? 19 19
  • 20. Grid Dynamics / Harry Potter & Apache Iceberg Table Format 20 Schema Evolution is: ・Retain current data (old schema) with new data (new schema) ・No rewrite data if schema changed ・Need to follow rules
  • 21. Grid Dynamics / Harry Potter & Apache Iceberg Table Format ・Operations: Add/Drop/Rename/Update/Reorder ・Iceberg schema updates are metadata changes ・SE in Iceberg independent & free of side-effects ・Use cases: ∙ Changing Data Requirements ∙ Ingestion diff sources: Data Lakes ∙ Data Versioning: e.g. historical ∙ Migrating to New Technologies Schema evolution in Iceberg: 21 21
  • 22. Grid Dynamics / Harry Potter & Apache Iceberg Table Format ・Partitions granularity. ・Metadata for each p. is kept separately. ・Write queries that select the data you need. ・Automatically prunes out not matching files. ・Use cases: ∙ Time-Based Partitions: peaks, or archive. ∙ Regional or Geographic Data: more usable. ∙ Versioning/Source/Analytics. 22 Partition evolution is:
  • 23. Grid Dynamics / Harry Potter & Apache Iceberg Table Format ・Resolve Small Files Problem. ・Requires call procedures. ・Doing in transactional manner. ・Different strategies: ∙ Bin pack ∙ Order ∙ Z-order 23 Compaction is:
  • 24. Grid Dynamics / Harry Potter & Apache Iceberg Table Format ・Extend spark(engine) optimization ・Uses statistic in metadata ・Use cases: ∙ Partition/File Pruning ∙ Join Optimization ∙ Data Skipping ∙ Column Pruning ∙ Predicate Pushdown 24 Query Optimization is:
  • 25. Grid Dynamics / Harry Potter & Apache Iceberg Table Format ・ New state of table is a snapshot ・ Work with previous snapshot as table ・ Use different versions of table in one select ・ Rollback from previous version ・ Use cases: ∙ Read data added last batch ∙ Anomaly detection. Compare updated data Time Travel is: 25 25
  • 26. Grid Dynamics / Harry Potter & Apache Iceberg Table Format Apache Iceberg. Underwater part.
  • 27. Grid Dynamics / Harry Potter & Apache Iceberg Table Format Key concepts ・TableMetadata → Information about schema, partitioning, properties, and current and historical snapshots ・Snapshot → Contains information about the manifests that compose a table at a given point in time ・Manifest → A list of files and associated metadata such as column statistics ・Data Files → Immutable files stored in one of the support file formats 27 27
  • 28. Grid Dynamics / Harry Potter & Apache Iceberg Table Format Files & layers ・Catalog → manage a collection of tables = namespaces ・Metadata file → information about a table schema, partition information, & snapshot details ・Manifest List→ information about all the manifest files & anchors & additional details. ・Manifest file → stores a list of data files with the column-level metrics and stats ・Data Files -> Immutable files stored in one of the support file formats 28 28 -----------------------------------------metadata layer ---------------------------------------------data layer -----------------------------------------catalog layer
  • 29. Grid Dynamics / Harry Potter & Apache Iceberg Table Format 29 Regular SQL SELECT * FROM table1
  • 30. Grid Dynamics / Harry Potter & Apache Iceberg Table Format 30 Time travel SELECT * FROM table1 AS OF '2021-01-28 00:00:00’
  • 31. Grid Dynamics / Harry Potter & Apache Iceberg Table Format Who is responsible for whom ? 31 Iceberg Catalog = version-hint.text └── Metadata File. v3.metadata.json │── Manifest List 1 = snap_…_b103.avro. │ ├── Manifest 1 = …_ad19_4e.avro │ │ ├──Data File = 0000-1-aef71.parquet │── Manifest List 2. = snap_…._16c3.avro │ ├── Manifest 2 (Snapshot 2) =. .._98fa_77.avro │ │ ├──Data File = 0000-1-aef71.parquet │ │ ├──Data File = 0000-3-0fa3a.parquet └── Metadata File 1. v2.metadata.json
  • 32. Grid Dynamics / Harry Potter & Apache Iceberg Table Format Coding task. Сonquering the iceberg.
  • 33. Grid Dynamics / Harry Potter & Apache Iceberg Table Format Practise ・Iceberg Demo Application: ∙ Built in Scala ・Aim to provide a practical illustration: table format and key features ・Data from Kaggle (Harry Potter) ・Simple Database & Table Creation on local environment 33 33
  • 34. Grid Dynamics / Harry Potter & Apache Iceberg Table Format 34 Dataset from www.kaggle.com
  • 35. Grid Dynamics / Harry Potter & Apache Iceberg Table Format 35 Dataset structure root |-- Id: string (nullable = true) |-- Name: string (nullable = true) |-- Gender: string (nullable = true) |-- Job: string (nullable = true) |-- House: string (nullable = true) |-- Wand: string (nullable = true) |-- Patronus: string (nullable = true) |-- Species: string (nullable = true) |-- Blood_status: string (nullable = true) |-- Hair_colour: string (nullable = true) |-- Eye_colour: string (nullable = true) |-- Loyalty: string (nullable = true) |-- Skills: string (nullable = true) |-- Birth: string (nullable = true) |-- Death: string (nullable = true) ⋅ Id - it’s id 😁 ⋅ Name - it’s human readable value ⋅ Gender - pure data field ⋅ House - it’s partition key ⋅ Blood_status- it’s data no change ⋅ Hair_colour - it’s data to change Important columns for us:
  • 36. Grid Dynamics / Harry Potter & Apache Iceberg Table Format 36 Dataset App example ID Name Gender House Blood_status Hair_colour 1 Harry James Potter Male Gryffindor Half-blood Black 2 Ronald Bilius Wea Male Gryffindor Pure-blood Red 3 Hermione Jean Gra Female Gryffindor Muggle-born Brown 4 lbus Percival Wu Male Gryffindor Half-blood Silver| formerly 5 Rubeus Hagrid Male Gryffindor Part-Human (Half Black
  • 37. Grid Dynamics / Harry Potter & Apache Iceberg Table Format Live coding app ・Install required dependencies. ・Run code for creation DB & Table (locally). ・Initial insert of data. (from Kaggle). ・Upsert data (updating). ・Table Management / Time travel. 37 37
  • 38. Grid Dynamics / Harry Potter & Apache Iceberg Table Format App questions What is spark.config: ・spark.sql.extensions : IcebergSparkSessionExtensions ・spark.sql.catalog.harry_ns.warehouse: src/main/resources/warehouse/catalog/harry_ns/ Why we use in code: ・GlobalTempView ・MERGE INTO Who give us querry: ・SELECT * FROM harry_ns.input_table.snapshots? ・CALL harry_ns.system.expire_snapshots(... 38
  • 39. Grid Dynamics / Harry Potter & Apache Iceberg Table Format 39 After DDL src/main/resources/warehouse └── catalog └── harry_ns └── input_table └── metadata ├── v1.metadata.json └── version-hint.text ⋅ Created dirs/files in warehouse ⋅ Created version-hint.text (version 1): point of entrance ⋅ Created v1.metadata.json ⋅ No snapshot. T0 ⋅ No data folder Warehouse is created
  • 40. Grid Dynamics / Harry Potter & Apache Iceberg Table Format 40 Insert data ⋅ version-hint.text (version 2) ⋅ Created v2.metadata.json ⋅ Contains 1st snapshot (T1) ⋅ Created manifest-list snap-T1-id.avro ⋅ Created manifest -m0.avro ⋅ Created data folder ⋅ Created 8 partition/ 8 data files parquet Full-fledged warehouse filled ----------------------------------------- metadata layer --------------------------------------------- data layer
  • 41. Grid Dynamics / Harry Potter & Apache Iceberg Table Format 41 Upsert data ⋅ version-hint.text (version 3) ⋅ Created v3.metadata.json ⋅ Contains 2st snapshot (T2) ⋅ Created 2nd manifest-list snap-T2-id.avro ⋅ Created 2nd manifest -m1.avro ⋅ House=Gryffindor 2 files (1 file added) Today & Past exists in warehouse ----------------------------------------- metadata layer --------------------------------------------- data layer
  • 42. Grid Dynamics / Harry Potter & Apache Iceberg Table Format 42 What we expect to see? ・Data replaced with delta. ・Old data still present on DB. ・Old data & new data labeled by metadata files. ・Is it possible to get “old” data?
  • 43. Grid Dynamics / Harry Potter & Apache Iceberg Table Format 43 Time travel ⋅ version-hint.text (version 3) ⋅ No new metadata ⋅ No new snapshots/metadata ⋅ Select T1 & T2 separately and together Works with versions as table
  • 44. Grid Dynamics / Harry Potter & Apache Iceberg Table Format 44 Table Management ⋅ version-hint.text (version 4) ⋅ Created v4.metadata.json ⋅ Contains 3d snapshot only (T3) ⋅ 1 manifest-list file snap-T3-id.avro ⋅ Exists 2 manifest -m0.avro , -m1.avro ⋅ House=Gryffindor 1 file No past. 1 version only
  • 45. Grid Dynamics / Harry Potter & Apache Iceberg Table Format 45 What files are left over? src/main/resources/warehouse/…/metadata ├── d23c0655-5393-4d89-9bc0-b658425e7d52-m0.avro ├── d23c0655-5393-4d89-9bc0-b658425e7d52-m1.avro ├── snap-5466781927818564652-1-d23c0655-5393-4d89-9bc0-b658425e7d52.avro ├── v1.metadata.json ├── v2.metadata.json ├── v3.metadata.json ├── v4.metadata.json └── version-hint.text
  • 46. Grid Dynamics / Harry Potter & Apache Iceberg Table Format 46 What we got? ・Data Files is not everything you need ・The more you do with metadata, the betters ∙ Especially on large datasets ・Operations affect data on the warehouse dir: ∙ insert/update/delete ∙ table management ・ Take into account metadata/files and do table management
  • 47. Grid Dynamics / Harry Potter & Apache Iceberg Table Format Conclusion. Summary & experience.
  • 48. Grid Dynamics / Harry Potter & Apache Iceberg Table Format So is it possible? 48 ✅ Engine agnostic ✅ Env/Storage agnostic ✅ Multiformat ✅ Support ACID ✅ Performance Optimized 48
  • 49. Grid Dynamics / Harry Potter & Apache Iceberg Table Format This is done by 49 ・Way of organisation files on disk ・Adding metadata ・Managing metadata over the data 49
  • 50. Grid Dynamics / Harry Potter & Apache Iceberg Table Format 50 Iceberg pros & cons from experience ・Performance optimizations ・ACID transactions ・Time-travel queries ・Seamless schema evolution ・Efficient for incremental data updates ・Multiple processing engines Pros Cons ・Learning Curve ・Complexity: ∙ Must have table management ∙ Universalism (Batch/streaming) ∙ Size Overheads ∙ Optimistic lock 50
  • 51. Grid Dynamics / Harry Potter & Apache Iceberg Table Format When to use? ・Data Lake and Cloud Storage: optimize list operation ・Data Volume: Substantial and complex data requirements ・Incremental Data Updates and changes. Data is frequently updated with small changes ・ACID Transactions: Ensure data integrity and consistency in multi-user environments. ・Performance Optimization: Improved file layouts, metadata management, and indexing ・Time-Travel and Historical Data Analysis: auditing, trend analysis, and understanding data changes over time ・Multi-Engine Compatibility: Apache Spark, Presto, Hive, streaming 51 51
  • 52. Grid Dynamics / Harry Potter & Apache Iceberg Table Format Q&A
  • 53. Grid Dynamics / Harry Potter & Apache Iceberg Table Format 5000 Executive Parkway, Suite 520 / San Ramon, CA 650-523-5000 info@griddynamics.com www.griddynamics.com Grid Dynamics Holdings, Inc. Thank you for your attention! Salarpuria Sattva Knowledge Park, HITEC City, Hyderabad, Telangana 500081 /taras.tros /tarasfedorov @taras_fedorov