Zesty journey to adopt Apache Iceberg

Zesty journey to adopt
Apache Iceberg
Eran Levy
@levyeran

Eran Levy
Data & Platform Group Lead @Zesty
https://levyeran.medium.com/
@levyeran
Introduction
@levyeran

How can you utilize Iceberg on AWS with no Spark
expertise in your team and going serverless all-in?
WIFM
@levyeran

Build Fast.
Stay Cost Efficient.

300%
Customer
Growth
YoY
$120M
Raised
Since
2020
406%
Employee
Growth
YoY

The Goal - Database Explorer
@levyeran

Medallion Architecture
@levyeran

Why did we choose
Apache Iceberg?
@levyeran
- Open data table format widely adopted and integrates well
with AWS ecosystem (Glue catalog, Athena, etc).
- Table evolution - Mainly schema and partitioning layout
(particularly hidden partitioning).
- Integrating well with many processing engines - supports our
long term strategy in leveraging the right technologies to their
needs.

While there are many cool
things in Iceberg,
There are some challenges…
The main challenge is:
Maintenance
@levyeran

Table Configuration
@levyeran
- Iceberg v2 table, created with AWS Glue catalog and
Athena engine version 3 (preferably a dedicated
WorkGroup).
- Parquet with ZSTD compression - this is the data format we
adopted across our data lake.
- Snapshot age - 2 days (default is 5 days).
Athena allows predefined key-value TBLPROPERTIES only.
Glue catalog - Metadata tracking

@levyeran
Table Maintenance
Main maintenance operations for optimizing Iceberg table in
Athena:
1. VACUUM
2. OPTIMIZE

Table Maintenance
@levyeran
We are updating our Iceberg table frequently (every minute,
5GBs, insert/update, 50 columns, 500M records)…
So we wanted to VACUUM but were hitting the Athena query
limits:

Table Maintenance
@levyeran
Increasing the limits didn’t help much because we were hitting another :
ICEBERG_VACUUM_MORE_RUNS_NEEDED: Removed 1000 files in this round of vacuum, but
there are more files remaining. Please run another VACUUM command to process the
remaining files
You can try overcome it by running AWS Step Functions in a loop like this suggested solution
Missing several runs and you will face another challenge as increasing Athena query limits
won’t help you much this time…

Table Maintenance
@levyeran
Same for OPTIMIZE but were hitting the partitions limitation:

Glue Spark ETL Jobs
@levyeran
In order to solve it for the long run, we decided to utilize the
Iceberg Spark procedures in order to perform our maintenance
jobs:
- Glue 3.0 and later supports Iceberg integration out of the
box
- Ad-hoc & built-in scheduler
- Integrated with CI/CD pipeline using AWS SDKs
Nice AWS blog and an AWS Glue Developer Guide are available

Glue Spark Maintenance Jobs
@levyeran
Basically the most important steps to perform are:
- Register the Iceberg connector for AWS Glue (Not required
for Glue 4.0)
- Create ETL Job or a Jupyter Notebook
- Provide the necessary configuration to the Spark
job/notebook such as: –datalake-formats and –conf
NOTE: these actions automatically inject the Iceberg Spark SQL
extension

@levyeran
Full example is available here: https://github.com/eran-levy/iceberg-journey-session-examples

@levyeran
Main maintenance procedures:
● Expire_snapshots
● Rewrite_data_files
● Remove_orphan_files
● Rewrite_manifests

QuickSight for Apache Iceberg metadata analysis
@levyeran
Not
Optimized
Optimized!

Snapshots after optimization
@levyeran

Snapshots after expiration procedure
@levyeran

QuickSight for Apache Iceberg data files analysis
@levyeran

Summary
● Apache Iceberg is well adopted in the industry and
specifically in the AWS ecosystem.
● It's not persist & forget -> take Iceberg maintenance into
consideration while choosing your architecture.
● Keep monitoring -> your partitioning strategy might
change, file size, query latencies, etc. as there are many
moving parts that can impact your performance.

Next Steps
● Choosing our data lakehouse platform
● Maintenance is an issue as we scale to additional use
cases with larger data volume - we might need a
managed service to assist us here

Zesty journey to adopt Apache Iceberg

Recommended

Recommended

More Related Content

Similar to Zesty journey to adopt Apache Iceberg

Similar to Zesty journey to adopt Apache Iceberg (20)

Recently uploaded

Recently uploaded (20)

Zesty journey to adopt Apache Iceberg