8. Why did we choose
Apache Iceberg?
@levyeran
- Open data table format widely adopted and integrates well
with AWS ecosystem (Glue catalog, Athena, etc).
- Table evolution - Mainly schema and partitioning layout
(particularly hidden partitioning).
- Integrating well with many processing engines - supports our
long term strategy in leveraging the right technologies to their
needs.
9. While there are many cool
things in Iceberg,
There are some challenges…
The main challenge is:
Maintenance
@levyeran
11. Table Configuration
@levyeran
- Iceberg v2 table, created with AWS Glue catalog and
Athena engine version 3 (preferably a dedicated
WorkGroup).
- Parquet with ZSTD compression - this is the data format we
adopted across our data lake.
- Snapshot age - 2 days (default is 5 days).
Athena allows predefined key-value TBLPROPERTIES only.
Glue catalog - Metadata tracking
13. Table Maintenance
@levyeran
We are updating our Iceberg table frequently (every minute,
5GBs, insert/update, 50 columns, 500M records)…
So we wanted to VACUUM but were hitting the Athena query
limits:
14. Table Maintenance
@levyeran
Increasing the limits didn’t help much because we were hitting another :
ICEBERG_VACUUM_MORE_RUNS_NEEDED: Removed 1000 files in this round of vacuum, but
there are more files remaining. Please run another VACUUM command to process the
remaining files
You can try overcome it by running AWS Step Functions in a loop like this suggested solution
Missing several runs and you will face another challenge as increasing Athena query limits
won’t help you much this time…
16. Glue Spark ETL Jobs
@levyeran
In order to solve it for the long run, we decided to utilize the
Iceberg Spark procedures in order to perform our maintenance
jobs:
- Glue 3.0 and later supports Iceberg integration out of the
box
- Ad-hoc & built-in scheduler
- Integrated with CI/CD pipeline using AWS SDKs
Nice AWS blog and an AWS Glue Developer Guide are available
17. Glue Spark Maintenance Jobs
@levyeran
Basically the most important steps to perform are:
- Register the Iceberg connector for AWS Glue (Not required
for Glue 4.0)
- Create ETL Job or a Jupyter Notebook
- Provide the necessary configuration to the Spark
job/notebook such as: –datalake-formats and –conf
NOTE: these actions automatically inject the Iceberg Spark SQL
extension
18. @levyeran
Full example is available here: https://github.com/eran-levy/iceberg-journey-session-examples
Glue Spark Maintenance Jobs
19. Glue Spark Maintenance Jobs
@levyeran
Main maintenance procedures:
● Expire_snapshots
● Rewrite_data_files
● Remove_orphan_files
● Rewrite_manifests
24. Summary
● Apache Iceberg is well adopted in the industry and
specifically in the AWS ecosystem.
● It's not persist & forget -> take Iceberg maintenance into
consideration while choosing your architecture.
● Keep monitoring -> your partitioning strategy might
change, file size, query latencies, etc. as there are many
moving parts that can impact your performance.
25. Next Steps
● Choosing our data lakehouse platform
● Maintenance is an issue as we scale to additional use
cases with larger data volume - we might need a
managed service to assist us here