SlideShare a Scribd company logo
Zesty journey to adopt
Apache Iceberg
Eran Levy
@levyeran
Eran Levy
Data & Platform Group Lead @Zesty
https://levyeran.medium.com/
@levyeran
Introduction
@levyeran
How can you utilize Iceberg on AWS with no Spark
expertise in your team and going serverless all-in?
WIFM
@levyeran
Build Fast.
Stay Cost Efficient.
300%
Customer
Growth
YoY
$120M
Raised
Since
2020
406%
Employee
Growth
YoY
The Goal - Database Explorer
@levyeran
Medallion Architecture
@levyeran
Why did we choose
Apache Iceberg?
@levyeran
- Open data table format widely adopted and integrates well
with AWS ecosystem (Glue catalog, Athena, etc).
- Table evolution - Mainly schema and partitioning layout
(particularly hidden partitioning).
- Integrating well with many processing engines - supports our
long term strategy in leveraging the right technologies to their
needs.
While there are many cool
things in Iceberg,
There are some challenges…
The main challenge is:
Maintenance
@levyeran
Architecture
@levyeran
Table Configuration
@levyeran
- Iceberg v2 table, created with AWS Glue catalog and
Athena engine version 3 (preferably a dedicated
WorkGroup).
- Parquet with ZSTD compression - this is the data format we
adopted across our data lake.
- Snapshot age - 2 days (default is 5 days).
Athena allows predefined key-value TBLPROPERTIES only.
Glue catalog - Metadata tracking
@levyeran
Table Maintenance
Main maintenance operations for optimizing Iceberg table in
Athena:
1. VACUUM
2. OPTIMIZE
Table Maintenance
@levyeran
We are updating our Iceberg table frequently (every minute,
5GBs, insert/update, 50 columns, 500M records)…
So we wanted to VACUUM but were hitting the Athena query
limits:
Table Maintenance
@levyeran
Increasing the limits didn’t help much because we were hitting another :
ICEBERG_VACUUM_MORE_RUNS_NEEDED: Removed 1000 files in this round of vacuum, but
there are more files remaining. Please run another VACUUM command to process the
remaining files
You can try overcome it by running AWS Step Functions in a loop like this suggested solution
Missing several runs and you will face another challenge as increasing Athena query limits
won’t help you much this time…
Table Maintenance
@levyeran
Same for OPTIMIZE but were hitting the partitions limitation:
Glue Spark ETL Jobs
@levyeran
In order to solve it for the long run, we decided to utilize the
Iceberg Spark procedures in order to perform our maintenance
jobs:
- Glue 3.0 and later supports Iceberg integration out of the
box
- Ad-hoc & built-in scheduler
- Integrated with CI/CD pipeline using AWS SDKs
Nice AWS blog and an AWS Glue Developer Guide are available
Glue Spark Maintenance Jobs
@levyeran
Basically the most important steps to perform are:
- Register the Iceberg connector for AWS Glue (Not required
for Glue 4.0)
- Create ETL Job or a Jupyter Notebook
- Provide the necessary configuration to the Spark
job/notebook such as: –datalake-formats and –conf
NOTE: these actions automatically inject the Iceberg Spark SQL
extension
@levyeran
Full example is available here: https://github.com/eran-levy/iceberg-journey-session-examples
Glue Spark Maintenance Jobs
Glue Spark Maintenance Jobs
@levyeran
Main maintenance procedures:
● Expire_snapshots
● Rewrite_data_files
● Remove_orphan_files
● Rewrite_manifests
QuickSight for Apache Iceberg metadata analysis
@levyeran
Not
Optimized
Optimized!
Snapshots after optimization
@levyeran
Snapshots after expiration procedure
@levyeran
QuickSight for Apache Iceberg data files analysis
@levyeran
Summary
● Apache Iceberg is well adopted in the industry and
specifically in the AWS ecosystem.
● It's not persist & forget -> take Iceberg maintenance into
consideration while choosing your architecture.
● Keep monitoring -> your partitioning strategy might
change, file size, query latencies, etc. as there are many
moving parts that can impact your performance.
Next Steps
● Choosing our data lakehouse platform
● Maintenance is an issue as we scale to additional use
cases with larger data volume - we might need a
managed service to assist us here

More Related Content

Similar to Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf

Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Databricks
 
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDelivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Databricks
 
Talend Summer '17 Release: New Features and Tech Overview
Talend Summer '17 Release: New Features and Tech OverviewTalend Summer '17 Release: New Features and Tech Overview
Talend Summer '17 Release: New Features and Tech Overview
Talend
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
Evolve18 | Ameeth Palla | Optimizing Your Assets Implementation
Evolve18 | Ameeth Palla | Optimizing Your Assets ImplementationEvolve18 | Ameeth Palla | Optimizing Your Assets Implementation
Evolve18 | Ameeth Palla | Optimizing Your Assets Implementation
Evolve The Adobe Digital Marketing Community
 
Migrate a successful transactional database to azure
Migrate a successful transactional database to azureMigrate a successful transactional database to azure
Migrate a successful transactional database to azure
Ike Ellis
 
Azure Nights August2017
Azure Nights August2017Azure Nights August2017
Azure Nights August2017
Michael Frank
 
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
Knut Relbe-Moe [MVP, MCT]
 
Oracle on AWS partner webinar series
Oracle on AWS partner webinar series Oracle on AWS partner webinar series
Oracle on AWS partner webinar series
Tom Laszewski
 
Elk ruminating on logs
Elk ruminating on logsElk ruminating on logs
Elk ruminating on logs
Mathew Beane
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
Antonios Chatzipavlis
 
How to collect Google Analytics events to your own data warehouse and do it o...
How to collect Google Analytics events to your own data warehouse and do it o...How to collect Google Analytics events to your own data warehouse and do it o...
How to collect Google Analytics events to your own data warehouse and do it o...
Alex Levashov
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenJ1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
MS Cloud Summit
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)
Mathew Beane
 
Modernizing your database with SQL Server 2019
Modernizing your database with SQL Server 2019Modernizing your database with SQL Server 2019
Modernizing your database with SQL Server 2019
Antonios Chatzipavlis
 
01 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv101 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv1Ivan Ma
 
Global Data Replication with Galera for Ansell Guardian®
Global Data Replication with Galera for Ansell Guardian®Global Data Replication with Galera for Ansell Guardian®
Global Data Replication with Galera for Ansell Guardian®
MariaDB plc
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User Store
DataStax Academy
 
Peteris Arajs - Where is my data
Peteris Arajs - Where is my dataPeteris Arajs - Where is my data
Peteris Arajs - Where is my data
Andrejs Vorobjovs
 
Serverless Data Lake on AWS
Serverless Data Lake on AWSServerless Data Lake on AWS
Serverless Data Lake on AWS
Thanh Nguyen
 

Similar to Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf (20)

Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
 
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDelivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
 
Talend Summer '17 Release: New Features and Tech Overview
Talend Summer '17 Release: New Features and Tech OverviewTalend Summer '17 Release: New Features and Tech Overview
Talend Summer '17 Release: New Features and Tech Overview
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Evolve18 | Ameeth Palla | Optimizing Your Assets Implementation
Evolve18 | Ameeth Palla | Optimizing Your Assets ImplementationEvolve18 | Ameeth Palla | Optimizing Your Assets Implementation
Evolve18 | Ameeth Palla | Optimizing Your Assets Implementation
 
Migrate a successful transactional database to azure
Migrate a successful transactional database to azureMigrate a successful transactional database to azure
Migrate a successful transactional database to azure
 
Azure Nights August2017
Azure Nights August2017Azure Nights August2017
Azure Nights August2017
 
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
 
Oracle on AWS partner webinar series
Oracle on AWS partner webinar series Oracle on AWS partner webinar series
Oracle on AWS partner webinar series
 
Elk ruminating on logs
Elk ruminating on logsElk ruminating on logs
Elk ruminating on logs
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
 
How to collect Google Analytics events to your own data warehouse and do it o...
How to collect Google Analytics events to your own data warehouse and do it o...How to collect Google Analytics events to your own data warehouse and do it o...
How to collect Google Analytics events to your own data warehouse and do it o...
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenJ1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)
 
Modernizing your database with SQL Server 2019
Modernizing your database with SQL Server 2019Modernizing your database with SQL Server 2019
Modernizing your database with SQL Server 2019
 
01 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv101 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv1
 
Global Data Replication with Galera for Ansell Guardian®
Global Data Replication with Galera for Ansell Guardian®Global Data Replication with Galera for Ansell Guardian®
Global Data Replication with Galera for Ansell Guardian®
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User Store
 
Peteris Arajs - Where is my data
Peteris Arajs - Where is my dataPeteris Arajs - Where is my data
Peteris Arajs - Where is my data
 
Serverless Data Lake on AWS
Serverless Data Lake on AWSServerless Data Lake on AWS
Serverless Data Lake on AWS
 

Recently uploaded

一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 

Recently uploaded (20)

一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 

Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf

  • 1. Zesty journey to adopt Apache Iceberg Eran Levy @levyeran
  • 2. Eran Levy Data & Platform Group Lead @Zesty https://levyeran.medium.com/ @levyeran Introduction @levyeran
  • 3. How can you utilize Iceberg on AWS with no Spark expertise in your team and going serverless all-in? WIFM @levyeran
  • 6. The Goal - Database Explorer @levyeran
  • 8. Why did we choose Apache Iceberg? @levyeran - Open data table format widely adopted and integrates well with AWS ecosystem (Glue catalog, Athena, etc). - Table evolution - Mainly schema and partitioning layout (particularly hidden partitioning). - Integrating well with many processing engines - supports our long term strategy in leveraging the right technologies to their needs.
  • 9. While there are many cool things in Iceberg, There are some challenges… The main challenge is: Maintenance @levyeran
  • 11. Table Configuration @levyeran - Iceberg v2 table, created with AWS Glue catalog and Athena engine version 3 (preferably a dedicated WorkGroup). - Parquet with ZSTD compression - this is the data format we adopted across our data lake. - Snapshot age - 2 days (default is 5 days). Athena allows predefined key-value TBLPROPERTIES only. Glue catalog - Metadata tracking
  • 12. @levyeran Table Maintenance Main maintenance operations for optimizing Iceberg table in Athena: 1. VACUUM 2. OPTIMIZE
  • 13. Table Maintenance @levyeran We are updating our Iceberg table frequently (every minute, 5GBs, insert/update, 50 columns, 500M records)… So we wanted to VACUUM but were hitting the Athena query limits:
  • 14. Table Maintenance @levyeran Increasing the limits didn’t help much because we were hitting another : ICEBERG_VACUUM_MORE_RUNS_NEEDED: Removed 1000 files in this round of vacuum, but there are more files remaining. Please run another VACUUM command to process the remaining files You can try overcome it by running AWS Step Functions in a loop like this suggested solution Missing several runs and you will face another challenge as increasing Athena query limits won’t help you much this time…
  • 15. Table Maintenance @levyeran Same for OPTIMIZE but were hitting the partitions limitation:
  • 16. Glue Spark ETL Jobs @levyeran In order to solve it for the long run, we decided to utilize the Iceberg Spark procedures in order to perform our maintenance jobs: - Glue 3.0 and later supports Iceberg integration out of the box - Ad-hoc & built-in scheduler - Integrated with CI/CD pipeline using AWS SDKs Nice AWS blog and an AWS Glue Developer Guide are available
  • 17. Glue Spark Maintenance Jobs @levyeran Basically the most important steps to perform are: - Register the Iceberg connector for AWS Glue (Not required for Glue 4.0) - Create ETL Job or a Jupyter Notebook - Provide the necessary configuration to the Spark job/notebook such as: –datalake-formats and –conf NOTE: these actions automatically inject the Iceberg Spark SQL extension
  • 18. @levyeran Full example is available here: https://github.com/eran-levy/iceberg-journey-session-examples Glue Spark Maintenance Jobs
  • 19. Glue Spark Maintenance Jobs @levyeran Main maintenance procedures: ● Expire_snapshots ● Rewrite_data_files ● Remove_orphan_files ● Rewrite_manifests
  • 20. QuickSight for Apache Iceberg metadata analysis @levyeran Not Optimized Optimized!
  • 22. Snapshots after expiration procedure @levyeran
  • 23. QuickSight for Apache Iceberg data files analysis @levyeran
  • 24. Summary ● Apache Iceberg is well adopted in the industry and specifically in the AWS ecosystem. ● It's not persist & forget -> take Iceberg maintenance into consideration while choosing your architecture. ● Keep monitoring -> your partitioning strategy might change, file size, query latencies, etc. as there are many moving parts that can impact your performance.
  • 25. Next Steps ● Choosing our data lakehouse platform ● Maintenance is an issue as we scale to additional use cases with larger data volume - we might need a managed service to assist us here