SlideShare a Scribd company logo
Batch Processing at Scale
with Flink & Iceberg
Andreas Hailu
Vice President, Goldman Sachs
Goldman Sachs Data Lake
● Platform allowing users to
generate batch data pipelines
without writing any code
● Data producers register datasets,
making metadata available
○ Dataset schema, source and access,
batch frequency, etc…
○ Flink batch applications generated
dynamically
● Datasets subscribed for updates
by consumers in warehouses
● Producers and consumers
decoupled
● Scale
○ 162K unique datasets
○ 140K batches/day
○ 4.2MM batches/month
2
ETL
Warehousing
Registry
Service
Producer
Source
Data
HDFS
Redshift
S3
SAP IQ/ASE
Snowflake
Lake
Browseable
Catalog
Batch Data Strategy
● Lake operates using copy-on-write enumerated batches
● Extracted data merged with existing data to create a new batch
● Support both milestoned and append merges
○ Milestoned merge builds out records such that records themselves contain the as-of data
■ No time-travel required
■ Done per key, “linked-list” of time-series records
■ Immutable, retained forever
○ Append merge simply appends incoming data to existing data
● Merged data is stored as Parquet/Avro, snapshots and deltas generated per
batch
○ Data exported to warehouse on batch completion in either snapshot/incremental loads
● Consumers always read data from last completed batch
● Last 3 batches of merged data are retained for recovery purposes
3
Milestoning Example
4
First Name Last Name Profession Date
Art Vandelay Importer May-31-1990
Staging Data
Merged Data
lake_in_id lake_out_id lake_from lake_thru First Name Last Name Profession Date
1 999999999 May-31-1990 11/30/9999 Art Vandelay Importer May-31-1990
Batch 1
Milestoning Example
5
First Name Last Name Profession Date
Art Vandelay Importer-Exporter June-30-1990
Staging Data
Merged Data
lake_in_id lake_out_id lake_from lake_thru First Name Last Name Profession Date
1 1 May-31-1990 11/30/9999 Art Vandelay Importer May-31-1990
2 999999999 May-31-1990 June-30-1990 Art Vandelay Importer May-31-1990
2 999999999 June-30-1990 11/30/9999 Art Vandelay Importer-Exporter June-30-1990
Batch 2
Job Graph - Extract
Extract Source
Data
Transform into Avro
→ Enrichment →
Validate Data Quality
Staging Directory
6
Accumulate Bloom
Filters, Partitions, …
Map, FlatMap DataSink
DataSource
Empty Sink
DataSink
Map, FlatMap
Batch
N
Job Graph - Merge
Read Merged Data
→ Dead Records ||
Records Not in
BloomFilter
7
Read Merged Data
→ Live Records →
Records In
BloomFilter
Read Staging Data
Merge Directory
(snapshot & delta)
Merge Staging
Records with
Merged Records
keyBy()
CoGroup DataSink
DataSource, Filter
DataSource, Filter
DataSource
Batch
N
Batch
N-1
Batch
N-1
Batch
N
Merge Details
● Staging data is merged with existing live records
○ Some niche exemptions for certain use cases
● Updates result in in closure of existing record and insertion of a new record
○ lake_out_id < 999999999 - “dead”
● Live records are typically what consumers query as contain time-series data
○ lake_out_id = 999999999 - “live”
● Over time, serialization of records not sent to CoGroup hinder runtime fitness
○ Dead records & records bloom filtered out but must still be written to new batch merge directory
○ More time spent rewriting records in CoGroup than actually merging
● Dead and live records bucketed by file, live records read and dead files copied
○ Substantial runtime reduction as data volume grows for patterns where ≥ 50% of data composed of
dead records
● Append merges copy data from previous batch
● Both optimizations require periodic compaction to tame overall file count
8
Partitioning
● Can substantially improve batch turnover time
○ Data merged against its own partition, reducing overall volume of data written in batch
● Dataset must have a field that supports partitioning for data requirements
○ Date, timestamp, or integer
● Changes how data is stored
○ Different underlying directory structure, consumers must be aware
○ Registry service stores metadata about latest batch for a partition
● Merge end result can be different
○ Partition fields can’t be changed once set
● Not all datasets have a field to partition on
9
Challenges
● Change set volumes per batch tend to stay consistent over time, but overall data
volume increases
● Data producer & consumer SLAs tend to be static
○ Data must be made available 30 minutes after batch begins
○ Data must be available by 14:30 EST in order to fulfill EOD reporting
● Own the implementation, not the data
○ Same code ran for every dataset
○ No control over fields, types, batch size, partitioning strategy etc…
● Support different use cases
○ Daily batch to 100+ batches/day
○ Milestoned & append batches
○ Snapshot feeds, incremental loads
● Merge optimizations so far only help ingest apps
○ Data consumed in many ways once ingested
○ User Spark code, internal processes exporting snapshot and incremental loads to warehouses
10
Iceberg
● Moving primary storage from HDFS → S3 offered chance for batch
data strategy review
● Iceberg’s metadata layer offers interesting features
○ Manifest files recording statistics
○ Hidden partitioning
■ Reading data looks the same client-side, regardless if/how table partitioned
■ Tracking of partition metadata no longer required
■ FIltering blocks out with Parquet predicates is good, not reading them at all is
better
● Not all datasets use Parquet
■ Consumers benefit in addition to ingest apps
○ V2 table format
■ Performant merge-on-read potential
● Batch retention managed with Snapshots
11
Iceberg - Partitioning
● Tables maintain metadata files that facilitate query planning
● Determines what files are required from query
○ Unnecessary files not read, single lookup rather than multiple IOPs
● Milestoned tables partitioned by record liveness
○ Live records bucketed together, dead records bucketed together
○ “select distinct(Profession) from dataset where lake_out_id =
999999999 and lake_from >= 7/1/1990 and lake_thru < 8/29/1990”
○ Ingest app no longer responsible for implementation
● Can further be partitioned by producer-specified field in schema
● Table implementation can change while consumption patterns
don’t
12
Iceberg - V2 Tables
● V2 tables support a merge-on-read strategy
○ Deltas applied to main table in lieu of rewriting files every batch
● Traditional ingest CoGroup step already marked records for insert,
update, delete, and unchanged
● Read only required records for CoGroup
○ Output becomes a bounded changelog DataStream
○ Unchanged records no longer emitted
● GenericRecord transformed to RowData and given
delta-appropriate RowKind association when written to Iceberg
table
○ RowKind.INSERT for new records
○ RowKind.DELETE + RowKind.INSERT for updates
13
Iceberg - V2 Tables
● Iceberg Flink connector uses Equality deletes
○ Identifies deleted rows by ≥ 1 column values
○ Data row is deleted if values equal to delete columns
○ Doesn’t require knowing where the rows are
○ Deleted when files compacted
○ Positional deletes require knowing where row to delete is required
● Records enriched with internal field with unique identifier for
deletes
○ Random 32-bit alphanumeric ID created during extract phase
○ Consumers only read data with schema in registry
14
Iceberg - V2 Tables Maintenance
● Over time, inserts and deletes can lead to many small data and
delete files
○ Small files problem, and more metadata stored in manifest files
● Periodically compact files during downtime
○ Downtime determined from ingestion schedule metadata in Registry
○ Creates a new snapshot, reads not impacted
○ Deletes applied to data files
15
Iceberg - V2 Tables Performance Testing
● Milestoning
○ Many updates and deletes
○ 10 million records over 8 batches
■ ~1.2GB staging data/batch
○ 10GB Snappy compressed data in total
○ 51% observed reduction in overall runtime over 8 batches when compared to
traditional file-based storage
○ Compaction runtime 51% faster than traditional merge runtime
● Append
○ Data is only appended, no updates/deletes
○ 500K records over 5 batches
○ 1TB Snappy compressed data in total
○ 63% observed reduction in overall runtime over 5 batches
○ Compaction runtime 24% faster than average traditional merge runtime
16
Summary
● Select equality delete fields wisely
○ Using just 1 field minimizes read overhead
● Compaction approach needs to be thought of early
○ Scheduling - built as part of application
● Partition to facilitate query patterns
17
Q&A
Thanks!
Learn more at GS.com/Engineering
The term ‘engineer’ in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law.
These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not
represent that it is accurate, complete, and/or up to date, and be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon
them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition of Goldman Sachs presenting the materials to you, you agree to
treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2022 the Goldman Sachs Group, Inc. All rights
reserved.

More Related Content

What's hot

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
confluent
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Knoldus Inc.
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
Flink Forward
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
Flink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
Prakash Chockalingam
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
KafkaZone
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 

What's hot (20)

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 

Similar to Batch Processing at Scale with Flink & Iceberg

Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
kbajda
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
Dori Waldman
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
Ryan Blue
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
Databricks
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems research
Vasia Kalavri
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
Alluxio, Inc.
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
Saeid Zebardast
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan Lambright
Gluster.org
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
Mukesh Singh
 
Object Compaction in Cloud for High Yield
Object Compaction in Cloud for High YieldObject Compaction in Cloud for High Yield
Object Compaction in Cloud for High Yield
ScyllaDB
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
Ahmed Ossama
 
Simplifying Disaster Recovery with Delta Lake
Simplifying Disaster Recovery with Delta LakeSimplifying Disaster Recovery with Delta Lake
Simplifying Disaster Recovery with Delta Lake
Databricks
 
week1slides1704202828322.pdf
week1slides1704202828322.pdfweek1slides1704202828322.pdf
week1slides1704202828322.pdf
TusharAgarwal49094
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
XinliShang1
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data Platforms
Alluxio, Inc.
 
Webinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLabWebinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLab
MayaData Inc
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
felixbarny
 

Similar to Batch Processing at Scale with Flink & Iceberg (20)

Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems research
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan Lambright
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Object Compaction in Cloud for High Yield
Object Compaction in Cloud for High YieldObject Compaction in Cloud for High Yield
Object Compaction in Cloud for High Yield
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Simplifying Disaster Recovery with Delta Lake
Simplifying Disaster Recovery with Delta LakeSimplifying Disaster Recovery with Delta Lake
Simplifying Disaster Recovery with Delta Lake
 
week1slides1704202828322.pdf
week1slides1704202828322.pdfweek1slides1704202828322.pdf
week1slides1704202828322.pdf
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data Platforms
 
Webinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLabWebinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLab
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
 

More from Flink Forward

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
Flink Forward
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
Flink Forward
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
Flink Forward
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
Flink Forward
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
Flink Forward
 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior Detection
Flink Forward
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 

More from Flink Forward (16)

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior Detection
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 

Recently uploaded

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 

Recently uploaded (20)

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 

Batch Processing at Scale with Flink & Iceberg

  • 1. Batch Processing at Scale with Flink & Iceberg Andreas Hailu Vice President, Goldman Sachs
  • 2. Goldman Sachs Data Lake ● Platform allowing users to generate batch data pipelines without writing any code ● Data producers register datasets, making metadata available ○ Dataset schema, source and access, batch frequency, etc… ○ Flink batch applications generated dynamically ● Datasets subscribed for updates by consumers in warehouses ● Producers and consumers decoupled ● Scale ○ 162K unique datasets ○ 140K batches/day ○ 4.2MM batches/month 2 ETL Warehousing Registry Service Producer Source Data HDFS Redshift S3 SAP IQ/ASE Snowflake Lake Browseable Catalog
  • 3. Batch Data Strategy ● Lake operates using copy-on-write enumerated batches ● Extracted data merged with existing data to create a new batch ● Support both milestoned and append merges ○ Milestoned merge builds out records such that records themselves contain the as-of data ■ No time-travel required ■ Done per key, “linked-list” of time-series records ■ Immutable, retained forever ○ Append merge simply appends incoming data to existing data ● Merged data is stored as Parquet/Avro, snapshots and deltas generated per batch ○ Data exported to warehouse on batch completion in either snapshot/incremental loads ● Consumers always read data from last completed batch ● Last 3 batches of merged data are retained for recovery purposes 3
  • 4. Milestoning Example 4 First Name Last Name Profession Date Art Vandelay Importer May-31-1990 Staging Data Merged Data lake_in_id lake_out_id lake_from lake_thru First Name Last Name Profession Date 1 999999999 May-31-1990 11/30/9999 Art Vandelay Importer May-31-1990 Batch 1
  • 5. Milestoning Example 5 First Name Last Name Profession Date Art Vandelay Importer-Exporter June-30-1990 Staging Data Merged Data lake_in_id lake_out_id lake_from lake_thru First Name Last Name Profession Date 1 1 May-31-1990 11/30/9999 Art Vandelay Importer May-31-1990 2 999999999 May-31-1990 June-30-1990 Art Vandelay Importer May-31-1990 2 999999999 June-30-1990 11/30/9999 Art Vandelay Importer-Exporter June-30-1990 Batch 2
  • 6. Job Graph - Extract Extract Source Data Transform into Avro → Enrichment → Validate Data Quality Staging Directory 6 Accumulate Bloom Filters, Partitions, … Map, FlatMap DataSink DataSource Empty Sink DataSink Map, FlatMap Batch N
  • 7. Job Graph - Merge Read Merged Data → Dead Records || Records Not in BloomFilter 7 Read Merged Data → Live Records → Records In BloomFilter Read Staging Data Merge Directory (snapshot & delta) Merge Staging Records with Merged Records keyBy() CoGroup DataSink DataSource, Filter DataSource, Filter DataSource Batch N Batch N-1 Batch N-1 Batch N
  • 8. Merge Details ● Staging data is merged with existing live records ○ Some niche exemptions for certain use cases ● Updates result in in closure of existing record and insertion of a new record ○ lake_out_id < 999999999 - “dead” ● Live records are typically what consumers query as contain time-series data ○ lake_out_id = 999999999 - “live” ● Over time, serialization of records not sent to CoGroup hinder runtime fitness ○ Dead records & records bloom filtered out but must still be written to new batch merge directory ○ More time spent rewriting records in CoGroup than actually merging ● Dead and live records bucketed by file, live records read and dead files copied ○ Substantial runtime reduction as data volume grows for patterns where ≥ 50% of data composed of dead records ● Append merges copy data from previous batch ● Both optimizations require periodic compaction to tame overall file count 8
  • 9. Partitioning ● Can substantially improve batch turnover time ○ Data merged against its own partition, reducing overall volume of data written in batch ● Dataset must have a field that supports partitioning for data requirements ○ Date, timestamp, or integer ● Changes how data is stored ○ Different underlying directory structure, consumers must be aware ○ Registry service stores metadata about latest batch for a partition ● Merge end result can be different ○ Partition fields can’t be changed once set ● Not all datasets have a field to partition on 9
  • 10. Challenges ● Change set volumes per batch tend to stay consistent over time, but overall data volume increases ● Data producer & consumer SLAs tend to be static ○ Data must be made available 30 minutes after batch begins ○ Data must be available by 14:30 EST in order to fulfill EOD reporting ● Own the implementation, not the data ○ Same code ran for every dataset ○ No control over fields, types, batch size, partitioning strategy etc… ● Support different use cases ○ Daily batch to 100+ batches/day ○ Milestoned & append batches ○ Snapshot feeds, incremental loads ● Merge optimizations so far only help ingest apps ○ Data consumed in many ways once ingested ○ User Spark code, internal processes exporting snapshot and incremental loads to warehouses 10
  • 11. Iceberg ● Moving primary storage from HDFS → S3 offered chance for batch data strategy review ● Iceberg’s metadata layer offers interesting features ○ Manifest files recording statistics ○ Hidden partitioning ■ Reading data looks the same client-side, regardless if/how table partitioned ■ Tracking of partition metadata no longer required ■ FIltering blocks out with Parquet predicates is good, not reading them at all is better ● Not all datasets use Parquet ■ Consumers benefit in addition to ingest apps ○ V2 table format ■ Performant merge-on-read potential ● Batch retention managed with Snapshots 11
  • 12. Iceberg - Partitioning ● Tables maintain metadata files that facilitate query planning ● Determines what files are required from query ○ Unnecessary files not read, single lookup rather than multiple IOPs ● Milestoned tables partitioned by record liveness ○ Live records bucketed together, dead records bucketed together ○ “select distinct(Profession) from dataset where lake_out_id = 999999999 and lake_from >= 7/1/1990 and lake_thru < 8/29/1990” ○ Ingest app no longer responsible for implementation ● Can further be partitioned by producer-specified field in schema ● Table implementation can change while consumption patterns don’t 12
  • 13. Iceberg - V2 Tables ● V2 tables support a merge-on-read strategy ○ Deltas applied to main table in lieu of rewriting files every batch ● Traditional ingest CoGroup step already marked records for insert, update, delete, and unchanged ● Read only required records for CoGroup ○ Output becomes a bounded changelog DataStream ○ Unchanged records no longer emitted ● GenericRecord transformed to RowData and given delta-appropriate RowKind association when written to Iceberg table ○ RowKind.INSERT for new records ○ RowKind.DELETE + RowKind.INSERT for updates 13
  • 14. Iceberg - V2 Tables ● Iceberg Flink connector uses Equality deletes ○ Identifies deleted rows by ≥ 1 column values ○ Data row is deleted if values equal to delete columns ○ Doesn’t require knowing where the rows are ○ Deleted when files compacted ○ Positional deletes require knowing where row to delete is required ● Records enriched with internal field with unique identifier for deletes ○ Random 32-bit alphanumeric ID created during extract phase ○ Consumers only read data with schema in registry 14
  • 15. Iceberg - V2 Tables Maintenance ● Over time, inserts and deletes can lead to many small data and delete files ○ Small files problem, and more metadata stored in manifest files ● Periodically compact files during downtime ○ Downtime determined from ingestion schedule metadata in Registry ○ Creates a new snapshot, reads not impacted ○ Deletes applied to data files 15
  • 16. Iceberg - V2 Tables Performance Testing ● Milestoning ○ Many updates and deletes ○ 10 million records over 8 batches ■ ~1.2GB staging data/batch ○ 10GB Snappy compressed data in total ○ 51% observed reduction in overall runtime over 8 batches when compared to traditional file-based storage ○ Compaction runtime 51% faster than traditional merge runtime ● Append ○ Data is only appended, no updates/deletes ○ 500K records over 5 batches ○ 1TB Snappy compressed data in total ○ 63% observed reduction in overall runtime over 5 batches ○ Compaction runtime 24% faster than average traditional merge runtime 16
  • 17. Summary ● Select equality delete fields wisely ○ Using just 1 field minimizes read overhead ● Compaction approach needs to be thought of early ○ Scheduling - built as part of application ● Partition to facilitate query patterns 17
  • 18. Q&A Thanks! Learn more at GS.com/Engineering The term ‘engineer’ in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law. These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete, and/or up to date, and be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition of Goldman Sachs presenting the materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2022 the Goldman Sachs Group, Inc. All rights reserved.