SlideShare a Scribd company logo
Tame the Small Files Problem and Optimize
Data Layout for Streaming Ingestion to Iceberg
Steven Wu, Gang Ye, Haizhou Zhao | Apple
THIS IS NOT A CONTRIBUTION
Apache Iceberg is an open table format for huge analytic data
• Time travel

• Advanced
fi
ltering

• Serializable isolation
Where does Iceberg fit in the ecosystem
Table Format


(Metadata)
Compute


Engine
Storage


(Data) Cloud Blob
Storage
Ingest data to Iceberg data lake in streaming fashion
Flink Streaming

Ingestion
Iceberg

Data Lake
Kafka 

Msg Queue
Zoom into the Flink Iceberg sink
Iceberg

Data Lake
writer-1
writer-2
writer-n
…
Records
DFS
Data Files
committer
File Metadata
Case 1: event-time partitioned tables
hour=2022-08-03-00/
hour=2022-08-03-01/
…
Long tail problem with late arrival data
https://en.wikipedia.org/wiki/Long_tail
Hour
Percentage

of data
0 1 2 N
A data file can’t contain rows across partitions
hour=2022-08-03-00/
|- file-000.parquet
|- file-001.parquet
|- …
hour=2022-08-03-01/
|- …
…
How many data files are generated every hour?
writer-1
writer-2
writer-500
…
committer
720K
fi
les every hour (with 10 minute checkpoint interval)
Records for


24x10 partitions
Open 240
fi
les
Commit 120K
fi
les (240x500)
every checkpoint
Assuming table is partitioned
hourly and event time range
is capped at 10 days
Long-tail hours lead to small files
Percentile File Size
P50 55 KB
P75 77 KB
P90 13 MB
P99 18 MB
What are the implications of too many small files
• Poor read performance

• Request throttling

• Memory pressure

• Longer checkpoint duration and pipeline pause

• Stress the metadata system
Why not keyBy shuffle
writer-1
writer-2
writer-n
…
committer
operator-1
operator-2
operator-n
keyBy(hour)
Iceberg
There are two problems
• Tra
ffi
c are not evenly distributed across event hours

• keyBy for low cardinality column won’t be balanced [1]
[1] https://github.com/apache/iceberg/pull/4228
Need smarter shu
ffl
ing
Case 2: data clustering for non-partition columns
CREATE TABLE db.tbl (
ts timestamp,
data string,
event_type string)
USING iceberg
PARTITIONED BY (hours(ts))
Queries often filter on event_type
SELECT count(1) FROM db.tbl WHERE
ts >= '2022-01-01 08:00:00’ AND
ts < '2022-01-01 09:00:00' AND
event_type = ‘C’
Iceberg supports file pruning leveraging min-max stats
at column level
|- file-000.parquet (event_type: A-B)
|- file-001.parquet (event_type: C-C)
|- file-002.parquet (event_type: D-F)
…
event_type = ‘C’
Wide value range would make pruning ineffective
Wide value range
|- file-000.parquet (event_type: A-Z)
|- file-001.parquet (event_type: A-Z)
|- file-002.parquet (event_type: A-Z)
…
event_type = ‘C’
Making event_type a partition column can lead to
explosion of number of partitions
• Before: 8.8K partitions (365 days x 24 hours) [1]

• After: 4.4M partitions (365 days x 24 hours x 500 event_types) [2]

• Can stress metadata system and lead to small
fi
les
[1] Assuming 12 months retention

[2] Assuming 500 event types
Batch engines solve the clustering problem via shuffle
2. Shuffle to
cluster data
Stage Stage
…
1. Compute
data sketch
Event
Type
Weight
A 2%
B 7%
C 22%
…
Z 0.5%
…
A B A
C C C
Z Y X
A
B
A C
C
C
Z
Y
X
3. Sort data
before writing to
files
A A B
C C C
X Y Z
A-B
min-max
C-C
X-Z
Tight value
range
Shu
ffl
e for better data clustering
Why not compact small files or sort files via
background batch maintenance jobs
• Remediation is usually more expensive than prevention

• Doesn’t solve the throttling problem in the streaming path
Agenda
Motivation Evaluation
Design
Introduce a smart shuffling operator in Flink Iceberg sink
Iceberg
writer-1
writer-2
writer-n
…
committer
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Smart shuffling
Step 1: calculate traffic distribution
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-10
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Step 2a: shuffle data based on traffic distribution
Hour Assigned tasks
0 1, 2, 3, 4
1 4, 5
2 6
… …
238 10
239 10
240 10
writer-1
writer-2
writer-n
…
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Step 2b: range shuffle data for non-partition column
Event
type
Weight
A 2%
B 7%
C 28%
… …
Z 0.5%
Event
type
Assigned
task
A-B 1
C-C 2, 3, 4
… …
P-Z 10
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-10
Range shuffling improves data clustering
A B A
C C C
Z Y X
Z X A
A C Y
C C B
Unsorted
data files
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Tight value
range
Sorting within a file brings additional benefits of row
group and page level skipping
Parquet
fi
le
X


X


X


X


X


Y


Y


Z


Z


Z


Z


Z


Row 

group 1
Row 

group 2
Row 

group 3
SELECT * FROM db.tbl WHERE
ts >= … AND ts < … AND
event_type = 'Y'
What if sorting is needed
• Sorting in streaming is possible but expensive

• Use batch sorting jobs
How to calculate tra
ffi
c distribution
FLIP-27 source interface introduced operator
coordinator component
JobManager TaskManager-1
TaskManager-n
…
Source
Reader-1
Source
Reader-k
…
Source

Coordinator
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Smart shuffling
Hour Count
0 33
1 14
2 5
… …
240 0
Hour Count
0 33
1 14
2 5
… …
240 0
Hour Count
0 33
1 14
2 5
… …
240 1
Shuffle tasks calculate local stats and send them to
coordinator
writer-1
JobManager
shu
ffl
e

coordinator
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Smart shuffling
Hour Count
0 33
1 14
2 5
… …
240 0
Hour Count
0 33
1 14
2 5
… …
240 0
Hour Count
0 33
1 14
2 5
… …
240 1
Shuffle coordinator does global aggregation
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Global aggregation
addresses the
potential problem of
different local views
shu
ffl
e

coordinator
JobManager
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Smart shuffling
Shuffle coordinator broadcasts the globally aggregated
stats to tasks
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Shu
ffl
e

Coordinator
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
JobManager
All shuf
fl
e tasks make
the same decision based
on the same stats
How to shu
ffl
e data
Add a custom partitioner after the shuffle operator
dataStream
.transform("shuffleOperator", shuffleOperatorOutputType, operatorFactory)
.partitionCustom(binPackingPartitioner, keySelector)
public class BinPackingPartitioner<K> implements Partitioner<K> {
@Override
int partition(K key, int numPartitions);
}
There are two shuffling strategies
• Bin packing

• Range distribution
Bin packing can combine multiple small keys to a
single task or split a single large key to multiple tasks
Task Assigned keys
T0 K0, K2, K4, K6, K8
T1 K7
T2 K3
T3 K3
T4 K3
T5 K3
… …
T9 K1,K5
• Only focus on balanced
weight distribution

• Ignore ordering when
assigning keys

• Work well with shu
ffl
ing by
partition columns
Range shuffling split sort values into ranges and
assign them to tasks
• Balance weight distribution
with continuous ranges

• Work well with shu
ffl
ing by
non-partition columns
Value Assigned task
A
T1
B
C
…
D
T2
T3
T4
Optimizing balanced distribution in byte rate can lead to
file count skew where a task handles many long-tail hours
hours
0 1 2 N
https://en.wikipedia.org/wiki/Long_tail
Many long-tail hours can
be assigned to a single
task, which can become
bottleneck
There are two solutions
• Parallelize
fi
le
fl
ushing and upload

• Limit the
fi
le count skew via close-
fi
le-cost (like open-
fi
le-
cost)
Tune close-file-cost to balance btw file count skew
and byte rate skew
Skewness
Close-
fi
le-cost
Byte rate skew
File count skew
Agenda
Motivation Evaluation
Design
A: Simple Iceberg ingestion job without shuffling
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
Chained
…
• Job parallelism is 60

• Checkpoint interval is 10 min
B: Iceberg ingestion with smart shuffling
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
• Job parallelism is 60

• Checkpoint interval is 10 min
shuf
fl
e-1
shuf
fl
e-2
Shuf
fl
e-n
Chained Shuffle
Test setup
• Sink Iceberg table is partitioned hourly by event time

• Benchmark tra
ffi
c volume is 250 MB/sec

• Event time range is 192 hours
What are we comparing
• Number of
fi
les written in one cycle

• File size distribution

• Checkpoint duration

• CPU utilization

• Shu
ffl
ing skew
• Job parallelism is 60

• Event time range is 192 hours

Shu
ffl
e reduced the number of
fi
les by 20x
Without shu
ffl
ing one cycle
fl
ushed 10K
fi
les
With shu
ffl
ing one cycle
fl
ushed 500
fi
les
~2.5x of minimal
number of
fi
les
Shuffling greatly improved file size distribution
Percentile
Without
shuffling
With
shuffling
Improvement
P50 55 KB 913 KB 17x
P75 77 KB 7 MB 90x
P95 13 MB 301 MB 23x
P99 18 MB 306 MB 17x
Shuffling tamed the small files problem
During checkpoint, writer tasks flush and upload data files
writer-1
writer-2
writer-n
…
committer
DFS
Data Files
Reduced checkpoint duration by 8x
Without shu
ffl
ing, checkpoint takes 64s on average
With shu
ffl
ing, checkpoint takes 8s on average
Seconds
10
20
30
40
50
60
70
Record handover btw chained operators are simple
method call
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
Chained
1. Kafka Source 2. Iceberg Sink
…
Shuffling involves significant CPU overhead on serdes
and network I/O
2. Shuffle
1. Kafka Source 3. Iceberg Sink
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
shuf
fl
e-1
shuf
fl
e-2
Shuf
fl
e-n
Shuffle
Chained
Shuffling increased CPU usage by 62%
All about tradeo
ff
!
With shu
ffl
ing avg CPU util is 57%
Without shu
ffl
ing avg CPU util is 35%
Without shuffling, checkpoint pause is longer and
catch-up spike is bigger
With shu
ffl
ing
Without shu
ffl
ing
Catch-up spike
Trough caused
by pause
Bin packing shuffling won’t be perfect in weight distribution
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
shuf
fl
e-1
shuf
fl
e-2
Shuf
fl
e-n
Shuffle
Chained
processes data for
partitions a, b, c
processes data for
partitions y, z
Min of writer
record rate
Max of writer
record rate
Skewness
(max-min)/min
No shuffling 4.36 K 4.44 K 1.8%
Bin packing
(greedy algo)
4.02 K 6.39 MB 59%
Our greedy algo implementation of bin packing
introduces higher skew than we hoped for
Future work
• Implement other algorithm

• Better bin packing with less skew

• Range partitioner

• Support sketch statistics for high-cardinality keys

• Contribute it to OSS
References
• Design doc: https://docs.google.com/document/d/13N8cMqPi-
ZPSKbkXGOBMPOzbv2Fua59j8bIjjtxLWqo/
Q&A
Weight table should be
relatively stable
What about new hour as time moves forward?
Absolute hour Weight
2022-08-03-00 0.4
… …
2022-08-03-12 22
2022-08-03-13 27
2022-08-03-14 38
2022-08-03-15 ??
Weight table based on relative hour would be stable
Relative hour Weight
0 38
1 27
2 22
… …
14 0.4
… …
What about cold start problem?
• First-time run

• Restart with empty state

• New subtasks from scale-up
Cope with with cold start problems
• No shu
ffl
e while learning

• Bu
ff
er records until learned the
fi
rst stats

• New subtasks (scale-up) request stats from the coordinator

More Related Content

What's hot

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
Flink Forward
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
Flink Forward
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
DataWorks Summit
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
Flink Forward
 

What's hot (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
 

Similar to Tame the small files problem and optimize data layout for streaming ingestion to Iceberg

Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
amesar0
 
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Evention
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a Service
Steven Wu
 
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
StreamNative
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience Sharing
Chen-en Lu
 
spark stream - kafka - the right way
spark stream - kafka - the right way spark stream - kafka - the right way
spark stream - kafka - the right way
Dori Waldman
 
Flink at netflix paypal speaker series
Flink at netflix   paypal speaker seriesFlink at netflix   paypal speaker series
Flink at netflix paypal speaker series
Monal Daxini
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Amazon Web Services
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Martin Zapletal
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
SolarWinds Loggly
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
Samuel Kerrien
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
Chester Chen
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
Oh Chan Kwon
 
Spark cep
Spark cepSpark cep
Spark cep
Byungjin Kim
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Dayprogrammermag
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
Amazon Web Services
 

Similar to Tame the small files problem and optimize data layout for streaming ingestion to Iceberg (20)

Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a Service
 
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience Sharing
 
spark stream - kafka - the right way
spark stream - kafka - the right way spark stream - kafka - the right way
spark stream - kafka - the right way
 
Flink at netflix paypal speaker series
Flink at netflix   paypal speaker seriesFlink at netflix   paypal speaker series
Flink at netflix paypal speaker series
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Spark cep
Spark cepSpark cep
Spark cep
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 

More from Flink Forward

Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
Flink Forward
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
Flink Forward
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
Flink Forward
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
Flink Forward
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
Flink Forward
 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior Detection
Flink Forward
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 

More from Flink Forward (15)

Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior Detection
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 

Recently uploaded

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 

Recently uploaded (20)

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 

Tame the small files problem and optimize data layout for streaming ingestion to Iceberg

  • 1. Tame the Small Files Problem and Optimize Data Layout for Streaming Ingestion to Iceberg Steven Wu, Gang Ye, Haizhou Zhao | Apple THIS IS NOT A CONTRIBUTION
  • 2. Apache Iceberg is an open table format for huge analytic data • Time travel • Advanced fi ltering • Serializable isolation
  • 3. Where does Iceberg fit in the ecosystem Table Format (Metadata) Compute Engine Storage (Data) Cloud Blob Storage
  • 4. Ingest data to Iceberg data lake in streaming fashion Flink Streaming Ingestion Iceberg Data Lake Kafka Msg Queue
  • 5. Zoom into the Flink Iceberg sink Iceberg Data Lake writer-1 writer-2 writer-n … Records DFS Data Files committer File Metadata
  • 6. Case 1: event-time partitioned tables hour=2022-08-03-00/ hour=2022-08-03-01/ …
  • 7. Long tail problem with late arrival data https://en.wikipedia.org/wiki/Long_tail Hour Percentage of data 0 1 2 N
  • 8. A data file can’t contain rows across partitions hour=2022-08-03-00/ |- file-000.parquet |- file-001.parquet |- … hour=2022-08-03-01/ |- … …
  • 9. How many data files are generated every hour? writer-1 writer-2 writer-500 … committer 720K fi les every hour (with 10 minute checkpoint interval) Records for 24x10 partitions Open 240 fi les Commit 120K fi les (240x500) every checkpoint Assuming table is partitioned hourly and event time range is capped at 10 days
  • 10. Long-tail hours lead to small files Percentile File Size P50 55 KB P75 77 KB P90 13 MB P99 18 MB
  • 11. What are the implications of too many small files • Poor read performance • Request throttling • Memory pressure • Longer checkpoint duration and pipeline pause • Stress the metadata system
  • 12. Why not keyBy shuffle writer-1 writer-2 writer-n … committer operator-1 operator-2 operator-n keyBy(hour) Iceberg
  • 13. There are two problems • Tra ffi c are not evenly distributed across event hours • keyBy for low cardinality column won’t be balanced [1] [1] https://github.com/apache/iceberg/pull/4228
  • 15. Case 2: data clustering for non-partition columns CREATE TABLE db.tbl ( ts timestamp, data string, event_type string) USING iceberg PARTITIONED BY (hours(ts))
  • 16. Queries often filter on event_type SELECT count(1) FROM db.tbl WHERE ts >= '2022-01-01 08:00:00’ AND ts < '2022-01-01 09:00:00' AND event_type = ‘C’
  • 17. Iceberg supports file pruning leveraging min-max stats at column level |- file-000.parquet (event_type: A-B) |- file-001.parquet (event_type: C-C) |- file-002.parquet (event_type: D-F) … event_type = ‘C’
  • 18. Wide value range would make pruning ineffective Wide value range |- file-000.parquet (event_type: A-Z) |- file-001.parquet (event_type: A-Z) |- file-002.parquet (event_type: A-Z) … event_type = ‘C’
  • 19. Making event_type a partition column can lead to explosion of number of partitions • Before: 8.8K partitions (365 days x 24 hours) [1] • After: 4.4M partitions (365 days x 24 hours x 500 event_types) [2] • Can stress metadata system and lead to small fi les [1] Assuming 12 months retention [2] Assuming 500 event types
  • 20. Batch engines solve the clustering problem via shuffle 2. Shuffle to cluster data Stage Stage … 1. Compute data sketch Event Type Weight A 2% B 7% C 22% … Z 0.5% … A B A C C C Z Y X A B A C C C Z Y X 3. Sort data before writing to files A A B C C C X Y Z A-B min-max C-C X-Z Tight value range
  • 21. Shu ffl e for better data clustering
  • 22. Why not compact small files or sort files via background batch maintenance jobs • Remediation is usually more expensive than prevention • Doesn’t solve the throttling problem in the streaming path
  • 24. Introduce a smart shuffling operator in Flink Iceberg sink Iceberg writer-1 writer-2 writer-n … committer shuf fl e-1 shuf fl e-2 shuf fl e-n Smart shuffling
  • 25. Step 1: calculate traffic distribution writer-1 writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-10 Hour Weight 0 33% 1 14% 2 5% … … 240 0.001%
  • 26. Step 2a: shuffle data based on traffic distribution Hour Assigned tasks 0 1, 2, 3, 4 1 4, 5 2 6 … … 238 10 239 10 240 10 writer-1 writer-2 writer-n … Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% shuf fl e-1 shuf fl e-2 shuf fl e-n
  • 27. Step 2b: range shuffle data for non-partition column Event type Weight A 2% B 7% C 28% … … Z 0.5% Event type Assigned task A-B 1 C-C 2, 3, 4 … … P-Z 10 writer-1 writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-10
  • 28. Range shuffling improves data clustering A B A C C C Z Y X Z X A A C Y C C B Unsorted data files writer-1 writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-n Tight value range
  • 29. Sorting within a file brings additional benefits of row group and page level skipping Parquet fi le X X X X X Y Y Z Z Z Z Z Row group 1 Row group 2 Row group 3 SELECT * FROM db.tbl WHERE ts >= … AND ts < … AND event_type = 'Y'
  • 30. What if sorting is needed • Sorting in streaming is possible but expensive • Use batch sorting jobs
  • 31. How to calculate tra ffi c distribution
  • 32. FLIP-27 source interface introduced operator coordinator component JobManager TaskManager-1 TaskManager-n … Source Reader-1 Source Reader-k … Source Coordinator
  • 33. writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-n Smart shuffling Hour Count 0 33 1 14 2 5 … … 240 0 Hour Count 0 33 1 14 2 5 … … 240 0 Hour Count 0 33 1 14 2 5 … … 240 1 Shuffle tasks calculate local stats and send them to coordinator writer-1 JobManager shu ffl e coordinator
  • 34. writer-1 writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-n Smart shuffling Hour Count 0 33 1 14 2 5 … … 240 0 Hour Count 0 33 1 14 2 5 … … 240 0 Hour Count 0 33 1 14 2 5 … … 240 1 Shuffle coordinator does global aggregation Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% Global aggregation addresses the potential problem of different local views shu ffl e coordinator JobManager
  • 35. writer-1 writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-n Smart shuffling Shuffle coordinator broadcasts the globally aggregated stats to tasks Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% Shu ffl e Coordinator Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% JobManager All shuf fl e tasks make the same decision based on the same stats
  • 37. Add a custom partitioner after the shuffle operator dataStream .transform("shuffleOperator", shuffleOperatorOutputType, operatorFactory) .partitionCustom(binPackingPartitioner, keySelector) public class BinPackingPartitioner<K> implements Partitioner<K> { @Override int partition(K key, int numPartitions); }
  • 38. There are two shuffling strategies • Bin packing • Range distribution
  • 39. Bin packing can combine multiple small keys to a single task or split a single large key to multiple tasks Task Assigned keys T0 K0, K2, K4, K6, K8 T1 K7 T2 K3 T3 K3 T4 K3 T5 K3 … … T9 K1,K5 • Only focus on balanced weight distribution • Ignore ordering when assigning keys • Work well with shu ffl ing by partition columns
  • 40. Range shuffling split sort values into ranges and assign them to tasks • Balance weight distribution with continuous ranges • Work well with shu ffl ing by non-partition columns Value Assigned task A T1 B C … D T2 T3 T4
  • 41. Optimizing balanced distribution in byte rate can lead to file count skew where a task handles many long-tail hours hours 0 1 2 N https://en.wikipedia.org/wiki/Long_tail Many long-tail hours can be assigned to a single task, which can become bottleneck
  • 42. There are two solutions • Parallelize fi le fl ushing and upload • Limit the fi le count skew via close- fi le-cost (like open- fi le- cost)
  • 43. Tune close-file-cost to balance btw file count skew and byte rate skew Skewness Close- fi le-cost Byte rate skew File count skew
  • 45. A: Simple Iceberg ingestion job without shuffling source-1 source-2 source-n writer-1 writer-2 writer-n committer Chained … • Job parallelism is 60 • Checkpoint interval is 10 min
  • 46. B: Iceberg ingestion with smart shuffling source-1 source-2 source-n writer-1 writer-2 writer-n committer • Job parallelism is 60 • Checkpoint interval is 10 min shuf fl e-1 shuf fl e-2 Shuf fl e-n Chained Shuffle
  • 47. Test setup • Sink Iceberg table is partitioned hourly by event time • Benchmark tra ffi c volume is 250 MB/sec • Event time range is 192 hours
  • 48. What are we comparing • Number of fi les written in one cycle • File size distribution • Checkpoint duration • CPU utilization • Shu ffl ing skew
  • 49. • Job parallelism is 60 • Event time range is 192 hours Shu ffl e reduced the number of fi les by 20x Without shu ffl ing one cycle fl ushed 10K fi les With shu ffl ing one cycle fl ushed 500 fi les ~2.5x of minimal number of fi les
  • 50. Shuffling greatly improved file size distribution Percentile Without shuffling With shuffling Improvement P50 55 KB 913 KB 17x P75 77 KB 7 MB 90x P95 13 MB 301 MB 23x P99 18 MB 306 MB 17x
  • 51. Shuffling tamed the small files problem
  • 52. During checkpoint, writer tasks flush and upload data files writer-1 writer-2 writer-n … committer DFS Data Files
  • 53. Reduced checkpoint duration by 8x Without shu ffl ing, checkpoint takes 64s on average With shu ffl ing, checkpoint takes 8s on average Seconds 10 20 30 40 50 60 70
  • 54. Record handover btw chained operators are simple method call source-1 source-2 source-n writer-1 writer-2 writer-n committer Chained 1. Kafka Source 2. Iceberg Sink …
  • 55. Shuffling involves significant CPU overhead on serdes and network I/O 2. Shuffle 1. Kafka Source 3. Iceberg Sink source-1 source-2 source-n writer-1 writer-2 writer-n committer shuf fl e-1 shuf fl e-2 Shuf fl e-n Shuffle Chained
  • 56. Shuffling increased CPU usage by 62% All about tradeo ff ! With shu ffl ing avg CPU util is 57% Without shu ffl ing avg CPU util is 35%
  • 57. Without shuffling, checkpoint pause is longer and catch-up spike is bigger With shu ffl ing Without shu ffl ing Catch-up spike Trough caused by pause
  • 58. Bin packing shuffling won’t be perfect in weight distribution source-1 source-2 source-n writer-1 writer-2 writer-n committer shuf fl e-1 shuf fl e-2 Shuf fl e-n Shuffle Chained processes data for partitions a, b, c processes data for partitions y, z
  • 59. Min of writer record rate Max of writer record rate Skewness (max-min)/min No shuffling 4.36 K 4.44 K 1.8% Bin packing (greedy algo) 4.02 K 6.39 MB 59% Our greedy algo implementation of bin packing introduces higher skew than we hoped for
  • 60. Future work • Implement other algorithm • Better bin packing with less skew • Range partitioner • Support sketch statistics for high-cardinality keys • Contribute it to OSS
  • 61. References • Design doc: https://docs.google.com/document/d/13N8cMqPi- ZPSKbkXGOBMPOzbv2Fua59j8bIjjtxLWqo/
  • 62. Q&A
  • 63.
  • 64. Weight table should be relatively stable
  • 65. What about new hour as time moves forward? Absolute hour Weight 2022-08-03-00 0.4 … … 2022-08-03-12 22 2022-08-03-13 27 2022-08-03-14 38 2022-08-03-15 ??
  • 66. Weight table based on relative hour would be stable Relative hour Weight 0 38 1 27 2 22 … … 14 0.4 … …
  • 67. What about cold start problem? • First-time run • Restart with empty state • New subtasks from scale-up
  • 68. Cope with with cold start problems • No shu ffl e while learning • Bu ff er records until learned the fi rst stats • New subtasks (scale-up) request stats from the coordinator