More Related Content Similar to Tuning Flink Clusters for stability and efficiency (20) More from Divye Kapoor (6) Tuning Flink Clusters for stability and efficiency2. Flink Forward 2023 ©
Starting with the end in mind…
By the end of this talk, you’ll know how
we tuned our Flink clusters to reduce
per-job costs by 50-90% (~75%
typical) and how we were able to
absorb ~40% workload for free for
Pinterest.
25%
Cost of a job after we
were done with our work
Hi! I’m Divye Kapoor, I’m the TL for the
Stream Processing Platform at Pinterest and
I’m here presenting the work that our teams at
Pinterest have done over the past 2 years on
stability and efficiency.
3. Flink Forward 2023 ©
Credits
Teja T. - SRE & EM
CGroups, Cluster HW, Rollouts &
optimizations at several levels of
the stack.
Thank you
6. Flink Forward 2023 ©
Let’s just say that a fair bit of money was printed for Pinterest…
7. Flink Forward 2023 ©
Our clusters run on YARN (today), everything that follows is in that context.
8. Flink Forward 2023 ©
So what’s challenging about running
and tuning a Flink multi tenant cluster?
● Job sizes: 2000+ cores on a job
vs jobs with < 10 cores.
● Job tiering: small jobs that can’t
fail and other jobs that can.
● Multitenant efficiency: resource
use that isn’t wasteful.
● Multitenant priority: in an incident,
keep the right jobs working.
● Noisy neighbors
● Data skew
9. Flink Forward 2023 ©
CGroups
● CGroups was our must-have for
everything that follows. Teja led the
charge.
1. We upgraded YARN and then
configured it to support soft CGroup
limits. (The limits only kick in if the
host is running out of capacity)
2. We verified that if a host is at capacity,
the resources are fairly shared.
3. We started running the cluster hotter
(no CPU starvation!).
10. Flink Forward 2023 ©
CGroups
● Hard limits don’t work well for Flink jobs.
● Most Flink jobs want to burst on CPU on
deploys and this setup allows for the catch
up to take place without throttling.
● Hard limits can trigger OOMs, back
pressure and other stability issues.
Generally, it’s not clear if the job will come
back after a restart.
Lesson 1: Always configure
your YARN or K8s cluster to
avoid hard limits / throttles.
11. Flink Forward 2023 ©
Container Placement: Stability & Cost Opt.
● No Hot Nodes please!
● Container Placement is critical to
keeping a stable cluster running.
● We want all applications to be well
behaved and work well with our job
schedulers.
● Bad container scheduling = host
running out of capacity at peak.
12. Flink Forward 2023 ©
Container Placement: Option 1
Caption
CPU Aware: Schedule on hosts where CPU utilization is < 50th percentile
13. Flink Forward 2023 ©
Container Placement: Option 2
Caption
Config: yarn.nodemanager.resource.cpu-vcores = 75% of cores on host
14. Flink Forward 2023 ©
No traffic-peak stability issues
seen after the container placement
strategy was implemented.
Stability is a prerequisite for optimization
15. Flink Forward 2023 ©
Job Optimization
● Source of significant wins - task placements & vertical sharding.
● Required a full round of re-optimization of our job configurations.
● Mass migrations & rollouts - we got good at it.
● 70%+ reduction in cross-host network traffic for jobs.
● Jobs became 50-90%+ cheaper to run.
● Serialization & Traffic overhead drops.
● Magic: Removing SSGs, aligning parallelism across operators, forcing
“ColocationConstraints” and optimizing Flink 1.11 task placements.
16. Flink Forward 2023 ©
Job Optimization
Before: CPU utilization showing skewed load. This is wasteful because the lightly loaded
Task Managers are asking for the same resources as the heavily loaded ones.
17. Flink Forward 2023 ©
Hardware optimization: i3 to i4i
~40% reduction in CPU utilization per job.
18. Flink Forward 2023 ©
Our last wins:
Input Data optimization:
Only read the data the job needs from Kafka. Where
appropriate, we split the Kafka topics.
Autotuning: We built an in-house autotuner so that
we don’t need to keep re-tuning our jobs for CPU
utilization.
These will be covered separately in other talks in the future.
19. Flink Forward 2023 ©
Recap:
● CGroups
● Soft Limits
● Run clusters
hotter
● Container
Placement
strategy
● Job re-tuning
● Job optimization
● Job retuning
● Hardware
upgrades
● Input Data
optimization
● Job autotuning
Stage 1 Stage 2 Stage 3 Stage 4
20. Flink Forward 2023 ©
Our total wins were ~fairly large.
The end result is a nice clean up of
the costs on the streaming stack.
Job costs on Flink were a discussion
point. After optimizations, these
concerns have melted away.
75%
Job cost reduction through
improved placement of Tasks
on Task Managers.
40%
Job cost reduction through
hardware upgrades.
20%
Cluster cost reduction through
CGroups and the ability to run
the clusters hotter.
%ages don’t sum up to 100 as the baselines are different
21. Flink Forward 2023 ©
Actual spend vs Budgeted spend for the company ($ terms).
CGroups
Job Optimizations
Hw upgrade
Data opt.