Tuning Flink Clusters for stability and efficiency

Tuning Flink Clusters for
stability and efficiency
Divye Kapoor, Pinterest

Flink Forward 2023 ©
Starting with the end in mind…
By the end of this talk, you’ll know how
we tuned our Flink clusters to reduce
per-job costs by 50-90% (~75%
typical) and how we were able to
absorb ~40% workload for free for
Pinterest.
25%
Cost of a job after we
were done with our work
Hi! I’m Divye Kapoor, I’m the TL for the
Stream Processing Platform at Pinterest and
I’m here presenting the work that our teams at
Pinterest have done over the past 2 years on
stability and efficiency.

Credits
Teja T. - SRE & EM
CGroups, Cluster HW, Rollouts &
optimizations at several levels of
the stack.
Thank you

Credits: Leadership & Partners
30+ teams
200+ partners
4 orgs

Actual spend vs Budgeted spend for the company ($ terms).

Let’s just say that a fair bit of money was printed for Pinterest…

Our clusters run on YARN (today), everything that follows is in that context.

So what’s challenging about running
and tuning a Flink multi tenant cluster?
● Job sizes: 2000+ cores on a job
vs jobs with < 10 cores.
● Job tiering: small jobs that can’t
fail and other jobs that can.
● Multitenant efficiency: resource
use that isn’t wasteful.
● Multitenant priority: in an incident,
keep the right jobs working.
● Noisy neighbors
● Data skew

CGroups
● CGroups was our must-have for
everything that follows. Teja led the
charge.
1. We upgraded YARN and then
configured it to support soft CGroup
limits. (The limits only kick in if the
host is running out of capacity)
2. We verified that if a host is at capacity,
the resources are fairly shared.
3. We started running the cluster hotter
(no CPU starvation!).

CGroups
● Hard limits don’t work well for Flink jobs.
● Most Flink jobs want to burst on CPU on
deploys and this setup allows for the catch
up to take place without throttling.
● Hard limits can trigger OOMs, back
pressure and other stability issues.
Generally, it’s not clear if the job will come
back after a restart.
Lesson 1: Always configure
your YARN or K8s cluster to
avoid hard limits / throttles.

Container Placement: Stability & Cost Opt.
● No Hot Nodes please!
● Container Placement is critical to
keeping a stable cluster running.
● We want all applications to be well
behaved and work well with our job
schedulers.
● Bad container scheduling = host
running out of capacity at peak.

Container Placement: Option 1
Caption
CPU Aware: Schedule on hosts where CPU utilization is < 50th percentile

Container Placement: Option 2
Caption
Config: yarn.nodemanager.resource.cpu-vcores = 75% of cores on host

No traffic-peak stability issues
seen after the container placement
strategy was implemented.
Stability is a prerequisite for optimization

Job Optimization
● Source of significant wins - task placements & vertical sharding.
● Required a full round of re-optimization of our job configurations.
● Mass migrations & rollouts - we got good at it.
● 70%+ reduction in cross-host network traffic for jobs.
● Jobs became 50-90%+ cheaper to run.
● Serialization & Traffic overhead drops.
● Magic: Removing SSGs, aligning parallelism across operators, forcing
“ColocationConstraints” and optimizing Flink 1.11 task placements.

Job Optimization
Before: CPU utilization showing skewed load. This is wasteful because the lightly loaded
Task Managers are asking for the same resources as the heavily loaded ones.

Hardware optimization: i3 to i4i
~40% reduction in CPU utilization per job.

Our last wins:
Input Data optimization:
Only read the data the job needs from Kafka. Where
appropriate, we split the Kafka topics.
Autotuning: We built an in-house autotuner so that
we don’t need to keep re-tuning our jobs for CPU
utilization.
These will be covered separately in other talks in the future.

Recap:
● CGroups
● Soft Limits
● Run clusters
hotter
● Container
Placement
strategy
● Job re-tuning
● Job optimization
● Job retuning
● Hardware
upgrades
● Input Data
optimization
● Job autotuning
Stage 1 Stage 2 Stage 3 Stage 4

Our total wins were ~fairly large.
The end result is a nice clean up of
the costs on the streaming stack.
Job costs on Flink were a discussion
point. After optimizations, these
concerns have melted away.
75%
Job cost reduction through
improved placement of Tasks
on Task Managers.
40%
Job cost reduction through
hardware upgrades.
20%
Cluster cost reduction through
CGroups and the ability to run
the clusters hotter.
%ages don’t sum up to 100 as the baselines are different

Actual spend vs Budgeted spend for the company ($ terms).
CGroups
Job Optimizations
Hw upgrade
Data opt.

Thank you
http://divye.me - to connect on LinkedIn

Tuning Flink Clusters for stability and efficiency

More Related Content

Similar to Tuning Flink Clusters for stability and efficiency

More from Divye Kapoor

Recently uploaded

Tuning Flink Clusters for stability and efficiency