Where is my bottleneck? Performance troubleshooting in Flink

Where is my bottleneck?
Performance
troubleshooting in
Apache Flink
Piotr Nowojski

About me
Open source
● Apache Flink contributor/committer since 2017
● Member of the project management committee (PMC)
● Among core architects of the Flink Runtime
Career
● Co-Founder, Engineer @ Immerok
○ immerok.com
● Before that: Runtime team @ DataArtisans/Ververica (acquired by Alibaba)
● Even before that: working on Presto (now Trino) runtime
2

Agenda
3
● Understanding Flink Job basics
● Where to start performance analysis?
● What about checkpointing or recovery process?
● Tips & Tricks

What are we troubleshooting?
8
● Processing records
○ Throughput is too low?
○ Resource usage is too high?
● Checkpointing
○ Are checkpoints failing?
○ Too long end-to-end exactly-once latency?
○ Reprocessing too many records after failover?
● Recovery
○ Long downtime?

Where is my bottleneck? TL;DR
11
HERE

Parallel subtasks can have different load profiles
12

Where is my bottleneck? TL;DRv2
15
● Rule of thumb
○ Start from the sources
○ Follow any backpressured subtasks downstream to the first ~100% busy
subtask(s)
○ That is your bottleneck
● Remember about potential data skew and varying load

I found the bottleneck! Now what?
16
● What to do next might be obvious
● Check machine and JVM process vitals
○ CPU usage
○ GC pauses
● Might require further investigation:
○ Looking into the code
○ Testing out various changes
○ Profiling

Not enough?
17
● You know what subtask(s) are causing problems
● Attach a code profiler to the Task Manager running that subtask
● Beware of other threads
○ Filter/Focus profiler results
○ Threads are named after the subtask that they are running

Checkpointing
20
● Checkpoints are failing?
● Too long end-to-end exactly-once latency?
● Reprocessing too many records after failover?

Checkpoints taking too long?
23

24

25

Long alignment duration/start delay
26
● Most likely caused by backpressure
○ Scale up
○ Optimise Job to increase throughput
○ Buffer debloating (reduces amount of in-flight data in Flink 1.14+)
○ Unaligned checkpoints

Long sync phase
27
● Might be general cluster overload (CPU, Memory, IO)
○ Checkpointing adds extra load to the cluster
● State backends
○ RocksDB flushing to disks
○ Tuning RocksDB advanced options
● Operators/Functions specific code
○ CheckpointedFunction#snapshotState call
○ For example: sink flushing/committing records

Long async phase
28
● Might be general cluster overload (CPU, Memory, IO)
○ Checkpointing adds extra load on the cluster
● Uploading state backend files
○ FileSystem-specific things
■ Make sure to fully utilize your FS (S3 Entropy)
○ Checkpointed state might be too large
■ Scale up?
■ Reduce state size?
■ Enable incremental checkpoints?
○ Too many small files
■ Increase state.storage.fs.memory-threshold?
● Experimental feature: enabling state backend changelog (Flink 1.14+)

Long recovery
30
● Analyse Flink (debug) logs
● Use incremental checkpoints and/or native savepoints
● Similar issues to checkpointing but in reverse
● Potential solutions
○ Enabling local recovery might help
○ Reduce state size
○ Scale up
○ Tuning RocksDB advanced options

Closing words
31
● What is the main problem:
○ Processing records?
■ First locate the bottleneck subtask
○ Checkpointing?
■ Look into checkpoint statistics
○ Recovery?
■ Flink logs

Thanks
Piotr Nowojski
@PiotrNowojski
piotr@immerok.com

Where is my bottleneck? Performance troubleshooting in Flink

More Related Content

What's hot

Similar to Where is my bottleneck? Performance troubleshooting in Flink

More from Flink Forward

Recently uploaded

Where is my bottleneck? Performance troubleshooting in Flink