Where is my bottleneck?
Performance
troubleshooting in
Apache Flink
Piotr Nowojski
About me
Open source
● Apache Flink contributor/committer since 2017
● Member of the project management committee (PMC)
● Among core architects of the Flink Runtime
Career
● Co-Founder, Engineer @ Immerok
○ immerok.com
● Before that: Runtime team @ DataArtisans/Ververica (acquired by Alibaba)
● Even before that: working on Presto (now Trino) runtime
2
Agenda
3
● Understanding Flink Job basics
● Where to start performance analysis?
● What about checkpointing or recovery process?
● Tips & Tricks
Understanding the
basics
4
5
Job on Task Managers
6
Performance
troubleshooting
7
What are we troubleshooting?
8
● Processing records
○ Throughput is too low?
○ Resource usage is too high?
● Checkpointing
○ Are checkpoints failing?
○ Too long end-to-end exactly-once latency?
○ Reprocessing too many records after failover?
● Recovery
○ Long downtime?
Processing records
9
WebUI
10
Where is my bottleneck? TL;DR
11
HERE
Parallel subtasks can have different load profiles
12
Varying load
13
Varying load
14
Where is my bottleneck? TL;DRv2
15
● Rule of thumb
○ Start from the sources
○ Follow any backpressured subtasks downstream to the first ~100% busy
subtask(s)
○ That is your bottleneck
● Remember about potential data skew and varying load
I found the bottleneck! Now what?
16
● What to do next might be obvious
● Check machine and JVM process vitals
○ CPU usage
○ GC pauses
● Might require further investigation:
○ Looking into the code
○ Testing out various changes
○ Profiling
Not enough?
17
● You know what subtask(s) are causing problems
● Attach a code profiler to the Task Manager running that subtask
● Beware of other threads
○ Filter/Focus profiler results
○ Threads are named after the subtask that they are running
Flame Graphs!
18
Checkpointing
19
Checkpointing
20
● Checkpoints are failing?
● Too long end-to-end exactly-once latency?
● Reprocessing too many records after failover?
Checkpoint Barriers
21
Alignment
22
Checkpoints taking too long?
23
Checkpoints taking too long?
24
Checkpoints taking too long?
25
Long alignment duration/start delay
26
● Most likely caused by backpressure
○ Scale up
○ Optimise Job to increase throughput
○ Buffer debloating (reduces amount of in-flight data in Flink 1.14+)
○ Unaligned checkpoints
Long sync phase
27
● Might be general cluster overload (CPU, Memory, IO)
○ Checkpointing adds extra load to the cluster
● State backends
○ RocksDB flushing to disks
○ Tuning RocksDB advanced options
● Operators/Functions specific code
○ CheckpointedFunction#snapshotState call
○ For example: sink flushing/committing records
Long async phase
28
● Might be general cluster overload (CPU, Memory, IO)
○ Checkpointing adds extra load on the cluster
● Uploading state backend files
○ FileSystem-specific things
■ Make sure to fully utilize your FS (S3 Entropy)
○ Checkpointed state might be too large
■ Scale up?
■ Reduce state size?
■ Enable incremental checkpoints?
○ Too many small files
■ Increase state.storage.fs.memory-threshold?
● Experimental feature: enabling state backend changelog (Flink 1.14+)
Recovery
29
Long recovery
30
● Analyse Flink (debug) logs
● Use incremental checkpoints and/or native savepoints
● Similar issues to checkpointing but in reverse
● Potential solutions
○ Enabling local recovery might help
○ Reduce state size
○ Scale up
○ Tuning RocksDB advanced options
Closing words
31
● What is the main problem:
○ Processing records?
■ First locate the bottleneck subtask
○ Checkpointing?
■ Look into checkpoint statistics
○ Recovery?
■ Flink logs
Thanks
Piotr Nowojski
@PiotrNowojski
piotr@immerok.com

Where is my bottleneck? Performance troubleshooting in Flink

  • 1.
    Where is mybottleneck? Performance troubleshooting in Apache Flink Piotr Nowojski
  • 2.
    About me Open source ●Apache Flink contributor/committer since 2017 ● Member of the project management committee (PMC) ● Among core architects of the Flink Runtime Career ● Co-Founder, Engineer @ Immerok ○ immerok.com ● Before that: Runtime team @ DataArtisans/Ververica (acquired by Alibaba) ● Even before that: working on Presto (now Trino) runtime 2
  • 3.
    Agenda 3 ● Understanding FlinkJob basics ● Where to start performance analysis? ● What about checkpointing or recovery process? ● Tips & Tricks
  • 4.
  • 5.
  • 6.
    Job on TaskManagers 6
  • 7.
  • 8.
    What are wetroubleshooting? 8 ● Processing records ○ Throughput is too low? ○ Resource usage is too high? ● Checkpointing ○ Are checkpoints failing? ○ Too long end-to-end exactly-once latency? ○ Reprocessing too many records after failover? ● Recovery ○ Long downtime?
  • 9.
  • 10.
  • 11.
    Where is mybottleneck? TL;DR 11 HERE
  • 12.
    Parallel subtasks canhave different load profiles 12
  • 13.
  • 14.
  • 15.
    Where is mybottleneck? TL;DRv2 15 ● Rule of thumb ○ Start from the sources ○ Follow any backpressured subtasks downstream to the first ~100% busy subtask(s) ○ That is your bottleneck ● Remember about potential data skew and varying load
  • 16.
    I found thebottleneck! Now what? 16 ● What to do next might be obvious ● Check machine and JVM process vitals ○ CPU usage ○ GC pauses ● Might require further investigation: ○ Looking into the code ○ Testing out various changes ○ Profiling
  • 17.
    Not enough? 17 ● Youknow what subtask(s) are causing problems ● Attach a code profiler to the Task Manager running that subtask ● Beware of other threads ○ Filter/Focus profiler results ○ Threads are named after the subtask that they are running
  • 18.
  • 19.
  • 20.
    Checkpointing 20 ● Checkpoints arefailing? ● Too long end-to-end exactly-once latency? ● Reprocessing too many records after failover?
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
    Long alignment duration/startdelay 26 ● Most likely caused by backpressure ○ Scale up ○ Optimise Job to increase throughput ○ Buffer debloating (reduces amount of in-flight data in Flink 1.14+) ○ Unaligned checkpoints
  • 27.
    Long sync phase 27 ●Might be general cluster overload (CPU, Memory, IO) ○ Checkpointing adds extra load to the cluster ● State backends ○ RocksDB flushing to disks ○ Tuning RocksDB advanced options ● Operators/Functions specific code ○ CheckpointedFunction#snapshotState call ○ For example: sink flushing/committing records
  • 28.
    Long async phase 28 ●Might be general cluster overload (CPU, Memory, IO) ○ Checkpointing adds extra load on the cluster ● Uploading state backend files ○ FileSystem-specific things ■ Make sure to fully utilize your FS (S3 Entropy) ○ Checkpointed state might be too large ■ Scale up? ■ Reduce state size? ■ Enable incremental checkpoints? ○ Too many small files ■ Increase state.storage.fs.memory-threshold? ● Experimental feature: enabling state backend changelog (Flink 1.14+)
  • 29.
  • 30.
    Long recovery 30 ● AnalyseFlink (debug) logs ● Use incremental checkpoints and/or native savepoints ● Similar issues to checkpointing but in reverse ● Potential solutions ○ Enabling local recovery might help ○ Reduce state size ○ Scale up ○ Tuning RocksDB advanced options
  • 31.
    Closing words 31 ● Whatis the main problem: ○ Processing records? ■ First locate the bottleneck subtask ○ Checkpointing? ■ Look into checkpoint statistics ○ Recovery? ■ Flink logs
  • 32.