Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large Scale

891 views

Published on

This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters

Published in: Data & Analytics
  • Be the first to comment

Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large Scale

  1. 1. Experiences in running Apache Flink®
 at large scale Stephan Ewen (@StephanEwen)
  2. 2. 2 Lessons learned from running Flink at large scale Including various things we never expected to become a problem and evidently still did… Also a preview to various fixes coming in Flink…
  3. 3. What is large scale? 3 Large Data Volume ( events / sec ) Large Application State (GBs / TBs) High Parallelism
 (1000s of subtasks) Complex Dataflow Graphs (many operators)
  4. 4. Distributed Coordination 4
  5. 5. Deploying Tasks 5 Happens during initial deployment and recovery JobManager TaskManager Akka / RPC Akka / RPC Blob Server Blob Server Deployment RPC Call Contains - Job Configuration - Task Code and Objects
 - Recover State Handle
 - Correlation IDs
  6. 6. Deploying Tasks 6 Happens during initial deployment and recovery JobManager TaskManager Akka / RPC Akka / RPC Blob Server Blob Server Deployment RPC Call Contains - Job Configuration - Task Code and Objects
 - Recover State Handle
 - Correlation IDs KBs up to MBs KBs few bytes
  7. 7. RPC volume during deployment 7 (back of the napkin calculation) number of tasks 2 MB parallelism size of task
 objects 100010 x x x x = = RPC volume 20 GB ~20 seconds on full 10 GBits/s net > 1 min with avg. of 3 GBits/s net > 3 min with avg. of 1GBs net
  8. 8. Timeouts and Failure detection 8 ~20 seconds on full 10 GBits/s net > 1 min with avg. of 3 GBits/s net > 3 min with avg. of 1GBs net Default RPC timeout: 10 secs default settings lead to failed
 deployments with RPC timeouts Solution: Increase RPC timeout Caveat: Increasing the timeout makes failure detection slower
 Future: Reduce RPC load (next slides)
  9. 9. Dissecting the RPC messages 9 Message part Size Variance across subtasks
 and redeploys Job Configuration KBs constant Task Code and Objects up to MBs constant Recover State Handle KBs variable Correlation IDs few bytes variable
  10. 10. Upcoming: Deploying Tasks 10 Out-of-band transfer and caching of
 large and constant message parts JobManager TaskManager Akka / RPC Akka / RPC Blob Server Blob Cache (1) Deployment RPC Call (Recover State Handle, Correlation IDs, BLOB pointers) (2) Download and cache BLOBs 
 (Job Config, Task Objects) MBs KBs
  11. 11. Checkpoints at scale 11
  12. 12. 12 …is the most important part of
 running a large Flink program Robustly checkpointing…
  13. 13. Review: Checkpoints 13 Trigger checkpoint Inject checkpoint barrier stateful
 operation source / transform
  14. 14. Review: Checkpoints 14 Take state snapshot Trigger state
 snapshot stateful
 operation source / transform
  15. 15. Review: Checkpoint Alignment 15 1 2 3 4 5 6 a b c d e f begin aligning checkpoint
 barrier n xy operator 7 8 9 c d e f g h aligning ab operator 23 1 4 5 6 input buffer y
  16. 16. Review: Checkpoint Alignment 16 7 8 9 d e f g h bc operator 23 1 5 6 emit barrier n 7 8 9 d e f g h c operator 23 1 5 6 input buffer continuecheckpoint 4 4 i a
  17. 17. Understanding Checkpoints 17
  18. 18. Understanding Checkpoints 18 How well behaves the alignment? (lower is better) How long do
 snapshots take? delay =
 end_to_end – sync – async
  19. 19. Understanding Checkpoints 19 How well behaves the alignment? (lower is better) How long do
 snapshots take? delay =
 end_to_end – sync – async long delay = under backpressure under constant backpressure means the application is under provisioned too long means ! too much state
 per node
 ! snapshot store cannot
 keep up with load
 (low bandwidth) changes with incremental
 checkpoints most important
 metric
  20. 20. Alignments: Limit in-flight data ▪ In-flight data is data "between" operators • On the wire or in the network buffers • Amount depends mainly on network buffer memory ▪ Need some to buffer out network
 fluctuations / transient backpressure ▪ Max amount of in-flight data is max
 amount buffered during alignment 20 1 2 3 4 5 6 a b c d e f checkpoint
 barrier xyoperator
  21. 21. Alignments: Limit in-flight data ▪ Flink 1.2: Global pool that distributes across all tasks • Rule-of-thumb: set to 4 * num_shuffles * parallelism * num_slots ▪ Flink 1.3: Limits the max in-flight
 data automatically • Heuristic based on of channels
 and connections involved in a
 transfer step 21 1 2 3 4 5 6 a b c d e f checkpoint
 barrier xyoperator
  22. 22. Heavy alignments ▪ A heavy alignment typically happens at some point
 ! Different load on different paths ▪ Big window emission concurrent
 to a checkpoint ▪ Stall of one operator on the path 22
  23. 23. Heavy alignments ▪ A heavy alignment typically happens at some point
 ! Different load on different paths ▪ Big window emission concurrent
 to a checkpoint ▪ Stall of one operator on the path 23
  24. 24. Heavy alignments ▪ A heavy alignment typically happens at some point
 ! Different load on different paths ▪ Big window emission concurrent
 to a checkpoint ▪ Stall of one operator on the path 24 GC stall
  25. 25. Catching up from heavy alignments ▪ Operators that did heavy alignment need to catch up again ▪ Otherwise, next checkpoint will have a
 heavy alignment as well 25 7 8 9 d e f g h operator 5 6 7 8 9 d e f g h bc operator 23 1 5 6 4 a consumed first after
 checkpoint completed bc a
  26. 26. Catching up from heavy alignments ▪ Giving the computation time to catch up before starting the next checkpoint • Useful: Set the min-time-between-checkpoints ▪ Asynchronous checkpoints help a lot! • Shorter stalls in the pipelines means less build-up of in-flight data • Catch up already happens concurrently to state materialization 26
  27. 27. Asynchronous Checkpoints 27 Durably persist
 snapshots
 asynchronously Processing pipeline continues stateful
 operation source / transform
  28. 28. Asynchrony of different state types 28 State Flink 1.2 Flink 1.3 Flink 1.3 + Keyed state
 RocksDB ✔ ✔ ✔ Keyed State
 on heap ✘ (✔)
 (hidden in 1.2.1) ✔ ✔ Timers ✘ ✔/✘ ✔ Operator State ✘ ✔ ✔
  29. 29. When to use which state backend? 29 Async. Heap RocksDB State ≥ Memory ? Complex Objects?
 (expensive serialization) high data rate? no yes yes no yes no a bit
 simplified
  30. 30. File Systems, Object Stores,
 and Checkpointed State 30
  31. 31. Exceeding FS request capacity ▪ Job size: 4 operators ▪ Parallelism: 100s to 1000 ▪ State Backend: FsStateBackend ▪ State size: few KBs per operator, 100s to 1000 of files ▪ Checkpoint interval: few secs ▪ Symptom: S3 blocked off connections after exceeding 1000s HEAD requests / sec 31
  32. 32. Exceeding FS request capacity What happened? ▪ Operators prepare state writes,
 ensure parent directory exists ▪ Via the S3 FS (from Hadoop), each mkdirs causes
 2 HEAD requests ▪ Flink 1.2: Lazily initialize checkpoint preconditions (dirs.) ▪ Flink 1.3: Core state backends reduce assumption of directories (PUT/GET/DEL), rich file systems support them as fast paths 32
  33. 33. Reducing FS stress for small state 33 JobManager TaskManager Checkpoint
 Coordinator Task TaskManager Task TaskTask Root Checkpoint File
 (metadata) checkpoint data
 files Fs/RocksDB state backend
 for most states
  34. 34. Reducing FS stress for small state 34 JobManager TaskManager Checkpoint
 Coordinator Task TaskManager Task TaskTask checkpoint data
 directly in metadata file Fs/RocksDB state backend
 for small states ack+data Increasing small state
 threshold reduces number
 of files (default: 1KB)
  35. 35. Lagging state cleanup 35 JobManager TM TMTM TMTM TM TMTMTM Symptom: Checkpoints get cleaned up too slow
 State accumulates over time many TaskManagers
 create files One JobManager
 deleting files
  36. 36. Lagging state cleanup ▪ Problem: FileSystems and Object Stores offer only synchronous requests to delete state object
 Time to delete a checkpoint may accumulates to minutes. ▪ Flink 1.2: Concurrent checkpoint deletes on the JobManager ▪ Flink 1.3: For FileSystems with actual directory structure, use recursive directory deletes (one request per directory) 36
  37. 37. Orphaned Checkpoint State 37 JobManager Checkpoint
 Coordinator TaskManager TaskTask (1) TaskManager
 writes state (2) Ack and transfer
 ownership of state (3) JobManager
 records state reference Who owns state objects at what time?
  38. 38. Orphaned Checkpoint State 38 fs:///checkpoints/job-61776516/ chk-113 / chk-129 / chk-221 / chk-271 / chk-272 / It gets more complicated with incremental checkpoints… Upcoming: Searching for orphaned state latest retained periodically sweep checkpoint
 directory for leftover dirs
  39. 39. Conclusion &
 General Recommendations 39
  40. 40. 40 The closer you application is to saturating either
 network, CPU, memory, FS throughput, etc.
 the sooner an extraordinary situation causes a regression Enough headroom in provisioned capacity means fast catchup after temporary regressions Be aware that certain operations are spiky (like aligned windows) Production test always with checkpoints ;-)
  41. 41. Recommendations (part 1) Be aware of the inherent scalability of primitives ▪ Broadcasting state is useful, for example for updating rules / configs, dynamic code loading, etc. ▪ Broadcasting does not scale, i.e., adding more nodes does not.
 Don't use it for high volume joins ▪ Putting very large objects into a ValueState may mean big serialization effort on access / checkpoint ▪ If the state can be mappified, use MapState – it performs much better 41
  42. 42. Recommendations (part 2) If you are about recovery time ▪ Having spare TaskManagers helps bridge the time until backup TaskManagers come online ▪ Having a spare JobManager can be useful • Future: JobManager failures are non disruptive 42
  43. 43. Recommendations (part 3) If you care about CPU efficiency, watch your serializers ▪ JSON is a flexible, but awfully inefficient data type ▪ Kryo does okay - make sure you register the types ▪ Flink's directly supported types have good performance
 basic types, arrays, tuples, … ▪ Nothing ever beats a custom serializer ;-) 43
  44. 44. 44 Thank you! Questions?

×