Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

893 views

Published on

http://flink-forward.org/kb_sessions/dynamic-scaling-how-apache-flink-adapts-to-changing-workloads/

Modern stream processing engines not only have to process millions of events per second at sub-second latency but also have to cope with constantly changing workloads. Due to the dynamic nature of stream applications where the number of incoming events can strongly vary with time, systems cannot reliably predetermine the amount of required resources. In order to meet guaranteed SLAs as well as utilizing system resources as efficiently as possible, frameworks like Apache Flink have to adapt their resource consumption dynamically. In this talk, we will take a look under the hood and explain how Flink scales stateful application in and out. Starting with the concept of key groups and partionable state, we will cover ways to detect bottlenecks in streaming jobs and discuss efficient strategies how to scale out operators with minimal down-time.

Published in: Data & Analytics
  • Be the first to comment

Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

  1. 1. Till Rohrmann trohrmann@apache.org @stsffap Dynamic Scaling: How Apache Flink® Adapts to Changing Workloads
  2. 2. Changing Workloads And SLAs 2
  3. 3. Resource Adaption 3 + time Workload Resources time Workload Resources
  4. 4. What Is This Talk About? § Flink’s approach to dynamic scaling § Current state with demo § Outlook on next development steps 4
  5. 5. Dynamic Scaling 5
  6. 6. Basic Idea 6 • Spread work across more workers to decrease workload
  7. 7. Scaling Stateless Jobs 7 Scale Up Scale Down Source Mapper Sink • Scale up: Deploy new tasks • Scale down: Cancel running tasks
  8. 8. Scaling Stateful Jobs 8 ? • Problem: Which state to assign to new task?
  9. 9. State in Apache Flink 9
  10. 10. Keyed vs. Non-keyed State 10 • State bound to a key • E.g. Keyed UDF and window state • State bound to a subtask • E.g. Source state Keyed Non-keyed
  11. 11. Repartitioning Keyed State § Similar to consistent hashing § Split key space into key groups § Assign key groups to tasks 11 Key space Key group #1 Key group #2 Key group #3Key group #4
  12. 12. Repartitioning Keyed State contd. § Rescaling changes key group assignment § Maximum parallelism defined by #key groups 12
  13. 13. Repartitioning Non-keyed state § User defined merge and split functions • Most general approach § Breaking non-keyed state up into finer granularity • State has to contain multiple entries • Automatic repartitioning wrt granularity 13 #1 #2 #3
  14. 14. Repartitioning Non-keyed State contd. § Non-keyed state entries gathered at the job manager § Repartitioning schemes • Repartition & send • Union & broadcast 14
  15. 15. Example: Kafka Source 15 partitionId: 1, offset: 42 partitionId: 3, offset: 10 partitionId: 6, offset: 27 • Store offset for each partition • Individual entries are repartitionable partitionId: 1, offset: 42 partitionId: 3, offset: 10 partitionId: 6, offset: 27
  16. 16. Rescaling: Why is That so Hard? § Handling of state § Repartitioning of keyed & non-keyed state § Unique among open source stream processors, afaik 16
  17. 17. Demo Time 17
  18. 18. Demo Topology 18 Kafka Source Counter KeyBy
  19. 19. Current State and next Steps 19
  20. 20. Current State § Manual rescaling 1. Take savepoint 2. Stop the job 3. Restart job with adjusted parallelism and savepoint 20
  21. 21. Next Steps § Integrate savepoint with stop signal § Rescaling individual operators w/o restart § Dynamic container de-/allocation • “Running Flink Everywhere” by Stephan Ewen, 16:45 at Kesselhaus 21
  22. 22. Auto Scaling Policies 22 • Latency • Throughput • Resource utilization • Kubernetes on GCE, EC2 and Mesos (marathon- autoscale) already support auto-scaling
  23. 23. Conclusion § Scaling of keyed and non-keyed state § Flink supports manual rescaling with restart (WIP branch: https://github.com/tillrohrmann/flink/tree/partitionable-op-state) § Future versions might support scaling on the fly and automatic rescaling policies 23

×