Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Resource Aware Scheduling in Apache Storm

58,860 views

Published on

Resource Aware Scheduling in Apache Spark

Published in: Technology

Resource Aware Scheduling in Apache Storm

  1. 1. RESOURCE AWARE SCHEDULING IN APACHE STORM Presented by Boyang Jerry Peng
  2. 2. 2 ABOUT ME • Apache Storm Committer and PMC member • Member of the Yahoo’s low latency Team  Data processing solutions with low latency • Graduate student @ University of Illinois, Urbana-Champaign  Research emphasis in distributed systems and stream processing • Contact:  jerrypeng@yahoo-inc.com
  3. 3. 3 AGENDA •Overview of Apache Storm •Problems and Challenges •Introduction of Resource Aware Scheduler •Results
  4. 4. 4 OVERVIEW • Apache Storm is an open source distributed real-time data stream processing platform  Real-time analytics  Online machine learning  Continuous computation  Distributed RPC  ETL
  5. 5. 5 STORM TOPOLOGY • Processing can be represented as a directed graph • Spouts are sources of information • Bolts are operators that process data
  6. 6. 6 DEFINITIONS OF STORM TERMS • Stream  an unbounded sequence of tuples. • Component  A processing operator in a Storm topology that is either a Bolt or Spout • Executors  Threads that are spawned in worker processes that execute the logic of components • Worker Process  A process spawned by Storm that may run one or more executors.
  7. 7. 7 STORM ARCHITECTURE Master Node Cluster Coordination Worker processes Worker Nimbus Zookeeper Zookeeper Zookeeper Supervisor Supervisor Supervisor Supervisor Worker Worker Worker Launches workers
  8. 8. 8 LOGICAL VS PHYSICAL CONNECTION IN STORM
  9. 9. 9 OVERVIEW OF SCHEDULING IN STORM • Default Scheduling Strategy  Naïve round robin scheduler  Naïve load limiter (Worker Slots) • Multitenant Scheduler  Default Scheduler with multitenant capabilities (supported by security)  Can allocate a set of isolated nodes for topology (Soft Partitioning) Resource Aware
  10. 10. 10 RUNNING STORM AT YAHOO - CHALLENGES • Increasing heterogeneous clusters  Isolation Scheduler – handing out dedicated machines • Low cluster overall resource utilization  Users not utilizing their isolated allocation very well • Unbalanced resource usage  Some machines not used, others over used • Per topology scheduling strategy  Different topologies have different scheduling needs (e.g. constraint based scheduling)
  11. 11. 11 RUNNING STORM AT YAHOO – SCALE 600 2300 3500 120 300 680 0 100 200 300 400 500 600 700 800 0 500 1000 1500 2000 2500 3000 3500 4000 2012 2013 2014 2015 2016 Nodes Year Total Nodes Running Storm at Yahoo Total Nodes Largest Cluster Size
  12. 12. 12 RESOURCE AWARE SCHEDULING IN STORM • Scheduling in Storm that takes into account resource availability on machines and resource requirement of workloads when scheduling the topology  Fine grain resource control  Resource Aware Scheduler (RAS) implements this function - Includes many nice multi-tenant features • Built on top of:  Peng, Boyang, Mohammad Hosseini, Zhihao Hong, Reza Farivar, and Roy Campbell. "R-storm: Resource-aware scheduling in storm." In Proceedings of the 16th Annual Middleware Conference, pp. 149-161. ACM, 2015
  13. 13. 13 RAS API • Fine grain resource control  Allows users to specify resources requirement for each component (Spout or Bolt) in a Storm Topology: API to set component memory requirement: API to set component CPU requirement: Example of Usage: public T setMemoryLoad(Number onHeap, Number offHeap) public T setCPULoad(Number amount) SpoutDeclarer s1 = builder.setSpout("word", new TestWordSpout(), 10); s1.setMemoryLoad(1024.0, 512.0); builder.setBolt("exclaim1", new ExclamationBolt(), 3) .shuffleGrouping("word").setCPULoad(100.0);
  14. 14. 14 CLUSTER CONFIGURATIONS conf/storm.yaml . . . supervisor.memory.capacity.mb: 20480.0 supervisor.cpu.capacity: 400.0 . . .
  15. 15. 15 RAS FEATURES – PLUGGABLE PER TOPOLOGY SCHEDULING STRATEGIES • Allows users to specify which scheduling strategy to use • Default Strategy - Based on: • Peng, Boyang, Mohammad Hosseini, Zhihao Hong, Reza Farivar, and Roy Campbell. "R-storm: Resource- aware scheduling in storm." In Proceedings of the 16th Annual Middleware Conference, pp. 149-161. ACM, 2015. - Enhancements have been made (e.g. limiting max heap size per worker, better rack selection algorithm, etc) - Aims to pack topology as tightly as possible on machines to reduce communication latency and increase utilization - Collocating components that communication with each other (operator chaining) • Constraint Based Scheduling Strategy  CSP problem solver conf.setTopologyStrategy(DefaultResourceAwareStrategy.class);
  16. 16. 16 RAS FEATURES – RESOURCE ISOLATION VIA CGROUPS (LINUX PLATFORMS ONLY*) • Replaces resource isolation via isolated nodes • Resource quotas enforced on a per worker basis • Each worker should not go over its allocated resource quota • Guarantee QOS and topology isolation • Documentation: https://storm.apache.org/releases/2.0.0- SNAPSHOT/cgroups_in_storm.html *RHEL 7 or higher. Potential critical bugs in older RHEL versions.
  17. 17. 17 RAS FEATURES – PER USER RESOURCE GUARANTEES • Configurable per user resource guarantees
  18. 18. 18 RAS FEATURE – TOPOLOGY PRIORITY • Users can set the priority of a topology to indicate its importance • The range of topology priorities can range form 0-29. The topologies priorities will be partitioned into several priority levels that may contain a range of priorities conf.setTopologyPriority(int priority) PRODUCTION => 0 – 9 STAGING => 10 – 19 DEV => 20 – 29
  19. 19. 19 RAS FEATURES – PLUGGABLE TOPOLOGY PRIORITY • Topology Priority Strategy  Which topology should be scheduled first?  Cluster wide configuration set in storm.yaml  Default Topology Priority Strategy - Takes into account resource guarantees and topology priority - Schedules topologies from users who is the most under his or her resource guarantee. - Topologies of each user is sorted by priority - More details: https://storm.apache.org/releases/2.0.0- SNAPSHOT/Resource_Aware_Scheduler_overview.html
  20. 20. 20 RAS FEATURES – PLUGGABLE TOPOLOGY EVICTION STRATEGIES • Topology Eviction Strategy  When there is not enough resource which topology from which user to evict?  Cluster wide configuration set in storm.yaml  Default Eviction Strategy - Based on how much a user’s guarantee has been satisfied - Priority of the topology  FIFO Eviction Strategy - Used on our staging clusters. - Ad hoc use  More details: https://storm.apache.org/releases/2.0.0- SNAPSHOT/Resource_Aware_Scheduler_overview.html
  21. 21. 21 SELECTED RESULTS (THROUGHPUT) FROM PAPER [1] – YAHOO TOPOLOGIES 47% improvement! 50% improvement! * Figures used [1]
  22. 22. 22 SELECTED RESULTS (THROUGHPUT) FROM PAPER [1] – YAHOO TOPOLOGIES
  23. 23. 23 PRELIMINARY RESULTS IN YAHOO STORM CLUSTERS
  24. 24. 24 PRELIMINARY RESULTS IN YAHOO STORM CLUSTERS
  25. 25. 25 CONCLUDING REMARKS AND FUTURE WORK • In Summary  Built resource aware scheduler • Migration Process  In the Progress from migrating from MultitenantScheduler to RAS  Working through bugs with Cgroups, Java, and Linux kernel • Future Work  Improved Scheduling Strategies  Real-time resource monitoring  Elasticity
  26. 26. 26 QUESTIONS
  27. 27. 27 REFERENCES • [1] Peng, Boyang, Mohammad Hosseini, Zhihao Hong, Reza Farivar, and Roy Campbell. "R-storm: Resource-aware scheduling in Storm." In Proceedings of the 16th Annual Middleware Conference, pp. 149-161. ACM, 2015.  http://web.engr.illinois.edu/~bpeng/files/r-storm.pdf • [2] Official Resource Aware Scheduler Documentation  https://storm.apache.org/releases/2.0.0-SNAPSHOT/Resource_Aware_Scheduler_overview.htm • [3] Umbrella Jira for Resource Aware Scheduling in Storm  https://issues.apache.org/jira/browse/STORM-893
  28. 28. 28 EXTRA SLIDES
  29. 29. 29 PROBLEM FORMULATION • Targeting 3 types of resources  CPU, Memory, and Network • Limited resource budget for each node • Specific resource needs for each task Goal: Improve throughput by maximizing utilization and minimizing network latency
  30. 30. 30 PROBLEM FORMULATION • Set of all tasks Ƭ = {τ1 , τ2, τ3, …}, each task τi has resource demands  CPU requirement of cτi  Network bandwidth requirement of bτi  Memory requirement of mτi • Set of all nodes N = {θ1 , θ2, θ3, …}  Total available CPU budget of W1  Total available Bandwidth budget of W2  Total available Memory budget of W3 30
  31. 31. 31 PROBLEM FORMULATION • Qi : Throughput contribution of each node • Assign tasks to a subset of nodes N’ ∈ N that minimizes the total resource waste: 31
  32. 32. 32 PROBLEM FORMULATION  Quadratic Multiple 3D Knapsack Problem  We call it QM3DKP!  NP-Hard! • Compute optimal solutions or approximate solutions may be hard and time consuming • Real time systems need fast scheduling  Re-compute scheduling when failures occur 32
  33. 33. 33 SOFT CONSTRAINTS VS HARD CONSTRAINTS • Soft Constraints  CPU and Network Resources  Graceful performance degradation with over subscription • Hard Constraints  Memory  Oversubscribe -> Game over Your date comes hereYour footer comes here33
  34. 34. 34 OBSERVATIONS ON NETWORK LATENCY 1. Inter-rack communication is the slowest 2. Inter-node communication is slow 3. Inter-process communication is faster 4. Intra-process communication is the fastest Your date comes hereYour footer comes here34
  35. 35. 35 HEURISTIC ALGORITHM 35 • Greedy approach • Designing a 3D resource space  Each resource maps to an axis  Can be generalized to nD resource space  Trivial overhead! • Based on:  min (Euclidean distance)  Satisfy hard constraints
  36. 36. 36 HEURISTIC ALGORITHM Your date comes hereYour footer comes here36
  37. 37. 37 HEURISTIC ALGORITHM Your date comes hereYour footer comes here37 Switch 1 2 3 4 5 6
  38. 38. 38 HEURISTIC ALGORITHM 38 • Our proposed heuristic algorithm has the following properties: 1) Tasks of components that communicate will each other will have the highest priority to be scheduled in close network proximity to each other. 2) No hard resource constraint is violated. 3) Resource waste on nodes are minimized.

×