Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Medea: Scheduling of Long Running Applications in Shared Production Clusters

250 views

Published on

MEDEA: Scheduling of Long Running Applications in Shared Production Clusters

EuroSys'18

https://lsds.doc.ic.ac.uk/sites/default/files/medea-eurosys18.pdf

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Medea: Scheduling of Long Running Applications in Shared Production Clusters

  1. 1. MEDEA Scheduling of Long Running Applications in Shared Production Clusters Panagiotis Garefalakis Imperial College London pg1712@imperial.ac.uk Konstantinos Karanasos Microsoft kokarana@microsoft.com Peter Pietzuch Imperial College London prp@imperial.ac.uk Arun Suresh Microsoft arsuresh@microsoft.com Sriram Rao Microsoft sriramra@microsoft.com
  2. 2. Long-Running Applications (LRAs) Richer applications in compute clusters > Interactive data-intensive applications • Spark, Hive LLAP > Streaming systems • Flink, Storm, SEEP > Latency-sensitive applications • HBase, Memcached > ML frameworks • TensorFlow, Spark ML-lib Panagiotis Garefalakis - Imperial College London 2 LRAs = applications with long-running containers (running from hours to months) Shift towards long running containers > Short-running containers • MapReduce, Scope, Tez
  3. 3. > Each cluster comprises tens of thousands of machines > 10-100% of each cluster’s machines used for LRAs > Machines for LRAs are picked statically or randomly 0 20 40 60 80 100 MachinesusedforLRAs(%) LRAs in Microsoft’s analytics clusters Panagiotis Garefalakis - Imperial College London 3 LRA placement is important
  4. 4. > Performance: “Place Storm containers in the same rack as HBase” > Cluster objectives: “Minimize resource fragmentation” > Resilience: “Place HBase containers across upgrade domains” Rack r1 LRA scheduling problem Panagiotis Garefalakis - Imperial College London 4 Rack r2 Upgrade Domain A Upgrade Domain B n1 n2 n3 n4 n5 n7 n8 n6 MR MR HBase Storm Storm Storm MR HBase Storm Storm MR MR MR Node Groups Goal: Diligent placement through constraints
  5. 5. Machine unavailability in a Microsoft cluster > Cluster organized in node groups > Machines become unavailable in groups > With random placement, an LRA might lose all containers at once Panagiotis Garefalakis - Imperial College London 5 0 2 4 6 8 10 0 1 2 3 4 Unavailablemachines(%) Days 1 10 100 Unavailablemachines(%) total node group A total
  6. 6. Machine unavailability in a Microsoft cluster > Cluster organized in node groups > Machines become unavailable in groups > With random placement, an LRA might lose all containers at once Panagiotis Garefalakis - Imperial College London 6 0 2 4 6 8 10 0 1 2 3 4 Unavailablemachines(%) Days 1 10 100 Unavailablemachines(%) total node group A total upgrade domain A
  7. 7. > How to relate containers to node groups? Support high level constraints > How to express different types of constraints related to LRA containers? Support expressive constraints > How to achieve high quality placement without affecting task-based jobs? Two-scheduler design Challenges Panagiotis Garefalakis - Imperial College London 7
  8. 8. > How to relate containers to node groups? > How to express different types of constraints related to LRA containers? > How to achieve high quality placement without affecting task-based jobs? – MEDEA LRA scheduling with expressive placement constraints Panagiotis Garefalakis - Imperial College London 8 Support container tags and logical node groups Introduce expressive cardinality constraints Follow a two-scheduler design
  9. 9. > Idea: use container tags to refer to group of containers – Describe • application type • application role • resource specification • global application ID – Can refer to any current or future LRA container Container tagging Panagiotis Garefalakis - Imperial College London 9 Storm n1 n2 HBase Container Tags KV HBase master memory critical appID_1
  10. 10. – E.g. node, rack, upgrade domain – Associate nodes with all the container tags that live there – Hide infrastructure “spread across upgrade domains” Hierarchical grouping of nodes Panagiotis Garefalakis - Imperial College London 10 Upgrade Domain Storm, nimbus appID_2 Node Tags KV, HBase master, memory_critical, appID_1 Node Tags Node Group Tags Storm n1 n2 HBase Container Tags KV HBase master memory critical appID_1 > Idea: logical node groups to refer to dynamic node sets
  11. 11. > Generic constraints to capture a variety of cases Source Tag Node Group Target Tag Min cardinality Max cardinality storm rack hbase 1 ∞ storm node storm 0 5 Defining constraints Panagiotis Garefalakis - Imperial College London 11 Min cardinality ≤ occurrences (Target Tag) ≤ Max cardinality A single constraint type is sufficient! > Affinity “Place Storm containers in the same rack as HBase” > Cardinality “Place up to 5 Storm containers in the same node”
  12. 12. Task-based Scheduler LRA Interface LRA Scheduler Tasks LRAs Cluster State Two-scheduler design Panagiotis Garefalakis - Imperial College London 12 > Idea: traditional scheduler for task-based jobs, optimization-based scheduler for LRAs Constraint Manager Constraints, Container Tags, Node Groups LRA scheduling algorithm High quality placementFast task-based allocations
  13. 13. Task-based Scheduler LRA Interface LRA Scheduler Tasks LRAs Allocation Cluster State Two-scheduler design Panagiotis Garefalakis - Imperial College London 13 > ILP scheduling algorithm with flexible optimization goals • Invoked at configurable intervals, considers multiple containers Constraint Manager Constraints, Container Tags, Node Groups LRA scheduling algorithm High quality placementFast task-based allocations
  14. 14. Task-based Scheduler LRA Interface LRA Scheduler Allocation Cluster State Two-scheduler design Panagiotis Garefalakis - Imperial College London 14 Constraint Manager Constraints, Container Tags, Node Groups LRA scheduling algorithm Tasks LRAs > ILP scheduling algorithm with flexible optimization goals • Invoked at configurable intervals, considers multiple containers High quality placementFast task-based allocations Satisfy LRA constraints without affecting task-based jobs
  15. 15. > Built as an extension to YARN code is part of current release 3.1.0 > Introduced APIs for clients to specify constraints, MEDEA’s components added to Resource Manager > YARN’s Capacity Scheduler was used as task- based scheduler Implementation Panagiotis Garefalakis - Imperial College London 15
  16. 16. > What is the impact of cardinality constraints in LRA performance? > What is the two-scheduler benefit in terms of scheduling latency? > What is the benefit of MEDEA in a large scale environment? Evaluation Panagiotis Garefalakis - Imperial College London 16
  17. 17. Impact of cardinality in LRA performance Panagiotis Garefalakis - Imperial College London 17 > TensorFlow ML workflow with 1M iterations using 32 workers with varying workers per node 0 50 100 150 200 250 300 1 2 4 8 16 32 Runtime(min) Max cardinality per node Low utilized cluster High utilized cluster Anti-affinity Affinity
  18. 18. > TensorFlow ML workflow with 1M iterations using 32 workers with varying workers per node 0 50 100 150 200 250 300 1 2 4 8 16 32 Runtime(min) Max cardinality per node 5% utilized cluster 70% utilized cluster Impact of cardinality in LRA performance Panagiotis Garefalakis - Imperial College London 18 Anti-affinity Affinity Cardinality constraints are important Affinity and anti-affinity are not enough 5% utilized cluster
  19. 19. 0 50 100 150 200 250 300 1 2 4 8 16 32 Runtime(min) Max cardinality per node 5% utilized cluster 70% utilized cluster Impact of cardinality in LRA performance Panagiotis Garefalakis - Imperial College London 19 34% 42% Anti-affinity Affinity Cardinality constraints are important Affinity and anti-affinity are not enough 5% utilized cluster 70% utilized cluster Over-utilizationNetwork overhead
  20. 20. 0 1 2 3 4 5 6 7 8 0 5 10 20 30 40 50 60 70 80 90 100 Schedulinglatency(ms) Fraction of resources used by LRAs (%) Optimise All MEDEA Two-scheduler benefit Panagiotis Garefalakis - Imperial College London 20
  21. 21. 0 1 2 3 4 5 6 7 8 0 5 10 20 30 40 50 60 70 80 90 100 Schedulinglatency(ms) Fraction of resources used by LRAs (%) Optimise All MEDEA Two-scheduler benefit Panagiotis Garefalakis - Imperial College London 21
  22. 22. 0 1 2 3 4 5 6 7 8 0 5 10 20 30 40 50 60 70 80 90 100 Schedulinglatency(ms) Fraction of resources used by LRAs (%) Optimise All MEDEA 9.5x Two-scheduler benefit Panagiotis Garefalakis - Imperial College London 22 Expensive scheduling logic is used where it matters the most
  23. 23. > Pre-production cluster – 400 machines on 10 racks > Workloads – 50 HBase instances (10 workers each) – 45 TensorFlow instances (8 workers and 2 PS each) – Batch production workload (50% of cluster resources) > Constraints – Containers of each LRA instance on the same rack – No more than 2 HBase (4 for TensorFlow) containers on same node Large scale deployment Panagiotis Garefalakis - Imperial College London 23
  24. 24. 0 250 500 750 Runtime(min) no-constraints (YARN) TensorFlow affinity+ cardinality+ multiple containers + MEDEA= 58% 54% Impact of MEDEA in LRA performance Panagiotis Garefalakis - Imperial College London 24 MEDEA improves performance and predictability of LRAs (Kubernetes) (Kubernetes enhanced) 99th median 5th
  25. 25. 0 10 20 30 40 50 60 YARN KUBE KUBE++ MEDEA Runtime(sec) Impact of MEDEA in Task performance Panagiotis Garefalakis - Imperial College London 25 Short-tasks no-constraints YARN affinity KUBE + cardinality KUBE++ + multiple containers + MEDEA does not affect task-based job runtimes
  26. 26. Powerful constraints are required to unlock the full potential of LRAs! MEDEA – Two-scheduler design – Support expressive & high level constraints – Does not impact latency of task-based jobs Thank you! Questions? Panagiotis Garefalakis pg1712@imperial.ac.uk YARN r3.1.0 https://github.com/apache/hadoop Summary

×