MEDEA: Scheduling of Long Running Applications in Shared Production Clusters
EuroSys'18
https://lsds.doc.ic.ac.uk/sites/default/files/medea-eurosys18.pdf
Medea: Scheduling of Long Running Applications in Shared Production Clusters
1. MEDEA
Scheduling of Long Running Applications
in Shared Production Clusters
Panagiotis Garefalakis
Imperial College London
pg1712@imperial.ac.uk
Konstantinos Karanasos
Microsoft
kokarana@microsoft.com
Peter Pietzuch
Imperial College London
prp@imperial.ac.uk
Arun Suresh
Microsoft
arsuresh@microsoft.com
Sriram Rao
Microsoft
sriramra@microsoft.com
2. Long-Running Applications (LRAs)
Richer applications
in compute clusters
> Interactive data-intensive applications
• Spark, Hive LLAP
> Streaming systems
• Flink, Storm, SEEP
> Latency-sensitive applications
• HBase, Memcached
> ML frameworks
• TensorFlow, Spark ML-lib
Panagiotis Garefalakis - Imperial College London 2
LRAs = applications with long-running containers
(running from hours to months)
Shift towards long
running containers
> Short-running containers
• MapReduce, Scope, Tez
3. > Each cluster
comprises tens
of thousands of
machines
> 10-100% of
each cluster’s
machines used
for LRAs
> Machines for
LRAs are
picked statically
or randomly
0
20
40
60
80
100
MachinesusedforLRAs(%) LRAs in Microsoft’s analytics clusters
Panagiotis Garefalakis - Imperial College London 3
LRA placement is important
4. > Performance: “Place Storm containers in the same rack
as HBase”
> Cluster objectives: “Minimize resource fragmentation”
> Resilience: “Place HBase containers across upgrade
domains”
Rack r1
LRA scheduling problem
Panagiotis Garefalakis - Imperial College London 4
Rack r2
Upgrade
Domain A
Upgrade
Domain B
n1 n2
n3 n4
n5
n7 n8
n6
MR
MR
HBase
Storm Storm
Storm
MR HBase
Storm Storm
MR
MR
MR
Node Groups
Goal: Diligent placement through constraints
5. Machine unavailability in a Microsoft cluster
> Cluster
organized in
node groups
> Machines
become
unavailable in
groups
> With random
placement, an
LRA might lose
all containers
at once
Panagiotis Garefalakis - Imperial College London 5
0
2
4
6
8
10
0 1 2 3 4
Unavailablemachines(%)
Days
1
10
100
Unavailablemachines(%)
total node group A
total
6. Machine unavailability in a Microsoft cluster
> Cluster
organized in
node groups
> Machines
become
unavailable in
groups
> With random
placement, an
LRA might lose
all containers
at once
Panagiotis Garefalakis - Imperial College London 6
0
2
4
6
8
10
0 1 2 3 4
Unavailablemachines(%)
Days
1
10
100
Unavailablemachines(%)
total node group A
total
upgrade domain A
7. > How to relate containers to node groups?
Support high level constraints
> How to express different types of constraints
related to LRA containers?
Support expressive constraints
> How to achieve high quality placement without
affecting task-based jobs?
Two-scheduler design
Challenges
Panagiotis Garefalakis - Imperial College London 7
8. > How to relate containers to node groups?
> How to express different types of constraints
related to LRA containers?
> How to achieve high quality placement without
affecting task-based jobs?
–
MEDEA
LRA scheduling with expressive placement constraints
Panagiotis Garefalakis - Imperial College London 8
Support container tags and logical node groups
Introduce expressive cardinality constraints
Follow a two-scheduler design
9. > Idea: use container tags to refer
to group of containers
– Describe
• application type
• application role
• resource specification
• global application ID
– Can refer to any current or future
LRA container
Container tagging
Panagiotis Garefalakis - Imperial College London 9
Storm
n1
n2
HBase
Container Tags
KV
HBase master
memory critical
appID_1
10. – E.g. node, rack,
upgrade domain
– Associate nodes with
all the container tags
that live there
– Hide infrastructure
“spread across
upgrade domains”
Hierarchical grouping of nodes
Panagiotis Garefalakis - Imperial College London 10
Upgrade Domain
Storm, nimbus
appID_2
Node Tags
KV, HBase master,
memory_critical,
appID_1
Node Tags
Node Group Tags
Storm
n1
n2
HBase
Container Tags
KV
HBase master
memory critical
appID_1
> Idea: logical node groups to refer
to dynamic node sets
11. > Generic constraints to capture a variety of cases
Source Tag Node Group Target Tag
Min cardinality
Max cardinality
storm rack hbase
1
∞
storm node storm
0
5
Defining constraints
Panagiotis Garefalakis - Imperial College London 11
Min cardinality ≤ occurrences (Target Tag) ≤ Max cardinality
A single constraint type is sufficient!
> Affinity “Place Storm containers in the same rack as HBase”
> Cardinality “Place up to 5 Storm containers in the same node”
12. Task-based
Scheduler
LRA Interface
LRA Scheduler
Tasks LRAs
Cluster State
Two-scheduler design
Panagiotis Garefalakis - Imperial College London 12
> Idea: traditional scheduler for task-based jobs,
optimization-based scheduler for LRAs
Constraint Manager
Constraints, Container Tags,
Node Groups
LRA scheduling algorithm
High quality placementFast task-based allocations
13. Task-based
Scheduler
LRA Interface
LRA Scheduler
Tasks LRAs
Allocation
Cluster State
Two-scheduler design
Panagiotis Garefalakis - Imperial College London 13
> ILP scheduling algorithm with flexible optimization goals
• Invoked at configurable intervals, considers multiple containers
Constraint Manager
Constraints, Container Tags,
Node Groups
LRA scheduling algorithm
High quality placementFast task-based allocations
14. Task-based
Scheduler
LRA Interface
LRA Scheduler
Allocation
Cluster State
Two-scheduler design
Panagiotis Garefalakis - Imperial College London 14
Constraint Manager
Constraints, Container Tags,
Node Groups
LRA scheduling algorithm
Tasks LRAs
> ILP scheduling algorithm with flexible optimization goals
• Invoked at configurable intervals, considers multiple containers
High quality placementFast task-based allocations
Satisfy LRA constraints without affecting
task-based jobs
15. > Built as an extension to YARN
code is part of current release 3.1.0
> Introduced APIs for clients to specify constraints,
MEDEA’s components added to Resource Manager
> YARN’s Capacity Scheduler was used as task-
based scheduler
Implementation
Panagiotis Garefalakis - Imperial College London 15
16. > What is the impact of cardinality constraints in LRA
performance?
> What is the two-scheduler benefit in terms of
scheduling latency?
> What is the benefit of MEDEA in a large scale
environment?
Evaluation
Panagiotis Garefalakis - Imperial College London 16
17. Impact of cardinality in LRA performance
Panagiotis Garefalakis - Imperial College London 17
> TensorFlow ML workflow with 1M iterations using 32
workers with varying workers per node
0
50
100
150
200
250
300
1 2 4 8 16 32
Runtime(min)
Max cardinality per node
Low utilized cluster High utilized cluster
Anti-affinity Affinity
18. > TensorFlow ML workflow with 1M iterations using 32
workers with varying workers per node
0
50
100
150
200
250
300
1 2 4 8 16 32
Runtime(min)
Max cardinality per node
5% utilized cluster 70% utilized cluster
Impact of cardinality in LRA performance
Panagiotis Garefalakis - Imperial College London 18
Anti-affinity Affinity
Cardinality constraints are important
Affinity and anti-affinity are not enough
5% utilized cluster
19. 0
50
100
150
200
250
300
1 2 4 8 16 32
Runtime(min)
Max cardinality per node
5% utilized cluster 70% utilized cluster
Impact of cardinality in LRA performance
Panagiotis Garefalakis - Imperial College London 19
34%
42%
Anti-affinity Affinity
Cardinality constraints are important
Affinity and anti-affinity are not enough
5% utilized cluster
70% utilized cluster
Over-utilizationNetwork overhead
20. 0
1
2
3
4
5
6
7
8
0 5 10 20 30 40 50 60 70 80 90 100
Schedulinglatency(ms)
Fraction of resources used by LRAs (%)
Optimise All MEDEA
Two-scheduler benefit
Panagiotis Garefalakis - Imperial College London 20
21. 0
1
2
3
4
5
6
7
8
0 5 10 20 30 40 50 60 70 80 90 100
Schedulinglatency(ms)
Fraction of resources used by LRAs (%)
Optimise All MEDEA
Two-scheduler benefit
Panagiotis Garefalakis - Imperial College London 21
22. 0
1
2
3
4
5
6
7
8
0 5 10 20 30 40 50 60 70 80 90 100
Schedulinglatency(ms)
Fraction of resources used by LRAs (%)
Optimise All MEDEA
9.5x
Two-scheduler benefit
Panagiotis Garefalakis - Imperial College London 22
Expensive scheduling logic is used where it
matters the most
23. > Pre-production cluster
– 400 machines on 10 racks
> Workloads
– 50 HBase instances (10 workers each)
– 45 TensorFlow instances (8 workers and 2 PS each)
– Batch production workload (50% of cluster resources)
> Constraints
– Containers of each LRA instance on the same rack
– No more than 2 HBase (4 for TensorFlow) containers on
same node
Large scale deployment
Panagiotis Garefalakis - Imperial College London 23
25. 0
10
20
30
40
50
60
YARN KUBE KUBE++ MEDEA
Runtime(sec)
Impact of MEDEA in Task performance
Panagiotis Garefalakis - Imperial College London 25
Short-tasks
no-constraints
YARN
affinity
KUBE
+ cardinality
KUBE++
+ multiple
containers
+
MEDEA does not affect task-based job
runtimes
26. Powerful constraints are required to
unlock the full potential of LRAs!
MEDEA
– Two-scheduler design
– Support expressive & high level constraints
– Does not impact latency of task-based jobs
Thank you!
Questions?
Panagiotis Garefalakis
pg1712@imperial.ac.uk
YARN r3.1.0
https://github.com/apache/hadoop
Summary