Medea: Scheduling of Long Running Applications in Shared Production Clusters

MEDEA
Scheduling of Long Running Applications
in Shared Production Clusters
Panagiotis Garefalakis
Imperial College London
pg1712@imperial.ac.uk
Konstantinos Karanasos
Microsoft
kokarana@microsoft.com
Peter Pietzuch
Imperial College London
prp@imperial.ac.uk
Arun Suresh
Microsoft
arsuresh@microsoft.com
Sriram Rao
Microsoft
sriramra@microsoft.com

Long-Running Applications (LRAs)
Richer applications
in compute clusters
> Interactive data-intensive applications
• Spark, Hive LLAP
> Streaming systems
• Flink, Storm, SEEP
> Latency-sensitive applications
• HBase, Memcached
> ML frameworks
• TensorFlow, Spark ML-lib
Panagiotis Garefalakis - Imperial College London 2
LRAs = applications with long-running containers
(running from hours to months)
Shift towards long
running containers
> Short-running containers
• MapReduce, Scope, Tez

> Each cluster
comprises tens
of thousands of
machines
> 10-100% of
each cluster’s
machines used
for LRAs
> Machines for
LRAs are
picked statically
or randomly
0
20
40
60
80
100
MachinesusedforLRAs(%) LRAs in Microsoft’s analytics clusters
LRA placement is important

> Performance: “Place Storm containers in the same rack
as HBase”
> Cluster objectives: “Minimize resource fragmentation”
> Resilience: “Place HBase containers across upgrade
domains”
Rack r1
LRA scheduling problem
Rack r2
Upgrade
Domain A
Upgrade
Domain B
n1 n2
n3 n4
n5
n7 n8
n6
MR
MR
HBase
Storm Storm
Storm
MR HBase
Storm Storm
MR
MR
MR
Node Groups
Goal: Diligent placement through constraints

Machine unavailability in a Microsoft cluster
> Cluster
organized in
node groups
> Machines
become
unavailable in
groups
> With random
placement, an
LRA might lose
all containers
at once
0
2
4
6
8
10
0 1 2 3 4
Unavailablemachines(%)
Days
1
10
100
total node group A
total

Machine unavailability in a Microsoft cluster
> Cluster
organized in
node groups
> Machines
become
unavailable in
groups
> With random
placement, an
LRA might lose
all containers
at once
0
2
4
6
8
10
0 1 2 3 4
Days
1
10
100
total node group A
total
upgrade domain A

> How to relate containers to node groups?
Support high level constraints
> How to express different types of constraints
related to LRA containers?
Support expressive constraints
> How to achieve high quality placement without
affecting task-based jobs?
Two-scheduler design
Challenges

> How to relate containers to node groups?
> How to express different types of constraints
related to LRA containers?
> How to achieve high quality placement without
affecting task-based jobs?
–
MEDEA
LRA scheduling with expressive placement constraints
Support container tags and logical node groups
Introduce expressive cardinality constraints
Follow a two-scheduler design

> Idea: use container tags to refer
to group of containers
– Describe
• application type
• application role
• resource specification
• global application ID
– Can refer to any current or future
LRA container
Container tagging
Storm
n1
n2
HBase
Container Tags
KV
HBase master
memory critical
appID_1

– E.g. node, rack,
upgrade domain
– Associate nodes with
all the container tags
that live there
– Hide infrastructure
“spread across
upgrade domains”
Hierarchical grouping of nodes
Upgrade Domain
Storm, nimbus
appID_2
Node Tags
KV, HBase master,
memory_critical,
appID_1
Node Tags
Node Group Tags
Storm
n1
n2
HBase
Container Tags
KV
HBase master
memory critical
appID_1
> Idea: logical node groups to refer
to dynamic node sets

> Generic constraints to capture a variety of cases
Source Tag Node Group Target Tag
Min cardinality
Max cardinality
storm rack hbase
1
∞
storm node storm
0
5
Defining constraints
Min cardinality ≤ occurrences (Target Tag) ≤ Max cardinality
A single constraint type is sufficient!
> Affinity “Place Storm containers in the same rack as HBase”
> Cardinality “Place up to 5 Storm containers in the same node”

Task-based
Scheduler
LRA Interface
LRA Scheduler
Tasks LRAs
Cluster State
> Idea: traditional scheduler for task-based jobs,
optimization-based scheduler for LRAs
Constraint Manager
Constraints, Container Tags,
Node Groups
LRA scheduling algorithm
High quality placementFast task-based allocations

Task-based
Scheduler
LRA Interface
LRA Scheduler
Tasks LRAs
Allocation
Cluster State
> ILP scheduling algorithm with flexible optimization goals
• Invoked at configurable intervals, considers multiple containers
Constraint Manager
Node Groups

Task-based
Scheduler
LRA Interface
LRA Scheduler
Allocation
Cluster State
Constraint Manager
Node Groups
Tasks LRAs
> ILP scheduling algorithm with flexible optimization goals
• Invoked at configurable intervals, considers multiple containers
Satisfy LRA constraints without affecting
task-based jobs

> Built as an extension to YARN
code is part of current release 3.1.0
> Introduced APIs for clients to specify constraints,
MEDEA’s components added to Resource Manager
> YARN’s Capacity Scheduler was used as task-
based scheduler
Implementation

> What is the impact of cardinality constraints in LRA
performance?
> What is the two-scheduler benefit in terms of
scheduling latency?
> What is the benefit of MEDEA in a large scale
environment?
Evaluation

Impact of cardinality in LRA performance
> TensorFlow ML workflow with 1M iterations using 32
workers with varying workers per node
0
50
100
150
200
250
300
1 2 4 8 16 32
Runtime(min)
Max cardinality per node
Low utilized cluster High utilized cluster
Anti-affinity Affinity

> TensorFlow ML workflow with 1M iterations using 32
workers with varying workers per node
0
50
100
150
200
250
300
1 2 4 8 16 32
Runtime(min)
5% utilized cluster 70% utilized cluster
Cardinality constraints are important
Affinity and anti-affinity are not enough
5% utilized cluster

0
50
100
150
200
250
300
1 2 4 8 16 32
Runtime(min)
5% utilized cluster 70% utilized cluster
34%
42%
Cardinality constraints are important
Affinity and anti-affinity are not enough
5% utilized cluster
70% utilized cluster
Over-utilizationNetwork overhead

0
1
2
3
4
5
6
7
8
0 5 10 20 30 40 50 60 70 80 90 100
Schedulinglatency(ms)
Fraction of resources used by LRAs (%)
Optimise All MEDEA
Two-scheduler benefit

0
1
2
3
4
5
6
7
8
0 5 10 20 30 40 50 60 70 80 90 100
Optimise All MEDEA

0
1
2
3
4
5
6
7
8
0 5 10 20 30 40 50 60 70 80 90 100
Optimise All MEDEA
9.5x
Expensive scheduling logic is used where it
matters the most

> Pre-production cluster
– 400 machines on 10 racks
> Workloads
– 50 HBase instances (10 workers each)
– 45 TensorFlow instances (8 workers and 2 PS each)
– Batch production workload (50% of cluster resources)
> Constraints
– Containers of each LRA instance on the same rack
– No more than 2 HBase (4 for TensorFlow) containers on
same node
Large scale deployment

0
250
500
750
Runtime(min)
no-constraints
(YARN)
TensorFlow
affinity+ cardinality+ multiple
containers
+ MEDEA=
58%
54%
Impact of MEDEA in LRA performance
MEDEA improves performance and
predictability of LRAs
(Kubernetes) (Kubernetes
enhanced)
99th
median
5th

0
10
20
30
40
50
60
YARN KUBE KUBE++ MEDEA
Runtime(sec)
Impact of MEDEA in Task performance
Short-tasks
no-constraints
YARN
affinity
KUBE
+ cardinality
KUBE++
+ multiple
containers
+
MEDEA does not affect task-based job
runtimes

Powerful constraints are required to
unlock the full potential of LRAs!
MEDEA
– Two-scheduler design
– Support expressive & high level constraints
– Does not impact latency of task-based jobs
Thank you!
Questions?
Panagiotis Garefalakis
pg1712@imperial.ac.uk
YARN r3.1.0
https://github.com/apache/hadoop
Summary

Medea: Scheduling of Long Running Applications in Shared Production Clusters

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Medea: Scheduling of Long Running Applications in Shared Production Clusters

Similar to Medea: Scheduling of Long Running Applications in Shared Production Clusters (20)

More from Panagiotis Garefalakis

More from Panagiotis Garefalakis (7)

Recently uploaded

Recently uploaded (20)

Medea: Scheduling of Long Running Applications in Shared Production Clusters