SlideShare a Scribd company logo
1 of 26
MEDEA
Scheduling of Long Running Applications
in Shared Production Clusters
Panagiotis Garefalakis
Imperial College London
pg1712@imperial.ac.uk
Konstantinos Karanasos
Microsoft
kokarana@microsoft.com
Peter Pietzuch
Imperial College London
prp@imperial.ac.uk
Arun Suresh
Microsoft
arsuresh@microsoft.com
Sriram Rao
Microsoft
sriramra@microsoft.com
Long-Running Applications (LRAs)
Richer applications
in compute clusters
> Interactive data-intensive applications
• Spark, Hive LLAP
> Streaming systems
• Flink, Storm, SEEP
> Latency-sensitive applications
• HBase, Memcached
> ML frameworks
• TensorFlow, Spark ML-lib
Panagiotis Garefalakis - Imperial College London 2
LRAs = applications with long-running containers
(running from hours to months)
Shift towards long
running containers
> Short-running containers
• MapReduce, Scope, Tez
> Each cluster
comprises tens
of thousands of
machines
> 10-100% of
each cluster’s
machines used
for LRAs
> Machines for
LRAs are
picked statically
or randomly
0
20
40
60
80
100
MachinesusedforLRAs(%) LRAs in Microsoft’s analytics clusters
Panagiotis Garefalakis - Imperial College London 3
LRA placement is important
> Performance: “Place Storm containers in the same rack
as HBase”
> Cluster objectives: “Minimize resource fragmentation”
> Resilience: “Place HBase containers across upgrade
domains”
Rack r1
LRA scheduling problem
Panagiotis Garefalakis - Imperial College London 4
Rack r2
Upgrade
Domain A
Upgrade
Domain B
n1 n2
n3 n4
n5
n7 n8
n6
MR
MR
HBase
Storm Storm
Storm
MR HBase
Storm Storm
MR
MR
MR
Node Groups
Goal: Diligent placement through constraints
Machine unavailability in a Microsoft cluster
> Cluster
organized in
node groups
> Machines
become
unavailable in
groups
> With random
placement, an
LRA might lose
all containers
at once
Panagiotis Garefalakis - Imperial College London 5
0
2
4
6
8
10
0 1 2 3 4
Unavailablemachines(%)
Days
1
10
100
Unavailablemachines(%)
total node group A
total
Machine unavailability in a Microsoft cluster
> Cluster
organized in
node groups
> Machines
become
unavailable in
groups
> With random
placement, an
LRA might lose
all containers
at once
Panagiotis Garefalakis - Imperial College London 6
0
2
4
6
8
10
0 1 2 3 4
Unavailablemachines(%)
Days
1
10
100
Unavailablemachines(%)
total node group A
total
upgrade domain A
> How to relate containers to node groups?
Support high level constraints
> How to express different types of constraints
related to LRA containers?
Support expressive constraints
> How to achieve high quality placement without
affecting task-based jobs?
Two-scheduler design
Challenges
Panagiotis Garefalakis - Imperial College London 7
> How to relate containers to node groups?
> How to express different types of constraints
related to LRA containers?
> How to achieve high quality placement without
affecting task-based jobs?
–
MEDEA
LRA scheduling with expressive placement constraints
Panagiotis Garefalakis - Imperial College London 8
Support container tags and logical node groups
Introduce expressive cardinality constraints
Follow a two-scheduler design
> Idea: use container tags to refer
to group of containers
– Describe
• application type
• application role
• resource specification
• global application ID
– Can refer to any current or future
LRA container
Container tagging
Panagiotis Garefalakis - Imperial College London 9
Storm
n1
n2
HBase
Container Tags
KV
HBase master
memory critical
appID_1
– E.g. node, rack,
upgrade domain
– Associate nodes with
all the container tags
that live there
– Hide infrastructure
“spread across
upgrade domains”
Hierarchical grouping of nodes
Panagiotis Garefalakis - Imperial College London 10
Upgrade Domain
Storm, nimbus
appID_2
Node Tags
KV, HBase master,
memory_critical,
appID_1
Node Tags
Node Group Tags
Storm
n1
n2
HBase
Container Tags
KV
HBase master
memory critical
appID_1
> Idea: logical node groups to refer
to dynamic node sets
> Generic constraints to capture a variety of cases
Source Tag Node Group Target Tag
Min cardinality
Max cardinality
storm rack hbase
1
∞
storm node storm
0
5
Defining constraints
Panagiotis Garefalakis - Imperial College London 11
Min cardinality ≤ occurrences (Target Tag) ≤ Max cardinality
A single constraint type is sufficient!
> Affinity “Place Storm containers in the same rack as HBase”
> Cardinality “Place up to 5 Storm containers in the same node”
Task-based
Scheduler
LRA Interface
LRA Scheduler
Tasks LRAs
Cluster State
Two-scheduler design
Panagiotis Garefalakis - Imperial College London 12
> Idea: traditional scheduler for task-based jobs,
optimization-based scheduler for LRAs
Constraint Manager
Constraints, Container Tags,
Node Groups
LRA scheduling algorithm
High quality placementFast task-based allocations
Task-based
Scheduler
LRA Interface
LRA Scheduler
Tasks LRAs
Allocation
Cluster State
Two-scheduler design
Panagiotis Garefalakis - Imperial College London 13
> ILP scheduling algorithm with flexible optimization goals
• Invoked at configurable intervals, considers multiple containers
Constraint Manager
Constraints, Container Tags,
Node Groups
LRA scheduling algorithm
High quality placementFast task-based allocations
Task-based
Scheduler
LRA Interface
LRA Scheduler
Allocation
Cluster State
Two-scheduler design
Panagiotis Garefalakis - Imperial College London 14
Constraint Manager
Constraints, Container Tags,
Node Groups
LRA scheduling algorithm
Tasks LRAs
> ILP scheduling algorithm with flexible optimization goals
• Invoked at configurable intervals, considers multiple containers
High quality placementFast task-based allocations
Satisfy LRA constraints without affecting
task-based jobs
> Built as an extension to YARN
code is part of current release 3.1.0
> Introduced APIs for clients to specify constraints,
MEDEA’s components added to Resource Manager
> YARN’s Capacity Scheduler was used as task-
based scheduler
Implementation
Panagiotis Garefalakis - Imperial College London 15
> What is the impact of cardinality constraints in LRA
performance?
> What is the two-scheduler benefit in terms of
scheduling latency?
> What is the benefit of MEDEA in a large scale
environment?
Evaluation
Panagiotis Garefalakis - Imperial College London 16
Impact of cardinality in LRA performance
Panagiotis Garefalakis - Imperial College London 17
> TensorFlow ML workflow with 1M iterations using 32
workers with varying workers per node
0
50
100
150
200
250
300
1 2 4 8 16 32
Runtime(min)
Max cardinality per node
Low utilized cluster High utilized cluster
Anti-affinity Affinity
> TensorFlow ML workflow with 1M iterations using 32
workers with varying workers per node
0
50
100
150
200
250
300
1 2 4 8 16 32
Runtime(min)
Max cardinality per node
5% utilized cluster 70% utilized cluster
Impact of cardinality in LRA performance
Panagiotis Garefalakis - Imperial College London 18
Anti-affinity Affinity
Cardinality constraints are important
Affinity and anti-affinity are not enough
5% utilized cluster
0
50
100
150
200
250
300
1 2 4 8 16 32
Runtime(min)
Max cardinality per node
5% utilized cluster 70% utilized cluster
Impact of cardinality in LRA performance
Panagiotis Garefalakis - Imperial College London 19
34%
42%
Anti-affinity Affinity
Cardinality constraints are important
Affinity and anti-affinity are not enough
5% utilized cluster
70% utilized cluster
Over-utilizationNetwork overhead
0
1
2
3
4
5
6
7
8
0 5 10 20 30 40 50 60 70 80 90 100
Schedulinglatency(ms)
Fraction of resources used by LRAs (%)
Optimise All MEDEA
Two-scheduler benefit
Panagiotis Garefalakis - Imperial College London 20
0
1
2
3
4
5
6
7
8
0 5 10 20 30 40 50 60 70 80 90 100
Schedulinglatency(ms)
Fraction of resources used by LRAs (%)
Optimise All MEDEA
Two-scheduler benefit
Panagiotis Garefalakis - Imperial College London 21
0
1
2
3
4
5
6
7
8
0 5 10 20 30 40 50 60 70 80 90 100
Schedulinglatency(ms)
Fraction of resources used by LRAs (%)
Optimise All MEDEA
9.5x
Two-scheduler benefit
Panagiotis Garefalakis - Imperial College London 22
Expensive scheduling logic is used where it
matters the most
> Pre-production cluster
– 400 machines on 10 racks
> Workloads
– 50 HBase instances (10 workers each)
– 45 TensorFlow instances (8 workers and 2 PS each)
– Batch production workload (50% of cluster resources)
> Constraints
– Containers of each LRA instance on the same rack
– No more than 2 HBase (4 for TensorFlow) containers on
same node
Large scale deployment
Panagiotis Garefalakis - Imperial College London 23
0
250
500
750
Runtime(min)
no-constraints
(YARN)
TensorFlow
affinity+ cardinality+ multiple
containers
+ MEDEA=
58%
54%
Impact of MEDEA in LRA performance
Panagiotis Garefalakis - Imperial College London 24
MEDEA improves performance and
predictability of LRAs
(Kubernetes) (Kubernetes
enhanced)
99th
median
5th
0
10
20
30
40
50
60
YARN KUBE KUBE++ MEDEA
Runtime(sec)
Impact of MEDEA in Task performance
Panagiotis Garefalakis - Imperial College London 25
Short-tasks
no-constraints
YARN
affinity
KUBE
+ cardinality
KUBE++
+ multiple
containers
+
MEDEA does not affect task-based job
runtimes
Powerful constraints are required to
unlock the full potential of LRAs!
MEDEA
– Two-scheduler design
– Support expressive & high level constraints
– Does not impact latency of task-based jobs
Thank you!
Questions?
Panagiotis Garefalakis
pg1712@imperial.ac.uk
YARN r3.1.0
https://github.com/apache/hadoop
Summary

More Related Content

What's hot

Towards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersTowards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersDataWorks Summit
 
Oracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret InternalsOracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret InternalsAnil Nair
 
Resume_Mohammed_Ali_Updated
Resume_Mohammed_Ali_UpdatedResume_Mohammed_Ali_Updated
Resume_Mohammed_Ali_UpdatedMohammed Ali
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Hortonworks
 
NAVEENCHANDRA_RESUME_2016_JULY
NAVEENCHANDRA_RESUME_2016_JULYNAVEENCHANDRA_RESUME_2016_JULY
NAVEENCHANDRA_RESUME_2016_JULYNAVEEN CHANDRA
 
Hadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch trainingHadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch trainingNandan Kumar
 
Machine Learning and AI at Oracle
Machine Learning and AI at OracleMachine Learning and AI at Oracle
Machine Learning and AI at OracleSandesh Rao
 
From HA to Maximum Availability - A Holistic Historical Discussion
From HA to Maximum Availability - A Holistic Historical DiscussionFrom HA to Maximum Availability - A Holistic Historical Discussion
From HA to Maximum Availability - A Holistic Historical DiscussionMarkus Michalewicz
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesDataWorks Summit
 
Oracle Database Lifecycle Management
Oracle Database Lifecycle ManagementOracle Database Lifecycle Management
Oracle Database Lifecycle ManagementHari Srinivasan
 
Oracle RAC 12c Rel. 2 & Cluster Architecture Internals OOW17 by Anil Nair
Oracle RAC 12c Rel. 2 & Cluster Architecture Internals OOW17 by Anil NairOracle RAC 12c Rel. 2 & Cluster Architecture Internals OOW17 by Anil Nair
Oracle RAC 12c Rel. 2 & Cluster Architecture Internals OOW17 by Anil NairMarkus Michalewicz
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2Bill Liu
 
Hidayth_DBA_WLS
Hidayth_DBA_WLSHidayth_DBA_WLS
Hidayth_DBA_WLSHidayath P
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Sumeet Singh
 
High Availability in YARN
High Availability in YARNHigh Availability in YARN
High Availability in YARNArinto Murdopo
 
Concept Drift: Monitoring Model Quality In Streaming ML Applications
Concept Drift: Monitoring Model Quality In Streaming ML ApplicationsConcept Drift: Monitoring Model Quality In Streaming ML Applications
Concept Drift: Monitoring Model Quality In Streaming ML ApplicationsLightbend
 

What's hot (20)

Towards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersTowards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN Clusters
 
Oracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret InternalsOracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret Internals
 
Revolution R: 100% R and more
Revolution R: 100% R and moreRevolution R: 100% R and more
Revolution R: 100% R and more
 
Resume_Mohammed_Ali_Updated
Resume_Mohammed_Ali_UpdatedResume_Mohammed_Ali_Updated
Resume_Mohammed_Ali_Updated
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
 
NAVEENCHANDRA_RESUME_2016_JULY
NAVEENCHANDRA_RESUME_2016_JULYNAVEENCHANDRA_RESUME_2016_JULY
NAVEENCHANDRA_RESUME_2016_JULY
 
Hadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch trainingHadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch training
 
Machine Learning and AI at Oracle
Machine Learning and AI at OracleMachine Learning and AI at Oracle
Machine Learning and AI at Oracle
 
From HA to Maximum Availability - A Holistic Historical Discussion
From HA to Maximum Availability - A Holistic Historical DiscussionFrom HA to Maximum Availability - A Holistic Historical Discussion
From HA to Maximum Availability - A Holistic Historical Discussion
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
Oracle Database Lifecycle Management
Oracle Database Lifecycle ManagementOracle Database Lifecycle Management
Oracle Database Lifecycle Management
 
Yarn
YarnYarn
Yarn
 
Oracle RAC 12c Rel. 2 & Cluster Architecture Internals OOW17 by Anil Nair
Oracle RAC 12c Rel. 2 & Cluster Architecture Internals OOW17 by Anil NairOracle RAC 12c Rel. 2 & Cluster Architecture Internals OOW17 by Anil Nair
Oracle RAC 12c Rel. 2 & Cluster Architecture Internals OOW17 by Anil Nair
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2
 
SAP Gateway scalability testing
SAP Gateway scalability testingSAP Gateway scalability testing
SAP Gateway scalability testing
 
Hidayth_DBA_WLS
Hidayth_DBA_WLSHidayth_DBA_WLS
Hidayth_DBA_WLS
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
 
High Availability in YARN
High Availability in YARNHigh Availability in YARN
High Availability in YARN
 
Concept Drift: Monitoring Model Quality In Streaming ML Applications
Concept Drift: Monitoring Model Quality In Streaming ML ApplicationsConcept Drift: Monitoring Model Quality In Streaming ML Applications
Concept Drift: Monitoring Model Quality In Streaming ML Applications
 

Similar to Medea: Scheduling of Long Running Applications in Shared Production Clusters

AskTom: How to Make and Test Your Application "Oracle RAC Ready"?
AskTom: How to Make and Test Your Application "Oracle RAC Ready"?AskTom: How to Make and Test Your Application "Oracle RAC Ready"?
AskTom: How to Make and Test Your Application "Oracle RAC Ready"?Markus Michalewicz
 
Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?DataWorks Summit
 
XSEDE14 SciGaP-Apache Airavata Tutorial
XSEDE14 SciGaP-Apache Airavata TutorialXSEDE14 SciGaP-Apache Airavata Tutorial
XSEDE14 SciGaP-Apache Airavata Tutorialmarpierc
 
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...IRJET Journal
 
Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications
Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch ApplicationsNeptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications
Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch ApplicationsPanagiotis Garefalakis
 
Why AIOps Matters For Kubernetes
Why AIOps Matters For KubernetesWhy AIOps Matters For Kubernetes
Why AIOps Matters For KubernetesTimothy Chen
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...GeeksLab Odessa
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...DataStax
 
Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine
Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query EngineMeasuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine
Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engineparekhnikunj
 
Reactive Applications with Apache Pulsar and Spring Boot
Reactive Applications with Apache Pulsar and Spring BootReactive Applications with Apache Pulsar and Spring Boot
Reactive Applications with Apache Pulsar and Spring BootVMware Tanzu
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBDDan Frincu
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Spark Summit
 
A sdn based application aware and network provisioning
A sdn based application aware and network provisioningA sdn based application aware and network provisioning
A sdn based application aware and network provisioningStanley Wang
 
Characterizing and contrasting kuhn tey-ner awr-kuh-streyt-ors
Characterizing and contrasting kuhn tey-ner awr-kuh-streyt-orsCharacterizing and contrasting kuhn tey-ner awr-kuh-streyt-ors
Characterizing and contrasting kuhn tey-ner awr-kuh-streyt-orsLee Calcote
 
Rsm Refactor April 2011
Rsm Refactor April 2011Rsm Refactor April 2011
Rsm Refactor April 2011Jorge Rivera
 
Summarizing Software API Usage Examples Using Clustering Techniques
Summarizing Software API Usage Examples Using Clustering TechniquesSummarizing Software API Usage Examples Using Clustering Techniques
Summarizing Software API Usage Examples Using Clustering TechniquesNikos Katirtzis
 
Robin Cloud Platform for Big Data and Databases
Robin Cloud Platform for Big Data and DatabasesRobin Cloud Platform for Big Data and Databases
Robin Cloud Platform for Big Data and DatabasesRobin Systems
 
Clipper at UC Berkeley RISECamp 2017
Clipper at UC Berkeley RISECamp 2017Clipper at UC Berkeley RISECamp 2017
Clipper at UC Berkeley RISECamp 2017Dan Crankshaw
 

Similar to Medea: Scheduling of Long Running Applications in Shared Production Clusters (20)

AskTom: How to Make and Test Your Application "Oracle RAC Ready"?
AskTom: How to Make and Test Your Application "Oracle RAC Ready"?AskTom: How to Make and Test Your Application "Oracle RAC Ready"?
AskTom: How to Make and Test Your Application "Oracle RAC Ready"?
 
Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?
 
XSEDE14 SciGaP-Apache Airavata Tutorial
XSEDE14 SciGaP-Apache Airavata TutorialXSEDE14 SciGaP-Apache Airavata Tutorial
XSEDE14 SciGaP-Apache Airavata Tutorial
 
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...
 
Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications
Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch ApplicationsNeptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications
Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications
 
Why AIOps Matters For Kubernetes
Why AIOps Matters For KubernetesWhy AIOps Matters For Kubernetes
Why AIOps Matters For Kubernetes
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
 
Microservices.pdf
Microservices.pdfMicroservices.pdf
Microservices.pdf
 
Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine
Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query EngineMeasuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine
Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine
 
Reactive Applications with Apache Pulsar and Spring Boot
Reactive Applications with Apache Pulsar and Spring BootReactive Applications with Apache Pulsar and Spring Boot
Reactive Applications with Apache Pulsar and Spring Boot
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBD
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
 
Apache phoenix
Apache phoenixApache phoenix
Apache phoenix
 
A sdn based application aware and network provisioning
A sdn based application aware and network provisioningA sdn based application aware and network provisioning
A sdn based application aware and network provisioning
 
Characterizing and contrasting kuhn tey-ner awr-kuh-streyt-ors
Characterizing and contrasting kuhn tey-ner awr-kuh-streyt-orsCharacterizing and contrasting kuhn tey-ner awr-kuh-streyt-ors
Characterizing and contrasting kuhn tey-ner awr-kuh-streyt-ors
 
Rsm Refactor April 2011
Rsm Refactor April 2011Rsm Refactor April 2011
Rsm Refactor April 2011
 
Summarizing Software API Usage Examples Using Clustering Techniques
Summarizing Software API Usage Examples Using Clustering TechniquesSummarizing Software API Usage Examples Using Clustering Techniques
Summarizing Software API Usage Examples Using Clustering Techniques
 
Robin Cloud Platform for Big Data and Databases
Robin Cloud Platform for Big Data and DatabasesRobin Cloud Platform for Big Data and Databases
Robin Cloud Platform for Big Data and Databases
 
Clipper at UC Berkeley RISECamp 2017
Clipper at UC Berkeley RISECamp 2017Clipper at UC Berkeley RISECamp 2017
Clipper at UC Berkeley RISECamp 2017
 

More from Panagiotis Garefalakis

More from Panagiotis Garefalakis (7)

Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancementsAccelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
 
Mres presentation
Mres presentationMres presentation
Mres presentation
 
Dais 2013 2 6 june
Dais 2013 2 6 juneDais 2013 2 6 june
Dais 2013 2 6 june
 
Master presentation-21-7-2014
Master presentation-21-7-2014Master presentation-21-7-2014
Master presentation-21-7-2014
 
Pgaref Piccolo Building Fast, Distributed Programs with Partitioned Tables
Pgaref   Piccolo Building Fast, Distributed Programs with Partitioned TablesPgaref   Piccolo Building Fast, Distributed Programs with Partitioned Tables
Pgaref Piccolo Building Fast, Distributed Programs with Partitioned Tables
 
Storage managment using nagios
Storage managment using nagiosStorage managment using nagios
Storage managment using nagios
 
Ithings2012 20nov
Ithings2012 20novIthings2012 20nov
Ithings2012 20nov
 

Recently uploaded

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Medea: Scheduling of Long Running Applications in Shared Production Clusters

  • 1. MEDEA Scheduling of Long Running Applications in Shared Production Clusters Panagiotis Garefalakis Imperial College London pg1712@imperial.ac.uk Konstantinos Karanasos Microsoft kokarana@microsoft.com Peter Pietzuch Imperial College London prp@imperial.ac.uk Arun Suresh Microsoft arsuresh@microsoft.com Sriram Rao Microsoft sriramra@microsoft.com
  • 2. Long-Running Applications (LRAs) Richer applications in compute clusters > Interactive data-intensive applications • Spark, Hive LLAP > Streaming systems • Flink, Storm, SEEP > Latency-sensitive applications • HBase, Memcached > ML frameworks • TensorFlow, Spark ML-lib Panagiotis Garefalakis - Imperial College London 2 LRAs = applications with long-running containers (running from hours to months) Shift towards long running containers > Short-running containers • MapReduce, Scope, Tez
  • 3. > Each cluster comprises tens of thousands of machines > 10-100% of each cluster’s machines used for LRAs > Machines for LRAs are picked statically or randomly 0 20 40 60 80 100 MachinesusedforLRAs(%) LRAs in Microsoft’s analytics clusters Panagiotis Garefalakis - Imperial College London 3 LRA placement is important
  • 4. > Performance: “Place Storm containers in the same rack as HBase” > Cluster objectives: “Minimize resource fragmentation” > Resilience: “Place HBase containers across upgrade domains” Rack r1 LRA scheduling problem Panagiotis Garefalakis - Imperial College London 4 Rack r2 Upgrade Domain A Upgrade Domain B n1 n2 n3 n4 n5 n7 n8 n6 MR MR HBase Storm Storm Storm MR HBase Storm Storm MR MR MR Node Groups Goal: Diligent placement through constraints
  • 5. Machine unavailability in a Microsoft cluster > Cluster organized in node groups > Machines become unavailable in groups > With random placement, an LRA might lose all containers at once Panagiotis Garefalakis - Imperial College London 5 0 2 4 6 8 10 0 1 2 3 4 Unavailablemachines(%) Days 1 10 100 Unavailablemachines(%) total node group A total
  • 6. Machine unavailability in a Microsoft cluster > Cluster organized in node groups > Machines become unavailable in groups > With random placement, an LRA might lose all containers at once Panagiotis Garefalakis - Imperial College London 6 0 2 4 6 8 10 0 1 2 3 4 Unavailablemachines(%) Days 1 10 100 Unavailablemachines(%) total node group A total upgrade domain A
  • 7. > How to relate containers to node groups? Support high level constraints > How to express different types of constraints related to LRA containers? Support expressive constraints > How to achieve high quality placement without affecting task-based jobs? Two-scheduler design Challenges Panagiotis Garefalakis - Imperial College London 7
  • 8. > How to relate containers to node groups? > How to express different types of constraints related to LRA containers? > How to achieve high quality placement without affecting task-based jobs? – MEDEA LRA scheduling with expressive placement constraints Panagiotis Garefalakis - Imperial College London 8 Support container tags and logical node groups Introduce expressive cardinality constraints Follow a two-scheduler design
  • 9. > Idea: use container tags to refer to group of containers – Describe • application type • application role • resource specification • global application ID – Can refer to any current or future LRA container Container tagging Panagiotis Garefalakis - Imperial College London 9 Storm n1 n2 HBase Container Tags KV HBase master memory critical appID_1
  • 10. – E.g. node, rack, upgrade domain – Associate nodes with all the container tags that live there – Hide infrastructure “spread across upgrade domains” Hierarchical grouping of nodes Panagiotis Garefalakis - Imperial College London 10 Upgrade Domain Storm, nimbus appID_2 Node Tags KV, HBase master, memory_critical, appID_1 Node Tags Node Group Tags Storm n1 n2 HBase Container Tags KV HBase master memory critical appID_1 > Idea: logical node groups to refer to dynamic node sets
  • 11. > Generic constraints to capture a variety of cases Source Tag Node Group Target Tag Min cardinality Max cardinality storm rack hbase 1 ∞ storm node storm 0 5 Defining constraints Panagiotis Garefalakis - Imperial College London 11 Min cardinality ≤ occurrences (Target Tag) ≤ Max cardinality A single constraint type is sufficient! > Affinity “Place Storm containers in the same rack as HBase” > Cardinality “Place up to 5 Storm containers in the same node”
  • 12. Task-based Scheduler LRA Interface LRA Scheduler Tasks LRAs Cluster State Two-scheduler design Panagiotis Garefalakis - Imperial College London 12 > Idea: traditional scheduler for task-based jobs, optimization-based scheduler for LRAs Constraint Manager Constraints, Container Tags, Node Groups LRA scheduling algorithm High quality placementFast task-based allocations
  • 13. Task-based Scheduler LRA Interface LRA Scheduler Tasks LRAs Allocation Cluster State Two-scheduler design Panagiotis Garefalakis - Imperial College London 13 > ILP scheduling algorithm with flexible optimization goals • Invoked at configurable intervals, considers multiple containers Constraint Manager Constraints, Container Tags, Node Groups LRA scheduling algorithm High quality placementFast task-based allocations
  • 14. Task-based Scheduler LRA Interface LRA Scheduler Allocation Cluster State Two-scheduler design Panagiotis Garefalakis - Imperial College London 14 Constraint Manager Constraints, Container Tags, Node Groups LRA scheduling algorithm Tasks LRAs > ILP scheduling algorithm with flexible optimization goals • Invoked at configurable intervals, considers multiple containers High quality placementFast task-based allocations Satisfy LRA constraints without affecting task-based jobs
  • 15. > Built as an extension to YARN code is part of current release 3.1.0 > Introduced APIs for clients to specify constraints, MEDEA’s components added to Resource Manager > YARN’s Capacity Scheduler was used as task- based scheduler Implementation Panagiotis Garefalakis - Imperial College London 15
  • 16. > What is the impact of cardinality constraints in LRA performance? > What is the two-scheduler benefit in terms of scheduling latency? > What is the benefit of MEDEA in a large scale environment? Evaluation Panagiotis Garefalakis - Imperial College London 16
  • 17. Impact of cardinality in LRA performance Panagiotis Garefalakis - Imperial College London 17 > TensorFlow ML workflow with 1M iterations using 32 workers with varying workers per node 0 50 100 150 200 250 300 1 2 4 8 16 32 Runtime(min) Max cardinality per node Low utilized cluster High utilized cluster Anti-affinity Affinity
  • 18. > TensorFlow ML workflow with 1M iterations using 32 workers with varying workers per node 0 50 100 150 200 250 300 1 2 4 8 16 32 Runtime(min) Max cardinality per node 5% utilized cluster 70% utilized cluster Impact of cardinality in LRA performance Panagiotis Garefalakis - Imperial College London 18 Anti-affinity Affinity Cardinality constraints are important Affinity and anti-affinity are not enough 5% utilized cluster
  • 19. 0 50 100 150 200 250 300 1 2 4 8 16 32 Runtime(min) Max cardinality per node 5% utilized cluster 70% utilized cluster Impact of cardinality in LRA performance Panagiotis Garefalakis - Imperial College London 19 34% 42% Anti-affinity Affinity Cardinality constraints are important Affinity and anti-affinity are not enough 5% utilized cluster 70% utilized cluster Over-utilizationNetwork overhead
  • 20. 0 1 2 3 4 5 6 7 8 0 5 10 20 30 40 50 60 70 80 90 100 Schedulinglatency(ms) Fraction of resources used by LRAs (%) Optimise All MEDEA Two-scheduler benefit Panagiotis Garefalakis - Imperial College London 20
  • 21. 0 1 2 3 4 5 6 7 8 0 5 10 20 30 40 50 60 70 80 90 100 Schedulinglatency(ms) Fraction of resources used by LRAs (%) Optimise All MEDEA Two-scheduler benefit Panagiotis Garefalakis - Imperial College London 21
  • 22. 0 1 2 3 4 5 6 7 8 0 5 10 20 30 40 50 60 70 80 90 100 Schedulinglatency(ms) Fraction of resources used by LRAs (%) Optimise All MEDEA 9.5x Two-scheduler benefit Panagiotis Garefalakis - Imperial College London 22 Expensive scheduling logic is used where it matters the most
  • 23. > Pre-production cluster – 400 machines on 10 racks > Workloads – 50 HBase instances (10 workers each) – 45 TensorFlow instances (8 workers and 2 PS each) – Batch production workload (50% of cluster resources) > Constraints – Containers of each LRA instance on the same rack – No more than 2 HBase (4 for TensorFlow) containers on same node Large scale deployment Panagiotis Garefalakis - Imperial College London 23
  • 24. 0 250 500 750 Runtime(min) no-constraints (YARN) TensorFlow affinity+ cardinality+ multiple containers + MEDEA= 58% 54% Impact of MEDEA in LRA performance Panagiotis Garefalakis - Imperial College London 24 MEDEA improves performance and predictability of LRAs (Kubernetes) (Kubernetes enhanced) 99th median 5th
  • 25. 0 10 20 30 40 50 60 YARN KUBE KUBE++ MEDEA Runtime(sec) Impact of MEDEA in Task performance Panagiotis Garefalakis - Imperial College London 25 Short-tasks no-constraints YARN affinity KUBE + cardinality KUBE++ + multiple containers + MEDEA does not affect task-based job runtimes
  • 26. Powerful constraints are required to unlock the full potential of LRAs! MEDEA – Two-scheduler design – Support expressive & high level constraints – Does not impact latency of task-based jobs Thank you! Questions? Panagiotis Garefalakis pg1712@imperial.ac.uk YARN r3.1.0 https://github.com/apache/hadoop Summary