YARN Federation
(YARN-2915)
Subru Krishnan, Kishore Chaliparambil,
Carlo Curino, and Giovanni Fumarola
Microsoft
Who are we?
Large team:
• Cloud and Information Services Lab (CISL)
• Applied research group in large-scale systems and machine learning
• BigData Resource Management team
• Design, build and operate Microsoft’s big data infrastructure
Agenda
• YARN @MS
• Federation Architecture
• Policy space
• Demo
YARN @MS
Familiar Challenges:
• Diverse workloads (batch, interactive, services,…)
• Support for production SLAs
• ROI on cluster investments (utilization)
Special Challenges:
• Leverage existing strong infrastructure (Cosmos/Scope/REEF/Azure)
• Enable all OSS technologies
• Scale of first-party clusters (each can exceed 50k nodes)
• Public Cloud (security, number of tenants, service integration…)
Big Bet: Unified Resource Management through YARN (OSS)
+
Azure
+
YARN @MS: Innovate and Contribute
Problems
• Lack of SLAs for production jobs
• High utilization for a broad range of
workloads
• YARN scalability,
• Private cloud (from disjoint clusters)
• Cross-DC?
Our Solution…
• Rayon: resource reservation
framework (YARN-1051)
• Mercury: introduce container types
and node-level queueing (YARN-
2877)
• Federation: “federate” multiple
YARN clusters (YARN-2915)
YARN Federation in Apache
• Umbrella JIRA: YARN-2915
• Includes detailed design proposal and e2e patch
• Federation branch created and API patches posted
• You are welcome to join and contribute 
• Thanks: Wangda, Karthik, Vinod, Jian….
Next
YARN Federation Architecture
by Kishore Chaliparambil
YARN Federation
•Enables applications to scale to 100k of thousands of
nodes
•YARN Resource Manager (RM) is a single instance.
• Scalability of RM is affected by
• Cardinality: |nodes|, |apps|, |tasks|
• Frequency: NM and AM heartbeat intervals, task duration
•YARN is battle-tested on 4-8k nodes
•@Microsoft: >50k node clusters, short lived tasks
•So how does federation work?
Yarn Sub-Cluster #1 Yarn Sub-Cluster #3Yarn Sub-Cluster #2
RM
Task
RM
Task
RM
Task
AM RM Proxy Service
(Per Node)Policy StateRouter Service
YARN Client
Federation
Services
YARN
Sub Clusters
Servers in Datacenter
AM
AM
Federation Architecture
• Implements Client-RM Protocol
• Stateless, Scalable Service
• Multiple Instances with Load
Balancer
• Implements AM-RM Protocol
• Hosted in NM
• Intercepts all AM-RM
communications
• Sub-clusters are unmodified standalone
YARN clusters with about 6K nodes.
Start ContainersSubmit App
• Voila! Applications can transparently span
across multiple YARN sub clusters and scale
to Datacenter level
• No code change in any application
• Centralized, highly-available repository
• RDBMS, Zookeeper, HDFS,…
AM RM Proxy Service Internals
Node Manager
AM RM Proxy Service
Application Master
Per Application Pipeline (Interceptor Chain)
Federation Interceptor
Security/Throttling Interceptor
…
Home RM Proxy
Unmanaged AM
SC #2
Unmanaged AM
SC #3
SC #1 RM SC #2 RM SC#3 RM
• Hosted in NM
• Extensible Design
• DDoS Prevention
• Unmanaged AM used for container negotiation.
They are created on demand based on policy
• Code Committed to 2.8
Policy
Next
Federation Policies
by Carlo Curino
Yarn Sub-Cluster #1 Yarn Sub-Cluster #3Yarn Sub-Cluster #2
RM RM RM
AM RM Proxy Service
(Per Node)Policy StateRouter Service
YARN Client
Federation
Services
YARN
Sub Clusters
Servers in Datacenter
Federation: Policy Engine
Policy Engine
Federation
Admin APIs
Flexible policies
• Manually curated (to start)
• Automatically generated (later)
General enforcement mechanisms:
• Router
• AMRMProxy
• RM Schedulers
Federation Policies
Goal: efficiently operate a federated cluster
• Complex trade-offs: load balancing, scaling, global-invariants (fairness), tenant
isolation, fault-tolerance,…
Policies
• Input: user, reservation, queue, node labels, ResourceRequest, …
• State information: sub-clusters load, planned maintenance,…
• Output: routing/scheduling decisions (that determine all container allocations)
Tackling hard problems with policies
SC1 SC2 SC3 SC4
? ? ? ?
Global queue structure
Local enforcement
A hard problem:
How to transparently enable “global queues” via “local enforcement”?
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2
Spectrum of options: Full Partitioning
SC3SC2SC1 SC4
Policies: Router and AMRMProxy direct to single RM
Pros: perfect scale-out, isolation
Cons: fragmentation/utilization issues, max-size job, uneven impact of SC failures,…
R
A
A1
100%
100%
40% 60%A2
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2
R
B
100%
100%
R
C
100%
100%
R
C
100%
D 100%
Spectrum of options: Full Replication
SC4SC1 SC2 SC3
Policies: Router (round-robin/random), and AMRMProxy fwd to RMs based on
locality of Resource Request
Pros: simple, symmetric, fair (if all jobs broadcast demand), resilient
Cons: scalability in #jobs, …  (heuristics improvements)
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2
Spectrum of options: Dynamic Partial Replication
SC3SC2SC1 SC4
Policies: Router (round-robin/random on subset of RMs), and AMRMProxy fwd to
RMs based on locality of ResourceRequest (on subset of RMs)
Pros: trade-off between advantages of replication/partitioning
Cons: complexity / rebalancing  could use dynamic approach
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2
R
A
A1
100%
50%
80%
D 50%
R
A
100%
50%
100%
C 50%
A2
R
B
100%
100%
R
C
100%
50% D 50%
20%A2
Demo
Show basic job running across sub-clusters
Show some UIs and ops commands
Showcase user-based, partially-replicated, routing policy
• Router: random-weighted among a set of sub-clusters…
• AMRMProxy: broadcast request to set of sub-clusters…
Next
YARN Federation Demo
by Giovanni Fumarola

YARN Federation

  • 1.
    YARN Federation (YARN-2915) Subru Krishnan,Kishore Chaliparambil, Carlo Curino, and Giovanni Fumarola Microsoft
  • 2.
    Who are we? Largeteam: • Cloud and Information Services Lab (CISL) • Applied research group in large-scale systems and machine learning • BigData Resource Management team • Design, build and operate Microsoft’s big data infrastructure
  • 3.
    Agenda • YARN @MS •Federation Architecture • Policy space • Demo
  • 4.
    YARN @MS Familiar Challenges: •Diverse workloads (batch, interactive, services,…) • Support for production SLAs • ROI on cluster investments (utilization) Special Challenges: • Leverage existing strong infrastructure (Cosmos/Scope/REEF/Azure) • Enable all OSS technologies • Scale of first-party clusters (each can exceed 50k nodes) • Public Cloud (security, number of tenants, service integration…) Big Bet: Unified Resource Management through YARN (OSS) + Azure +
  • 5.
    YARN @MS: Innovateand Contribute Problems • Lack of SLAs for production jobs • High utilization for a broad range of workloads • YARN scalability, • Private cloud (from disjoint clusters) • Cross-DC? Our Solution… • Rayon: resource reservation framework (YARN-1051) • Mercury: introduce container types and node-level queueing (YARN- 2877) • Federation: “federate” multiple YARN clusters (YARN-2915)
  • 6.
    YARN Federation inApache • Umbrella JIRA: YARN-2915 • Includes detailed design proposal and e2e patch • Federation branch created and API patches posted • You are welcome to join and contribute  • Thanks: Wangda, Karthik, Vinod, Jian….
  • 7.
  • 8.
    YARN Federation •Enables applicationsto scale to 100k of thousands of nodes •YARN Resource Manager (RM) is a single instance. • Scalability of RM is affected by • Cardinality: |nodes|, |apps|, |tasks| • Frequency: NM and AM heartbeat intervals, task duration •YARN is battle-tested on 4-8k nodes •@Microsoft: >50k node clusters, short lived tasks •So how does federation work?
  • 9.
    Yarn Sub-Cluster #1Yarn Sub-Cluster #3Yarn Sub-Cluster #2 RM Task RM Task RM Task AM RM Proxy Service (Per Node)Policy StateRouter Service YARN Client Federation Services YARN Sub Clusters Servers in Datacenter AM AM Federation Architecture • Implements Client-RM Protocol • Stateless, Scalable Service • Multiple Instances with Load Balancer • Implements AM-RM Protocol • Hosted in NM • Intercepts all AM-RM communications • Sub-clusters are unmodified standalone YARN clusters with about 6K nodes. Start ContainersSubmit App • Voila! Applications can transparently span across multiple YARN sub clusters and scale to Datacenter level • No code change in any application • Centralized, highly-available repository • RDBMS, Zookeeper, HDFS,…
  • 10.
    AM RM ProxyService Internals Node Manager AM RM Proxy Service Application Master Per Application Pipeline (Interceptor Chain) Federation Interceptor Security/Throttling Interceptor … Home RM Proxy Unmanaged AM SC #2 Unmanaged AM SC #3 SC #1 RM SC #2 RM SC#3 RM • Hosted in NM • Extensible Design • DDoS Prevention • Unmanaged AM used for container negotiation. They are created on demand based on policy • Code Committed to 2.8 Policy
  • 11.
  • 12.
    Yarn Sub-Cluster #1Yarn Sub-Cluster #3Yarn Sub-Cluster #2 RM RM RM AM RM Proxy Service (Per Node)Policy StateRouter Service YARN Client Federation Services YARN Sub Clusters Servers in Datacenter Federation: Policy Engine Policy Engine Federation Admin APIs Flexible policies • Manually curated (to start) • Automatically generated (later) General enforcement mechanisms: • Router • AMRMProxy • RM Schedulers
  • 13.
    Federation Policies Goal: efficientlyoperate a federated cluster • Complex trade-offs: load balancing, scaling, global-invariants (fairness), tenant isolation, fault-tolerance,… Policies • Input: user, reservation, queue, node labels, ResourceRequest, … • State information: sub-clusters load, planned maintenance,… • Output: routing/scheduling decisions (that determine all container allocations)
  • 14.
    Tackling hard problemswith policies SC1 SC2 SC3 SC4 ? ? ? ? Global queue structure Local enforcement A hard problem: How to transparently enable “global queues” via “local enforcement”? R A B C A1 100% 25%25%25% 40% 60% D 25% A2
  • 15.
    Spectrum of options:Full Partitioning SC3SC2SC1 SC4 Policies: Router and AMRMProxy direct to single RM Pros: perfect scale-out, isolation Cons: fragmentation/utilization issues, max-size job, uneven impact of SC failures,… R A A1 100% 100% 40% 60%A2 R A B C A1 100% 25%25%25% 40% 60% D 25% A2 R B 100% 100% R C 100% 100% R C 100% D 100%
  • 16.
    Spectrum of options:Full Replication SC4SC1 SC2 SC3 Policies: Router (round-robin/random), and AMRMProxy fwd to RMs based on locality of Resource Request Pros: simple, symmetric, fair (if all jobs broadcast demand), resilient Cons: scalability in #jobs, …  (heuristics improvements) R A B C A1 100% 25%25%25% 40% 60% D 25% A2 R A B C A1 100% 25%25%25% 40% 60% D 25% A2 R A B C A1 100% 25%25%25% 40% 60% D 25% A2 R A B C A1 100% 25%25%25% 40% 60% D 25% A2 R A B C A1 100% 25%25%25% 40% 60% D 25% A2
  • 17.
    Spectrum of options:Dynamic Partial Replication SC3SC2SC1 SC4 Policies: Router (round-robin/random on subset of RMs), and AMRMProxy fwd to RMs based on locality of ResourceRequest (on subset of RMs) Pros: trade-off between advantages of replication/partitioning Cons: complexity / rebalancing  could use dynamic approach R A B C A1 100% 25%25%25% 40% 60% D 25% A2 R A A1 100% 50% 80% D 50% R A 100% 50% 100% C 50% A2 R B 100% 100% R C 100% 50% D 50% 20%A2
  • 18.
    Demo Show basic jobrunning across sub-clusters Show some UIs and ops commands Showcase user-based, partially-replicated, routing policy • Router: random-weighted among a set of sub-clusters… • AMRMProxy: broadcast request to set of sub-clusters…
  • 19.