YARN Federation

YARN Federation
(YARN-2915)
Subru Krishnan, Kishore Chaliparambil,
Carlo Curino, and Giovanni Fumarola
Microsoft

Who are we?
Large team:
• Cloud and Information Services Lab (CISL)
• Applied research group in large-scale systems and machine learning
• BigData Resource Management team
• Design, build and operate Microsoft’s big data infrastructure

Agenda
• YARN @MS
• Federation Architecture
• Policy space
• Demo

YARN @MS
Familiar Challenges:
• Diverse workloads (batch, interactive, services,…)
• Support for production SLAs
• ROI on cluster investments (utilization)
Special Challenges:
• Leverage existing strong infrastructure (Cosmos/Scope/REEF/Azure)
• Enable all OSS technologies
• Scale of first-party clusters (each can exceed 50k nodes)
• Public Cloud (security, number of tenants, service integration…)
Big Bet: Unified Resource Management through YARN (OSS)
+
Azure
+

YARN @MS: Innovate and Contribute
Problems
• Lack of SLAs for production jobs
• High utilization for a broad range of
workloads
• YARN scalability,
• Private cloud (from disjoint clusters)
• Cross-DC?
Our Solution…
• Rayon: resource reservation
framework (YARN-1051)
• Mercury: introduce container types
and node-level queueing (YARN-
2877)
• Federation: “federate” multiple
YARN clusters (YARN-2915)

YARN Federation in Apache
• Umbrella JIRA: YARN-2915
• Includes detailed design proposal and e2e patch
• Federation branch created and API patches posted
• You are welcome to join and contribute 
• Thanks: Wangda, Karthik, Vinod, Jian….

Next
YARN Federation Architecture
by Kishore Chaliparambil

YARN Federation
•Enables applications to scale to 100k of thousands of
nodes
•YARN Resource Manager (RM) is a single instance.
• Scalability of RM is affected by
• Cardinality: |nodes|, |apps|, |tasks|
• Frequency: NM and AM heartbeat intervals, task duration
•YARN is battle-tested on 4-8k nodes
•@Microsoft: >50k node clusters, short lived tasks
•So how does federation work?

Yarn Sub-Cluster #1 Yarn Sub-Cluster #3Yarn Sub-Cluster #2
RM
Task
RM
Task
RM
Task
AM RM Proxy Service
(Per Node)Policy StateRouter Service
YARN Client
Federation
Services
YARN
Sub Clusters
Servers in Datacenter
AM
AM
Federation Architecture
• Implements Client-RM Protocol
• Stateless, Scalable Service
• Multiple Instances with Load
Balancer
• Implements AM-RM Protocol
• Hosted in NM
• Intercepts all AM-RM
communications
• Sub-clusters are unmodified standalone
YARN clusters with about 6K nodes.
Start ContainersSubmit App
• Voila! Applications can transparently span
across multiple YARN sub clusters and scale
to Datacenter level
• No code change in any application
• Centralized, highly-available repository
• RDBMS, Zookeeper, HDFS,…

AM RM Proxy Service Internals
Node Manager
AM RM Proxy Service
Application Master
Per Application Pipeline (Interceptor Chain)
Federation Interceptor
Security/Throttling Interceptor
…
Home RM Proxy
Unmanaged AM
SC #2
Unmanaged AM
SC #3
SC #1 RM SC #2 RM SC#3 RM
• Hosted in NM
• Extensible Design
• DDoS Prevention
• Unmanaged AM used for container negotiation.
They are created on demand based on policy
• Code Committed to 2.8
Policy

Next
Federation Policies
by Carlo Curino

Yarn Sub-Cluster #1 Yarn Sub-Cluster #3Yarn Sub-Cluster #2
RM RM RM
AM RM Proxy Service
(Per Node)Policy StateRouter Service
YARN Client
Federation
Services
YARN
Sub Clusters
Servers in Datacenter
Federation: Policy Engine
Policy Engine
Federation
Admin APIs
Flexible policies
• Manually curated (to start)
• Automatically generated (later)
General enforcement mechanisms:
• Router
• AMRMProxy
• RM Schedulers

Federation Policies
Goal: efficiently operate a federated cluster
• Complex trade-offs: load balancing, scaling, global-invariants (fairness), tenant
isolation, fault-tolerance,…
Policies
• Input: user, reservation, queue, node labels, ResourceRequest, …
• State information: sub-clusters load, planned maintenance,…
• Output: routing/scheduling decisions (that determine all container allocations)

Tackling hard problems with policies
SC1 SC2 SC3 SC4
? ? ? ?
Global queue structure
Local enforcement
A hard problem:
How to transparently enable “global queues” via “local enforcement”?
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2

Spectrum of options: Full Partitioning
SC3SC2SC1 SC4
Policies: Router and AMRMProxy direct to single RM
Pros: perfect scale-out, isolation
Cons: fragmentation/utilization issues, max-size job, uneven impact of SC failures,…
R
A
A1
100%
100%
40% 60%A2
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2
R
B
100%
100%
R
C
100%
100%
R
C
100%
D 100%

Spectrum of options: Full Replication
SC4SC1 SC2 SC3
Policies: Router (round-robin/random), and AMRMProxy fwd to RMs based on
locality of Resource Request
Pros: simple, symmetric, fair (if all jobs broadcast demand), resilient
Cons: scalability in #jobs, …  (heuristics improvements)
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2

Spectrum of options: Dynamic Partial Replication
SC3SC2SC1 SC4
Policies: Router (round-robin/random on subset of RMs), and AMRMProxy fwd to
RMs based on locality of ResourceRequest (on subset of RMs)
Pros: trade-off between advantages of replication/partitioning
Cons: complexity / rebalancing  could use dynamic approach
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2
R
A
A1
100%
50%
80%
D 50%
R
A
100%
50%
100%
C 50%
A2
R
B
100%
100%
R
C
100%
50% D 50%
20%A2

Demo
Show basic job running across sub-clusters
Show some UIs and ops commands
Showcase user-based, partially-replicated, routing policy
• Router: random-weighted among a set of sub-clusters…
• AMRMProxy: broadcast request to set of sub-clusters…

Next
YARN Federation Demo
by Giovanni Fumarola

YARN Federation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to YARN Federation

Similar to YARN Federation (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

YARN Federation