Scale-Out Resource Management at Microsoft using Apache YARN

Scale-Out Resource Management at
Microsoft Using Apache YARN
Raghu Ramakrishnan
CTO for Data
Microsoft
Technical Fellow
Head, Big Data Engineering

Store any data
relations
Do any analysis
SQL queries
Hive,
At any speed
Batch
Hive
At any scale … elastic!
Anywhere
Data to
Intelligent
Action

Windows
SMSG
Live
Ads
CRM/Dynamics
Windows Phone
Xbox Live
Office365
STB Malware Protection
Microsoft Stores
STBCommerceRisk
Messenger
LCA
Exchange
Yammer
Skype
Bing
data managed: EBs
cluster sizes: 10s of Ks
# machines: 100s of Ks
daily I/O: >100 PBs
# internal developers: 1000s
# daily jobs: 100s of Ks

• Interactive and Real-Time Analytics requires i
• Massive data volumes require scale-out stores
using commodity servers, even archival storage
Tiered Storage
Seamlessly move data across tiers, mirroring life-cycle and usage patterns
Schedule compute near low-latency copies of data
How can we manage this trade-off without moving data across
different storage systems (and governance boundaries)?

• Many different analytic engines (OSS and
vendors; SQL, ML; batch, interactive, streaming)
• Many users’ jobs (across these job types) run
on the same machines (where the data lives)
Resource Management with Multitenancy and SLAs
Policy-driven management of vast compute pools co-located with data
Schedule computation “near” data
How can we manage this multi-tenanted heterogeneous job mix
across tens of thousands of machines?

Shared Data and Compute
Tiered Storage
Relational
Query Engine
Machine
Learning
Compute Fabric (Resource Management)
Multiple analytic
engines sharing same
resource pool
Compute and
store/cache on
same machines

What’s Behind a U-SQL Query
. . .
. . .
…
…
…

Resource Managers for Big Data
Allocate compute containers to competing jobs
Multiple job engines shared pool
Containers
YARN: Resource manager for Hadoop2.x
Corona, Mesos, Omega

YARN Gaps
resource allocation SLOs
scalability limitations
• High allocation latency
• Support for specialized execution frameworks
• Interactive environments, long-running services

• Amoeba Rayon
• Status: shipping in Apache Hadoop 2.6
• Mercury and Yaq
• Status: prototypes, JIRAs and papers
• Federation
• Status: prototype and JIRA
• Framework-level Pooling
• Enable frameworks that want to take over resource allocation to support millisecond-
level response and adaptation times
• Status: spec
Microsoft Contributions to OSS Apache YARN

Killing Tasks vs. Preemption
0
10
20
30
40
50
60
70
80
90
100
0
220
420
620
820
1020
1180
1380
1580
1780
1980
2180
2380
2580
2780
2980
3180
3380
3580
3780
3980
4140
4350
4550
4750
4950
5150
5350
5550
5750
5950
6150
6350
6550
6750
6950
7150
7350
7550
7750
7950
8150
8350
8550
%Complete
Time (s)
Kill Preempt
33% Improvement

Client
Job1
RM
Scheduler
NodeManager NodeManager NodeManager
App
Master Task
Task
Task
Task
Task
Task
Task
MR-5192
MR-5194
MR-5197
MR-5189
MR-5189
MR-5176
YARN-569
MR-5196
Contributing to Apache
Engaging with OSS
talk with active developers
show early/partial work
small patches
ok to leave things unfinished

Sharing a Cluster Between Production & Best-effort Jobs
Production Jobs (P)
Money-making, large (recurrent) jobs with SLAs
e.g., Shows up at 3pm, deadline at 6pm, ~ 90 min runtime
Best-Effort Jobs
Interactive exploratory jobs submitted by data scientists w/o SLAs
However, latency is still important (user waiting)
19

New idea:
Support SLOs for production jobs by using
- Job-provided resource requirements in RDL
- System-enforced admission control
Reservation-Based Scheduling in Hadoop
(Curino, Krishnan, Difallah, Douglas, Ramakrishnan, Rao; Rayon paper, SOCC 2014)
20

Resource Definition Language (RDL)
e.g., atom (<2GB,1core>, 1, 10,
1min, 10bundle/min)
(simplified for OSS release)

Steps:
1. App formulates reservation request in RDL
2. Request is “placed” in the Plan
3. Allocation is validated against sharing policy
4. System commits to deliver resources on time
5. Plan is dynamically enacted
6. Jobs get (reserved) resources
7. System adapts to live conditions
Planning
Plan
Follower
Plan
j
… j
5
sharing
policy
adapter
3
7
Reservation
Adaptive Scheduling
Scheduler
Node
Manager
...
Node
Manager
Node
Manager
Planning
Agent 2
RDL
1
Node
Manager
ResourceManager
Production Job
Best-effort Job
6
4
Reservation-based Scheduling
Architecture: Teach the RM About Time

Results
• Meets all production job SLAs
• Lowers best-effort jobs latency
• Increased cluster utilization and throughput
Committed to Hadoop trunk and 2.6 release
Now part of Cloudera CDH and Hortonworks HDP
Comparing Rayon With CapacityScheduler

Initial Umbrella JIRA: YARN-1051 (14 sub-tasks)
Rayon OSS
Rayon V2 Umbrella JIRA: YARN-2572 (25 sub-tasks)
(tooling, REST APIs, UI, Documentation, perf-improvements)
High-Availability Umbrella JIRA: YARN-2573 (7 sub-tasks)
Heterogeneity/node-labels Umbrella JIRA: YARN-4193
(8 sub-tasks)
Algo enhancements: YARN-3656 (1 sub-task)
Folks Involved:
Carlo Curino, Subru Krishnan, Ishai Menache, Sean Po, Jonathan Yaniv, Arun Suresh, Abhunav Dhoot, Alexey Tumanov
Included in Apache Hadoop 2.6
Various enhancements in upcoming Apache Hadoop 2.8

Why Federation?
Problem:
• YARN scalability is bounded by the centralized ResourceManager
• Proportional to #nodes, #apps, #containers, heart-beat#frequency
• Maintenance and Operations on single massive cluster are painful
Solution:
• Scale by federating multiple YARN clusters
• Appears as a single massive cluster to an app
• Node Manager(s) heart-beat to one RM only
• Most apps talk with one RM only; a few apps might span sub-clusters
(achieved by transparently proxying AM-RM communication)
• If single app exceed sub-cluster size, or for load
• Easier provisioning / maintenance
• Leverage cross-company stabilization effort of smaller YARN clusters
• Use ~6k YARN clusters as-is as building blocks

Federation ArchitectureClient
HDFS
Read Placement hints
(optional)
Global Policy
Generator
. .
.
Yarn Sub-Cluster 1
NM-001
NM-001
NM-001
…
NM-001
NM-001
NM-001
…
YARNRM-001YARNRM-001
Yarn Sub-Cluster 2
NM-002
NM-002
NM-002
…
NM-002
NM-002
NM-002
…
YARNRM-002YARNRM-002
Yarn Sub-Cluster N
NM-
NNN
NM-
NNN
NM-
NNN
…
NM-
NNN
NM-
NNN
NM-
NNN
…
YARNRM-NNN
YARNRM-NNN
Router
9. Submit
Job
7. Read Policies (capacity allocation and job routing)
4. Read membership and load
State
Store
8. Write App -> sub-cluster mapping
3. Write (initial) capacity allocation
Policy
Store
5. Write capacity allocation (updates) & policies
2. Request capacity allocation
6. Submit Job
1. Heartbeat
(membership)

Federation JIRAs
YARN-2915
Work Item Associated
JIRA
Author
Federation StateStore APIs YARN-3662 Subru Krishnan
Federation PolicyStore APIs YARN-3664 Subru Krishnan
Federation "Capacity Allocation" across sub-cluster YARN-3658 Carlo Curino
Federation Router YARN-3658 Giovanni Fumarola
Federation Intercepting and propagating AM-RM
communications
YARN-3666 Kishore Chaliparambil
Federation maintenance mechanisms (command propagation) YARN-3657 Carlo Curino
Federation subcluster membership mechanisms YARN-3665 Subru Krishnan
Federation State and Policy Store (SQL implementation) YARN-3663 Giovanni Fumarola
Federation Global Policy Generator (load balancing) YARN-3660 Subru Krishnan

• Mercury
• Yaq
Cluster Utilization in YARN
5 sec 10 sec 50 sec Mixed-5-50 Cosmos-gm
60.59% 78.35% 92.38% 78.54% 83.38%

Two types of schedulers:
• Central scheduler (YARN)
• Distributed schedulers (new)
Mercury: Distributed Scheduling in YARN
Mercury
Runtime
Mercury
Runtime
Mercury
Runtime
Mercury Resource Management Framework
Request with
resource type
guaranteed
opportunistic

Mercury: Task Throughput As Task Duration Increases
Results:
“only-G” is stock YARN, “only-Q” is Mercury

• Introduce queuing of tasks at NMs
• Explore queue management techniques
• Techniques applied to:
Yaq: Efficient Management of NM Queues

Evaluating Yaq on a Production Workload

• Umbrella JIRA for Mercury: YARN-2877
– RESOLVED
– RESOLVED
– RESOLVED
– PATCH AVAILABLE
– PATCH AVAILABLE
– RESOLVED
– RESOLVED
Mercury and Yaq OSS

Papers
(Won Best Paper at SoCC’13)

• Reservations and planning
• Queue management techniques
• Scalability
Comparison with Mesos/Omega/Borg

REEF
http://ww.reef-project.org http://reef.incubator.apache.org

http://aka.ms/adltechblog/
http://ww.reef-project.org and
http://reef.incubator.apache.org

Scale-Out Resource Management at Microsoft using Apache YARN

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Scale-Out Resource Management at Microsoft using Apache YARN

Similar to Scale-Out Resource Management at Microsoft using Apache YARN (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Scale-Out Resource Management at Microsoft using Apache YARN

Editor's Notes