Scale-Out Resource Management at
Microsoft Using Apache YARN
Raghu Ramakrishnan
CTO for Data
Microsoft
Technical Fellow
Head, Big Data Engineering
Store any data
relations
Do any analysis
SQL queries
Hive,
At any speed
Batch
Hive
At any scale … elastic!
Anywhere
Data to
Intelligent
Action
Windows
SMSG
Live
Ads
CRM/Dynamics
Windows Phone
Xbox Live
Office365
STB Malware Protection
Microsoft Stores
STBCommerceRisk
Messenger
LCA
Exchange
Yammer
Skype
Bing
data managed: EBs
cluster sizes: 10s of Ks
# machines: 100s of Ks
daily I/O: >100 PBs
# internal developers: 1000s
# daily jobs: 100s of Ks
• Interactive and Real-Time Analytics requires i
• Massive data volumes require scale-out stores
using commodity servers, even archival storage
Tiered Storage
Seamlessly move data across tiers, mirroring life-cycle and usage patterns
Schedule compute near low-latency copies of data
How can we manage this trade-off without moving data across
different storage systems (and governance boundaries)?
• Many different analytic engines (OSS and
vendors; SQL, ML; batch, interactive, streaming)
• Many users’ jobs (across these job types) run
on the same machines (where the data lives)
Resource Management with Multitenancy and SLAs
Policy-driven management of vast compute pools co-located with data
Schedule computation “near” data
How can we manage this multi-tenanted heterogeneous job mix
across tens of thousands of machines?
Shared Data and Compute
Tiered Storage
Relational
Query Engine
Machine
Learning
Compute Fabric (Resource Management)
Multiple analytic
engines sharing same
resource pool
Compute and
store/cache on
same machines
What’s Behind a U-SQL Query
. . .
. . .
…
…
…
Resource Managers for Big Data
Allocate compute containers to competing jobs
Multiple job engines shared pool
Containers
YARN: Resource manager for Hadoop2.x
Corona, Mesos, Omega
YARN Gaps
resource allocation SLOs
scalability limitations
• High allocation latency
• Support for specialized execution frameworks
• Interactive environments, long-running services
• Amoeba Rayon
• Status: shipping in Apache Hadoop 2.6
• Mercury and Yaq
• Status: prototypes, JIRAs and papers
• Federation
• Status: prototype and JIRA
• Framework-level Pooling
• Enable frameworks that want to take over resource allocation to support millisecond-
level response and adaptation times
• Status: spec
Microsoft Contributions to OSS Apache YARN
15
Killing Tasks vs. Preemption
0
10
20
30
40
50
60
70
80
90
100
0
220
420
620
820
1020
1180
1380
1580
1780
1980
2180
2380
2580
2780
2980
3180
3380
3580
3780
3980
4140
4350
4550
4750
4950
5150
5350
5550
5750
5950
6150
6350
6550
6750
6950
7150
7350
7550
7750
7950
8150
8350
8550
%Complete
Time (s)
Kill Preempt
33% Improvement
Client
Job1
RM
Scheduler
NodeManager NodeManager NodeManager
App
Master Task
Task
Task
Task
Task
Task
Task
MR-5192
MR-5194
MR-5197
MR-5189
MR-5189
MR-5176
YARN-569
MR-5196
Contributing to Apache
Engaging with OSS
talk with active developers
show early/partial work
small patches
ok to leave things unfinished
Sharing a Cluster Between Production & Best-effort Jobs
Production Jobs (P)
Money-making, large (recurrent) jobs with SLAs
e.g., Shows up at 3pm, deadline at 6pm, ~ 90 min runtime
Best-Effort Jobs
Interactive exploratory jobs submitted by data scientists w/o SLAs
However, latency is still important (user waiting)
19
New idea:
Support SLOs for production jobs by using
- Job-provided resource requirements in RDL
- System-enforced admission control
Reservation-Based Scheduling in Hadoop
(Curino, Krishnan, Difallah, Douglas, Ramakrishnan, Rao; Rayon paper, SOCC 2014)
20
Resource Definition Language (RDL)
e.g., atom (<2GB,1core>, 1, 10,
1min, 10bundle/min)
(simplified for OSS release)
Steps:
1. App formulates reservation request in RDL
2. Request is “placed” in the Plan
3. Allocation is validated against sharing policy
4. System commits to deliver resources on time
5. Plan is dynamically enacted
6. Jobs get (reserved) resources
7. System adapts to live conditions
Planning
Plan
Follower
Plan
j
… j
5
sharing
policy
adapter
3
7
Reservation
Adaptive Scheduling
Scheduler
Node
Manager
...
Node
Manager
Node
Manager
Planning
Agent 2
RDL
1
Node
Manager
ResourceManager
Production Job
Best-effort Job
6
4
Reservation-based Scheduling
Architecture: Teach the RM About Time
Results
• Meets all production job SLAs
• Lowers best-effort jobs latency
• Increased cluster utilization and throughput
Committed to Hadoop trunk and 2.6 release
Now part of Cloudera CDH and Hortonworks HDP
Comparing Rayon With CapacityScheduler
Initial Umbrella JIRA: YARN-1051 (14 sub-tasks)
Rayon OSS
Rayon V2 Umbrella JIRA: YARN-2572 (25 sub-tasks)
(tooling, REST APIs, UI, Documentation, perf-improvements)
High-Availability Umbrella JIRA: YARN-2573 (7 sub-tasks)
Heterogeneity/node-labels Umbrella JIRA: YARN-4193
(8 sub-tasks)
Algo enhancements: YARN-3656 (1 sub-task)
Folks Involved:
Carlo Curino, Subru Krishnan, Ishai Menache, Sean Po, Jonathan Yaniv, Arun Suresh, Abhunav Dhoot, Alexey Tumanov
Included in Apache Hadoop 2.6
Various enhancements in upcoming Apache Hadoop 2.8
Why Federation?
Problem:
• YARN scalability is bounded by the centralized ResourceManager
• Proportional to #nodes, #apps, #containers, heart-beat#frequency
• Maintenance and Operations on single massive cluster are painful
Solution:
• Scale by federating multiple YARN clusters
• Appears as a single massive cluster to an app
• Node Manager(s) heart-beat to one RM only
• Most apps talk with one RM only; a few apps might span sub-clusters
(achieved by transparently proxying AM-RM communication)
• If single app exceed sub-cluster size, or for load
• Easier provisioning / maintenance
• Leverage cross-company stabilization effort of smaller YARN clusters
• Use ~6k YARN clusters as-is as building blocks
Federation ArchitectureClient
HDFS
Read Placement hints
(optional)
Global Policy
Generator
. .
.
Yarn Sub-Cluster 1
NM-001
NM-001
NM-001
…
NM-001
NM-001
NM-001
…
YARNRM-001YARNRM-001
Yarn Sub-Cluster 2
NM-002
NM-002
NM-002
…
NM-002
NM-002
NM-002
…
YARNRM-002YARNRM-002
Yarn Sub-Cluster N
NM-
NNN
NM-
NNN
NM-
NNN
…
NM-
NNN
NM-
NNN
NM-
NNN
…
YARNRM-NNN
YARNRM-NNN
Router
9. Submit
Job
7. Read Policies (capacity allocation and job routing)
4. Read membership and load
State
Store
8. Write App -> sub-cluster mapping
3. Write (initial) capacity allocation
Policy
Store
5. Write capacity allocation (updates) & policies
2. Request capacity allocation
6. Submit Job
1. Heartbeat
(membership)
Federation JIRAs
YARN-2915
Work Item Associated
JIRA
Author
Federation StateStore APIs YARN-3662 Subru Krishnan
Federation PolicyStore APIs YARN-3664 Subru Krishnan
Federation "Capacity Allocation" across sub-cluster YARN-3658 Carlo Curino
Federation Router YARN-3658 Giovanni Fumarola
Federation Intercepting and propagating AM-RM
communications
YARN-3666 Kishore Chaliparambil
Federation maintenance mechanisms (command propagation) YARN-3657 Carlo Curino
Federation subcluster membership mechanisms YARN-3665 Subru Krishnan
Federation State and Policy Store (SQL implementation) YARN-3663 Giovanni Fumarola
Federation Global Policy Generator (load balancing) YARN-3660 Subru Krishnan
• Mercury
• Yaq
Cluster Utilization in YARN
5 sec 10 sec 50 sec Mixed-5-50 Cosmos-gm
60.59% 78.35% 92.38% 78.54% 83.38%
Two types of schedulers:
• Central scheduler (YARN)
• Distributed schedulers (new)
Mercury: Distributed Scheduling in YARN
Mercury
Runtime
Mercury
Runtime
Mercury
Runtime
Mercury Resource Management Framework
Request with
resource type
guaranteed
opportunistic
Mercury: Task Throughput As Task Duration Increases
Results:
“only-G” is stock YARN, “only-Q” is Mercury
• Introduce queuing of tasks at NMs
• Explore queue management techniques
• Techniques applied to:
Yaq: Efficient Management of NM Queues
Evaluating Yaq on a Production Workload
• Umbrella JIRA for Mercury: YARN-2877
– RESOLVED
– RESOLVED
– RESOLVED
– PATCH AVAILABLE
– PATCH AVAILABLE
– RESOLVED
– RESOLVED
Mercury and Yaq OSS
• Amoeba Rayon
• Status: shipping in Apache Hadoop 2.6
• Mercury and Yaq
• Status: prototypes, JIRAs and papers
• Federation
• Status: prototype and JIRA
• Framework-level Pooling
• Enable frameworks that want to take over resource allocation to support millisecond-
level response and adaptation times
• Status: spec
Microsoft Contributions to OSS Apache YARN
Papers
(Won Best Paper at SoCC’13)
• Reservations and planning
• Queue management techniques
• Scalability
Comparison with Mesos/Omega/Borg
REEF
http://ww.reef-project.org http://reef.incubator.apache.org
http://aka.ms/adltechblog/
http://ww.reef-project.org and
http://reef.incubator.apache.org

Scale-Out Resource Management at Microsoft using Apache YARN

  • 1.
    Scale-Out Resource Managementat Microsoft Using Apache YARN Raghu Ramakrishnan CTO for Data Microsoft Technical Fellow Head, Big Data Engineering
  • 2.
    Store any data relations Doany analysis SQL queries Hive, At any speed Batch Hive At any scale … elastic! Anywhere Data to Intelligent Action
  • 6.
    Windows SMSG Live Ads CRM/Dynamics Windows Phone Xbox Live Office365 STBMalware Protection Microsoft Stores STBCommerceRisk Messenger LCA Exchange Yammer Skype Bing data managed: EBs cluster sizes: 10s of Ks # machines: 100s of Ks daily I/O: >100 PBs # internal developers: 1000s # daily jobs: 100s of Ks
  • 7.
    • Interactive andReal-Time Analytics requires i • Massive data volumes require scale-out stores using commodity servers, even archival storage Tiered Storage Seamlessly move data across tiers, mirroring life-cycle and usage patterns Schedule compute near low-latency copies of data How can we manage this trade-off without moving data across different storage systems (and governance boundaries)?
  • 8.
    • Many differentanalytic engines (OSS and vendors; SQL, ML; batch, interactive, streaming) • Many users’ jobs (across these job types) run on the same machines (where the data lives) Resource Management with Multitenancy and SLAs Policy-driven management of vast compute pools co-located with data Schedule computation “near” data How can we manage this multi-tenanted heterogeneous job mix across tens of thousands of machines?
  • 9.
    Shared Data andCompute Tiered Storage Relational Query Engine Machine Learning Compute Fabric (Resource Management) Multiple analytic engines sharing same resource pool Compute and store/cache on same machines
  • 10.
    What’s Behind aU-SQL Query . . . . . . … … …
  • 11.
    Resource Managers forBig Data Allocate compute containers to competing jobs Multiple job engines shared pool Containers YARN: Resource manager for Hadoop2.x Corona, Mesos, Omega
  • 12.
    YARN Gaps resource allocationSLOs scalability limitations • High allocation latency • Support for specialized execution frameworks • Interactive environments, long-running services
  • 13.
    • Amoeba Rayon •Status: shipping in Apache Hadoop 2.6 • Mercury and Yaq • Status: prototypes, JIRAs and papers • Federation • Status: prototype and JIRA • Framework-level Pooling • Enable frameworks that want to take over resource allocation to support millisecond- level response and adaptation times • Status: spec Microsoft Contributions to OSS Apache YARN
  • 15.
  • 16.
    Killing Tasks vs.Preemption 0 10 20 30 40 50 60 70 80 90 100 0 220 420 620 820 1020 1180 1380 1580 1780 1980 2180 2380 2580 2780 2980 3180 3380 3580 3780 3980 4140 4350 4550 4750 4950 5150 5350 5550 5750 5950 6150 6350 6550 6750 6950 7150 7350 7550 7750 7950 8150 8350 8550 %Complete Time (s) Kill Preempt 33% Improvement
  • 17.
    Client Job1 RM Scheduler NodeManager NodeManager NodeManager App MasterTask Task Task Task Task Task Task MR-5192 MR-5194 MR-5197 MR-5189 MR-5189 MR-5176 YARN-569 MR-5196 Contributing to Apache Engaging with OSS talk with active developers show early/partial work small patches ok to leave things unfinished
  • 19.
    Sharing a ClusterBetween Production & Best-effort Jobs Production Jobs (P) Money-making, large (recurrent) jobs with SLAs e.g., Shows up at 3pm, deadline at 6pm, ~ 90 min runtime Best-Effort Jobs Interactive exploratory jobs submitted by data scientists w/o SLAs However, latency is still important (user waiting) 19
  • 20.
    New idea: Support SLOsfor production jobs by using - Job-provided resource requirements in RDL - System-enforced admission control Reservation-Based Scheduling in Hadoop (Curino, Krishnan, Difallah, Douglas, Ramakrishnan, Rao; Rayon paper, SOCC 2014) 20
  • 21.
    Resource Definition Language(RDL) e.g., atom (<2GB,1core>, 1, 10, 1min, 10bundle/min) (simplified for OSS release)
  • 22.
    Steps: 1. App formulatesreservation request in RDL 2. Request is “placed” in the Plan 3. Allocation is validated against sharing policy 4. System commits to deliver resources on time 5. Plan is dynamically enacted 6. Jobs get (reserved) resources 7. System adapts to live conditions Planning Plan Follower Plan j … j 5 sharing policy adapter 3 7 Reservation Adaptive Scheduling Scheduler Node Manager ... Node Manager Node Manager Planning Agent 2 RDL 1 Node Manager ResourceManager Production Job Best-effort Job 6 4 Reservation-based Scheduling Architecture: Teach the RM About Time
  • 23.
    Results • Meets allproduction job SLAs • Lowers best-effort jobs latency • Increased cluster utilization and throughput Committed to Hadoop trunk and 2.6 release Now part of Cloudera CDH and Hortonworks HDP Comparing Rayon With CapacityScheduler
  • 24.
    Initial Umbrella JIRA:YARN-1051 (14 sub-tasks) Rayon OSS Rayon V2 Umbrella JIRA: YARN-2572 (25 sub-tasks) (tooling, REST APIs, UI, Documentation, perf-improvements) High-Availability Umbrella JIRA: YARN-2573 (7 sub-tasks) Heterogeneity/node-labels Umbrella JIRA: YARN-4193 (8 sub-tasks) Algo enhancements: YARN-3656 (1 sub-task) Folks Involved: Carlo Curino, Subru Krishnan, Ishai Menache, Sean Po, Jonathan Yaniv, Arun Suresh, Abhunav Dhoot, Alexey Tumanov Included in Apache Hadoop 2.6 Various enhancements in upcoming Apache Hadoop 2.8
  • 26.
    Why Federation? Problem: • YARNscalability is bounded by the centralized ResourceManager • Proportional to #nodes, #apps, #containers, heart-beat#frequency • Maintenance and Operations on single massive cluster are painful Solution: • Scale by federating multiple YARN clusters • Appears as a single massive cluster to an app • Node Manager(s) heart-beat to one RM only • Most apps talk with one RM only; a few apps might span sub-clusters (achieved by transparently proxying AM-RM communication) • If single app exceed sub-cluster size, or for load • Easier provisioning / maintenance • Leverage cross-company stabilization effort of smaller YARN clusters • Use ~6k YARN clusters as-is as building blocks
  • 27.
    Federation ArchitectureClient HDFS Read Placementhints (optional) Global Policy Generator . . . Yarn Sub-Cluster 1 NM-001 NM-001 NM-001 … NM-001 NM-001 NM-001 … YARNRM-001YARNRM-001 Yarn Sub-Cluster 2 NM-002 NM-002 NM-002 … NM-002 NM-002 NM-002 … YARNRM-002YARNRM-002 Yarn Sub-Cluster N NM- NNN NM- NNN NM- NNN … NM- NNN NM- NNN NM- NNN … YARNRM-NNN YARNRM-NNN Router 9. Submit Job 7. Read Policies (capacity allocation and job routing) 4. Read membership and load State Store 8. Write App -> sub-cluster mapping 3. Write (initial) capacity allocation Policy Store 5. Write capacity allocation (updates) & policies 2. Request capacity allocation 6. Submit Job 1. Heartbeat (membership)
  • 28.
    Federation JIRAs YARN-2915 Work ItemAssociated JIRA Author Federation StateStore APIs YARN-3662 Subru Krishnan Federation PolicyStore APIs YARN-3664 Subru Krishnan Federation "Capacity Allocation" across sub-cluster YARN-3658 Carlo Curino Federation Router YARN-3658 Giovanni Fumarola Federation Intercepting and propagating AM-RM communications YARN-3666 Kishore Chaliparambil Federation maintenance mechanisms (command propagation) YARN-3657 Carlo Curino Federation subcluster membership mechanisms YARN-3665 Subru Krishnan Federation State and Policy Store (SQL implementation) YARN-3663 Giovanni Fumarola Federation Global Policy Generator (load balancing) YARN-3660 Subru Krishnan
  • 30.
    • Mercury • Yaq ClusterUtilization in YARN 5 sec 10 sec 50 sec Mixed-5-50 Cosmos-gm 60.59% 78.35% 92.38% 78.54% 83.38%
  • 31.
    Two types ofschedulers: • Central scheduler (YARN) • Distributed schedulers (new) Mercury: Distributed Scheduling in YARN Mercury Runtime Mercury Runtime Mercury Runtime Mercury Resource Management Framework Request with resource type guaranteed opportunistic
  • 32.
    Mercury: Task ThroughputAs Task Duration Increases Results: “only-G” is stock YARN, “only-Q” is Mercury
  • 33.
    • Introduce queuingof tasks at NMs • Explore queue management techniques • Techniques applied to: Yaq: Efficient Management of NM Queues
  • 34.
    Evaluating Yaq ona Production Workload
  • 35.
    • Umbrella JIRAfor Mercury: YARN-2877 – RESOLVED – RESOLVED – RESOLVED – PATCH AVAILABLE – PATCH AVAILABLE – RESOLVED – RESOLVED Mercury and Yaq OSS
  • 36.
    • Amoeba Rayon •Status: shipping in Apache Hadoop 2.6 • Mercury and Yaq • Status: prototypes, JIRAs and papers • Federation • Status: prototype and JIRA • Framework-level Pooling • Enable frameworks that want to take over resource allocation to support millisecond- level response and adaptation times • Status: spec Microsoft Contributions to OSS Apache YARN
  • 37.
  • 38.
    • Reservations andplanning • Queue management techniques • Scalability Comparison with Mesos/Omega/Borg
  • 39.
  • 40.

Editor's Notes

  • #7 Youre familiar with SQL Server, and many of you know Hadoop and Azure HDInsight This is a little bigger.
  • #8 Analytic storage for the cloud Users want to think about the content of their data and what it can tell them about their business, and control who can access it They don’t want to think about remote vs local storage RAM vs flash security
  • #9 Analytic storage for the cloud Users want to think about the content of their data and what it can tell them about their business, and control who can access it They don’t want to think about remote vs local storage RAM vs flash security
  • #35 Example?