Lessons learned from scaling YARN to 40K machines in a multi tenancy environment

LESSONS LEARNED FROM
SCALING YARN TO 40K
MACHINES IN A MULTI
TENANCY ENVIRONMENT
Roni Burd : Principal Software Eng Mgr
Hitesh Sharma: Principal Software Eng
Sarvesh Sakalanaga: Senior Software Eng Mgr

Cosmos: World’s Biggest YARN Cluster!
Cosmos in numbers
• Single DC >40,000 machines
• Multiple DCs (do the math )
• >500,000 jobs / day
• >2,000,000 containers / hour
• Up to 90% CPU utilization
• 99.9% reliability/availability
• Several Exabytes in storage
• Hundreds of petabytes processed per
day
Journey
Migrate Cosmos to YARN
Run OSS workloads
Multiplier effect from the community

• Running different YARN applications with different characteristics
• SCOPE (AKA U-SQL) is the most common application today
• Spark, custom REEF code and custom YARN AM
• As we migrate existing workloads to YARN, we have to remain compatible with
legacy apps
• Cosmos supports running SLA and non-SLA jobs in the same datacenter
• Customer are allocated Virtual cluster (VC) with a max number of containers
• Machines allocated to each VC are shared across all tenants
• Virtual clusters have guaranteed capacity and can use idle capacity from other
tenants
• Maximize COGS and reduce latency
• Maximize data locality and prevent data movement/copy
• Don’t leave any resource unused – CPU, Memory, IOs and Bandwidth
Cosmos environment

NM1
NM 2
NM 3
NM 4
NM 5
NM 6
NM 7
NM 8
NM 9
NM 10
NM 11
NM 12
NM 14
NM 15
NM 13
NM 3900
NM 3901
NM 3903
NM 3904
NM 3902
Scope AM
Time
Parallel
containers
Max containers
allowed
• Scope submits containers in ‘waves’
• Waves are capped by max allocation
• Waves can be very big (2000 or more)
• Jobs achieve SLA by avoiding contention
• Allocated capacity < Total Cluster Capacity
• This leads to natural underutilization
Scope AM
Scope request characteristics
…
Rack Rack Rack

Achieving 90% CPU utilization: Mercury (YARN-2877)
Opportunistic
containers
• Scope allocates OPPORTUNISTIC containers
• OPP are queued in NM based on NMs capacity
• Scheduled after exhausting max GUARANTEED
• Once GUARANEED finish, SCOPE can either:
• Promote running OPPORTUNISTIC
• Scheduled new GUARANEED wave instead
• OPPORTUNISTIC get paused (YARN-5972)
• This improved latencies, reduced COGS and preserves SLAs
DC1 DC2

In aggregate SCOPE generate 4000 QPS (avg) /10,000 QPS (max)
YARN RM
Scope
AM
Pending
containers
Allocated
Guaranteed
NM
Scheduler
SCOPE Aux
Service
Allocate(G)
Allocate(OPP)
Immediate response
“Promote”*
Response only when allocated
*:based on 2.7 patch
Working with OSS for 2.9
Opportunistic
queue
Paused
queue
O
O
O
G
G
G
Running
StartContainer(G)
G
OLoad
information
O
O
Response only when allocated

Latency sensitive workloads
• Duration of scope containers vary widely, but most are small – delays are
expensive!
• Any delays in Allocate() or StartContainer() can lead to job slowdown and failures
• A 20sec delay can mean millions of U$S
• Locality is extremely important for latency and COGS
• Scope has visibility to all 40,000 machines through a Aux Service
• Scope AM is smart and makes several optimizations (e.g. future JOINs, RACK aggregations)
• RACK locality is “almost” as good as NODE locality – OFF_SWITCH is bad 
Container allocation need to be <5sec @ 95%tile
Node Locality on every request across the DC is important for latency

How did we scale YARN to deal with our QPS, latency and locality
requirements?
How did we scale YARN to 40K machines?
Hard lessons (that we have time to cover today)

YARN RM Scalability
Hitesh Sharma
Principal Software Engineer

Scope – main workload
In aggregate SCOPE generate 4000 QPS (avg) /10,000 QPS (max)
Container allocation need to be <5sec @ 95%tile
Node Locality on every request across the DC is important for latency

• A few Scope jobs would overwhelm YARN RM in a few minutes
• Long delays in allocating containers would result in job failures
• Scale testing in big clusters is a very expensive process
• Lack of metrics and telemetry to understand what’s happening in YARN RM
Legend
• Red line : pending containers
• Blue line :allocated containers
Allocated vs Pending containers
(Note: Cluster had resources to satisfy all the
requests being made)
Setup
• Test cluster with 3000 nodes
• Running Hadoop 2.7.1
• CapacityScheduler with
single queue
Challenges

Scheduling in capacity-scheduler
YARN RM
NM
NM
NM
NM
HB
HB
HB
HB
Thread Scheduler
loop
(thread
driven by
HB)
Thread
Thread
Allocate
Allocate
Allocate
Pending
request
YARN RM takes a lock and looks at all
the outstanding container requests in
the queue on each HB
If the node can satisfy a pending
request then YARN RM allocates
otherwise it just looks at the next one
AM
AM
AM
Look at _every_ pending requests
(potentially thousands) - algorithm
uses this to downgrade locality

• Scheduler loop is very expensive as it looks over all the
outstanding requests
• Counting missed opportunities to relax locality impacts scale
and hard to tune
• Creating immutable “Resource” objects during heartbeat
processing is very expensive
• Log4j is synchronous by default
• Only one allocation per node heartbeat
Bottlenecks

1. Scheduler key pruning
• Each node heartbeat looks at the outstanding allocation requests for that
node only
2. Time based decay for locality
• Use time to decide if we should downgrade to rack or any where in the cluster
• Reduces the work done in each node heartbeat
3. Switched to async log4j appender
• Reduces expensive lock and IO contention
4. Metrics to track allocation latencies and QPS
• TTD and TTM are critical to achieve 99.9% availability/reliability
Improvements

• Test cluster with 4000 nodes
• RM-NM and RM-AM heartbeat set to 1sec sec
• Using log4j async logger
Stage Impact
Before improvements + relaxLocality ON • Allocation latency @ 95th%tile< 10s
• Promotion latency @ 95th%tile< 10s
• Node locality – <10%
• ANY> locality – >80%
• Sustained load < 500 QPS
Scheduler key pruning • Allocation latency @ 95th%tile< 4s
• Node locality – 99.51%
• Rack locality– 0.23%
Time based decay for locality • Allocation latency @ 95th%tile< 3s
• Node locality – 99.84%
• Rack locality– 0.11%
Test results

Allocated vs Pending containers
Before After
Test results

Pending/Allocated
Paused containers
Queued container
Percentile latency
and QPS for
Allocate, Promote,
Locality metrics
Containers running and
type
Containers Queued
Containers Paused
Example dashboard

• Single machine setup to stress test YARN scheduler (part of OSS since 2.6)
• Allows us to try out cluster configurations with any number of nodes and
different configurations (NM-RM heartbeat frequency, scheduler
configurations, etc.)
• Updated SLS to generate load similar to our workloads
We can now try out different settings and fixes in a matter of minutes!
Simulator
(mimic Scope AM)
Real YARN RM
Mock
NM
Mock
NM
Mock
NM
HB
HB
HB
• Container requests wave
• Container duration
• Multiple apps
• Allocation ID
• Relax locality
• Opportunistic allocations
• Container promotions
thread thread threads
Scheduler Load Simulator (aka SLS)

Scaling YARN to 40K+ nodes and
beyond
Sarvesh Sakalanaga
Senior Engineering Manager

Yarn Sub-Cluster #1 Yarn Sub-Cluster #3Yarn Sub-Cluster #2
RM
Task
RM
Task
RM
Task
AM RM Proxy Service
(Per NodeManager)
Policy StateRouter Service
YARN Client
Federation
Services
YARN
Sub Clusters
Servers in Datacenter
AM
AM
• Implements Client-RM Protocol
• Stateless, Scalable Service
• Multiple Instances with Load
Balancer
• Implements AM-RM Protocol
• Hosted in NM
• Intercepts all AM-RM
communications
• Sub-clusters are unmodified standalone
YARN clusters with about 4K nodes.
Start
Containers
Submit App
• Voila! Applications can transparently span
across multiple YARN sub clusters and scale
to Datacenter level
• No code change in any application• Centralized, highly-available repository
Federation Interceptor
UAM POOL
Smart
Policy
Federation – High level architecture

1. Load shaping
2. Cluster maintenance
3. Log management
Production challenges

Limitations of BroadcastAMRMProxyPolicy
•RM Scalability: increases QPS in all the sub-clusters
•Cluster Utilization: duplicate allocations for each sub-
cluster
Solution: LocalityMulticastAMRMProxyPolicy
1. Create UAMs on demand
2. Allocate on sub-clusters on a rack
1: Load shaping

2: Cluster maintenance
Each DC will have constants machine movements
• 13 sub-clusters split our 40K+ DC
• About ~800 machines/day need some form of maintenance
• Clusters keeps growing and changing (e.g. decommissioning RACKs)
Solution: Use a Sub-Cluster Manger for balancing
1. Node to sub-cluster resolver service
2. Dynamic balancing of sub-cluster capacity
3. NM Maintenance mode: Container draining

Yarn Sub-Cluster #1
2: Cluster maintenance scenario: adding machines
O
O
G
G
HEALTHY HEALTHY
O
O
G
G
O
O
G
G
HEALTHY HEALTHY
O
O
G
G
O
O
G
G
HEALTHY HEALTHY
O
O
G
G
Yarn Sub-Cluster #2
O
O
G
G
HEALTHY HEALTHY
O
O
G
G
O
O
G
G
HEALTHY HEALTHY
O
O
G
G
O
O
G
G
HEALTHY HEALTHY
O
O
G
G
Yarn Sub-Cluster #3
HEALTHY HEALTHY
HEALTHY HEALTHY
MAINTENANCE MAINTENANCE
Sub-cluster Manager

Log volume per DC: 2 PB/day ; 2.5 GB/hour per machine
1. Custom log aggregator that works with our Scope AM to keep logs
only for critical path containers and failed containers
2. Custom tools to aggregates container logs outside of YARN on need
basis
3. Custom log search tool (Helios – Internal code name) that indexes
YARN logs bases on keywords
4. Scope NM AUX service to keep track of key information on
container stats
3: Log handing

Scalability
Multi-cast policy tuning for better placements, Container reuse, Have light
weight “Resource” objects, Multiple node allocations in the same heartbeat
Utilization
Container resize, Improve opportunistic containers utilization with system
priorities, Federation Global policy generator enhancements and Relax
locality for Opportunistic containers
Operability
ATS V2, more metrics and logs and better log management
And many more…!
Next Steps

Fully committed to YARN and Open Source
Committed
• Federation: YARN-2915
• AllocationID: YARN-4879
• Support distributed scheduling (AKA
mercury): YARN-2877
• AMRMProxy: YARN-2884
• UnmanagedAM pool manager: YARN-5531
• Federation Interceptor: YARN-6511
• Locality – Multicast Policy: YARN-5325
In progress/Open
• Federation phase 2: YARN-5597
• GPG: YARN-3660
• Router: YARN-3659
• Pausing of Opportunistic containers: YARN-5972
• Container Promotion: YARN-5085
• Scheduling of Opp through YARN RM: YARN-5220

Conclusion
Largest YARN deployment in the world!
...and we are growing 5X next year
…with more OSS workloads
Our Journey has just started!

Lessons learned from scaling YARN to 40K machines in a multi tenancy environment

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lessons learned from scaling YARN to 40K machines in a multi tenancy environment

Similar to Lessons learned from scaling YARN to 40K machines in a multi tenancy environment (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Lessons learned from scaling YARN to 40K machines in a multi tenancy environment

Editor's Notes