Apache YARN 3.x in Alibaba
Weiwei Yang, Chunde Ren
1
Agenda
1 Apache YARN Ecosystem in Alibaba
2 Omni Scheduler
3 Resource Extension & Isolation
4 Future Plans
2
Apache YARN Ecosystem in Alibaba
Part I
3
Apache YARN Ecosystem in Alibaba
Apache YARN
BI Recommendation Security Search
Apache HDFS
Ads
Streaming + Batch
Challenges
5
Service SLAUtilization
Omni Scheduler
Part II
6
Motivation
7
Existing Capacity/Fair Scheduler:
● Not ready to support resource over-subscription
● Not able to make overall good decisions
● Limitations to support online service
● Absent of dynamic scheduling ability
Resource (Enhanced Resource Protocol)
Placement Constraints
Guaranteed/Opportunistic
OmniScheduler Architecture
8
Resource
Tracker
Node Status
Updater
Node Manager
Resource Manager
AM
MainRescheduler
Placement Constraints
Manager
Qos
Container
Monitor
Container
Scheduler
Service G
G
O
O
OG
ReScheduler
Request
OmniScheduler
OmniScheduler - A Closer Look
9
Sorting Nodes Manager
Scheduler State
Node, app, queue, requests
Ephemeral
Scheduler StateNM
Heartbeats
Allocate Threads Commit Thread
Policy
Scheduler Metrics
Request
Node Candidates
Candidates Selector
ALLOCATE
TRY
COMMIT
Main-Scheduler
Node
Manager
Node
Manager
Application
MasterApplication
Master
Placement Constraints
Manager
P
Re-Scheduler
Allocation
Reviewer
10
Highlights
10
Sorting Nodes Manager
Scheduler
State
Node, app, queue,
requests ...
Ephemeral Scheduler State
Node, app, queue, requests
AllocationProprosal
Queue
NM
Heartbeats
Allocate Threads
Commit Thread
P
Policy
Scheduler Metrics
Request
Node Candidates
Candidates Selector
ALLOCATE
TRY_COMMIT
Main-SchedulerRe-Scheduler
Node
ManagerNode
Manager
Application
MasterApplication
Master
Placement Constraints
Manager
Integrated Opportunistic
container scheduling in CS
1
Filter candidates by placement
constraints
2
Offer candidates ordered by
actul utilization
3
Reschedule to optimize
allocation placements
4
Optimzed allocate logic to
improve performance and
reduce proposal conflicts
5
What is resource oversubscription
11
100%
50%
0%
Actual Utilized
Allocated
Time
Utilization
WASTED
12
The Problem
Allocated: 8.5 million cores
Actual Utilized Peak: 3.3 million cores
Actual Utilized Trough: 1.2 million cores
86.2% Allocated
Resource Oversubscription
Integrated Opportunistic container scheduling into Capacity
Scheduler, it leverages a dynamical over-allocation threshold
in order to ensure a reasonable range of the
oversubscription. We also added a QoS module on NM to
manage the lifecycle of Opportunistic containers.
Objective
1. Better Fairness
2. Scheduling with predicted resource utilization
3. 2 thresholds (G+O/O) and 2 factors (min, predict)
4. Qos (Isolation and Elastic...)
5. Future: optimized preemption decision with
consideration of preemption cost
Scheduler
QoSContainerMonitor
G
G
O
O
O
Opportunistic
Containers Queue
G
1
2
3
4
13
Placement Constraints
14
A
B
A
B
x A
A: Don’t place me with B on same node (anti-affinity)
A: Do place me with B on same node (affinity)
A: Do place me on node that has … (affinity with node)
1
2
3
(1) (2) (3)
Placement Constraints
in, node, tf-ps notin, node, hostModel=F53
C1 C2
hostModel=F41 hostModel=F41 hostModel=F53
Placement Constraint: AND :
Num: 2
Allocation Tag: tf-ps
NM1 NM2 NM3
Scheduling
Request
Capacity
Scheduler
Allocations
CS
Allocation Tag Node Attribute
15
Advanced: Allocation tag Namespace, composite constraints, operators.
Related issues: YARN-6592, YARN-7812, YARN-3409.
Node Scorer
Highlights
➔ Sorting Policy can be specified at queue level or job level.
➔ The sorting interval of each policy is configurable, if set to zero, it
runs live-sorting.
16
Rescheduler
We not only concern about allocations at “scheduling” phase!
Objective
✓ Dynamically opmize container distributions on the cluster
Our Use case:
✓ Eliminate Hotspot
✓ Eliminate Fragmentation (future)
17
Elinimate Hotspots
18
OmniScheduler OmniScheduler
Requests
Allocate Preempt
MainScheduler ReScheduler
Rescheduler - Eliminate Hotspot
19
Hotspot
Candidates
Selector
Rescheduler
App1 Container
App2 Container
Capacity
Scheduler
editScheduler
Scheduler States
Notify Scheduler
MARK_CONTAINER_FOR_PREEMPTION
Node Manager
Resource Manager
Node Manager
MainScheduler Highlights
1. High/Low water mark
2. Lazy marker: weaken the
impact of momentary
utilization
3. Observe mode
Result
Enable Node Scorer
20
Reduce Allocation Proposal Conflicts
Allocate thread is fast, often tens of milliseconds, sorting is slow (depending on the number of nodes), possibly a large
number of allocate threads will see same sorting result. If they both do allocate in order, that creates a lot of conflicts.
Therefore, we optimized the sorting strategy to “Partition Score Sort”.
N1
[0.0,0.1]
Nx
Node
Scorer
(0.1,0.2]
(0.9, 1.0] N1 N3
N2
N4 N5
N6
N2
App1
App2
P1
P9
P0
... (0.9, 1.0] N1 N3 N6P0
Random Pick
(0.9, 1.0] N1 N3 N6P0
(0.1,0.2] N2P1
Unsatisfied
Random Pick
21
Performance Improvements
● Throughput improvement
○ Allocate a batch of request in each allocate iteration
● Optimize RESERVE container behavior
○ Lazy Reservation: Do not reserve container until all candidates cannot satisfy the request
○ Accelerate reservation allocation: attempt to allocate reserved container when heartbeat arrives
22
Cluster size: 10K nodes
Node capability: 128gb,128vcore
Workload: 23.5K Job, 1000 task per job
Task: mem 1gb - 8gb, exec 5s - 10s
Improvements for Online Service
Part III
23
YARN - Resource Management System
A Resource Management System but not managing all resources ?!
24
CPU
GPU
FPGA
Memory
IP
Disk Port
Network
Current resource types is not able to support these!
Multi-dimensional Resources
COUNTABLE
SET
RESOURCE
SET
disks : [ {attributes : {"type":"sata", "index":"1"}, size : 100, iops : 100, ratio : 100},
{attributes : {"type":"ssd", "index":"2"}, size : 100, iops : 100, ratio : 100},
{attributes : {"type":"ssd", "index":"9999"}, size : 40, iops : 40, ratio : 40}]
{
"name": "IP",
"values": ["10.100.0.1", "0.100.0.2", "100.100.0.3"]
}
{
"name" : "memory",
"units" : "mb"
"value" : "1024"
}
Ne
w
Ne
w
25
Disk Resource - Workflow
AM Resource Tracker
Node Manager
CS
Resource Manager
1
2 3
4
5
6
26
Ask:
1 disk with size 10g
Attribute contains: type=SATA
NM resource:
Memory: 10g, Vcore: 100
disk1: { attributes : { index: 1, type: SATA, localDir:
“/mnt/disk1”}, size: 100g, iops: 1000}
disk2: …
Response:
disk1: { attributes : { index: 1, type: SATA},
size: 100g, iops: 1000}
Container Launch:
disk1: { attributes : { index: 1, type: SATA},
size: 100g, iops: 1000}
Workdir: /mnt/disk1/application_xx/...
Resource Isolation
27
Quota
Set
Share
Quota Priority
Resource Isolation - Cpuset
28
Related issue: YARN-8320
CPU context switch impacts the latency of a (Latency Sensitive) LS task.
Our solution: support cpu_share_mode via cgroups cpuset.
Resource Oversubscription - A Step Forward
29
Reserve allocation for me, Share resources to O containers
Node
Manager
Service
AM
1
Omni
Scheduler
Batch
AM
2
3
4
Node
Manager
Service
AM
5
6
Future Work
Part IV
30
Future Work
● Rescheduler
○ Leverage ML to minimize the cost of movements
● Comprehensive Preemption
● Performance
○ High throughput & low latency
● Online service features
○ Volumn
○ Pod
31
THANKS/感谢聆听
--------- Q&A Section --------
32

Apache Hadoop YARN 3.x in Alibaba

  • 1.
    Apache YARN 3.xin Alibaba Weiwei Yang, Chunde Ren 1
  • 2.
    Agenda 1 Apache YARNEcosystem in Alibaba 2 Omni Scheduler 3 Resource Extension & Isolation 4 Future Plans 2
  • 3.
    Apache YARN Ecosystemin Alibaba Part I 3
  • 4.
    Apache YARN Ecosystemin Alibaba Apache YARN BI Recommendation Security Search Apache HDFS Ads Streaming + Batch
  • 5.
  • 6.
  • 7.
    Motivation 7 Existing Capacity/Fair Scheduler: ●Not ready to support resource over-subscription ● Not able to make overall good decisions ● Limitations to support online service ● Absent of dynamic scheduling ability
  • 8.
    Resource (Enhanced ResourceProtocol) Placement Constraints Guaranteed/Opportunistic OmniScheduler Architecture 8 Resource Tracker Node Status Updater Node Manager Resource Manager AM MainRescheduler Placement Constraints Manager Qos Container Monitor Container Scheduler Service G G O O OG ReScheduler Request OmniScheduler
  • 9.
    OmniScheduler - ACloser Look 9 Sorting Nodes Manager Scheduler State Node, app, queue, requests Ephemeral Scheduler StateNM Heartbeats Allocate Threads Commit Thread Policy Scheduler Metrics Request Node Candidates Candidates Selector ALLOCATE TRY COMMIT Main-Scheduler Node Manager Node Manager Application MasterApplication Master Placement Constraints Manager P Re-Scheduler Allocation Reviewer
  • 10.
    10 Highlights 10 Sorting Nodes Manager Scheduler State Node,app, queue, requests ... Ephemeral Scheduler State Node, app, queue, requests AllocationProprosal Queue NM Heartbeats Allocate Threads Commit Thread P Policy Scheduler Metrics Request Node Candidates Candidates Selector ALLOCATE TRY_COMMIT Main-SchedulerRe-Scheduler Node ManagerNode Manager Application MasterApplication Master Placement Constraints Manager Integrated Opportunistic container scheduling in CS 1 Filter candidates by placement constraints 2 Offer candidates ordered by actul utilization 3 Reschedule to optimize allocation placements 4 Optimzed allocate logic to improve performance and reduce proposal conflicts 5
  • 11.
    What is resourceoversubscription 11 100% 50% 0% Actual Utilized Allocated Time Utilization WASTED
  • 12.
    12 The Problem Allocated: 8.5million cores Actual Utilized Peak: 3.3 million cores Actual Utilized Trough: 1.2 million cores 86.2% Allocated
  • 13.
    Resource Oversubscription Integrated Opportunisticcontainer scheduling into Capacity Scheduler, it leverages a dynamical over-allocation threshold in order to ensure a reasonable range of the oversubscription. We also added a QoS module on NM to manage the lifecycle of Opportunistic containers. Objective 1. Better Fairness 2. Scheduling with predicted resource utilization 3. 2 thresholds (G+O/O) and 2 factors (min, predict) 4. Qos (Isolation and Elastic...) 5. Future: optimized preemption decision with consideration of preemption cost Scheduler QoSContainerMonitor G G O O O Opportunistic Containers Queue G 1 2 3 4 13
  • 14.
    Placement Constraints 14 A B A B x A A:Don’t place me with B on same node (anti-affinity) A: Do place me with B on same node (affinity) A: Do place me on node that has … (affinity with node) 1 2 3 (1) (2) (3)
  • 15.
    Placement Constraints in, node,tf-ps notin, node, hostModel=F53 C1 C2 hostModel=F41 hostModel=F41 hostModel=F53 Placement Constraint: AND : Num: 2 Allocation Tag: tf-ps NM1 NM2 NM3 Scheduling Request Capacity Scheduler Allocations CS Allocation Tag Node Attribute 15 Advanced: Allocation tag Namespace, composite constraints, operators. Related issues: YARN-6592, YARN-7812, YARN-3409.
  • 16.
    Node Scorer Highlights ➔ SortingPolicy can be specified at queue level or job level. ➔ The sorting interval of each policy is configurable, if set to zero, it runs live-sorting. 16
  • 17.
    Rescheduler We not onlyconcern about allocations at “scheduling” phase! Objective ✓ Dynamically opmize container distributions on the cluster Our Use case: ✓ Eliminate Hotspot ✓ Eliminate Fragmentation (future) 17
  • 18.
  • 19.
    Rescheduler - EliminateHotspot 19 Hotspot Candidates Selector Rescheduler App1 Container App2 Container Capacity Scheduler editScheduler Scheduler States Notify Scheduler MARK_CONTAINER_FOR_PREEMPTION Node Manager Resource Manager Node Manager MainScheduler Highlights 1. High/Low water mark 2. Lazy marker: weaken the impact of momentary utilization 3. Observe mode
  • 20.
  • 21.
    Reduce Allocation ProposalConflicts Allocate thread is fast, often tens of milliseconds, sorting is slow (depending on the number of nodes), possibly a large number of allocate threads will see same sorting result. If they both do allocate in order, that creates a lot of conflicts. Therefore, we optimized the sorting strategy to “Partition Score Sort”. N1 [0.0,0.1] Nx Node Scorer (0.1,0.2] (0.9, 1.0] N1 N3 N2 N4 N5 N6 N2 App1 App2 P1 P9 P0 ... (0.9, 1.0] N1 N3 N6P0 Random Pick (0.9, 1.0] N1 N3 N6P0 (0.1,0.2] N2P1 Unsatisfied Random Pick 21
  • 22.
    Performance Improvements ● Throughputimprovement ○ Allocate a batch of request in each allocate iteration ● Optimize RESERVE container behavior ○ Lazy Reservation: Do not reserve container until all candidates cannot satisfy the request ○ Accelerate reservation allocation: attempt to allocate reserved container when heartbeat arrives 22 Cluster size: 10K nodes Node capability: 128gb,128vcore Workload: 23.5K Job, 1000 task per job Task: mem 1gb - 8gb, exec 5s - 10s
  • 23.
    Improvements for OnlineService Part III 23
  • 24.
    YARN - ResourceManagement System A Resource Management System but not managing all resources ?! 24 CPU GPU FPGA Memory IP Disk Port Network Current resource types is not able to support these!
  • 25.
    Multi-dimensional Resources COUNTABLE SET RESOURCE SET disks :[ {attributes : {"type":"sata", "index":"1"}, size : 100, iops : 100, ratio : 100}, {attributes : {"type":"ssd", "index":"2"}, size : 100, iops : 100, ratio : 100}, {attributes : {"type":"ssd", "index":"9999"}, size : 40, iops : 40, ratio : 40}] { "name": "IP", "values": ["10.100.0.1", "0.100.0.2", "100.100.0.3"] } { "name" : "memory", "units" : "mb" "value" : "1024" } Ne w Ne w 25
  • 26.
    Disk Resource -Workflow AM Resource Tracker Node Manager CS Resource Manager 1 2 3 4 5 6 26 Ask: 1 disk with size 10g Attribute contains: type=SATA NM resource: Memory: 10g, Vcore: 100 disk1: { attributes : { index: 1, type: SATA, localDir: “/mnt/disk1”}, size: 100g, iops: 1000} disk2: … Response: disk1: { attributes : { index: 1, type: SATA}, size: 100g, iops: 1000} Container Launch: disk1: { attributes : { index: 1, type: SATA}, size: 100g, iops: 1000} Workdir: /mnt/disk1/application_xx/...
  • 27.
  • 28.
    Resource Isolation -Cpuset 28 Related issue: YARN-8320 CPU context switch impacts the latency of a (Latency Sensitive) LS task. Our solution: support cpu_share_mode via cgroups cpuset.
  • 29.
    Resource Oversubscription -A Step Forward 29 Reserve allocation for me, Share resources to O containers Node Manager Service AM 1 Omni Scheduler Batch AM 2 3 4 Node Manager Service AM 5 6
  • 30.
  • 31.
    Future Work ● Rescheduler ○Leverage ML to minimize the cost of movements ● Comprehensive Preemption ● Performance ○ High throughput & low latency ● Online service features ○ Volumn ○ Pod 31
  • 32.