Demand-Based Coordinated Scheduling for SMP VMs

Demand-Based Coordinated
Scheduling for SMP VMs
Hwanju Kim1, Sangwook Kim2, Jinkyu Jeong1, Joonwon Lee2,
and Seungryoul Maeng1
Korea Advanced Institute of Science and Technology (KAIST)1
Sungkyunkwan University2
The 18th International Conference on Architectural Support for Programming Languages and Operating
Systems (ASPLOS)
Houston, Texas, March 16-20 2013
1

Software Trends in Multi-core Era
• Making the best use of HW parallelism
• Increasing “thread-level parallelism”
HW
SW
“Convergence of Recognition, Mining, and Synthesis Workloads and Its Implications”,
Proceedings of IEEE, 2008
Processor
OS
App App App
Apps increasingly being multithreaded
RMS apps are “emerging killer apps”
Processors increasingly adding more cores
2/28

• Synchronization (communication)
• The greatest obstacle to the performance of
multithreaded workloads
HW
SW
Processor
OS
App App App
Barrier
Barrier
Thread
Lock wait
SpinlockSpin
wait
CPU
3/28

• Virtualization
• Ubiquitous for consolidating multiple workloads
• “Even OSes are workloads to be handled by VMM”
HW
SW
Processor
OS
App App App
OS OS
VMM
SMP
VM
SMP
VM
SMP
VM
“Synchronization-conscious coordination”
is essential for VMM to improve efficiency
Virtual CPU (vCPU) as a software entity
dictated by VMM scheduler
4/28

Coordinated Scheduling
vCPU
VMM scheduler VMM
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
Time
shared
Uncoordinated scheduling
 A vCPU treated as an independent entity
Independent
entity
vCPU
VMM scheduler VMM
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
Coordinated scheduling
 Sibling vCPUs treated as a group
(who belongs to the same VM)
Coordinated
group
vCPU
vCPU
vCPU
Time
shared
Lock
holder
Lock
waiter
Lock
waiter
Running
Waiting
Waiting
Uncoordinated scheduling makes
inter-vCPU synchronization ineffective
Time
shared
5/28

Prior Efforts for Coordination
Coscheduling [Ousterhout82]
: Synchronizing execution
Time
pCPU
pCPU
pCPU
pCPU
vCPU execution
Illusion of dedicated multi-core,
but CPU fragmentation
Relaxed coscheduling [VMware10]
: Balancing execution time
Time
pCPU
pCPU
pCPU
pCPU
Stop execution for siblings to catch up
Good CPU utilization & coordination,
but not based on synchronization demands
Time
pCPU
pCPU
pCPU
pCPU
Balance scheduling [Sukwong11]
: Balancing pCPU allocation
Good CPU utilization & coordination,
but not based on synchronization demands
Selective coscheduling [Weng09,11]…
: Coscheduling selected vCPUs
Time
pCPU
pCPU
pCPU
pCPU
Better coordination through explicit information,
but relying on user or OS support
Selected vCPUs
Need for VMM scheduling based on
synchronization (coordination) demands
6/28

Overview
• Demand-based coordinated scheduling
• Identifying synchronization demands
• With non-intrusive design
• Not compromising inter-VM fairness
Time
pCPU
pCPU
pCPU
pCPU
Demand of coscheduling for synchronization
Demand of delayed preemption for synchronization
Preemption
attempt
7/28

Coordination Space
• Time and space domains
• Independent scheduling decision for each domain
Space
Where to schedule?
Time
When to schedule?
vCPU
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
Coordinated
groupPreemptive scheduling policy
 Coscheduling
 Delayed preemption
pCPU assignment policy
8/28

Outline
• Motivation
• Coordination in time domain
• Kernel-level coordination demands
• User-level coordination demands
• Coordination in space domain
• Load-conscious balance scheduling
• Evaluation
vCPU
pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
Space
Time
9/28

Synchronization to be Coordinated
• Synchronization based on “busy-waiting”
• Unnecessary CPU consumption by busy-waiting
for a descheduled vCPU
• Significant performance degradation
• Semantic gap
• “OSes make liberal use of busy-waiting (e.g., spinlock)
since they believe their vCPUs are dedicated”
 Serious problem in kernel
vCPU
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
• When and where to demand synchronization?
• How to identify coordination demands?
10/28

Kernel-Level Coordination Demands
• Does kernel really need coordination?
• Experimental analysis
• Multithreaded applications in the PARSEC suite
• Measuring “kernel time” when uncoordinated
Solorun (no consolidation) Corun (w/ 1 VM running streamcluster)A 8-vCPU VM
on 8 pCPUs
0%
20%
40%
60%
80%
100%
blackscholes
bodytrack
canneal
dedup
facesim
ferret
fluidanimate
freqmine
raytrace
streamcluster
swaptions
vips
x264
CPUtime(%)
Kernel time User time
0%
20%
40%
60%
80%
100%
blackscholes
bodytrack
canneal
dedup
facesim
ferret
fluidanimate
freqmine
raytrace
streamcluster
swaptions
vips
x264
CPUtime(%)
Kernel time ratio is largely amplified by x1.3-x30
 “Newly introduced kernel-level contention”
11/28

Kernel-Level Coordination Demands
• Where is the kernel time amplified?
0%
20%
40%
60%
80%
100%
blackscholes
bodytrack
canneal
dedup
facesim
ferret
fluidanimate
freqmine
raytrace
streamcluster
swaptions
vips
x264
CPUtime(%)
0%
20%
40%
60%
80%
100%
CPUusageforkerneltime(%)
TLB shootdown Lock spinning Others
Kernel time breakdown by functions
Dominant sources
1) TLB shootdown
2) Lock spinning
How to identify?
12/28

How to Identify TLB Shootdown?
• TLB shootdown
• Notification of TLB invalidation to a remote CPU
CPU
Thread
CPU
Thread
Virtual address
space
TLB TLB
V->P1
V->P1
V->P1
V->P2 or V->Null
Modify
or
Unmap
Inter-processor interrupt (IPI)
Busy-waiting until all corresponding
TLB entries are invalidated
“Busy-waiting for TLB synchronization” is efficient in native systems,
but not in virtualized systems if target vCPUs are not scheduled.
(Even worse if TLBs are synchronized in a broadcast manner)
13/28

How to Identify TLB Shootdown?
• TLB shootdown IPI
• Virtualized by VMM
• Used in x86-based Windows and Linux
0%
20%
40%
60%
80%
100%
bodytrack
canneal
dedup
facesim
ferret
fluidani…
streamcl…
swaptions
vips
x264
CPUusageforkerneltime(%)
TLB shootdown Lock spinning Others
0
500
1000
1500
2000
bodytrack
canneal
dedup
facesim
ferret
fluidanim…
streamclu…
swaptions
vips
x264
#ofIPIs/vCPU/sec
“A TLB shootdown IPI is a signal for coordination demand!”
 Co-schedule IPI-recipient vCPUs with its sender vCPU
TLB shootdown IPI traffic
14/28

How to Identify Lock Spinning?
• Why excessive lock spinning?
• “Lock-holder preemption (LHP)”
• Short critical section can be unpredictably prolonged by
vCPU preemption
• Which spinlock is problematic?
vCPU
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
0%
20%
40%
60%
80%
100%
Lockwaittime(%)
Other locks
Runqueue lock
Pagetable lock
Semaphore wait-queue lock
Futex wait-queue lock
Spinlock
wait time
breakdown
82%
93%
15/28

• Futex
• Linux kernel support for user-level synchronization
(e.g., mutex, barrier, conditional variables, etc)
mutex_lock(mutex)
/* critical section */
mutex_unlock(mutex)
futex_wake(mutex) {
spin_lock(queue->lock)
thread=dequeue(queue)
wake_up(thread)
spin_unlock(queue->lock)
}
mutex_lock(mutex)
futex_wait(mutex) {
enqueue(queue, me)
schedule() /* blocked */
vCPU1 vCPU2
/* wake-up */
/* critical section */
mutex_unlock(mutex)
futex_wake(mutex) {
Reschedule IPI
User-level
contention
Kernel-level
contention
If vCPU1 is preempted before releasing its spinlock,
vCPU2 starts busy-waiting on the preempted spinlock
 LHP!
Kernel
space
16/28
Preempted

• Why preemption-prone?
pCPU
vCPU1
vCPU0
VMExit
IPI emulation
Wait-queue lock
VMExit
APIC reg access
VMEntry
VMExit
APIC reg access
VMEntry
Wait-queue unlock
VMEntry
Wait-queue lock
spinning
 Prolonged by VMM intervention
 Multiple VMM interventions
for one IPI transmission
 Repeated by iterative wake-up
No more short critical section!
 Likelihood of preemption
 Preemption by woken-up sibling
 Serious issue
Remote thread wake-up
17/28

• Generalization: “Wait-queue locks”
• Not limited to futex wake-up
• Many wake-up functions in the Linux kernel
• General wake-up
• __wake_up*()
• Semaphore or mutex unlock
• rwsem_wake(), __mutex_unlock_common_slowpath(), …
• “Multithreaded workloads usually communicate
and synchronize on wait-queues”
“A Reschedule IPI is a signal for coordination demand!”
Delay preemption of an IPI-sender vCPU
until a likely-held spinlock is released
18/28

Outline
• Motivation
• Evaluation
vCPU
pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
Space
Time
19/28

vCPU-to-pCPU Assignment
• Balance scheduling [Sukwong11]
• Spreading sibling vCPUs on different pCPUs
• Increase in likelihood of coscheduling
• No coordination in time domain
vCPU
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
pCPU pCPU pCPU pCPU
Uncoordinated scheduling Balance scheduling
vCPU stacking
Likelihood of
coscheduling
<
No vCPU stacking
20/28

vCPU-to-pCPU Assignment
• Balance scheduling [Sukwong11]
• Limitation
• Based on “global CPU loads are well balanced”
• In practice, VMs with fair CPU shares can have
vCPU vCPU
vCPU
vCPU vCPU
x4 shares
SMP VM
UP VM
vCPU vCPU
vCPU
vCPU vCPUSMP VM
SMP VM
Inactive vCPUs
Single-threaded workload
Multithreaded workload
Different # of vCPUs Different TLP
0
200
400
600
800
5 15 25 35 45 55 65 75 85 95
CPUusage(%)
Time (sec)
canneal
0
200
400
600
800
1 4 7 10 13 16 19 22
CPUusage(%)
Time (sec)
dedup
TLP can be changed
in a multithreaded app
TLP: Thread-level parallelism
pCPU pCPU
vCPUvCPU
pCPU pCPU
vCPU vCPU
vCPU vCPU
High scheduling latency
Balance scheduling
on imbalanced loads
21/28

Proposed Scheme
• Adaptive scheme based on pCPU loads
• When assigning a vCPU, check pCPU loads
vCPU
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
If load is balanced
 Balance scheduling
vCPU
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU vCPU
vCPU
If load is imbalanced
 Favoring underloaded pCPUs
CPU load > Avg. CPU load
 overloaded
Handled by coordination
in time domain
22/28

Outline
• Motivation
• Evaluation
23/28

Evaluation
• Implementation
• Based on Linux KVM and CFS
• Evaluation
• Effective time slice
• For coscheduling & delayed preemption
• 500us decided by sensitive analysis
• Performance improvement
• Alternative
• OS re-engineering
24/28

Evaluation
• SMP VM with UP VMs
• One 8-vCPU VM + four 1-vCPU VMs (x264)
0.00
0.50
1.00
1.50
2.00
Normalizedexecutiontime
Workloads of 8-vCPU VM
Baseline Balance LC-Balance LC-Balance+Resched-DP LC-Balance+Resched-DP+TLB-Co
Futex-intensive
 5-53% improvement
TLB-intensive
 20-90% improvement
Performance of 8-vCPU VM
LC-Balance: Load-conscious balance scheduling
Resched-DP: Delayed preemption for reschedule IPI
TLB-Co: Coscheduling for TLB shootdown IPI
Non-synchronization-intensive
25/28
High scheduling latencyBalance
scheduling

Alternative: OS Re-engineering
• Virtualization-friendly re-engineering
• Decoupling reschedule IPI transmission from
thread wake-up
wake_up (queue) {
thread=dequeue(queue)
wake_up(thread)
}
Reschedule IPI
Delayed reschedule IPI transmission
• Modified wake_up func
• Using per-cpu bitmap
• Applied to futex_wakeup
& futex_requeue
One 8-vCPU VM + four 1-vCPU VMs (x264)
Delayed reschedule IPI is virtualization-friendly to resolve LHP problems
26/28
0.00
0.20
0.40
0.60
0.80
1.00
1.20
facesim streamcluster
Baseline
Baseline w/ DelayedResched
LC_Balance
LC_Balance w/ DelayedResched
LC_Balance w/ Resched-DP

Conclusions & Future Work
• Demand-based coordinated scheduling
• IPI as an effective signal for coordination
• pCPU assignment conscious of dynamic CPU loads
• Limitation
• Cannot cover ALL types of synchronization demands
• Kernel spinlock contention w/o VMM intervention
• Future work
• Cooperation with HW (e.g., PLE) & paravirt
Barrier or lock
27/28
Address
space

Thank You!
• Questions and comments
• Contacts
• hjukim@calab.kaist.ac.kr
• http://calab.kaist.ac.kr/~hjukim
28/28

User-Level Coordination Demands
• Coscheduling-friendly workloads
• SPMD, bulk-synchronous, etc.
• Busy-waiting synchronization
• “Spin-then-block”
Barrier
Barrier
Thread1 Thread2 Thread3 Thread4
Wake
up
Wake
up
Wake
up
Wake
up
Additional
barrier
Thread1 Thread2 Thread3 Thread4 Thread1 Thread2 Thread3 Thread4
Wake
up
Coscheduling
(balanced execution)
Uncoordinated
(largely skewed execution)
Uncoordinated
(skewed execution)
More blocking operations
when uncoordinated
Spin Block
30/28

User-Level Coordination Demands
• Coscheduling
• Avoiding more expensive blocking in a VM
• VMExits for CPU yielding and wake-up
• Halt (HLT) and Reschedule IPI
• When to coschedule?
• User-level synchronization involves reschedule IPIs
Providing a knob to selectively enable this coscheduling for coscheduling-friendly VMs
Reschedule IPI traffic of streamcluster
Barriers Barriers Barriers Barriers Barriers Barriers
“A Reschedule IPI is a signal for coordination demand!”
Co-schedule IPI-recipient vCPUs with a sender vCPU
31/28

Urgent vCPU First (UVF) Scheduling
• Urgent vCPU
• 1. Preemptively scheduled if fairness is kept
• 2. Protected from preemption once scheduled
• During “Urgent time slice (utslice)”
pCPU
vCPU vCPU vCPU
Urgent queue Runqueue
vCPU
pCPU
vCPU vCPU vCPUvCPU
FIFO order Proportional shares order
vCPU : urgent vCPU
vCPU vCPU
Wait queue
If inter-VM fairness is kept
Coscheduled
Protected from
preemption
32/28

Proposed Scheme
• Adaptive scheme based on pCPU loads
vCPU
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
Balanced loads
 Balance scheduling
vCPU
pCPU pCPU pCPU pCPU
vCPU
vCPU
vCPU vCPU
vCPU
Imbalanced loads
 Favoring underloaded pCPUs
vCPU
pCPU0 pCPU1 pCPU2 pCPU3
vCPU
vCPU vCPU
Wait queue
• Example
vCPUvCPU vCPU
Candidate pCPU set
(Scheduler assigns a lowest-loaded pCPU in this set)
= {pCPU0, pCPU1, pCPU2, pCPU3}
pCPU3 is overloaded
(i.e., CPU load > Avg. CPU load)
Handled by coordination in time domain
(UVF scheduling)
33/28

Evaluation
• Urgent time slice (utslice)
• 1. Utslice for reducing LHP
• 2. Utslice for quickly serving multiple urgent vCPUs
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 100 300 500 700 1000
#offutexqueueLHP
Utslice (usec)
bodytrack
facesim
streamcluster
Workloads:
A futex-intensive workload in one VM
+ dedup in another VM as a preempting VM
>300us utslice
2x-3.8x LHP reduction
Remaining LHPs occur during local wake-up or
before reschedule IPI transmission
 Unlikely lead to lock contention
34/28

Evaluation
• Urgent time slice (utslice)
• 1. utslice for reducing LHP
• 2. utslice for quickly serving multiple urgent vCPUs
30
35
40
45
50
55
60
0
2
4
6
8
10
12
14
16
100 500 1000 3000 5000
Averageexecutiontime(sec)
CPUcycles(%)
Utslice (usec)
Spinlock cycles (%)
TLB cycles (%)
Execution time (sec)
Workloads:
3 VMs, each of which runs vips
(vips - TLB-IPI-intensive application)
As utslice increases,
TLB shootdown cycles increase
500usec is an appropriate utslice for both
LHP reduction and multiple urgent vCPUs
~11% degradation
35/28

Evaluation
• Urgent allowance
• Improving overall efficiency with fairness
0
0.5
1
1.5
2
2.5
3
3.5
0
5
10
15
20
25
30
No UVF 0 6 12 18 24
Slowdown
CPUcycles(%)
Urgent allowance (msec)
Spinlock cycles
TLB cycles
Slowdown (vips)
Slowdown (facesim x 2)
Workloads:
vips (TLB-IPI-intensive) VM + two facesim VMs
Efficient TLB synchronization
No performance drop
36/28

Evaluation
• Impact of kernel-level coordination
• One 8-vCPU VM + four 1-vCPU VMs (x264)
0.00
0.50
1.00
1.50
Co-running workloads with 1-vCPU VM (x264)
Baseline Balance LC-Balance LC-Balance+Resched-DP LC-Balance+Resched-DP+TLB-Co
Performance of 1-vCPU VM
LC-Balance: Load-conscious balance scheduling
Resched-DP: Delayed preemption for reschedule IPI
TLB-Co: Coscheduling for TLB shootdown IPI
Unfair
contention
Balance
scheduling
Balance scheduling  Up to 26% degradation
37/28

Evaluation: Two SMP VMs
w/ dedup
w/ freqmine
a: baseline
b: balance
c: LC-balance
d: LC-balance+Resched-DP
e: LC-balance+Resched-DP+TLB-Co

corun
solorun
Time
Time
38/28

Evaluation
• Effectiveness on HW-assisted feature
• CPU feature to reduce the amount of busy-waiting
• VMExit in response to excessive busy-waiting
• Intel Pause-Loop-Exiting (PLE), AMD Pause Filter
• Inevitable cost of some busy-waiting and VMExit
LHP
PAUSE
PAUSE
PAUSE
…
Threshold
VMExit
Yielding
0
0.2
0.4
0.6
0.8
1
0
2
4
6
8
10
Baseline LC_Balance LC_Balance
w/ UVF
CPUcycles(%)
TLB cycles (%) Spinlock cycles (%)
0
0.2
0.4
0.6
0.8
1
0
2
4
6
8
10
Baseline LC_Balance LC_Balance
w/ UVF
CPUcycles(%)
TLB cycles (%) Spinlock cycles (%)
streamcluster (futex-intensive) ferret (TLB-IPI-intensive)
Apps Streamcluster facesim ferret vips
Reduction in Pause-
loop VMExits (%) 44.5 97.7 74.0 37.9
39/28

Evaluation
• Coscheduling-friendly user-level workload
• Streamcluster
• Spin-then-block barrier intensive workload
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
UVF w/o Resched-Co UVF w/ Resched-Co
#ofbarriersynchronization
Departure (block)
Departure (spin)
Arrival (block)
Arrival (spin)
More performance improvement
as the time of spin-waiting increases
Blocking: 38%
Reschedule IPIs (3 VMExits): 21%
Additional (departure) barriers: 29%
Normalized execution time (corunning w/ bodytrack)
Additional barriers
Barrier breakdown
Resched-Co: Coscheduling for rescheudle IPI
0.00
0.20
0.40
0.60
0.80
1.00
0.1ms spin wait
(default)
10x spin wait 20x spin wait
UVF w/o Resched-Co UVF w/ Resched-Co
40/28

Demand-Based Coordinated Scheduling for SMP VMs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Demand-Based Coordinated Scheduling for SMP VMs

Similar to Demand-Based Coordinated Scheduling for SMP VMs (20)

Recently uploaded

Recently uploaded (20)

Demand-Based Coordinated Scheduling for SMP VMs