While datacenters are increasingly adopting VMs to provide elastic cloud services, they still rely on traditional TCP for congestion control. In this talk, I will first show that VM scheduling delays can heavily contaminate RTTs sensed by VM senders, preventing TCP from correctly learning the physical network condition. Focusing on the incast problem, which is commonly seen in large-scale distributed data processing such as MapReduce and web search, I find that the solutions that have been developed for *physical* clusters fall short in a Xen *virtual* cluster. Second, I will provide a concrete understanding of the problem, and reveal that the situations that when the sending VM is preempted versus when the receiving VM is preempted, are different. Third, I will introduce my recent attempts on paravirtualizing TCP to overcome the negative effect caused by VM scheduling delays.
XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong
1. On ParaVirualizing TCP:
Congestion Control in Xen Virtual Machines
Luwei Cheng, Cho-Li Wang, Francis C.M. Lau
Department of Computer Science
The University of Hong Kong
Xen Project Developer Summit 2013
Edinburgh, UK, October 24-25, 2013
2. Outline
Motivation
– Physical datacenter vs. Virtualized datacenter
– Incast congestion
Understand the Problem
– Pseudo-congestion
– Sender-side vs. Receiver-side
PVTCP – A ParaVirtualized TCP
– Design, Implementation, Evaluation
Questions & Comments
3. Outline
Motivation
– Physical datacenter vs. Virtualized datacenter
– Incast congestion
Understand the Problem
– Pseudo-congestion
– Sender-side vs. Receiver-side
PVTCP – A ParaVirtualized TCP
– Design, Implementation, Evaluation
Questions & Comments
4. Physical datacenter Virtualized datacenter
Core switch
Core switch
ToR
switches
…
ToR
switches
...
…
…
Servers in a rack
…
...
…
…
Servers in a rack
VM VM VM
VM VM VM
A set of physical machines A set of virtual machines
Network delays:
Network delays:
propagation delays of the
physical network/switch
additional delays due to
virtualization overhead
5. Virtualization brings “delays”
VM
VM
VM
VM
delay
VM
VM
Hypervisor
pCPU
pCPU
1. I/O virtualization overhead (PV or HVM)
– Guest VMs are unable to directly access the hardware.
– Additional data movement between dom0 and domUs.
– HVM: Passthrough I/O can avoid it
2. VM scheduling delays
– Multiple VMs share one physical core
6. Virtualization brings “delays”
Avg: 0.147ms
Avg: 0.374ms
[PM PM]
[1VM 1VM]
Delays of I/O virtualization (PV guests): < 1ms
Peak: 30ms
Peak: 60ms
[1VM 2VMs]
VM scheduling delays: 10× ms
– Queuing delays VM scheduling delays
The dominant factor to network RTT
[1VM 3VMs]
8. Incast network congestion
• A special form of network congestion, typically seen in
distributed processing applications (scatter-gather).
– Barrier-synchronized request workloads
– The limited buffer space of the switch output port can be easily
overfilled by simultaneous transmissions.
• Application-level throughput (goodput) can be orders of
magnitude lower than the link capacity.
[SIGCOMM’09]
9. Solutions for physical clusters
Prior works: none of them can fully eliminate the
throughput collapse.
–
–
–
–
–
–
Increase switch buffer size
Limited transmit
Reduce duplicate ACK threshold
Disable slow-start
Randomize timeout value
Reno, NewReno, SACK
The dominate factor: once the packet loss happens,
whether the sender can know it as soon as possible.
– In case of “tail loss”, the sender can only count on the
retransmit timer’s firing.
Two representative papers:
Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems [FAST’08].
Understanding TCP Incast Throughput Collapse in Datacenter Networks [WREN’09].
10. Solutions for physical clusters (cont’d)
Significantly reducing RTOmin has been shown to be a safe
and effective approach. [SIGCOMM’09]
Even with ECN support in hardware switch, a small RTOmin
still shows apparent advantages. [DCTCP, SIGCOMM’10]
[SIGCOMM’09]
[DCTCP, SIGCOMM’10]
RTOmin in a virtual cluster? Not well studied.
11. Outline
Motivation
– Physical datacenter vs. Virtualized datacenter
– Incast congestion
Understand the Problem
– Pseudo-congestion
– Sender-side vs. Receiver-side
PVTCP – A ParaVirtualized TCP
– Design, Implementation, Evaluation
Questions & Comments
12. Pseudo-congestion
NO network congestion, still RTT spikes.
Red points:
measured RTTs
30ms
30ms
VM
30ms
VM
Blue points:
calculated RTO values
VM
RTOmin=200ms
RTOmin=100ms
pCPU
TCP’s low-pass filter
RTO = SRTT + 4* RTTVAR
Lower-bound: RTOmin
3VMs per core
Retransmit TimeOut
RTOmin=10ms
RTOmin=1ms
A small RTOmin frequent spurious RTOs
13. Pseudo-congestion (cont’d)
A small RTOmin:
serious spurious
RTOs with largely
varied RTTs.
A big RTOmin:
throughput collapse
with heavy network
congestion.
“adjusting RTOmin: a tradeoff between timely
response with premature timeouts, and there is NO
optimal balance between the two.”
-- Allman and Paxson [SIGCOMM’99]
Virtualized datacenters A new instantiation
14. Sender-side vs. Receiver-side
The scheduling delays to the sender VM
The scheduling delays to the receiver VM
To transmit 4000 1MB data blocks
3VMs1VM
1086
0
0
0
Freq.
1× RTOs
2× RTOs
3× RTOs
4× RTOs
RTO only happens once a time
1VM3VMs
677
673
196
30
Successive RTOs are normal
15. A micro-view with tcpdump
9.1
snd.una: the first sent but unacknowledged byte.
x106
snd.nxt
9
snd.una
8.9
RTO happens twice,
before the receiver
VM wakes up.
The sender
VM has been
stopped.
8.8
8.7
snd.nxt: the next byte that will be sent.
The receiver
VM has been
stopped.
An ACK arrives
before the sender
VM wakes up.
snd.nxt
8.6
snd.una
8.5
RTO happens just after
the sender VM wakes up.
8.4
0
10
20
30
40
50
60
70
Time (ms) vs. sequence number (from the sender VM)
When the sender VM is preempted
The ACK’s arrival time is not
80
0
10
20
30
40
50
60
70
80
Time (ms) vs. ACK number (from the receiver VM)
When the receiver VM is preempted
The generation and the
delayed, but the receiving
return of the ACKs will be
time is too late.
delayed.
From TCP’s perspective, RTO RTOs must happen on the
should not be triggered.
sender’s side.
16. The sender-side problem: OS reasons
ACK
ACK
TCP
receiver
ACK
Physical
network
VM scheduling latency
wait ..
Within
hypervisor
data
data
clear
timer
Timer
Buffer
data
clear
timer
Timer
2
Network IRQ:
receive ACK;
Spurious RTO!
1
Timer IRQ:
RTO happens!
Expire time
Timer
deliver
ACK
VM1 is running
VM3 is running
VM1 is running
VM2 is waiting
VM3 is waiting
VM1 is waiting
VM1 is waiting
VM2 is waiting
TCP
sender
VM2 is waiting
VM3 is waiting
Scheduling
queue
VM2 is running
Driver
domain
VM3 is waiting
After the VM wakes up, both TIMER and NET are pending.
RTO happens just before the ACK enters the VM
The reasons due to common OS design
– Timer interrupt is executed before other interrupts
– Network processing is a little late (bottom half)
17. To detect spurious RTOs
Two well-known detection algorithms: F-RTO and Eifel
– Eifel performs much worse than F-RTO in some situations, e.g.
with bursty packet loss [CCR’03]
– F-RTO is implemented in Linux
[3VMs1VM]
Low
detection
rate
c
[1VM3VMs]
Low
detection
rate
F-RTO interacts badly with delayed ACK (ACK coalescing)
– Reducing delayed ACK timeout value does NOT help.
Disabling delayed ACK seems to be helpful
18. Delayed ACK vs. CPU overhead
Sender VM
Receiver VM
Sender VM
Receiver VM
Disabling delayed ACK Significant CPU overhead
19. Delayed ACK vs. CPU overhead
delack-200ms delack-1ms w/o delack
Total ACKs
229,650
244,757
2,832,260
delack-200ms delack-1ms w/o delack
Total ACKs
252,278
262,274
2,832,179
Sender delayed VM
Disabling VM ReceiverACK: 11~13×Sender VM ACKs are sent
more Receiver VM
Disabling delayed ACK Significant CPU overhead
20. Outline
Motivation
– Physical datacenter vs. Virtualized datacenter
– Incast congestion
Understand the Problem
– Pseudo-congestion
– Sender-side vs. Receiver-side
PVTCP – A ParaVirtualized TCP
– Design, Implementation, Evaluation
Questions & Comments
21. PVTCP – A ParaVirtualized TCP
Observation
– Spurious RTOs only happen when the sender/receiver
VM just experienced a scheduling delays.
Main Idea
– If we can detect such moment, and let the guest OS be
aware of this, there is a chance to handle the problem.
“the more information about current network conditions
available to a transport protocol, the more efficiently it
can use the network to transfer its data.”
-- Allman and Paxson [SIGCOMM’99]
22. Detect the VM’s wakeup moment
VM
30ms
VM
30ms
VM
pCPU
3VMs per core
Guest OS
(HZ=1000)
jiffies++
jiffies++
jiffies++
jiffies++
jiffies++
30ms
VM is NOT
running
jiffies += 60
VM is
running
jiffies++
jiffies++
jiffies++
jiffies++
jiffies++
VM is
running
...
Virtual timer IRQs
(every 1ms)
Hypervisor
Time
Virtual timer IRQs
(every 1ms)
one-shot timer
Acute increase of the system clock (jiffies)
The VM just wakes up
23. PVTCP – the sender VM is preempted
Spurious RTOs can be avoided.
No need to detect them at all!
ACK
ACK
ACK
Physical
network
wait ..
Within
hypervisor
data
data
clear
timer
Timer
Timer
VM scheduling latency
Timer
TCP
Start
time
VM2 is running
Network IRQ:
receive ACK;
Spurious RTO!
1
Expire time
VM1 is running
deliver
ACK
2
Buffer
data
clear
timer
TCP
receiver
Timer IRQ:
RTO happens!
VM3 is running
VM1 is running
Timer
Expiry
time
Driver
domain
TCP
sender
24. PVTCP – the sender VM is preempted
Spurious RTOs can be avoided.
No need to detect them at all!
Solution: after the VM wakes up,
extend the TCP retransmit
timer’s expiry time by 1ms.
ACK
ACK
ACK
Physical
network
wait ..
Within
hypervisor
data
data
clear
timer
Timer
Timer
VM scheduling latency
deliver
ACK
Driver
domain
Buffer
data
clear
timer
TCP
receiver
Net IRQ first:
ACK enters.
Expire time
Reset the timer.
Timer
VM1 is running
TCP
PVTCP
Start
time
VM2 is running
VM3 is running
VM1 is running
Timer
Timer
1ms
Expiry
time
TCP
sender
25. PVTCP – the sender VM is preempted
ACK
ACK
ACK
Physical
network
wait ..
Within
hypervisor
data
data
clear
timer
Timer
Timer
VM1 is running
PVTCP
VM scheduling latency
deliver
ACK
Driver
domain
Buffer
data
clear
timer
TCP
receiver
Net IRQ first:
ACK enters.
Expire time
Reset the timer.
Timer
VM2 is running
StartTime
VM3 is running
VM1 is running
Timer
ExpiryTime
1ms
Measured RTT (MRTT) = TrueRTT + VMSchedDelay
TCP’s low-pass filter to estimate RTT/RTO
Smoothed RTT (SRTTi) 7/8 * SRTTi-1 +1/8 * MRTTi
RTT variance (RTTVARi) 3/4 * RTTVARi -1+ 1/4 * |SRTTi - MRTTi|
Expected RTO value (RTOi+1) SRTTi + 4 * RTTVARi
Solution: MRTTi SRTTi-1
TCP
sender
26. PVTCP – the receiver VM is preempted
Spurious RTOs cannot be
avoided, so we have to let
the sender detect them.
Detection algorithms requires deterministic return of
future ACKs from the receiver
– Enable delayed ACK retransmission ambiguity
– Disable delayed ACK significant CPU overhead
Solution: temporarily disable delayed ACK when the
receiver VM just wakes up.
– Eifel: check the timestamp of the first one ACK
– F-RTO: check the ACK number of the first two ACKs
– Just-in-time: do not delay the ACKs for the first three segments
27. PVTCP evaluation: throughput
TCP’s dilemma: pseudo-congestion & real congestion
PVTCP-1ms
TCP-200ms
TCP-1ms
Experimental setup:
20 sender VMs 1 receiver VM
RTOmin
PVTCP avoids throughput collapse in the whole range
28. PVTCP evaluation: CPU overhead
Sender VM
Receiver VM
Sender VM
Receiver VM
Enable delayed ACK:
PVTCP (RTOmin=1ms) ≈ TCP (RTOmin=200ms)
29. PVTCP evaluation: CPU overhead
Sender VM
Receiver VM
Spurious RTOs are avoided
Sender VM
Receiver VM
Temporarily disable delayed
ACK to help the sender
detect spurious RTOs
RTOmin
TCP-200ms
TCP-1ms
PVTCP-1ms
RTOmin
TCP-200ms
TCP-1ms
PVTCP-1ms
Total ACKs
192,587
244,757
192,863
Total ACKs
194,384
262,274
208,688
+0%
+7.4%
31. The buffer of the netback
Sender
VM
VM 3
WAIT
VM 2
WAIT
VM 1
RUN
VM 1
WAIT
VM 3
WAIT
VM 2
RUN
VM 2
WAIT
VM 1
WAIT
VM 3
RUN
VM 3
WAIT
VM 2
WAIT
VM 1
RUN
VM scheduling queue
Driver
domain
Receiver
VM
Data packets,
waiting for ACK
wait
ACKing
buffer
Within
hypervisor
Driver
domain
Data packets, waiting for ACK
wait
RTO
happens!
Sender
VM
Hypervisor
scheduling delay
The buffer
size matters!
deliver
Physical
network
The scheduling delays to the sender VM
Receiver
VM
VM 1
ACKing RUN
VM2
WAIT
VM 3
WAIT
VM 2
RUN
VM 3
WAIT
VM 1
WAIT
VM 3
RUN
VM 1
WAIT
VM 2
WAIT
ACKing VM 1
VM 2
WAIT
VM 3
WAIT
buffer
Hypervisor
scheduling delay
RTO
happens!
deliver
RUN
Physical
network
Within
hypervisor
VM scheduling queue
The scheduling delays to the receiver VM
The vif’s buffer: temporarily store incoming packets
when the VM has been preempted.
– ifconfig vifX.Y txqueuelen [value]
The default value is too small intensive packet loss
– #define XENVIF_QUEUE_LENGTH 32
This parameter should be set bigger (> 10,000 perhaps..)
32. Summary
Problem: VM scheduling delays cause spurious RTOs.
Sender-side problem
Receiver-side problem
– There are OS reasons
– Networking problem
Proposed Solution: a ParaVirtualized TCP (PVTCP)
– Provide a method to detect a VM’s wakeup moment
Sender-side problem
– Spurious RTOs can be
avoided.
– Slightly extends the
retransmit timer’s expiry
time after the sender
VM wakes up.
Receiver-side problem
– Spurious RTOs can be
detected.
– Temporarily disable
delayed ACK after the
receiver VM wakes up.
– Just-in-time
Future Work: your inputs ..
33. Thanks for your listening
Comments & Questions
Email: lwcheng@cs.hku.hk
URL: http://www.cs.hku.hk/~lwcheng