On ParaVirualizing TCP:
Congestion Control in Xen Virtual Machines
Luwei Cheng, Cho-Li Wang, Francis C.M. Lau
Department o...
Outline
 Motivation
– Physical datacenter vs. Virtualized datacenter
– Incast congestion
 Understand the Problem
– Pseud...
Outline
 Motivation
– Physical datacenter vs. Virtualized datacenter
– Incast congestion
 Understand the Problem
– Pseud...
Physical datacenter Virtualized datacenter
Core switch

Core switch

ToR
switches

…

ToR
switches

...
…

…

Servers in a...
Virtualization brings “delays”
VM
VM

VM

VM

delay

VM

VM

Hypervisor
pCPU

pCPU

 1. I/O virtualization overhead (PV o...
Virtualization brings “delays”
Avg: 0.147ms

Avg: 0.374ms

[PM  PM]

[1VM  1VM]

 Delays of I/O virtualization (PV gues...
Network delays in public clouds

[INFOCOM’10]

[HPDC’10]
Incast network congestion
• A special form of network congestion, typically seen in
distributed processing applications (s...
Solutions for physical clusters
 Prior works: none of them can fully eliminate the

throughput collapse.
–
–
–
–
–
–

Inc...
Solutions for physical clusters (cont’d)
 Significantly reducing RTOmin has been shown to be a safe

and effective approa...
Outline
 Motivation
– Physical datacenter vs. Virtualized datacenter
– Incast congestion
 Understand the Problem
– Pseud...
Pseudo-congestion
NO network congestion, still RTT spikes.
Red points:
measured RTTs
30ms
30ms

VM

30ms

VM

Blue points:...
Pseudo-congestion (cont’d)
A small RTOmin:
serious spurious
RTOs with largely
varied RTTs.

A big RTOmin:
throughput colla...
Sender-side vs. Receiver-side

The scheduling delays to the sender VM

The scheduling delays to the receiver VM

To transm...
A micro-view with tcpdump
9.1

snd.una: the first sent but unacknowledged byte.

x106

snd.nxt

9

snd.una

8.9

RTO happe...
The sender-side problem: OS reasons
ACK

ACK

TCP
receiver

ACK

Physical
network

VM scheduling latency

wait ..

Within
...
To detect spurious RTOs
 Two well-known detection algorithms: F-RTO and Eifel
– Eifel performs much worse than F-RTO in s...
Delayed ACK vs. CPU overhead

Sender VM

Receiver VM

Sender VM

Receiver VM

Disabling delayed ACK  Significant CPU over...
Delayed ACK vs. CPU overhead

delack-200ms delack-1ms w/o delack
Total ACKs

229,650

244,757

2,832,260

delack-200ms del...
Outline
 Motivation
– Physical datacenter vs. Virtualized datacenter
– Incast congestion
 Understand the Problem
– Pseud...
PVTCP – A ParaVirtualized TCP
 Observation
– Spurious RTOs only happen when the sender/receiver
VM just experienced a sch...
Detect the VM’s wakeup moment
VM

30ms

VM

30ms

VM
pCPU

3VMs per core

Guest OS
(HZ=1000)

jiffies++
jiffies++
jiffies+...
PVTCP – the sender VM is preempted
 Spurious RTOs can be avoided.

No need to detect them at all!

ACK

ACK

ACK

Physica...
PVTCP – the sender VM is preempted
 Spurious RTOs can be avoided.

No need to detect them at all!
 Solution: after the V...
PVTCP – the sender VM is preempted
ACK

ACK

ACK

Physical
network
wait ..

Within
hypervisor

data

data
clear
timer

Tim...
PVTCP – the receiver VM is preempted
Spurious RTOs cannot be
avoided, so we have to let
the sender detect them.
 Detectio...
PVTCP evaluation: throughput
TCP’s dilemma: pseudo-congestion & real congestion
PVTCP-1ms
TCP-200ms
TCP-1ms
Experimental s...
PVTCP evaluation: CPU overhead

Sender VM

Receiver VM

Sender VM

Receiver VM

Enable delayed ACK:
PVTCP (RTOmin=1ms) ≈ T...
PVTCP evaluation: CPU overhead

Sender VM

Receiver VM

Spurious RTOs are avoided

Sender VM

Receiver VM

Temporarily dis...
One concern
The buffer of the netback
Sender
VM
VM 3
WAIT

VM 2
WAIT

VM 1
RUN

VM 1
WAIT

VM 3
WAIT

VM 2
RUN

VM 2
WAIT

VM 1
WAIT

...
Summary
Problem: VM scheduling delays cause spurious RTOs.
 Sender-side problem
 Receiver-side problem
– There are OS re...
Thanks for your listening
Comments & Questions

Email: lwcheng@cs.hku.hk
URL: http://www.cs.hku.hk/~lwcheng
Upcoming SlideShare
Loading in...5
×

XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

1,754

Published on

While datacenters are increasingly adopting VMs to provide elastic cloud services, they still rely on traditional TCP for congestion control. In this talk, I will first show that VM scheduling delays can heavily contaminate RTTs sensed by VM senders, preventing TCP from correctly learning the physical network condition. Focusing on the incast problem, which is commonly seen in large-scale distributed data processing such as MapReduce and web search, I find that the solutions that have been developed for *physical* clusters fall short in a Xen *virtual* cluster. Second, I will provide a concrete understanding of the problem, and reveal that the situations that when the sending VM is preempted versus when the receiving VM is preempted, are different. Third, I will introduce my recent attempts on paravirtualizing TCP to overcome the negative effect caused by VM scheduling delays.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,754
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong"

  1. 1. On ParaVirualizing TCP: Congestion Control in Xen Virtual Machines Luwei Cheng, Cho-Li Wang, Francis C.M. Lau Department of Computer Science The University of Hong Kong Xen Project Developer Summit 2013 Edinburgh, UK, October 24-25, 2013
  2. 2. Outline  Motivation – Physical datacenter vs. Virtualized datacenter – Incast congestion  Understand the Problem – Pseudo-congestion – Sender-side vs. Receiver-side  PVTCP – A ParaVirtualized TCP – Design, Implementation, Evaluation  Questions & Comments
  3. 3. Outline  Motivation – Physical datacenter vs. Virtualized datacenter – Incast congestion  Understand the Problem – Pseudo-congestion – Sender-side vs. Receiver-side  PVTCP – A ParaVirtualized TCP – Design, Implementation, Evaluation  Questions & Comments
  4. 4. Physical datacenter Virtualized datacenter Core switch Core switch ToR switches … ToR switches ... … … Servers in a rack … ... … … Servers in a rack VM VM VM VM VM VM  A set of physical machines  A set of virtual machines  Network delays:  Network delays: propagation delays of the physical network/switch additional delays due to virtualization overhead
  5. 5. Virtualization brings “delays” VM VM VM VM delay VM VM Hypervisor pCPU pCPU  1. I/O virtualization overhead (PV or HVM) – Guest VMs are unable to directly access the hardware. – Additional data movement between dom0 and domUs. – HVM: Passthrough I/O can avoid it  2. VM scheduling delays – Multiple VMs share one physical core
  6. 6. Virtualization brings “delays” Avg: 0.147ms Avg: 0.374ms [PM  PM] [1VM  1VM]  Delays of I/O virtualization (PV guests): < 1ms Peak: 30ms Peak: 60ms [1VM  2VMs]  VM scheduling delays: 10× ms – Queuing delays  VM scheduling delays  The dominant factor to network RTT [1VM  3VMs]
  7. 7. Network delays in public clouds [INFOCOM’10] [HPDC’10]
  8. 8. Incast network congestion • A special form of network congestion, typically seen in distributed processing applications (scatter-gather). – Barrier-synchronized request workloads – The limited buffer space of the switch output port can be easily overfilled by simultaneous transmissions. • Application-level throughput (goodput) can be orders of magnitude lower than the link capacity. [SIGCOMM’09]
  9. 9. Solutions for physical clusters  Prior works: none of them can fully eliminate the throughput collapse. – – – – – – Increase switch buffer size Limited transmit Reduce duplicate ACK threshold Disable slow-start Randomize timeout value Reno, NewReno, SACK  The dominate factor: once the packet loss happens, whether the sender can know it as soon as possible. – In case of “tail loss”, the sender can only count on the retransmit timer’s firing. Two representative papers:  Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems [FAST’08].  Understanding TCP Incast Throughput Collapse in Datacenter Networks [WREN’09].
  10. 10. Solutions for physical clusters (cont’d)  Significantly reducing RTOmin has been shown to be a safe and effective approach. [SIGCOMM’09]  Even with ECN support in hardware switch, a small RTOmin still shows apparent advantages. [DCTCP, SIGCOMM’10] [SIGCOMM’09] [DCTCP, SIGCOMM’10] RTOmin in a virtual cluster? Not well studied.
  11. 11. Outline  Motivation – Physical datacenter vs. Virtualized datacenter – Incast congestion  Understand the Problem – Pseudo-congestion – Sender-side vs. Receiver-side  PVTCP – A ParaVirtualized TCP – Design, Implementation, Evaluation  Questions & Comments
  12. 12. Pseudo-congestion NO network congestion, still RTT spikes. Red points: measured RTTs 30ms 30ms VM 30ms VM Blue points: calculated RTO values VM RTOmin=200ms RTOmin=100ms pCPU TCP’s low-pass filter RTO = SRTT + 4* RTTVAR Lower-bound: RTOmin 3VMs per core Retransmit TimeOut RTOmin=10ms RTOmin=1ms A small RTOmin  frequent spurious RTOs
  13. 13. Pseudo-congestion (cont’d) A small RTOmin: serious spurious RTOs with largely varied RTTs. A big RTOmin: throughput collapse with heavy network congestion.  “adjusting RTOmin: a tradeoff between timely response with premature timeouts, and there is NO optimal balance between the two.” -- Allman and Paxson [SIGCOMM’99] Virtualized datacenters  A new instantiation
  14. 14. Sender-side vs. Receiver-side The scheduling delays to the sender VM The scheduling delays to the receiver VM To transmit 4000 1MB data blocks 3VMs1VM 1086 0 0 0 Freq. 1× RTOs 2× RTOs 3× RTOs 4× RTOs RTO only happens once a time 1VM3VMs 677 673 196 30 Successive RTOs are normal
  15. 15. A micro-view with tcpdump 9.1 snd.una: the first sent but unacknowledged byte. x106 snd.nxt 9 snd.una 8.9 RTO happens twice, before the receiver VM wakes up. The sender VM has been stopped. 8.8 8.7 snd.nxt: the next byte that will be sent. The receiver VM has been stopped. An ACK arrives before the sender VM wakes up. snd.nxt 8.6 snd.una 8.5 RTO happens just after the sender VM wakes up. 8.4 0 10 20 30 40 50 60 70 Time (ms) vs. sequence number (from the sender VM) When the sender VM is preempted  The ACK’s arrival time is not 80 0 10 20 30 40 50 60 70 80 Time (ms) vs. ACK number (from the receiver VM) When the receiver VM is preempted  The generation and the delayed, but the receiving return of the ACKs will be time is too late. delayed.  From TCP’s perspective, RTO  RTOs must happen on the should not be triggered. sender’s side.
  16. 16. The sender-side problem: OS reasons ACK ACK TCP receiver ACK Physical network VM scheduling latency wait .. Within hypervisor data data clear timer Timer Buffer data clear timer Timer 2 Network IRQ: receive ACK; Spurious RTO! 1 Timer IRQ: RTO happens! Expire time Timer deliver ACK VM1 is running VM3 is running VM1 is running VM2 is waiting VM3 is waiting VM1 is waiting VM1 is waiting VM2 is waiting TCP sender VM2 is waiting VM3 is waiting Scheduling queue VM2 is running Driver domain VM3 is waiting  After the VM wakes up, both TIMER and NET are pending.  RTO happens just before the ACK enters the VM  The reasons due to common OS design – Timer interrupt is executed before other interrupts – Network processing is a little late (bottom half)
  17. 17. To detect spurious RTOs  Two well-known detection algorithms: F-RTO and Eifel – Eifel performs much worse than F-RTO in some situations, e.g. with bursty packet loss [CCR’03] – F-RTO is implemented in Linux [3VMs1VM] Low detection rate c [1VM3VMs] Low detection rate  F-RTO interacts badly with delayed ACK (ACK coalescing) – Reducing delayed ACK timeout value does NOT help. Disabling delayed ACK seems to be helpful
  18. 18. Delayed ACK vs. CPU overhead Sender VM Receiver VM Sender VM Receiver VM Disabling delayed ACK  Significant CPU overhead
  19. 19. Delayed ACK vs. CPU overhead delack-200ms delack-1ms w/o delack Total ACKs 229,650 244,757 2,832,260 delack-200ms delack-1ms w/o delack Total ACKs 252,278 262,274 2,832,179 Sender delayed VM Disabling VM ReceiverACK: 11~13×Sender VM ACKs are sent more Receiver VM Disabling delayed ACK  Significant CPU overhead
  20. 20. Outline  Motivation – Physical datacenter vs. Virtualized datacenter – Incast congestion  Understand the Problem – Pseudo-congestion – Sender-side vs. Receiver-side  PVTCP – A ParaVirtualized TCP – Design, Implementation, Evaluation  Questions & Comments
  21. 21. PVTCP – A ParaVirtualized TCP  Observation – Spurious RTOs only happen when the sender/receiver VM just experienced a scheduling delays.  Main Idea – If we can detect such moment, and let the guest OS be aware of this, there is a chance to handle the problem. “the more information about current network conditions available to a transport protocol, the more efficiently it can use the network to transfer its data.” -- Allman and Paxson [SIGCOMM’99]
  22. 22. Detect the VM’s wakeup moment VM 30ms VM 30ms VM pCPU 3VMs per core Guest OS (HZ=1000) jiffies++ jiffies++ jiffies++ jiffies++ jiffies++ 30ms VM is NOT running jiffies += 60 VM is running jiffies++ jiffies++ jiffies++ jiffies++ jiffies++ VM is running ... Virtual timer IRQs (every 1ms) Hypervisor Time Virtual timer IRQs (every 1ms) one-shot timer Acute increase of the system clock (jiffies)  The VM just wakes up
  23. 23. PVTCP – the sender VM is preempted  Spurious RTOs can be avoided. No need to detect them at all! ACK ACK ACK Physical network wait .. Within hypervisor data data clear timer Timer Timer VM scheduling latency Timer TCP Start time VM2 is running Network IRQ: receive ACK; Spurious RTO! 1 Expire time VM1 is running deliver ACK 2 Buffer data clear timer TCP receiver Timer IRQ: RTO happens! VM3 is running VM1 is running Timer Expiry time Driver domain TCP sender
  24. 24. PVTCP – the sender VM is preempted  Spurious RTOs can be avoided. No need to detect them at all!  Solution: after the VM wakes up, extend the TCP retransmit timer’s expiry time by 1ms. ACK ACK ACK Physical network wait .. Within hypervisor data data clear timer Timer Timer VM scheduling latency deliver ACK Driver domain Buffer data clear timer TCP receiver Net IRQ first: ACK enters. Expire time Reset the timer. Timer VM1 is running TCP PVTCP Start time VM2 is running VM3 is running VM1 is running Timer Timer 1ms Expiry time TCP sender
  25. 25. PVTCP – the sender VM is preempted ACK ACK ACK Physical network wait .. Within hypervisor data data clear timer Timer Timer VM1 is running PVTCP VM scheduling latency deliver ACK Driver domain Buffer data clear timer TCP receiver Net IRQ first: ACK enters. Expire time Reset the timer. Timer VM2 is running StartTime VM3 is running VM1 is running Timer ExpiryTime 1ms Measured RTT (MRTT) = TrueRTT + VMSchedDelay TCP’s low-pass filter to estimate RTT/RTO Smoothed RTT (SRTTi)  7/8 * SRTTi-1 +1/8 * MRTTi RTT variance (RTTVARi)  3/4 * RTTVARi -1+ 1/4 * |SRTTi - MRTTi| Expected RTO value (RTOi+1)  SRTTi + 4 * RTTVARi Solution: MRTTi  SRTTi-1 TCP sender
  26. 26. PVTCP – the receiver VM is preempted Spurious RTOs cannot be avoided, so we have to let the sender detect them.  Detection algorithms requires deterministic return of future ACKs from the receiver – Enable delayed ACK  retransmission ambiguity – Disable delayed ACK  significant CPU overhead  Solution: temporarily disable delayed ACK when the receiver VM just wakes up. – Eifel: check the timestamp of the first one ACK – F-RTO: check the ACK number of the first two ACKs – Just-in-time: do not delay the ACKs for the first three segments
  27. 27. PVTCP evaluation: throughput TCP’s dilemma: pseudo-congestion & real congestion PVTCP-1ms TCP-200ms TCP-1ms Experimental setup: 20 sender VMs  1 receiver VM RTOmin PVTCP avoids throughput collapse in the whole range
  28. 28. PVTCP evaluation: CPU overhead Sender VM Receiver VM Sender VM Receiver VM Enable delayed ACK: PVTCP (RTOmin=1ms) ≈ TCP (RTOmin=200ms)
  29. 29. PVTCP evaluation: CPU overhead Sender VM Receiver VM Spurious RTOs are avoided Sender VM Receiver VM Temporarily disable delayed ACK to help the sender detect spurious RTOs RTOmin TCP-200ms TCP-1ms PVTCP-1ms RTOmin TCP-200ms TCP-1ms PVTCP-1ms Total ACKs 192,587 244,757 192,863 Total ACKs 194,384 262,274 208,688 +0% +7.4%
  30. 30. One concern
  31. 31. The buffer of the netback Sender VM VM 3 WAIT VM 2 WAIT VM 1 RUN VM 1 WAIT VM 3 WAIT VM 2 RUN VM 2 WAIT VM 1 WAIT VM 3 RUN VM 3 WAIT VM 2 WAIT VM 1 RUN VM scheduling queue Driver domain Receiver VM Data packets, waiting for ACK wait ACKing buffer Within hypervisor Driver domain Data packets, waiting for ACK wait RTO happens! Sender VM Hypervisor scheduling delay The buffer size matters! deliver Physical network The scheduling delays to the sender VM Receiver VM VM 1 ACKing RUN VM2 WAIT VM 3 WAIT VM 2 RUN VM 3 WAIT VM 1 WAIT VM 3 RUN VM 1 WAIT VM 2 WAIT ACKing VM 1 VM 2 WAIT VM 3 WAIT buffer Hypervisor scheduling delay RTO happens! deliver RUN Physical network Within hypervisor VM scheduling queue The scheduling delays to the receiver VM  The vif’s buffer: temporarily store incoming packets when the VM has been preempted. – ifconfig vifX.Y txqueuelen [value]  The default value is too small  intensive packet loss – #define XENVIF_QUEUE_LENGTH 32  This parameter should be set bigger (> 10,000 perhaps..)
  32. 32. Summary Problem: VM scheduling delays cause spurious RTOs.  Sender-side problem  Receiver-side problem – There are OS reasons – Networking problem  Proposed Solution: a ParaVirtualized TCP (PVTCP) – Provide a method to detect a VM’s wakeup moment  Sender-side problem – Spurious RTOs can be avoided. – Slightly extends the retransmit timer’s expiry time after the sender VM wakes up.  Receiver-side problem – Spurious RTOs can be detected. – Temporarily disable delayed ACK after the receiver VM wakes up. – Just-in-time  Future Work: your inputs ..
  33. 33. Thanks for your listening Comments & Questions Email: lwcheng@cs.hku.hk URL: http://www.cs.hku.hk/~lwcheng
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×