XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

  • 1,497 views
Uploaded on

While datacenters are increasingly adopting VMs to provide elastic cloud services, they still rely on traditional TCP for congestion control. In this talk, I will first show that VM scheduling delays …

While datacenters are increasingly adopting VMs to provide elastic cloud services, they still rely on traditional TCP for congestion control. In this talk, I will first show that VM scheduling delays can heavily contaminate RTTs sensed by VM senders, preventing TCP from correctly learning the physical network condition. Focusing on the incast problem, which is commonly seen in large-scale distributed data processing such as MapReduce and web search, I find that the solutions that have been developed for *physical* clusters fall short in a Xen *virtual* cluster. Second, I will provide a concrete understanding of the problem, and reveal that the situations that when the sending VM is preempted versus when the receiving VM is preempted, are different. Third, I will introduce my recent attempts on paravirtualizing TCP to overcome the negative effect caused by VM scheduling delays.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,497
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
19
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. On ParaVirualizing TCP: Congestion Control in Xen Virtual Machines Luwei Cheng, Cho-Li Wang, Francis C.M. Lau Department of Computer Science The University of Hong Kong Xen Project Developer Summit 2013 Edinburgh, UK, October 24-25, 2013
  • 2. Outline  Motivation – Physical datacenter vs. Virtualized datacenter – Incast congestion  Understand the Problem – Pseudo-congestion – Sender-side vs. Receiver-side  PVTCP – A ParaVirtualized TCP – Design, Implementation, Evaluation  Questions & Comments
  • 3. Outline  Motivation – Physical datacenter vs. Virtualized datacenter – Incast congestion  Understand the Problem – Pseudo-congestion – Sender-side vs. Receiver-side  PVTCP – A ParaVirtualized TCP – Design, Implementation, Evaluation  Questions & Comments
  • 4. Physical datacenter Virtualized datacenter Core switch Core switch ToR switches … ToR switches ... … … Servers in a rack … ... … … Servers in a rack VM VM VM VM VM VM  A set of physical machines  A set of virtual machines  Network delays:  Network delays: propagation delays of the physical network/switch additional delays due to virtualization overhead
  • 5. Virtualization brings “delays” VM VM VM VM delay VM VM Hypervisor pCPU pCPU  1. I/O virtualization overhead (PV or HVM) – Guest VMs are unable to directly access the hardware. – Additional data movement between dom0 and domUs. – HVM: Passthrough I/O can avoid it  2. VM scheduling delays – Multiple VMs share one physical core
  • 6. Virtualization brings “delays” Avg: 0.147ms Avg: 0.374ms [PM  PM] [1VM  1VM]  Delays of I/O virtualization (PV guests): < 1ms Peak: 30ms Peak: 60ms [1VM  2VMs]  VM scheduling delays: 10× ms – Queuing delays  VM scheduling delays  The dominant factor to network RTT [1VM  3VMs]
  • 7. Network delays in public clouds [INFOCOM’10] [HPDC’10]
  • 8. Incast network congestion • A special form of network congestion, typically seen in distributed processing applications (scatter-gather). – Barrier-synchronized request workloads – The limited buffer space of the switch output port can be easily overfilled by simultaneous transmissions. • Application-level throughput (goodput) can be orders of magnitude lower than the link capacity. [SIGCOMM’09]
  • 9. Solutions for physical clusters  Prior works: none of them can fully eliminate the throughput collapse. – – – – – – Increase switch buffer size Limited transmit Reduce duplicate ACK threshold Disable slow-start Randomize timeout value Reno, NewReno, SACK  The dominate factor: once the packet loss happens, whether the sender can know it as soon as possible. – In case of “tail loss”, the sender can only count on the retransmit timer’s firing. Two representative papers:  Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems [FAST’08].  Understanding TCP Incast Throughput Collapse in Datacenter Networks [WREN’09].
  • 10. Solutions for physical clusters (cont’d)  Significantly reducing RTOmin has been shown to be a safe and effective approach. [SIGCOMM’09]  Even with ECN support in hardware switch, a small RTOmin still shows apparent advantages. [DCTCP, SIGCOMM’10] [SIGCOMM’09] [DCTCP, SIGCOMM’10] RTOmin in a virtual cluster? Not well studied.
  • 11. Outline  Motivation – Physical datacenter vs. Virtualized datacenter – Incast congestion  Understand the Problem – Pseudo-congestion – Sender-side vs. Receiver-side  PVTCP – A ParaVirtualized TCP – Design, Implementation, Evaluation  Questions & Comments
  • 12. Pseudo-congestion NO network congestion, still RTT spikes. Red points: measured RTTs 30ms 30ms VM 30ms VM Blue points: calculated RTO values VM RTOmin=200ms RTOmin=100ms pCPU TCP’s low-pass filter RTO = SRTT + 4* RTTVAR Lower-bound: RTOmin 3VMs per core Retransmit TimeOut RTOmin=10ms RTOmin=1ms A small RTOmin  frequent spurious RTOs
  • 13. Pseudo-congestion (cont’d) A small RTOmin: serious spurious RTOs with largely varied RTTs. A big RTOmin: throughput collapse with heavy network congestion.  “adjusting RTOmin: a tradeoff between timely response with premature timeouts, and there is NO optimal balance between the two.” -- Allman and Paxson [SIGCOMM’99] Virtualized datacenters  A new instantiation
  • 14. Sender-side vs. Receiver-side The scheduling delays to the sender VM The scheduling delays to the receiver VM To transmit 4000 1MB data blocks 3VMs1VM 1086 0 0 0 Freq. 1× RTOs 2× RTOs 3× RTOs 4× RTOs RTO only happens once a time 1VM3VMs 677 673 196 30 Successive RTOs are normal
  • 15. A micro-view with tcpdump 9.1 snd.una: the first sent but unacknowledged byte. x106 snd.nxt 9 snd.una 8.9 RTO happens twice, before the receiver VM wakes up. The sender VM has been stopped. 8.8 8.7 snd.nxt: the next byte that will be sent. The receiver VM has been stopped. An ACK arrives before the sender VM wakes up. snd.nxt 8.6 snd.una 8.5 RTO happens just after the sender VM wakes up. 8.4 0 10 20 30 40 50 60 70 Time (ms) vs. sequence number (from the sender VM) When the sender VM is preempted  The ACK’s arrival time is not 80 0 10 20 30 40 50 60 70 80 Time (ms) vs. ACK number (from the receiver VM) When the receiver VM is preempted  The generation and the delayed, but the receiving return of the ACKs will be time is too late. delayed.  From TCP’s perspective, RTO  RTOs must happen on the should not be triggered. sender’s side.
  • 16. The sender-side problem: OS reasons ACK ACK TCP receiver ACK Physical network VM scheduling latency wait .. Within hypervisor data data clear timer Timer Buffer data clear timer Timer 2 Network IRQ: receive ACK; Spurious RTO! 1 Timer IRQ: RTO happens! Expire time Timer deliver ACK VM1 is running VM3 is running VM1 is running VM2 is waiting VM3 is waiting VM1 is waiting VM1 is waiting VM2 is waiting TCP sender VM2 is waiting VM3 is waiting Scheduling queue VM2 is running Driver domain VM3 is waiting  After the VM wakes up, both TIMER and NET are pending.  RTO happens just before the ACK enters the VM  The reasons due to common OS design – Timer interrupt is executed before other interrupts – Network processing is a little late (bottom half)
  • 17. To detect spurious RTOs  Two well-known detection algorithms: F-RTO and Eifel – Eifel performs much worse than F-RTO in some situations, e.g. with bursty packet loss [CCR’03] – F-RTO is implemented in Linux [3VMs1VM] Low detection rate c [1VM3VMs] Low detection rate  F-RTO interacts badly with delayed ACK (ACK coalescing) – Reducing delayed ACK timeout value does NOT help. Disabling delayed ACK seems to be helpful
  • 18. Delayed ACK vs. CPU overhead Sender VM Receiver VM Sender VM Receiver VM Disabling delayed ACK  Significant CPU overhead
  • 19. Delayed ACK vs. CPU overhead delack-200ms delack-1ms w/o delack Total ACKs 229,650 244,757 2,832,260 delack-200ms delack-1ms w/o delack Total ACKs 252,278 262,274 2,832,179 Sender delayed VM Disabling VM ReceiverACK: 11~13×Sender VM ACKs are sent more Receiver VM Disabling delayed ACK  Significant CPU overhead
  • 20. Outline  Motivation – Physical datacenter vs. Virtualized datacenter – Incast congestion  Understand the Problem – Pseudo-congestion – Sender-side vs. Receiver-side  PVTCP – A ParaVirtualized TCP – Design, Implementation, Evaluation  Questions & Comments
  • 21. PVTCP – A ParaVirtualized TCP  Observation – Spurious RTOs only happen when the sender/receiver VM just experienced a scheduling delays.  Main Idea – If we can detect such moment, and let the guest OS be aware of this, there is a chance to handle the problem. “the more information about current network conditions available to a transport protocol, the more efficiently it can use the network to transfer its data.” -- Allman and Paxson [SIGCOMM’99]
  • 22. Detect the VM’s wakeup moment VM 30ms VM 30ms VM pCPU 3VMs per core Guest OS (HZ=1000) jiffies++ jiffies++ jiffies++ jiffies++ jiffies++ 30ms VM is NOT running jiffies += 60 VM is running jiffies++ jiffies++ jiffies++ jiffies++ jiffies++ VM is running ... Virtual timer IRQs (every 1ms) Hypervisor Time Virtual timer IRQs (every 1ms) one-shot timer Acute increase of the system clock (jiffies)  The VM just wakes up
  • 23. PVTCP – the sender VM is preempted  Spurious RTOs can be avoided. No need to detect them at all! ACK ACK ACK Physical network wait .. Within hypervisor data data clear timer Timer Timer VM scheduling latency Timer TCP Start time VM2 is running Network IRQ: receive ACK; Spurious RTO! 1 Expire time VM1 is running deliver ACK 2 Buffer data clear timer TCP receiver Timer IRQ: RTO happens! VM3 is running VM1 is running Timer Expiry time Driver domain TCP sender
  • 24. PVTCP – the sender VM is preempted  Spurious RTOs can be avoided. No need to detect them at all!  Solution: after the VM wakes up, extend the TCP retransmit timer’s expiry time by 1ms. ACK ACK ACK Physical network wait .. Within hypervisor data data clear timer Timer Timer VM scheduling latency deliver ACK Driver domain Buffer data clear timer TCP receiver Net IRQ first: ACK enters. Expire time Reset the timer. Timer VM1 is running TCP PVTCP Start time VM2 is running VM3 is running VM1 is running Timer Timer 1ms Expiry time TCP sender
  • 25. PVTCP – the sender VM is preempted ACK ACK ACK Physical network wait .. Within hypervisor data data clear timer Timer Timer VM1 is running PVTCP VM scheduling latency deliver ACK Driver domain Buffer data clear timer TCP receiver Net IRQ first: ACK enters. Expire time Reset the timer. Timer VM2 is running StartTime VM3 is running VM1 is running Timer ExpiryTime 1ms Measured RTT (MRTT) = TrueRTT + VMSchedDelay TCP’s low-pass filter to estimate RTT/RTO Smoothed RTT (SRTTi)  7/8 * SRTTi-1 +1/8 * MRTTi RTT variance (RTTVARi)  3/4 * RTTVARi -1+ 1/4 * |SRTTi - MRTTi| Expected RTO value (RTOi+1)  SRTTi + 4 * RTTVARi Solution: MRTTi  SRTTi-1 TCP sender
  • 26. PVTCP – the receiver VM is preempted Spurious RTOs cannot be avoided, so we have to let the sender detect them.  Detection algorithms requires deterministic return of future ACKs from the receiver – Enable delayed ACK  retransmission ambiguity – Disable delayed ACK  significant CPU overhead  Solution: temporarily disable delayed ACK when the receiver VM just wakes up. – Eifel: check the timestamp of the first one ACK – F-RTO: check the ACK number of the first two ACKs – Just-in-time: do not delay the ACKs for the first three segments
  • 27. PVTCP evaluation: throughput TCP’s dilemma: pseudo-congestion & real congestion PVTCP-1ms TCP-200ms TCP-1ms Experimental setup: 20 sender VMs  1 receiver VM RTOmin PVTCP avoids throughput collapse in the whole range
  • 28. PVTCP evaluation: CPU overhead Sender VM Receiver VM Sender VM Receiver VM Enable delayed ACK: PVTCP (RTOmin=1ms) ≈ TCP (RTOmin=200ms)
  • 29. PVTCP evaluation: CPU overhead Sender VM Receiver VM Spurious RTOs are avoided Sender VM Receiver VM Temporarily disable delayed ACK to help the sender detect spurious RTOs RTOmin TCP-200ms TCP-1ms PVTCP-1ms RTOmin TCP-200ms TCP-1ms PVTCP-1ms Total ACKs 192,587 244,757 192,863 Total ACKs 194,384 262,274 208,688 +0% +7.4%
  • 30. One concern
  • 31. The buffer of the netback Sender VM VM 3 WAIT VM 2 WAIT VM 1 RUN VM 1 WAIT VM 3 WAIT VM 2 RUN VM 2 WAIT VM 1 WAIT VM 3 RUN VM 3 WAIT VM 2 WAIT VM 1 RUN VM scheduling queue Driver domain Receiver VM Data packets, waiting for ACK wait ACKing buffer Within hypervisor Driver domain Data packets, waiting for ACK wait RTO happens! Sender VM Hypervisor scheduling delay The buffer size matters! deliver Physical network The scheduling delays to the sender VM Receiver VM VM 1 ACKing RUN VM2 WAIT VM 3 WAIT VM 2 RUN VM 3 WAIT VM 1 WAIT VM 3 RUN VM 1 WAIT VM 2 WAIT ACKing VM 1 VM 2 WAIT VM 3 WAIT buffer Hypervisor scheduling delay RTO happens! deliver RUN Physical network Within hypervisor VM scheduling queue The scheduling delays to the receiver VM  The vif’s buffer: temporarily store incoming packets when the VM has been preempted. – ifconfig vifX.Y txqueuelen [value]  The default value is too small  intensive packet loss – #define XENVIF_QUEUE_LENGTH 32  This parameter should be set bigger (> 10,000 perhaps..)
  • 32. Summary Problem: VM scheduling delays cause spurious RTOs.  Sender-side problem  Receiver-side problem – There are OS reasons – Networking problem  Proposed Solution: a ParaVirtualized TCP (PVTCP) – Provide a method to detect a VM’s wakeup moment  Sender-side problem – Spurious RTOs can be avoided. – Slightly extends the retransmit timer’s expiry time after the sender VM wakes up.  Receiver-side problem – Spurious RTOs can be detected. – Temporarily disable delayed ACK after the receiver VM wakes up. – Just-in-time  Future Work: your inputs ..
  • 33. Thanks for your listening Comments & Questions Email: lwcheng@cs.hku.hk URL: http://www.cs.hku.hk/~lwcheng