XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

On ParaVirualizing TCP:
Congestion Control in Xen Virtual Machines
Luwei Cheng, Cho-Li Wang, Francis C.M. Lau
Department of Computer Science
The University of Hong Kong
Xen Project Developer Summit 2013
Edinburgh, UK, October 24-25, 2013

Outline
 Motivation
– Physical datacenter vs. Virtualized datacenter
– Incast congestion
 Understand the Problem
– Pseudo-congestion
– Sender-side vs. Receiver-side
 PVTCP – A ParaVirtualized TCP
– Design, Implementation, Evaluation
 Questions & Comments

Physical datacenter Virtualized datacenter
Core switch

Core switch

ToR
switches

…

ToR
switches

...
…

…

Servers in a rack

…

...
…

…

Servers in a rack

VM VM VM
VM VM VM

 A set of physical machines  A set of virtual machines
 Network delays:
 Network delays:

propagation delays of the
physical network/switch

additional delays due to
virtualization overhead

Virtualization brings “delays”
VM
VM

VM

VM

delay

VM

VM

Hypervisor
pCPU

pCPU

 1. I/O virtualization overhead (PV or HVM)
– Guest VMs are unable to directly access the hardware.
– Additional data movement between dom0 and domUs.
– HVM: Passthrough I/O can avoid it
 2. VM scheduling delays
– Multiple VMs share one physical core

Virtualization brings “delays”
Avg: 0.147ms

Avg: 0.374ms

[PM  PM]

[1VM  1VM]

 Delays of I/O virtualization (PV guests): < 1ms
Peak: 30ms

Peak: 60ms

[1VM  2VMs]

 VM scheduling delays: 10× ms
– Queuing delays  VM scheduling delays
 The dominant factor to network RTT

[1VM  3VMs]

Network delays in public clouds

[INFOCOM’10]

[HPDC’10]

Incast network congestion
• A special form of network congestion, typically seen in
distributed processing applications (scatter-gather).

– Barrier-synchronized request workloads
– The limited buffer space of the switch output port can be easily
overfilled by simultaneous transmissions.

• Application-level throughput (goodput) can be orders of
magnitude lower than the link capacity.

[SIGCOMM’09]

Solutions for physical clusters
 Prior works: none of them can fully eliminate the

throughput collapse.
–
–
–
–
–
–

Increase switch buffer size
Limited transmit
Reduce duplicate ACK threshold
Disable slow-start
Randomize timeout value
Reno, NewReno, SACK

 The dominate factor: once the packet loss happens,

whether the sender can know it as soon as possible.
– In case of “tail loss”, the sender can only count on the
retransmit timer’s firing.

Two representative papers:

 Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems [FAST’08].
 Understanding TCP Incast Throughput Collapse in Datacenter Networks [WREN’09].

Solutions for physical clusters (cont’d)
 Significantly reducing RTOmin has been shown to be a safe

and effective approach. [SIGCOMM’09]
 Even with ECN support in hardware switch, a small RTOmin
still shows apparent advantages. [DCTCP, SIGCOMM’10]

[SIGCOMM’09]

[DCTCP, SIGCOMM’10]

RTOmin in a virtual cluster? Not well studied.

Pseudo-congestion
NO network congestion, still RTT spikes.
Red points:
measured RTTs
30ms
30ms

VM

30ms

VM

Blue points:
calculated RTO values

VM

RTOmin=200ms

RTOmin=100ms

pCPU

TCP’s low-pass filter

RTO = SRTT + 4* RTTVAR
Lower-bound: RTOmin

3VMs per core

Retransmit TimeOut
RTOmin=10ms

RTOmin=1ms

A small RTOmin  frequent spurious RTOs

Pseudo-congestion (cont’d)
A small RTOmin:
serious spurious
RTOs with largely
varied RTTs.

A big RTOmin:
throughput collapse
with heavy network
congestion.

 “adjusting RTOmin: a tradeoff between timely

response with premature timeouts, and there is NO
optimal balance between the two.”
-- Allman and Paxson [SIGCOMM’99]

Virtualized datacenters  A new instantiation

Sender-side vs. Receiver-side

The scheduling delays to the sender VM

The scheduling delays to the receiver VM

To transmit 4000 1MB data blocks

3VMs1VM
1086
0
0
0

Freq.
1× RTOs
2× RTOs
3× RTOs
4× RTOs

RTO only happens once a time

1VM3VMs
677
673
196
30

Successive RTOs are normal

A micro-view with tcpdump
9.1

snd.una: the first sent but unacknowledged byte.

x106

snd.nxt

9

snd.una

8.9

RTO happens twice,
before the receiver
VM wakes up.

The sender
VM has been
stopped.

8.8
8.7

snd.nxt: the next byte that will be sent.

The receiver
VM has been
stopped.

An ACK arrives
before the sender
VM wakes up.

snd.nxt

8.6
snd.una

8.5

RTO happens just after
the sender VM wakes up.

8.4
0

10

20

30

40

50

60

70

Time (ms) vs. sequence number (from the sender VM)

When the sender VM is preempted

 The ACK’s arrival time is not

80

0

10

20

30

40

50

60

70

80

Time (ms) vs. ACK number (from the receiver VM)

When the receiver VM is preempted

 The generation and the
delayed, but the receiving
return of the ACKs will be
time is too late.
delayed.
 From TCP’s perspective, RTO  RTOs must happen on the
should not be triggered.
sender’s side.

The sender-side problem: OS reasons
ACK

ACK

TCP
receiver

ACK

Physical
network

VM scheduling latency

wait ..

Within
hypervisor

data

data
clear
timer

Timer

Buffer

data
clear
timer

Timer

2

Network IRQ:
receive ACK;
Spurious RTO!

1

Timer IRQ:
RTO happens!

Expire time
Timer

deliver
ACK

VM1 is running

VM3 is running

VM1 is running

VM2 is waiting

VM3 is waiting

VM1 is waiting

VM1 is waiting

VM2 is waiting

TCP
sender

VM2 is waiting

VM3 is waiting

Scheduling
queue

VM2 is running

Driver
domain

VM3 is waiting

 After the VM wakes up, both TIMER and NET are pending.
 RTO happens just before the ACK enters the VM

 The reasons due to common OS design
– Timer interrupt is executed before other interrupts
– Network processing is a little late (bottom half)

To detect spurious RTOs
 Two well-known detection algorithms: F-RTO and Eifel
– Eifel performs much worse than F-RTO in some situations, e.g.
with bursty packet loss [CCR’03]
– F-RTO is implemented in Linux
[3VMs1VM]

Low
detection
rate

c

[1VM3VMs]

Low
detection
rate

 F-RTO interacts badly with delayed ACK (ACK coalescing)
– Reducing delayed ACK timeout value does NOT help.

Disabling delayed ACK seems to be helpful

Delayed ACK vs. CPU overhead

Sender VM

Receiver VM

Sender VM

Receiver VM

Disabling delayed ACK  Significant CPU overhead

Delayed ACK vs. CPU overhead

delack-200ms delack-1ms w/o delack
Total ACKs

229,650

244,757

2,832,260

delack-200ms delack-1ms w/o delack
Total ACKs

252,278

262,274

2,832,179

Sender delayed VM
Disabling VM ReceiverACK: 11~13×Sender VM ACKs are sent
more Receiver VM

Disabling delayed ACK  Significant CPU overhead

PVTCP – A ParaVirtualized TCP
 Observation
– Spurious RTOs only happen when the sender/receiver
VM just experienced a scheduling delays.
 Main Idea
– If we can detect such moment, and let the guest OS be
aware of this, there is a chance to handle the problem.
“the more information about current network conditions
available to a transport protocol, the more efficiently it
can use the network to transfer its data.”
-- Allman and Paxson [SIGCOMM’99]

Detect the VM’s wakeup moment
VM

30ms

VM

30ms

VM
pCPU

3VMs per core

Guest OS
(HZ=1000)

jiffies++
jiffies++
jiffies++
jiffies++
jiffies++

30ms

VM is NOT
running

jiffies += 60

VM is
running
jiffies++
jiffies++
jiffies++
jiffies++
jiffies++

VM is
running

...
Virtual timer IRQs
(every 1ms)
Hypervisor

Time

Virtual timer IRQs
(every 1ms)

one-shot timer

Acute increase of the system clock (jiffies) 
The VM just wakes up

PVTCP – the sender VM is preempted
 Spurious RTOs can be avoided.

No need to detect them at all!

ACK

ACK

ACK

Physical
network
wait ..

Within
hypervisor

data

data
clear
timer

Timer

Timer


Timer

TCP

Start
time

VM2 is running

Network IRQ:
receive ACK;
Spurious RTO!

1

Expire time

VM1 is running

deliver
ACK

2

Buffer

data
clear
timer

TCP
receiver

Timer IRQ:
RTO happens!

VM3 is running

VM1 is running

Timer

Expiry
time

Driver
domain

TCP
sender

 Spurious RTOs can be avoided.

No need to detect them at all!
 Solution: after the VM wakes up,
extend the TCP retransmit
timer’s expiry time by 1ms.
ACK

ACK

ACK

Physical
network
wait ..

Within
hypervisor

data

data
clear
timer

Timer

Timer


deliver
ACK

Driver
domain

Buffer

data
clear
timer

TCP
receiver

Net IRQ first:
ACK enters.

Expire time

Reset the timer.

Timer

VM1 is running

TCP
PVTCP

Start
time

VM2 is running

VM3 is running

VM1 is running

Timer
Timer

1ms
Expiry
time

TCP
sender

ACK

ACK

ACK

Physical
network
wait ..

Within
hypervisor

data

data
clear
timer

Timer

Timer

VM1 is running

PVTCP


deliver
ACK

Driver
domain

Buffer

data
clear
timer

TCP
receiver

Net IRQ first:
ACK enters.

Expire time

Reset the timer.

Timer

VM2 is running

StartTime

VM3 is running

VM1 is running

Timer

ExpiryTime

1ms

Measured RTT (MRTT) = TrueRTT + VMSchedDelay
TCP’s low-pass filter to estimate RTT/RTO
Smoothed RTT (SRTTi)  7/8 * SRTTi-1 +1/8 * MRTTi
RTT variance (RTTVARi)  3/4 * RTTVARi -1+ 1/4 * |SRTTi - MRTTi|
Expected RTO value (RTOi+1)  SRTTi + 4 * RTTVARi

Solution: MRTTi  SRTTi-1

TCP
sender

PVTCP – the receiver VM is preempted
Spurious RTOs cannot be
avoided, so we have to let
the sender detect them.
 Detection algorithms requires deterministic return of

future ACKs from the receiver

– Enable delayed ACK  retransmission ambiguity
– Disable delayed ACK  significant CPU overhead

 Solution: temporarily disable delayed ACK when the

receiver VM just wakes up.

– Eifel: check the timestamp of the first one ACK
– F-RTO: check the ACK number of the first two ACKs
– Just-in-time: do not delay the ACKs for the first three segments

PVTCP evaluation: throughput
TCP’s dilemma: pseudo-congestion & real congestion
PVTCP-1ms
TCP-200ms
TCP-1ms
Experimental setup:
20 sender VMs  1 receiver VM

RTOmin

PVTCP avoids throughput collapse in the whole range

PVTCP evaluation: CPU overhead

Sender VM

Receiver VM

Sender VM

Receiver VM

Enable delayed ACK:
PVTCP (RTOmin=1ms) ≈ TCP (RTOmin=200ms)

PVTCP evaluation: CPU overhead

Sender VM

Receiver VM

Spurious RTOs are avoided

Sender VM

Receiver VM

Temporarily disable delayed
ACK to help the sender
detect spurious RTOs

RTOmin

TCP-200ms

TCP-1ms

PVTCP-1ms

RTOmin

TCP-200ms

TCP-1ms

PVTCP-1ms

Total ACKs

192,587

244,757

192,863

Total ACKs

194,384

262,274

208,688

+0%

+7.4%

The buffer of the netback
Sender
VM
VM 3
WAIT

VM 2
WAIT

VM 1
RUN

VM 1
WAIT

VM 3
WAIT

VM 2
RUN

VM 2
WAIT

VM 1
WAIT

VM 3
RUN

VM 3
WAIT

VM 2
WAIT

VM 1
RUN

VM scheduling queue

Driver
domain

Receiver
VM

Data packets,
waiting for ACK

wait

ACKing

buffer

Within
hypervisor

Driver
domain

Data packets, waiting for ACK

wait

RTO
happens!

Sender
VM

Hypervisor
scheduling delay

The buffer
size matters!

deliver
Physical
network

The scheduling delays to the sender VM

Receiver
VM
VM 1
ACKing RUN

VM2
WAIT

VM 3
WAIT

VM 2
RUN

VM 3
WAIT

VM 1
WAIT

VM 3
RUN

VM 1
WAIT

VM 2
WAIT

ACKing VM 1

VM 2
WAIT

VM 3
WAIT

buffer

Hypervisor
scheduling delay
RTO
happens!

deliver

RUN

Physical
network

Within
hypervisor

VM scheduling queue

The scheduling delays to the receiver VM

 The vif’s buffer: temporarily store incoming packets

when the VM has been preempted.

– ifconfig vifX.Y txqueuelen [value]

 The default value is too small  intensive packet loss
– #define XENVIF_QUEUE_LENGTH 32
 This parameter should be set bigger (> 10,000 perhaps..)

Summary
Problem: VM scheduling delays cause spurious RTOs.
 Sender-side problem
 Receiver-side problem
– There are OS reasons

– Networking problem

 Proposed Solution: a ParaVirtualized TCP (PVTCP)
– Provide a method to detect a VM’s wakeup moment
 Sender-side problem
– Spurious RTOs can be
avoided.
– Slightly extends the
retransmit timer’s expiry
time after the sender
VM wakes up.

 Receiver-side problem
– Spurious RTOs can be
detected.
– Temporarily disable
delayed ACK after the
receiver VM wakes up.
– Just-in-time

 Future Work: your inputs ..

Thanks for your listening
Comments & Questions

Email: lwcheng@cs.hku.hk
URL: http://www.cs.hku.hk/~lwcheng

XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Similar to XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong (20)

More from The Linux Foundation

More from The Linux Foundation (20)

Recently uploaded

Recently uploaded (20)

XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong