VMworld 2013: Extreme Performance Series: Network Speed Ahead

Extreme Performance Series:
Network Speed Ahead
Lenin Singaravelu, VMware
Haoqiang Zheng, VMware
VSVC5596
#VSVC5596

2
Agenda
 Networking Performance in vSphere 5.5
• Network Processing Deep Dive
• Performance Improvements in vSphere 5.5
• Tuning for Extreme Workloads
 Virtualizing Extremely Latency Sensitive Applications
 Available Resources
 Extreme Performance Series Sessions

3
vSphere Networking Architecture: A Simplified View
PNIC Driver
Virtualization Layer
vSwitch, vDS, VXLAN, NetIOC,
Teaming, DVFilter, …
VM
vNIC
VM
vNIC
VM
vNIC
vmknic
(VMkernel TCP/IP)
NFS/iSCSI
vMotion
Mgmt
FT
VM
vNIC
VM
vNIC
VM
SR-IOV
VF
VM
Direct
Path
Device Emulation
Layer
VM
vNIC
VM
vNIC
VM
DirectPath
w/VMotion
Pass-through

4
Transmit Processing for a VM
 One transmit thread per VM, executing all parts of the stack
• Transmit thread can also execute receive path for destination VM
 Wakeup of transmit thread: Two mechanisms
• Immediate, forcible wakeup by VM (low delay, high CPU overhead)
• Opportunistic wakeup by other threads or when VM halts (potentially higher
delay, low CPU overhead)
PNIC Driver
Network Virtualization Layer
VM
vNIC
VM
vNIC
VM
vNIC
DevEmu DevEmu DevEmu
Ring Entries To Packets
Packets to Ring Entries
Switching, Encapsulation,
Teaming, …
Destination VM
Packets To Ring Entries
VMKcall
Opport.

5
Receive Processing For a VM
 One thread per device
 NetQueue enabled devices: one thread per NetQueue
• Each NetQueue processes traffic for one or more MAC addresses (vNICs)
• NetQueue  vNIC mapping determined by unicast throughput and FCFS.
 vNICs can share queues
• Low throughput, Too many vNICs or Queue type mismatch (LRO Queue vs.
non-LRO vNIC)
PNIC Driver
Network Virtualization Layer
VM
vNIC
VM
vNIC
VM
vNIC
DevEmu DevEmu DevEmu
Dedicated Queue
Shared Queue

6
Advanced Performance Monitoring using net-stats
 net-stats: single-host network performance monitoring tool
since vSphere 5.1
 Runs on ESXi console. net-stats –h for help and net-stats –A to
monitor all ports
• Measure packet rates and drops at various layers (vNIC backend, vSwitch,
PNIC) in a single place
• Identify VMkernel threads for transmit and receive processing
• Break down CPU cost for networking into interrupt, receive, vCPU and
transmit thread
• PNIC Stats: NetQueue allocation information, interrupt rate
• vNIC Stats: Coalescing and RSS information

7
vSphere 5.1 Networking Performance Summary
 TCP Performance
• 10GbE Line Rate to/from 1vCPU VM to external host with TSO & LRO
• Up to 26.3 Gbps between 2x 1-vCPU Linux VMs on same host
• Able to scale or maintain throughput even at 8X PCPU overcommit (64 VMs
on a 8 core, HT-enabled machine)
 UDP Performance
• 0.7+ Million PPS (MPPS) with a 1vCPU VM, rising to 2.5+ MPPS with more
VMs (over single 10GbE)
• Low loss rate at very high throughput might require tuning vNIC ring and
socket buffer sizes
 Latency
• 35+ us for ping –i 0.001 in 1vCPU-1VM case over 10GbE
• Can increase to hundreds of microseconds under contention
 Note: Microbenchmark performance is highly dependent on CPU
clock speed and size of last-level cache (LLC)

8
Agenda

9
What’s New in vSphere – 5.5 Performance
 80 Gbps on a single host
 Support for 40Gbps NICs
 vmknic IPv6 Optimizations
 Reduced CPU Cycles/Byte
• vSphere Native Drivers
• VXLAN Offloads
• Dynamic queue balancing
 Experimental Packet Capture Framework
 Latency-Sensitivity Feature

10
80 Gbps1 On Single Host
8 PNICs over
8 vSwitches
16 Linux VMs
Apache
4x Intel E5-4650, 2.7 GHz
32 cores, 64 threads
IXIA XT80-V2
Traffic Generator
HTTP Get 1MB File
75+ Gbps
4.1 Million PPS
HTTP POST 1MB File
75+ Gbps
7.3 Million PPS2
1Why stop at 80 Gbps? vSphere allows a maximum of 8x 10GbE PNICs.
2Software LRO less aggressive than TSO in aggregating packets

11
40 Gbps NIC Support
 Inbox support for Mellanox
ConnectX-3 40Gbps over Ethernet
 Max Throughput to 1 vCPU VM
• 14.1 Gbps Receive
• 36.2 Gbps Transmit
 Max Throughput to Single VM
• 23.6 Gbps Receive with RSS enabled
in vNIC and PNIC.
 Max Throughput to Single Host
• 37.3 Gbps Receive
RHEL 6.3 + VMXNET3
2x Intel Xeon E5-2667@2.90GHz
Mellanox MT27500
Netperf TCP_STREAM workload

12
Vmknic IPv6 Enhancements
 TCP Checksum offload for Transmit and Receive
 Software Large Receive Offload (LRO) for TCP over IPv6
 Zero-copy receives between vSwitch and TCP/IP stack
Dirtying 48 GB RAM
Intel Xeon E5-2667, 2.9 GHz
34.5 Gbps
IPv4
32.5 Gbps
IPv6
4x 10 GbE Links

13
Reduced CPU Cycles/Byte
 Change NetQueue allocation model for some PNICs from
throughput-based to CPU-usage based
• Fewer NetQueues used for low traffic workload
 TSO, Checksum offload for VXLAN for some PNICs
 Native vSphere drivers for Emulex PNIC
• Eliminate vmklinux layer from device drivers
• 10% - 35% lower CPU Cycles/byte in VMkernel

14
Packet Capture Framework
New Experimental Packet Capture Framework in vSphere 5.5
• Designed for capture at moderate packet rates
• Capture packets at one or more layers of vSphere network stack
• --trace option timestamps packets as it passes through key points of the stack
Useful in identifying sources of packet drops and latency
• e.g., between UplinkRcv, Vmxnet3Rx to check for packet drops by firewall
• e.g, With --trace enabled, diff of timestamp at Vmxnet3Tx, UplinkSnd informs
us if NetIOC delayed any packet
PNIC
NetVirt
VM
vNIC
DevEmu
UplinkSnd
EtherswitchOutput
EtherswitchDispatch
Vmxnet3Tx
UplinkRcv
EtherswitchDispatch
EtherswitchOutput
Vmxnet3Rx

15
Agenda

16
Improve Receive Throughput to a Single VM
Single thread for receives can become bottleneck at high packet
rates (> 1 Million PPS or > 15Gbps)
Use VMXNET3 virtual device, Enable RSS inside Guest
Enable RSS in Physical NICs (only available on some PNICs)
Add ethernetX.pnicFeatures = “4” to VM’s configuration parameters
Side effects: Increased CPU Cycles/Byte
PNIC
NetVirt
VM
vNIC
DevEmu
PNIC
NetVirt
DevEmu
VM
vNIC
vCPU0 vCPUn
14.1 Gbps
on 40G PNIC
23.6 Gbps

17
Improve Transmit Throughput with Multiple vNICs
 Some applications use multiple vNICs for very high throughput
 Common transmit thread for all vNICs can become bottleneck
 Add ethernetX.ctxPerDev = “1” to VM’s configuration parameters
 Side effects: Increased CPU Cycles/Byte
PNIC
NetVirt
VM
vNICvNIC
DevEmu DevEmu
PNIC
NetVirt
VM
vNICvNIC
DevEmu DevEmu0.9 MPPS
UDP Tx Rate
1.41 MPPS
UDP Tx Rate

18
Achieve Higher Consolidation Ratios
Switch vNIC coalescing to “static”:
ethernetX.coalescingScheme = “static”
• Reduce interrupts to VM and vmkcalls from VM for networking traffic
• Less interruptions => more efficient processing => more requests processed
at lower cost
Disable vNIC RSS in Guest for multi-vCPU VMs
• At low throughput and low vCPU utilization, RSS only adds overhead
Side-effects: Potentially higher latency for some requests and
some workloads

19
Achieving tens of microseconds latency

20
The Realm of Virtualization Grows
WebService
AppService
E-Mail
Desktops Databases
X
X Soft Real-Time Apps
HPC
X
High Frequency Trading
Tier1 Apps
 Highly Latency-Sensitive Applications
• Low latency/jitter requirement (10 us – 100 us)
• Normally considered to be “non-virtualizable”

21
The Latency Sensitivity Feature in vSphere 5.5
Physical
Hardware
 Latency-sensitivity Feature
• Minimize virtualization overhead
• Achieve near bare-metal performance
Latency-sensitivity
HypervisorHypervisor

22
Ping Latency Test
35 us
Default VM
to Native
Native
to Native
Latency Sensitive VM
to Native
18 us 20 us32 us
557 us
46 us
Median Latency
99.99% Latency
Jitter Metric
10X

23
Agenda
 Network Performance in vSphere 5.5
• Sources of Latency and Jitter
• Latency Sensitivity Feature
• Performance
• Best Practices

25
Maximize Memory Reservation

26
Ping Latency Test
35 us
Default VM
to Native
18 us 32 us
557 us
Median Latency
99.99% Latency
Jitter Metric
Native
to Native

27
Sources of the Latency/Jitter
CPU Contention
CPU Scheduling Overhead
Networking Stack Overhead
PNIC
NetVirt
VM
vNIC
DevEmu

28
System View from CPU Scheduler’s Perspective
vcpu-0 vcpu-1 MKS I/O
System threads
hostd ssh
Mem
Mgr
I/O
VMs User threads
 We have more than just vCPUs from VMs
 CPU contention can occasionally occur on an under-committed
system
 Some system threads run at higher priority

29
Causes of Scheduling Related Execution Delay
A B C D E
E: HT, power management, cache related efficiency loss
Wakeup Finish runningStart running
A: ready time, waiting for other contexts to finish
B: scheduling overhead and world switch overhead
C: actual execution time
D: overlap time (caused by interrupts etc)

30
Setting Latency-Sensitivity to HIGH

31
Reduce CPU Contention Using Exclusive Affinity
vcpu
A B C D E
Exclusive Affinity
PCPUs
I/O
Intr

32
Reduce CPU Contention Using Exclusive Affinity (II)
 What about other contexts?
• Share cores without exclusive affinity
• May be contented and may cause jitters for the latency sensitive VM
PCPUs
vcpu0 vcpu1 vcpu2 vcpu3 vcpu0 vcpu1
MKSI/O
hostd ssh
Mem
Mgr
I/O

33
Use DERatio to Monitor CPU Contention

34
Side Effects of Exclusive Affinity
There’s no such thing as a free lunch
 CPU cycles may be wasted
• The CPU will NOT be used by other contexts when the vCPU is idle
 Exclusive affinity is only applied when:
• The VM’s latency-sensitivity is HIGH
• The VM has enough CPU allocation

35
Latency/Jitter from Networking Stack
PNIC
NetVirt
VM
vNIC
DevEmu Execute more code
Context Switches
- Scheduling Delays
- Variance from Coalescing
Large Receive Offload
modifies TCP ACK behavior
 Disable vNIC Coalescing
 Disable LRO for vNICs
 Use Pass-Through device for networking traffic

36
Pass-Through Devices
 SR-IOV or DirectPath I/O allow VM direct access to device
 Bypass Virtual Networking Stack, reducing CPU cost and Latency
 Pass-through NICs negate many benefits of virtualization
• Only some versions of DirectPath I/O allow sharing of devices and VMotion
• SR-IOV allows sharing of devices, but does not support Vmotion
• No support for NetIOC, FT, Memory Overcommit, HA, …

37
Agenda
 Network Performance in vSphere 5.5
• Sources of Latency and Jitter
• Latency Sensitivity Feature
• Performance
• Best Practices

38
Performance of Latency Sensitivity Feature
 Single 2-vCPU VM to Native, RHEL 6.2, RTT from ‘ping –i 0.001’
 Intel Xeon E5-2640 @ 2.50 GHz, Intel 82599EB PNIC
 Median reduced by 15 us over Default, 6 us over SR-IOV
• 99.99th percentile lower than 50 us
 Performance gap to native is between 2us-10us

39
Performance with Multiple VMs
 4x 2-vCPU VMs on a 12-core host, same ping workload
 4-VM performance very similar to that of 1-VM
 Median reduced by 20us over Default, 6 us over SR-IOV
 99.99th percentile ~ 75 us: 400+us better than Default, 50us better
than SR-IOV

40
Extreme Performance with SolarFlare PNIC
 1VM with Solarflare SFN6122F-R7 PNIC, Native with same PNIC
• Netperf TCP_RR workload
• OpenOnload® enabled for netperf and netserver
 Median RTT of 6 us.
 99.99th percentile <= 25us.

41
Latency Sensitivity Feature Caveats
 Designed for applications with latency requirements of the order of
few tens of microseconds
 Exclusive affinity reduces flexibility for CPU Scheduler
• Even if vCPU is idle, other threads cannot use PCPU
 Last-Level CPU Cache sharing not addressed
• Performance can be impacted by a competing VM with big memory footprint
 Storage and other resources (load balancer, firewalls, …) sharing
not addressed
 Pass-through NICs not compatible with many
virtualization features
 Automatic VMXNET3 tunings not suited for high packet rates
• LRO, vNIC coalescing ON might be better

42
Latency Sensitivity Feature Best Practices
 Hardware Power Management
• Disable C States
• Set BIOS policy to “High Performance”
 Use SR-IOV if possible
 Full CPU and memory reservations
 Under-commit resources for best performance
 Ensure that Demand-Entitlement Ratio of the VM is lower than 100

43
Summary
 Good out-of-box network performance with vSphere 5.5
• Supports 40 Gbps NICs, Capable of saturating 80 Gbps on single host
 Supports extreme throughput to single VM
• Tunables to increase parallelism to nearly double receive/transmit throughput
 Latency-sensitivity feature enables extremely latency sensitive
workloads
• Median latency <20us, 99.99% latency <50us possible with vSphere
• Only few microseconds higher than Native
 Only extreme workloads require tuning
• Be aware of tradeoffs before tuning

44
Latency Sensitive Feature Docs
 Deploying Extremely Latency-Sensitive Applications in vSphere 5.5
http://www.vmware.com/files/pdf/techpaper/latency-sensitive-perf-
vsphere55.pdf
 Best Practices for Performance Tuning of Latency-Sensitive
Workloads in vSphere VMs
http://www.vmware.com/files/pdf/techpaper/VMW-Tuning-Latency-
Sensitive-Workloads.pdf
 Network IO Latency on vSphere 5
http://www.vmware.com/files/pdf/techpaper/network-io-latency-
perf-vsphere5.pdf

45
Network Performance Docs
 VXLAN Performance on vSphere 5.1
http://www.vmware.com/files/pdf/techpaper/VMware-vSphere-
VXLAN-Perf.pdf
 Performance and Use Cases of VMware DirectPath I/O for
Networking
http://blogs.vmware.com/performance/2010/12/performance-and-
use-cases-of-vmware-directpath-io-for-networking.html
 VMware Network I/O Control: Architecture, Performance and Best
Practices
http://www.vmware.com/files/pdf/techpaper/VMW_Netioc_BestPrac
tices.pdf
 Multicast Performance on vSphere 5.0
http://blogs.vmware.com/performance/2011/08/multicast-
performance-on-vsphere-50.html

46
Performance Community Resources
Performance Technology Pages
• http://www.vmware.com/technical-resources/performance/resources.html
Technical Marketing Blog
• http://blogs.vmware.com/vsphere/performance/
Performance Engineering Blog VROOM!
• http://blogs.vmware.com/performance
Performance Community Forum
• http://communities.vmware.com/community/vmtn/general/performance
Virtualizing Business Critical Applications
• http://www.vmware.com/solutions/business-critical-apps/

47
Performance Technical Resources
Performance Technical Papers
• http://www.vmware.com/resources/techresources/cat/91,96
Performance Best Practices
• http://www.vmware.com/pdf/Perf_Best_Practices_vSphere4.0.pdf
 Troubleshooting Performance Related Problems in vSphere Environments
• http://communities.vmware.com/docs/DOC-14905 (vSphere 4.1)
• http://communities.vmware.com/docs/DOC-19166 (vSphere 5)
• http://communities.vmware.com/docs/DOC-23094 (vSphere 5.x with vCOps)

48
Extreme Performance Series Sessions
vCenter of the Universe – Session #VSVC5234
Monster Virtual Machines – Session # VSVC4811
Network Speed Ahead – Session # VSVC5596
Storage in a Flash – Session # VSVC5603
Silent Killer: How Latency Destroys Performance...And What to Do About It –
Session # VSVC5187
Big Data:
Virtualized SAP HANA Performance, Scalability and Practices – Session #
VAPP5591
Hands on Labs:
HOL-SDC-1304 – Optimize vSphere Performance  includes vFRC

49
Other VMware Activities Related to This Session
 HOL:
HOL-SDC-1304
vSphere Performance Optimization
 Group Discussions:
VSVC1001-GD
Performance with Mark Achtemichuk

5353
Networking Performance Goals
 Tune vSphere for best out-of-box performance for wide range
of applications
• Extreme applications might require modified settings
 Line-Rate throughput on latest devices
 Near-Native throughput, latency, CPU Cycles/Byte

5454
Design Choices for Higher Performance
 Asynchronous transmit and receive paths for most network
stack consumers
 Ability to use multiple VMkernel I/O threads per VM or PNIC
• Sacrifice some efficiency to drive higher throughput
 Interrupt coalescing at PNIC, vNIC to reduce CPU Cycles/Byte
• Coalescing introduces variance in packet processing cost and latency
 Packet aggregation in software and hardware (TSO, LRO)
• Aggregation may hurt latency-sensitive workloads
 Co-locate I/O threads and vCPUs on same Last-Level Cache to
improve efficiency

VMworld 2013: Extreme Performance Series: Network Speed Ahead

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to VMworld 2013: Extreme Performance Series: Network Speed Ahead

Similar to VMworld 2013: Extreme Performance Series: Network Speed Ahead (20)

More from VMworld

More from VMworld (20)

Recently uploaded

Recently uploaded (20)

VMworld 2013: Extreme Performance Series: Network Speed Ahead