VMworld 2013
Lenin Singaravelu, VMware
Haoqiang Zheng, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
2. 2
Agenda
Networking Performance in vSphere 5.5
• Network Processing Deep Dive
• Performance Improvements in vSphere 5.5
• Tuning for Extreme Workloads
Virtualizing Extremely Latency Sensitive Applications
Available Resources
Extreme Performance Series Sessions
3. 3
vSphere Networking Architecture: A Simplified View
PNIC Driver
Virtualization Layer
vSwitch, vDS, VXLAN, NetIOC,
Teaming, DVFilter, …
VM
vNIC
VM
vNIC
VM
vNIC
vmknic
(VMkernel TCP/IP)
NFS/iSCSI
vMotion
Mgmt
FT
VM
vNIC
VM
vNIC
VM
SR-IOV
VF
VM
Direct
Path
Device Emulation
Layer
VM
vNIC
VM
vNIC
VM
DirectPath
w/VMotion
Pass-through
4. 4
Transmit Processing for a VM
One transmit thread per VM, executing all parts of the stack
• Transmit thread can also execute receive path for destination VM
Wakeup of transmit thread: Two mechanisms
• Immediate, forcible wakeup by VM (low delay, high CPU overhead)
• Opportunistic wakeup by other threads or when VM halts (potentially higher
delay, low CPU overhead)
PNIC Driver
Network Virtualization Layer
VM
vNIC
VM
vNIC
VM
vNIC
DevEmu DevEmu DevEmu
Ring Entries To Packets
Packets to Ring Entries
Switching, Encapsulation,
Teaming, …
Destination VM
Packets To Ring Entries
VMKcall
Opport.
5. 5
Receive Processing For a VM
One thread per device
NetQueue enabled devices: one thread per NetQueue
• Each NetQueue processes traffic for one or more MAC addresses (vNICs)
• NetQueue vNIC mapping determined by unicast throughput and FCFS.
vNICs can share queues
• Low throughput, Too many vNICs or Queue type mismatch (LRO Queue vs.
non-LRO vNIC)
PNIC Driver
Network Virtualization Layer
VM
vNIC
VM
vNIC
VM
vNIC
DevEmu DevEmu DevEmu
Dedicated Queue
Shared Queue
6. 6
Advanced Performance Monitoring using net-stats
net-stats: single-host network performance monitoring tool
since vSphere 5.1
Runs on ESXi console. net-stats –h for help and net-stats –A to
monitor all ports
• Measure packet rates and drops at various layers (vNIC backend, vSwitch,
PNIC) in a single place
• Identify VMkernel threads for transmit and receive processing
• Break down CPU cost for networking into interrupt, receive, vCPU and
transmit thread
• PNIC Stats: NetQueue allocation information, interrupt rate
• vNIC Stats: Coalescing and RSS information
7. 7
vSphere 5.1 Networking Performance Summary
TCP Performance
• 10GbE Line Rate to/from 1vCPU VM to external host with TSO & LRO
• Up to 26.3 Gbps between 2x 1-vCPU Linux VMs on same host
• Able to scale or maintain throughput even at 8X PCPU overcommit (64 VMs
on a 8 core, HT-enabled machine)
UDP Performance
• 0.7+ Million PPS (MPPS) with a 1vCPU VM, rising to 2.5+ MPPS with more
VMs (over single 10GbE)
• Low loss rate at very high throughput might require tuning vNIC ring and
socket buffer sizes
Latency
• 35+ us for ping –i 0.001 in 1vCPU-1VM case over 10GbE
• Can increase to hundreds of microseconds under contention
Note: Microbenchmark performance is highly dependent on CPU
clock speed and size of last-level cache (LLC)
8. 8
Agenda
Networking Performance in vSphere 5.5
• Network Processing Deep Dive
• Performance Improvements in vSphere 5.5
• Tuning for Extreme Workloads
Virtualizing Extremely Latency Sensitive Applications
Available Resources
Extreme Performance Series Sessions
9. 9
What’s New in vSphere – 5.5 Performance
80 Gbps on a single host
Support for 40Gbps NICs
vmknic IPv6 Optimizations
Reduced CPU Cycles/Byte
• vSphere Native Drivers
• VXLAN Offloads
• Dynamic queue balancing
Experimental Packet Capture Framework
Latency-Sensitivity Feature
10. 10
80 Gbps1 On Single Host
8 PNICs over
8 vSwitches
16 Linux VMs
Apache
4x Intel E5-4650, 2.7 GHz
32 cores, 64 threads
IXIA XT80-V2
Traffic Generator
HTTP Get 1MB File
75+ Gbps
4.1 Million PPS
HTTP POST 1MB File
75+ Gbps
7.3 Million PPS2
1Why stop at 80 Gbps? vSphere allows a maximum of 8x 10GbE PNICs.
2Software LRO less aggressive than TSO in aggregating packets
11. 11
40 Gbps NIC Support
Inbox support for Mellanox
ConnectX-3 40Gbps over Ethernet
Max Throughput to 1 vCPU VM
• 14.1 Gbps Receive
• 36.2 Gbps Transmit
Max Throughput to Single VM
• 23.6 Gbps Receive with RSS enabled
in vNIC and PNIC.
Max Throughput to Single Host
• 37.3 Gbps Receive
RHEL 6.3 + VMXNET3
2x Intel Xeon E5-2667@2.90GHz
Mellanox MT27500
Netperf TCP_STREAM workload
12. 12
Vmknic IPv6 Enhancements
TCP Checksum offload for Transmit and Receive
Software Large Receive Offload (LRO) for TCP over IPv6
Zero-copy receives between vSwitch and TCP/IP stack
Dirtying 48 GB RAM
Intel Xeon E5-2667, 2.9 GHz
34.5 Gbps
IPv4
32.5 Gbps
IPv6
4x 10 GbE Links
13. 13
Reduced CPU Cycles/Byte
Change NetQueue allocation model for some PNICs from
throughput-based to CPU-usage based
• Fewer NetQueues used for low traffic workload
TSO, Checksum offload for VXLAN for some PNICs
Native vSphere drivers for Emulex PNIC
• Eliminate vmklinux layer from device drivers
• 10% - 35% lower CPU Cycles/byte in VMkernel
14. 14
Packet Capture Framework
New Experimental Packet Capture Framework in vSphere 5.5
• Designed for capture at moderate packet rates
• Capture packets at one or more layers of vSphere network stack
• --trace option timestamps packets as it passes through key points of the stack
Useful in identifying sources of packet drops and latency
• e.g., between UplinkRcv, Vmxnet3Rx to check for packet drops by firewall
• e.g, With --trace enabled, diff of timestamp at Vmxnet3Tx, UplinkSnd informs
us if NetIOC delayed any packet
PNIC
NetVirt
VM
vNIC
DevEmu
UplinkSnd
EtherswitchOutput
EtherswitchDispatch
Vmxnet3Tx
UplinkRcv
EtherswitchDispatch
EtherswitchOutput
Vmxnet3Rx
15. 15
Agenda
Networking Performance in vSphere 5.5
• Network Processing Deep Dive
• Performance Improvements in vSphere 5.5
• Tuning for Extreme Workloads
Virtualizing Extremely Latency Sensitive Applications
Available Resources
Extreme Performance Series Sessions
16. 16
Improve Receive Throughput to a Single VM
Single thread for receives can become bottleneck at high packet
rates (> 1 Million PPS or > 15Gbps)
Use VMXNET3 virtual device, Enable RSS inside Guest
Enable RSS in Physical NICs (only available on some PNICs)
Add ethernetX.pnicFeatures = “4” to VM’s configuration parameters
Side effects: Increased CPU Cycles/Byte
PNIC
NetVirt
VM
vNIC
DevEmu
PNIC
NetVirt
DevEmu
VM
vNIC
vCPU0 vCPUn
14.1 Gbps
on 40G PNIC
23.6 Gbps
17. 17
Improve Transmit Throughput with Multiple vNICs
Some applications use multiple vNICs for very high throughput
Common transmit thread for all vNICs can become bottleneck
Add ethernetX.ctxPerDev = “1” to VM’s configuration parameters
Side effects: Increased CPU Cycles/Byte
PNIC
NetVirt
VM
vNICvNIC
DevEmu DevEmu
PNIC
NetVirt
VM
vNICvNIC
DevEmu DevEmu0.9 MPPS
UDP Tx Rate
1.41 MPPS
UDP Tx Rate
18. 18
Achieve Higher Consolidation Ratios
Switch vNIC coalescing to “static”:
ethernetX.coalescingScheme = “static”
• Reduce interrupts to VM and vmkcalls from VM for networking traffic
• Less interruptions => more efficient processing => more requests processed
at lower cost
Disable vNIC RSS in Guest for multi-vCPU VMs
• At low throughput and low vCPU utilization, RSS only adds overhead
Side-effects: Potentially higher latency for some requests and
some workloads
20. 20
The Realm of Virtualization Grows
WebService
AppService
E-Mail
Desktops Databases
X
X Soft Real-Time Apps
HPC
X
High Frequency Trading
Tier1 Apps
Highly Latency-Sensitive Applications
• Low latency/jitter requirement (10 us – 100 us)
• Normally considered to be “non-virtualizable”
21. 21
The Latency Sensitivity Feature in vSphere 5.5
Physical
Hardware
Latency-sensitivity Feature
• Minimize virtualization overhead
• Achieve near bare-metal performance
Latency-sensitivity
HypervisorHypervisor
22. 22
Ping Latency Test
35 us
Default VM
to Native
Native
to Native
Latency Sensitive VM
to Native
18 us 20 us32 us
557 us
46 us
Median Latency
99.99% Latency
Jitter Metric
10X
23. 23
Agenda
Network Performance in vSphere 5.5
Virtualizing Extremely Latency Sensitive Applications
• Sources of Latency and Jitter
• Latency Sensitivity Feature
• Performance
• Best Practices
Available Resources
Extreme Performance Series Sessions
26. 26
Ping Latency Test
35 us
Default VM
to Native
18 us 32 us
557 us
Median Latency
99.99% Latency
Jitter Metric
Native
to Native
27. 27
Sources of the Latency/Jitter
CPU Contention
CPU Scheduling Overhead
Networking Stack Overhead
PNIC
NetVirt
VM
vNIC
DevEmu
28. 28
System View from CPU Scheduler’s Perspective
vcpu-0 vcpu-1 MKS I/O
System threads
hostd ssh
Mem
Mgr
I/O
VMs User threads
We have more than just vCPUs from VMs
CPU contention can occasionally occur on an under-committed
system
Some system threads run at higher priority
29. 29
Causes of Scheduling Related Execution Delay
A B C D E
E: HT, power management, cache related efficiency loss
Wakeup Finish runningStart running
A: ready time, waiting for other contexts to finish
B: scheduling overhead and world switch overhead
C: actual execution time
D: overlap time (caused by interrupts etc)
31. 31
Reduce CPU Contention Using Exclusive Affinity
vcpu
A B C D E
Exclusive Affinity
PCPUs
I/O
Intr
32. 32
Reduce CPU Contention Using Exclusive Affinity (II)
What about other contexts?
• Share cores without exclusive affinity
• May be contented and may cause jitters for the latency sensitive VM
PCPUs
vcpu0 vcpu1 vcpu2 vcpu3 vcpu0 vcpu1
MKSI/O
hostd ssh
Mem
Mgr
I/O
34. 34
Side Effects of Exclusive Affinity
There’s no such thing as a free lunch
CPU cycles may be wasted
• The CPU will NOT be used by other contexts when the vCPU is idle
Exclusive affinity is only applied when:
• The VM’s latency-sensitivity is HIGH
• The VM has enough CPU allocation
35. 35
Latency/Jitter from Networking Stack
PNIC
NetVirt
VM
vNIC
DevEmu Execute more code
Context Switches
- Scheduling Delays
- Variance from Coalescing
Large Receive Offload
modifies TCP ACK behavior
Disable vNIC Coalescing
Disable LRO for vNICs
Use Pass-Through device for networking traffic
36. 36
Pass-Through Devices
SR-IOV or DirectPath I/O allow VM direct access to device
Bypass Virtual Networking Stack, reducing CPU cost and Latency
Pass-through NICs negate many benefits of virtualization
• Only some versions of DirectPath I/O allow sharing of devices and VMotion
• SR-IOV allows sharing of devices, but does not support Vmotion
• No support for NetIOC, FT, Memory Overcommit, HA, …
37. 37
Agenda
Network Performance in vSphere 5.5
Virtualizing Extremely Latency Sensitive Applications
• Sources of Latency and Jitter
• Latency Sensitivity Feature
• Performance
• Best Practices
Available Resources
Extreme Performance Series Sessions
38. 38
Performance of Latency Sensitivity Feature
Single 2-vCPU VM to Native, RHEL 6.2, RTT from ‘ping –i 0.001’
Intel Xeon E5-2640 @ 2.50 GHz, Intel 82599EB PNIC
Median reduced by 15 us over Default, 6 us over SR-IOV
• 99.99th percentile lower than 50 us
Performance gap to native is between 2us-10us
39. 39
Performance with Multiple VMs
4x 2-vCPU VMs on a 12-core host, same ping workload
4-VM performance very similar to that of 1-VM
Median reduced by 20us over Default, 6 us over SR-IOV
99.99th percentile ~ 75 us: 400+us better than Default, 50us better
than SR-IOV
40. 40
Extreme Performance with SolarFlare PNIC
1VM with Solarflare SFN6122F-R7 PNIC, Native with same PNIC
• Netperf TCP_RR workload
• OpenOnload® enabled for netperf and netserver
Median RTT of 6 us.
99.99th percentile <= 25us.
41. 41
Latency Sensitivity Feature Caveats
Designed for applications with latency requirements of the order of
few tens of microseconds
Exclusive affinity reduces flexibility for CPU Scheduler
• Even if vCPU is idle, other threads cannot use PCPU
Last-Level CPU Cache sharing not addressed
• Performance can be impacted by a competing VM with big memory footprint
Storage and other resources (load balancer, firewalls, …) sharing
not addressed
Pass-through NICs not compatible with many
virtualization features
Automatic VMXNET3 tunings not suited for high packet rates
• LRO, vNIC coalescing ON might be better
42. 42
Latency Sensitivity Feature Best Practices
Hardware Power Management
• Disable C States
• Set BIOS policy to “High Performance”
Use SR-IOV if possible
Full CPU and memory reservations
Under-commit resources for best performance
Ensure that Demand-Entitlement Ratio of the VM is lower than 100
43. 43
Summary
Good out-of-box network performance with vSphere 5.5
• Supports 40 Gbps NICs, Capable of saturating 80 Gbps on single host
Supports extreme throughput to single VM
• Tunables to increase parallelism to nearly double receive/transmit throughput
Latency-sensitivity feature enables extremely latency sensitive
workloads
• Median latency <20us, 99.99% latency <50us possible with vSphere
• Only few microseconds higher than Native
Only extreme workloads require tuning
• Be aware of tradeoffs before tuning
44. 44
Latency Sensitive Feature Docs
Deploying Extremely Latency-Sensitive Applications in vSphere 5.5
http://www.vmware.com/files/pdf/techpaper/latency-sensitive-perf-
vsphere55.pdf
Best Practices for Performance Tuning of Latency-Sensitive
Workloads in vSphere VMs
http://www.vmware.com/files/pdf/techpaper/VMW-Tuning-Latency-
Sensitive-Workloads.pdf
Network IO Latency on vSphere 5
http://www.vmware.com/files/pdf/techpaper/network-io-latency-
perf-vsphere5.pdf
45. 45
Network Performance Docs
VXLAN Performance on vSphere 5.1
http://www.vmware.com/files/pdf/techpaper/VMware-vSphere-
VXLAN-Perf.pdf
Performance and Use Cases of VMware DirectPath I/O for
Networking
http://blogs.vmware.com/performance/2010/12/performance-and-
use-cases-of-vmware-directpath-io-for-networking.html
VMware Network I/O Control: Architecture, Performance and Best
Practices
http://www.vmware.com/files/pdf/techpaper/VMW_Netioc_BestPrac
tices.pdf
Multicast Performance on vSphere 5.0
http://blogs.vmware.com/performance/2011/08/multicast-
performance-on-vsphere-50.html
46. 46
Performance Community Resources
Performance Technology Pages
• http://www.vmware.com/technical-resources/performance/resources.html
Technical Marketing Blog
• http://blogs.vmware.com/vsphere/performance/
Performance Engineering Blog VROOM!
• http://blogs.vmware.com/performance
Performance Community Forum
• http://communities.vmware.com/community/vmtn/general/performance
Virtualizing Business Critical Applications
• http://www.vmware.com/solutions/business-critical-apps/
48. 48
Extreme Performance Series Sessions
Extreme Performance Series:
vCenter of the Universe – Session #VSVC5234
Monster Virtual Machines – Session # VSVC4811
Network Speed Ahead – Session # VSVC5596
Storage in a Flash – Session # VSVC5603
Extreme Performance Series:
Silent Killer: How Latency Destroys Performance...And What to Do About It –
Session # VSVC5187
Big Data:
Virtualized SAP HANA Performance, Scalability and Practices – Session #
VAPP5591
Hands on Labs:
HOL-SDC-1304 – Optimize vSphere Performance includes vFRC
49. 49
Other VMware Activities Related to This Session
HOL:
HOL-SDC-1304
vSphere Performance Optimization
Group Discussions:
VSVC1001-GD
Performance with Mark Achtemichuk
53. 5353
Networking Performance Goals
Tune vSphere for best out-of-box performance for wide range
of applications
• Extreme applications might require modified settings
Line-Rate throughput on latest devices
Near-Native throughput, latency, CPU Cycles/Byte
54. 5454
Design Choices for Higher Performance
Asynchronous transmit and receive paths for most network
stack consumers
Ability to use multiple VMkernel I/O threads per VM or PNIC
• Sacrifice some efficiency to drive higher throughput
Interrupt coalescing at PNIC, vNIC to reduce CPU Cycles/Byte
• Coalescing introduces variance in packet processing cost and latency
Packet aggregation in software and hardware (TSO, LRO)
• Aggregation may hurt latency-sensitive workloads
Co-locate I/O threads and vCPUs on same Last-Level Cache to
improve efficiency