VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to Do About It

2,807 views

Published on

VMworld 2013

Bhavesh Davda, VMware
Josh Simons, VMware

Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,807
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
50
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to Do About It

  1. 1. Silent Killer: How Latency Destroys Performance...And What to Do About It Bhavesh Davda, VMware Josh Simons, VMware VSVC5187 #VSVC5187
  2. 2. 22 Agenda  Introduction • Definitions • Effects • Sources  Mitigation • BIOS settings • CPU scheduling and over-commitment • Memory over-commitment and MMU virtualization • NUMA and vNUMA • Guest OS • Storage • Networking
  3. 3. 33 What is Latency?  Examples in computing environments: • Signal propagation within a microprocessor • Memory access from cache, from local memory, from non-local memory • PCI I/O data transfers • Data access within rotating media • Operating system scheduling • Network communication, local and wide area • Application logic  Typically reported as average latency Latency is a measure of time delay experienced in a system, the precise definition of which depends on the system and the time being measured. (Wikipedia)
  4. 4. 44 https://gist.github.com/hellerbarde/2843375 ^ and IT person
  5. 5. 55 A Latency Number Every Human Should Know
  6. 6. 66 What is Jitter?  Examples in computing environments • Unpredictable response times in financial trading applications • Stalling, stuttering audio and video in telecommunication applications • Reduced performance of distributed parallel computing applications • Measurable variations in run times for long-running jobs Jitter is variation in latency that causes non-deterministic performance in seemingly deterministic workloads “Insanity: doing the same thing over and over again and expecting different results.” Albert Einstein
  7. 7. 77 Agenda  Introduction • Definitions • Effects • Sources  Mitigation • BIOS settings • CPU scheduling and over-commitment • Memory over-commitment and MMU virtualization • NUMA and vNUMA • Guest OS • Storage • Networking
  8. 8. 88 Effects of Latency and Jitter on VoIP Audio Quality Original 5% drop 20% drop http://www.voiptroubleshooter.com/sound_files/ 1 2 3 4 5 6 1 2 3 4 5 6 De-jitter buffering 1 2 3 4 5 6 De-jitter buffering ITU-T G.114 Latency Recommendation Mean Opinion Score (MOS) 4.3-5.0 4.0-4.3 3.6-4.0 3.1-3.6 2.6-3.1 Higherisbetter 1 2 3 4 5 6 Play out latency 1 2 4 5 Drops
  9. 9. 99 The Case of the Missing Supercomputer Performance The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q, Petrini, F., Kerbyson, D., Pakin, S., Proceedings of the 2003 CM/IEEE conference on Supercomputing  Peer-to-peer parallel (MPI) application performance degrades as scale increases – up to 2X worse than predicted by model  No obvious explanations, initially  Noise – extraneous daemons, kernel timers, etc. – indicted as problem  Jittered arrival times at application synchronization points resulted in significant overall slowdowns (lowerisbetter) (lowerisbetter)
  10. 10. 1010 Latency Affects Throughput, Packet Rate, and IOPs, Too Assume a 100 bit/sec channel bandwidth (1 bit every 0.01 sec) XMIT Time (sec) = Latency + Packet Size * 0.01 Throughput (bits/sec) = Packet Size / XMIT Time Packet Size (bits) Throughput (bits/sec) Packet Rate (packets/sec) Latency 0 sec Latency 0.01 sec Latency 0.04 sec Latency 0 sec Latency 0.01 sec Latency 0.04 sec 1 100 100 10 100 10 100 100 1
  11. 11. 1111 Latency Affects Throughput, Packet Rate, and IOPs, Too Assume a 100 bit/sec channel bandwidth (1 bit every 0.01 sec) XMIT Time (sec) = Latency + Packet Size * 0.01 Throughput (bits/sec) = Packet Size / XMIT Time Packet Size (bits) Throughput (bits/sec) Packet Rate (packets/sec) Latency 0 sec Latency 0.01 sec Latency 0.04 sec Latency 0 sec Latency 0.01 sec Latency 0.04 sec 1 100 50 100 50 10 100 91 10 9 100 100 99 1 1
  12. 12. 1212 Latency Affects Throughput, Packet Rate, and IOPs, Too Assume a 100 bit/sec channel bandwidth (1 bit every 0.01 sec) XMIT Time (sec) = Latency + Packet Size * 0.01 Throughput (bits/sec) = Packet Size / XMIT Time Packet Size (bits) Throughput (bits/sec) Packet Rate (packets/sec) Latency 0 sec Latency 0.01 sec Latency 0.04 sec Latency 0 sec Latency 0.01 sec Latency 0.04 sec 1 100 50 20 100 50 20 10 100 91 71 10 9 7 100 100 99 96 1 1 1
  13. 13. 1313 Agenda  Introduction • Definitions • Effects • Sources  Mitigation • BIOS settings • CPU scheduling and over-commitment • Memory over-commitment and MMU virtualization • NUMA and vNUMA • Guest OS • Storage • Networking
  14. 14. 1414 Network Latency in Bare Metal Environments  Message copy from application to OS (kernel)  OS (network stack) + NIC driver queues packet for NIC  NIC DMAs packet and transmits on the wire CPUs RAM Interconnect NIC Disk Network Switch Server
  15. 15. 1515 Interconnect Network Latency in Virtual Environments  Message copy from application to GOS (kernel)  GOS (network stack) + vNIC driver queues packet for vNIC  VM exit to VMM/Hypervisor  vNIC implementation emulates DMA from VM, sends to vSwitch  vSwitch queues packet for pNIC  pNIC DMAs packet and transmits on the wire Network Switch VMs Virtual Switch NIC Server Management Agents Background Tasks ESXi Hypervisor
  16. 16. 1616 Network Storage: Small I/O Case Study  Rendering applications • 1.4X – 3X slowdown seen initially  Customer NFS stress test • 10K files • 1K random reads/file • 1-32K bytes • 7X slowdown  Single change • Disable LRO (Large Receive Offload) within the guest to avoid coalescing of small messages upon arrival • See KB 1027511: Poor TCP Performance can occur in Linux virtual machines with LRO enabled  Final application performance • 1 – 5% slower than native Guest OS Application ESXiNFS Server
  17. 17. 1717 Data Center Networks – the Trend to Fabrics WAN/Internet WAN/Internet NORTH/SOUTH EAST/WEST
  18. 18. 1818 Agenda  Introduction • Definitions • Effects • Sources  Mitigation • BIOS settings • CPU scheduling and over-commitment • Memory over-commitment and MMU virtualization • NUMA and vNUMA • Guest OS • Storage • Networking
  19. 19. 1919 General Guidelines about Tuning for Latency  vSphere ESXi is designed for high performance and fairness • Maximizes overall performance of all VMs without unfairly penalizing any VM • Defaults are carefully tuned for high throughput  Tunable settings should be thoroughly vetted in a test environment before deployment  Tuning should be applied individually to study the effects on performance • Maintain good change control practices  Certain tunables for lowest latency can negatively affect throughput and efficiency, so consider tradeoffs • Consider isolating latency-sensitive VMs on dedicated hosts • DRS host groups can be used to manage groups of hosts supporting latency- sensitive VMs
  20. 20. 2020 Optimizing for Latency-sensitive Workloads (1 of 3)  Power Management • Set at both BIOS and hypervisor levels • Hyperthreading may cause jitter due to pipeline sharing • Intel Turbo Boost may cause runtime jitter  CPU and memory over-commitment • Transparent page sharing may cause jitter due to non-deterministic share-breaking on writes • Memory compression • Better to avoid over-subscription of resources  Memory virtualization • Hardware memory virtualization can sometimes be slower than software approaches Max performance / Static High To disable: sched.mem.pshare.enable = FALSE Mem.MemZipEnable = 0 For shadow page tables (i.e., software approach): monitor.virtual_mmu = software
  21. 21. 2121 Memory Virtualization HPL Native (GFLOP/s) Virtual EPT on EPT off 4K guest pages 37.04 36.04 (97.3%) 36.22 (97.8%) 2MB guest pages 37.74 38.24 (100.1%) 38.42 (100.2%) *RandomAccess Native (GUP/s) Virtual EPT on EPT off 4K guest pages 0.01842 0.0156 (84.8%) 0.0181 (98.3%) 2MB guest pages 0.03956 0.0380 (96.2%) 0.0390 (98.6%) physical virtual machine EPT = Intel Extended Page Tables = hardware page table virtualization = AMD RVI
  22. 22. 2222 NUMA and vNUMA hypervisor Application socketM socket MsocketM socket M Making virtual NUMA nodes visible within the Guest OS allows ESXi to respect GOS process placement and memory allocation decisions, which can lead to significant performance increases
  23. 23. 2323 Optimizing for Latency-sensitive Workloads (2 of 3)  NUMA • ESXi optimally allocates CPU and memory • NUMA node affinity can be set manually • Exposing NUMA topology to wide guests (vNUMA) can be very important. Automatic for #vCPU > 8 and can be forced otherwise • NUMA scheduler does not include HT by default. Can be overridden to prevent VM split across NUMA nodes numa.nodeAffinity = X numa.vcpu.min = N (< #vCPUs) numa.vcpu.preferHT = “1”
  24. 24. 2424 vNUMA Performance Study: SpecOMP (Lower is Better) Performance Evaluation of HPC Benchmarks on VMware’s ESX Server, Ali Q., Kiriansky, V., Simons J., Zaroo, P., 5th Workshop on System-level Virtualization for High Performance Computing, 2011
  25. 25. 2525 Optimizing for Latency-sensitive Workloads (2 of 3)  NUMA • ESXi optimally allocates CPU and memory • NUMA node affinity can be set manually • Exposing NUMA topology to wide guests (vNUMA) can be very important. Automatic for #vCPU > 8 and can be forced otherwise • NUMA scheduler does not include HT by default. Can be overridden to prevent VM split across NUMA nodes  VM scheduling optimizations • e.g., suppress descheduling  Guest OS choice • Later distributions are usually better (tickless kernel, etc.) • RHEL 6+, SLES 11+, etc. (2.6.32+ kernel) • Windows Server 2008+ monitor_control.halt_desched = FALSE numa.nodeAffinity = X numa.vcpu.min = N (< #vCPUs) numa.vcpu.preferHT = “1”
  26. 26. 2626 Optimizing for Latency-sensitive Workloads (3/3)  Storage • Storage stack already tuned for small block transfers • iSCSI and NAS (host and guest) affected by network tuning parameters • Local Flash memory’s much lower latency exposes overheads in software stack that we are working to address  Networking • Interrupt coalescing should be disabled vNIC pNIC • Jumbo frames may interfere with low-latency traffic • Disable Large Receive Offload (LRO) for TCP (including NAS) • Polling for I/O completion rather than using interrupts • Passthrough / direct assignment for lowest I/O latencies ethernetX.coalescingScheme = “disabled” esxcli module parameter driver-parameter DPDK, RDMA poll mode
  27. 27. 2727 kernel Kernel Bypass Model driver tcp/ip sockets hardware application rdma guestkernel driver tcp/ip sockets vmkernel application hardware user user rdma
  28. 28. 2828 InfiniBand Bandwidth with Passthrough / Direct Assignment 0 500 1000 1500 2000 2500 3000 3500 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M Bandwidth(MB/s) Message size (bytes) Send: Native Send: ESXi RDMA Read: Native RDMA Read: ESXi RDMA Performance in Virtual Machines using QDR InfiniBand on VMware vSphere 5, April 2011 http://labs.vmware.com/academic/publications/ib-researchnote-apr2012
  29. 29. 2929 Latency with Passthrough / Direct Assignment (Send/Rcv, Polling) 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M Halfroundtriplatency(µs) Message size (bytes) Native ESXi ExpA MsgSize (bytes) Native ESXi ExpA 2 1.35 1.75 4 1.35 1.75 8 1.38 1.78 16 1.37 2.05 32 1.38 2.35 64 1.39 2.9 128 1.5 4.13 256 2.3 2.31
  30. 30. 3030 New Features Planned for Upcoming vSphere ESXi Releases  New virtual machine property: “Latency sensitivity” • High => lowest latency  Exclusively assign physical CPUs to virtual CPUs of “Latency Sensitivity = High” VMs • Physical CPUs not used for scheduling other VMs or ESXi tasks  Idle in Virtual Machine monitor (VMM) when Guest OS is idle • Lowers latency to wake up the idle Guest OS, compared to idling in ESXi vmkernel  Disable vNIC interrupt coalescing  For DirectPath I/O, optimize interrupt delivery path for lowest latency  Make ESXi vmkernel more preemptible • Reduces jitter due to long-running kernel code
  31. 31. 3131 Summary  Virtualization does add some latency over bare metal  vSphere is generally tuned for throughput and fairness • Tunables exist at the host, VM, and guest level to improve latency • This will become more automatic in subsequent releases  ESXi is a good hypervisor for virtualizing an increasingly broad array of applications, including latency-sensitive applications such as Telco, Financial, and some HPC workloads  When observing application performance degradation in the future, we hope you will think about the “silent killer” and try some of techniques we’ve described here
  32. 32. 3232 Resources Best Practices for Performance Tuning of Latency-Sensitive Workloads in vSphere VMs http://www.vmware.com/resources/techresources/10220 Network I/O Latency in vSphere 5 http://www.vmware.com/resources/techresources/10256 Deploying Extremely Latency-Sensitive Applications in vSphere 5.5 http://www.vmware.com/files/pdf/techpaper/deploying-latency-sensitive-apps- vSphere5.pdf RDMA Performance in Virtual Machines Using QDR InfiniBand on VMware vSphere 5 http://labs.vmware.com/academic/publications/ib-researchnote-apr2012
  33. 33. 3333 Other VMworld Activities Related to This Session  HOL: HOL-SDC-1304 vSphere Performance Optimization  Session: VSVC5596 Extreme Performance Series: Network Speed Ahead
  34. 34. THANK YOU
  35. 35. Silent Killer: How Latency Destroys Performance...And What to Do About It Bhavesh Davda, VMware Josh Simons, VMware VSVC5187 #VSVC5187

×