Advancedperformancetroubleshootingusingesxtop 101110131727-phpapp02


Published on

Advancedperformancetroubleshootingusingesxtop 101110131727-phpapp02

Published in: Education, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Advancedperformancetroubleshootingusingesxtop 101110131727-phpapp02

  1. 1. Advanced performance troubleshooting usingesxtop/resxtopKrishna Raj RajaStaff Engineer, Performance Group © 2010 VMware Inc. All rights reserved
  2. 2. Disclaimer This session may contain product features that are currently under development. This session/overview of the new technology represents no commitment from VMware to deliver these features in any generally available product. Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new technologies or features discussed or presented have not been determined. “THESE FEATURES ARE REPRESENTATIVE OF FEATURE AREAS UNDERDEVELOPMENT. FEATURE COMMITMENTS ARE SUBJECT TO CHANGE, ANDMUST NOT BE INCLUDED IN CONTRACTS, PURCHASE ORDERS, OR SALES AGREEMENTS OF ANY KIND. TECHNICAL FEASIBILITY AND MARKET DEMAND WILL AFFECT FINAL.” 2
  3. 3. esxtop resourcesesxtop manual: Community documents: - ESX 4.0 - ESX 4.1esxtop for advanced users: VMworld 2008 - VMworld 2009 -
  4. 4. Ten things that you need to know about esxtop4
  5. 5. esxtop counters1. esxtop does not create performance metrics • esxtop derives performance metrics from raw counters exported in the VMkernel System Info nodes (VSI nodes) • esxtop can show new counters on older ESX system if the raw counters are present in VMKernel5
  6. 6. esxtop counters2. Counter values • Many raw counters have static values that do no change with time – esxtop displays them as it is • Many counters increment monotonically, esxtop reports the delta for these for the given refresh interval – for instance CMDS/sec, packets transmitted/sec etc • %USED and %RUN - CPU occupancy delta between successive snapshots6
  7. 7. Refresh interval3. Graphs will look different depending on the refresh interval • Many counters values are dependent on refresh interval • Larger refresh interval smoothens spikes and troughs 2 second refresh interval 10 second refresh interval7
  8. 8. esxtop counters4. Counter normalization • By default counters are shown for the group • In group view counters values are cumulative • In expanded view, counters are normalized per entity Cumulative stats Pressing ‘e’ key expands a group vcpu world consumes CPU8
  9. 9. esxtop counters5. %USED can exceed 100 • Turbo boost can increase the processor clock speed • Asynchronous work can be happening on a different core on behalf of the VM VM on a NFS datastore running I/O intensive workload9
  10. 10. esxtop batch mode6. Batch mode (-b) • Produces windows perfmon compatible CSV file • CSV file compatibility requires fixed number of columns on every row - statistics of VMs/worlds instances that appear after starting the batch mode are not collected because of this reason • Only counters that are specified in the configuration file are collected, (-a) option collects all counters • Counters are named slightly differently10
  11. 11. esxtop batch mode – importing data into perfmon11
  12. 12. esxtop batch mode – viewing data in perfmon12
  13. 13. esxtop batch mode – trimming data Trimming data Saving data after trim13
  14. 14. esxplot
  15. 15. I/O Latencies7. IO latencies • IO latencies are measured per SCSI command so it is not affected by refresh interval • Reported latencies are average values for all the SCSI commands issued within the refresh interval window • Reported average latencies can be different on different screens (adapter, LUN, VM), since each screen accounts for different group of I/Os15
  16. 16. resxtop – remote esxtop8. You can use resxtop to connect to different ESX hosts • Newer version of resxtop will connect to older ESX hosts9. You don’t need root access to view esxtop counters • resxtop can authenticate using vCenter credentials16
  17. 17. esxtop CPU usage10. esxtop can consume non-trivial amount of CPU • When you have very large inventory (VMs, LUNs, virtual disks, virtual NICs etc) CPU consumption on a host with 512 VMs CPU usage when using resxtop • You can limit the amount of data collected by limiting the fields (columns) and entities (rows), you can also reduce CPU consumption by locking entities, (-l) option CPU consumption with esxtop -l17
  18. 18. Performance Troubleshooting Using esxtop18
  19. 19. esxtop screensScreens • c: cpu (default) • m: memory VM VM VM VM • n: network • d: disk adapter • u: disk device (added in ESX 3.5) CPU Memory Virtual vSCSI • v: disk VM (added in ESX 3.5) Scheduler Scheduler Switch • i: Interrupts (new in ESX 4.0) c, i, p m n d, u, v • p: power management (new in ESX 4.1) VMkernel19
  20. 20. Troubleshooting CPU Problems20
  21. 21. CPU Constrained SMP VM High CPU utilization Both the virtual CPUs CPU constrained21
  22. 22. CPU Contention 4 CPUs, VMs don’t 3 SMP get to run %ready all at VMs all the time accumulates 100%22
  23. 23. CPU Limit Max Limited CPU Limit AMAX = -1 : Unlimited23
  24. 24. Mis-configured SMP VM vCPU 1 not Incorrect (UP) Kernel/HAL inside the used by the guest or the application inside the VM guest is single threaded24
  25. 25. Power management – CPU frequency scaling C states: C0 – busy, C1 – halted, C2 – deep halt P states: P0 – Highest clock frequency, P11 – Lowest clock frequency25
  26. 26. VM Power Usage Experimental feature, not enabled by default. VMkernel advanced setting: Power.ChargeVMs26
  27. 27. CPU clock frequency scaling VM is running all the time but uses only 75% of the clock frequency %USED: CPU usage with reference to base clock frequency %UTIL: CPU utilization with reference to current clock frequency %RUN: CPU scheduled time27
  28. 28. Hyperthreading Two VMs running on different cores Two VMs sharing the same core %LAT_C counter shows the time de- scheduled due to core sharing28
  29. 29. Timer interrupt rate Linux Guests29
  30. 30. Timer interrupt rate Windows Guests – Multimedia timer30
  31. 31. New metrics in CPU screen%LAT_C : %time the VM was not scheduled due to CPU resource issue%LAT_M : %time the VM was not scheduled due to memory resource issue%DMD : Moving CPU utilization average in the last one minuteEMIN : Minimum CPU resources in MHZ that the VM is guaranteed to getwhen there is CPU contention31
  32. 32. Troubleshooting Memory Problems32
  33. 33. esxtop memory screen (m) Possible states: high, soft, hard and low PMEM – Total Physical memory VMKMEM - Memory managed by VMKernel COSMEM - Memory used by Service Console33
  34. 34. Not able to power-on a new VM Memory reservation 820 MB4G memory reservationreservation requested Overhead memory needs to be reserved 34
  35. 35. Granted Memory Granted Memory = Memory touched by the guest Windows and FreeBSD Guests touches (zeroes) all its memory during boot Linux Guests touches memory when it first uses it35
  36. 36. Ballooning versus Swapping VM with Swapped in the past but Swap target is Memory MCTL: N - Balloon Balloon not actively more for the VM Hog driver not active, tools driver swaps swapping without the balloon VMs probably not installed less now driver36
  37. 37. Memory Compression StatsCOWH : Copy on Write Pages hints – amount of memory in MB that arepotentially shareableCACHESZ: Compression Cache sizeCACHEUSD: Compression Cache currently usedZIP/s, UNZIP/s: Memory compression/decompression rate37
  38. 38. Wide NUMA - CPU 2 NUMA nodes with ~6G each NUMA home node not assigned 4G, can fit into a single node 6-vcpu VM –cannot fit intoa NUMA node size of 4 CPUs38
  39. 39. NUMA affinity not set NUMA machine with 2 nodes NHN: NUMA Home Node All the memory in remote node NLMEM: Memory in local node NRMEM: Memory in remote node CPU affinity set to wrong NUMA node39
  40. 40. Wide NUMA - Memory 2 NUMA nodes with ~6G each NUMA home node not assigned VM cannot be fit into a single NUMA node40
  41. 41. Troubleshooting Network Problems41
  42. 42. vSwitch active uplink TEAM-PNIC : The uplink that the virtual switch port is currently using42
  43. 43. Dropped packets at vSwitch Packet drops usually happens when the traffic has no flow control (UDP/Multicast/Broadcast packets)43
  44. 44. Multicast/Broadcast stats PKTTXMUL/s – Multicast packets transmitted per second PKTRXMUL/s – Multicast packets received per second PKTTXBRD/s – Broadcast packets transmitted per second PKTRXBRD/s – Broadcast packets received per second44
  45. 45. NFS stats DAVG and KAVG is not available for network backed storage GAVG – gives the end to end latency45
  46. 46. Troubleshooting Disk Problems46
  47. 47. Disk I/O latency Host bus adapters (HBAs) - Latency stats from the includes SCSI, iSCSI, RAID, Device, Kernel and the and FC-HBA adapters Guest DAVG/cmd - Average latency (ms) from the Device (LUN) KAVG/cmd - Average latency (ms) in the VMKernel GAVG/cmd - Average latency (ms) in the Guest47
  48. 48. Problem with the disk subsystem Good throughput Low device Latency Bad throughput Device Latency is high - cache disabled 48
  49. 49. Insufficient Queue depth Non-zero KAVG Queuing at the HBA 49
  50. 50. FC bottleneck ‘v’ – VM view‘u’ – device view ‘d’ – adapter view 50
  51. 51. vStorage API for Array Integration (VAAI) statsCLONE_RD, CLONE_WR: Number of Clone read/write requestsCLONE_F: Number of Failed clone operationsMBC_RD/s, MBC_WR/s – Clone read/write MBs/secATS – Number of ATS commandsATSF – Number of failed ATS commandsZERO – Number of Zero requestsZEROF – Number of failed zero requestsMBZERO/s – Megabytes Zeroed per second51
  52. 52. VAAI - virtual disk creation example vStorage API for Array Integration (VAAI)52
  53. 53. SCSI reservation conflicts53
  54. 54. Other diagnostic tools54
  55. 55. Other diagnostic tools (1 of 2) sched-stats and schedtrace • vm-support -s/-S flag captures sched-stats • vm-support -c flag captures scheduler trace – takes lot of disk space memstats • Provides detailed memory usage stats with resource pool hierarchy ft-stats • FT Virtual Machine stats • Collected with vm-support –s/S flag55
  56. 56. Other diagnostic tools (2 of 2) swatchStats • Stopwatch stats for VMFS, SCSI events vscsiStats • Virtual machine SCSI disk I/O stats • Provides histogram information for latency, IO size, inter-arrival time and outstanding I/Os56
  57. 57. vscsiStats# vscsiStats -l World group Virtual Machine leader id Name Virtual scsi disk handle ids - unique across virtual machines57
  58. 58. vscsiStats – latency histogram# vscsiStats -p latency -w 118739 -i 8205 Latency in I/O microseconds distribution count58
  59. 59. vscsiStats – iolength histogram # vscsiStats -p iolength -w 118739 -i 8205 I/O block sizeDistribution Count 59