Your SlideShare is downloading. ×
Advancedperformancetroubleshootingusingesxtop 101110131727-phpapp02
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Advancedperformancetroubleshootingusingesxtop 101110131727-phpapp02

4,531
views

Published on

Advancedperformancetroubleshootingusingesxtop 101110131727-phpapp02

Advancedperformancetroubleshootingusingesxtop 101110131727-phpapp02

Published in: Education, Technology

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,531
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
448
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Advanced performance troubleshooting usingesxtop/resxtopKrishna Raj RajaStaff Engineer, Performance Group © 2010 VMware Inc. All rights reserved
  • 2. Disclaimer This session may contain product features that are currently under development. This session/overview of the new technology represents no commitment from VMware to deliver these features in any generally available product. Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new technologies or features discussed or presented have not been determined. “THESE FEATURES ARE REPRESENTATIVE OF FEATURE AREAS UNDERDEVELOPMENT. FEATURE COMMITMENTS ARE SUBJECT TO CHANGE, ANDMUST NOT BE INCLUDED IN CONTRACTS, PURCHASE ORDERS, OR SALES AGREEMENTS OF ANY KIND. TECHNICAL FEASIBILITY AND MARKET DEMAND WILL AFFECT FINAL.” 2
  • 3. esxtop resourcesesxtop manual: http://www.vmware.com/pdf/vsphere4/r41/vsp_41_resource_mgmt.pdfVMware Community documents: http://communities.vmware.com/docs/DOC-9279 - ESX 4.0 http://communities.vmware.com/docs/DOC-11812 - ESX 4.1esxtop for advanced users: VMworld 2008 - http://vmworld.com/docs/DOC-2356 VMworld 2009 - http://vmworld.com/docs/DOC-38383
  • 4. Ten things that you need to know about esxtop4
  • 5. esxtop counters1. esxtop does not create performance metrics • esxtop derives performance metrics from raw counters exported in the VMkernel System Info nodes (VSI nodes) • esxtop can show new counters on older ESX system if the raw counters are present in VMKernel5
  • 6. esxtop counters2. Counter values • Many raw counters have static values that do no change with time – esxtop displays them as it is • Many counters increment monotonically, esxtop reports the delta for these for the given refresh interval – for instance CMDS/sec, packets transmitted/sec etc • %USED and %RUN - CPU occupancy delta between successive snapshots6
  • 7. Refresh interval3. Graphs will look different depending on the refresh interval • Many counters values are dependent on refresh interval • Larger refresh interval smoothens spikes and troughs 2 second refresh interval 10 second refresh interval7
  • 8. esxtop counters4. Counter normalization • By default counters are shown for the group • In group view counters values are cumulative • In expanded view, counters are normalized per entity Cumulative stats Pressing ‘e’ key expands a group vcpu world consumes CPU8
  • 9. esxtop counters5. %USED can exceed 100 • Turbo boost can increase the processor clock speed • Asynchronous work can be happening on a different core on behalf of the VM VM on a NFS datastore running I/O intensive workload9
  • 10. esxtop batch mode6. Batch mode (-b) • Produces windows perfmon compatible CSV file • CSV file compatibility requires fixed number of columns on every row - statistics of VMs/worlds instances that appear after starting the batch mode are not collected because of this reason • Only counters that are specified in the configuration file are collected, (-a) option collects all counters • Counters are named slightly differently10
  • 11. esxtop batch mode – importing data into perfmon11
  • 12. esxtop batch mode – viewing data in perfmon12
  • 13. esxtop batch mode – trimming data Trimming data Saving data after trim13
  • 14. esxplot http://labs.vmware.com/flings/esxplot14
  • 15. I/O Latencies7. IO latencies • IO latencies are measured per SCSI command so it is not affected by refresh interval • Reported latencies are average values for all the SCSI commands issued within the refresh interval window • Reported average latencies can be different on different screens (adapter, LUN, VM), since each screen accounts for different group of I/Os15
  • 16. resxtop – remote esxtop8. You can use resxtop to connect to different ESX hosts • Newer version of resxtop will connect to older ESX hosts9. You don’t need root access to view esxtop counters • resxtop can authenticate using vCenter credentials16
  • 17. esxtop CPU usage10. esxtop can consume non-trivial amount of CPU • When you have very large inventory (VMs, LUNs, virtual disks, virtual NICs etc) CPU consumption on a host with 512 VMs CPU usage when using resxtop • You can limit the amount of data collected by limiting the fields (columns) and entities (rows), you can also reduce CPU consumption by locking entities, (-l) option CPU consumption with esxtop -l17
  • 18. Performance Troubleshooting Using esxtop18
  • 19. esxtop screensScreens • c: cpu (default) • m: memory VM VM VM VM • n: network • d: disk adapter • u: disk device (added in ESX 3.5) CPU Memory Virtual vSCSI • v: disk VM (added in ESX 3.5) Scheduler Scheduler Switch • i: Interrupts (new in ESX 4.0) c, i, p m n d, u, v • p: power management (new in ESX 4.1) VMkernel19
  • 20. Troubleshooting CPU Problems20
  • 21. CPU Constrained SMP VM High CPU utilization Both the virtual CPUs CPU constrained21
  • 22. CPU Contention 4 CPUs, VMs don’t 3 SMP get to run %ready all at VMs all the time accumulates 100%22
  • 23. CPU Limit Max Limited CPU Limit AMAX = -1 : Unlimited23
  • 24. Mis-configured SMP VM vCPU 1 not Incorrect (UP) Kernel/HAL inside the used by the guest or the application inside the VM guest is single threaded24
  • 25. Power management – CPU frequency scaling C states: C0 – busy, C1 – halted, C2 – deep halt P states: P0 – Highest clock frequency, P11 – Lowest clock frequency25
  • 26. VM Power Usage Experimental feature, not enabled by default. VMkernel advanced setting: Power.ChargeVMs26
  • 27. CPU clock frequency scaling VM is running all the time but uses only 75% of the clock frequency %USED: CPU usage with reference to base clock frequency %UTIL: CPU utilization with reference to current clock frequency %RUN: CPU scheduled time27
  • 28. Hyperthreading Two VMs running on different cores Two VMs sharing the same core %LAT_C counter shows the time de- scheduled due to core sharing28
  • 29. Timer interrupt rate Linux Guests29
  • 30. Timer interrupt rate Windows Guests – Multimedia timer30
  • 31. New metrics in CPU screen%LAT_C : %time the VM was not scheduled due to CPU resource issue%LAT_M : %time the VM was not scheduled due to memory resource issue%DMD : Moving CPU utilization average in the last one minuteEMIN : Minimum CPU resources in MHZ that the VM is guaranteed to getwhen there is CPU contention31
  • 32. Troubleshooting Memory Problems32
  • 33. esxtop memory screen (m) Possible states: high, soft, hard and low PMEM – Total Physical memory VMKMEM - Memory managed by VMKernel COSMEM - Memory used by Service Console33
  • 34. Not able to power-on a new VM Memory reservation 820 MB4G memory reservationreservation requested Overhead memory needs to be reserved 34
  • 35. Granted Memory Granted Memory = Memory touched by the guest Windows and FreeBSD Guests touches (zeroes) all its memory during boot Linux Guests touches memory when it first uses it35
  • 36. Ballooning versus Swapping VM with Swapped in the past but Swap target is Memory MCTL: N - Balloon Balloon not actively more for the VM Hog driver not active, tools driver swaps swapping without the balloon VMs probably not installed less now driver36
  • 37. Memory Compression StatsCOWH : Copy on Write Pages hints – amount of memory in MB that arepotentially shareableCACHESZ: Compression Cache sizeCACHEUSD: Compression Cache currently usedZIP/s, UNZIP/s: Memory compression/decompression rate37
  • 38. Wide NUMA - CPU 2 NUMA nodes with ~6G each NUMA home node not assigned 4G, can fit into a single node 6-vcpu VM –cannot fit intoa NUMA node size of 4 CPUs38
  • 39. NUMA affinity not set NUMA machine with 2 nodes NHN: NUMA Home Node All the memory in remote node NLMEM: Memory in local node NRMEM: Memory in remote node CPU affinity set to wrong NUMA node39
  • 40. Wide NUMA - Memory 2 NUMA nodes with ~6G each NUMA home node not assigned VM cannot be fit into a single NUMA node40
  • 41. Troubleshooting Network Problems41
  • 42. vSwitch active uplink TEAM-PNIC : The uplink that the virtual switch port is currently using42
  • 43. Dropped packets at vSwitch Packet drops usually happens when the traffic has no flow control (UDP/Multicast/Broadcast packets)43
  • 44. Multicast/Broadcast stats PKTTXMUL/s – Multicast packets transmitted per second PKTRXMUL/s – Multicast packets received per second PKTTXBRD/s – Broadcast packets transmitted per second PKTRXBRD/s – Broadcast packets received per second44
  • 45. NFS stats DAVG and KAVG is not available for network backed storage GAVG – gives the end to end latency45
  • 46. Troubleshooting Disk Problems46
  • 47. Disk I/O latency Host bus adapters (HBAs) - Latency stats from the includes SCSI, iSCSI, RAID, Device, Kernel and the and FC-HBA adapters Guest DAVG/cmd - Average latency (ms) from the Device (LUN) KAVG/cmd - Average latency (ms) in the VMKernel GAVG/cmd - Average latency (ms) in the Guest47
  • 48. Problem with the disk subsystem Good throughput Low device Latency Bad throughput Device Latency is high - cache disabled 48
  • 49. Insufficient Queue depth Non-zero KAVG Queuing at the HBA 49
  • 50. FC bottleneck ‘v’ – VM view‘u’ – device view ‘d’ – adapter view 50
  • 51. vStorage API for Array Integration (VAAI) statsCLONE_RD, CLONE_WR: Number of Clone read/write requestsCLONE_F: Number of Failed clone operationsMBC_RD/s, MBC_WR/s – Clone read/write MBs/secATS – Number of ATS commandsATSF – Number of failed ATS commandsZERO – Number of Zero requestsZEROF – Number of failed zero requestsMBZERO/s – Megabytes Zeroed per second51
  • 52. VAAI - virtual disk creation example vStorage API for Array Integration (VAAI)52
  • 53. SCSI reservation conflicts53
  • 54. Other diagnostic tools54
  • 55. Other diagnostic tools (1 of 2) sched-stats and schedtrace • vm-support -s/-S flag captures sched-stats • vm-support -c flag captures scheduler trace – takes lot of disk space memstats • Provides detailed memory usage stats with resource pool hierarchy ft-stats • FT Virtual Machine stats • Collected with vm-support –s/S flag55
  • 56. Other diagnostic tools (2 of 2) swatchStats • Stopwatch stats for VMFS, SCSI events vscsiStats • Virtual machine SCSI disk I/O stats • Provides histogram information for latency, IO size, inter-arrival time and outstanding I/Os56
  • 57. vscsiStats# vscsiStats -l World group Virtual Machine leader id Name Virtual scsi disk handle ids - unique across virtual machines57
  • 58. vscsiStats – latency histogram# vscsiStats -p latency -w 118739 -i 8205 Latency in I/O microseconds distribution count58
  • 59. vscsiStats – iolength histogram # vscsiStats -p iolength -w 118739 -i 8205 I/O block sizeDistribution Count 59