© 2010 VMware Inc. All rights reserved
Advanced performance troubleshooting using
esxtop/resxtop
Krishna Raj Raja
Staff En...
2
Disclaimer
This session may contain product features that are
currently under development.
This session/overview of the ...
3
esxtop resources
esxtop manual:
http://www.vmware.com/pdf/vsphere4/r41/vsp_41_resource_mgmt.pdf
VMware Community documen...
4
Ten things that you need to know about
esxtop
5
esxtop counters
1. esxtop does not create performance metrics
• esxtop derives performance metrics from raw counters exp...
6
esxtop counters
2. Counter values
• Many raw counters have static values that do no change with time – esxtop
displays t...
7
Refresh interval
3. Graphs will look different depending on the refresh interval
• Many counters values are dependent on...
8
esxtop counters
4. Counter normalization
• By default counters are shown for the group
• In group view counters values a...
9
esxtop counters
5. %USED can exceed 100
• Turbo boost can increase the processor clock speed
• Asynchronous work can be ...
10
esxtop batch mode
6. Batch mode (-b)
• Produces windows perfmon compatible CSV file
• CSV file compatibility requires f...
11
esxtop batch mode – importing data into perfmon
12
esxtop batch mode – viewing data in perfmon
13
esxtop batch mode – trimming data
Trimming data
Saving data after trim
14
esxplot
http://labs.vmware.com/flings/esxplot
15
I/O Latencies
7. IO latencies
• IO latencies are measured per SCSI command so it is not affected by
refresh interval
• ...
16
resxtop – remote esxtop
8. You can use resxtop to connect to different ESX hosts
• Newer version of resxtop will connec...
17
esxtop CPU usage
10. esxtop can consume non-trivial amount of CPU
• When you have very large inventory (VMs, LUNs, virt...
18
Performance Troubleshooting Using
esxtop
19
esxtop screens
Screens
• c: cpu (default)
• m: memory
• n: network
• d: disk adapter
• u: disk device (added in ESX 3.5...
20
Troubleshooting CPU Problems
21
CPU Constrained
SMP VM
High CPU
utilization
Both the
virtual CPUs
CPU
constrained
22
CPU Contention
4 CPUs,
all at
100%
3 SMP
VMs
VMs don’t
get to run
all the time
%ready
accumulates
23
CPU Limit
Max
Limited
CPU
Limit AMAX = -1 : Unlimited
24
Mis-configured SMP VM
vCPU 1 not
used by the
VM
Incorrect (UP) Kernel/HAL inside the
guest or the application inside th...
25
Power management – CPU frequency scaling
C states: C0 – busy, C1 – halted, C2 – deep halt
P states: P0 – Highest clock ...
26
VM Power Usage
Experimental feature, not enabled by default.
VMkernel advanced setting: Power.ChargeVMs
27
CPU clock frequency scaling
%USED: CPU usage with reference to base clock frequency
%UTIL: CPU utilization with referen...
28
Hyperthreading
Two VMs running
on different cores
Two VMs sharing
the same core
%LAT_C counter
shows the time de-
sched...
29
Timer interrupt rate
Linux Guests
30
Timer interrupt rate
Windows Guests – Multimedia timer
31
New metrics in CPU screen
%LAT_C : %time the VM was not scheduled due to CPU resource issue
%LAT_M : %time the VM was n...
32
Troubleshooting Memory Problems
33
esxtop memory screen (m)
Possible states:
high, soft, hard
and low
PMEM – Total Physical memory
VMKMEM - Memory managed...
34
Not able to power-on a new VM
Memory reservation
820 MB
reservation
requested
Overhead
memory
needs to be
reserved
4G m...
35
Granted Memory
Granted Memory = Memory touched by the guest
Windows and FreeBSD Guests touches (zeroes) all its memory ...
36
Ballooning versus Swapping
MCTL: N - Balloon
driver not active, tools
probably not installed
Memory
Hog
VMs
Swapped in
...
37
Memory Compression Stats
COWH : Copy on Write Pages hints – amount of memory in MB that are
potentially shareable
CACHE...
38
Wide NUMA - CPU
2 NUMA
nodes with
~6G each
NUMA home
node not assigned
6-vcpu VM –
cannot fit into
a NUMA node
size of ...
39
NUMA affinity not set
NUMA machine
with 2 nodes
CPU affinity set to
wrong NUMA node
All the memory in
remote node
NHN: ...
40
Wide NUMA - Memory
2 NUMA
nodes with
~6G each
NUMA home
node not
assigned
VM cannot be
fit into a single
NUMA node
41
Troubleshooting Network Problems
42
vSwitch active uplink
TEAM-PNIC : The uplink that the virtual switch port is currently using
43
Dropped packets at vSwitch
Packet drops usually happens when the traffic has
no flow control (UDP/Multicast/Broadcast p...
44
Multicast/Broadcast stats
PKTTXMUL/s – Multicast packets transmitted per second
PKTRXMUL/s – Multicast packets received...
45
NFS stats
DAVG and KAVG is not available for network backed storage
GAVG – gives the end to end latency
46
Troubleshooting Disk Problems
47
Disk I/O latency
Host bus adapters (HBAs) -
includes SCSI, iSCSI, RAID,
and FC-HBA adapters
Latency stats from the
Devi...
48
Problem with the disk subsystem
Bad
throughput
Good
throughput
Device Latency is
high - cache disabled
Low device
Laten...
49
Insufficient Queue depth
Non-zero
KAVG
Queuing at
the HBA
50
FC bottleneck
‘v’ – VM view
‘u’ – device view
‘d’ – adapter view
51
vStorage API for Array Integration (VAAI) stats
CLONE_RD, CLONE_WR: Number of Clone read/write requests
CLONE_F: Number...
52
VAAI - virtual disk creation example
vStorage API for Array Integration (VAAI)
53
SCSI reservation conflicts
54
Other diagnostic tools
55
Other diagnostic tools (1 of 2)
sched-stats and schedtrace
• vm-support -s/-S flag captures sched-stats
• vm-support -c...
56
Other diagnostic tools (2 of 2)
swatchStats
• Stopwatch stats for VMFS, SCSI events
vscsiStats
• Virtual machine SCSI d...
57
vscsiStats
Virtual scsi disk
handle ids -
unique across
virtual machines
World group
leader id
Virtual Machine
Name
# v...
58
vscsiStats – latency histogram
# vscsiStats -p latency -w 118739 -i 8205
Latency in
microsecondsI/O
distribution
count
59
vscsiStats – iolength histogram
# vscsiStats -p iolength -w 118739 -i 8205
I/O block size
Distribution
Count
Upcoming SlideShare
Loading in …5
×

Advanced performance troubleshooting using esxtop

5,268 views

Published on

Advanced performance troubleshooting using esxtop presented by Krishna Raj Raja

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,268
On SlideShare
0
From Embeds
0
Number of Embeds
1,222
Actions
Shares
0
Downloads
439
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Advanced performance troubleshooting using esxtop

  1. 1. © 2010 VMware Inc. All rights reserved Advanced performance troubleshooting using esxtop/resxtop Krishna Raj Raja Staff Engineer, Performance Group
  2. 2. 2 Disclaimer This session may contain product features that are currently under development. This session/overview of the new technology represents no commitment from VMware to deliver these features in any generally available product. Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new technologies or features discussed or presented have not been determined. “THESE FEATURES ARE REPRESENTATIVE OF FEATURE AREAS UNDER DEVELOPMENT. FEATURE COMMITMENTS ARE SUBJECT TO CHANGE, AND MUST NOT BE INCLUDED IN CONTRACTS, PURCHASE ORDERS, OR SALES AGREEMENTS OF ANY KIND. TECHNICAL FEASIBILITY AND MARKET DEMAND WILL AFFECT FINAL.”
  3. 3. 3 esxtop resources esxtop manual: http://www.vmware.com/pdf/vsphere4/r41/vsp_41_resource_mgmt.pdf VMware Community documents: http://communities.vmware.com/docs/DOC-9279 - ESX 4.0 http://communities.vmware.com/docs/DOC-11812 - ESX 4.1 esxtop for advanced users: VMworld 2008 - http://vmworld.com/docs/DOC-2356 VMworld 2009 - http://vmworld.com/docs/DOC-3838
  4. 4. 4 Ten things that you need to know about esxtop
  5. 5. 5 esxtop counters 1. esxtop does not create performance metrics • esxtop derives performance metrics from raw counters exported in the VMkernel System Info nodes (VSI nodes) • esxtop can show new counters on older ESX system if the raw counters are present in VMKernel
  6. 6. 6 esxtop counters 2. Counter values • Many raw counters have static values that do no change with time – esxtop displays them as it is • Many counters increment monotonically, esxtop reports the delta for these for the given refresh interval – for instance CMDS/sec, packets transmitted/sec etc • %USED and %RUN - CPU occupancy delta between successive snapshots
  7. 7. 7 Refresh interval 3. Graphs will look different depending on the refresh interval • Many counters values are dependent on refresh interval • Larger refresh interval smoothens spikes and troughs 2 second refresh interval 10 second refresh interval
  8. 8. 8 esxtop counters 4. Counter normalization • By default counters are shown for the group • In group view counters values are cumulative • In expanded view, counters are normalized per entity Cumulative stats vcpu world consumes CPU Pressing ‘e’ key expands a group
  9. 9. 9 esxtop counters 5. %USED can exceed 100 • Turbo boost can increase the processor clock speed • Asynchronous work can be happening on a different core on behalf of the VM VM on a NFS datastore running I/O intensive workload
  10. 10. 10 esxtop batch mode 6. Batch mode (-b) • Produces windows perfmon compatible CSV file • CSV file compatibility requires fixed number of columns on every row - statistics of VMs/worlds instances that appear after starting the batch mode are not collected because of this reason • Only counters that are specified in the configuration file are collected, (-a) option collects all counters • Counters are named slightly differently
  11. 11. 11 esxtop batch mode – importing data into perfmon
  12. 12. 12 esxtop batch mode – viewing data in perfmon
  13. 13. 13 esxtop batch mode – trimming data Trimming data Saving data after trim
  14. 14. 14 esxplot http://labs.vmware.com/flings/esxplot
  15. 15. 15 I/O Latencies 7. IO latencies • IO latencies are measured per SCSI command so it is not affected by refresh interval • Reported latencies are average values for all the SCSI commands issued within the refresh interval window • Reported average latencies can be different on different screens (adapter, LUN, VM), since each screen accounts for different group of I/Os
  16. 16. 16 resxtop – remote esxtop 8. You can use resxtop to connect to different ESX hosts • Newer version of resxtop will connect to older ESX hosts 9. You don’t need root access to view esxtop counters • resxtop can authenticate using vCenter credentials
  17. 17. 17 esxtop CPU usage 10. esxtop can consume non-trivial amount of CPU • When you have very large inventory (VMs, LUNs, virtual disks, virtual NICs etc) • You can limit the amount of data collected by limiting the fields (columns) and entities (rows), you can also reduce CPU consumption by locking entities, (-l) option CPU consumption on a host with 512 VMs CPU consumption with esxtop -l CPU usage when using resxtop
  18. 18. 18 Performance Troubleshooting Using esxtop
  19. 19. 19 esxtop screens Screens • c: cpu (default) • m: memory • n: network • d: disk adapter • u: disk device (added in ESX 3.5) • v: disk VM (added in ESX 3.5) • i: Interrupts (new in ESX 4.0) • p: power management (new in ESX 4.1) VMkernel CPU Scheduler Memory Scheduler Virtual Switch vSCSI c, i, p m d, u, vn VM VM VMVM
  20. 20. 20 Troubleshooting CPU Problems
  21. 21. 21 CPU Constrained SMP VM High CPU utilization Both the virtual CPUs CPU constrained
  22. 22. 22 CPU Contention 4 CPUs, all at 100% 3 SMP VMs VMs don’t get to run all the time %ready accumulates
  23. 23. 23 CPU Limit Max Limited CPU Limit AMAX = -1 : Unlimited
  24. 24. 24 Mis-configured SMP VM vCPU 1 not used by the VM Incorrect (UP) Kernel/HAL inside the guest or the application inside the guest is single threaded
  25. 25. 25 Power management – CPU frequency scaling C states: C0 – busy, C1 – halted, C2 – deep halt P states: P0 – Highest clock frequency, P11 – Lowest clock frequency
  26. 26. 26 VM Power Usage Experimental feature, not enabled by default. VMkernel advanced setting: Power.ChargeVMs
  27. 27. 27 CPU clock frequency scaling %USED: CPU usage with reference to base clock frequency %UTIL: CPU utilization with reference to current clock frequency %RUN: CPU scheduled time VM is running all the time but uses only 75% of the clock frequency
  28. 28. 28 Hyperthreading Two VMs running on different cores Two VMs sharing the same core %LAT_C counter shows the time de- scheduled due to core sharing
  29. 29. 29 Timer interrupt rate Linux Guests
  30. 30. 30 Timer interrupt rate Windows Guests – Multimedia timer
  31. 31. 31 New metrics in CPU screen %LAT_C : %time the VM was not scheduled due to CPU resource issue %LAT_M : %time the VM was not scheduled due to memory resource issue %DMD : Moving CPU utilization average in the last one minute EMIN : Minimum CPU resources in MHZ that the VM is guaranteed to get when there is CPU contention
  32. 32. 32 Troubleshooting Memory Problems
  33. 33. 33 esxtop memory screen (m) Possible states: high, soft, hard and low PMEM – Total Physical memory VMKMEM - Memory managed by VMKernel COSMEM - Memory used by Service Console
  34. 34. 34 Not able to power-on a new VM Memory reservation 820 MB reservation requested Overhead memory needs to be reserved 4G memory reservation
  35. 35. 35 Granted Memory Granted Memory = Memory touched by the guest Windows and FreeBSD Guests touches (zeroes) all its memory during boot Linux Guests touches memory when it first uses it
  36. 36. 36 Ballooning versus Swapping MCTL: N - Balloon driver not active, tools probably not installed Memory Hog VMs Swapped in the past but not actively swapping now Swap target is more for the VM without the balloon driver VM with Balloon driver swaps less
  37. 37. 37 Memory Compression Stats COWH : Copy on Write Pages hints – amount of memory in MB that are potentially shareable CACHESZ: Compression Cache size CACHEUSD: Compression Cache currently used ZIP/s, UNZIP/s: Memory compression/decompression rate
  38. 38. 38 Wide NUMA - CPU 2 NUMA nodes with ~6G each NUMA home node not assigned 6-vcpu VM – cannot fit into a NUMA node size of 4 CPUs 4G, can fit into a single node
  39. 39. 39 NUMA affinity not set NUMA machine with 2 nodes CPU affinity set to wrong NUMA node All the memory in remote node NHN: NUMA Home Node NLMEM: Memory in local node NRMEM: Memory in remote node
  40. 40. 40 Wide NUMA - Memory 2 NUMA nodes with ~6G each NUMA home node not assigned VM cannot be fit into a single NUMA node
  41. 41. 41 Troubleshooting Network Problems
  42. 42. 42 vSwitch active uplink TEAM-PNIC : The uplink that the virtual switch port is currently using
  43. 43. 43 Dropped packets at vSwitch Packet drops usually happens when the traffic has no flow control (UDP/Multicast/Broadcast packets)
  44. 44. 44 Multicast/Broadcast stats PKTTXMUL/s – Multicast packets transmitted per second PKTRXMUL/s – Multicast packets received per second PKTTXBRD/s – Broadcast packets transmitted per second PKTRXBRD/s – Broadcast packets received per second
  45. 45. 45 NFS stats DAVG and KAVG is not available for network backed storage GAVG – gives the end to end latency
  46. 46. 46 Troubleshooting Disk Problems
  47. 47. 47 Disk I/O latency Host bus adapters (HBAs) - includes SCSI, iSCSI, RAID, and FC-HBA adapters Latency stats from the Device, Kernel and the Guest DAVG/cmd - Average latency (ms) from the Device (LUN) KAVG/cmd - Average latency (ms) in the VMKernel GAVG/cmd - Average latency (ms) in the Guest
  48. 48. 48 Problem with the disk subsystem Bad throughput Good throughput Device Latency is high - cache disabled Low device Latency
  49. 49. 49 Insufficient Queue depth Non-zero KAVG Queuing at the HBA
  50. 50. 50 FC bottleneck ‘v’ – VM view ‘u’ – device view ‘d’ – adapter view
  51. 51. 51 vStorage API for Array Integration (VAAI) stats CLONE_RD, CLONE_WR: Number of Clone read/write requests CLONE_F: Number of Failed clone operations MBC_RD/s, MBC_WR/s – Clone read/write MBs/sec ATS – Number of ATS commands ATSF – Number of failed ATS commands ZERO – Number of Zero requests ZEROF – Number of failed zero requests MBZERO/s – Megabytes Zeroed per second
  52. 52. 52 VAAI - virtual disk creation example vStorage API for Array Integration (VAAI)
  53. 53. 53 SCSI reservation conflicts
  54. 54. 54 Other diagnostic tools
  55. 55. 55 Other diagnostic tools (1 of 2) sched-stats and schedtrace • vm-support -s/-S flag captures sched-stats • vm-support -c flag captures scheduler trace – takes lot of disk space memstats • Provides detailed memory usage stats with resource pool hierarchy ft-stats • FT Virtual Machine stats • Collected with vm-support –s/S flag
  56. 56. 56 Other diagnostic tools (2 of 2) swatchStats • Stopwatch stats for VMFS, SCSI events vscsiStats • Virtual machine SCSI disk I/O stats • Provides histogram information for latency, IO size, inter-arrival time and outstanding I/Os
  57. 57. 57 vscsiStats Virtual scsi disk handle ids - unique across virtual machines World group leader id Virtual Machine Name # vscsiStats -l
  58. 58. 58 vscsiStats – latency histogram # vscsiStats -p latency -w 118739 -i 8205 Latency in microsecondsI/O distribution count
  59. 59. 59 vscsiStats – iolength histogram # vscsiStats -p iolength -w 118739 -i 8205 I/O block size Distribution Count

×