Presentation drs advanced concepts, best practices and future directions
Presentation tracking down storage performance issues a customer’s perspective
1. Tracking Down Storage
Performance Issues:
A Customer’s Perspective
Keith Aasen, NetApp
Scott Elliott, Christie Digital
INF-STO1430
#vmworldinf
2.
3. Agenda
1. Introduction and background
2. Storage problems and their effect on virtual infrastructure
3. Root cause analyses and resolution
4. Results and next steps
4. Who is Christie?About Christie
• A global visual technologies company
• Visual Solutions include:
• Media Walls
5. • A global visual technologies company
• Visual Solutions include:
• Digital Cinema Projectors
Who is Christie?About Christie
6. • A global visual technologies company
• Visual Solutions include:
• 3D Virtual Reality
Who is Christie?About Christie
7. • A global visual technologies company
• Visual Solutions include:
• Simulation Projection Systems
Who is Christie?About Christie
11. The problem arises
• Disk latencies:
20 ms
40 ms
Good
Bad
Ugly
Implementation
Sustained
Latency
Spikes
Increase in business
demand
Sustained
Latency
SpikesDeployed SCOM
Plug-in
Continued growth;
High I/O introduced
Sustained: 30 ms
Spikes: 100 ms
No application impact
Sustained: +40 ms
Spikes: 6 seconds
Significant Application Impact
13. List of issues
1. Most datastores had a consistent 40ms (or higher) of disk
latency with spikes lasting multiple seconds
2. ESXi hosts lose connectivity at seemingly random times
• Most happen between midnight and 5:00 a.m.
3. Applications complained of disk time-outs
• Where applicable, would automatically fail over to DR site
14. The hunt begins
• Where to start?
• Oceans of data across multiple systems
• Need to correlate information and filter out distractions
• Specialized knowledge to interpret the data
15. Timing is everything
• Coincidentally, PoC of NetApp OnCommand Balance
• Additional diagnostic analysis and correlated data
• Supplemented SCOM and PerfStats
• Large amount of misaligned VMs
• Most severe latencies happened between midnight and 5:00 a.m.
Intelligence Instead of DataPerformance Capacity Analytics
OnCommand Balance
16. Misaligned VMs on a LUN
VMDK
NTFS
block
NTFS
block
NTFS
block
NTFS
block
MBR or starting offset
WAFL
block
WAFL
block
WAFL
block
WAFL
block
WAFL
block
WAFL
block
The VMDK is aligned to the VMFS file system.
VMFS block VMFS block VMFS block
The VMFS file system is aligned to the WAFL file system
so that the VMFS blocks align to the WAFL blocks
This offset causes
the NTFS blocks
to be misaligned
with the WAFL
blocks
18. Properly aligned VM IO
In a properly aligned VM configuration, each Guest OS
Block (NTFS/EXT3) is mapped to one block on the storage
array.
VMDK
NTFS
block
WAFL
block
WAFL
block
WAFL
block
WAFL
block
WAFL
block
WAFL
block
VMFS block VMFS block VMFS block
NTFS
block
NTFS
block
NTFS
block
NTFS
block
19. Properly aligned VM IO
VMDK
NTFS
block
WAFL
block
WAFL
block
WAFL
block
WAFL
block
WAFL
block
WAFL
block
VMFS block VMFS block VMFS block
NTFS
block
NTFS
block
NTFS
block
NTFS
block
When a write occurs from the guest OS the write is cached
and then acknowledged back to the guest.
20. Properly aligned VM IO
VMDK
NTFS
block
WAFL
block
WAFL
block
WAFL
block
WAFL
block
WAFL
block
WAFL
block
VMFS block VMFS block VMFS block
NTFS
block
NTFS
block
NTFS
block
NTFS
block
When a write occurs from the guest OS the write is cached
and then acknowledged back to the guest.
Guest write
21. Properly aligned VM IO
VMDK
NTFS
block
WAFL
block
WAFL
block
WAFL
block
WAFL
block
WAFL
block
WAFL
block
VMFS block VMFS block VMFS block
NTFS
block
NTFS
block
NTFS
block
NTFS
block
Cached
in
NVRAM
NTFS
block
Guest write
When a write occurs from the guest OS, the write is cached
and then acknowledged back to the guest.
22. Properly aligned VM IO
VMDK
NTFS
block
WAFL
block
WAFL
block
WAFL
block
WAFL
block
WAFL
block
WAFL
block
VMFS block VMFS block VMFS block
NTFS
block
NTFS
block
NTFS
block
NTFS
block
Cached
in
NVRAM
NTFS
block
ACK
Guest write
When a write occurs from the guest OS the write is cached
and then acknowledged back to the guest.
23. Properly aligned VM IO
VMDK
NTFS
block
WAFL
block
WAFL
block
WAFL
block
WAFL
block
WAFL
block
WAFL
block
VMFS block VMFS block VMFS block
NTFS
block
NTFS
block
NTFS
block
NTFS
block
NTFS
block
Because of NetApp WAFL and NVRAM technology NetApp
controllers can write to disk very quickly therefore NVRAM rarely
fills up.
Written to disk later
Invalidated
31. Net effect
• This process causes consistency points to take longer in duration
• Increases CPU load on the controller
• No effect on performance to the VM, if the controller can
“keep up”
• If load increases, then a dramatic spike in latency can occur
• Ultimately determines how many VMs can be hosted on a
storage system
32. How to correct misalignment
• Adjust the MBR or boot sector with MBRalign or VMware
converter
• Permanent solution
• Requires Downtime for the VM
• Create an “Optimized Datastore”
• No downtime required for the VM
• Limited Vendors offer this
• Must be sure not to mix misaligned VMs and aligned VMs
35. Misaligned VMs on optimized LUN
VMDK
NTFS
block
NTFS
block
NTFS
block
NTFS
block
MBR or starting offset
WAFL
block
WAFL
block
WAFL
block
WAFL
block
WAFL
block
WAFL
block
The VMDK is aligned to the VMFS file system.
VMFS block VMFS block VMFS block
The VMFS file system is “improperly” aligned to the
storage file system so that the NTFS blocks align to the
storage blocks.
This offset causes
the NTFS blocks
to be aligned with
the storage blocks
36. Getting closer
• Remaining latency spike late at night with no corresponding IO.
• Time coincided with aggregate-level snapshot.
• Aggregate snapshot is on by default on every system. Usually
there is no noticeable activity.
• Will trigger a disk cleanup process, if significant space is
released.
• The cleanup process was colliding with the SQL DB copy causing
the latency spike. (has since had it’s priority adjusted)
37. • Still had lingering – and seemingly random – spikes
• Use Veeam’s Management Pack for VMware
• Agentless vSphere monitoring and management
• Systems Center Operations Manager Plug-In
• Used report “Virtual Machines: Disk Performance History”
The cumulative effect of client software
40. What did we learn?
• An underused storage subsystem can mask environment
misconfigurations.
• Storage performance issues are rarely due to a single cause.
• In this case, there were three causes:
1. VM alignment
2. Storage resource contention from background process, and
3. Suboptimal antimalware configuration.
41. Other lessons learned
1. Invest in monitoring tools to detect problems.
2. Fix misconfigurations before they become a problem.
3. Engage your vendor to assist with the troubleshooting process.