SlideShare a Scribd company logo
1 of 71
SDDC Performance and Capacity Management
About your speakers
Iwan ‘e1’ Rahabok
virtual-red-dot.info
e1@vmware.com
@e1_ang
Linkedin.com/in/e1ang
9119-9226
Sunny Dua
vxpresss.blogspot.com
duas@vmware.com
@sunny_dua
Linkedin.com/in/duasunny
One day in the life of VMware Admin…
• A VM Owner complains to IaaS Team that her VM is slow.
• Her application architect has verified that:
– The VM CPU and RAM utilization is good.
– The disk latency is good.
– There is no network drop packets.
– No change in the application settings
– No recent patch to Windows
What do you do?
• A: Check ESXi utilization. If it’s low, tell her to doubt no more.
• B: Buy her a nice lunch + flower. Ask her to forget about it 
• C: Call your VMware TAM & MCS. That’s why you pay them right? 
• D: Roll up your sleeve. You are born for this!
What’s wrong with these statements?
• Cluster CPU
– CPU Ratio is high at 1:5 times on cluster “XYZ”
– Rest all other cluster overcommit ratio looks good around 1:3
– Keep the over commitment ratio to 1:4.
– CPU usage is around 50% on cluster “ABCDE”. Since they are UAT servers, don’t worry.
– Rest other cluster CPU utilization is around 25%. This is good!
• Cluster RAM
– We recommend 1:2 overcommit ratio between physical RAM and virtual RAM.
– Memory Usage on most of the cluster is high around 60%
– Cluster “ABCD” is running peak at around 75%. CPU utilization should be less than 70%
– If we see that Active Mem% is also high than we should add more RAM to cluster
– % Active should not exceed 50-60% and Memory should be running at high state on each host
Monitoring
• There are 2 levels to monitor in VMware:
– The VM.
• VM is the most important as that’s all customers care.
• They do not care about your infrastructure. It is a Service. IaaS.
– The Infra.
• Software: NSX, vCenter, VSAN, vRealize, Distributed vSwitch, Datastore
• ESXi + hardware
• Storage & Fabric
• Network
• There are 4 areas to monitor
• The 4 areas above impact one another
2 distinct layer
SDDC
VM VM VM VM
VM VM VM VM
VM VM VM VM
VM VM VM VM
Performance:
We check if it is being served well by the platform.
Other VM is irrelevant from VM Owner point of view.
1
Capacity. We check if VM is right-sized.
If too small, increase its configuration.
If too big, right size it for better performance
2
Performance:
We check if IaaS is serving everyone well.
Make sure there is no contention for resource among all the VMs
1
Capacity:
Check utilization. Too low, we spent too much on hardware.
Too high, we need to buy more hardware.
2
Configuration: Check for Compliance and Config Drift
Availability: Get alert for hardware fault or software stop working3
Consumer Layer
Provider Layer
Performance
How do you know your IaaS is performing fast?
ESXi utilization a 10% means your ESXi is fast?
ESXi utilization a 90% means your ESXi is fast?
Storage is doing 10K IOPS?
Network is processing 8 Gbps?
What counter do you use as a proof to your customers (VM Owner)?
Utilization?
Performance is measured by how well your IaaS serves the VMs.
Fast is relative to your customer. Use SLA as your defense line.
Capacity
Performance vs Capacity
Performance Capacity
Focus is on the VM.
It does not apply to IaaS
Focus is on the IaaS.
VM Capacity Management is just right sizing
Primary counter: Contention or Latency.
Utilization is largely irrelevant.
Primary counter: Contention or Latency
Secondary counter: Utilization
Does not take into account Availability SLA Takes into account Availability SLA
Tier 1 is in fact Availabity-driven.
© 2015 VMware Inc. All rights reserved.
The Consumer Layer
The “dining area”
CONFIDENTIAL11
How a VM gets its resource
Provisioned
Limit
Reservation
Entitlement
0 vCPU or 0 GB
Contention
Usage
Demand
This is the counter
we need to measure
4 vCPU or 16 GB
Dashboards
• Detail monitoring of a single VM
– When customer complains that his VM is slow. Can help desk value right away?
• Large VMs Monitoring
– Because they are actually hurting your IaaS business
– This impacts both Performance and Capacity
– VM Right Sizing
• Excessive Usage
– Excessive Usage by 1-2 VM can impact the overall IaaS performance.
– VMs with excessive usage hurts the business, if we do not charge for Network and Disk IOPS
Single VM Monitoring
• A VM Owner complains that his VM is slow.
– It was okay the day before
– How does Help Desk quickly determine where the issue is?
• How well does Infra serve the VM?
– VM CPU Contention
– VM RAM contention
– VM Disk latency. For each virtual disk, not average.
• Is VM undersized?
– VM CPU Utilisation
– VM RAM Consumed (not Usage)
– VM RAM Usage
– VM Disk IOPS
Dashboard 1
Single VM
Monitoring
How oversized are the Large VMs?
• They cause performance issue
– They impact others, and also themselves!
– ESXi vmkernel scheduler has to find available cores for all the vCPU, even though they are idle.
– Other VMs maybe migrated from core to core. The counter at esxtop tracks this migration.
• Tends to have slower performance
– ESXi may not have all the available vCPU for them.
• Reduces consolidation ratio
– You can pack more vCPU with smaller VM than with big VM.
– Unless you have progressive pricing, you make more money with smaller VM as you sell more vCPU.
Dashboard of Large VMs
• Overall Picture
– A line chart showing Max CPU Demand among all the Large VMs
• If this is low, they are way oversubscribed. Remember, it only takes 1 VM to make this number high.
• This number should be 80% most of the time, indicating right sizing.
– A line chart showing Average CPU Demand
• If this chart is below <25% all the time for entire month, then the large VMs are over sized.
• Heat Map of Large VMs
– Size by vCPU config. So it’s easy to see who the biggest among these large VMs.
– Color by CPU Workload. Both high and low are bad. You want to see ~50% CPU utilisation
• To differentiate between the 2 ends, choose Black and Red. Expect to see mostly green.
• Top-N CPU Demand
– Allows us to zoom into specific time to see the past
• Line chart of a selected VM (automatically plotted)
As expected, the Max of All VMs is low. We can go
back in time and see over 3 months. As expected, they are mostly Black. This means
they are over provisioned.
This shows the Top 15 VM. You can change the
period to any time. This is auto shown. We are showing CPU and RAM.
You expect 70% range, not 20% like this example.
VM Right Sizing
• Focus on large VM
– Every downsize is a battle. No one likes to give up that they are given.
– It also requires downtime
– Downsizing from 4 vCPU to 2 does not buy much nowadays with >10 core Xeon.
• Focus on CPU, not RAM
– RAM in general is plentiful as RAM is cheap nowadays
– RAM is hard to measure, even with agents, as it’s application dependant
MS Windows: Memory Management
• Windows makes great and full use of RAM
– It’s using it as cache.
– Adding more physical RAM will result in more usage.
• Virtual Memory is an integral part of Windows Memory Management
– It is not a swap file.
– A growing pagefile is an early warning.
– Track that it’s below 2.0 with this formula:
≤ 2.0
Commit Limit
Physical RAM
MS Windows: Memory Management
In Use
Available
Cached
vRealize Ops 6.1 EP Agent: Memory Used
vRealize Ops 6.1 EP Agent: Memory Available
MS Windows: Memory Sizing
Conservative
Cost Effective
Windows: Memory Management
• Which VMs need to be upsized?
– Get all those VMs whose commit limit ratio > 2.0
– List can be sorted by the one with the highest commit limit ratio
≤ 2.0
Commit Limit
Physical RAM
Windows RAM: Right Sizing
• Use Commit Limit Ratio super metric to upsize VM
• Conservative or Cost Effective?
– Cost Effective: Used.
– Conservative: Used + Cache
• Example
– Used + Cache: 90%
– Value exceeding 90% means
the VM needs more RAM
Server Workload VDI Workload
1 apps Many apps
Long live apps Many apps launched and closed.
Varies Many files opened and closed
No Internet browsing Internet browsing (movie!)
Workload predictable Workload spiky and unpredictable
Varies (UI-less) Flash, Java, JavaScript (UI heavy)
Windows RAM: What counters to use?
Cost Effective: Used
Conservative: Used + Standby Cache Normal Priority + Standby Cache Reserve
Windows RAM: Hypervisor vs In-Guest
30
A B DC
CONFIDENTIAL 31
CONFIDENTIAL 32
VM Right Sizing
• Do not reduce RAM without changing application
– If the VM has no RAM shortage, reducing RAM will not speed up anything.
– Reducing RAM can trigger more internal swapping in the guest. This in turn generates IO
– Reducing RAM beyond the ISV recommendation can result in unsupported configuration.
– Reducing RAM requires manual reduction for apps that manage its own RAM (e.g. Java, SQL, Oracle).
– It's hard enough to ask apps team to reduce CPU, so asking for both will be even harder.
– If there is a performance issue after you reduce both CPU and RAM….
• If a VM is not using the full RAM, ask the Appl Team if they can
– To monitor paging from outside the Guest, put the pagefile into its own vmdk file.
Why VM Owner should right size
• It takes longer to boot.
– If a VM does not have reservation, vSphere will create a swap file the size of the configured RAM.
• It takes longer to vMotion.
• Risk of NUMA effect
– The RAM or CPU maybe spread over a single socket. Due to NUMA architecture, the performance will
not be as good.
• It will experience higher co-stop and ready time.
• It takes longer to snapshot, especially if memory snapshot is included.
• The processes inside the Guest OS may experience ping-pong.
• Lack of performance visibility at the individual vCPU or virtual core level.
34
Dashboard 2
VM Right Sizing
Any Excessive Utilization in our DC?
• A VM consumes 5 resources:
1. vCPU
2. vRAM (GB)
3. Disk Space
4. Disk IOPS
5. Network (Mbps)
• The first 3 you can bound and control
• The last 2 you can, but normally you don’t do it. You should.
• Need a dashboard to track excessive usage
– Disk IOPS
– Network throughput
Dashboard for Excessive Utilisation
• Excessive Storage consumption
– Line Chart:
• Max VM Disk IOPS among all VMs
• Average VM Disk IOPS
– Heat Map
• Size by IOPS. Color by Latency
• If you see a single big box, that means you have a VM dominating your storage IOPS.
• Excessive Network consumption
– Similar concept as above
This tracks the IOPS from VM. From here we can tell is a distinct peak. It looks like it’s coming from
1 VM, as the average is far lower. This is a cluster of 500 VM, so even if 1 VM hits 13,200 IOPS, the
average did not even pass 15 IOPS.
Let’s zoom into the peak.
Excessive Storage Dashboard
The peak was 13,212 IOPS on 24 May, around 3:16 am. Let’s find out
which VM.
Excessive Storage Dashboard
• We can list the Top VMs generating the IOPS on any given period.
Bingo, it was VM 63ee that did that 13212 IOPS.
Catcha! 
The dashboards are great.
But it does not tell you how the IOPS distribution
among all the VMs. It also does not tell if the VMs
are experiencing high latency.
You need a Heat Map for this.
At a glance, we can tell the IOPS distribution among the VMs. We can also tell if they getting low
latency or not.
Dashboard 3
Excessive DC
Utilization
© 2015 VMware Inc. All rights reserved.
And that’s it!
You “passed” those dashboards, you’re done with the “dining area”!
© 2015 VMware Inc. All rights reserved.
The Provider Layer
The “kitchen”
CONFIDENTIAL44
Performance Management
• Overall Performance Monitoring
– Is any of our customers experiencing bad performance?
– CPU, RAM, Disk, Network
• If yes, who are affected?
– Different VM may get different impact.
– VM 007 may get hit on CPU, while VM 747 may get hit on Storage.
Performance SLA Monitoring
• How do we prove that….not a single VM… in any service tier…. fails the SLA threshold we
agree for that tier… in the past 1 month?
• Since VMs move around in a cluster due to DRS and HA, we need to track at Cluster level.
• If you oversubscribe, there is a risk of Contention.
– For Tier 1, do not overcommit.
– For Tier 2 and 3, do overcommit.
Using Max and Average to determine how VMs are served
If the Max is:
• below what you think your customers can tolerate, then you are good.
• Near the threshold, then your capacity is full. Do not add more VM.
• Above the threshold, move a few VMs out, preferably the large ones.
This dashboard is good as summary. You stop here if there is no issue.
But say there is an issue. Can you drill down?
This is an example of
how you can drill down
to individual Cluster,
and see any metrics
you want.
Notice the Max
Contention and
Average Contention
both spike. The
average is much lower
than Max, indicating
room for additional VM.
The VM contention hit
4.44%. It’s still okay.
Drill down: Cluster CPU Performance
There is issue
here. The Max
value hit 31%.
The Average is
still good at
0.31%, so this
means 90% of
VMs are being
served well.
Storage Latency Monitoring: Details
• The data you see at vCenter and vRealize Operations are average
– The storage latency data that vCenter provides is a 20-second average. With vRealize Operations,
and other management tools that keeps the data much longer, it is a 5-minute average.
– 20 seconds may seem short, but if the storage is doing 10K IOPS, that is an average of 10,000 x 20 =
200,000 read or writes.
• The data you see at vmkernel log is not.
– It is per individual IO. More info here.
– It is acceptable to have higher latency, but ensure it is not too high. Set your threshold at 250 ms for a
start.
• Additional info
– Data at vmkernel excludes bottleneck at upper layer. For example, no disk queue in vDisk. As a result,
we can conclude that the storage latency is not at VM level.
Drill down: Cluster Storage Performance
56
We look at storage across the past 1+ month.
We are seeing latency spike. There is an outlier at 6000 ms on a magnetic disk.
This data is going back to 18 April. Let’s zoom into May 22 as there are recent spike there.
Drill down: Cluster Storage Performance
Zooming into May 17 – 23.
We also exclude all the Magnetic Disk.
Device ID naa.55* is SSD, while naa.5000* is magnetic.
We are seeing latency at the SSD. There is 1 outlier.
Drill down: Cluster Storage Performance
We can also group the data by ESXi Host.
We can also present the data in bar chart
We can zoomed into much more granular time line, below 1 second!
Which VMs are affected?
• The previous slides give us info at Cluster level.
– If there is no VM affected, it’s good. No need to analyse further.
– If there are VMs affected, we want to know which ones.
• We can address the above by listing the Top 30 VM
– CPU Contention
– RAM Contention
– Disk Latency
– Network drop packet (ensure it is 0)
– Network latency (this needs NetFlow)
These are the top 40 VMs which
experienced the worst CPU
Contention.
These are the top 40 VMs which
experienced the worst RAM
Contention.
These are the top 40 VMs which
experienced the worst Disk
Latency.
© 2015 VMware Inc. All rights reserved.
And that’s it!
If Performance is ok, it’s time to review Capacity
61
Capacity Management based on Business Policy
http://virtual-red-dot.info/capacity-management-based-on-business-policy/
Performance Policy
63
Group Discussion: What should your Performance Policy be?
Availability Policy
64
Group Discussion: What should your Availability Policy be?
IOPS
(per VM)
Latency
(VM level)
Automated DR
(SRM)
RPO RTO
Tier 1 1000 <10 ms Yes 5 minutes 1 hour
Tier 2 500 <20 ms Yes <2 hours <2 hours
Tier 3 100 <30 ms Yes <8 hours <4 hours
Capacity Management: Tier 1
5 line charts showing these in the past 3 months
• Number of vCPU left in the cluster.
• Number of vRAM left in the cluster.
• Number of VM left in the cluster.
• Maximum & Average storage latency experience by any VM in the cluster
• “Usable” space left in the datastore cluster.
65
If the number is approaching low number (your threshold) for it’s time to
increase supply (e.g. IOPS, Cluster)
Capacity Management: Tier 2 or 3
5 line charts showing data in the past 3 months
• The Maximum CPU Contention experience by any VM in the cluster.
– This number has to be lower than the SLA we promise.
• The Maximum RAM Contention experience by any VM in the cluster.
– This number has to be lower than the SLA we promise.
• The total number of VM left in the cluster.
• The Maximum & Average storage latency experience by any VM in the cluster
• The disk capacity left in the datastore cluster.
66
Tier 2 or 3
67
In this example, if we use
10% as threashold, the
cluster is full
In this example, if we use
30ms as threashold, the
cluster is full
Capacity Management: Tier 2 or 3
• RAM has different pattern to CPU as it’s a form storage
– SLA principel remains the same. If you exceed the SLA, that cluster is full.
68
SLA vs Internal Threshold
69
SLA Tier 1 Tier 2 Tier 3
CPU Contention 1% 2% 13%
RAM Contention 0% 10% 20%
Disk Latency 10 ms 20 ms 30 ms
SLA only applies to VM.
VM owner does not care about underlying platform.
The above is my personal opinion. You need to get your Customer agreed
Internal (your own) Tier 1 Tier 2 Tier 3
CPU Contention 1% 3% 10%
RAM Contention 0% 5% 10%
Disk Latency 10 ms 15 ms 20 ms
Key Takeaways
Agree on a Performance SLA.
Contention, not Utilization.
Capacity is defined by Performance.
CONFIDENTIAL 70
More Details
• The book provides details that we could not cover
in half a day.
• The book is not a product book.
– It focuses on concept, which you can apply using
any product. It does not have to be vRealize Ops
71
Thank You!

More Related Content

What's hot

VMware vSphere Performance Troubleshooting
VMware vSphere Performance TroubleshootingVMware vSphere Performance Troubleshooting
VMware vSphere Performance TroubleshootingDan Brinkmann
 
Operating System 2
Operating System 2Operating System 2
Operating System 2tech2click
 
Virtualization with KVM (Kernel-based Virtual Machine)
Virtualization with KVM (Kernel-based Virtual Machine)Virtualization with KVM (Kernel-based Virtual Machine)
Virtualization with KVM (Kernel-based Virtual Machine)Novell
 
Transaction Processing its properties & States
Transaction Processing its properties & StatesTransaction Processing its properties & States
Transaction Processing its properties & StatesMeghaj Mallick
 
Fast and Furious: Handling Edge Computing Data With Oracle 19c Fast Ingest an...
Fast and Furious: Handling Edge Computing Data With Oracle 19c Fast Ingest an...Fast and Furious: Handling Edge Computing Data With Oracle 19c Fast Ingest an...
Fast and Furious: Handling Edge Computing Data With Oracle 19c Fast Ingest an...Jim Czuprynski
 
Data and database administration(database)
Data and database administration(database)Data and database administration(database)
Data and database administration(database)welcometofacebook
 
Ch2: Computer System Structure (OS)
Ch2: Computer System Structure (OS)Ch2: Computer System Structure (OS)
Ch2: Computer System Structure (OS)Ahmar Hashmi
 
Récupération d’un Active Directory: comment repartir en confiance après une c...
Récupération d’un Active Directory: comment repartir en confiance après une c...Récupération d’un Active Directory: comment repartir en confiance après une c...
Récupération d’un Active Directory: comment repartir en confiance après une c...Identity Days
 
System Device Tree and Lopper: Concrete Examples - ELC NA 2022
System Device Tree and Lopper: Concrete Examples - ELC NA 2022System Device Tree and Lopper: Concrete Examples - ELC NA 2022
System Device Tree and Lopper: Concrete Examples - ELC NA 2022Stefano Stabellini
 
Introduction: Databases and Database Users
Introduction: Databases and Database UsersIntroduction: Databases and Database Users
Introduction: Databases and Database Userssontumax
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugginglibfetion
 
DNS
DNSDNS
DNSFTC
 
Linux memory-management-kamal
Linux memory-management-kamalLinux memory-management-kamal
Linux memory-management-kamalKamal Maiti
 
Practical examples of using extended events
Practical examples of using extended eventsPractical examples of using extended events
Practical examples of using extended eventsDean Richards
 
Database Keys & Relationship
Database Keys & RelationshipDatabase Keys & Relationship
Database Keys & RelationshipBellal Hossain
 
Windows OS Architecture in Summery
Windows OS Architecture in SummeryWindows OS Architecture in Summery
Windows OS Architecture in SummeryAsanka Dilruk
 

What's hot (20)

VMware vSphere Performance Troubleshooting
VMware vSphere Performance TroubleshootingVMware vSphere Performance Troubleshooting
VMware vSphere Performance Troubleshooting
 
Operating System 2
Operating System 2Operating System 2
Operating System 2
 
Virtualization with KVM (Kernel-based Virtual Machine)
Virtualization with KVM (Kernel-based Virtual Machine)Virtualization with KVM (Kernel-based Virtual Machine)
Virtualization with KVM (Kernel-based Virtual Machine)
 
Transaction Processing its properties & States
Transaction Processing its properties & StatesTransaction Processing its properties & States
Transaction Processing its properties & States
 
Fast and Furious: Handling Edge Computing Data With Oracle 19c Fast Ingest an...
Fast and Furious: Handling Edge Computing Data With Oracle 19c Fast Ingest an...Fast and Furious: Handling Edge Computing Data With Oracle 19c Fast Ingest an...
Fast and Furious: Handling Edge Computing Data With Oracle 19c Fast Ingest an...
 
Data and database administration(database)
Data and database administration(database)Data and database administration(database)
Data and database administration(database)
 
Ch2: Computer System Structure (OS)
Ch2: Computer System Structure (OS)Ch2: Computer System Structure (OS)
Ch2: Computer System Structure (OS)
 
Récupération d’un Active Directory: comment repartir en confiance après une c...
Récupération d’un Active Directory: comment repartir en confiance après une c...Récupération d’un Active Directory: comment repartir en confiance après une c...
Récupération d’un Active Directory: comment repartir en confiance après une c...
 
System Device Tree and Lopper: Concrete Examples - ELC NA 2022
System Device Tree and Lopper: Concrete Examples - ELC NA 2022System Device Tree and Lopper: Concrete Examples - ELC NA 2022
System Device Tree and Lopper: Concrete Examples - ELC NA 2022
 
Introduction: Databases and Database Users
Introduction: Databases and Database UsersIntroduction: Databases and Database Users
Introduction: Databases and Database Users
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
Lvm advanced topics
Lvm advanced topicsLvm advanced topics
Lvm advanced topics
 
DNS
DNSDNS
DNS
 
File System Modules
File System ModulesFile System Modules
File System Modules
 
Linux memory-management-kamal
Linux memory-management-kamalLinux memory-management-kamal
Linux memory-management-kamal
 
Practical examples of using extended events
Practical examples of using extended eventsPractical examples of using extended events
Practical examples of using extended events
 
Chapter 1: Introduction to Unix / Linux Kernel
Chapter 1: Introduction to Unix / Linux KernelChapter 1: Introduction to Unix / Linux Kernel
Chapter 1: Introduction to Unix / Linux Kernel
 
Windows Architecture
Windows ArchitectureWindows Architecture
Windows Architecture
 
Database Keys & Relationship
Database Keys & RelationshipDatabase Keys & Relationship
Database Keys & Relationship
 
Windows OS Architecture in Summery
Windows OS Architecture in SummeryWindows OS Architecture in Summery
Windows OS Architecture in Summery
 

Viewers also liked

Advanced performance troubleshooting using esxtop
Advanced performance troubleshooting using esxtopAdvanced performance troubleshooting using esxtop
Advanced performance troubleshooting using esxtopAlan Renouf
 
Virtualising Tier 1 Apps
Virtualising Tier 1 AppsVirtualising Tier 1 Apps
Virtualising Tier 1 AppsIwan Rahabok
 
vSphere vStorage: Troubleshooting Performance
vSphere vStorage: Troubleshooting PerformancevSphere vStorage: Troubleshooting Performance
vSphere vStorage: Troubleshooting PerformanceProfessionalVMware
 
VMware Log Insight
VMware Log Insight VMware Log Insight
VMware Log Insight Iwan Rahabok
 
CPU Scheduling for Virtual Desktop Infrastructure
CPU Scheduling for Virtual Desktop InfrastructureCPU Scheduling for Virtual Desktop Infrastructure
CPU Scheduling for Virtual Desktop InfrastructureHwanju Kim
 
VMware Performance for Gurus - A Tutorial
VMware Performance for Gurus - A TutorialVMware Performance for Gurus - A Tutorial
VMware Performance for Gurus - A TutorialRichard McDougall
 

Viewers also liked (6)

Advanced performance troubleshooting using esxtop
Advanced performance troubleshooting using esxtopAdvanced performance troubleshooting using esxtop
Advanced performance troubleshooting using esxtop
 
Virtualising Tier 1 Apps
Virtualising Tier 1 AppsVirtualising Tier 1 Apps
Virtualising Tier 1 Apps
 
vSphere vStorage: Troubleshooting Performance
vSphere vStorage: Troubleshooting PerformancevSphere vStorage: Troubleshooting Performance
vSphere vStorage: Troubleshooting Performance
 
VMware Log Insight
VMware Log Insight VMware Log Insight
VMware Log Insight
 
CPU Scheduling for Virtual Desktop Infrastructure
CPU Scheduling for Virtual Desktop InfrastructureCPU Scheduling for Virtual Desktop Infrastructure
CPU Scheduling for Virtual Desktop Infrastructure
 
VMware Performance for Gurus - A Tutorial
VMware Performance for Gurus - A TutorialVMware Performance for Gurus - A Tutorial
VMware Performance for Gurus - A Tutorial
 

Similar to Master VMware Performance and Capacity Management

VMworld 2015: Extreme Performance Series - vSphere Compute & Memory
VMworld 2015: Extreme Performance Series - vSphere Compute & MemoryVMworld 2015: Extreme Performance Series - vSphere Compute & Memory
VMworld 2015: Extreme Performance Series - vSphere Compute & MemoryVMworld
 
Session 7362 Handout 427 0
Session 7362 Handout 427 0Session 7362 Handout 427 0
Session 7362 Handout 427 0jln1028
 
Virtualizing Tier One Applications - Varrow
Virtualizing Tier One Applications - VarrowVirtualizing Tier One Applications - Varrow
Virtualizing Tier One Applications - VarrowAndrew Miller
 
Cinder enhancements-for-replication-using-stateless-snapshots
Cinder enhancements-for-replication-using-stateless-snapshotsCinder enhancements-for-replication-using-stateless-snapshots
Cinder enhancements-for-replication-using-stateless-snapshotsCaitlin Bestler
 
vSphere APIs for performance monitoring
vSphere APIs for performance monitoringvSphere APIs for performance monitoring
vSphere APIs for performance monitoringAlan Renouf
 
Managing Performance in a Virtual Environment
Managing Performance in a Virtual EnvironmentManaging Performance in a Virtual Environment
Managing Performance in a Virtual EnvironmentSolarWinds
 
VMworld 2013: DRS: New Features, Best Practices and Future Directions
VMworld 2013: DRS: New Features, Best Practices and Future Directions VMworld 2013: DRS: New Features, Best Practices and Future Directions
VMworld 2013: DRS: New Features, Best Practices and Future Directions VMworld
 
VMworld 2011 Review: Preparing for vSphere 5 with Virtualization Manager
VMworld 2011 Review: Preparing for vSphere 5 with Virtualization ManagerVMworld 2011 Review: Preparing for vSphere 5 with Virtualization Manager
VMworld 2011 Review: Preparing for vSphere 5 with Virtualization ManagerSolarWinds
 
The have no fear guide to virtualizing databases
The have no fear guide to virtualizing databasesThe have no fear guide to virtualizing databases
The have no fear guide to virtualizing databasesSolarWinds
 
Varrow madness 2013 virtualizing sql presentation
Varrow madness 2013 virtualizing sql presentationVarrow madness 2013 virtualizing sql presentation
Varrow madness 2013 virtualizing sql presentationpittmantony
 
Dynamic Memory Management Hyperv 2008 R2 S
Dynamic Memory Management Hyperv 2008 R2 SDynamic Memory Management Hyperv 2008 R2 S
Dynamic Memory Management Hyperv 2008 R2 SEduardo Castro
 
Dynamic Memory Management HyperV R2 SP1
Dynamic Memory Management HyperV R2 SP1Dynamic Memory Management HyperV R2 SP1
Dynamic Memory Management HyperV R2 SP1Eduardo Castro
 
Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...
Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...
Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...Andrew Miller
 
INF7827 DRS Best Practices
INF7827 DRS Best PracticesINF7827 DRS Best Practices
INF7827 DRS Best PracticesBrian Graf
 
Introduction to eNlight Cloud Computing Platform
Introduction to eNlight Cloud Computing PlatformIntroduction to eNlight Cloud Computing Platform
Introduction to eNlight Cloud Computing PlatformMilind Koyande
 
eNlight- Intelligent Cloud Computing Platform
eNlight- Intelligent Cloud Computing PlatformeNlight- Intelligent Cloud Computing Platform
eNlight- Intelligent Cloud Computing PlatformManisha Daulatani
 

Similar to Master VMware Performance and Capacity Management (20)

VMworld 2015: Extreme Performance Series - vSphere Compute & Memory
VMworld 2015: Extreme Performance Series - vSphere Compute & MemoryVMworld 2015: Extreme Performance Series - vSphere Compute & Memory
VMworld 2015: Extreme Performance Series - vSphere Compute & Memory
 
Session 7362 Handout 427 0
Session 7362 Handout 427 0Session 7362 Handout 427 0
Session 7362 Handout 427 0
 
ESX performance problems 10 steps
ESX performance problems 10 stepsESX performance problems 10 steps
ESX performance problems 10 steps
 
Virtualizing Tier One Applications - Varrow
Virtualizing Tier One Applications - VarrowVirtualizing Tier One Applications - Varrow
Virtualizing Tier One Applications - Varrow
 
Cinder enhancements-for-replication-using-stateless-snapshots
Cinder enhancements-for-replication-using-stateless-snapshotsCinder enhancements-for-replication-using-stateless-snapshots
Cinder enhancements-for-replication-using-stateless-snapshots
 
vSphere APIs for performance monitoring
vSphere APIs for performance monitoringvSphere APIs for performance monitoring
vSphere APIs for performance monitoring
 
Managing Performance in a Virtual Environment
Managing Performance in a Virtual EnvironmentManaging Performance in a Virtual Environment
Managing Performance in a Virtual Environment
 
VDI Design Guide
VDI Design GuideVDI Design Guide
VDI Design Guide
 
VMworld 2013: DRS: New Features, Best Practices and Future Directions
VMworld 2013: DRS: New Features, Best Practices and Future Directions VMworld 2013: DRS: New Features, Best Practices and Future Directions
VMworld 2013: DRS: New Features, Best Practices and Future Directions
 
5 Things to Ask Your Virtualization Administrator
5 Things to Ask Your Virtualization Administrator5 Things to Ask Your Virtualization Administrator
5 Things to Ask Your Virtualization Administrator
 
VMworld 2011 Review: Preparing for vSphere 5 with Virtualization Manager
VMworld 2011 Review: Preparing for vSphere 5 with Virtualization ManagerVMworld 2011 Review: Preparing for vSphere 5 with Virtualization Manager
VMworld 2011 Review: Preparing for vSphere 5 with Virtualization Manager
 
Designing virtual infrastructure
Designing virtual infrastructureDesigning virtual infrastructure
Designing virtual infrastructure
 
The have no fear guide to virtualizing databases
The have no fear guide to virtualizing databasesThe have no fear guide to virtualizing databases
The have no fear guide to virtualizing databases
 
Varrow madness 2013 virtualizing sql presentation
Varrow madness 2013 virtualizing sql presentationVarrow madness 2013 virtualizing sql presentation
Varrow madness 2013 virtualizing sql presentation
 
Dynamic Memory Management Hyperv 2008 R2 S
Dynamic Memory Management Hyperv 2008 R2 SDynamic Memory Management Hyperv 2008 R2 S
Dynamic Memory Management Hyperv 2008 R2 S
 
Dynamic Memory Management HyperV R2 SP1
Dynamic Memory Management HyperV R2 SP1Dynamic Memory Management HyperV R2 SP1
Dynamic Memory Management HyperV R2 SP1
 
Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...
Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...
Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...
 
INF7827 DRS Best Practices
INF7827 DRS Best PracticesINF7827 DRS Best Practices
INF7827 DRS Best Practices
 
Introduction to eNlight Cloud Computing Platform
Introduction to eNlight Cloud Computing PlatformIntroduction to eNlight Cloud Computing Platform
Introduction to eNlight Cloud Computing Platform
 
eNlight- Intelligent Cloud Computing Platform
eNlight- Intelligent Cloud Computing PlatformeNlight- Intelligent Cloud Computing Platform
eNlight- Intelligent Cloud Computing Platform
 

Recently uploaded

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Master VMware Performance and Capacity Management

  • 1. SDDC Performance and Capacity Management
  • 2. About your speakers Iwan ‘e1’ Rahabok virtual-red-dot.info e1@vmware.com @e1_ang Linkedin.com/in/e1ang 9119-9226 Sunny Dua vxpresss.blogspot.com duas@vmware.com @sunny_dua Linkedin.com/in/duasunny
  • 3. One day in the life of VMware Admin… • A VM Owner complains to IaaS Team that her VM is slow. • Her application architect has verified that: – The VM CPU and RAM utilization is good. – The disk latency is good. – There is no network drop packets. – No change in the application settings – No recent patch to Windows What do you do? • A: Check ESXi utilization. If it’s low, tell her to doubt no more. • B: Buy her a nice lunch + flower. Ask her to forget about it  • C: Call your VMware TAM & MCS. That’s why you pay them right?  • D: Roll up your sleeve. You are born for this!
  • 4. What’s wrong with these statements? • Cluster CPU – CPU Ratio is high at 1:5 times on cluster “XYZ” – Rest all other cluster overcommit ratio looks good around 1:3 – Keep the over commitment ratio to 1:4. – CPU usage is around 50% on cluster “ABCDE”. Since they are UAT servers, don’t worry. – Rest other cluster CPU utilization is around 25%. This is good! • Cluster RAM – We recommend 1:2 overcommit ratio between physical RAM and virtual RAM. – Memory Usage on most of the cluster is high around 60% – Cluster “ABCD” is running peak at around 75%. CPU utilization should be less than 70% – If we see that Active Mem% is also high than we should add more RAM to cluster – % Active should not exceed 50-60% and Memory should be running at high state on each host
  • 5. Monitoring • There are 2 levels to monitor in VMware: – The VM. • VM is the most important as that’s all customers care. • They do not care about your infrastructure. It is a Service. IaaS. – The Infra. • Software: NSX, vCenter, VSAN, vRealize, Distributed vSwitch, Datastore • ESXi + hardware • Storage & Fabric • Network • There are 4 areas to monitor • The 4 areas above impact one another
  • 6. 2 distinct layer SDDC VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM Performance: We check if it is being served well by the platform. Other VM is irrelevant from VM Owner point of view. 1 Capacity. We check if VM is right-sized. If too small, increase its configuration. If too big, right size it for better performance 2 Performance: We check if IaaS is serving everyone well. Make sure there is no contention for resource among all the VMs 1 Capacity: Check utilization. Too low, we spent too much on hardware. Too high, we need to buy more hardware. 2 Configuration: Check for Compliance and Config Drift Availability: Get alert for hardware fault or software stop working3 Consumer Layer Provider Layer
  • 7. Performance How do you know your IaaS is performing fast? ESXi utilization a 10% means your ESXi is fast? ESXi utilization a 90% means your ESXi is fast? Storage is doing 10K IOPS? Network is processing 8 Gbps? What counter do you use as a proof to your customers (VM Owner)? Utilization? Performance is measured by how well your IaaS serves the VMs. Fast is relative to your customer. Use SLA as your defense line.
  • 9. Performance vs Capacity Performance Capacity Focus is on the VM. It does not apply to IaaS Focus is on the IaaS. VM Capacity Management is just right sizing Primary counter: Contention or Latency. Utilization is largely irrelevant. Primary counter: Contention or Latency Secondary counter: Utilization Does not take into account Availability SLA Takes into account Availability SLA Tier 1 is in fact Availabity-driven.
  • 10. © 2015 VMware Inc. All rights reserved. The Consumer Layer The “dining area” CONFIDENTIAL11
  • 11. How a VM gets its resource Provisioned Limit Reservation Entitlement 0 vCPU or 0 GB Contention Usage Demand This is the counter we need to measure 4 vCPU or 16 GB
  • 12. Dashboards • Detail monitoring of a single VM – When customer complains that his VM is slow. Can help desk value right away? • Large VMs Monitoring – Because they are actually hurting your IaaS business – This impacts both Performance and Capacity – VM Right Sizing • Excessive Usage – Excessive Usage by 1-2 VM can impact the overall IaaS performance. – VMs with excessive usage hurts the business, if we do not charge for Network and Disk IOPS
  • 13. Single VM Monitoring • A VM Owner complains that his VM is slow. – It was okay the day before – How does Help Desk quickly determine where the issue is? • How well does Infra serve the VM? – VM CPU Contention – VM RAM contention – VM Disk latency. For each virtual disk, not average. • Is VM undersized? – VM CPU Utilisation – VM RAM Consumed (not Usage) – VM RAM Usage – VM Disk IOPS
  • 14.
  • 15.
  • 16.
  • 17.
  • 19. How oversized are the Large VMs? • They cause performance issue – They impact others, and also themselves! – ESXi vmkernel scheduler has to find available cores for all the vCPU, even though they are idle. – Other VMs maybe migrated from core to core. The counter at esxtop tracks this migration. • Tends to have slower performance – ESXi may not have all the available vCPU for them. • Reduces consolidation ratio – You can pack more vCPU with smaller VM than with big VM. – Unless you have progressive pricing, you make more money with smaller VM as you sell more vCPU.
  • 20. Dashboard of Large VMs • Overall Picture – A line chart showing Max CPU Demand among all the Large VMs • If this is low, they are way oversubscribed. Remember, it only takes 1 VM to make this number high. • This number should be 80% most of the time, indicating right sizing. – A line chart showing Average CPU Demand • If this chart is below <25% all the time for entire month, then the large VMs are over sized. • Heat Map of Large VMs – Size by vCPU config. So it’s easy to see who the biggest among these large VMs. – Color by CPU Workload. Both high and low are bad. You want to see ~50% CPU utilisation • To differentiate between the 2 ends, choose Black and Red. Expect to see mostly green. • Top-N CPU Demand – Allows us to zoom into specific time to see the past • Line chart of a selected VM (automatically plotted)
  • 21. As expected, the Max of All VMs is low. We can go back in time and see over 3 months. As expected, they are mostly Black. This means they are over provisioned. This shows the Top 15 VM. You can change the period to any time. This is auto shown. We are showing CPU and RAM. You expect 70% range, not 20% like this example.
  • 22. VM Right Sizing • Focus on large VM – Every downsize is a battle. No one likes to give up that they are given. – It also requires downtime – Downsizing from 4 vCPU to 2 does not buy much nowadays with >10 core Xeon. • Focus on CPU, not RAM – RAM in general is plentiful as RAM is cheap nowadays – RAM is hard to measure, even with agents, as it’s application dependant
  • 23. MS Windows: Memory Management • Windows makes great and full use of RAM – It’s using it as cache. – Adding more physical RAM will result in more usage. • Virtual Memory is an integral part of Windows Memory Management – It is not a swap file. – A growing pagefile is an early warning. – Track that it’s below 2.0 with this formula: ≤ 2.0 Commit Limit Physical RAM
  • 24. MS Windows: Memory Management In Use Available Cached vRealize Ops 6.1 EP Agent: Memory Used vRealize Ops 6.1 EP Agent: Memory Available
  • 25. MS Windows: Memory Sizing Conservative Cost Effective
  • 26. Windows: Memory Management • Which VMs need to be upsized? – Get all those VMs whose commit limit ratio > 2.0 – List can be sorted by the one with the highest commit limit ratio ≤ 2.0 Commit Limit Physical RAM
  • 27. Windows RAM: Right Sizing • Use Commit Limit Ratio super metric to upsize VM • Conservative or Cost Effective? – Cost Effective: Used. – Conservative: Used + Cache • Example – Used + Cache: 90% – Value exceeding 90% means the VM needs more RAM Server Workload VDI Workload 1 apps Many apps Long live apps Many apps launched and closed. Varies Many files opened and closed No Internet browsing Internet browsing (movie!) Workload predictable Workload spiky and unpredictable Varies (UI-less) Flash, Java, JavaScript (UI heavy)
  • 28. Windows RAM: What counters to use? Cost Effective: Used Conservative: Used + Standby Cache Normal Priority + Standby Cache Reserve
  • 29. Windows RAM: Hypervisor vs In-Guest 30 A B DC
  • 32. VM Right Sizing • Do not reduce RAM without changing application – If the VM has no RAM shortage, reducing RAM will not speed up anything. – Reducing RAM can trigger more internal swapping in the guest. This in turn generates IO – Reducing RAM beyond the ISV recommendation can result in unsupported configuration. – Reducing RAM requires manual reduction for apps that manage its own RAM (e.g. Java, SQL, Oracle). – It's hard enough to ask apps team to reduce CPU, so asking for both will be even harder. – If there is a performance issue after you reduce both CPU and RAM…. • If a VM is not using the full RAM, ask the Appl Team if they can – To monitor paging from outside the Guest, put the pagefile into its own vmdk file.
  • 33. Why VM Owner should right size • It takes longer to boot. – If a VM does not have reservation, vSphere will create a swap file the size of the configured RAM. • It takes longer to vMotion. • Risk of NUMA effect – The RAM or CPU maybe spread over a single socket. Due to NUMA architecture, the performance will not be as good. • It will experience higher co-stop and ready time. • It takes longer to snapshot, especially if memory snapshot is included. • The processes inside the Guest OS may experience ping-pong. • Lack of performance visibility at the individual vCPU or virtual core level. 34
  • 35. Any Excessive Utilization in our DC? • A VM consumes 5 resources: 1. vCPU 2. vRAM (GB) 3. Disk Space 4. Disk IOPS 5. Network (Mbps) • The first 3 you can bound and control • The last 2 you can, but normally you don’t do it. You should. • Need a dashboard to track excessive usage – Disk IOPS – Network throughput
  • 36. Dashboard for Excessive Utilisation • Excessive Storage consumption – Line Chart: • Max VM Disk IOPS among all VMs • Average VM Disk IOPS – Heat Map • Size by IOPS. Color by Latency • If you see a single big box, that means you have a VM dominating your storage IOPS. • Excessive Network consumption – Similar concept as above
  • 37. This tracks the IOPS from VM. From here we can tell is a distinct peak. It looks like it’s coming from 1 VM, as the average is far lower. This is a cluster of 500 VM, so even if 1 VM hits 13,200 IOPS, the average did not even pass 15 IOPS. Let’s zoom into the peak.
  • 38. Excessive Storage Dashboard The peak was 13,212 IOPS on 24 May, around 3:16 am. Let’s find out which VM.
  • 39. Excessive Storage Dashboard • We can list the Top VMs generating the IOPS on any given period. Bingo, it was VM 63ee that did that 13212 IOPS. Catcha!  The dashboards are great. But it does not tell you how the IOPS distribution among all the VMs. It also does not tell if the VMs are experiencing high latency. You need a Heat Map for this.
  • 40. At a glance, we can tell the IOPS distribution among the VMs. We can also tell if they getting low latency or not.
  • 42. © 2015 VMware Inc. All rights reserved. And that’s it! You “passed” those dashboards, you’re done with the “dining area”!
  • 43. © 2015 VMware Inc. All rights reserved. The Provider Layer The “kitchen” CONFIDENTIAL44
  • 44. Performance Management • Overall Performance Monitoring – Is any of our customers experiencing bad performance? – CPU, RAM, Disk, Network • If yes, who are affected? – Different VM may get different impact. – VM 007 may get hit on CPU, while VM 747 may get hit on Storage.
  • 45. Performance SLA Monitoring • How do we prove that….not a single VM… in any service tier…. fails the SLA threshold we agree for that tier… in the past 1 month? • Since VMs move around in a cluster due to DRS and HA, we need to track at Cluster level. • If you oversubscribe, there is a risk of Contention. – For Tier 1, do not overcommit. – For Tier 2 and 3, do overcommit.
  • 46. Using Max and Average to determine how VMs are served If the Max is: • below what you think your customers can tolerate, then you are good. • Near the threshold, then your capacity is full. Do not add more VM. • Above the threshold, move a few VMs out, preferably the large ones.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51. This dashboard is good as summary. You stop here if there is no issue. But say there is an issue. Can you drill down?
  • 52. This is an example of how you can drill down to individual Cluster, and see any metrics you want. Notice the Max Contention and Average Contention both spike. The average is much lower than Max, indicating room for additional VM. The VM contention hit 4.44%. It’s still okay.
  • 53. Drill down: Cluster CPU Performance There is issue here. The Max value hit 31%. The Average is still good at 0.31%, so this means 90% of VMs are being served well.
  • 54. Storage Latency Monitoring: Details • The data you see at vCenter and vRealize Operations are average – The storage latency data that vCenter provides is a 20-second average. With vRealize Operations, and other management tools that keeps the data much longer, it is a 5-minute average. – 20 seconds may seem short, but if the storage is doing 10K IOPS, that is an average of 10,000 x 20 = 200,000 read or writes. • The data you see at vmkernel log is not. – It is per individual IO. More info here. – It is acceptable to have higher latency, but ensure it is not too high. Set your threshold at 250 ms for a start. • Additional info – Data at vmkernel excludes bottleneck at upper layer. For example, no disk queue in vDisk. As a result, we can conclude that the storage latency is not at VM level.
  • 55. Drill down: Cluster Storage Performance 56 We look at storage across the past 1+ month. We are seeing latency spike. There is an outlier at 6000 ms on a magnetic disk. This data is going back to 18 April. Let’s zoom into May 22 as there are recent spike there.
  • 56. Drill down: Cluster Storage Performance Zooming into May 17 – 23. We also exclude all the Magnetic Disk. Device ID naa.55* is SSD, while naa.5000* is magnetic. We are seeing latency at the SSD. There is 1 outlier.
  • 57. Drill down: Cluster Storage Performance We can also group the data by ESXi Host. We can also present the data in bar chart We can zoomed into much more granular time line, below 1 second!
  • 58. Which VMs are affected? • The previous slides give us info at Cluster level. – If there is no VM affected, it’s good. No need to analyse further. – If there are VMs affected, we want to know which ones. • We can address the above by listing the Top 30 VM – CPU Contention – RAM Contention – Disk Latency – Network drop packet (ensure it is 0) – Network latency (this needs NetFlow)
  • 59. These are the top 40 VMs which experienced the worst CPU Contention. These are the top 40 VMs which experienced the worst RAM Contention. These are the top 40 VMs which experienced the worst Disk Latency.
  • 60. © 2015 VMware Inc. All rights reserved. And that’s it! If Performance is ok, it’s time to review Capacity 61
  • 61. Capacity Management based on Business Policy http://virtual-red-dot.info/capacity-management-based-on-business-policy/
  • 62. Performance Policy 63 Group Discussion: What should your Performance Policy be?
  • 63. Availability Policy 64 Group Discussion: What should your Availability Policy be? IOPS (per VM) Latency (VM level) Automated DR (SRM) RPO RTO Tier 1 1000 <10 ms Yes 5 minutes 1 hour Tier 2 500 <20 ms Yes <2 hours <2 hours Tier 3 100 <30 ms Yes <8 hours <4 hours
  • 64. Capacity Management: Tier 1 5 line charts showing these in the past 3 months • Number of vCPU left in the cluster. • Number of vRAM left in the cluster. • Number of VM left in the cluster. • Maximum & Average storage latency experience by any VM in the cluster • “Usable” space left in the datastore cluster. 65 If the number is approaching low number (your threshold) for it’s time to increase supply (e.g. IOPS, Cluster)
  • 65. Capacity Management: Tier 2 or 3 5 line charts showing data in the past 3 months • The Maximum CPU Contention experience by any VM in the cluster. – This number has to be lower than the SLA we promise. • The Maximum RAM Contention experience by any VM in the cluster. – This number has to be lower than the SLA we promise. • The total number of VM left in the cluster. • The Maximum & Average storage latency experience by any VM in the cluster • The disk capacity left in the datastore cluster. 66
  • 66. Tier 2 or 3 67 In this example, if we use 10% as threashold, the cluster is full In this example, if we use 30ms as threashold, the cluster is full
  • 67. Capacity Management: Tier 2 or 3 • RAM has different pattern to CPU as it’s a form storage – SLA principel remains the same. If you exceed the SLA, that cluster is full. 68
  • 68. SLA vs Internal Threshold 69 SLA Tier 1 Tier 2 Tier 3 CPU Contention 1% 2% 13% RAM Contention 0% 10% 20% Disk Latency 10 ms 20 ms 30 ms SLA only applies to VM. VM owner does not care about underlying platform. The above is my personal opinion. You need to get your Customer agreed Internal (your own) Tier 1 Tier 2 Tier 3 CPU Contention 1% 3% 10% RAM Contention 0% 5% 10% Disk Latency 10 ms 15 ms 20 ms
  • 69. Key Takeaways Agree on a Performance SLA. Contention, not Utilization. Capacity is defined by Performance. CONFIDENTIAL 70
  • 70. More Details • The book provides details that we could not cover in half a day. • The book is not a product book. – It focuses on concept, which you can apply using any product. It does not have to be vRealize Ops 71

Editor's Notes

  1. More materials and steps by steps guides are available at our blogs.
  2. It’s a common story. This presentation was born many years ago with story like this. He think it’s your IaaS problem. How do we prove where the performance issue is? The point of this slide is you need a formal SLA. It is your defence mechanism.
  3. These are statements we often hear. Unfortunately, they are not correct, because they focus on the wrong component of IaaS.
  4. Network: Physical Firewall, Router, Load Balancer etc. Network needs to be monitored differently as it’s a form of interconnect. We monitor at DC wide level There are 4 areas to monitor, as we have covered in previous slide. Performance Capacity Availability Configuration (for compliance and security)
  5. There are 2 distinct layer in the virtual world. This differs to the physical world, where there are only 1 layer. Application team or VM Owner only cares about their VM. Infra team have to care about both, especially from proving that Infrastructure is not a bottleneck. By knowing what matters on each layer, we can better manage the virtual environment. VM layer: Usage or utilization: Individual core utilization is important as we must see distribution and if any particular core is max out. A typical ESXi has 16 cores so it’s possible that the overall utilization is not high but certain cores are maxed out. Storage Latency at the SDDC layer includes both kernel queue and device queue. These are KAVG and DAVG respectively. VM CPU counters: Used (ms) = Usage (%) = Usage in MHz (MHz) The counter Usage and utilization may differ due to power management and hyper-threading. [e1: which has has which??] System: amount of time spent on system processes on each vCPU in the VM. VMM knows if it’s ring 0 or not, privilege instruction or not. CPU Latency. I found this to be better than Ready. This is the Sum, not Average, or each vCPU latency. I know this as it is Sum for the Ready counter. The Ready (milisecond) for each vCPU adds up to the Ready (ms) in the vCPU. Co-Stop. I realise that Ready excludes Co-Stop. The Co-Stop figure at the CPU level is a Sum of the Co-Stop at each vCores. So no need to do a super metric. How much CPU does a single VM use? %USED = %RUN + %SYS - %OVRLP %RUN is work done by VM itself. %SYS is work done by system (vmkernel), but on VM’s behalf, like doing IO. So %USED can be >100% if the VM has heavy IO as IO is executed on a different pCore. vmkernel does not have visibility inside the Guest OS. Example, we cannot see CPU Run Queue in Windows. This requires agent inside. RAM. If Ballooning is 0, then there is no memory pressure at all. Ballooning is the first sign. It happens before compression and swapping. Active is for us to see how much RAM is actually active. "state" : the free memory state. Possible values are high, soft, hard and low. The memory "state" is "high", if the free memory is greater than or equal to 6% of "total" - "cos". If is "soft" at 4%, "hard" at 2%, and "low" at 1%. So, high implies that the machine memory is not under any pressure and low implies that the machine memory is under pressure. While the host's memory state is not used to determine whether memory should be reclaimed from VMs (that decision is made at the resource pool level), it can affect what mechanisms are used to reclaim memory if necessary. In the high and soft states, ballooning is favored over swapping. In the hard and low states, swapping is favored over ballooning. Disk Abort indicate that the storage system is unable to meet the demands of the guest operating system. Abort commands are issued by the guest when the storage system has not responded within an acceptable amount of time, e.g. 60 seconds on some windows OS’s. Also, resets issued by a guest OS on its virtual SCSI adapter will be translated to aborts of all the commands outstanding on that virtual SCSI adapter. Good source: http://communities.vmware.com/community/vmtn/server/performance?view=documents http://searchsystemschannel.techtarget.com/feature/Monitoring-vSphere-performance-with-vCenter-Server-performance-graphs http://cartershanklin.com/blog3/2010/12/19/esx-performance-counter-secrets-revealed/x`
  6. Compute is the Cluster due to DRS and HA. When a VM had a performance issue say 3 days ago, you do not know which ESXi it was running. So no point troubleshooting at that level. VSAN components means the SSD and Magnetic
  7. We have covered Performance. What about capacity? [click]
  8. You have 100 TB space left in your storage. Lots of space for new VMs. But latency is bad. VMs are getting 100 ms latency. Will adding VM make the situation worse? Yes! Can you add more VM? No. Every VM consumes IOPS, not just space. If you cannot add more VM, that storage is full. Time to add capacity (IOPS)
  9. From the viewpoint of the customers (VM Owner, App Team), there is no such thing as IaaS Performance. There is only IaaS Capacity. The capacity is capable of handling certain amount of workload demanded by the Consumers.
  10. What your customers care. They do not care about your infra 
  11. Unlike a physical server, a VM has dynamic resources given to it. It is not static. Contention, Demand and Entitlement are concepts that do not exist in the physical world. The area represents the resource. Say this VM is given 16 GB of vRAM. So the bottom line represents 0 GB, and the top line is 16 GB. The VM is configured with 16 GB, and we call this Provisioned. Unlike physical server, we can configure Limit and Reservation. We should minimise the use of both as it can make operation more complex. Entitlement means what the VM is entitled to. The hypervisors entitles the VM to a certain amount of RAM. A VM can only use what it is entitled to. It cannot use more than it is allowed. Demand is what the VM wants to use. In normal situation, where there is no contention among all the VMs, entitlement, usage and demand will be very close to each other. If there is a contention, then a VM can’t use what it ideally wants to use. ESXi also will not entitle it as there are competing VMs. So usage will be lower than Demand. Demand can go higher than Limit naturally. If this VM is limited to 2 GB, and it wants to use 14 GB, then Demand will exceed Limit. Contention is a _special_ counter that tracks all these competition for resources. It’s a counter that only exists in the virtual world. It happens when what the VM demands is more than it gets to use. So Contention = Demand – Usage. Usable does not apply to VM. Entitlement only applies to VM. Demand: Usage if there is no constraint. Difference between the amount of the resource that the object requires and the amount of the resource that the object gets. This metric measures the effect of conflict for a resource between consumers. Contention measures latency or the amount of time it takes to gain access to a resource. This measurement accounts for dropped packets for networking. Limit: Maximum amount that an object can obtain from a resource. The limit sets the upper bound for CPU, memory, or disk I/O resources that we allocate and configure in vCenter Server. Entitlement Amount of a resource that a VM can use based on the relative priority of that consumer set by the virtualization configuration. This metric is a function of provisioned, limit, reservation, shares, and demand. Shares involve proportional weighting that indicates the importance of a VM. The entitlement amount is less than or equal to the limit amount. The entitlement metric applies only to VMs. ---------------- Notes: thanks to Michael Beckmann for the correction on Contention.
  12. Can we display vSphere tags in vR Ops columns? If not, we can use the built-in Description field in vR Ops to map between customers and VM. We saw a performance degradation on a cluster of 500 VM when just 1-2 VM did an IOmeter. 10% of the population was affected when only 1-2 VM.
  13. We can also display extra info VM Disk throughput VM network throughput
  14. This is the list of VM. Help Desk can search. For each, we show the key properties, such as No of vCPU, RAM size, CPU Contention, RAM Contention, etc. The columns can be customised.
  15. This is the list of VM. Help Desk can search. For each, we show the key properties, such as No of vCPU, RAM size, CPU Contention, RAM Contention, etc. The columns can be customised.
  16. This line charts show the CPU Contention. Ideally, the line is <1%. This line charts show the CPU Utilization and RAM Utilization. If they are high, the VM is undersized and it could contribute to slowness. This line charts show the Disk Latency. We plot 6 lines: vDisk 1 Read Latency vDisk 1 Write Latency vDisk 2 Read Latency vDisk 2 Write Latency The VM overall Read Latency The VM overall Write Latency. Notice the number is very high here. .
  17. If you are contacting us regarding this dashboard, please cite “Single VM Monitoring” so we know which one you’re referring to
  18. The counter CPU Co-Stop tracks this. Example. If you have 2 socket, 20 cores, you can have either 1x 20-vCPU VM with high utilisation 40x 1 vCPU VM with high utilization
  19. Heat Map of Large VMs Size by vCPU config. So it’s easy to see who the biggest among these large VMs. Color by CPU Workload. Both high and low are bad. You want to see around 50% CPU utilisation, not 20% nor 80%. To differentiate between the 2 ends, choose Black and Red. Expect to see mostly green. If you see mostly black, what does it mean? Top-N CPU Demand Allows us to zoom into specific time to see the past A user can click a VM, and the details are plotted right away. Line chart of a selected VM (automatically plotted) CPU Workload. Expect this to spike to 90% a few times. RAM Workload. Expect this to spike to 70% a few times.
  20. Just like ESXi. This is why Memory Consumed is higher in ESXi with more RAM. Windows 2008 and 7 x64 default to Physical RAM Size Windows 2012 may choose smaller. Value depends on Windows version? I only have 1 Win2012, so needs to validate properly. Do we enable SuperFetch or not? If we do, it will use more RAM. The guidelines for SSD is to turn it off.  Do we enable PageFile.sys or not?  If we have enough RAM, why do we need it? Especially on the Server workload, since we know exactly the application that run on that VM.​ That impacts the % Committed Memory metrics Client Many applications. Sizing becomes difficult as you don’t know what apps Apps are opened and closed. Binaries are loaded and cleared from RAM. Usage is spiky. Web Sites can chew up RAM. Watch how Chrome uses RAM. Be careful of Intranet apps, especially home grown. Video. Users watch video. Server Normally 1 application. Sizing can be apps specific. The server does not use browser and serve Internet while you’re not watching  VDI: Windows 7 x64 Ensure PageFile.sys is set as Windows managed. Conservative: Total RAM – Free RAM Add RAM when Free RAM <5% Cost Optimized: Total RAM – (Free RAM + Standby RAM) Add RAM when (Free RAM + Standby RAM) < 5%
  21. This applies to the following Windows: 7 x64, 2008, 2012. Complication. In Use does not include Modified Windows has 3 caches, with 2 being shown by Resource Monitor: Modified and Standby Available includes Standby. Windows takes advantage of physical RAM. It maximizes the use. So measuring the Free RAM alone is misleading. Check the standby RAM also. So track Available RAM metric. Keep around 60-80%. Too low is not good in server workload? Check if pagefile is more than physical RAM. Total Commit = 2x physical RAM. If windows is under RAM pressure, it will increase this. They do not change often. Disabling this means losing visibility. ---------------------------------------------------- Not sure why Committed doesn't tally with Use in terms of pattern. One explanation is Standby. Compare committed vs Free. In vR Ops 6.1, the metric Free = Available. It should not be. Available = Free + Standby. PR has been filed, see Socialcast
  22. Do we include Standby or not? I see the value to be quite high, both on client workload and server workload.
  23. VM Name, VM RAM Config, VM Commit Limit ratio. Heatmap can show by color, so at a glance you can see the extend of the problem. Suitable for NOC
  24. I think the guide is something you can decide, based on your environment. You can create super metric to get the average & maximum of your environment. Know the usage Client vs Server
  25. The 3rd cache counter is normally 0. You may need to manually turn on the policy, if the metrics are not enabled by default
  26. Active RAM can be too low, or too high. I have not tested Linux. A: normal AD running. vCenter is reporting low utilisation, around 15-20% B: I installed the EP Agent. C: Patching. Mostly comparing, downloading. D: Patching. Mostly installing If possible, do not use hypervisor data. But this is better than nothing obviously, since in-guest data requires agent and access to Guest OS. So you can always start with just the VMs with large RAM. Focus on downsize. For upside, the VM owner can help. Memory used does not include Standby.
  27. Reducing RAM can trigger more internal swapping in the guest. This in turn generates IO, making the storage situation worse. Reducing RAM requires manual reduction for apps that manage its own RAM (e.g. Java, SQL, Oracle). VMware vCenter, for example, has 3 JVM and each has its RAM documented in vSphere. It's hard enough to ask apps team to reduce CPU, so asking for both will be even harder. Suggest you ask for 1 thing and drive that. If there is a performance issue after you reduce both CPU and RAM…. You have to bring both back up, even though it was caused by just one of them. To monitor paging from outside the Guest, put the pagefile into its own vmdk file. We can then use vRealize to analyse for pattern for anomaly If a VM is not using the full RAM, ask the Appl Team if they can since the RAM is already given to them. More RAM will result in reduced paging. This in turn will reduce IO load. To monitor paging from outside the Guest, put the pagefile into its own vmdk file.
  28. Educate VM Owner on the benefits of right sizing Carrot a lot more effective than stick, especially for those with money. Everyone always want more performance. Address from this angle. It will experience higher co-stop and ready time. Even if not all vCPU is used by the application, the Guest OS will still demand all the vCPU be provided by the hypervisor. The processes inside the Guest OS may experience ping-pong. The Guest OS may not be aware of the NUMA nature of the physical motherboard, and think it has a uniform structure. It may move processes within its own CPUs, as it assumes it has no performance impact. If the vCPUs are spread into different NUMA node, example a 20 vCPU on a box with 2-socket and 20 cores, it can experience the ping-pong effect. Lack of performance visibility at the individual vCPU or virtual core level. Majority of the counter is at the VM level, which is an aggregate of all of its vCPU. It does not matter whether you use virtual socket or virtual core.
  29. The last 2 you can, but normally you don’t do it. You should. Application Team does not normally know how much IOPS or Network they need. Do you allow any VM to generate 100K IOPS? Do you allow any VM to saturate 1Gb link? If you are public cloud: - Net: Why not upsell? First 200 Mbps is free. Above that is chargeable. - Storage: Why not upsell? First 500 IOPS is free. Above that is chargeable. Or, create a premium category, where there is no limit with 1 flat fee.
  30. Max VM Disk IOPS among all VMs If this is consistently high, you have a VM doing excessive IOPS. It is VM, not VSAN doing rebalancing. We are tracking the source of IOPS = VM. What number do you cater for? My take is 1000 IOPS can do damage, as it’s 5 minutes average. That is actually 1000 x 5 x 60 = 300K IO performed in 5 minutes! It’s not normal to do 300K IOPS. Average VM Disk IOPS Expect this to be <100 IOPS. Remember, it’s 5 minutes sustained. Excessive Network consumption Line Chart Max Network Usage among all VMs If this number is >1 Gb most of the time, it’s high. Max is 2 Gb, as it’s full duplex. Average Expect this to be <100 Mbps. Remember, it’s 5 minutes sustained Heat Map Size by Network Usage. Color by Drop Packets Expect to see green. VMs should not be dropping packets.
  31. At a glance, we can tell the IOPS distribution among the VMs. We can also tell if they getting good latency or not. You can change the threshold anytime. If you want a brand new storage from Finance, set the max to 1 ms  Heatmap can also be grouped (e.g. by cluster, host, folder).
  32. Performance is customer-facing. You cannot serve them well. Performance is measured in Contention. Performance is about Quality. Capacity is internal-facing. You need to buy more hardware. Capacity is measured in Utilization. Capacity is about Quantity.
  33. The answer comes in the form of a few lines chart. CPU, RAM, Disk Latency will have 1 chart each. Total 3 line charts. Each Service Tier will have its own set of the above 3 charts. This is because they have different SLA threshold. Network drop packet should be 0 at all times.
  34. These 2 line charts show the Max and Average Contention. In this example, the Max is low. The Max hit 0.05% only. Plenty of RAM. In this example, the Max is high. The Max hit 20% only. Move the largest VM (vCPU) out. Average is low, a sign of Large VM in existence. I’m using CPU and RAM as example here. Same concept applies to Disk (Latency) and Network
  35. This is the list of Clusters. For each, we show the key properties, such as No of Running VM, No of vCPU, Contention, etc. The columns can be customised.
  36. These 2 line charts show the Max and Average CPU Contention. From here, we can easily tell that this Cluster is full. It has high CPU contention.
  37. These 2 line charts show the Max and Average RAM Contention. As RAM is a form of storage, the number here will be lower than CPU Contention
  38. These 2 line charts show the Max and Average Disk Latency. It is very high. Ideally, add SSD here as it’s too high. [pause] But say there is an issue. Can you drill down? We can drill down to individual cluster, and do further analysis. Click on the icon in the “Select a cluster”. It will take you to the detail screen. [click to next slide]
  39. An average of 200,000 numbers will hide the peak. If you have 1000 reads or writes that experienced bad latency, but the remaining 199,000 operations returned fast, that poor 1000 operations will be hidden. Log Insight 2.5 does not yet show Datastore name. It shows the LUN or physical Disks. Need to manually correlate for now.
  40. In this example, we should check the latency at the Magnetic Disk on 22 - 23 May. However, due to time limit in this presentation, we will move on.
  41. I’m using 1.5x for Tier 2 because Intel Hyper-Threading gives 1.5x performance boost. I’m using 2x for Tier 3 because there are 2 threads per core. There is certainly performance penalty, but this is Tier 3 (lowest, cheapest tier). We need to differentiate between your highest tier and lowest tier, else costs go up. Latency is measured in 5 minute average. How do we ensure Tier 3 storage does not impact Tier 1 since they are on the same array (spindles, CPU, etc). Some storage array has shares internally. All numbers are measured in 5 minute average. If a spike only lasts for 1 minute, then it calms down for the remaining 4 minutes, it won’t show up.
  42. Should we also add IO Share. Need to see if it’s practical to do it for each of the 3 storage options.
  43. We can also include Average Storage Latency to see how far we are from peaking. It’s also good to see if there is storage unbalanced, which resulting is a spike. We can have a Top-N widget to see which VM hit that super high latency when the rest of farm is doing very well. Network should be 0 drop packet Take the lowest number of vCPU, vRAM and VM in cluster. This is because you’re buying per cluster for Tier 1
  44. Notice we can actually use the same line chart we use for Performance Monitoring. This is because Capacity Management is dependant on Performance Management. Take the lowest number of CPU Contention, RAM Contention and VM in cluster. This is because you’re buying per cluster (or host) in these tiers. The 3 numbers should balance in the long run, to optimize your cost. If not, either adjust your policy, your VM standard, or your ESXi specification. I’ll create the actual sample charts and post to my blog for example.
  45. How do we ensure Tier 3 storage does not impact Tier 1 since they are on the same array (spindles, CPU, etc). Some storage array has shares internally. All numbers are measured in 5 minute average. If a spike only lasts for 1 minute, then it calms down for the remaining 4 minutes, it won’t show up.