Master VMware Performance and Capacity Management

SDDC Performance and Capacity Management

About your speakers
Iwan ‘e1’ Rahabok
virtual-red-dot.info
e1@vmware.com
@e1_ang
Linkedin.com/in/e1ang
9119-9226
Sunny Dua
vxpresss.blogspot.com
duas@vmware.com
@sunny_dua
Linkedin.com/in/duasunny

One day in the life of VMware Admin…
• A VM Owner complains to IaaS Team that her VM is slow.
• Her application architect has verified that:
– The VM CPU and RAM utilization is good.
– The disk latency is good.
– There is no network drop packets.
– No change in the application settings
– No recent patch to Windows
What do you do?
• A: Check ESXi utilization. If it’s low, tell her to doubt no more.
• B: Buy her a nice lunch + flower. Ask her to forget about it 
• C: Call your VMware TAM & MCS. That’s why you pay them right? 
• D: Roll up your sleeve. You are born for this!

What’s wrong with these statements?
• Cluster CPU
– CPU Ratio is high at 1:5 times on cluster “XYZ”
– Rest all other cluster overcommit ratio looks good around 1:3
– Keep the over commitment ratio to 1:4.
– CPU usage is around 50% on cluster “ABCDE”. Since they are UAT servers, don’t worry.
– Rest other cluster CPU utilization is around 25%. This is good!
• Cluster RAM
– We recommend 1:2 overcommit ratio between physical RAM and virtual RAM.
– Memory Usage on most of the cluster is high around 60%
– Cluster “ABCD” is running peak at around 75%. CPU utilization should be less than 70%
– If we see that Active Mem% is also high than we should add more RAM to cluster
– % Active should not exceed 50-60% and Memory should be running at high state on each host

Monitoring
• There are 2 levels to monitor in VMware:
– The VM.
• VM is the most important as that’s all customers care.
• They do not care about your infrastructure. It is a Service. IaaS.
– The Infra.
• Software: NSX, vCenter, VSAN, vRealize, Distributed vSwitch, Datastore
• ESXi + hardware
• Storage & Fabric
• Network
• There are 4 areas to monitor
• The 4 areas above impact one another

2 distinct layer
SDDC
VM VM VM VM
VM VM VM VM
VM VM VM VM
VM VM VM VM
Performance:
We check if it is being served well by the platform.
Other VM is irrelevant from VM Owner point of view.
1
Capacity. We check if VM is right-sized.
If too small, increase its configuration.
If too big, right size it for better performance
2
Performance:
We check if IaaS is serving everyone well.
Make sure there is no contention for resource among all the VMs
1
Capacity:
Check utilization. Too low, we spent too much on hardware.
Too high, we need to buy more hardware.
2
Configuration: Check for Compliance and Config Drift
Availability: Get alert for hardware fault or software stop working3
Consumer Layer
Provider Layer

Performance
How do you know your IaaS is performing fast?
ESXi utilization a 10% means your ESXi is fast?
ESXi utilization a 90% means your ESXi is fast?
Storage is doing 10K IOPS?
Network is processing 8 Gbps?
What counter do you use as a proof to your customers (VM Owner)?
Utilization?
Performance is measured by how well your IaaS serves the VMs.
Fast is relative to your customer. Use SLA as your defense line.

Performance vs Capacity
Performance Capacity
Focus is on the VM.
It does not apply to IaaS
Focus is on the IaaS.
VM Capacity Management is just right sizing
Primary counter: Contention or Latency.
Utilization is largely irrelevant.
Primary counter: Contention or Latency
Secondary counter: Utilization
Does not take into account Availability SLA Takes into account Availability SLA
Tier 1 is in fact Availabity-driven.

© 2015 VMware Inc. All rights reserved.
The Consumer Layer
The “dining area”
CONFIDENTIAL11

How a VM gets its resource
Provisioned
Limit
Reservation
Entitlement
0 vCPU or 0 GB
Contention
Usage
Demand
This is the counter
we need to measure
4 vCPU or 16 GB

Dashboards
• Detail monitoring of a single VM
– When customer complains that his VM is slow. Can help desk value right away?
• Large VMs Monitoring
– Because they are actually hurting your IaaS business
– This impacts both Performance and Capacity
– VM Right Sizing
• Excessive Usage
– Excessive Usage by 1-2 VM can impact the overall IaaS performance.
– VMs with excessive usage hurts the business, if we do not charge for Network and Disk IOPS

Single VM Monitoring
• A VM Owner complains that his VM is slow.
– It was okay the day before
– How does Help Desk quickly determine where the issue is?
• How well does Infra serve the VM?
– VM CPU Contention
– VM RAM contention
– VM Disk latency. For each virtual disk, not average.
• Is VM undersized?
– VM CPU Utilisation
– VM RAM Consumed (not Usage)
– VM RAM Usage
– VM Disk IOPS

Dashboard 1
Single VM
Monitoring

How oversized are the Large VMs?
• They cause performance issue
– They impact others, and also themselves!
– ESXi vmkernel scheduler has to find available cores for all the vCPU, even though they are idle.
– Other VMs maybe migrated from core to core. The counter at esxtop tracks this migration.
• Tends to have slower performance
– ESXi may not have all the available vCPU for them.
• Reduces consolidation ratio
– You can pack more vCPU with smaller VM than with big VM.
– Unless you have progressive pricing, you make more money with smaller VM as you sell more vCPU.

Dashboard of Large VMs
• Overall Picture
– A line chart showing Max CPU Demand among all the Large VMs
• If this is low, they are way oversubscribed. Remember, it only takes 1 VM to make this number high.
• This number should be 80% most of the time, indicating right sizing.
– A line chart showing Average CPU Demand
• If this chart is below <25% all the time for entire month, then the large VMs are over sized.
• Heat Map of Large VMs
– Size by vCPU config. So it’s easy to see who the biggest among these large VMs.
– Color by CPU Workload. Both high and low are bad. You want to see ~50% CPU utilisation
• To differentiate between the 2 ends, choose Black and Red. Expect to see mostly green.
• Top-N CPU Demand
– Allows us to zoom into specific time to see the past
• Line chart of a selected VM (automatically plotted)

As expected, the Max of All VMs is low. We can go
back in time and see over 3 months. As expected, they are mostly Black. This means
they are over provisioned.
This shows the Top 15 VM. You can change the
period to any time. This is auto shown. We are showing CPU and RAM.
You expect 70% range, not 20% like this example.

VM Right Sizing
• Focus on large VM
– Every downsize is a battle. No one likes to give up that they are given.
– It also requires downtime
– Downsizing from 4 vCPU to 2 does not buy much nowadays with >10 core Xeon.
• Focus on CPU, not RAM
– RAM in general is plentiful as RAM is cheap nowadays
– RAM is hard to measure, even with agents, as it’s application dependant

MS Windows: Memory Management
• Windows makes great and full use of RAM
– It’s using it as cache.
– Adding more physical RAM will result in more usage.
• Virtual Memory is an integral part of Windows Memory Management
– It is not a swap file.
– A growing pagefile is an early warning.
– Track that it’s below 2.0 with this formula:
≤ 2.0
Commit Limit
Physical RAM

MS Windows: Memory Management
In Use
Available
Cached
vRealize Ops 6.1 EP Agent: Memory Used
vRealize Ops 6.1 EP Agent: Memory Available

MS Windows: Memory Sizing
Conservative
Cost Effective

Windows: Memory Management
• Which VMs need to be upsized?
– Get all those VMs whose commit limit ratio > 2.0
– List can be sorted by the one with the highest commit limit ratio
≤ 2.0
Commit Limit
Physical RAM

Windows RAM: Right Sizing
• Use Commit Limit Ratio super metric to upsize VM
• Conservative or Cost Effective?
– Cost Effective: Used.
– Conservative: Used + Cache
• Example
– Used + Cache: 90%
– Value exceeding 90% means
the VM needs more RAM
Server Workload VDI Workload
1 apps Many apps
Long live apps Many apps launched and closed.
Varies Many files opened and closed
No Internet browsing Internet browsing (movie!)
Workload predictable Workload spiky and unpredictable
Varies (UI-less) Flash, Java, JavaScript (UI heavy)

Windows RAM: What counters to use?
Cost Effective: Used
Conservative: Used + Standby Cache Normal Priority + Standby Cache Reserve

Windows RAM: Hypervisor vs In-Guest
30
A B DC

VM Right Sizing
• Do not reduce RAM without changing application
– If the VM has no RAM shortage, reducing RAM will not speed up anything.
– Reducing RAM can trigger more internal swapping in the guest. This in turn generates IO
– Reducing RAM beyond the ISV recommendation can result in unsupported configuration.
– Reducing RAM requires manual reduction for apps that manage its own RAM (e.g. Java, SQL, Oracle).
– It's hard enough to ask apps team to reduce CPU, so asking for both will be even harder.
– If there is a performance issue after you reduce both CPU and RAM….
• If a VM is not using the full RAM, ask the Appl Team if they can
– To monitor paging from outside the Guest, put the pagefile into its own vmdk file.

Why VM Owner should right size
• It takes longer to boot.
– If a VM does not have reservation, vSphere will create a swap file the size of the configured RAM.
• It takes longer to vMotion.
• Risk of NUMA effect
– The RAM or CPU maybe spread over a single socket. Due to NUMA architecture, the performance will
not be as good.
• It will experience higher co-stop and ready time.
• It takes longer to snapshot, especially if memory snapshot is included.
• The processes inside the Guest OS may experience ping-pong.
• Lack of performance visibility at the individual vCPU or virtual core level.
34

Any Excessive Utilization in our DC?
• A VM consumes 5 resources:
1. vCPU
2. vRAM (GB)
3. Disk Space
4. Disk IOPS
5. Network (Mbps)
• The first 3 you can bound and control
• The last 2 you can, but normally you don’t do it. You should.
• Need a dashboard to track excessive usage
– Disk IOPS
– Network throughput

Dashboard for Excessive Utilisation
• Excessive Storage consumption
– Line Chart:
• Max VM Disk IOPS among all VMs
• Average VM Disk IOPS
– Heat Map
• Size by IOPS. Color by Latency
• If you see a single big box, that means you have a VM dominating your storage IOPS.
• Excessive Network consumption
– Similar concept as above

This tracks the IOPS from VM. From here we can tell is a distinct peak. It looks like it’s coming from
1 VM, as the average is far lower. This is a cluster of 500 VM, so even if 1 VM hits 13,200 IOPS, the
average did not even pass 15 IOPS.
Let’s zoom into the peak.

Excessive Storage Dashboard
The peak was 13,212 IOPS on 24 May, around 3:16 am. Let’s find out
which VM.

Excessive Storage Dashboard
• We can list the Top VMs generating the IOPS on any given period.
Bingo, it was VM 63ee that did that 13212 IOPS.
Catcha! 
The dashboards are great.
But it does not tell you how the IOPS distribution
among all the VMs. It also does not tell if the VMs
are experiencing high latency.
You need a Heat Map for this.

At a glance, we can tell the IOPS distribution among the VMs. We can also tell if they getting low
latency or not.

Dashboard 3
Excessive DC
Utilization

And that’s it!
You “passed” those dashboards, you’re done with the “dining area”!

The Provider Layer
The “kitchen”
CONFIDENTIAL44

Performance Management
• Overall Performance Monitoring
– Is any of our customers experiencing bad performance?
– CPU, RAM, Disk, Network
• If yes, who are affected?
– Different VM may get different impact.
– VM 007 may get hit on CPU, while VM 747 may get hit on Storage.

Performance SLA Monitoring
• How do we prove that….not a single VM… in any service tier…. fails the SLA threshold we
agree for that tier… in the past 1 month?
• Since VMs move around in a cluster due to DRS and HA, we need to track at Cluster level.
• If you oversubscribe, there is a risk of Contention.
– For Tier 1, do not overcommit.
– For Tier 2 and 3, do overcommit.

Using Max and Average to determine how VMs are served
If the Max is:
• below what you think your customers can tolerate, then you are good.
• Near the threshold, then your capacity is full. Do not add more VM.
• Above the threshold, move a few VMs out, preferably the large ones.

This dashboard is good as summary. You stop here if there is no issue.
But say there is an issue. Can you drill down?

This is an example of
how you can drill down
to individual Cluster,
and see any metrics
you want.
Notice the Max
Contention and
Average Contention
both spike. The
average is much lower
than Max, indicating
room for additional VM.
The VM contention hit
4.44%. It’s still okay.

Drill down: Cluster CPU Performance
There is issue
here. The Max
value hit 31%.
The Average is
still good at
0.31%, so this
means 90% of
VMs are being
served well.

Storage Latency Monitoring: Details
• The data you see at vCenter and vRealize Operations are average
– The storage latency data that vCenter provides is a 20-second average. With vRealize Operations,
and other management tools that keeps the data much longer, it is a 5-minute average.
– 20 seconds may seem short, but if the storage is doing 10K IOPS, that is an average of 10,000 x 20 =
200,000 read or writes.
• The data you see at vmkernel log is not.
– It is per individual IO. More info here.
– It is acceptable to have higher latency, but ensure it is not too high. Set your threshold at 250 ms for a
start.
• Additional info
– Data at vmkernel excludes bottleneck at upper layer. For example, no disk queue in vDisk. As a result,
we can conclude that the storage latency is not at VM level.

Drill down: Cluster Storage Performance
56
We look at storage across the past 1+ month.
We are seeing latency spike. There is an outlier at 6000 ms on a magnetic disk.
This data is going back to 18 April. Let’s zoom into May 22 as there are recent spike there.

Zooming into May 17 – 23.
We also exclude all the Magnetic Disk.
Device ID naa.55* is SSD, while naa.5000* is magnetic.
We are seeing latency at the SSD. There is 1 outlier.

We can also group the data by ESXi Host.
We can also present the data in bar chart
We can zoomed into much more granular time line, below 1 second!

Which VMs are affected?
• The previous slides give us info at Cluster level.
– If there is no VM affected, it’s good. No need to analyse further.
– If there are VMs affected, we want to know which ones.
• We can address the above by listing the Top 30 VM
– CPU Contention
– RAM Contention
– Disk Latency
– Network drop packet (ensure it is 0)
– Network latency (this needs NetFlow)

These are the top 40 VMs which
experienced the worst CPU
Contention.
experienced the worst RAM
Contention.
experienced the worst Disk
Latency.

And that’s it!
If Performance is ok, it’s time to review Capacity
61

Capacity Management based on Business Policy
http://virtual-red-dot.info/capacity-management-based-on-business-policy/

Performance Policy
63
Group Discussion: What should your Performance Policy be?

Availability Policy
64
Group Discussion: What should your Availability Policy be?
IOPS
(per VM)
Latency
(VM level)
Automated DR
(SRM)
RPO RTO
Tier 1 1000 <10 ms Yes 5 minutes 1 hour
Tier 2 500 <20 ms Yes <2 hours <2 hours
Tier 3 100 <30 ms Yes <8 hours <4 hours

Capacity Management: Tier 1
5 line charts showing these in the past 3 months
• Number of vCPU left in the cluster.
• Number of vRAM left in the cluster.
• Number of VM left in the cluster.
• Maximum & Average storage latency experience by any VM in the cluster
• “Usable” space left in the datastore cluster.
65
If the number is approaching low number (your threshold) for it’s time to
increase supply (e.g. IOPS, Cluster)

Capacity Management: Tier 2 or 3
5 line charts showing data in the past 3 months
• The Maximum CPU Contention experience by any VM in the cluster.
– This number has to be lower than the SLA we promise.
• The Maximum RAM Contention experience by any VM in the cluster.
– This number has to be lower than the SLA we promise.
• The total number of VM left in the cluster.
• The Maximum & Average storage latency experience by any VM in the cluster
• The disk capacity left in the datastore cluster.
66

Tier 2 or 3
67
In this example, if we use
10% as threashold, the
cluster is full
In this example, if we use
30ms as threashold, the
cluster is full

Capacity Management: Tier 2 or 3
• RAM has different pattern to CPU as it’s a form storage
– SLA principel remains the same. If you exceed the SLA, that cluster is full.
68

SLA vs Internal Threshold
69
SLA Tier 1 Tier 2 Tier 3
CPU Contention 1% 2% 13%
RAM Contention 0% 10% 20%
Disk Latency 10 ms 20 ms 30 ms
SLA only applies to VM.
VM owner does not care about underlying platform.
The above is my personal opinion. You need to get your Customer agreed
Internal (your own) Tier 1 Tier 2 Tier 3
CPU Contention 1% 3% 10%
RAM Contention 0% 5% 10%
Disk Latency 10 ms 15 ms 20 ms

Key Takeaways
Agree on a Performance SLA.
Contention, not Utilization.
Capacity is defined by Performance.
CONFIDENTIAL 70

More Details
• The book provides details that we could not cover
in half a day.
• The book is not a product book.
– It focuses on concept, which you can apply using
any product. It does not have to be vRealize Ops
71

Master VMware Performance and Capacity Management

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Master VMware Performance and Capacity Management

Similar to Master VMware Performance and Capacity Management (20)

Recently uploaded

Recently uploaded (20)

Master VMware Performance and Capacity Management

Editor's Notes