12 Sep 2016 update: See this http://virtual-red-dot.info/operationalize-sddc-program-2/ for details.
-------------
Based on the book http://virtual-red-dot.info/performance-and-capacity-management/
Master performance and capacity management of VMware SDDC
2. About your speakers
Iwan ‘e1’ Rahabok
virtual-red-dot.info
e1@vmware.com
@e1_ang
Linkedin.com/in/e1ang
9119-9226
Sunny Dua
vxpresss.blogspot.com
duas@vmware.com
@sunny_dua
Linkedin.com/in/duasunny
3. One day in the life of VMware Admin…
• A VM Owner complains to IaaS Team that her VM is slow.
• Her application architect has verified that:
– The VM CPU and RAM utilization is good.
– The disk latency is good.
– There is no network drop packets.
– No change in the application settings
– No recent patch to Windows
What do you do?
• A: Check ESXi utilization. If it’s low, tell her to doubt no more.
• B: Buy her a nice lunch + flower. Ask her to forget about it
• C: Call your VMware TAM & MCS. That’s why you pay them right?
• D: Roll up your sleeve. You are born for this!
4. What’s wrong with these statements?
• Cluster CPU
– CPU Ratio is high at 1:5 times on cluster “XYZ”
– Rest all other cluster overcommit ratio looks good around 1:3
– Keep the over commitment ratio to 1:4.
– CPU usage is around 50% on cluster “ABCDE”. Since they are UAT servers, don’t worry.
– Rest other cluster CPU utilization is around 25%. This is good!
• Cluster RAM
– We recommend 1:2 overcommit ratio between physical RAM and virtual RAM.
– Memory Usage on most of the cluster is high around 60%
– Cluster “ABCD” is running peak at around 75%. CPU utilization should be less than 70%
– If we see that Active Mem% is also high than we should add more RAM to cluster
– % Active should not exceed 50-60% and Memory should be running at high state on each host
5. Monitoring
• There are 2 levels to monitor in VMware:
– The VM.
• VM is the most important as that’s all customers care.
• They do not care about your infrastructure. It is a Service. IaaS.
– The Infra.
• Software: NSX, vCenter, VSAN, vRealize, Distributed vSwitch, Datastore
• ESXi + hardware
• Storage & Fabric
• Network
• There are 4 areas to monitor
• The 4 areas above impact one another
6. 2 distinct layer
SDDC
VM VM VM VM
VM VM VM VM
VM VM VM VM
VM VM VM VM
Performance:
We check if it is being served well by the platform.
Other VM is irrelevant from VM Owner point of view.
1
Capacity. We check if VM is right-sized.
If too small, increase its configuration.
If too big, right size it for better performance
2
Performance:
We check if IaaS is serving everyone well.
Make sure there is no contention for resource among all the VMs
1
Capacity:
Check utilization. Too low, we spent too much on hardware.
Too high, we need to buy more hardware.
2
Configuration: Check for Compliance and Config Drift
Availability: Get alert for hardware fault or software stop working3
Consumer Layer
Provider Layer
7. Performance
How do you know your IaaS is performing fast?
ESXi utilization a 10% means your ESXi is fast?
ESXi utilization a 90% means your ESXi is fast?
Storage is doing 10K IOPS?
Network is processing 8 Gbps?
What counter do you use as a proof to your customers (VM Owner)?
Utilization?
Performance is measured by how well your IaaS serves the VMs.
Fast is relative to your customer. Use SLA as your defense line.
9. Performance vs Capacity
Performance Capacity
Focus is on the VM.
It does not apply to IaaS
Focus is on the IaaS.
VM Capacity Management is just right sizing
Primary counter: Contention or Latency.
Utilization is largely irrelevant.
Primary counter: Contention or Latency
Secondary counter: Utilization
Does not take into account Availability SLA Takes into account Availability SLA
Tier 1 is in fact Availabity-driven.
11. How a VM gets its resource
Provisioned
Limit
Reservation
Entitlement
0 vCPU or 0 GB
Contention
Usage
Demand
This is the counter
we need to measure
4 vCPU or 16 GB
12. Dashboards
• Detail monitoring of a single VM
– When customer complains that his VM is slow. Can help desk value right away?
• Large VMs Monitoring
– Because they are actually hurting your IaaS business
– This impacts both Performance and Capacity
– VM Right Sizing
• Excessive Usage
– Excessive Usage by 1-2 VM can impact the overall IaaS performance.
– VMs with excessive usage hurts the business, if we do not charge for Network and Disk IOPS
13. Single VM Monitoring
• A VM Owner complains that his VM is slow.
– It was okay the day before
– How does Help Desk quickly determine where the issue is?
• How well does Infra serve the VM?
– VM CPU Contention
– VM RAM contention
– VM Disk latency. For each virtual disk, not average.
• Is VM undersized?
– VM CPU Utilisation
– VM RAM Consumed (not Usage)
– VM RAM Usage
– VM Disk IOPS
19. How oversized are the Large VMs?
• They cause performance issue
– They impact others, and also themselves!
– ESXi vmkernel scheduler has to find available cores for all the vCPU, even though they are idle.
– Other VMs maybe migrated from core to core. The counter at esxtop tracks this migration.
• Tends to have slower performance
– ESXi may not have all the available vCPU for them.
• Reduces consolidation ratio
– You can pack more vCPU with smaller VM than with big VM.
– Unless you have progressive pricing, you make more money with smaller VM as you sell more vCPU.
20. Dashboard of Large VMs
• Overall Picture
– A line chart showing Max CPU Demand among all the Large VMs
• If this is low, they are way oversubscribed. Remember, it only takes 1 VM to make this number high.
• This number should be 80% most of the time, indicating right sizing.
– A line chart showing Average CPU Demand
• If this chart is below <25% all the time for entire month, then the large VMs are over sized.
• Heat Map of Large VMs
– Size by vCPU config. So it’s easy to see who the biggest among these large VMs.
– Color by CPU Workload. Both high and low are bad. You want to see ~50% CPU utilisation
• To differentiate between the 2 ends, choose Black and Red. Expect to see mostly green.
• Top-N CPU Demand
– Allows us to zoom into specific time to see the past
• Line chart of a selected VM (automatically plotted)
21. As expected, the Max of All VMs is low. We can go
back in time and see over 3 months. As expected, they are mostly Black. This means
they are over provisioned.
This shows the Top 15 VM. You can change the
period to any time. This is auto shown. We are showing CPU and RAM.
You expect 70% range, not 20% like this example.
22. VM Right Sizing
• Focus on large VM
– Every downsize is a battle. No one likes to give up that they are given.
– It also requires downtime
– Downsizing from 4 vCPU to 2 does not buy much nowadays with >10 core Xeon.
• Focus on CPU, not RAM
– RAM in general is plentiful as RAM is cheap nowadays
– RAM is hard to measure, even with agents, as it’s application dependant
23. MS Windows: Memory Management
• Windows makes great and full use of RAM
– It’s using it as cache.
– Adding more physical RAM will result in more usage.
• Virtual Memory is an integral part of Windows Memory Management
– It is not a swap file.
– A growing pagefile is an early warning.
– Track that it’s below 2.0 with this formula:
≤ 2.0
Commit Limit
Physical RAM
24. MS Windows: Memory Management
In Use
Available
Cached
vRealize Ops 6.1 EP Agent: Memory Used
vRealize Ops 6.1 EP Agent: Memory Available
26. Windows: Memory Management
• Which VMs need to be upsized?
– Get all those VMs whose commit limit ratio > 2.0
– List can be sorted by the one with the highest commit limit ratio
≤ 2.0
Commit Limit
Physical RAM
27. Windows RAM: Right Sizing
• Use Commit Limit Ratio super metric to upsize VM
• Conservative or Cost Effective?
– Cost Effective: Used.
– Conservative: Used + Cache
• Example
– Used + Cache: 90%
– Value exceeding 90% means
the VM needs more RAM
Server Workload VDI Workload
1 apps Many apps
Long live apps Many apps launched and closed.
Varies Many files opened and closed
No Internet browsing Internet browsing (movie!)
Workload predictable Workload spiky and unpredictable
Varies (UI-less) Flash, Java, JavaScript (UI heavy)
28. Windows RAM: What counters to use?
Cost Effective: Used
Conservative: Used + Standby Cache Normal Priority + Standby Cache Reserve
32. VM Right Sizing
• Do not reduce RAM without changing application
– If the VM has no RAM shortage, reducing RAM will not speed up anything.
– Reducing RAM can trigger more internal swapping in the guest. This in turn generates IO
– Reducing RAM beyond the ISV recommendation can result in unsupported configuration.
– Reducing RAM requires manual reduction for apps that manage its own RAM (e.g. Java, SQL, Oracle).
– It's hard enough to ask apps team to reduce CPU, so asking for both will be even harder.
– If there is a performance issue after you reduce both CPU and RAM….
• If a VM is not using the full RAM, ask the Appl Team if they can
– To monitor paging from outside the Guest, put the pagefile into its own vmdk file.
33. Why VM Owner should right size
• It takes longer to boot.
– If a VM does not have reservation, vSphere will create a swap file the size of the configured RAM.
• It takes longer to vMotion.
• Risk of NUMA effect
– The RAM or CPU maybe spread over a single socket. Due to NUMA architecture, the performance will
not be as good.
• It will experience higher co-stop and ready time.
• It takes longer to snapshot, especially if memory snapshot is included.
• The processes inside the Guest OS may experience ping-pong.
• Lack of performance visibility at the individual vCPU or virtual core level.
34
35. Any Excessive Utilization in our DC?
• A VM consumes 5 resources:
1. vCPU
2. vRAM (GB)
3. Disk Space
4. Disk IOPS
5. Network (Mbps)
• The first 3 you can bound and control
• The last 2 you can, but normally you don’t do it. You should.
• Need a dashboard to track excessive usage
– Disk IOPS
– Network throughput
36. Dashboard for Excessive Utilisation
• Excessive Storage consumption
– Line Chart:
• Max VM Disk IOPS among all VMs
• Average VM Disk IOPS
– Heat Map
• Size by IOPS. Color by Latency
• If you see a single big box, that means you have a VM dominating your storage IOPS.
• Excessive Network consumption
– Similar concept as above
37. This tracks the IOPS from VM. From here we can tell is a distinct peak. It looks like it’s coming from
1 VM, as the average is far lower. This is a cluster of 500 VM, so even if 1 VM hits 13,200 IOPS, the
average did not even pass 15 IOPS.
Let’s zoom into the peak.
39. Excessive Storage Dashboard
• We can list the Top VMs generating the IOPS on any given period.
Bingo, it was VM 63ee that did that 13212 IOPS.
Catcha!
The dashboards are great.
But it does not tell you how the IOPS distribution
among all the VMs. It also does not tell if the VMs
are experiencing high latency.
You need a Heat Map for this.
40. At a glance, we can tell the IOPS distribution among the VMs. We can also tell if they getting low
latency or not.
44. Performance Management
• Overall Performance Monitoring
– Is any of our customers experiencing bad performance?
– CPU, RAM, Disk, Network
• If yes, who are affected?
– Different VM may get different impact.
– VM 007 may get hit on CPU, while VM 747 may get hit on Storage.
45. Performance SLA Monitoring
• How do we prove that….not a single VM… in any service tier…. fails the SLA threshold we
agree for that tier… in the past 1 month?
• Since VMs move around in a cluster due to DRS and HA, we need to track at Cluster level.
• If you oversubscribe, there is a risk of Contention.
– For Tier 1, do not overcommit.
– For Tier 2 and 3, do overcommit.
46. Using Max and Average to determine how VMs are served
If the Max is:
• below what you think your customers can tolerate, then you are good.
• Near the threshold, then your capacity is full. Do not add more VM.
• Above the threshold, move a few VMs out, preferably the large ones.
47.
48.
49.
50.
51. This dashboard is good as summary. You stop here if there is no issue.
But say there is an issue. Can you drill down?
52. This is an example of
how you can drill down
to individual Cluster,
and see any metrics
you want.
Notice the Max
Contention and
Average Contention
both spike. The
average is much lower
than Max, indicating
room for additional VM.
The VM contention hit
4.44%. It’s still okay.
53. Drill down: Cluster CPU Performance
There is issue
here. The Max
value hit 31%.
The Average is
still good at
0.31%, so this
means 90% of
VMs are being
served well.
54. Storage Latency Monitoring: Details
• The data you see at vCenter and vRealize Operations are average
– The storage latency data that vCenter provides is a 20-second average. With vRealize Operations,
and other management tools that keeps the data much longer, it is a 5-minute average.
– 20 seconds may seem short, but if the storage is doing 10K IOPS, that is an average of 10,000 x 20 =
200,000 read or writes.
• The data you see at vmkernel log is not.
– It is per individual IO. More info here.
– It is acceptable to have higher latency, but ensure it is not too high. Set your threshold at 250 ms for a
start.
• Additional info
– Data at vmkernel excludes bottleneck at upper layer. For example, no disk queue in vDisk. As a result,
we can conclude that the storage latency is not at VM level.
55. Drill down: Cluster Storage Performance
56
We look at storage across the past 1+ month.
We are seeing latency spike. There is an outlier at 6000 ms on a magnetic disk.
This data is going back to 18 April. Let’s zoom into May 22 as there are recent spike there.
56. Drill down: Cluster Storage Performance
Zooming into May 17 – 23.
We also exclude all the Magnetic Disk.
Device ID naa.55* is SSD, while naa.5000* is magnetic.
We are seeing latency at the SSD. There is 1 outlier.
57. Drill down: Cluster Storage Performance
We can also group the data by ESXi Host.
We can also present the data in bar chart
We can zoomed into much more granular time line, below 1 second!
58. Which VMs are affected?
• The previous slides give us info at Cluster level.
– If there is no VM affected, it’s good. No need to analyse further.
– If there are VMs affected, we want to know which ones.
• We can address the above by listing the Top 30 VM
– CPU Contention
– RAM Contention
– Disk Latency
– Network drop packet (ensure it is 0)
– Network latency (this needs NetFlow)
59. These are the top 40 VMs which
experienced the worst CPU
Contention.
These are the top 40 VMs which
experienced the worst RAM
Contention.
These are the top 40 VMs which
experienced the worst Disk
Latency.
63. Availability Policy
64
Group Discussion: What should your Availability Policy be?
IOPS
(per VM)
Latency
(VM level)
Automated DR
(SRM)
RPO RTO
Tier 1 1000 <10 ms Yes 5 minutes 1 hour
Tier 2 500 <20 ms Yes <2 hours <2 hours
Tier 3 100 <30 ms Yes <8 hours <4 hours
64. Capacity Management: Tier 1
5 line charts showing these in the past 3 months
• Number of vCPU left in the cluster.
• Number of vRAM left in the cluster.
• Number of VM left in the cluster.
• Maximum & Average storage latency experience by any VM in the cluster
• “Usable” space left in the datastore cluster.
65
If the number is approaching low number (your threshold) for it’s time to
increase supply (e.g. IOPS, Cluster)
65. Capacity Management: Tier 2 or 3
5 line charts showing data in the past 3 months
• The Maximum CPU Contention experience by any VM in the cluster.
– This number has to be lower than the SLA we promise.
• The Maximum RAM Contention experience by any VM in the cluster.
– This number has to be lower than the SLA we promise.
• The total number of VM left in the cluster.
• The Maximum & Average storage latency experience by any VM in the cluster
• The disk capacity left in the datastore cluster.
66
66. Tier 2 or 3
67
In this example, if we use
10% as threashold, the
cluster is full
In this example, if we use
30ms as threashold, the
cluster is full
67. Capacity Management: Tier 2 or 3
• RAM has different pattern to CPU as it’s a form storage
– SLA principel remains the same. If you exceed the SLA, that cluster is full.
68
68. SLA vs Internal Threshold
69
SLA Tier 1 Tier 2 Tier 3
CPU Contention 1% 2% 13%
RAM Contention 0% 10% 20%
Disk Latency 10 ms 20 ms 30 ms
SLA only applies to VM.
VM owner does not care about underlying platform.
The above is my personal opinion. You need to get your Customer agreed
Internal (your own) Tier 1 Tier 2 Tier 3
CPU Contention 1% 3% 10%
RAM Contention 0% 5% 10%
Disk Latency 10 ms 15 ms 20 ms
69. Key Takeaways
Agree on a Performance SLA.
Contention, not Utilization.
Capacity is defined by Performance.
CONFIDENTIAL 70
70. More Details
• The book provides details that we could not cover
in half a day.
• The book is not a product book.
– It focuses on concept, which you can apply using
any product. It does not have to be vRealize Ops
71
More materials and steps by steps guides are available at our blogs.
It’s a common story. This presentation was born many years ago with story like this.
He think it’s your IaaS problem.
How do we prove where the performance issue is?
The point of this slide is you need a formal SLA. It is your defence mechanism.
These are statements we often hear. Unfortunately, they are not correct, because they focus on the wrong component of IaaS.
Network:
Physical Firewall, Router, Load Balancer etc.
Network needs to be monitored differently as it’s a form of interconnect. We monitor at DC wide level
There are 4 areas to monitor, as we have covered in previous slide.
Performance
Capacity
Availability
Configuration (for compliance and security)
There are 2 distinct layer in the virtual world. This differs to the physical world, where there are only 1 layer. Application team or VM Owner only cares about their VM. Infra team have to care about both, especially from proving that Infrastructure is not a bottleneck.
By knowing what matters on each layer, we can better manage the virtual environment.
VM layer:
Usage or utilization: Individual core utilization is important as we must see distribution and if any particular core is max out. A typical ESXi has 16 cores so it’s possible that the overall utilization is not high but certain cores are maxed out.
Storage Latency at the SDDC layer includes both kernel queue and device queue. These are KAVG and DAVG respectively.
VM CPU counters:
Used (ms) = Usage (%) = Usage in MHz (MHz)
The counter Usage and utilization may differ due to power management and hyper-threading. [e1: which has has which??]
System: amount of time spent on system processes on each vCPU in the VM. VMM knows if it’s ring 0 or not, privilege instruction or not.
CPU Latency. I found this to be better than Ready. This is the Sum, not Average, or each vCPU latency. I know this as it is Sum for the Ready counter. The Ready (milisecond) for each vCPU adds up to the Ready (ms) in the vCPU.
Co-Stop. I realise that Ready excludes Co-Stop. The Co-Stop figure at the CPU level is a Sum of the Co-Stop at each vCores. So no need to do a super metric.
How much CPU does a single VM use?
%USED = %RUN + %SYS - %OVRLP
%RUN is work done by VM itself.
%SYS is work done by system (vmkernel), but on VM’s behalf, like doing IO.
So %USED can be >100% if the VM has heavy IO as IO is executed on a different pCore.
vmkernel does not have visibility inside the Guest OS. Example, we cannot see CPU Run Queue in Windows. This requires agent inside.
RAM.
If Ballooning is 0, then there is no memory pressure at all. Ballooning is the first sign. It happens before compression and swapping.
Active is for us to see how much RAM is actually active.
"state" : the free memory state. Possible values are high, soft, hard and low. The memory "state" is "high", if the free memory is greater than or equal to 6% of "total" - "cos". If is "soft" at 4%, "hard" at 2%, and "low" at 1%. So, high implies that the machine memory is not under any pressure and low implies that the machine memory is under pressure.
While the host's memory state is not used to determine whether memory should be reclaimed from VMs (that decision is made at the resource pool level), it can affect what mechanisms are used to reclaim memory if necessary. In the high and soft states, ballooning is favored over swapping. In the hard and low states, swapping is favored over ballooning.
Disk
Abort indicate that the storage system is unable to meet the demands of the guest operating system. Abort commands are issued by the guest when the storage system has not responded within an acceptable amount of time, e.g. 60 seconds on some windows OS’s. Also, resets issued by a guest OS on its virtual SCSI adapter will be translated to aborts of all the commands outstanding on that virtual SCSI adapter.
Good source:
http://communities.vmware.com/community/vmtn/server/performance?view=documents
http://searchsystemschannel.techtarget.com/feature/Monitoring-vSphere-performance-with-vCenter-Server-performance-graphs
http://cartershanklin.com/blog3/2010/12/19/esx-performance-counter-secrets-revealed/x`
Compute is the Cluster due to DRS and HA. When a VM had a performance issue say 3 days ago, you do not know which ESXi it was running. So no point troubleshooting at that level.
VSAN components means the SSD and Magnetic
We have covered Performance. What about capacity? [click]
You have 100 TB space left in your storage. Lots of space for new VMs.
But latency is bad. VMs are getting 100 ms latency.
Will adding VM make the situation worse?
Yes!
Can you add more VM?
No. Every VM consumes IOPS, not just space.
If you cannot add more VM, that storage is full. Time to add capacity (IOPS)
From the viewpoint of the customers (VM Owner, App Team), there is no such thing as IaaS Performance. There is only IaaS Capacity. The capacity is capable of handling certain amount of workload demanded by the Consumers.
What your customers care. They do not care about your infra
Unlike a physical server, a VM has dynamic resources given to it. It is not static. Contention, Demand and Entitlement are concepts that do not exist in the physical world.
The area represents the resource. Say this VM is given 16 GB of vRAM. So the bottom line represents 0 GB, and the top line is 16 GB. The VM is configured with 16 GB, and we call this Provisioned.
Unlike physical server, we can configure Limit and Reservation. We should minimise the use of both as it can make operation more complex.
Entitlement means what the VM is entitled to. The hypervisors entitles the VM to a certain amount of RAM. A VM can only use what it is entitled to. It cannot use more than it is allowed. Demand is what the VM wants to use. In normal situation, where there is no contention among all the VMs, entitlement, usage and demand will be very close to each other.
If there is a contention, then a VM can’t use what it ideally wants to use. ESXi also will not entitle it as there are competing VMs. So usage will be lower than Demand. Demand can go higher than Limit naturally. If this VM is limited to 2 GB, and it wants to use 14 GB, then Demand will exceed Limit.
Contention is a _special_ counter that tracks all these competition for resources. It’s a counter that only exists in the virtual world. It happens when what the VM demands is more than it gets to use.
So Contention = Demand – Usage.
Usable does not apply to VM. Entitlement only applies to VM.
Demand: Usage if there is no constraint.
Difference between the amount of the resource that the object requires and the amount of the resource that the object gets.
This metric measures the effect of conflict for a resource between consumers. Contention measures latency or the amount of time it takes to gain access to a resource. This measurement accounts for dropped packets for networking.
Limit: Maximum amount that an object can obtain from a resource.
The limit sets the upper bound for CPU, memory, or disk I/O resources that we allocate and configure in vCenter Server.
Entitlement
Amount of a resource that a VM can use based on the relative priority of that consumer set by the virtualization configuration.
This metric is a function of provisioned, limit, reservation, shares, and demand. Shares involve proportional weighting that indicates the importance of a VM.
The entitlement amount is less than or equal to the limit amount.
The entitlement metric applies only to VMs.
----------------
Notes: thanks to Michael Beckmann for the correction on Contention.
Can we display vSphere tags in vR Ops columns? If not, we can use the built-in Description field in vR Ops to map between customers and VM.
We saw a performance degradation on a cluster of 500 VM when just 1-2 VM did an IOmeter. 10% of the population was affected when only 1-2 VM.
We can also display extra info
VM Disk throughput
VM network throughput
This is the list of VM.
Help Desk can search. For each, we show the key properties, such as No of vCPU, RAM size, CPU Contention, RAM Contention, etc.
The columns can be customised.
This is the list of VM.
Help Desk can search. For each, we show the key properties, such as No of vCPU, RAM size, CPU Contention, RAM Contention, etc.
The columns can be customised.
This line charts show the CPU Contention. Ideally, the line is <1%.
This line charts show the CPU Utilization and RAM Utilization.
If they are high, the VM is undersized and it could contribute to slowness.
This line charts show the Disk Latency.
We plot 6 lines:
vDisk 1 Read Latency
vDisk 1 Write Latency
vDisk 2 Read Latency
vDisk 2 Write Latency
The VM overall Read Latency
The VM overall Write Latency.
Notice the number is very high here.
.
If you are contacting us regarding this dashboard, please cite “Single VM Monitoring” so we know which one you’re referring to
The counter CPU Co-Stop tracks this.
Example. If you have 2 socket, 20 cores, you can have either
1x 20-vCPU VM with high utilisation
40x 1 vCPU VM with high utilization
Heat Map of Large VMs
Size by vCPU config. So it’s easy to see who the biggest among these large VMs.
Color by CPU Workload. Both high and low are bad. You want to see around 50% CPU utilisation, not 20% nor 80%.
To differentiate between the 2 ends, choose Black and Red. Expect to see mostly green.
If you see mostly black, what does it mean?
Top-N CPU Demand
Allows us to zoom into specific time to see the past
A user can click a VM, and the details are plotted right away.
Line chart of a selected VM (automatically plotted)
CPU Workload. Expect this to spike to 90% a few times.
RAM Workload. Expect this to spike to 70% a few times.
Just like ESXi. This is why Memory Consumed is higher in ESXi with more RAM.
Windows 2008 and 7 x64 default to Physical RAM Size
Windows 2012 may choose smaller. Value depends on Windows version? I only have 1 Win2012, so needs to validate properly.
Do we enable SuperFetch or not?
If we do, it will use more RAM. The guidelines for SSD is to turn it off.
Do we enable PageFile.sys or not?
If we have enough RAM, why do we need it? Especially on the Server workload, since we know exactly the application that run on that VM.
That impacts the % Committed Memory metrics
Client
Many applications. Sizing becomes difficult as you don’t know what apps
Apps are opened and closed. Binaries are loaded and cleared from RAM. Usage is spiky.
Web Sites can chew up RAM. Watch how Chrome uses RAM.
Be careful of Intranet apps, especially home grown.
Video. Users watch video.
Server
Normally 1 application. Sizing can be apps specific.
The server does not use browser and serve Internet while you’re not watching
VDI: Windows 7 x64
Ensure PageFile.sys is set as Windows managed.
Conservative: Total RAM – Free RAM
Add RAM when Free RAM <5%
Cost Optimized: Total RAM – (Free RAM + Standby RAM)
Add RAM when (Free RAM + Standby RAM) < 5%
This applies to the following Windows: 7 x64, 2008, 2012.
Complication.
In Use does not include Modified
Windows has 3 caches, with 2 being shown by Resource Monitor: Modified and Standby
Available includes Standby.
Windows takes advantage of physical RAM. It maximizes the use. So measuring the Free RAM alone is misleading. Check the standby RAM also. So track Available RAM metric. Keep around 60-80%. Too low is not good in server workload?
Check if pagefile is more than physical RAM. Total Commit = 2x physical RAM. If windows is under RAM pressure, it will increase this. They do not change often. Disabling this means losing visibility.
----------------------------------------------------Not sure why Committed doesn't tally with Use in terms of pattern. One explanation is Standby. Compare committed vs Free.
In vR Ops 6.1, the metric Free = Available. It should not be. Available = Free + Standby. PR has been filed, see Socialcast
Do we include Standby or not? I see the value to be quite high, both on client workload and server workload.
VM Name, VM RAM Config, VM Commit Limit ratio.
Heatmap can show by color, so at a glance you can see the extend of the problem. Suitable for NOC
I think the guide is something you can decide, based on your environment. You can create super metric to get the average & maximum of your environment.
Know the usage
Client vs Server
The 3rd cache counter is normally 0.
You may need to manually turn on the policy, if the metrics are not enabled by default
Active RAM can be too low, or too high.
I have not tested Linux.
A: normal AD running. vCenter is reporting low utilisation, around 15-20%
B: I installed the EP Agent.
C: Patching. Mostly comparing, downloading.
D: Patching. Mostly installing
If possible, do not use hypervisor data. But this is better than nothing obviously, since in-guest data requires agent and access to Guest OS. So you can always start with just the VMs with large RAM. Focus on downsize. For upside, the VM owner can help.
Memory used does not include Standby.
Reducing RAM can trigger more internal swapping in the guest. This in turn generates IO, making the storage situation worse.
Reducing RAM requires manual reduction for apps that manage its own RAM (e.g. Java, SQL, Oracle).
VMware vCenter, for example, has 3 JVM and each has its RAM documented in vSphere.
It's hard enough to ask apps team to reduce CPU, so asking for both will be even harder.
Suggest you ask for 1 thing and drive that.
If there is a performance issue after you reduce both CPU and RAM….
You have to bring both back up, even though it was caused by just one of them.
To monitor paging from outside the Guest, put the pagefile into its own vmdk file.
We can then use vRealize to analyse for pattern for anomaly
If a VM is not using the full RAM, ask the Appl Team if they can since the RAM is already given to them.
More RAM will result in reduced paging. This in turn will reduce IO load.
To monitor paging from outside the Guest, put the pagefile into its own vmdk file.
Educate VM Owner on the benefits of right sizing
Carrot a lot more effective than stick, especially for those with money.
Everyone always want more performance. Address from this angle.
It will experience higher co-stop and ready time.
Even if not all vCPU is used by the application, the Guest OS will still demand all the vCPU be provided by the hypervisor.
The processes inside the Guest OS may experience ping-pong.
The Guest OS may not be aware of the NUMA nature of the physical motherboard, and think it has a uniform structure. It may move processes within its own CPUs, as it assumes it has no performance impact. If the vCPUs are spread into different NUMA node, example a 20 vCPU on a box with 2-socket and 20 cores, it can experience the ping-pong effect.
Lack of performance visibility at the individual vCPU or virtual core level.
Majority of the counter is at the VM level, which is an aggregate of all of its vCPU. It does not matter whether you use virtual socket or virtual core.
The last 2 you can, but normally you don’t do it. You should.
Application Team does not normally know how much IOPS or Network they need.
Do you allow any VM to generate 100K IOPS?
Do you allow any VM to saturate 1Gb link?
If you are public cloud:
- Net: Why not upsell? First 200 Mbps is free. Above that is chargeable.
- Storage: Why not upsell? First 500 IOPS is free. Above that is chargeable. Or, create a premium category, where there is no limit with 1 flat fee.
Max VM Disk IOPS among all VMs
If this is consistently high, you have a VM doing excessive IOPS. It is VM, not VSAN doing rebalancing. We are tracking the source of IOPS = VM.
What number do you cater for? My take is 1000 IOPS can do damage, as it’s 5 minutes average. That is actually 1000 x 5 x 60 = 300K IO performed in 5 minutes! It’s not normal to do 300K IOPS.
Average VM Disk IOPS
Expect this to be <100 IOPS. Remember, it’s 5 minutes sustained.
Excessive Network consumption
Line Chart
Max Network Usage among all VMs
If this number is >1 Gb most of the time, it’s high. Max is 2 Gb, as it’s full duplex.
Average
Expect this to be <100 Mbps. Remember, it’s 5 minutes sustained
Heat Map
Size by Network Usage. Color by Drop Packets
Expect to see green. VMs should not be dropping packets.
At a glance, we can tell the IOPS distribution among the VMs. We can also tell if they getting good latency or not.
You can change the threshold anytime. If you want a brand new storage from Finance, set the max to 1 ms
Heatmap can also be grouped (e.g. by cluster, host, folder).
Performance is customer-facing. You cannot serve them well. Performance is measured in Contention. Performance is about Quality.
Capacity is internal-facing. You need to buy more hardware. Capacity is measured in Utilization. Capacity is about Quantity.
The answer comes in the form of a few lines chart.
CPU, RAM, Disk Latency will have 1 chart each. Total 3 line charts.
Each Service Tier will have its own set of the above 3 charts. This is because they have different SLA threshold.
Network drop packet should be 0 at all times.
These 2 line charts show the Max and Average Contention. In this example, the Max is low. The Max hit 0.05% only. Plenty of RAM.
In this example, the Max is high. The Max hit 20% only. Move the largest VM (vCPU) out. Average is low, a sign of Large VM in existence.
I’m using CPU and RAM as example here. Same concept applies to Disk (Latency) and Network
This is the list of Clusters.
For each, we show the key properties, such as No of Running VM, No of vCPU, Contention, etc.
The columns can be customised.
These 2 line charts show the Max and Average CPU Contention. From here, we can easily tell that this Cluster is full. It has high CPU contention.
These 2 line charts show the Max and Average RAM Contention.
As RAM is a form of storage, the number here will be lower than CPU Contention
These 2 line charts show the Max and Average Disk Latency. It is very high. Ideally, add SSD here as it’s too high.
[pause]
But say there is an issue. Can you drill down?
We can drill down to individual cluster, and do further analysis. Click on the icon in the “Select a cluster”. It will take you to the detail screen.
[click to next slide]
An average of 200,000 numbers will hide the peak. If you have 1000 reads or writes that experienced bad latency, but the remaining 199,000 operations returned fast, that poor 1000 operations will be hidden.
Log Insight 2.5 does not yet show Datastore name. It shows the LUN or physical Disks.
Need to manually correlate for now.
In this example, we should check the latency at the Magnetic Disk on 22 - 23 May. However, due to time limit in this presentation, we will move on.
I’m using 1.5x for Tier 2 because Intel Hyper-Threading gives 1.5x performance boost.
I’m using 2x for Tier 3 because there are 2 threads per core. There is certainly performance penalty, but this is Tier 3 (lowest, cheapest tier). We need to differentiate between your highest tier and lowest tier, else costs go up.
Latency is measured in 5 minute average.
How do we ensure Tier 3 storage does not impact Tier 1 since they are on the same array (spindles, CPU, etc). Some storage array has shares internally.
All numbers are measured in 5 minute average. If a spike only lasts for 1 minute, then it calms down for the remaining 4 minutes, it won’t show up.
Should we also add IO Share. Need to see if it’s practical to do it for each of the 3 storage options.
We can also include Average Storage Latency to see how far we are from peaking. It’s also good to see if there is storage unbalanced, which resulting is a spike. We can have a Top-N widget to see which VM hit that super high latency when the rest of farm is doing very well.
Network should be 0 drop packet
Take the lowest number of vCPU, vRAM and VM in cluster. This is because you’re buying per cluster for Tier 1
Notice we can actually use the same line chart we use for Performance Monitoring. This is because Capacity Management is dependant on Performance Management.
Take the lowest number of CPU Contention, RAM Contention and VM in cluster. This is because you’re buying per cluster (or host) in these tiers. The 3 numbers should balance in the long run, to optimize your cost. If not, either adjust your policy, your VM standard, or your ESXi specification.
I’ll create the actual sample charts and post to my blog for example.
How do we ensure Tier 3 storage does not impact Tier 1 since they are on the same array (spindles, CPU, etc). Some storage array has shares internally.
All numbers are measured in 5 minute average. If a spike only lasts for 1 minute, then it calms down for the remaining 4 minutes, it won’t show up.