More Related Content Similar to Avoid resource contention with e4 c (20) Avoid resource contention with e4 c1. AVOID RESOURCE CONTENTION
WITH ECO4CLOUD TECHNOLOGY
A PRIMARY TELCO USE CASE
Ph. +39 0984 494276 Piazza Vermicelli
87036 Rende (CS), Italy
www.eco4cloud.com
info@eco4cloud.com
Copyright © 2016 Eco4Cloud. All rights reserved. This product is protected by Italian and international copyright and intellectual property laws.
Eco4Cloud — www.eco4cloud.com | Phone +39 0984494276 | E-mail info@eco4cloud.com
2. AVOID RESOURCE CONTENTION WITH E4C TECHNOLOGY
© 2016 Eco4Cloud and/or its affiliates. All rights reserved. This document is Eco4Cloud Public. Page 2
Overcommitment and Contention
1. Introduction
VMware® ESX™ is a hypervisor designed to efficiently manage hardware resources
including CPU, memory, storage and network among multiple concurrent virtual machines
[1]. ESX uses high-level resource management policies to compute a target memory
allocation for each virtual machine (VM), based on the current system load and parameter
settings for the virtual machine (shares, reservation, and limit [2]).
The computed target allocation is used to guide the dynamic adjustment of the memory
allocation for each virtual machine; in case host memory is overcommitted, the target
allocations are achieved by invoking several lower-level mechanisms to reclaim memory
from virtual machines.
VMware ESX enables impressive memory and CPU consolidation ratios; ESX allows
running VMs with total configured resources that exceed the amount available on the
physical machine: this is called overcommitment.
Overcommitment raises the consolidation ratio, increases operational efficiency and lowers
total cost of operating virtual machines; if out of control, overcommitment leads to Resource
Contention, a typical situation where several VMs are competing over the same resources,
waiting for the VMware scheduler to assign them.
This is the main reason for performance issues in virtualized environment and, as such, it’s
the very first key performance indicator to be monitored in a virtual farm.
Contention is measured via CPU Ready Time and Memory Ballooning.
2. CPU Ready Time
CPU Ready Time is the period of time a VM waits in a ready-to-run state (meaning it has
work to do) before being scheduled by the hypervisor on one or more physical CPUs.
Therefore, CPU Ready Time is a metric showing how much time virtual CPU is ready to be
scheduled on a given physical host. In general terms, it is normal for VMs to have small
values of CPU Ready Time, even if the hypervisor is not over subscribed, or under heavy
activity; it is just the nature of shared scheduling in virtualization. For SMP VMs with multiple
vCPUs, the amount of ready time will generally be higher than for VMs with fewer vCPUs,
In general terms,
it is
normal for VMs to
have small values
of CPU Ready
Time
3. AVOID RESOURCE CONTENTION WITH E4C TECHNOLOGY
© 2016 Eco4Cloud and/or its affiliates. All rights reserved. This document is Eco4Cloud Public. Page 3
since it requires more resources to schedule/co-schedule the VM when necessary and each
CPU accumulates the time separately; under normal operating conditions, this value should
remain under 5%. If ready time values are higher, virtual machines experience bad
performance.
Even in best designed environments there will be some CPU contention and that is okay.
Any %ready number less than 5% is considered the optimal area to be in. Once your
%ready number climbs in between 5 and 10%, you need to pay attention when adding more
virtual machines and/or CPU cores to the virtual machines. We can call this the warning
area. Now, once the %ready numbers climb higher than 10%, you will reach the dangerous
area and as a consequence bad performance will impact those virtual machines. Your host
could show a %50 overall CPU utilization and strong CPU contention in your environment,
thus affecting the overall performance of your virtual machines.
Just to summarize, CPU contention is one of the hidden issues you might find in your
environment, unless you know where looking for. The best tool to use when looking for any
CPU contention in your environment is ESXTOP from inside the service console of the host,
RESXTOP from the vMA appliance, or other third-party tools, like Eco4Cloud. The best
defense against CPU contention is knowledge and comprehension of scheduler interactions
with multi-processor virtual machines; if you are using multi-processor systems, take into
account that potential issue.
While there are a number of scenarios where high values of CPU Ready Time can occur,
there are two most common scenarios. The first common reason tends to be host over
subscription, where too many vCPUs have been allocated per pCPU ratio wise; while ESX 5
supports a maximum of 25 vCPUs per physical CPU, this is definitely the case where just
because you can do it, it equals to a good practice. As always, your mileage may vary based
on your specific VM workloads, but typically you begin to experience some problems when a
host is in the range of 2-2.5X over subscribed for server workloads.
The second most common scenario where CPU Ready Time goes higher is when a larger
SMP VM, for example a 4-8 vCPUs running on a host having a lot of smaller VMs with 1-2
vCPUs for application servers. Depending on the number of physical processors and on the
total number of vCPUs allocated on the host, a larger resource allocation for the VM results
in longer waiting time, because the hypervisor has to preempt the necessary physical CPUs
to schedule/co-schedule the workload. When this issue occurs, the software vendor
increases vCPUs requirements, due to performance problems for the VM. Unfortunately, if
CPU Ready Time is the root cause, increasing vCPUs number actually does not improve
performance, on the contrary things get worse.
The best defense
against CPU
contention
is knowledge
4. AVOID RESOURCE CONTENTION WITH E4C TECHNOLOGY
© 2016 Eco4Cloud and/or its affiliates. All rights reserved. This document is Eco4Cloud Public. Page 4
3. Memory Ballooning
One of main benefits introduced by virtualization is virtual machines isolation, which is very
useful for security and risk management. A drawback of virtual machines isolation is that the
guest operating system is not aware it is running inside a virtual machine and is not aware of
the states of other virtual machines on the same physical host. When the hypervisor runs
multiple virtual machines and the total amount of free host memory gets low, none of the
virtual machines will release guest physical memory, since when the guest operating system
cannot detect the host’s memory shortage.
VMware ballooning is a memory reclamation technique used when an ESXi host is running
low on memory. This allows the physical host system to retrieve unused memory from
certain guest virtual machines (VMs) and share it with others [3].
Ballooning makes the guest operating system aware of the low memory status of the host. In
ESX, a balloon driver is loaded into the guest operating system as a pseudo-device driver. It
has no external interface to the guest operating system and communicates with the
hypervisor through a private channel. The balloon driver polls the hypervisor to obtain a
target balloon size. If the hypervisor needs to reclaim virtual machine memory, it sets a
proper target balloon size for the balloon driver, making it “inflate” by allocating guest
physical pages within the virtual machine.
Ballooned memory is a symptom of RAM memory contention. If host free memory drops
towards the 4% threshold, the hypervisor starts to reclaim memory, using ballooning.
VM memory ballooning can create performance degradation.
Ballooning is a CPU intensive process, and can eventually lead to memory swapping, when
a balloon driver inflates to the point where the VM no longer has enough memory to run its
processes. This will slow down the VMs, depending upon the amount of memory to recoup
and/or the quality of the storage IOPS delivered to it.
4. Why these counters are important
CPU Ready Time and Ballooned Memory are symptoms of contention on CPU and RAM,
respectively. These metrics represent, in IT literature, the universally recognized two most
significant indicators of the fact that virtual machines are experiencing bad performance.
The generally accepted industry best practice based on VMware’s guidelines is that CPU
Ready Time values up to 5% (per vCPU) fall within acceptable parameters.
5. AVOID RESOURCE CONTENTION WITH E4C TECHNOLOGY
© 2016 Eco4Cloud and/or its affiliates. All rights reserved. This document is Eco4Cloud Public. Page 5
Memory Ballooning is the first technique the hypervisor uses to reclaim memory. Absence or
very low levels of ballooning is a sign of excellent/good health for a virtual farm.
Eco4Cloud Workload Consolidation intelligence computes the ideal placement of VMs
among physical hosts, in order to decrease both CPU Ready Time and Memory Ballooning,
enabling higher performance and VMs density.
5. Test Workflow
A field test has been performed in a performance comparison between VMware® Distributed
Resource Scheduler and Eco4Cloud Workload Consolidation platform.
VMware® Distributed Resource Scheduler (DRS) aggregates computing capacity across a
set of servers into logical resource pools and intelligently allocates available resources
among the VMs, based on pre-defined rules.
VMware Distributed Power Management (DPM), within VMware DRS, automates power
management and minimizes power consumption across a given collection of servers in a
VMware DRS cluster.
The test was performed on a cluster in a production farm of a leader Italian Telco company;
the cluster contained 6 physical hosts running vmware vSphere version 5.0.
The hosts were HP ProLiant DL580 G5, equipped with 64GB RAM and 4 CPU socket. Three
hosts mounted 4x Intel® Xeon® CPU E7320 @ 2.13GHz while the other three mounted 4x
Intel® Xeon® CPU X7350 @ 2.93GHz. Each CPU had 4 physical cores, so the total number
of physical cores for each host was 16. The hosts ran about 94 virtual machines with a
number of virtual CPU assigned cores that range from 1 to 8 (most of them with 2 or 4 virtual
cores) and an amount of assigned RAM varying from 1 to 16 GB (most of them with 2 or 4
GB RAM). The guests operating systems were: 80% Microsoft Windows (various editions,
32 and 64 bit), 14% Linux Red Hat Enterprise (5 and 6, 32 and 64 bit), and 6% Oracle
Solaris 10 64 bit.
The average CPU host usage during the test performance was about 28%.
In order to collect valuable data, a set of tests using VMWare DRS and E4C Workload
Consolidation were performed.
Overall test was set to run in 6 days, divided in two equal length phases.
During first phase (3 days) workload placement was managed with VMware® DRS in fully
automated mode and Eco4Cloud Workload Consolidation was disabled.
Avoiding ballooning
is sign
of good health
for a virtual farm
6. AVOID RESOURCE CONTENTION WITH E4C TECHNOLOGY
© 2016 Eco4Cloud and/or its affiliates. All rights reserved. This document is Eco4Cloud Public. Page 6
After that first phase, a second one of additional 3 days occurred: Eco4Cloud Workload
Consolidation was enabled and VMware® DRS was put in partially automated mode.
The two phases were comparable, because the production workload on the given cluster did
not change significantly.
6. Results
Just after the end of the test, it was crystal clear that through Eco4Cloud Workload
Consolidation usage, the overall cluster performance increased: on one hand, CPU
Ready Time dropped by 23%; on the other hand, Ballooned Memory was completely
removed, through the intelligent workload placement strategy brought by Eco4Cloud
Workload Consolidation.
On the memory side, the result is crystal clear: problem solved.
On the CPU side, the result positively affects performance; 23% is just an aggregate value.
Let’s see how CPU Ready Time decreases in most important cases, where CPU Ready
Time exceeds the warning and alert thresholds, 5% and 10%, respectively.
Ballooning
memory
totally
removed
7. AVOID RESOURCE CONTENTION WITH E4C TECHNOLOGY
© 2016 Eco4Cloud and/or its affiliates. All rights reserved. This document is Eco4Cloud Public. Page 7
As you can see from the following exhibit, CPU Ready Time warnings decrease by 90.26%,
while CPU Ready Time alerts decrease by 42.86%.
It means, in our evaluation scenario:
- 514 less warning/alerts each day, per cluster
- 3598 less warnings/alerts each week, per cluster
Given how much time it takes to manage a performance warning or an alert, evaluating how
much time you can save with an intelligent workload placement solution is simple math.
References
[1] Carl A. Waldspurger. “Memory Resource Management in VMware ESX Server”.
Proceeding of the fifth Symposium on Operating System Design and Implementation,
Boston, Dec 2002
[2] vSphere Resource Management Guide. VMware.
http://www.vmware.com/pdf/vsphere4/r40/vsp_40_upgrade_guide.pdf
[3] Understanding Memory Resource Management in VMware® ESX™ Server
http://www.vmware.com/files/pdf/perf-vsphere-memory_management.pdf
For more information
E4C Workload Consolidation: http://www.eco4cloud.com/workload-consolidation
Eco4Cloud Workload Consolidation Product Overview
Eco4Cloud Workload Consolidation FAQ
Ph. +39 0984 494276 Piazza Vermicelli
87036 Rende (CS), Italy
www.eco4cloud.com
info@eco4cloud.com
CPU Ready
Time warnings
and alerts
decreased by
more than 90%
and 42%,
respectively