webinar vmware v-sphere performance management Challenges and Best Practices


Published on

With the majority of businesses using internal Cloud Services, whether it be Software as a Service (SaaS), Platform as a Service (PaaS) or Infrastructure as a Service (IaaS) in a VMware vSphere environment, this presentation gives an insight into how to manage the gathering Storm Clouds. After an introduction to VMware's Virtual Infrastructure 4 (vSphere) environment andCloud Computing, we discuss how Capacity Management provides the means to spot potential Storm Clouds far in advance and more specifically, how you can avoid them.
Delving deeper we look at IaaS and how to identify potential capacity on demand issues. Discussion focuses on topics such as:
•identifying whether virtual machines are under or over provisioned
•the advantages/disadvantages of application sizing
•how to minimize SLA impact
•whether to scale the infrastructure out, up or in and ultimately how to get it right.

Typically organizations have adopted a "silo mentality" whereby they ring fence IT systems and don’t share resources through lack of trust and confidence. We look at the advantages virtualization brings in terms of flexibility, scalability, cost reduction (monetary and environmental) and how we can protect our 'loved ones' through resource pools, shares, reservations and limits.

With all this in mind, join us to find out what information and processes we recommend you need to have and implement to avoid an Internal Storm and ensure that Brighter Outlook!

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Can my application be virtualized?Applications can be categorized into three basic groups:Green – Applications can be virtualized ‘out of the box’. Great performance with no tuning required.Yellow – Applications have good performance but require some tuningRed – Applications exceed the capabilities of the virtual platform and should not be virtualizedThe vast majority of applications are Green, which means that no performance tuning is required. Yellow applications are a smaller but still significant group and there are very few Red applications (that is applications that do not virtualize).Pay attention to CPU, Memory, Network bandwidth and Storage bandwidth usage
  • Selecting the Correct Operating SystemWhen creating a virtual machine, you have to select the operating system type that you intend to install in that virtual machine. The operating system type determines the optimal monitor mode to use, the optimal devices, such as the SCSI controller and the network adapter to use. It also specifies the correct version of VMware Tools to install.Therefore, it is important that the correct operating system type is selected at virtual machine creation. It is recommended to use a 64-bit operating system only if you are running 64-bit applications. 64-bit virtual machines require more memory overhead than 32-bit virtual machines. Consider contacting the application vendor for representative benchmarks of their 32-bit applications compared to their 64-bit applications to determine if it is worth using the 64-bit version.
  • Guest Operating System TimekeepingVirtual machines share their underlying hardware with the VMkernel. Other applications and other virtual machines might also be running on the same host. Thus at the moment a virtual machine should generate a virtual timer interrupt, it might not actually be running. In fact, the virtual machine might not get a chance to run again until it has accumulated a backlog of many timer interrupts. In addition, even a running virtual machine can sometimes be late in delivering virtual timer interrupts. The virtual machine checks for pending virtual timer interrupts only at certain points, such as when the underlying hardware receives a physical timer interrupt. Because the operating system keeps time by counting interrupts “ticks”, time as measured by the operating system falls behind real time whenever there is a timer interrupt backlog. A virtual machine handles this problem by keeping track of the current timer interrupt backlog and if the backlog is too large, delivers ticks at a faster rate to catch up.Catching up is made more difficult by the fact that a new timer interrupt should not be generated until the operating system has fully handled the previous one. Otherwise, the operating system might fail to see the next interrupt as a separate event and miss counting it. This phenomenon is called a lost tick.You can mitigate these issues by using guest operating system that use fewer ticks, such as Windows which use timer interrupts rates of 66Hz – 100Hz. By installing VMware Tools, you should enable the clock synchronization as it has the advantage that it is aware of the virtual machines built-in catch-up and interacts properly with it. If the guest operating systems clock is behind real time by more than the known backlog that is in the process of being caught up, VMware Tools reset the clock and informs the virtual machine to stop catching up, resetting the backlog to zero.For Linux guests, VMware recommends using NTP instead of VMware Tools for time synchronization. Please check knowledge base article 1006427.
  • SMP GuidelinesIn SMP guests, the operating system can migrate processes from one vCPU to another. This migration can incur a small CPU overhead. If the migration is very frequent, it might be helpful to pin guest threads or processes to specific vCPUs. (This is another reason not to configure virtual machines with more vCPUs than they need.)The VMkernel assigns interrupt request (IRQ) numbers to all PCI devices when the system initializes. When the number of available IRQs is limited due to hardware constraints, two or more PCI devices might be assigned the same IRQ number. In earlier versions of ESX, performance issues might occurwhen IRQ shared devices were owned by the Service Console and VMkernel.
  • NUMA Server ConsiderationsIf virtual machines are running on a NUMA server, ensure that a virtual machine fits in a NUMA node for best performance. The scheduler attempts to keep both the virtual machine and its memory located on the same node, thus maintaining good NUMA locality. However, there are some instances where a virtual machine will have poor NUMA locality despite the efforts of the scheduler: Virtual machine with more vCPUs than PCPUs in a node. When a virtual machine has more vCPUs that there are physical processor core in a home node, the CPU scheduler does not attempt to use NUMA optimizations for that virtual machine. This means that the virtual machines memory is not migrated to be local to the processors on which its vCPUs are running, leading to poor NUMA locality.Virtual machine size is greater than memory per NUMA node. Each physical processor in a NUMA server is usually configured with an equal amount of local memory. For example, a four socket NUMA server with 64GB of memory typically has 16GB locally on each processor. If a virtual machines memory size is greater than the amount of memory local to each processor, the CPU scheduler does not attempt to use NUMA optimizations for that virtual machine. This means that the virtual machines memory is not migrated to be local to the processors on which its vCPUs are running, leading to poor NUMA locality.In vSphere 4.1, Wide virtual machines are supported which splits virtual machines into multiple NUMA clients if they span the NUMA node capacity.
  • WorldsThe role of the CPU scheduler is to assign execution contexts to processors in a way that meets system objectives such as throughput and utilization. On a conventional operating system, the execution context corresponds to a process or thread. On ESX/ESXi hosts, it corresponds to a world.A virtual machine is a group of worlds, with some being virtual CPUs (vCPUs) and other threads doing additional work. For example, a virtual machine consists of a world that controls the mouse, keyboard and screen (MKS). The virtual machine also has a world for its virtual machine monitor (VMM). So a virtual machine with 4vCPUs would consist of 6 worlds.There are also non-virtual machine worlds such as VMkernel worlds, which are used to perform various system tasks, e.g. idle, driver and vMotion worlds.
  • CPU SchedulingOne of the main tasks of the CPU scheduler is to choose which world is to be scheduled to a processor. If the chosen processor is already occupied, the scheduler must decide whether to pre-empt the currently running world on behalf of the chosen one.The scheduler checks physical CPU utilization every 20 milliseconds and migrates vCPUs as necessary. A world migrates from a busy processor to an idle processor. This migration can be initiated either by a physical CPU that becomes idle or by a world that becomes ready to be scheduled.When CPU resources are over-committed ESX/ESXi implements the proportional-share-based algorithm. The host time-slices the physical CPU’s across all virtual machines so that each one runs as if it had its specified number of virtual processors. It associates each world with a share of CPU resource.This is called Entitlement and is calculated from user provided resource specifications such as Shares, Reservations and Limits. When making scheduling decisions, the ratio of the consumed CPU resource to the entitlement is used as the priority of the world.
  • CPU Scheduling – SMP VMsAn SMP virtual machine presents the guest operating system and applications with the illusion that they are running on a dedicated physical multiprocessor. ESX/ESXi implements this illusion by supporting co-scheduling of the vCPUs within an SMP virtual machine.Co-scheduling refers to a technique used for scheduling related processes to run on different processors at the same time. ESX/ESXi uses a form of co-scheduling that is optimized for running SMP virtual machines efficiently.At any particular time, each vCPU might be scheduled, descheduled, pre-empted or blocked while waiting for some event. Without co-scheduling, the vCPUs associated with an SMP virtual machine would be scheduled independently, thus breaking the guest’s assumptions regarding uniform progress.The word “skew” is used to refer to the difference in execution rates between two or more vCPUs associated with an SMP virtual machine.The ESX/ESXi scheduler maintains a fine-grained cumulative skew value for each vCPU within an SMP virtual machine. A vCPU is consider to make progress if it consumes CPU in the guest level or if it halts. The time spent in the hypervisor is excluded from the progress. This means that the hypervisor execution might not always be co-scheduled. This is acceptable because not all operations in the hypervisor benefit from being co-scheduled. When it is beneficial, the hypervisor make explicit co-scheduling requests to achieve good performance. The progress of each vCPU in an SMP virtual machine is tracked individually. The “skew” is measured as the difference in progress between the slowest vCPU and each of the other vCPUs. A vCPU is considered to be skewed if its cumulative skew value exceeds a configurable threshold (typically a few milliseconds).Further relaxed co-scheduling was introduced to mitigate the effects of “skew” by reducing the need to co-stop / start vCPUs if the threshold was breached. Only the vCPUs that had more progress were stopped and restarted when the other siblings had caught up.For more information, check out the following link on the CPU Scheduler:http://www.vmware.com/resources/techresources/10059.
  • Processor Topology / Cache AwareThe CPU scheduler can interpret processor topology, including the relationship between sockets, cores and logical processors. The scheduler uses topology information to optimize the placement of vCPUs onto different sockets to maximize overall cache utilization and to improve cache affinity by minimizing vCPU migrations.A dual-core processor can provide almost double the performance of a single-core processor by allowing two vCPUs to execute at the same time. Cores within the same processor are typically configured with a shared-level cache used by all cores, potentially reducing the need to access slower, main memory. Last-level cache is a memory cache that has a dedicated channel to CPU socket (bypassing the main memory bus), enabling it to run at the full speed of the CPU.
  • CPU Ready TimeTo achieve best performance in a consolidated environment, you must consider ready-time. Ready time is the time a virtual machine must wait in the queue in a ready-to-run state before it can be scheduled on a CPU.Co-scheduling SMP virtual machines can increase CPU ready time as the number of physical CPUs required to co-start the VM are not currently available. CPU ready time latency can therefore impact on the performance of the guest operating system and its applications within a virtual machine.When creating virtual machines, define smaller numbers of vCPUs to avoid CPU overcommitment and increased CPU ready time.
  • What affects CPU Performance?You might think that idling virtual machines cost little in terms of performance. However, timer interrupts still need to be delivered to these virtual machines. Lower timer interrupt rates can help a guest’s operating system’s performance.Using CPU affinity has a positive effect for the virtual machine being pinned to a vCPU. However, for the entire system as a whole, CPU affinity constrains the scheduler and can cause an imbalanced load. VMware strongly recommends against using CPU affinity.Try to use as few vCPUs in your virtual machines as possible to reduce timer interrupts and reduce any co-scheduling overhead that might be incurred. CPU Capacity is finite. As a result performance problems might occur when there are insufficient CPU resources to satisfy demand. The CPU scheduler will prioritize the CPU usage based on set share allocations.
  • Causes of Host CPU SaturationPopulating an ESX/ESXihost with too many virtual machines running compute-intensive applications which are demanding more CPU resources than the host has available.The main scenarios for which this problem can occur in are:The host has a small number of VMs with high CPU demandThe host has a large number of VMs with moderate CPU demandThe host has a mix of VMs with high and low CPU demand
  • Resolving Host CPU SaturationThe most straightforward solution to the problem of host CPU saturation is to reduce the demand for CPU resources by migrating virtual machines to other ESX/ESXi hosts with spare CPU capacity. In order to do this, monitor the CPU usage making sure to include any peaks as well as average usage and then identify which virtual machines would fit on other hosts without causing CPU saturation. This allows for the CPU load to be manually rebalanced without incurring any downtime.If additional hosts are not available, either power down non-critical VMs or use resource controls such as shares, limits and/or reservations.Alternatively, you can increase the CPU resources available by adding the host to a DRS cluster. The option is then available to automatically load balance virtual machines across all members of the cluster with having the need to manually compute what virtual machine should go on which host.Increase the amount of workload that can be performed on a saturated host by increasing the efficiency with which applications and virtual machines use those resources. Reference any application tuning guides that document best practices and procedures for their application. These guides often include operating system level tuning advice and best practices. There are two application level and operating system level tunings that are particularly effective in a virtualized environment:Use of large memory pagesReducing timer interrupt rate for VM operating systemAlso, a virtualized environment provides more direct control over the assignment of CPU and memory resources to applications that a non-virtualized environment. There are some key adjustments that can be made in virtual machine configurations that might improve overall CPU efficiency:Allocating more memoryReduce vCPUsUse resource controls to prioritize available resources to critical VMs
  • ESX Host – PCPU0 High UtilizationThe issue of high utilization on physical core 0 (PCPU0) on ESX hosts only, refers to where the CPU utilization is disproportionately high compared to the utilization on other CPU cores. In ESX the Service Console is restricted to running only on PCPU0. In many vSphere environments, management agents from various vendors are run inside the Service Console.When an agent demands a large amount of CPU resources, or many less demanding agents are running, the utilization on PCPU0 can rise out of proportion to the overall CPU utilization. High utilization on PCPU0 might impact the performance of the virtual machines on the ESX host. The CPU scheduler will attempt to run virtual machines on other processors. However, high CPU utilization by the Service Console decreases resources available to other virtual machines.When running SMP virtual machines on NUMA systems, performance might be impacted when they are assigned to the home node that includes PCPU0.
  • Memory Reclamation ChallengesVirtual machine memory deallocation acts just like an operating system, such that the guest operating system frees a piece of guest physical memory by adding these memory page numbers to the guest free list. But the data of the “freed” memory might not be modified at all. As a result when a particular piece of guest physical memory is freed, the mapped host physical memory does not usually change its state and only the guest free list is changed.It is difficult for the hypervisor to know when to free host physical memory when guest physical memory is deallocated or freed, because the guest operating system free list is not accessible to the hypervisor. The hypervisor is completely unaware of which pages are free or allocated in the guest operating system. As a result, the hypervisor cannot reclaim host physical memory when the guest operating system frees guest physical memory.
  • VM Memory Reclamation TechniquesThe hypervisor must rely on memory reclamation to reclaim the host physical memory freed up by the guest operating system. The memory reclamation techniques are, Transparent Page Sharing (default), Ballooning and Host-level (or Hypervisor) swappingWhen multiple virtual machines are running, some of them might have identical sets of memory content. This presents opportunities for sharing memory across virtual machines (as well as sharing within a single VM). With transparent page sharing, the hypervisor can reclaim the redundant copies and keep only one copy, which is then shared across multiple virtual machines in host physical memory.Due to the virtual machines isolation, the guest operating system is not aware that it is running inside a virtual machine and is not aware of the states of the other virtual machines running on the same ESX/ESXi host. When the total amount of free host physical memory becomes low, none of the virtual machines will free guest physical memory, because the guest operating system cannot detect the host physical memory shortage.Ballooning makes the guest operating system aware of the low host physical memory status so it can free up some of its memory. If the virtual machine has plenty of idle and free memory guest physical memory, inflating the balloon will not induce guest paging and will not affect guest performance. However, if the guest is already under memory pressure, the guest operating system decides which guest physical pages are to be paged out.In cases where transparent page sharing and ballooning are not sufficient to reclaim memory, ESX/ESXi employs host-level swapping. This is supported by creating a swap file (vSwp) when the virtual machine is started. Then, if necessary, the hypervisor can directly swap out guest physical memory to the swap file, which frees host physical memory for other virtual machines.Host-level swapping may however, severely penalize guest performance. This occurs when the hypervisor has no knowledge of which guest physical pages should be swapped out and the swapping may conflict with the native memory management of the guest operating system. For example, the guest operating system will never page out its kernel pages, because they are critical to ensure guest kernel performance. The hypervisor, however, cannot identify those guest kernel pages, so it might swap out those pages.
  • Memory Management ReportingThis example report shows the three reclamation techniques for a single cluster ESX host called VIXEN. Notice how the balloon driver (purple) is used to reclaim some memory, whilst on average about 3GB of memory is being shared. In this example, there is no requirement for swapping on this ESX host.
  • Why does the Hypervisor Reclaim Memory?The hypervisor reclaims memory to support ESX/ESXi memory overcommitment. Memory overcommitment provides two important benefits:Higher memory utilization – With memory overcommitment, ESX/ESXi ensures that the host physical memory is consumed by active guest memory as much as possible. Typically, some virtual machines will be lightly loaded and their memory for much of the time will be idle. Memory overcommitment allows the hypervisor to use memory reclamation techniques to take the inactive/unused host physical memory away from the idle virtual machines and give it to other virtual machines to use.Higher consolidation ratio - With memory overcommitment, each virtual machine has as a smaller footprint in host physical memory, making it possible to fit more virtual machines on the host whilst still achieving good performance for all virtual machines. In the example above, you can enable a host with 4GB of host physical memory to run three virtual machines with 2GB of guest physical memory each. This is assuming no memory reservations have been set.
  • When to Reclaim Host MemoryESX/ESXi maintains four host free memory states: high, soft, hard and low. These states are reflected by four thresholds: 6%, 4%, 2% and 1% of host physical memory. When to use ballooning or host-level swapping to reclaim host physical memory is largely determined by the current host free memory state. (Transparent Page Sharing is enabled by default).As an example, if you have a host with 1GB of physical memory and the amount of free host physical memory drops to 60MB, the VMkernel does nothing to reclaim memory. However, if that value dropped to 40MB, the VMkernel start ballooning virtual machines. If the value of free memory drops to 20MB, the VMkernel starts swapping and ballooning. If it drops to 10MB the VMkernel continues to swap until enough memory is reclaimed for it to use for other purposes.In the high state, the aggregate virtual machine guest memory usage is smaller than the host physical memory size. Whether or not host physical memory is overcommitted, the hypervisor will not reclaim memory through ballooning or host-level swapping. This is true only when the virtual machine memory limit is not set.
  • Monitoring VM and Host Memory UsageActive guest memory is defined as the amount of guest physical memory that is currently being used by the guest operating system and its applications. The Consumed host memory is the amount of host physical memory that is allocated to the virtual machine. This value includes virtualization overhead. When reported at the host level, the Consumed memory value also includes the memory used by the Service Console (ESX only) and VMkernel. When monitoring memory usage, you may question why consumed host memory is greater than active? The reason is that for physical hosts that are not overcommitted on memory, consumed host memory represents the highest amount of memory usage by a virtual machine. It is possible in the past that this virtual machine was actively using a very large amount of memory.Because the host is not overcommitted, there is no reason for the hypervisor to invoke any reclamation techniques. So you can find that the active guest memory usage is low, whilst the host physical memory assigned to it is high. This is a perfectly normal situation.If consumed host memory is less than or equal to active guest memory, it might be because the active guest memory does not completely reside in host physical memory. This might occur if a guest’s active memory has been reclaimed by either the balloon driver or if the virtual machine has been swapped out by the hypervisor. In both cases, this is probably due to high memory overcommitment.
  • Memory Troubleshooting1) When ESX/ESXi is actively swapping the memory of a virtual machine in and out of disk, the performance of that virtual machine will degrade. The overhead of swapping a virtual machines memory in and out from disk can also degrade the performance of other virtual machines.Monitor the memory swap in rate and memory swap out rate counters for the host. If either of these measurements is greater than zero, then the host is actively swapping virtual machine memory. In addition identify the virtual machines affected by monitoring the memory swap in and swap out rate counters at the virtual machine level.2) If overall demand for host physical memory is high, the ballooning mechanism might reclaim memory that is needed by an application or guest operating system. In addition, some applications are highly sensitive to having any memory reclaimed by the balloon driver. Monitor the Balloon counter at both the host and virtual machine levels. If you are seeing ballooning at the virtual machine level, check for high paging activity within the guest operating system. Perfmon in Windows and vmstat / sar in Linux provide the values for page activity.3) In some cases, virtual machine memory can remain swapped to disk even though the ESX/ESXi host is not actively swapping. It can occur when high memory activity caused some virtual machine memory to be swapped to disk and the virtual machine has not yet attempted to access this swapped memory. As soon as this is accessed it will swap it back in from disk.There is one common situation that can lead to virtual machine memory being left swapped out even though there was no observed performance problem. When the virtual machines operating system first boots, there will be a period of time before the balloon driver begins running. In that time, the virtual machine might access a large portion of its allocated memory. If many virtual machines are powered on at the same time, the spike in memory demand, together with the lack of balloon drivers, might force the ESX/ESXi host to resort to host-level swapping.
  • vSwp file usage and placement guidelinesSwap (vSwp) files are created for each virtual machine hosted on ESX/ESXi when memory is overcommitted. These files are, by default, located with the virtual machine files in a VMFS datastore. Placement of a virtual machines swap file can affect the performance of vMotion. If the swap file is on shared storage, then vMotion performance is good because the swap file does not need to be copied. If the swap file is on the hosts local storage, then vMotion performance is slightly degraded (usually negligible) because the swap file has to be copied to the destination host.
  • TCP Segmentation Off-LoadA TCP message must be broken down into Ethernet frames. The size of the frame is the maximum transmission unit (MTU). The default MTU is 1500 bytes, defined by the Ethernet specification. The process of breaking messages into frames is called segmentation.Historically, the operating system used the CPU to perform segmentation. Modern NICs try to optimize this TCP segmentation by using a larger segment size as well as off-loading work from the CPU to the NIC hardware. ESX/ESXi utilizes this concept to provide a virtual NIC with TSO support, without requiring specialized network hardware. TSO improves networking I/O performance by reducing the CPU overhead involved with sending large amounts of TCP traffic. Instead of processing many smaller MTU sized frames during transmit, the system can send fewer, larger virtual MTU sized frames.TSO improves performance for TCP data coming from a virtual machine and for network traffic sent out of the server, such as vMotion traffic. TSO is supported in both the guest operating system and in the VMkernel TCP/IP stack. It is enabled by default in the VMkernel. To take advantage of TSO, you must select Enhanced vmxnet, vmxnet2 (or later) or e1000 as the network device for the guest.
  • Jumbo framesAs previously mentioned, the default Ethernet MTU (packet size) is 1500 bytes. Recent advances have enabled an increase in the Ethernet packet size to 9000 bytes, called jumbo frames. Jumbo frames decrease the number of packets requiring packaging, compared to previously sized packets. That decrease results in less work for network transactions, which frees up resources for other activities.The network must support jumbo frames end to end. In other words, the physical NICs at both ends and all the intermediate hops, routers, and switches must support jumbo frames. jumbo frames must be enabled at the virtual switch level, at the virtual machine, and at the VMkernel interface. Before enabling jumbo frames, check with your hardware vendor to ensure that your network adapter supports jumbo frames.
  • NetQueueVMware support NetQueue, a performance technology that significantly improves performance in 10Gb Ethernet virtualized environments. NetQueue takes advantage of the multiple queue capability that newer physical network adapters have. Multiple queues, allows I/O processing to be spread across multiple CPUs in a multiprocessor system. So, while one packet is queued up on one CPU, another packet can be queued up on another CPU at the same time.NetQueue load balances across queues. It monitors the load of the virtual machines as they are receiving packets and can assign queues to critical virtual machines. All other virtual machines use the default queue. NetQueue requires MSI-X support from the server platform, so support is limited to specific systems. It is alsodisabled by default but can be enabled manually, through the command line.
  • Monitoring Network StatisticsNetwork packets get stored (buffered) in queues at multiple points along their route from the source to the destination. Switches, NICs, device drivers and network stacks might all contain queues where packet data or headers are buffered before being passed to the next step. These queues are finite in size. When these queues fill up, no more packets can be received at that point on the route. This causes additional arriving packets to be dropped.When a packet is dropped, TCP/IP’s recovery mechanisms work to maintain in-order delivery of packets to applications. However, these mechanisms operate at a cost to both networking performance and CPU overhead, a penalty that becomes more severe as the physical network speed increases.Monitor the droppedRx and droppedTx metrics for any values greater than 0 as they may indicate a network throughput issue.
  • LUN Queue DepthSCSI device drivers have a configurable parameter called the LUN queue depth that determines how many commands can be active at one time to a given LUN. Qlogic Fibre Channel HBAs support up to 255 outstanding commands per LUN, and Emulex HBAs support up to 128. However, the default value for both drivers is set to 32. If an ESX/ESXi host generates more commands to a LUN than the LUN queue depth, the excess commands are queued in the VMkernel and this increases latency.To reduce latency, ensure that the sum of active commands from all virtual machines does not consistently exceed the LUN queue depth. Either increase the queue depth (the maximum recommended queue depth is 64) or move the virtual disks of some virtual machines to a different VMFS volume.Review the Disk.SchedNumReqOutstanding parameter which defines the total number of outstanding commands permitted from all virtual machines to that LUN.
  • Monitoring Disk MetricsTo identify disk-related performance problems, start with determining the available bandwidth on your hosts and compare it with your expectations. Are you getting the expected IOPS? Check disk latency and compare it. Disk bandwidth and latency help determine whether storage is overloaded or slow. Within your vSphere environment, the most significant metrics to monitor for disk performance:Disk throughput – Disk read rate, Disk write rate and Disk usageLatency (Device and Kernel)Physical device command latency, values greater than 15 milliseconds indicates that the storage array might be slow or overworked.Kernel command latency, for best performance values should be 0-1 milliseconds, if the value is greater than 4 milliseconds, the virtual machines on the host are trying to send more throughput to the storage system that the configuration supports.When storage is severely overloaded, commands are aborted because the storage subsystem is taking far too long to respond to the commands. Aborted commands are a sign that the storage hardware is overloaded and unable to handle the requests in line with the hosts expectations.Number of active commands – this metric represents the number of I/O operations that are currently active. This metric can serve as a quick view of storage activity. If the value of this metric is close to or at zero, the storage subsystem is not being used.Number of commands queued - this metric represents the number of I/O operations that require processing but have not yet been addressed. Commands are queued and awaiting management by the kernel when the driver’s active command buffer is full. Occasionally, a queue will form and result in a small, nonzero value for QUED. However, any significant double digit average of queued commands means that the storage hardware is unable to keep up with the host’s needs.
  • Storage Response Time – FactorsTo determine whether high disk latency values represent an actual problem, you must understand the storage workload. Three main workload factors affect the response time of a storage subsystem:I/O arrival rate - Maximum rate at which a storage device can handle specific mixes of I/O requests. Requests may be queued in buffers if they exceed this rate. This queuing can add to the overall response time.I/O size – Transmission rate of storage interconnects. Large I/O operations naturally take longer to complete. A response time is that is slow for small transfers might be expected for larger operations.I/O locality – Successive I/O requests to data that is stored sequentially on disk can be completed more quickly than those that are spread randomly. In addition, read requests to sequential data are more likely to be completed out of high-speed caches in the disks or arrays.
  • x86 Virtualization Challengesx86 operating systems are designed to run on the bare-metal hardware, so they can assume full control of the computer hardware. The x86 architecture offers four levels of privilege to operating system and applications to manage the access to the computer hardware; ring 0, ring 1, ring 2 and ring 3. Due to its need to have direct access to the memory and hardware, the operating system must execute its privileged instructions in ring 0. The challenge of virtualizing the x86 architecture was to place a virtualization layer under the operating system (which expects to be in the most privileged ring, ring 0) to create and manage the virtual machines. To further complicate matters, some sensitive instructions cannot be virtualized because they have different semantics when not executed in ring0.The initial difficulties of trapping and translating these sensitive and privileged instructions at runtime was the original reason why virtualizing x86 architecture looked impossible to achieve. VMware resolved the challenge in 1998 by developing a software virtualization technique called binary translation.
  • Binary TranslationThe use of Binary Translation means that the Virtual Machine Monitor (VMM) can run in ring 0 for isolation and performance, whilst moving the guest operating system to ring 1.By using this technique and direct execution, any x86 operating system can be virtualized. The VMM dynamically translates all guest operating system instructions and caches the results for future use. The translator in the VMM does not perform a mapping from one architecture to the another, instead it translated the full unrestricted x86 instruction set to a subset which is safe to execute. In particular, the binary translator replaces privileged instructions with sequences of instructions that perform the privileged operations in the virtual machine rather than on the physical machine. This translation enforces encapsulation of the virtual machine while preserving the x86 semantics.In addition, user-level code is executed directly on the processor for high-performance virtualization. Each VMM provides each virtual machine with all the services of a physical system, such as a virtual BIOS, virtual devices and virtualized memory management.
  • Hardware VirtualizationIn addition to software virtualization, there is also support for hardware virtualization. Intel provides the Intel Virtualization Technology (Intel VT-x) feature and AMD the AMD Virtualization (AMD-V) feature. Both are similar in aim but different in detail, with both designs aiming to simplify virtualization techniques. The designs allow for the VMM to remove the need to use Binary Translation whilst still being able to fully control the execution of a virtual machine. This is achieved by restricting which kinds of privileged instructions the virtual machine can execute without intervention from the VMM.With the first generation of Intel VT-x and AMD-V, a CPU execution mode feature was introduced that allows the VMM to run in a root mode below ring 0. Privileged and sensitive calls are set to automatically trap to the VMM. The guest state is stored in virtual machines control structures (Intel) or blocks (AMD).
  • Memory Management ConceptsBeyond CPU virtualization, the next critical component is Memory virtualization. This involves sharing the physical system memory and dynamically allocating it to virtual machines. Virtual machine memory virtualization is very similar to the virtual memory support provided in modern operating systems.Applications see a contiguous address space that is not necessarily tied to the underlying physical memory. The operating system keeps a map of virtual memory addresses to physical memory addresses in a page table. However, all modern x86 CPUs include a Memory Management Unit (MMU) and a Translation Look-aside Buffer (TLB) to optimize virtual memory performance. The MMU translates virtual addresses to physical addresses. The TLB is a cache which the MMU uses to speed up these translations. If the requested address is in the TLB, then the physical address is quickly located and accessed, known as a TLB hit. If the requested address is not in the TLB (TLB miss), the page table has to be consulted.The page table walker receives the virtual address and traverses the page table tree to produce the corresponding physical address. When the page table walk is completed, the virtual/physical address mapping is inserted into the TLB to speed up future accesses to that address.
  • MMU VirtualizationIn order to run multiple virtual machines on a single system, another level of memory virtualization is required. This is host physical memory (a.k.a machine memory). The guest operating system continues to control the mapping of virtual addresses to physical addresses, but the operating system does not have direct access to host physical memory. Therefore, the VMM is responsible for mapping guest physical memory (PA) to host machine memory (MA).To accomplish this the MMU must be virtualized. There are two techniques for virtualizing the MMU, (1) software using Shadow Page Tables and (2) hardware using either Intel’s Extended Page Tables (EPT) or AMD’s Rapid Virtualization Indexing (RVI).
  • Software MMUVirtualization - Shadow Page TablesTo virtualize the MMU in software, the VMM creates a shadow page table for each primary page table that the virtual machine is using. The VMM populates the shadow page table with the composition of two mappings:VA > PA – Virtual memory addresses to guest physical addresses. This mapping is specified by the guest operating system and is obtained from the primary page table.PA > HA – Guest physical memory addresses to host physical memory addresses. This mapping is defined by the VMM and VMkernel.By building shadow page tables that capture this composite mapping, the VMM points the hardware MMU directly at the shadow page tables, allowing the memory accesses of the virtual machine to run at native speed. It also prevents the virtual machine from accessing host physical memory that is not associated.
  • Hardware MMU VirtualizationSoftware MMU is where the VMM maps guest physical pages to host physical pages in the shadow page tables, which are exposed to the hardware. The VMM also synchronizes shadow page tables to guest page tables (mapping of VA to PA).With Hardware MMU, the guest operating system does VA to PA mapping. The VMM maintains the mapping of guest physical addresses (PA) to host physical addresses (MA) in an additional level of page tables called nested page tables. The guest page tables and nested page tables are exposed to hardware. When a virtual address is accessed, the hardware walks the guest page tables, as in the case of native execution. However, for every guest physical page accessed during the guest page table walk, the hardware also walks the nested page tables to determine the corresponding host physical page.This translation eliminates the need for the VMM to synchronize shadow page tables with guest page tables. However, the extra operation also increases the cost of a page walk, thereby affecting the performance of applications that stress the TLB. This cost can be reduced by use of large pages, which reduces the stress on the TLB for application with good spatial locality.When hardware MMU is used, ESX VMM and VMkernel aggressively try to use large pages for their own memory.
  • Memory Virtualization OverheadWith software MMU virtualization, shadow page tables are used to accelerate memory access and thereby improve memory performance. Shadow page tables however, consume additional memory and also incur CPU overhead in certain situations:When new processes are created, the virtual machine updates a primary page table. The VMM must trap the update and propagate the change into the corresponding shadow page table(s). This slows down memory mapping operations and the creation of new processes in virtual machines.When the virtual machine switches context from one process to another, the VMM must intervene to switch the physical MMU to the shadow page table root of the new process.When running a large number of processes, shadow page tables need to be maintainedWhen allocating pages, the shadow page table entry mapping this memory must be created on demand, slowing down the first access to memory. (The native equivalent is a TLB miss.)For most workloads, hardware MMU virtualization provides an overall performance win over shadow page tables. There are some exceptions: workloads that suffer frequent TLB misses or that perform few context switches or page table updates.
  • webinar vmware v-sphere performance management Challenges and Best Practices

    1. 1. vSphere Performance Management Challenges and Best Practices Jamie Baker Principal Consultant jamie.baker@metron-athene.com
    2. 2. Agenda • Can my Application be virtualized? • Virtual Machine Performance Management and Best Practices • CPU Performance Management and Best Practices • Memory Performance Management and Best Practices • Networking Performance Management and Best Practices • Storage Performance Management and Best Practices • x86 Virtualization Challenges ( not covered in presentation but included with slides) www.metron-athene.com
    3. 3. Can my application be virtualized? Resource Application Category CPU CPU-intensive Green (with latest HW) More than 8 CPUs Red Memory Memory-intensive Green (with latest HW) Greater than 255GB Red RAM Network bandwidth 1-27Gb/s Yellow Greater than 27Gb/s Red Storage bandwidth 10-250K IOPS Yellow Greater than 250K IOPS Red 1/3/2012 3
    4. 4. Virtual Machines Performance Management • Selecting the correct operating system • Virtual machine timekeeping • Why installing VMware Tools brings benefits • SMP guidelines • NUMA server considerations • Overview of Best Practices 1/3/2012 4
    5. 5. Selecting the Correct Operating System • When creating a VM, select the correct OS to: • Determine the optimal monitor mode to use • Determine the default optimal devices, e.g. SCSI controller • Correct version of VMware Tools is installed • Use a 64-bit OS only when necessary • If running 64-bit applications • 64-bit VMs require more memory overhead • Compare 32-bit / 64-bit application benchmarks to determine the worthiness of 64-bit 1/3/2012 5
    6. 6. Guest Operating System Timekeeping • To keep time, most OS’ count periodic timer interrupts, or “ticks” • Tick frequency is 64-1000Hz or more • Counting “ticks” can be a real-time issue • Ticks are not always delivered on time • VM might be descheduled • If a tick is lost, time falls behind • Ticks are backlogged • When backlogged, the system delivers ticks faster to catch up • Mitigate these issues by: • Use guest OS’ that require fewer ticks • Most Windows – 66 to 100Hz, Linux 2.4: 100Hz • Linux 2.6 – 1000Hz, Recent Linux – 250Hz • Clock synchronisation software (VMware Tools) • For Linux use NTP instead of VMware Tools • Check article kb1006427 1/3/2012 6
    7. 7. SMP Guidelines • Avoid using SMP unless specifically required • Processes can migrate across vCPUs • This increases CPU overhead • If migration is frequent – use CPU Affinity • VMkernel in vSphere now owns all PCI devices • No performance concern with IRQ sharing in vSphere • Earlier versions of ESX had performance issues when IRQ shared devices were owned by the Service Console and VMkernel 1/3/2012 7
    8. 8. NUMA Server Considerations • If using NUMA, ensure that a VM fits in a NUMA node • If the total amount of guest vCPUs exceeds the cores, then the VM is not treated as a NUMA client and not managed by the NUMA balancer • VM Memory size is greater than memory per NUMA node • Wide VMs are supported in ESX 4.1 • If more vCPUs than cores • Splits the VM into multiple NUMA clients • Improves memory locality 1/3/2012 8
    9. 9. Virtual Machine Performance Best Practices • Select the right guest operating system type during virtual machine creation • Use 64-bit operating systems only when necessary • Don’t deploy single-threaded applications in SMP virtual machines • Configure proper time synchronization • Install VMware Tools in the guest operating system and keep up-to- date 1/3/2012 9
    10. 10. CPU Performance Management • What are Worlds? • CPU Scheduling • How it works • Processor Topology • SMP and CPU ready time • What affects CPU Performance? • Causes and resolution of Host Saturation • ESX PCPU0 – High Utilization • Overview of Best Practices 1/3/2012 10
    11. 11. Worlds • A world is an execution context scheduled on a processor (a.k.a. Process) • A virtual machine is a group of worlds • World for each vCPU • World for virtual machines Mouse, Keyboard and Screen (MKS) • World for VMM • CPU Scheduler chooses which world to schedule on a processor • VMkernel worlds are known as non-virtual machine worlds 1/3/2012 11
    12. 12. CPU Scheduling • Schedules vCPUs on physical CPUs • Scheduler check physical CPU utilization every 20 milliseconds and migrates worlds as necessary • Entitlement is implemented when CPU resources are overcommitted • Calculated from user resource specifications, such as Shares, Reservations and Limits • Ratio of consumed CPU / Entitlement is used to set priority of the world • High Priority = consumed CPU < Entitlement 1/3/2012 12
    13. 13. CPU Scheduling – SMP VMs • ESX/ESXi uses a form of co-scheduling to run SMP VMs • Co-scheduling is a technique that schedules related processes to run on different processors at the same time • At any time, each vCPU might be scheduled, descheduled, pre-empted or blocked while waiting for some event. • The CPU Scheduler takes “skew” into account when scheduling vCPUs • Skew is the difference in execution rates between two or more vCPUs • A vCPUs “skew” increases when it is not making progress whilst one of its sibling is. • A vCPU is considered to be skewed if its cumulative skew exceeds a configurable threshold (typically a few milliseconds) • Further relaxed co-scheduling introduced to mitigate vCPU Skew • More information on vSphere 4 CPU Scheduler visit http://www.vmware.com/resources/techresources/10059 1/3/2012 13
    14. 14. Processor Topology / Cache Aware • The CPU Scheduler uses processor topology information to optimize the placement of vCPUs onto different sockets. • The CPU Scheduler spreads load across all sockets to maximize the aggregate amount of cache available. • Cores within a single socket typically use shared last-level cache • Use of a shared last-level cache can improve vCPU performance if running memory intensive workloads. • Reduces the need to access slower main memory • Last-level cache has a dedicated channel to a CPU socket enabling it to run at the full speed of the CPU 1/3/2012 14
    15. 15. CPU Ready Time • The amount of time the vCPU waits for physical CPU time to become available • Co-scheduling SMP VMs can increase CPU ready time • This latency can impact on the performance of the guest operating system and its applications in a VM • Avoid over committing vCPUs to host physical CPUs 1/3/2012 15
    16. 16. What affects CPU Performance? • Idling VMs • Consider the overhead of delivering guest timer interrupts • Try to use as few vCPUs as possible to reduce timer interrupts and reduce any co-scheduling overhead • CPU Affinity • This can constrain the scheduler and cause an imbalanced load. • VMware strongly recommends against using CPU affinity • SMP VMs • Some co-scheduling overhead is incurred. • Insufficient CPU resources to satisfy demand • When there is contention the scheduler forces vCPUs of lower-priority VMs to queue behind higher-priority VMs 1/3/2012 16
    17. 17. Causes of Host CPU Saturation • VMs running on the host are demanding more CPU resource than the host has available. • Main scenarios for this problem: • The host has a small number of VMs with high CPU demand • The host has a large number of VMs with moderate CPU demand • The host has a mix of VMs with high and low CPU demand 1/3/2012 17
    18. 18. Resolving Host CPU Saturation • Reduce the VMs on the host • Monitor CPU Usage (both Peak and Average) of each VM, then identify and migrate the VMs to other hosts with spare CPU capacity • Load can be manually rebalanced without downtime • If additional hosts are not available, either power down non-critical VMs or use resource controls such as Shares. • Increase the CPU resources by adding the host to a DRS cluster • Automatic vMotion migrations to load balance across Hosts • Increase the efficiency of a VM’s CPU Usage • Reference application tuning guides, papers, forums. • Guest operating system and application should include: • Use of large memory pages • Reducing timer interrupt rate for VM operating system • Optimize VMs by: • Choose right vCPU number and memory allocation • Use resource controls to direct available resources to critical VMs 1/3/2012 18
    19. 19. ESX Host – pCPU0 High Utilization • Monitor the usage on pCPU0 • There is high utilization on pCPU0, if the average usage is > 75% and 20% > than overall host usage • Within ESX the service console is restricted to running on pCPU0 • Management agents installed in Service Console can request large amounts of CPU • High utilization on PCPU0 can impact on hosted VMs performance • SMP VMs running on NUMA systems might be impacted when assigned to the home node that includes pCPU0 1/3/2012 19
    20. 20. CPU Performance Best Practices • Avoid using SMP unless specifically required by the application running in the VM. • Prioritize VM CPU Usage with Shares • Use vMotion and DRS to load balance VMs and reduce contention • Increase the efficiency of VM usage by: • Referencing application tuning guides • Tuning the guest operating system • Optimizing the virtual hardware 1/3/2012 20
    21. 21. Memory Performance Management • Reclamation – how and why? • Monitoring – what and why? • Troubleshooting • vSwp files placement guidelines • Overview of Best Practices 1/3/2012 21
    22. 22. Memory Reclamation Challenges • VM physical memory is not “freed” • Memory is moved to the “free” list • The hypervisor is not aware when the VM releases memory • It has no access to the VMs “free” list • The VM can accrue lots of host physical memory • Therefore, the hypervisor cannot reclaim released VM memory 1/3/2012 22
    23. 23. VM Memory Reclamation Techniques • The hypervisor relies on these techniques to “free” the host physical memory • Transparent page sharing (default) • redundant copies reclaimed • Ballooning • Forces guest OS to “free” up guest physical memory when the physical host memory is low • Balloon driver installed with VMware Tools • Host-level (hypervisor) swapping • Used when TPS and Ballooning are not enough • Swaps out guest physical memory to the swap file • Might severely penalize guest performance 1/3/2012 23
    24. 24. Memory Management Reporting Production Cluster Memory Shared, Ballooned and Swapped VIXEN (ESX) Average Swap space i n use MB Average Amount of memo ry used by memory control MB Average Memory shared across VMs MB 5,000 4,000 3,000 2,000 1,000 0 1/3/2012 24
    25. 25. Why does the Hypervisor Reclaim Memory? • Hypervisor reclaims memory to support memory overcommitment • ESX host memory is overcommitted when the total amount of VM physical memory exceeds the total amount of host 1/3/2012 25
    26. 26. When to Reclaim Host Memory • ESX/ESXi maintains four host free memory states and associated thresholds: • High (6%), Soft (4%), Hard (2%), Low (1%) • If the host free memory drops towards the stated thresholds, the following reclamation technique is used: High None Soft Ballooning Hard Swapping and Ballooning Low Swapping 1/3/2012 26
    27. 27. Monitoring VM and Host Memory Usage • Active • amount of physical host memory currently used by the guest • displayed as “Guest Memory Usage” in vCenter at Guest level • Consumed • amount of physical ESX memory allocated (granted) to the guest, accounting for savings from memory sharing with other guests. • includes memory used by Service Console & VMKernel • displayed as “Memory Usage” in vCenter at Host level • displayed as “Host Memory Usage” in vCenter at Guest level • If consumed host memory > active memory • Host physical memory not overcommitted • Active guest usage low but high host physical memory assigned • Perfectly normal • If consumed host memory <= active memory • Active guest memory might not completely reside in host physical memory • This might point to potential performance degradation 1/3/2012 27
    28. 28. Memory Troubleshooting 1. Active host-level swapping • Cause: excessive memory overcommitment • Resolution: • reduce memory overcommitment (add physical memory / reduce VMs) • enable balloon driver in all VMs • reduce memory reservations and use shares 2. Guest operating system paging • Monitor the hosts ballooning activity • If host ballooning > 0 look at the VM ballooning activity • If VM ballooning > 0 check for high paging activity within the guest OS 3. When swapping occurs before ballooning • Many VMs are powered on at same time • VMs might access a large portion of their allocated memory • At the same time, the balloon drivers have not started yet • This causes the host to swap VMs 1/3/2012 28
    29. 29. vSwp file usage and placement guidelines • Used when memory is overcommitted • vSwp file is created for every VM • Default placement is with VM files • Can affect vMotion performance if vSwp file is not located on Shared Storage 1/3/2012 29
    30. 30. Memory Performance Best Practices • Allocate enough memory to hold the working set of applications you will run in the virtual machine, thus minimizing swapping • Never disable the balloon driver • Keep transparent page sharing enabled • Avoid over committing memory to the point that it results in heavy memory reclamation 1/3/2012 30
    31. 31. Networking Performance Management • Reducing CPU Load • TCP Segmentation Off Load and jumbo frames • NetQueue • What is it and how will it benefit me? • Monitoring • How to identify network performance problems • Overview of Best Practices 1/3/2012 31
    32. 32. TCP Segmentation Off-Load • Segmentation is the process of breaking messages into frames • Size of a frame is the Maximum Transmission Unit (MTU) • Default MTU is 1500 bytes • Historically, the operating system used the CPU to perform segmentation. • Now modern NICs optimize TCP segmentation • Using larger segments and offloading from CPU to NIC hardware • Improves networking performance by reducing the CPU overhead involved in sending large amounts of TCP traffic • TSO is supported in VMkernel and Guest OS • Enabled by default in VMkernel • You must select Enhanced vmxnet, vmxnet2 (or later) or e1000 as the network device for the VM 1/3/2012 32
    33. 33. Jumbo Frames • Data is transmitted into MTU sized frames • The receive side reassembles the data • A jumbo frame: • Is an Ethernet frame with a bigger MTU, typically 9000 bytes • reduces the number of frames transmitted • reduced the CPU utilization on transmit and receive side • VMs must be configured with vmxnet2 or vmxnet3 adapters • The network must support jumbo frames end to end 1/3/2012 33
    34. 34. NetQueue • NetQueue is a performance technology that significantly improves performance in a 10Gb Ethernet environment • How? • It allows network processing to scale over multiple CPUs • Multiple transmit and receive queues are used to allow I/O processing across multiple CPUs. • NetQueue monitors receive load and balances across queues • Can assign queues to critical VMs • NetQueue requires MSI-X support from the server platform, so NetQueue support is limited to specific systems. • For further information - http://www.vmware.com/support/vi3/doc/whatsnew_esx35_vc25.html 1/3/2012 34
    35. 35. Monitoring Network Statistics • Network packets are queued in buffers if: • The destination is not ready to receive (Rx) • The network is too busy to send (Tx) • Buffers are finite in size. • Virtual NIC devices buffer packets when they cannot be handled immediately • If the Virtual NIC queue fills, packets are buffered by the virtual switch port • Packets are dropped if this happens • Monitor the droppedRx and droppedTx values • If droppedRX or dropped Tx values > 0 = network throughput issue • For droppedRx = check CPU Utilization and driver configuration • For droppedTx = check virtual switch usage, move VMs to other virtual switch 1/3/2012 35
    36. 36. Networking Best Practices • Use the vmxnet3 network adapter where possible • Use vmxnet, vmxnet2 if vmxnet3 is not supported by the guest operating system • Use a physical network adapter that supports high-performance features • Use TCP Off-load and Jumbo Frames where possible to reduce the CPU load. • Monitor droppedRx and droppedTx metrics for network throughput information 1/3/2012 36
    37. 37. Storage Performance Management • LUN Queue Depth • Monitoring • Storage Response Time – key factors • Overview of Storage Best Practices 1/3/2012 37
    38. 38. LUN Queue Depth • LUN queue depth determines how many commands to a given LUN can be active at one time. • The default depth driver queue depth is 32: • Qlogic FC HBAs support up to 255 • Emulex HBAs support up to 128 • Maximum recommended queue depth is 64. • If ESX generates more commands to a LUN than the specified queue depth: • Excess commands are queued • Disk I/O Latency increased • Review the Disk.SchedNumReqOutstanding parameter 1/3/2012 38
    39. 39. Monitoring Disk Metrics • Monitor the following key metrics, to determine available bandwidth and identify disk-related performance issues: • Disk throughout • Disk Read Rate, Disk Write Rate and Disk Usage • Latency (Device and Kernel) • Physical device command latency (values > 15ms indicate a slow array) • Kernel command latency (values should be 0-1ms, >4ms more data being sent to the storage system than it supports • Number of aborted disk commands • If this value is >0 for any LUN, then storage is overloaded on that LUN • Number of active disk commands • If this value is close to or zero, the storage subsystem is not being used • Number of active commands queued • Significant double digit average values indicate storage hardware unable to process the hosts I/O requests fast enough 1/3/2012 39
    40. 40. Storage Response Time - Factors Three main factors: • I/O arrival rate • Maximum rate at which a storage device can handle specific mixes of I/O requests. Requests may be queued in buffers if they exceed this rate. • This queuing can add to the overall response time • I/O size • Transmission rate of storage interconnects. Huge IOPS naturally take longer to complete. • I/O locality • I/O requests to data that is stored sequentially can be completed faster than to data that is stored randomly. • High speed caches typically complete read requests to sequential data 1/3/2012 40
    41. 41. Storage Performance Best Practices • Applications that write a lot of data to storage should not share Ethernet links to a storage device • Eliminate all possible swapping to reduce the burden on the storage subsystem • If required, set the LUN Queue size to 64 (default 32) 1/3/2012 41
    42. 42. vSphere Performance Management Challenges and Best Practices Jamie Baker Principal Consultant jamie.baker@metron-athene.com
    43. 43. x86 Virtualization Challenges • Privilege levels • Software Virtualization – Binary Translation • Hardware Virtualization – Intel VT-x and AMD-V • Memory Management Concepts • MMU Virtualization • Software MMU • Hardware MMU • Memory Virtualization Overhead 1/3/2012 43
    44. 44. x86 Virtualization Challenges • x86 operating systems are designed to run on the bare-metal hardware. • Four levels of privilege are available to operating systems and applications – (Ring 0,1, 2 & 3) • Operating system needs direct access to the memory and hardware and must execute instructions in Ring 0. • Virtualizing x86 architecture requires a virtualization layer under the operating system • Initial difficulties in trapping and translating sensitive and privileged instructions made virtualizing x86 architecture look impossible! • VMware resolved the issue by developing Binary Translation 1/3/2012 44
    45. 45. Software Virtualization - Binary Translation • Original approach to virtualizing the (32-bit) x86 instruction set • Binary Translation allows the VMM to run in Ring 0 • Guest operating system moved to Ring 1 • Applications still run in Ring 3 1/3/2012 45
    46. 46. Hardware Virtualization • In addition to software virtualization: • Intel VT-x • AMD –V • Both are similar in aim but different in detail • Aim: is to simplify virtualization techniques • VMM removes Binary Translation whilst fully controlling VM. • Restricts privileged instructions the VM can execute without assistance from VMM. • CPU execution mode feature allows: • The VMM to run in a root mode below 0 • Automatically traps privileged and sensitive call to the hypervisor • Stores the guest operating system state in VM control structures (Intel) or blocks (AMD) 1/3/2012 46
    47. 47. Memory Management Concepts • Memory virtualization is next critical component • Processes see virtual memory • Guest operating systems use page tables to map virtual memory addresses to physical memory addresses • The MMU translates virtual addresses to physical addresses and the TLB cache help the MMU speed up these translations. • Page table is consulted if a TLB hit is not achievable. • The TLB is updated with virtual/physical address map, when page table walk is completed. 1/3/2012 47
    48. 48. MMU Virtualization • Hosting multiple virtual machines on a single host requires: • Another level of virtualization – Host Physical Memory • VMM maps “guest” physical addresses (PA) to host physical addresses (MA) • To support the Guest operating system, the MMU must be virtualized by using: • Software technique: shadow page tables • Hardware technique: Intel EPT and AMD RVI 1/3/2012 48
    49. 49. Software MMU Virtualization - Shadow Page Tables • Are created for each primary page table • Consist of two mappings: • Virtual Addresses (VA) -> Physical Addresses (PA) • Physical Addresses (PA) -> Machine Addresses (MA) • Accelerate memory access • VMM points the hardware MMU directly at Shadow Page Tables • Memory access runs as native speed • Ensures VM cannot access host physical memory that is not associated 1/3/2012 49
    50. 50. Hardware MMU Virtualization • AMD RVI and Intel EPT permit two levels of address mapping • Guest page tables • Nested page tables • When a virtual address is accessed, the hardware walks both the guest page and nested page tables • Eliminates the need for VMM to synchronize shadow page tables with guest page tables • Can affect performance of applications that stress the TLB • Increases the cost of a page walk • Can be mitigated by use of Large Pages 1/3/2012 50
    51. 51. Memory Virtualization Overhead • Software MMU virtualization incurs CPU overhead: • When new processes are created • New address spaces created • When context switching occurs • Address spaces are switched • Running large numbers of processes • Shadow page tables need updating • Allocating or deallocating pages • Hardware MMU virtualization incurs CPU overhead • When there is a TLB miss 1/3/2012 51