Virtualization Primer for Java Developers


Published on

Virtualization Technical Deep Dive
Key Concepts for for Java Developers

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Virtualization Primer for Java Developers

  1. 1. Chicago, October 19 - 22, 2010Virtualization Technical Deep DiveKey Concepts for for DevelopersRichard McDougall - VMware
  2. 2. Virtualization Technical Deep DiveWe’ll be covering•  Virtualization Capabilities•  Workstation Virtualization•  How Virtual machines work, what is the overhead•  How Server Virtualization/Consolidation works•  Java and Consolidation on Server Virtualization SpringOne 2GX 2009. All rights reserved. Do not distribute without permission.
  3. 3. What is Virtualization?
  4. 4. Three Properties of Virtualization Partitioning Isolation Encapsulation•  Run multiple operating •  Isolate faults and security at •  Encapsulate the entire state systems on one physical the virtual-machine level of the virtual machine in machine •  Dynamically control CPU, hardware-independent files•  Fully utilize server resources memory, disk and network •  Save the virtual machine state•  Support high availability by resources per virtual as a snapshot in time clustering virtual machines machine •  Re-use or transfer whole virtual •  Guarantee service levels machines with a simple file copy
  5. 5. Virtualization for Desktops/Laptops•  Desktop products –  VMware Fusion and Workstation•  Features for Developers –  Run multiple OS versions concurrently –  Test Server applications on your desktop/laptop –  Leverage the record/replay capability for debug
  6. 6. Virtualization for Servers:Problem: Underutilized Servers Consolidation targets are often <30% Utilized "  Windows average utilization: 5-8% "  Linux/Unix average: 10-35%
  7. 7. Initial Virtualization Benefits: Consolidation BEFORE VMware AFTER VMware Servers 1,000 Servers 80 Direct attach Tiered SAN and NAS Storage Storage 3000 cables/ports 400 cables/ports Network 200 racks Network 10 racks Facilities 400 power whips Facilities 20 power whips
  8. 8. Next Benefit: Simpler ManagementVMotion Technology VMotion Technology moves running virtual machines from one host to another while maintaining continuous service availability - Enables Resource Pools - Enables High Availability
  9. 9. Pooling of resources Pools replace hosts as the primary compute abstraction Resource Resource Pool Resource Pool Pool
  10. 10. Automated Pool of Resources vCenter Imbalanced Balanced Cluster Cluster Heavy Load Lighter Load
  11. 11. DRS Scalability – Transactions per minute(Higher the better) Already balanced So, fewer gains Higher gains (> 40%) with more imbalance
  13. 13. “Hosted” vs vSphere Virtualization Architecture Guest Guest Guest Guest VMware (Fusion, Workstation) VMware Host Operating vSphere System (Server Virtualization) (Linux, Windows, MacOSX) Physical Physical Hardware Hardware
  14. 14. “Hosted” Virtualization Architecture Virtual CPU abstraction is created by “monitor”rmc$ ps -fp 4295 File TCP/IP System UID Guest PID PPID Guest C STIME TTY Each VM is an CMD process TIME OS 0 4295 1 0 18:15.66 ?? 21:05.14 /Library/ Application Support/VMware Fusion/vmware-vmx /Users/rmc/ Monitor supports: Documents/Virtual Machines/Windows XP Pro.vmwarevm/Windows XP Pro.vmx Monitor Monitor Virtual NIC . Virtual SCSI  BT (Binary Translation)  HW (Hardware assist)rmc$ more Windows XP Pro.vmx OS OS  PV (Paravirtualization) Host Process Process Local File SystemvirtualHW.version = "7” Operating mydisk.vmdk System Memory is allocated by thememsize = "776” OS and virtualized by theide0:0.fileName = "Windows XPI/O Drivers NIC Drivers Professional.vmdk” monitorethernet0.connectionType = "nat" Network and I/O devices Physical Hardware are emulated and proxied though native device drivers
  15. 15. Inside the Monitor: Classical Instruction VirtualizationTrap-and-emulate  Nonvirtualized (“native”) system –  OS runs in privileged mode –  OS “owns” the hardware Apps Ring 3 –  Application code has less privilege OS Ring 0  Virtualized –  VMM most privileged (for isolation) –  Classical “ring compression” or “de-privileging” •  Run guest OS kernel in Ring 1 Apps Ring 3 •  Privileged instructions trap; emulated by VMM –  But: does not work for x86 (lack of traps) Guest OS Ring 1 VMM Ring 0
  16. 16. Binary Translation of Guest Code  Translate guest kernel code  Replace privileged instrs with safe “equivalent” instruction sequences  No need for traps  BT is an extremely powerful technology –  Permits any unmodified x86 OS to run in a VM –  Can virtualize any instruction set
  17. 17. Combining BT and Direct Execution Direct Execution (user mode guest code) Faults, syscalls interrupts VMM IRET, sysret Binary Translation (kernel mode guest code)
  18. 18. BT Mechanics  Each translator invocation –  Consume one input basic block (guest code) –  Produce one output basic block  Store output in translation cache –  Future reuse –  Amortize translation costs –  Guest-transparent: no patching “in place” input translated basic block basic block Guest translator Translation cache
  19. 19. Intel VT/ AMD-V: 1st Generation HW Support Apps Ring 3•  Key feature: root vs. guest CPU mode Guest mode Root mode –  VMM executes in root mode Guest OS Ring 0 –  Guest (OS, apps) execute in guest mode VM VM exit enter•  VMM and Guest run as VMM “co-routines” –  VM enter –  Guest runs –  A while later: VM exit –  VMM runs –  ...
  20. 20. Qualitative Comparison of BT and VT-x/AMD-V•  VT-x/AMD-V loses on: •  BT loses on: –  exits (costlier than “callouts”) –  system calls –  no adaptation (cannot elim. exits) –  translator overheads –  page table updates –  path lengthening –  memory-mapped I/O –  indirect control flow –  IN/OUT instructions •  BT wins on:•  VT-x/AMD-V wins on: –  page table updates (adaptation) –  system calls –  memory-mapped I/O (adapt.) –  almost all code runs “directly” –  IN/OUT instructions –  no traps for priv. instructions
  21. 21. Can I Virtualize CPU Intensive Applications?Most CPU intensive applications have very low overhead VMware ESX 3.x compared to Native SPECcpu results covered by O.Agesen and K.Adams Paper Websphere results published jointly by IBM/VMware SPECjbb results from recent internal measurements
  22. 22. Virtualizing Virtual Memory VM 1 VM 2 Process 1 Process 2 Process 1 Process 2 Virtual VA Memory Physical PA Memory Machine MA Memory •  To run multiple VMs on a single system, another level of memory virtualization must be done –  Guest OS still controls virtual to physical mapping: VA -> PA –  Guest OS has no direct access to machine memory (to enforce isolation) •  VMM maps guest physical memory to actual machine memory: PA -> MA
  23. 23. Virtualizing Virtual MemoryShadow Page Tables VM 1 VM 2 Process 1 Process 2 Process 1 Process 2 Virtual VA Memory Physical PA Memory Machine MA Memory •  VMM builds “shadow page tables” to accelerate the mappings –  Shadow directly maps VA -> MA –  Can avoid doing two levels of translation on every access –  TLB caches VA->MA mapping –  Leverage hardware walker for TLB fills (walking shadows) –  When guest changes VA -> PA, the VMM updates shadow page tables
  24. 24. 2nd Generation Hardware AssistNested/Extended Page Tables VA→PA mapping Guest PT ptr ... TLB VA MA TLB fill guest hardware VMM Nested PT ptr PA→MA mapping
  25. 25. Hardware-assisted Memory Virtualization Efficiency Improvement 60% 50% 40% 30% 20% 10% 0% Apache Compile SQL Server Citrix XenApp Efficiency Improvement
  26. 26. “Hosted” vs vSphere Virtualization Architecture Guest Guest Guest Guest VMware (Fusion, Workstation) Host VMware Operating vSphere System (Linux, Windows, MacOSX) Physical Physical Hardware Hardware
  27. 27. vSphere Virtualization Architecture Virtual CPU abstraction is created by “monitor” File TCP/IP SystemGuest Guest Each VM is an OS process Monitor supports:  BT (Binary Translation) Monitor Monitor  HW (Hardware assist)  PV (Paravirtualization) Virtual NIC Virtual SCSI vSphere Memory Scheduler Allocator Virtual Switch File System Memory is allocated by the OS and virtualized by the NIC Drivers I/O Drivers monitor Network and I/O devicesPhysicalHardware are emulated and proxied though native device drivers
  28. 28. Performance 100%MissionCritical VI 3.0 VI 3.5 ESX 2.x vSphere 4.0Apps (2005) (2007) (2003) (2009) Overhead:30-60% Overhead:20-40% Overhead:10-30% Overhead:2-15% VCPUs: 2 VCPUs:2 VCPUs:4 VCPUs:8 VM RAM:3.6 GB VM RAM:16 GB VM RAM:64GB VM RAM:255GBGeneral Phys RAM:256GB Phys RAM:1 TBPopulation Phys RAM:64GB Phys RAM:64GBOf PCPUs:16 core PCPUs:16 core PCPUs:64 core PCPUs:64 coreApps IOPS:100,000 IOPS:350,000 IOPS:<10,000 IOPS:10,000 N/W:380 Mb/s N/W:800 Mb/s N/W:9 Gb/s N/W:28 Gb/s Monitor Type: Gen-1 64-bit OS Support 64-bit OS Support Binary Translation HW Virtualization Gen-2 HW 320 VMs per host Monitor Type: Virtualization 512 vCPUs per host VT / SVM Monitor Type: NPT Monitor Type: EPT Ability to satisfy Performance Demands
  29. 29. High Throughput Web Workloads(SPECweb) Overall response time is lower when CPU utilization is less than 100% due to multi-core offload
  30. 30. >95% of All Databases fit in a Virtual Machine
  31. 31. CPUs and Scheduling o  Schedule virtual CPUs on physical CPUs o  Virtual time based Guest Guest Guest proportional-share CPU scheduler o  Flexible and accurate rate- based controls over CPU time Monitor Monitor Monitor allocations o  NUMA/processor/cache topology awareVMkernel Scheduler o  Provide graceful degradation in over-commitment situations o  High scalability with low scheduling latencies o  Fine-grain built-in accountingPhysical for workload observabilityCPUs o  Support for VSMP virtual machines
  32. 32. VM Scheduling: How will multiple VMs operate? Run•  VM state –  running (%used) –  waiting (%twait) Ready Wait –  ready to run (%ready)•  When does a VM go to “ready to run” state –  Guest wants to run or need to be woken up (to deliver an interrupt) –  All available CPU is running other VMs
  33. 33. Resource Controls: Performance SLA Total Mhz•  Reservation –  Minimum service level guarantee (in MHz) –  Even when system is overcommitted Limit –  Needs to pass admission control•  Shares Shares apply –  CPU entitlement is directly proportional to VMs here shares and depends on the total number of shares issued Reservation –  Abstract number, only ratio matters•  Limit 0 Mhz –  Absolute upper bound on CPU entitlement (in MHz) –  Even when system is not overcommitted
  34. 34. vSphere Memory ManagementVM Size: 1GB Guest A Guest B Thin 1GB 200MB 200MB Provisioned used used (Undercommited) 400Mb 2GB of VMs on 1GB 1GB host is OK used Physical (Overcommited) Guest A Guest B 1GB 1GB 1GB 1GB used used 1GB 1GB used Physical Paging and Swapping to Disk
  35. 35. Virtual Memory “virtual” memory guest “virtual” Application memory guest App “physical” memory Operating “physical” System memory hypervisor OS hypervisor “machine” “machine” memory Hypervisor memory Hypervisor
  36. 36. Application Memory Management –  Starts with no memory –  Allocates memory through syscall to operating system App –  Often frees memory voluntarily through syscall –  Explicit memory allocation interface with operating system OS Hyper visor
  37. 37. Operating System Memory Management –  Assumes it owns all physical memory App –  No memory allocation interface with hardware •  Does not explicitly allocate or free physical memory OS –  Defines semantics of “allocated” and “free” memory •  Maintains “free” list and “allocated” lists of physical Hyper memory visor •  Memory is “free” or “allocated” depending on which list it resides
  38. 38. Hypervisor Memory Management –  Very similar to operating system memory management App •  Assumes it owns all machine memory •  No memory allocation interface with hardware OS •  Maintains lists of “free” and “allocated” memory Hypervisor
  39. 39. VM Memory Allocation –  VM starts with no physical memory allocated to it App –  Physical memory allocated on demand •  Guest OS will not explicitly allocate OS •  Allocate on first VM access to memory (read or write) Hypervisor
  40. 40. VM Memory Reclamation•  Guest physical memory not “freed” in typical sense –  Guest OS moves memory to its App “free” list –  Data in “freed” memory may not have been modified Guest free list OS•  Hypervisor isn’t aware when guest frees memory –  Freed memory state unchanged Hypervisor –  No access to guest’s “free” list –  Unsure when to reclaim “freed” guest memory
  41. 41. VM Memory Reclamation Cont’d•  Guest OS (inside the VM) –  Allocates and frees… Insid e the Ap –  And allocates and p VM frees… VM –  And allocates and frees… Guest free list OS"   VM "   Allocates… "   And allocates… Hypervisor "   And allocates…Hypervisor needs some way of reclaiming memory!
  42. 42. Ballooning may free buffers or inflate balloon page out (+ pressure) Guest OS to virtual disk balloon Guest OS guest OS manages memory implicit cooperation balloon May grow buffers or page Guest OS in from virtual deflate balloon disk (– pressure)
  43. 43. Java Memory Management (Hotspot)
  44. 44. Java Heap Usage Garbage Collection VM Usage JVM Heap Size (-Xmx=)
  45. 45. VMware ESX and Java MemoryManagement Combined
  46. 46. Java Heap Usage – Without reservations VM Config Size VM Usage JVM Heap Size (-Xmx=)
  47. 47. Java Heap Usage – With VM Reservation VM Config Total MB Size VM Usage JVM Heap Limit Size Reservation 0 MB
  48. 48. Performance Measurement in a Virtual WorldTraditionally, the OS was the authorityOperating system performs various roles –  Application Runtime Libraries –  Resource Management (CPU, Memory etc) –  Hardware + Driver management "   Performance & Scalability of the OS was paramount "   Performance Observability tools are a feature of the OS
  49. 49. Performance Measurement in a Virtual WorldThe OS becomes the “Application Library”, and the Hypervisor becomes the authority
  50. 50. Important Notes about MeasuringPerformance•  Resources measured from within the Guest-OS may not be accurate –  The OS is sharing physical resources with others –  CPU utilization is often under-reported (some CPU time is stolen to other guest-Oses)•  Time measurements –  Course grained time measurements are correct (if VMware tools are installed/enabled) –  Fine grained measurements are subject to jitter (don’t try to measure sub-millisecond response times without special tools) –  CPU steals will add to latency of non-CPU measured events (e.g. I/O response times)
  51. 51. Tools for Performance Analysis•  Guest Tools: vmstat, mpstat, management tools•  VirtualCenter client (VI client): –  Per-host and per-cluster stats –  Graphical Interface –  Historical and Real-time data•  esxtop: per-host statistics –  Command-line tool found in the console-OS•  Java SDK –  Allows you to collect only the statistics they want
  52. 52. Potential Impacts to Performance•  Virtual Machine Contributors Latency: –  CPU Overhead can contribute to latency (but it’s small!) –  Scheduling latency (VM runnable, but waiting…) –  Waiting for a global memory paging operation –  Disk Reads/Writes taking longer•  Virtual machine impacts to Throughput: –  Throughput ceiling if not enough resources allocated –  Throughput ceiling if not enough virtual CPU/Mem allocated
  53. 53. vSphere Instrumentation Points File TCP/IP System ServicevCPU Guest Console Virtual DiskcCPU VMHBA Monitor Monitor vNIC Virtual NIC Virtual SCSI Memory VMkernel Virtual Switch File System Scheduler Allocator NIC Drivers I/O Drivers Physical HBA Hardware pCPU Physical Disk pNIC
  54. 54. VI Client Chart Type Real-time vs. Historical Object Counter type Rollup Stats type
  55. 55. CPU capacity (screenshot from VI Client)Some caveats on ready time Used time  Used time ~ ready time: may signal contention. However, might not be overcommitted due to workload variability  In this example, we have Ready time ~ used time periods of activity and idle periods: CPU isn’t overcommitted all the time Ready time < used time
  56. 56. esxtop  What is esxtop ? •  Performance troubleshooting tool for ESX host •  Displays performance statistics in rows and column format Fields
  57. 57. Performance Summary•  Use vSphere rather than Workstation/Fusion for any performance testing –  Better performance from Sched, I/O, Large Pages, etc,…•  vSphere will provide near-native performance –  Ensure resources are available (under-commit or use controls) –  If I/O intensive, ensure shared storage is configured with enough capacity –  Ensure VMware tools are installed•  Use the correct performance instrumentation –  vSphere or esxtop
  58. 58. Q&A SpringOne 2GX 2010. All rights reserved. Do not distribute without permission.