Hardware Virtualization

  • 3,633 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,633
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
150
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hardware Virtualization (ETISS lecture) Dr. Leendert van Doorn Senior Fellow October, 2007
  • 2. 2 ETISS 2007 Hardware Virtualization
  • 3. Overview Introduction Virtualization 101 CPU Virtualization I/O Virtualization Advanced Topics Summary 3 ETISS 2007 Hardware Virtualization
  • 4. Introduction 4 ETISS 2007 Hardware Virtualization
  • 5. Virtualization • Multiple consumers share a resource while maintaining the illusion that each consumer owns the full resource – Memory, processor(s), storage, peripherals, entire machines • Goes all the way back to Popek and Goldberg [1974] • Virtual Machine Monitor (VMM) or hypervisor is the software layer that provides one or more Virtual Machine (VM) abstractions Some of server virtualization examples … 5 ETISS 2007 Hardware Virtualization
  • 6. Example: Datacenter Consolidation • Reduce total cost of ownership (TCO) – Increased systems utilization (current servers have less than 10% average utilization, less than 50% peak utilization) – Reduce hardware (25% of the TCO) – Space, electricity, cooling (50% of the operating cost of a data center) • Management simplification – Dynamic provisioning – Workload management/isolation – Virtual machine migration – Reconfiguration • Better security • Legacy compatibility • Virtualization protects IT investment • Virtualization is a true scalable multi-core work load 6 ETISS 2007 Hardware Virtualization
  • 7. Example: Utility Computing Google for computing cycles: Amazon is offering a VM that is the equivalent of a 1.7Ghz X86 processor, 1.75GB of RAM, 160GB of local disk, and 250Mb/s of network bandwidth for $0.10 per hour. This includes backup and security. 7 ETISS 2007 Hardware Virtualization
  • 8. Virtualization is not a Panacea Dependent systems VMM Independent systems • Increasing utilization through consolidation decreases the reliability – Need better hardware reliability, error reporting, and fault tolerance – Need better software fault isolation 8 ETISS 2007 Hardware Virtualization
  • 9. Server Workloads Are Changing!  Utility computing is a disruptive business model – Very attractive for small and medium businesses – Managed security, backups and hardware upgrades – Heavily depends on virtualization  Open issues – Improve platform reliability (RAS) – Improve software reliability (fault isolation) – Add per VM QoS guarantees and billing capabilities – How to scale the number of VMs significantly? World switch times, direct device access, number of cached VMCBs, over commit resources, … 9 ETISS 2007 Hardware Virtualization
  • 10. Virtualization 101 10 ETISS 2007 Hardware Virtualization
  • 11. Virtual Machine Monitor Approaches Type 2 VMM Hybrid VMM Type 1 VMM App App Guest OS 1 Guest OS 2 App App App App VMM Guest OS 1 Guest OS 2 Guest OS 1 Guest OS 2 Host OS Host OS VMM VMM Hardware Hardware Hardware JVM VMware ESX CLR Xen MS Virtual Server VMware Workstation MS Viridian 11 ETISS 2007 Hardware Virtualization
  • 12. x86 Virtualization Problem • VMM needs to intercept the privileged instructions that are executed in the guest OS • Traditional x86 architecture is not virtualizable – POPF: pop value from stack, set flags eflags – If in privileged mode: IF is set – If in non-privileged mode: IF is not set, no exception is raised • x86 has 16 other unprivileged instructions 12 ETISS 2007 Hardware Virtualization
  • 13. x86 Virtualization Approaches • Full virtualization – Binary rewriting  Inspect each basic block, rewrite privileged instructions  VMware, Virtual PC, qemu – Hardware assist (AMD SVM, Intel VT-x)  Conceptually, introduce a new CPU mode  Xen, KVM, MS Viridian, (VMware) • Paravirtualization – Modify guest OS to cooperate with the VMM – Xen, L4, Denali • Hybrid combinations – MS Viridian’s enlightements – Vmware’s Virtual Machine Interface (VMI) 13 ETISS 2007 Hardware Virtualization
  • 14. CPU Virtualization Techniques Comparison Performance Legacy guest VMM support complexity Binary rewriting medium yes high paravirtualization high no medium Hardware assist low yes medium-low (current gen) Hardware assist medium yes medium-low (next gen) Future hardware high yes low assist low medium high 14 ETISS 2007 Hardware Virtualization
  • 15. Typical Virtualization Software Stack Microsoft Viridian Virtualization stack WMI VM Worker VM VM Worker VM Worker Service Guest Applications kernel kernel VSPs VSCs Windows Windows vmbus enlightenments Hypervisor Hardware Viridian runs Windows and Linux guests Uses AMD SVM, Intel VT-x and paravirtualization (enlightenments) 15 ETISS 2007 Hardware Virtualization
  • 16. CPU Virtualization 16 ETISS 2007 Hardware Virtualization
  • 17. Virtualizing The x86 Platform Done by SW Virtual PCI PCI bus Disk Nested Paging SVM IOMMU controller NPIV Virtual CPU PCI text Network Memory CPU text Bridge/ text Controller IOMMU NPIV CPU Video controller Graphics Virtualization 17 ETISS 2007 Hardware Virtualization
  • 18. Processor Virtualization Features • Both AMD and Intel defined processor extensions for their CPU architectures • AMD: Secure Virtual Machine (Pacifica, SVM, AMD-V), Rev F, Rev G, Barcelona, … • Intel: Vanderpool Technology (VT-x, VT-x2) • From 10,000 ft. both look very similar – Container model (similar to mainframe SIE, start interpretive execution) 18 ETISS 2007 Hardware Virtualization
  • 19. SVM In A Nutshell  Virtualization based on VMRUN VMRUN instruction (similar to SIE)  VMRUN executed by host causes the guest to run  Guest runs until it exits back to the host Guest executes  Host resumes at the instruction following VMRUN  World-switch: host guest host  World switches are not cheap VMCB 19 ETISS 2007 Hardware Virtualization
  • 20. Intel Vanderpool Technology (VT-x) Guest OS 1 Guest OS 2 VM entry VM entry VM exit VM exit Instruction stream VMM VMXON VMXOFF • VT-x adds new instructions such as VMXON, VMXOFF, VMLAUNCH, VMRESUME, VMCALL, … • VM entry is caused by a VMLAUNCH or a VMRESUME • Each guest has a VMCS (VM control segment) for its state 20 ETISS 2007 Hardware Virtualization
  • 21. Intercepts and Exits • A guest runs until – it performs an action that causes an exit – it executes a VMCALL/VMMCALL • Exit conditions are specified per guest – Exceptions (e.g., page faults) and interrupts – Instruction intercepts (CLTS, HLT, IN, OUT, INVLPG, MONITOR, MOV CR/DR, MWAIT, PAUSE, RDTSC …) • AMD-V has paged real-mode support • Intel VT-x has shadow registers 21 ETISS 2007 Hardware Virtualization
  • 22. Example: Full Virtualization Support for Xen HVM domain • Most device emulation is implemented in ioemu (PCI, VGA, IDE, NE2100, …) Application Application Application Application Application ioemu • High performance drivers, such as ioapic, lapic, vpit are implemented in Xen • Developed by Intel, AMD exit Domain 0 RHEL3_U5 and IBM Xen Hardware 22 ETISS 2007 Hardware Virtualization
  • 23. Example: Xen Implementation Statistics • Lines of Code (C, assembly, headers):  Xen: Intel VT-x specific code: 3718 (3.7%)  Xen: AMD SVM specific code: 5721 (5.6%)  Xen: Common HVM code: 5794 (5.7%)  Tools: Common HVM code: 86313 (85%) • Xen 3.0.2 contains both Intel VT-x and AMD SVM support 23 ETISS 2007 Hardware Virtualization
  • 24. The Cost Of VM Entry And Exit • VM entry and exits are very heavy weight and expensive operations • Intel VT-x specification has 11 pages of conditions that need to be checked just on a single VM entry! 24 ETISS 2007 Hardware Virtualization
  • 25. Sample #VMEXIT Distribution READ_CR0 634749 0% Performance benchmark READ_CR3 1935734 0% – kernbench -M READ_CR4 75 0% – Host: linux-2.6.20.2 + kvm-16, WRITE_CR0 958506 0% x86_64 WRITE_CR3 3255402 0% – Guest: FC6, x86_64, 1.5GB WRITE_CR4 146 0% – Guest is not paging WRITE_DR0 1 0% WRITE_DR1 1 0% WRITE_DR2 1 0% WRITE_DR3 1 0% WRITE_DR7 1 0% EXCEPTION_PF 1201225361 90% INTR 2151104 0% NMI 7105 0% CPUID 48111299 3% HLT 9370980 0% IOIO 61350890 4% MSR 24 0% 25 ETISS 2007 Hardware Virtualization
  • 26. Virtualization Challenge  The key problem is how to scale the number or VMs? – Reduce overall world-switch times – Eliminate world switches VM World-switch Times – Over commit (memory) resources F/G GH-B Goal 100  Reduce world-switch times – Better caching of VMCB state Cycles (in %) 75 – Tag TLB by ASID 50  Eliminate world switches – Nested paging (Barcelona) 25 – Direct device assignment (IOMMU) 0  Additional features Processor – APIC, clock, exit delays, precise exits, performance counters, etc. 26 ETISS 2007 Hardware Virtualization
  • 27. Traditional Virtual Memory Map • Virtual to physical translation (page table) is 1GB 4GB maintained by the OS • The CPU walks the page tables automatically • Page faults when page is not present or access violation 0 0 • CPU uses Translation- Lookaside-Buffer (TLB) to Virtual Physical cache lookups Address space Address space 27 ETISS 2007 Hardware Virtualization
  • 28. Virtualized Memory Map 1GB 4GB 4GB 0 0 0 Guest Virtual Guest Physical Host Physical Address space Address space Address space 28 ETISS 2007 Hardware Virtualization
  • 29. Shadow Page Tables GUEST VMM 1GB 4GB 1GB 4GB cr3 0 0 0 0 Guest Virtual Guest Physical Guest Virtual Host Physical Address space Address space Address space Address space • VMM maintains a shadow copy of the guest page table to translate from guest virtual to host physical • Hardware only sees the shadow copy 29 ETISS 2007 Hardware Virtualization
  • 30. Shadow Page Table Issues • Managing the Shadow Page Table is expensive  All page faults are handled by the VMM, it has to walk the guest page tables, and instantiate a shadow entry  VMM needs to propagate access and modify bits – A&M bits are used by the demand paging algorithms – The hardware modifies the shadow page table entry – VMM needs to emulate A&M behavior for the guest – May take up to 3 actual page faults per one guest page fault • Obviously this should be done in hardware … 30 ETISS 2007 Hardware Virtualization
  • 31. Recursive (Page Table) Walker Hardware 1GB 4GB 4GB 0 0 0 Guest Virtual Guest Physical Host Physical Address space Address space Address space Nested paging eliminates this by performing a recursive walk – Available in Barcelona – Reduces number of #VMEXITs by 40-70% 31 ETISS 2007 Hardware Virtualization
  • 32. Nested Paging Page Entry Accesses 63 48 47 39 38 30 29 21 20 12 11 0 Guest Virtual PML4 Offset PDP Offset PD Offset PT Offset Physical Page Off. Guest page table walk Memory accesses are in Page-Map Page Directory Page Directory Page Guest 4KB guest physical space Level-4 Table Pointer Table Table Table memory page gPDPE gData 4KB pages addressed by gPTE guest physical address gPDE 25 51 12 gPML4E gCR3 63 48 47 39 38 30 29 21 20 12 11 0 GP address of gPML 4E PML4 Offset PDP Offset PD Offset PT Offset Physical Page Off. Nested page table walk Page-Map Page Directory Page Directory Page Guest 4KB Level-4 Table Pointer Table Table Table memory page nPDPE 4 Memory accesses are in PDC hits here skip system physical space one memory access 1 3 4KB pages addressed by 2 nPTE 5 system physical address nPDE 51 12 nPML4E gPML4E nCR3 Repeat Nested Page table walk for each GP address gPDPE nPML4E 6 nPDPE 7 nPDE 8 nPTE 9 gPDPE 10 gPDE nPML4E 11 nPDPE 12 nPDE 13 nPTE 14 gPDE 15 Memory access gPDE nPML4E 16 nPDPE 17 nPDE 18 nPTE 19 gPTE 20 count gPTE nPML4E 21 nPDPE 22 nPDE 23 nPTE 24 gData 25 Guest Physical addresses needing System Physical addresses translations to System translated from Guest Physical Physical addresses 32 ETISS 2007 Hardware Virtualization
  • 33. Nested Page Table Performance Sahara, AMD 2.1 Ghz (RevG0) Kernbench Host OS: SLES 10 (64-bit) Xen Guest OS: SLES 10 (32-bit) 400 370.9 364.7 350 341.1 300 274.8 Elapsed Time in seconds 269.7 (lower is better) 250 200 150 100 50 0 Native NPT 32b on 64b Shadow 1 Shadow 2 Paravirtualized 64b/64b 33 ETISS 2007 Hardware Virtualization
  • 34. Nested Page Table Performance 34 ETISS 2007 Hardware Virtualization
  • 35. I/O Virtualization 35 ETISS 2007 Hardware Virtualization
  • 36. I/O Virtualization PCI bus Disk controller Virtual CPU PCI text Network Memory CPU text Bridge/ text Controller IOMMU CPU Video controller Physical Address Virtual I/O Address • Assign devices directly to a guest VM • Eliminate IPCs to service OS • IOMMU isolates busmaster DMA capable devices 36 ETISS 2007 Hardware Virtualization
  • 37. I/O Hosting Partition • With I/O hosting domain/partition, all real drivers extracted from guest domains. • Can have multiple Logical Partition Logical Partition Logical Partition Logical Partition Device Domains to support different devices. • Reasonable performance possible through batching, (page flipping)… Kernel <-> Hypervisor Interface (“Unmodified device driver reuse via virtual Hypervisor machines” OSDI04…) Hardware <-> Hypervisor Interface • But performance is just not good enough to get Hardware Platform rid of all native devices. 37 ETISS 2007 Hardware Virtualization
  • 38. Direct Device Assignment • With IOMMU can directly give partitions control over Bus/Dev/Func. Logical Partition Logical Partition Logical Partition Logical Partition • Same HW results in improved reliability. • With right HW can support fully virtualized OS (e.g., windows). • Clean support for Kernel <-> Hypervisor Interface legacy OSes and for highest performance devices (majority Hypervisor probably still Hardware <-> Hypervisor Interface virtualized) Hardware Platform • Migration becomes impossible using the current PCI standards • Device driver in each OS 38 ETISS 2007 Hardware Virtualization
  • 39. Self Virtualizing Devices • Self virtualizing devices allow direct access by partition, e.g., infiniband. • No overhead Logical Partition Logical Partition Logical Partition Logical Partition (throughput, latency, serialization) to context switch to device domain. • Is exception for high performance devices… Kernel <-> Hypervisor Interface no migration. • Device driver in each Hypervisor OS Hardware <-> Hypervisor Interface Hardware Platform 39 ETISS 2007 Hardware Virtualization
  • 40. IOMMU Fundamental Features • Address translation and memory protection  Traditional – Simplify I/O devices by eliminating scatter/gather logic – Isolation is key to security protections – Restrict I/O devices to access only allowed memory, preventing “wild” writes and “sneak peeks”  New – Direct assignment of I/O device to VM guest increases I/O efficiency – I/O devices can use same address space as VM guest, reducing hypervisor intervention • Interrupt remapping – Efficiently route and block interrupts – Support new PCI-SIG I/O Virtualization (IOV) specifications 40 ETISS 2007 Hardware Virtualization
  • 41. Uses of Translation Services • For traditional devices – Extend address reach beyond 32-bits – Translate 0..4G  full address space  Obviates need for bounce buffers – Exclusion range (pass-through for video) • For new devices – Support ATS – Offload address translation layer to hardware • For all devices – VM guest isolation – Device driver isolation – Device isolation 41 ETISS 2007 Hardware Virtualization
  • 42. Translation Data Services Follows existing PDE and IOMMU walks the tables, PTE formats caching results  Allocates Reserved bits to  Flush commands to manage compactly represent level incorporated TLB skipping in large address spaces  Supports ATS translation  Allows sharing of CPU and IOMMU requests page tables 42 ETISS 2007 Hardware Virtualization
  • 43. DMA Where is the IOMMU? Peripheral Application IOMMU Application MMU RAM Application Peripheral System Software Peripheral control 43 ETISS 2007 Hardware Virtualization
  • 44. Device Protection Traditional * No Virtualization Peripheral Process 1 IOMMU Process 2 MMU RAM Peripheral Process 3 Operating buffers System IO (kernel) Peripheral control 44 ETISS 2007 Hardware Virtualization
  • 45. Device Protection New * I/O Device Assignment in Virtualization Process OS Peripheral Process VM 1 VM Guest 1 IOMMU VM Guest 2 MMU RAM Peripheral VM Guest 3 Hypervisor Parent VM 0 Peripheral control 45 ETISS 2007 Hardware Virtualization
  • 46. I/O Virtualization Topology HT DRAM ATC Device Tunnel ATC PCIe optional HT bridge remote ATC IOMMU CPU PCIe Express™ switches devices, bridge PCI ATC PCIe HT CPU bridge IOMMU IO Hub DRAM ATC = Address Translation Cache PCI, LPC, (ATC a.k.a. IOTLB) etc HT = HyperTransport™ link PCIe = PCI Express™ link 46 ETISS 2007 Hardware Virtualization
  • 47. Advanced Topics 47 ETISS 2007 Hardware Virtualization
  • 48. Secure Initialization • Both Intel and AMD are working on hardware security constructs to enhance their virtualization offerings • Initially driven by Microsoft’s NGSCB design • These enhancements include processor modifications to support – Isolation (VMM) – Trusted computing (TCG) – Trusted keyboard/graphics I/O • AMD introduced a new instruction SKINIT that essentially reboots the CPU into a known state – Start a 64KB secure loader – Interrupts disabled and other processors idled – Inhibit DMA to the secure loader memory area – Measurement of the secure loader is stored in the TCG trusted platform module (using special LPC bus cycles) – Bootstrap/continue with OS startup • Shipping in AMD RevF Opteron since spring 2006 (see OpenTC demo) 48 ETISS 2007 Hardware Virtualization
  • 49. Over Committing Memory Resources  Scaling the number of VMs per core requires memory over commitment – Per core: 32 VMs x 2G versus 32 VMs x 100 MB (working set) – Use paging or memory compaction – VMWare collapses memory pages with the same content into one and uses copy-on-write to disaggregate if necessary – Depending on workloads, this results in 7-33% memory compaction (Memory Resource Management in VMware ESX Server, OSDI’02)  This does not work for the first generation IOMMU designs – You cannot restart PCI operations – Even if you make PCI restartable or pinning you still have to deal with devices that do not do end-to-end flow control signaling – How to deal with VM migration?  Hardware support for memory compaction? 49 ETISS 2007 Hardware Virtualization
  • 50. Virtual Machine Migration  Move a running VM to another machine – For example: Maintenance and load rebalancing  Easy when moving between same CPU models  Issues with migrating between different CPU models? – CPUID masquerading – New CPU opcodes means no longer cause #UD – Emulating new opcodes on old CPUs – Emulating old opcodes on new CPUs – Differences in FP significance  Do you provide a bit vector to enable/disable features?  Do you support N generations (Power6)?  How much of a problem is this actually? – Software really should obey CPUID, but doesn’t always – Vendors want 100% case coverage; is this really needed? – Opcode set enable is filled with problems 50 ETISS 2007 Hardware Virtualization
  • 51. Nested Virtualization  Enable VMMs to run as guests – Akin to z/VM 2nd level guests – Allows different hypervisors to co- exist – Use binary translation for the 1st level VM VM VM VM guest? – Make VMM aware of nesting, 1..N-1 aware, N can be unaware  Open issues Guest VMM Guest VMM – Is it transparent to the VMM? – Performance impact & complexity? VMM – z/VM is mainly used by devtest – Could we partition cores instead? Hardware 51 ETISS 2007 Hardware Virtualization
  • 52. Hypervisor Software Landscape  VMware is the undisputed leader in the x86 virtualization space – Its binary translation technology is currently superior – Only uses VT-x on x86-64 because unlike AMD, Intel does not provide long mode segment limits – Very mature product  Xen is an open source hypervisor shipped as part of RedHat and Suse Linux, virtual Iron – Uses paravirtualization for modified Linux – SVM/VT-x for unmodified guest OS support KVM is being shipped as part of RedHat – Uses SVM/VT-x – Linux module  Microsoft Viridian – Uses SVM/VT-x for CPU virtualization and paravirtualized device drivers – Still in development, released 180-days after Longhorn server 52 ETISS 2007 Hardware Virtualization
  • 53. Summary 53 ETISS 2007 Hardware Virtualization
  • 54. Things to Think About 1. Workloads are changing because of virtualization We do not have good insight into how (especially true for servers) • No good workloads • What happens when you run at 100% utilization all the time? • What to cache? • What are the right bandwidths? 2. Further adoption of virtualization requires improved platform reliability (RAS) • Platform consolidation reduces overall reliability • How to scale the number of VMs per core? • What makes sense? (16 x 8 = 128 * 2GB = ¼ TB per socket) • Reduce the cost or eliminate world-switches • Over-commit memory resources 54 ETISS 2007 Hardware Virtualization
  • 55. Trademark Attribution AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. ©2006 Advanced Micro Devices, Inc. All rights reserved. 55 ETISS 2007 Hardware Virtualization