A Comparison of Software and Hardware Techniques for x86 ...


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A Comparison of Software and Hardware Techniques for x86 ...

  1. 1. A Comparison of Software and Hardware Techniques for x86 Virtualization Paper by Keith Adams & Ole Agesen (VMWare) Presentation by Jason Agron
  2. 2. Presentation Overview <ul><li>What is virtualization? </li></ul><ul><li>Traditional virtualization techniques. </li></ul><ul><li>Overview of Software VMM. </li></ul><ul><li>Overview of Hardware VMM. </li></ul><ul><li>Evaluation of VMMs. </li></ul><ul><li>Conclusions </li></ul><ul><li>Questions </li></ul>
  3. 3. “Virtualization” <ul><li>Defined by Popek & Goldberg in 1974. </li></ul><ul><li>Establishes 3 essential characteristics of a VMM: </li></ul><ul><ul><li>Fidelity </li></ul></ul><ul><ul><ul><li>Running on VMM == Running directly on HW. </li></ul></ul></ul><ul><ul><li>Performance </li></ul></ul><ul><ul><ul><li>Performance on VMM == Performance on HW. </li></ul></ul></ul><ul><ul><li>Safety </li></ul></ul><ul><ul><ul><li>VMM manages all hardware resources (correctly?). </li></ul></ul></ul>
  4. 4. Is This Definition Correct? <ul><li>Yes, but it’s scope should be taken into account. </li></ul><ul><li>It assumes the traditional “trap-and-emulate” style of full virtualization. </li></ul><ul><ul><li>This was extremely popular circa 1974. </li></ul></ul><ul><ul><li>Completely “transparent”. </li></ul></ul><ul><li>It does not account for… </li></ul><ul><ul><li>Paravirtualization. </li></ul></ul><ul><ul><ul><li>Not transparent. </li></ul></ul></ul><ul><ul><ul><li>Guest software is modified. </li></ul></ul></ul>
  5. 5. Full Virtualization <ul><li>Full == Transparent </li></ul><ul><li>Must be able to “detect” when VMM must intervene. </li></ul><ul><li>Definitions: </li></ul><ul><ul><li>Sensitive Instruction: </li></ul></ul><ul><ul><ul><li>Accesses and/or modifies privileged state. </li></ul></ul></ul><ul><ul><li>Privileged Instruction: </li></ul></ul><ul><ul><ul><li>Traps when run in an unprivileged mode. </li></ul></ul></ul>
  6. 6. Traditional Techniques <ul><li>De-privileging </li></ul><ul><ul><li>Run guest programs in a reduced privilege level so that privileged instructions trap. </li></ul></ul><ul><ul><li>VMM intercepts the trap and emulates the functionality of the original call. </li></ul></ul><ul><ul><li>Very similar to the way programs transfer control to the OS kernel during a system call. </li></ul></ul>
  7. 7. Traditional Techniques <ul><li>Primary & Shadow Structures </li></ul><ul><ul><li>Each virtual system’s privileged state differs from that of the underlying HW. </li></ul></ul><ul><ul><li>Therefore, the VMM must provide the “correct” environment to meet the guests’ expectations. </li></ul></ul><ul><li>Guest-level primary structures reflect the state that a guest sees. </li></ul><ul><li>VMM-level shadow structures are copies of primary structures. </li></ul><ul><ul><li>Kept coherent via “memory traces”. </li></ul></ul>
  8. 8. Traditional Techniques <ul><li>Memory traces </li></ul><ul><ul><li>Traps occur when on-chip privileged state is accessed/modified. </li></ul></ul><ul><ul><li>What about off-chip privileged state? </li></ul></ul><ul><ul><ul><li>i.e. page tables. </li></ul></ul></ul><ul><ul><ul><ul><li>They can be accessed by LOADs/STOREs. </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Either by CPU or DMA-capable devices. </li></ul></ul></ul></ul></ul><ul><li>HW page protection schemes are employed to “detect” when this happens. </li></ul>
  9. 9. Refinements to Classical Virtualization <ul><li>Traps are expensive! </li></ul><ul><li>Improve the Guest/VMM interface: </li></ul><ul><ul><li>AKA Paravirtualization. </li></ul></ul><ul><ul><li>Allows for higher-level information to be passed to the VMM. </li></ul></ul><ul><ul><li>Can provide features beyond the baseline of “classic” virtualization. </li></ul></ul><ul><li>Improve the VMM/HW interface: </li></ul><ul><ul><li>IBM’s System 370 - Interpretive Execution Mode. </li></ul></ul><ul><ul><li>Guests allowed safe and direct access to certain pieces of privileged information w/o trapping. </li></ul></ul>
  10. 10. Software VMM <ul><li>x86 - not “classically” virtualizable. </li></ul><ul><ul><li>Visibility of privileged state. </li></ul></ul><ul><ul><ul><li>i.e. Guest can observe it’s privilege level via un-protected %cs register. </li></ul></ul></ul><ul><ul><li>Not all sensitive instructions trap. </li></ul></ul><ul><ul><ul><li>i.e. Privileged execution of popf (pop flags) instruction modifies on-chip privileged state. </li></ul></ul></ul><ul><ul><ul><li>Unprivileged execution must trap so that VMM can emulate it’s effects. </li></ul></ul></ul><ul><ul><ul><li>Unfortunately, no trap occurs, instead a NO-OP. </li></ul></ul></ul>
  11. 11. Software VMM <ul><li>How can x86’s faults be overcome? </li></ul><ul><li>What if guests execute on an interpreter? </li></ul><ul><li>The interpreter can… </li></ul><ul><ul><li>Prevent leakage of privileged state. </li></ul></ul><ul><ul><li>Ensure that all sensitive instructions are correctly detected. </li></ul></ul><ul><li>Therefore it can provide… </li></ul><ul><ul><li>Fidelity </li></ul></ul><ul><ul><li>Safety </li></ul></ul><ul><ul><li>Performance?? </li></ul></ul>
  12. 12. Interpreter-Based Software VMM <ul><li>Authors’ Statement: </li></ul><ul><ul><li>An interpreter-based VMM will not provide adequate performance. </li></ul></ul><ul><ul><ul><li>A single native x86 instruction will take N instructions to interpret. </li></ul></ul></ul><ul><li>Question: </li></ul><ul><ul><li>Is this necessarily true? </li></ul></ul><ul><li>Authors’ Solution: </li></ul><ul><ul><li>Binary Translation. </li></ul></ul>
  13. 13. Properties of This BT <ul><li>Dynamic and On-Demand </li></ul><ul><ul><li>Run-time translation interleaved with code execution. </li></ul></ul><ul><ul><li>Code is translated only when about to execute. </li></ul></ul><ul><ul><li>Laziness avoids problem of distinguishing code & data. </li></ul></ul><ul><li>System-level </li></ul><ul><ul><li>All translation rules are set by the x86 ISA. </li></ul></ul><ul><li>Subsetting </li></ul><ul><ul><li>Input is x86 ISA binary </li></ul></ul><ul><ul><li>Output is a “safe” subset of the ISA. </li></ul></ul><ul><ul><ul><li>Mostly user-mode instructions. </li></ul></ul></ul><ul><li>Adaptive </li></ul><ul><ul><li>Can optimize generated code over time </li></ul></ul>
  14. 14. BT Process <ul><ul><li>Input a TU (Translation Unit) </li></ul></ul><ul><ul><ul><li>Stopping at either: </li></ul></ul></ul><ul><ul><ul><ul><li>12 instructions. </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Terminating instruction (usually control flow). </li></ul></ul></ul></ul><ul><ul><li>Translate the TU into a CCF (Compiled Code Fragment). </li></ul></ul><ul><ul><li>Place generated CCF into the TC (Translation Cache). </li></ul></ul>
  15. 15. BT Process <ul><ul><li>CCFs must be chained together to form a “complete” program. </li></ul></ul><ul><ul><li>Each CCF ends in a continuation that acts as a link. </li></ul></ul><ul><ul><li>Continuations are evaluated at run-time… </li></ul></ul><ul><ul><ul><li>Can be translated into jumps </li></ul></ul></ul><ul><ul><ul><li>Can be “removed” (code merely falls through to next CCF). </li></ul></ul></ul><ul><ul><li>If a continuation is never “hit”… </li></ul></ul><ul><ul><ul><li>Then it is never transformed. </li></ul></ul></ul><ul><ul><ul><li>Thus, the BT acts like a just-in-time compiler. </li></ul></ul></ul><ul><ul><li>Software VMM can switch between BT-mode and direct execution. </li></ul></ul><ul><ul><ul><li>Performance optimization. </li></ul></ul></ul>
  16. 16. Adaptive BT <ul><li>Traps are expensive. </li></ul><ul><li>BT can avoid some traps. </li></ul><ul><ul><li>i.e. rdtsc instruction </li></ul></ul><ul><ul><li>TC emulation << Call-out & emulate << Trap-and-emulate. </li></ul></ul><ul><li>Sensitive non-privileged instructions are harder to avoid. </li></ul><ul><ul><li>i.e. LOADs/STOREs to privileged data. </li></ul></ul><ul><ul><li>Use adaptive BT to re-work code. </li></ul></ul>
  17. 17. Adaptive BT <ul><li>Detect instructions that trap frequently </li></ul><ul><li>Adapt the translation of these instructions. </li></ul><ul><ul><li>Re-translate to avoid trapping. </li></ul></ul><ul><ul><ul><li>Jump directly to translation. </li></ul></ul></ul><ul><ul><ul><li>Call out to interpreter. </li></ul></ul></ul><ul><li>Adaptive BT tries to eliminate more and more traps over time. </li></ul>
  18. 18. Hardware VMM <ul><li>Experimental VMM based on new x86 virtualization extensions. </li></ul><ul><ul><li>AMD’s SVM & Intel’s VT. </li></ul></ul><ul><li>New HW features: </li></ul><ul><ul><li>Virtual Machine Control Blocks (VMCBs). </li></ul></ul><ul><ul><li>Guest mode privilege level. </li></ul></ul><ul><ul><li>Ability to transfer control to/from guest mode. </li></ul></ul><ul><ul><ul><li>vmrun - host to guest. </li></ul></ul></ul><ul><ul><ul><li>exit - guest to host. </li></ul></ul></ul>
  19. 19. Hardware VMM <ul><li>VMM executes vmrun to start a guest. </li></ul><ul><ul><li>Guest state is loaded into HW from in-memory VMCB. </li></ul></ul><ul><ul><li>Guest mode is resumed and guest continues execution. </li></ul></ul><ul><li>Guests execute until they “toy” with control bits of the VMCB. </li></ul><ul><ul><li>An exit operation occurs. </li></ul></ul><ul><ul><li>Guest saves data to VMCB. </li></ul></ul><ul><ul><li>VMM state is loaded into HW - switches to host mode. </li></ul></ul><ul><ul><li>VMM begins executing. </li></ul></ul>
  20. 20. x86 Architecture Extensions
  21. 21. Qualitative Comparison <ul><li>Software wins in… </li></ul><ul><ul><li>Trap elimination via adaptive BT. </li></ul></ul><ul><ul><ul><li>HW replaces traps w/ exits. </li></ul></ul></ul><ul><ul><li>Emulation speed. </li></ul></ul><ul><ul><ul><li>Translations and call-outs essentially jump to pre-decoded emulation routines. </li></ul></ul></ul><ul><ul><ul><li>HW VMM must fetch VMCB and decode trapping instructions before emulating. </li></ul></ul></ul>
  22. 22. Qualitative Comparison <ul><li>Hardware wins in… </li></ul><ul><ul><li>Code density. </li></ul></ul><ul><ul><ul><li>No translation = No replicated code segments </li></ul></ul></ul><ul><ul><li>Precise exceptions. </li></ul></ul><ul><ul><ul><li>BT approach must perform extra work to recover guest state for faults and interrupts. </li></ul></ul></ul><ul><ul><ul><li>HW approach can just examine the VMCB. </li></ul></ul></ul><ul><ul><li>System calls. </li></ul></ul><ul><ul><ul><li>[Can] run w/o VMM intervention. </li></ul></ul></ul>
  23. 23. Qualitative Comparison (Summary) <ul><li>Hardware VMMs… </li></ul><ul><ul><li>Native performance for things that avoid exits. </li></ul></ul><ul><ul><li>However exits are still costly (currently). </li></ul></ul><ul><ul><ul><li>Strongly targeted towards “trap-and-emulate” style. </li></ul></ul></ul><ul><li>Software VMMs… </li></ul><ul><ul><li>Carefully engineered to be efficient. </li></ul></ul><ul><ul><li>Flexible (b/c it isn’t HW). </li></ul></ul>
  24. 24. Experiments <ul><li>3.8 GHz Intel Pentium 4. </li></ul><ul><ul><li>HT disabled (b/c most virtualization products can’t handle this). </li></ul></ul><ul><li>The contenders… </li></ul><ul><ul><li>Mature commercial Software VMM. </li></ul></ul><ul><ul><li>Recently developed Hardware VMM. </li></ul></ul><ul><li>Fair battle? </li></ul>
  25. 25. SPECint & SPECjbb <ul><li>Primarily user-level computations. </li></ul><ul><ul><li>Unaffected by VMMs </li></ul></ul><ul><ul><li>Therefore, performance should be near native. </li></ul></ul><ul><li>Experimental results confirm this. </li></ul><ul><li>4% average slowdown for Software VMM. </li></ul><ul><li>5% average slowdown for Hardware VMM. </li></ul><ul><li>The cause is “host background activity”. </li></ul><ul><ul><li>Windows jiffy rate << Linux jiffy rate </li></ul></ul><ul><ul><li>Windows test closer to native than Linux test. </li></ul></ul>
  26. 26. Apache ab Benchmark <ul><li>Tests I/O efficiency </li></ul><ul><li>SW VMM (and HW VMM?) use host as I/O controller. </li></ul><ul><ul><li>Therefore ~2x overhead of normal I/O </li></ul></ul><ul><li>Experimental results confirm this… </li></ul><ul><ul><li>~ 2x slowdown. </li></ul></ul><ul><ul><li>Both HW and SW VMMs “suck”. </li></ul></ul><ul><ul><li>Windows and Linux tests differ widely </li></ul></ul><ul><ul><ul><li>Windows - single process (less paging). </li></ul></ul></ul><ul><ul><ul><ul><li>HW VMM is better. </li></ul></ul></ul></ul><ul><ul><ul><li>Linux - multiple processes (more paging). </li></ul></ul></ul><ul><ul><ul><ul><li>SW VMM is better. </li></ul></ul></ul></ul><ul><ul><ul><li>Why (hint: VMCB)? </li></ul></ul></ul>
  27. 27. PassMark Benchmarks <ul><li>A synthetic suite of microbenchmarks. </li></ul><ul><ul><li>used to pinpoint various aspects of workstation performance. </li></ul></ul><ul><li>Large RAM test - exhausts memory </li></ul><ul><ul><li>Intended to test paging capability </li></ul></ul><ul><ul><li>SW VMM wins. </li></ul></ul><ul><li>2D Graphics test - hits system calls </li></ul><ul><ul><li>HW VMM wins. </li></ul></ul>
  28. 28. Compile Jobs Test <ul><li>“ Less” synthetic test. </li></ul><ul><ul><li>Compilation time of Linux Kernel, Apache, etc. </li></ul></ul><ul><li>SW VMM beats the HW VMM again. </li></ul><ul><ul><li>Big compilation job w/ lots of files = Lots of page faults. </li></ul></ul><ul><ul><li>SW VMM is better at this than HW VMM. </li></ul></ul><ul><li>Compared to native speed… </li></ul><ul><ul><li>SW VMM is ~60% as fast. </li></ul></ul><ul><ul><li>HW VMM is ~55% as fast. </li></ul></ul>
  29. 29. ForkWait Test <ul><li>Test to stress process creation/destruction. </li></ul><ul><ul><li>System calls, context switching, page table modifications, page faults, context switching, etc. </li></ul></ul><ul><li>Native = 6.0 seconds. </li></ul><ul><li>SW VMM = 36.9 seconds. </li></ul><ul><li>HW VMM = 106.4 seconds. </li></ul>
  30. 30. Nanobenchmarks <ul><li>Tests used to exercise single “virtualization sensitive” operations. </li></ul><ul><li>All tests are conducted using a specially developed guest OS -- FrobOS. </li></ul>
  31. 31. Nanobenchmarks <ul><li>Syscall (Native == HW << SW) </li></ul><ul><ul><li>HW VMM doesn’t intervene. </li></ul></ul><ul><ul><li>SW VMM traps. </li></ul></ul><ul><li>In (SW << Native << HW) </li></ul><ul><ul><li>Native goes off-chip. </li></ul></ul><ul><ul><li>SW VMM interacts with virtual CPU model. </li></ul></ul><ul><ul><li>HW VMM intervenes </li></ul></ul><ul><li>Ptemod (Native << SW << HW) </li></ul><ul><ul><li>Both take a hit (both use shadowing) </li></ul></ul><ul><ul><li>SW VMM can adapt, but still less than ideal. </li></ul></ul><ul><ul><li>HW VMM can’t, so it must always do exit/vmrun. </li></ul></ul>
  32. 32. Analysis of Results <ul><li>SW and HW VMMs are “even” except… </li></ul><ul><ul><li>When BT adaptation helps. </li></ul></ul><ul><ul><ul><li>i.e. page table faults vs.. exit/vmrun round-trips. </li></ul></ul></ul><ul><li>They claim that “we have found few workloads that benefit from current HW extensions”. </li></ul><ul><li>BUT… </li></ul><ul><ul><li>HW extensions are getting faster all of the time. </li></ul></ul><ul><ul><ul><li>But “stateless” HW VMM approach still has a memory bottleneck with VMCB access! </li></ul></ul></ul><ul><ul><li>Trouble w/ HW VMM is MMU virtualization. </li></ul></ul><ul><ul><ul><li>HW assisted MMU could relieve VMM of a lot of work! </li></ul></ul></ul><ul><ul><ul><li>Being proposed by both AMD and Intel. </li></ul></ul></ul>
  33. 33. Future/Related Works <ul><li>CISC/RISC? </li></ul><ul><ul><li>Should the HW be more complex to support virtualization? </li></ul></ul><ul><ul><li>Should a complex SW VMM be used? </li></ul></ul><ul><li>Open source? </li></ul><ul><ul><li>Open source OS code allows for paravirtualization. </li></ul></ul><ul><ul><li>What should the OS/VMM interface be? </li></ul></ul><ul><ul><ul><li>It should be investigated, standardized, documented, and most importantly SUPPORTED! </li></ul></ul></ul><ul><ul><li>What should the OS/HW interface be? </li></ul></ul><ul><ul><ul><li>This should be looked at as well! </li></ul></ul></ul>
  34. 34. Conclusions <ul><li>Hardware extensions now allow x86 to execute guests directly (trap-and-emulate style). </li></ul><ul><li>Comparison of SW and HW VMMs… </li></ul><ul><ul><li>Both are able to execute computation-bound workloads at near native speed. </li></ul></ul><ul><ul><li>When I/O and process management is involved. </li></ul></ul><ul><ul><ul><li>SW prevails. </li></ul></ul></ul><ul><ul><li>When there are a lot of system calls. </li></ul></ul><ul><ul><ul><li>HW prevails. </li></ul></ul></ul>
  35. 35. Conclusions <ul><li>SW VMM techniques are very mature. </li></ul><ul><ul><li>Also, very flexible. </li></ul></ul><ul><li>New x86 extensions are relatively immature and present a fixed (inflexible) interface. </li></ul><ul><li>Future work on HW extensions promises to improve performance. </li></ul><ul><li>Hybrid SW/HW VMMs promise to provide benefits of both worlds. </li></ul><ul><li>There is no “clear” winner at this time. </li></ul>
  36. 36. Questions???? <ul><li>References: </li></ul><ul><ul><li>K. Adams and O. Agesen (2006). A comparison of software and hardware techniques for x86 virtualization . In Proceedings of the 12th international Conference on Architectural Support For Programming Languages and Operating Systems . ASPLOS-XII. ACM Press, New York, NY, 2-13. </li></ul></ul>