Virtualization Technology Overview


                               Liu, Jinsong
                         (jinsong.liu@intel.com)




SSG System Software Division
Agenda
• Introduction
    history
    Usage model
• Virtualization overview
    cpu virtualiztion
    memory virtualization
    I/O virtualization
• Xen/KVM architecture
    Xen
    KVM
• Some intel work for Openstack
    OAT



                                  2   2012/11/28
Virtualization history

• 60s’             IBM - CP/CMS on S360, VM370, …
• 70’s 80s’        Silence
• 1998             VMWare - SimOS project, Stanford
• 2003             Xen - Xen project, Cambridge
• After that:      KVM/Hyper-v/Parallels …




                                     3   2012/11/28
What is Virtualization

        VM0             VM1                   VMn

           Apps             Apps                  Apps

          Guest OS        Guest OS      ...    Guest OS



               Virtual Machine Monitor (VMM)


        Platform HW
            Memory         Processors         I/O Devices


• VMM is a layer of abstraction
   support multiple guest OSes
   de-privilege each OS to run as Guest OS
• VMM is a layer of redirection
   redirect physical platform to virtual platform illusions of many
   provide virtaul platfom to guest os



                                                    4    2012/11/28
Server Virtualization Usage Model
             Server Consolidation          R&D                 Production



  App            App                       App
             …
  OS             OS                           OS

  HW             HW          VMM                             VMM
                              HW                             HW
Benefit: Cost Savings
• Consolidate services              Benefit: Business Agility and Productivity
• Power saving
                                           Dynamic Load Balancing
              Disaster Recovery
                                        App            App           App           App
                                         1              2             3             4
                 App
    App
             …                           OS            OS            OS                OS
        OS       OS                           VMM                           VMM
                            VMM
                                              HW                            HW
         VMM
                                           CPU Usage                       CPU Usage
          HW                HW               90%                             30%
 Benefit: Lost saving
 • RAS
 • live migration
 • relief lost                                     • Benefit: Productivity


                                                   5    2012/11/28
Agenda

• Introduction
• Virtualization overview
    CPU virtualization
    Memory virtualization
    I/O virtualization
• Xen/KVM architecture
• Some intel work for Openstack




                                  6   2012/11/28
X86 virtualization challenges
• Ring Deprivileging
    Goal: isolate guest OS from
       • Controlling physical resources directly
       • Modifying VMM code and data

    Ring deprivileging layout
       • vmm runs at full privileged ring0
       • Guest kernel runs at
            •   X86-32: deprivileging ring 1
            •   X86-64: deprivileging ring 3
       • Guest app runs at ring 3

    Ring deprivileging problems
       • Unnecessary faulting
            •   some privilege instructions
            •   some exceptions
       • Guest kernel protection (x86-64)

• Virtualization holes
    19 instructions
       • SIDT/SGDT/SLDT …
       • PUSHF/POPF …
    Some userspace holes hard to fix by s/w approach
       • Hard to trap, or
       • Performance overhead




                                                   7    2012/11/28
X86 virtualization challenges


        VM0            VM0
                         1
                                          VM0
                                            2


Ring3   Guest Apps
          Apps          Guest Apps
                          Apps            Guest Apps
                                            Apps


        Guest Kernel
         Guest OS      Guest Kernel
                        Guest OS          Guest Kernel
                                           Guest OS
Ring1



Ring0          Virtual Machine Monitor (VMM)




                                      8    2012/11/28
Typical X86 virtualization approaches
• Para-virtualization (PV)
     Para virtualization approach, like Xen
     Modified guest OS aware and co-work with VMM
     Standardization milestone: linux3.0
         • VMI vs. PVOPS
         • Bare metal vs. virtual platform


• Binary Translation (BT)
     Full virtualization approach, like VMWare
     Unmodified guest OS
     Translate binary ‘on-the-fly’
         • translation block w/ caching,
               •   usually used for kernel, ~80% native performance
               •   userspace app directly runs natively as much as possible, ~100% native performance
               •   overall ~95% native performance
         • Complicated
         • Involves excessive complexities. e.g., self-modifying code


• Hardware-assisted Virtualization (VT)
       Full virtualization approach assisted by hardware, like KVM
       Unmodified guest OS
       Intel VT-x, AMD-v
       Benefits:
         • Closing virtualization holes in hardware
         • Simplify VMM software
         • Optimizing for performance




                                                                              9     2012/11/28
Memory virtualization challenges
• Guest OS has 2 assumptions
    expect to own physical memory starting from 0
      • BIOS/Legacy OS are designed to boot from address low 1M

    expect to own basically contiguous physical memory
      •   OS kernel requires minimal contiguous low memory
      •   DMA require certain level of contiguous memory
      •   Efficient MM management, e.g., less buddy overhead
      •   Efficient TLB, e.g., super page TLB


• MMU virtualization
    How to keep physical TLB valid
    Different approaches involve different complication and overhead




                                                  10   2012/11/28
Memory virtualization challenges
             VM1   VM2          VM3         VM4

              1
  Guest       2
 Pseudo       3
 Physical     4
 Memory       5


                        Hypervisor


                                      5

  Machine                      1

  Physical
  Memory                              2
                    3                 4




                                      11   2012/5/13
Memory virtualization approaches
• Direct page table
     Guest/VMM in same linear space
     Guest/VMM share same page table
                                                            GVA
• Shadow page table
     Guest page table unmodified
         •   gva -> gpa
     VMM shadow page table
         •   gva -> hpa
     Complication and memory overhead                      Guest
                                                          page table
• Extended page table
     Guest page table unmodified
         •   gva -> gpa
                                               Direct                         Shadow
         •   full control CR3, page fault
     VMM extended page table                page table    GPA               page table
        • gpa -> hpa
        • hardware based
        • good scalability for SMP
        • low memory overhead                             Extended
        • Reduce page fault VMexit greatly                page table

• Flexible choices
     Para virtualization
         • Direct page table
         • Shadow page table
     Full virtualization                                      HPA
         • Shadow page table
         • Extended page table




                                                          12    2012/11/28
Shadow page table

                                               Page
                                             Directory

• Guest page table remains                                    PTE
 unmodified to guest
                                                  PDE
    Translate from gva -> gpa                               Page
                                                             Table
• Hypervisor create a new        vCR3
 page table for physical
                                        Virtual
    Use hpa in PDE/PTE
                                        Physical
    Translate from gva -> hpa
    Invisible to guest                         Page
                                              Directory

                                                              PTE

                                                  PDE
                                                              Page
                                                              Table
                                 pCR3




                                                    13    2012/11/28
Extended page table

                Guest CR3                         EPT base pointer




                            Guest    Guest Physical Address    Extended
Guest Linear                Page                                 Page                  Host Physical
  Address                   Tables                              Tables                   Address




 • Extended page table
      Guest can have full control over its page tables and events
               • CR3, INVLPG, page fault
      VMM controls Extended Page Tables
         • Complicated shadow page table is eliminated
         • Improved scalability for SMP guest




                                                                     14   2012/11/28
I/O virtualization requirements
                                    Interrupt
                           Register Access
            Device                              CPU
                           DMA        Shared
                                      Memory



• I/O device from OS point of view
      Resource configuration and probe
      I/O request: IO, MMIO
      I/O data: DMA
      Interrupt

• I/O Virtualization require
    presenting guestos driver a complete device interface
        • Presenting an existing interface
             •   Software Emulation
             •   Direct assignment
        • Presenting a brand new interface
             •   Paravirtualization




                                                      15   2012/11/28
I/O virtualization approaches
• Emulated I/O
     Software emulates real hardware device
     VMs run same driver for the emulated hardware device
     Good legacy software compatibility
     Emulation overheads limit performance

• Paravirtualized I/O
     Uses abstract interfaces and stack for I/O services
     FE driver: guest run virtualization-aware drivers
     BE driver: driver based on simplified I/O interface and stack
     Better performance over emulated I/O

• Direct I/O
   Directly assign device to Guest
        • Guest access I/O device directly
        • High performance and low CPU utilization
   DMA issue
        • Guest set guest physical address
        • DMA hardware only accept host physical address
   Solution: DMA Remapping (a.k.a IOMMU)
        • I/O page table is introduced
        • DMA engine translate according to I/O page table
   Some limitations under live migration




                                                             16   2012/11/28
Virtual platform models
 Hypervisor Model              Host-based Model                           Hybrid Model


                Guest                         Guest                                       Guest
    Apps                                                                Apps     ULM
                Apps                          Apps                                        Apps
                                                                                   DM
                                                                         Service
Preferred       Guest                         Guest             Preferred VM              Guest
   OS            OS               ULM          OS                  OS                      OS
                                         DM                                  DR
P                     M                           DR                P                         M
                                  Host
         Hypervisor                                                           U-Hypervisor
                                  OS      P LKM M
DR                    DM                                            N




     P       Processor Mgt code          DR    Device Driver              N       NoDMA

     M       Memory Mgt code             DM    Device Model




                                                               17       2012/11/28
Agenda

• Introduction
• Virtualization
• Xen/KVM architecture
• Some intel work for Openstack




                                  18   2012/11/28
Xen Architecture
                                                                                     HVM Domain                    HVM Domain
           Domain 0                                  DomainU                           (32-bit)                      (64-bit)

          XenLinux64                                                                   Unmodified                 Unmodified
                                                                                          OS                         OS
       (xm/xend)




                   Models
        Control



                   Device
         Panel



 3P                                                                                                                                 3D

                                                  XenLinux64




                                                                                    Drivers




                                                                                                                Drivers
                                                                                      FE




                                                                                                                  FE
                                                               Front end Virtual
                                                                                                                                    0D
                            Virtual driver



                                                                                      Guest BIOS                  Guest BIOS
                              Backend




                                                                   Drivers
                                                                                     Virtual Platform            Virtual Platform
1/3P   Native
       Device
       Drivers                                                                       VM Exit                    VM Exit

                                             Callback / Hypercall



0P
                                                     Inter-domain Event Channels
       Control Interface                           Scheduler                       Event Channel                    Hypercalls
               Processor                              Memory                             I/O: PIT, APIC, PIC, IOAPIC

                                                            Xen Hypervisor
                                                                                              19   2012/11/28
KVM Architecture

                     Windows                Linux
                      Guest                 Guest

Qemu-kvm




                                                        Non Root


                                                          Root
                      VMCS   VMCS                VMCS



                     vCPU       vMEM        vTimer
    Linux Kernel       vPIC vAPIC      vIOAPIC

                             KVM module


                               20   2012/11/28
Agenda

• Introduction
• Virtualization
• Xen/KVM architecture
• Some intel work for Openstack




                                  21   2012/11/28
Trusted Pools - Implementation

User specifies ::
                                   OpenStack                                                                                       App
                                                                                                                                  App
                                                                                                                                           App
                                                                                                                                          App
                                                                                                                                 App     App
                                                                                                                         Host
  Mem > 2G                                                                                                               agent
  Disk > 50G                                                                                                                   OS      OS
  GPGPU=Intel                                                                                                            Hypervisor / tboot


                         EC2 API
  trusted_host=trusted                              Create VM                                                                HW/TXT
                                                                                                                             Tboot-
                                         Scheduler                                                                           Enabled
                     Create              TrustedFilter
                         OSAPI




                                            Query




                                                                                                                Report
                                                                                                       Attest
                                                    untrusted
                                                    trusted/



                                   Query API                                Attestation
                                                                              Server




                                                                                             Host Agent API
                                                                              Privacy                           OAT-
                                                                Query API




                                                                                CA
                                                                                                                Based
                                     Attestation                             Appraiser
                                      Service                                               Whitelist
                                                                            Whitelist API
                                                                                              DB

Virtualization Technology Overview

  • 1.
    Virtualization Technology Overview Liu, Jinsong (jinsong.liu@intel.com) SSG System Software Division
  • 2.
    Agenda • Introduction  history  Usage model • Virtualization overview  cpu virtualiztion  memory virtualization  I/O virtualization • Xen/KVM architecture  Xen  KVM • Some intel work for Openstack  OAT 2 2012/11/28
  • 3.
    Virtualization history • 60s’ IBM - CP/CMS on S360, VM370, … • 70’s 80s’ Silence • 1998 VMWare - SimOS project, Stanford • 2003 Xen - Xen project, Cambridge • After that: KVM/Hyper-v/Parallels … 3 2012/11/28
  • 4.
    What is Virtualization VM0 VM1 VMn Apps Apps Apps Guest OS Guest OS ... Guest OS Virtual Machine Monitor (VMM) Platform HW Memory Processors I/O Devices • VMM is a layer of abstraction  support multiple guest OSes  de-privilege each OS to run as Guest OS • VMM is a layer of redirection  redirect physical platform to virtual platform illusions of many  provide virtaul platfom to guest os 4 2012/11/28
  • 5.
    Server Virtualization UsageModel Server Consolidation R&D Production App App App … OS OS OS HW HW VMM VMM HW HW Benefit: Cost Savings • Consolidate services Benefit: Business Agility and Productivity • Power saving Dynamic Load Balancing Disaster Recovery App App App App 1 2 3 4 App App … OS OS OS OS OS OS VMM VMM VMM HW HW VMM CPU Usage CPU Usage HW HW 90% 30% Benefit: Lost saving • RAS • live migration • relief lost • Benefit: Productivity 5 2012/11/28
  • 6.
    Agenda • Introduction • Virtualizationoverview  CPU virtualization  Memory virtualization  I/O virtualization • Xen/KVM architecture • Some intel work for Openstack 6 2012/11/28
  • 7.
    X86 virtualization challenges •Ring Deprivileging  Goal: isolate guest OS from • Controlling physical resources directly • Modifying VMM code and data  Ring deprivileging layout • vmm runs at full privileged ring0 • Guest kernel runs at • X86-32: deprivileging ring 1 • X86-64: deprivileging ring 3 • Guest app runs at ring 3  Ring deprivileging problems • Unnecessary faulting • some privilege instructions • some exceptions • Guest kernel protection (x86-64) • Virtualization holes  19 instructions • SIDT/SGDT/SLDT … • PUSHF/POPF …  Some userspace holes hard to fix by s/w approach • Hard to trap, or • Performance overhead 7 2012/11/28
  • 8.
    X86 virtualization challenges VM0 VM0 1 VM0 2 Ring3 Guest Apps Apps Guest Apps Apps Guest Apps Apps Guest Kernel Guest OS Guest Kernel Guest OS Guest Kernel Guest OS Ring1 Ring0 Virtual Machine Monitor (VMM) 8 2012/11/28
  • 9.
    Typical X86 virtualizationapproaches • Para-virtualization (PV)  Para virtualization approach, like Xen  Modified guest OS aware and co-work with VMM  Standardization milestone: linux3.0 • VMI vs. PVOPS • Bare metal vs. virtual platform • Binary Translation (BT)  Full virtualization approach, like VMWare  Unmodified guest OS  Translate binary ‘on-the-fly’ • translation block w/ caching, • usually used for kernel, ~80% native performance • userspace app directly runs natively as much as possible, ~100% native performance • overall ~95% native performance • Complicated • Involves excessive complexities. e.g., self-modifying code • Hardware-assisted Virtualization (VT)  Full virtualization approach assisted by hardware, like KVM  Unmodified guest OS  Intel VT-x, AMD-v  Benefits: • Closing virtualization holes in hardware • Simplify VMM software • Optimizing for performance 9 2012/11/28
  • 10.
    Memory virtualization challenges •Guest OS has 2 assumptions  expect to own physical memory starting from 0 • BIOS/Legacy OS are designed to boot from address low 1M  expect to own basically contiguous physical memory • OS kernel requires minimal contiguous low memory • DMA require certain level of contiguous memory • Efficient MM management, e.g., less buddy overhead • Efficient TLB, e.g., super page TLB • MMU virtualization  How to keep physical TLB valid  Different approaches involve different complication and overhead 10 2012/11/28
  • 11.
    Memory virtualization challenges VM1 VM2 VM3 VM4 1 Guest 2 Pseudo 3 Physical 4 Memory 5 Hypervisor 5 Machine 1 Physical Memory 2 3 4 11 2012/5/13
  • 12.
    Memory virtualization approaches •Direct page table  Guest/VMM in same linear space  Guest/VMM share same page table GVA • Shadow page table  Guest page table unmodified • gva -> gpa  VMM shadow page table • gva -> hpa  Complication and memory overhead Guest page table • Extended page table  Guest page table unmodified • gva -> gpa Direct Shadow • full control CR3, page fault  VMM extended page table page table GPA page table • gpa -> hpa • hardware based • good scalability for SMP • low memory overhead Extended • Reduce page fault VMexit greatly page table • Flexible choices  Para virtualization • Direct page table • Shadow page table  Full virtualization HPA • Shadow page table • Extended page table 12 2012/11/28
  • 13.
    Shadow page table Page Directory • Guest page table remains PTE unmodified to guest PDE  Translate from gva -> gpa Page Table • Hypervisor create a new vCR3 page table for physical Virtual  Use hpa in PDE/PTE Physical  Translate from gva -> hpa  Invisible to guest Page Directory PTE PDE Page Table pCR3 13 2012/11/28
  • 14.
    Extended page table Guest CR3 EPT base pointer Guest Guest Physical Address Extended Guest Linear Page Page Host Physical Address Tables Tables Address • Extended page table  Guest can have full control over its page tables and events • CR3, INVLPG, page fault  VMM controls Extended Page Tables • Complicated shadow page table is eliminated • Improved scalability for SMP guest 14 2012/11/28
  • 15.
    I/O virtualization requirements Interrupt Register Access Device CPU DMA Shared Memory • I/O device from OS point of view  Resource configuration and probe  I/O request: IO, MMIO  I/O data: DMA  Interrupt • I/O Virtualization require  presenting guestos driver a complete device interface • Presenting an existing interface • Software Emulation • Direct assignment • Presenting a brand new interface • Paravirtualization 15 2012/11/28
  • 16.
    I/O virtualization approaches •Emulated I/O  Software emulates real hardware device  VMs run same driver for the emulated hardware device  Good legacy software compatibility  Emulation overheads limit performance • Paravirtualized I/O  Uses abstract interfaces and stack for I/O services  FE driver: guest run virtualization-aware drivers  BE driver: driver based on simplified I/O interface and stack  Better performance over emulated I/O • Direct I/O  Directly assign device to Guest • Guest access I/O device directly • High performance and low CPU utilization  DMA issue • Guest set guest physical address • DMA hardware only accept host physical address  Solution: DMA Remapping (a.k.a IOMMU) • I/O page table is introduced • DMA engine translate according to I/O page table  Some limitations under live migration 16 2012/11/28
  • 17.
    Virtual platform models Hypervisor Model Host-based Model Hybrid Model Guest Guest Guest Apps Apps ULM Apps Apps Apps DM Service Preferred Guest Guest Preferred VM Guest OS OS ULM OS OS OS DM DR P M DR P M Host Hypervisor U-Hypervisor OS P LKM M DR DM N P Processor Mgt code DR Device Driver N NoDMA M Memory Mgt code DM Device Model 17 2012/11/28
  • 18.
    Agenda • Introduction • Virtualization •Xen/KVM architecture • Some intel work for Openstack 18 2012/11/28
  • 19.
    Xen Architecture HVM Domain HVM Domain Domain 0 DomainU (32-bit) (64-bit) XenLinux64 Unmodified Unmodified OS OS (xm/xend) Models Control Device Panel 3P 3D XenLinux64 Drivers Drivers FE FE Front end Virtual 0D Virtual driver Guest BIOS Guest BIOS Backend Drivers Virtual Platform Virtual Platform 1/3P Native Device Drivers VM Exit VM Exit Callback / Hypercall 0P Inter-domain Event Channels Control Interface Scheduler Event Channel Hypercalls Processor Memory I/O: PIT, APIC, PIC, IOAPIC Xen Hypervisor 19 2012/11/28
  • 20.
    KVM Architecture Windows Linux Guest Guest Qemu-kvm Non Root Root VMCS VMCS VMCS vCPU vMEM vTimer Linux Kernel vPIC vAPIC vIOAPIC KVM module 20 2012/11/28
  • 21.
    Agenda • Introduction • Virtualization •Xen/KVM architecture • Some intel work for Openstack 21 2012/11/28
  • 22.
    Trusted Pools -Implementation User specifies :: OpenStack App App App App App App Host Mem > 2G agent Disk > 50G OS OS GPGPU=Intel Hypervisor / tboot EC2 API trusted_host=trusted Create VM HW/TXT Tboot- Scheduler Enabled Create TrustedFilter OSAPI Query Report Attest untrusted trusted/ Query API Attestation Server Host Agent API Privacy OAT- Query API CA Based Attestation Appraiser Service Whitelist Whitelist API DB