Your SlideShare is downloading. ×
Hardware accelerated Virtualization in the ARM Cortex™ Processors
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hardware accelerated Virtualization in the ARM Cortex™ Processors

8,756

Published on

Ian Pratt will give an overview of Xen and how it and its community have evolved in 2011.

Ian Pratt will give an overview of Xen and how it and its community have evolved in 2011.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
8,756
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
279
Comments
0
Likes
6
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Key features that are introduced on the Cortex-A15 that have not been seen before: Virtualization – one could do it before through SOFTWARE, but Cortex-A15 brings HARDWARE support making it much more efficient. Large physical addressing – before ARM cores were limited to 4GB of memory. For the most part this is effectively 2-3GB due to address mapping of peripherals etc. Increasing the number of address bits from 32 to 40 enables this increase in usable memory. Note that virtual memory available to a single program is still limited to 32-bit/4GB. AMBA 4 System coherency – allows different clusters of processors or other cached masters to share data efficiently. ARM has supported up to 4xMP previously; AMBA 4 allows MP scale well beyond that. Currently plans include support to 8 and 16 core systems. AMBA 4 can theoretically support more cores than that, but a Cache Coherent Interconnect with more than four ports will need to be created to realize such 16+ core systems. (ARM currently has plans for two and four port CCIs)
  • To do: emphasize what the VMM does. The OS can now manage its own page tables and disable interrupts – details on this later. The VMM can be written to trap all hardware accesses by OS meaning that OS source need not be changed (might want to change it for performance reasons though). For instance, can configure system so that all CP15 register accesses are trapped to Hypervisor mode where they can be handled by VMM. Hyp mode can alias identification registers e.g. CPU ID etc. Interaction between Virtualization and TrustZone is not a subject for now but the two can co-exist and Virtualization is only supported in Non-Secure world.
  • LPAE extensions allows each OS to live in 32-bit virtual address space and VMM to map multiple systems within 40-bit physical address space. The Guest OS believes that IPA is the real PA. The second-stage translation is invisible to it. Without virtualization extensions, VMSAv7 page tables have been extended to allow generation of 40-bit physical addresses at 16MB super-section granularity only. Support of super-sections and 40-bit PA generation is optional in VMSAv7. The LPAE extension (always used for the Stage 2 translation in a virtualized system) allows 4KB granularity over entire 40-bit address range. Support of LPAE without Virtualization is permitted (but not the other way round).
  • ANIMATED Pretty simplified, but still shows delivery of a virtual interrupt. What is missing is the incredibly complex process of the hypervisor restoring the Guest OS context, or deciding which CPU to deliver the virtual interrupt to, or any one of the 101 details that the hypervisor must deal with.
  • We achieve Big-Little system can use the Right-Core for the right task. Built on the coherency and software compatiblity: - Soft ware migration across processors is at a minimal cost This migration is transparent to applications and OS There is a great deal of innovation and differentiation possible here through tuning of OS Power Management, use of Hypervisors and more. [Internal Note: The idea is to rebutt asynchronous DVFS without specifically alluding to it] Even with near-perfect voltage scaling, the “Big” processor cannot match the power and energy efficiency of the simple Kingfisher pipeline for the lower-intensity workloads such as MP3 playback, basic OS UI and communication activity and scrolling Kingfisher establishes a new bar on power-efficient performance for the low and mid-intensity workloads Lead-into next slide: For the more interactive high-intensity workloads Cortex-A15 provides the instantaneous responsiveness. However, for the higher-intensity workloads At the upper end of the spectrum
  • GIC-400 transports interrupts across clusters under hypervisor control
  • Phase I: Single cluster in operation at any given time Good fit for Linux OS estimates system load and invokes performance governors at preset watermarks Governor back end can synchronously call switcher software When performance on one cluster falls below or exceeds required levels, execution switches seamlessly to appropriate cluster Core state switch occurs in sub 20 microseconds OS remains oblivious to state switch OS load estimation re-calculates system capability Virtualizer handles out of bound accesses OS can be built for Cortex-A15 or Kingfisher Switcher handles core switching OS agnostic so easily extended to support other systems Virtualizer natural progression of the switching work Single OS and simultaneous execution across both clusters Switcher layer is redundant – design is consequently simpler Virtualizer continues to handle out-of-bound micro-architectural accesses Dual cluster system appears logically as a single cluster of processors with varying capability Phase II: Native support requires minimal extension to the Linux kernel Machine description for “SMP” platform Inter cluster signaling modifications Map CPUID reference to CPU within target cluster Still about mechanics not policy OS scheduler still won’t make intelligent scheduling decisions Scope for ‘manual’ improvement Forcefully bind threads statically to suitable processor domain Use cgroups and cpusets
  • Transcript

    • 1. Hardware accelerated Virtualization in the ARM Cortex ™ Processors John Goodacre Director, Program Management ARM Processor Division ARM Ltd. Cambridge UK 2nd November 2010
    • 2. New Capabilities in the Cortex-A15
      • Full compatibility with the Cortex-A9
        • Supporting the ARMv7 Architecture
      • Addition of Virtualization Extension (VE)
        • Run multiple OS binary instances simultaneously
        • Isolates multiple work environments and data
      • Supporting Large Physical Addressing Extensions (LPAE)
        • Ability to use up to 1TB of physical memory
      • With AMBA 4 System Coherency (AMBA-ACE)
        • Other cached devices can be coherent with processor
        • Many core multiprocessor scalability
        • Basis of concurrent big.LITTLE Processing
    • 3. Large Physical Addressing
      • Cortex-A15 introduces 40-bit physical addressing
        • Virtual memory (apps and OS) still has 32bit address space
      • Offering up to 1 TB of physical address space
        • Traditional 32bit ARM devices limited to 4GB
      • What does this mean for ARM based systems?
        • Reduced address-map congestion
        • More applications at the same time
        • Multiple resident virtualized operating systems
        • Common global physical address in many-core
    • 4. Virtualization Extensions: The Basics
      • New Non-secure level of privilege to hold Hypervisor
        • Hyp mode
      • New mechanisms avoid the need Hypervisor intervention for:
        • Guest OS Interrupt masking bits
        • Guest OS page table management
        • Guest OS Device Drivers due to Hypervisor memory relocation
        • Guest OS communication with the interrupt controller (GIC)
      • New traps into Hyp mode for:
        • ID register accesses and idling (WFI/WFE)
        • Miscellaneous “difficult” System Control Register cases
      • New mechanisms to improve:
        • Guest OS Load/Store emulation by the Hypervisor
        • Emulation of trapped instructions through syndromes
    • 5. How does ARM do Virtualization
      • Extensions to the v7-A Architecture, available on the Cortex™-A15 and Cortex-A7 CPUs
        • Second stage of address translation (separate page tables) Functionality for virtualizing interrupts inside the Interrupt Controller
        • Functionality for virtualizing all CPU features, including CP15
        • Option of a MMU within the system to help virtualize IO
      • Hypervisor runs in new “Hyp” exception mode / privilege
        • HVC (Hypervisor Call) instruction to enter Hyp mode
        • Uses previously unused entry (0X14 offset) in vector table for hypervisor traps
        • Hyp mode exception link register, SPSR, stack pointer
        • Hypervisor Control Register (HCR) marks virtualized resources
        • Hypervisor Syndrome Register (HSR) for Hyp mode entry reason
    • 6. Virtualization: Third Privilege
      • Guest OS same kernel/user privilege structure
      • HYP mode higher privilege than OS kernel level
      • VMM controls wide range of OS accesses
      • Hardware maintains TZ security (4 th privilege)
      TrustZone Secure Monitor (Highest Privilege) Secure Apps Secure Operating System Non-secure State Secure State Exceptions Exception Returns User Mode ( Non-privileged ) Supervisor Mode (Privileged) Hyp Mode (More Privileged) Guest Operating System1 App2 App1 Guest Operating System2 App2 App1 Virtual Machine Monitor / Hypervisor 1 2 3
    • 7. Memory – the Classic Resource
      • Before virtualisation – the OS owns the memory
        • Allocates areas of memory to the different applications
        • Virtual Memory commonly used in “rich” operating systems
      Virtual address map of each application Physical Address Map Translations from translation table (owned by the OS)
    • 8. Virtual Memory in Two Stages Stage 1 translation owned by each Guest OS Virtual address (VA) map of each App on each Guest OS “ Intermediate Physical” address map of each Guest OS (IPA) Physical Address (PA) Map Stage 2 translation owned by the VMM
      • Hardware has 2-stage memory translation
      • Tables from Guest OS translate VA to IPA
      • Second set of tables from VMM translate IPA to PA
      • Allows aborts to be routed to appropriate software layer
    • 9. Classic Issue: Interrupts
      • An Interrupt might need to be routed to one of:
        • Current or different GuestOS
        • Hypervisor
        • OS/RTOS running in the secure TrustZone environment
      • Basic model of the ARM virtualisation extensions:
        • Physical interrupts are taken initially in the Hypervisor
        • If the Interrupt should go to a GuestOS :
          • Hypervisor maps a “virtual” interrupt for that GuestOS
      Operating System App2 App1 Guest OS 1 App2 App1 Guest OS 2 App2 App1 VMM System without virtualisation System with virtualisation Physical Interrupt Physical Interrupt Virtual Interrupt
    • 10. Interrupt Virtualization
      • Virtualisation Extensions provides :
        • Registers to hold the Virtual Interrupt
        • CPSR.{I,A,F} bits in the GuestOS only appling to that OS
          • Physical Interrupts are not masked by the CPSR.{I.A.F} bits
          • GuestOS changes to I,A,F no longer need to be trapped
        • Mechanism to route all physical interrupts to Monitor Mode
          • Already utilized in TrustZone technology based devices
        • Virtual Interrupts are routed to the Non-secure IRQ/FIQ/Abort
        • Guest OS manipulates a virtualized interrupt controller
      • Actually available in the Cortex-A9 to aid paravirtualization support for interrupts
    • 11. Virtual Interrupt Controller
      • New “Virtual” GIC Interface has been Architected
        • ISR of GuestOS interacts with the virtual controller
        • Pending and Active interrupt lists for each GuestOS
        • Interacts with the physical GIC in hardware
        • Creates Virtual Interrupts only when priority indicates it is necessary
      • GuestOS ISRs therefore do not need calls for:
        • Determining interrupt to take [Read of the Interrupt Acknowledge]
        • Marking the end of an interrupt [Sending EOI]
        • Changing CPU Interrupt Priority Mask [Current Priority]
    • 12. Virtual GIC
      • GIC now has separate sets of internal registers:
        • Physical registers and virtual registers
        • Non-virtualized system and hypervisor access the physical registers
        • Virtual machines access the virtual registers
        • Guest OS functionality does not change when accessing the vGIC
      • Virtual registers are remapped by hypervisor so that the Guest OS thinks it is accessing the physical registers
        • GIC registers and functionality are identical
      • Hypervisor can set IRQs as virtual in the HCR
        • Interrupts are configured to generate a Hypervisor trap
        • Hypervisor can deliver an interrupt to a CPU running a virtual process using “register lists” of interrupts
    • 13. Virtual interrupt example
      • External IRQ (configured as virtual by the hypervisor) arrives at the GIC
      • GIC Distributor signals a Physical IRQ to the CPU
      • CPU takes HYP trap, and Hypervisor reads the interrupt status from the Physical CPU Interface
      • Hypervisor makes an entry in register list in the GIC
      • GIC Distributor signals a Virtual IRQ to the CPU
      • CPU takes an IRQ exception, and Guest OS running on the virtual machine reads the interrupt status from the Virtual CPU Interface
      Distributor Physical CPU Interface Virtual CPU Interface Virtual IRQ Physical IRQ CPU External Interrupt source Hypervisor Guest OS
    • 14. Resource Ownership
      • Software-only approaches
        • Access to resources by GuestOS intercepted by the VMM
          • VMM interprets the GuestOS’s intent
          • Provides its own mechanism to meet that intent
        • Mechanism of interception varies
          • Paravirtualisation adds a hypercall to the source code
          • Binary translation adds a hypercall to the binary
          • Exceptions in Hardware provide an trapping of operations
        • Hypercalls can be more efficient
          • More of the intent to be expressed in a single VMM entry
      • Hardware assisted approaches:
        • Provide further indirection to resources
        • Accelerating trapped operations by syndrome information
    • 15. Helping with Virtual Devices
      • ARM I/O handling uses memory mapped devices
        • Reads and Writes to the Device registers have specific side-effects
      • Creating “Virtual Devices” requires emulation:
        • Typically reads/writes to devices have to trap to the VMM
        • VMM interprets the operation and performs emulation
        • Perfect virtualization means all possible devices loads/stores emulated
        • Fetching and interpreting emulated load/store is performance intensive
        • “ Syndrome” information on aborts available for some loads/stores
        • Syndrome unpacks key information about the instruction
          • Source/Destination register, Size of data transfer, Size of the instruction, SignExtension etc
          • If syndrome not available, then fetching of the instruction for emulation still required
    • 16. Devices and Memory
      • Providing address translation for devices is important
        • Allows unmodified device drivers in the GuestOS
        • If the device can access memory, GuestOS will program it in IPA
      • ARM virtualisation adds option for a “System MMU”
        • Enables second stage memory translations in the system
      • A System MMU could also provide stage 1 translations
        • Allows devices to be programmed into guest’s VA space
      • System MMU natural fit for the processor ACP port
        • ARM defining a common programming model
        • Intent is for the system MMU to be hardware managed using Distributed Virtual Messages found in AMBA 4 ACE
    • 17. Potential of System MMU
    • 18. Partitioning in a Secure ARM System
      • ARM TrustZone technology define two worlds
        • Everything must live in Normal World or Secure World
      • TrustZone-Enhanced processor exports World Information
        • Via NS bit (Not Secure) on system bus since AMBA 3 AXI
      • TrustZone-Aware devices can partition across both worlds
        • Only AMBA AXI compatible devices can be TrustZone-aware
      • AMBA 3 AXI Interconnect decodes TZ like an address line
      • AMBA AHB and APB do not contain TrustZone information
        • AHB and APB devices live in only one World
        • Groups of peripherals can be managed from bus interface
        • Inclusion of TrustZone Peripheral Controller gives more control
    • 19. Base Secure System
      • Minimal TrustZone System required for payment solutions
        • Protects On-Chip Secure Ram area via TrustZone Memory Adaptor
        • Keyboard and screen secured dynamically to protect PIN entry
        • Master Key and Random Number Generators for daughter-keys permanently secures. Non-volatile counters required for state management & anti-rollback (fuse based not non-volatile memory due to process geometry limitations)
    • 20. Extended Secure System
      • Extended TrustZone system enables complex content management
        • Builds on Base Secure System
        • + TrustZone ASC to protect media in RAM and off-chip decode
        • + On-chip Crypto, Media Accelerators & DMA Controller for media handling
    • 21. Propagating System Security NS : N ot S ecure - treated like an address line
    • 22. Multi-Cluster Virtualization
      • Works just like with a single cluster MPCore system
        • Guest OS (like threads) can migrate from CPU to CPU across clusters
      • External (virtual) GIC used to handle interrupts
        • Functions the same as internal GIC, but accessed by multiple CPUs
      • AMBA Coherency Extensions (ACE)
        • Manages coherency across clusters
      • System MMU allows other bus masters to map from IPA to PA
      • Hypervisor needs to be aware of different clusters and CPUs
        • But again: it is just like a single cluster system
        • Hardware requirements are the same as non-virtualized multi-cluster
          • Just make sure there’s enough memory
    • 23. big.LITTLE Multi-Processing
      • “ Big” processor is paired with a “little” processor
        • Processors share exact same ISA and feature set
        • High performance tasks run on the Big processor
        • Lightweight/non-time-critical tasks run on the little processor
        • Best of both worlds solution for high performance and low power
      • big.LITTLE Use Models
        • big.LITTLE Switch (Swapping) – one CPU cluster active at a time
        • big.LITTLE MP – both CPUs can be active, loads dynamically balanced
      Cortex-A15 Tags SCU L2 Cache Coherent Interconnect Auxiliary Interfaces Tags SCU L2 Kingfisher
    • 24. System Characteristics of big.LITTLE PERFORMANCE SMS & Voice do not need a 1GHz processor Browser needs full performance for complex rendering but very little if the page is just being read You do not know what combination of apps the user will use, but the Smartphone must continue to be responsive
    • 25. big.LITTLE Processing
      • Right sized Core for the Right Task
        • Cortex-A7 enabled by default with sufficient performance for common usage scenarios
        • Cortex-A15 performance for best user experience
      • Software alignment allows execution to migrate between cores
        • Transparent to user, application and OS
      • Optimizes system workload based on application requirements
        • Extending existing power management
        • Additional benefit though OS Power Management Policy tuning
      • End-product delivers longer battery life and richer user experience at the same time
    • 26. Putting together a big.LITTLE System
      • Cortex-A15 and Cortex-A7 clusters are cache coherent
      • CCI-400 maintains cache-coherency between clusters
      • GIC-400 provides transparent virtualized Interrupt control
      • Supports all big.LITTLE usage models
      128-bit AMBA 4 CoreLink CCI-400 Cache Coherent Interconnect 128-bit AMBA 4 A7 A15 System Memory GIC-400 Virtual GIC
      • Full task migration (active to active) in less than 20K cycles
      • Cache coherency managed
      • in hardware
      • 128-bit system transactions
      • Virtualization manager can map OS either across or between clusters
      Cortex-A15 A15 SCU + L2 Cache Cortex-A7 A7 SCU + L2 Cache
    • 27. Forms of big.LITTLE
      • big.LITTLE cluster switching
        • OS sees big cores or little cores at any one time
        • Eases deployment of software on big.LITTLE platforms
          • Minimizes OS changes by extending DVFS framework
        • Symmetric big/little clusters currently preferred
      • Concurrent big.LITTLE
      • OS sees all CPUs all the time; better energy and performance matching of compute to workload
        • Asymmetric clusters supported
        • Expected to be the preferred use model eventually
        • Requires OS improvements and changes for efficient operation
          • Co-ordination of MP scheduling, power policies, environmental conditions
      OS Switched L2$ KF KF L2$ A15 A15 L2$ A15 A15 L2$ KF KF OS Spans Clusters
    • 28. Virtualization and TrustZone
      • TrustZone offers a specialised type of Virtualisation
        • Only 2 ‘Worlds’ – not extendable (except through paravirtualization)
          • Although VMM can also span both worlds
        • Fourth privilege level is provided by CPU’s secure monitor mode
        • Non-symmetrical - The two ‘Worlds’ are not equal
          • Secure world can access both worlds (33bit addressing)
      Secure Apps Secure RTOS Secure Monitor Normal World Secure World
      • TrustZone coexists alongside a VMM
      HARDWARE (Memory, ARM CPU, I/O Devices) Guest Operating System1 App2 App1 (EPG) Guest Operating System2 App2 App1 (Flash ) Virtual Machine Monitor (VMM) or Hypervisor
    • 29. Spanning Hypervisor Framework SoC Management Domain (ST/TEI) TrustZone Monitor RTOS/uKernel Para-virtualization with platform service API’s Secured hw timer (tick) SoC Management Domains Secured Services Domain Open Platform Hypervisor Full Virtualization (KVM) TZ Monitor Exception Level HYP(ervisor) Exception Level Open OS & Drivers Open OS (Linux) Kernel/Supervisor Exception Level A P P A P P A P P A P P User Exception Level (Application Layer) ARM Non-Secure Execution Environment ARM Secure Execution Environment Resource API Virtualized TZ Call Virtualized TZ Call Resource Driver API NoC sMMU Accelerators (eg DSP / GPU) Virtualized Heterogeneous API Driver (OpenMP RT) Coherence Platform Resources Project website http://www.virtical.eu This project in ARM is in part funded by ICT-Virtical, a European project supported under the Seventh Framework Programme (7FP) for research and technological development
    • 30. Summary
      • Hardware based Virtualization Extensions were introduced in the soon-available Cortex-A15
        • Also found the in the recently announce Cortex-A7
      • Enables full binary compatible virtualization of the CPU
        • Supporting both existing and new market scenarios
      • Extended into the system and IO through the system MMU
      • Independent of the TrustZone environment
        • Providing both a secure environment, and a virtualizable non-secured environment
      • Software models available now, first silicon 2H2012

    ×