Hardware accelerated Virtualization in the  ARM Cortex ™  Processors John Goodacre Director, Program Management ARM Processor Division ARM Ltd.  Cambridge UK 2nd November 2010
New Capabilities in the Cortex-A15 Full compatibility with the Cortex-A9 Supporting the ARMv7 Architecture Addition of Virtualization Extension (VE) Run multiple OS binary instances simultaneously Isolates multiple work environments and data Supporting Large Physical Addressing Extensions (LPAE) Ability to use up to 1TB of physical memory With AMBA 4 System Coherency  (AMBA-ACE) Other cached devices can be coherent with processor Many core multiprocessor scalability Basis of concurrent big.LITTLE Processing
Large Physical Addressing Cortex-A15 introduces 40-bit physical addressing Virtual memory (apps and OS) still has 32bit address space Offering up to 1 TB of physical address space Traditional 32bit ARM devices limited to 4GB What does this mean for ARM based systems? Reduced address-map congestion More applications at the same time Multiple resident virtualized operating systems Common global physical address in many-core
Virtualization Extensions: The Basics New Non-secure level of privilege to hold Hypervisor Hyp mode New mechanisms avoid the need Hypervisor intervention for: Guest OS Interrupt masking bits Guest OS page table management Guest OS Device Drivers due to Hypervisor memory relocation Guest OS communication with the interrupt controller (GIC) New traps into Hyp mode for: ID register accesses and idling (WFI/WFE) Miscellaneous “difficult” System Control Register cases New mechanisms to improve: Guest OS Load/Store emulation by the Hypervisor Emulation of trapped instructions through syndromes
How does ARM do Virtualization Extensions to the v7-A Architecture, available on the  Cortex™-A15 and Cortex-A7 CPUs Second stage of address translation (separate page tables) Functionality for virtualizing interrupts inside the Interrupt Controller Functionality for virtualizing all CPU features, including CP15 Option of a MMU within the system to help virtualize IO Hypervisor runs in new “Hyp” exception mode / privilege HVC (Hypervisor Call) instruction to enter Hyp mode Uses previously unused entry (0X14 offset) in vector table for hypervisor traps Hyp mode exception link register, SPSR, stack pointer Hypervisor Control Register (HCR) marks virtualized resources Hypervisor Syndrome Register (HSR) for Hyp mode entry reason
Virtualization: Third Privilege Guest OS same kernel/user privilege structure HYP mode higher privilege than OS kernel level VMM controls wide range of OS accesses Hardware maintains TZ security (4 th  privilege) TrustZone Secure Monitor  (Highest Privilege) Secure Apps Secure Operating System Non-secure State Secure State Exceptions Exception Returns User Mode ( Non-privileged ) Supervisor Mode  (Privileged) Hyp Mode  (More Privileged) Guest Operating System1 App2 App1 Guest Operating System2 App2 App1 Virtual Machine Monitor / Hypervisor 1 2 3
Memory – the Classic Resource Before virtualisation – the OS owns the memory Allocates areas of memory to the different applications Virtual Memory commonly used in “rich” operating systems Virtual address map of  each application Physical Address Map Translations from translation table (owned by the OS)
Virtual Memory in Two Stages Stage 1 translation owned  by each Guest OS Virtual address (VA) map of  each App on each Guest OS “ Intermediate Physical” address map of each Guest OS  (IPA) Physical Address (PA) Map  Stage 2 translation owned by the VMM Hardware has 2-stage memory translation Tables from Guest OS translate VA to IPA Second set of tables from VMM translate IPA to PA Allows aborts to be routed to appropriate software layer
Classic Issue: Interrupts An Interrupt might need to be routed to one of: Current or different GuestOS Hypervisor  OS/RTOS running in the secure TrustZone environment Basic model of the ARM virtualisation extensions: Physical interrupts are taken initially in the Hypervisor  If the Interrupt should go to a GuestOS : Hypervisor maps a “virtual” interrupt for that GuestOS Operating System App2 App1 Guest OS 1 App2 App1 Guest OS 2 App2 App1 VMM System without virtualisation System with virtualisation Physical Interrupt Physical Interrupt Virtual Interrupt
Interrupt Virtualization Virtualisation Extensions provides : Registers to hold the Virtual Interrupt  CPSR.{I,A,F} bits in the GuestOS only appling to that OS Physical Interrupts are not masked by the CPSR.{I.A.F} bits GuestOS changes to I,A,F no longer need to be trapped Mechanism to route all physical interrupts to Monitor Mode Already utilized in TrustZone technology based devices Virtual Interrupts are routed to the Non-secure IRQ/FIQ/Abort Guest OS manipulates a virtualized interrupt controller Actually available in the Cortex-A9 to aid paravirtualization support for interrupts
Virtual Interrupt Controller  New “Virtual” GIC Interface has been Architected ISR of GuestOS interacts with the virtual controller Pending and Active interrupt lists for each GuestOS Interacts with the physical GIC in hardware Creates Virtual Interrupts only when priority indicates it is necessary GuestOS ISRs therefore  do not need calls for: Determining interrupt to  take  [Read of the  Interrupt Acknowledge] Marking the end of an  interrupt [Sending EOI] Changing CPU Interrupt  Priority Mask [Current Priority]
Virtual GIC GIC now has separate sets of internal registers: Physical registers and virtual registers Non-virtualized system and hypervisor access the physical registers Virtual machines access the virtual registers Guest OS functionality does not change when accessing the vGIC Virtual registers are remapped by hypervisor so that the Guest OS thinks it is accessing the physical registers GIC registers and functionality are identical Hypervisor can set IRQs as virtual in the HCR Interrupts are configured to generate a Hypervisor trap Hypervisor can deliver an interrupt to a CPU running a virtual process using “register lists” of interrupts
Virtual interrupt example External IRQ (configured as virtual by the hypervisor) arrives at the GIC GIC Distributor signals a Physical IRQ to the CPU CPU takes HYP trap, and Hypervisor reads the interrupt status from the Physical CPU Interface Hypervisor makes an entry in register list in the GIC GIC Distributor signals a Virtual IRQ to the CPU CPU takes an IRQ exception, and Guest OS running on the virtual machine reads the interrupt status from the Virtual CPU Interface Distributor Physical CPU Interface Virtual CPU Interface Virtual IRQ Physical IRQ CPU External Interrupt source Hypervisor Guest OS
Resource Ownership Software-only approaches Access to resources by GuestOS intercepted by the VMM VMM interprets the GuestOS’s intent Provides its own mechanism to meet that intent  Mechanism of interception varies Paravirtualisation adds a hypercall to the source code Binary translation adds a hypercall to the binary Exceptions in Hardware provide an trapping of operations Hypercalls can be more efficient More of the intent to be expressed in a single VMM entry Hardware assisted approaches: Provide further indirection to resources Accelerating trapped operations by syndrome information
Helping with Virtual Devices ARM I/O handling uses memory mapped devices Reads and Writes to the Device registers have specific side-effects Creating “Virtual Devices” requires emulation: Typically reads/writes to devices have to trap to the VMM VMM interprets the operation and performs emulation Perfect virtualization means all possible devices loads/stores emulated Fetching and interpreting emulated load/store is performance intensive “ Syndrome” information on aborts available for some loads/stores Syndrome unpacks key information about the instruction  Source/Destination register, Size of data transfer, Size of the instruction, SignExtension etc If syndrome not available, then fetching of the instruction for emulation still required
Devices and Memory Providing address translation for devices is important Allows unmodified device drivers in the GuestOS If the device can access memory, GuestOS will program it in IPA ARM virtualisation adds option for a “System MMU” Enables second stage memory translations in the system A System MMU could also provide stage 1 translations Allows devices to be programmed into guest’s VA space  System MMU natural fit for the processor ACP port  ARM defining a common programming model Intent is for the system MMU to be hardware managed using Distributed Virtual Messages found in AMBA 4 ACE
Potential of System MMU
Partitioning in a Secure ARM System ARM TrustZone technology define two worlds Everything must live in Normal World or Secure World TrustZone-Enhanced processor exports World Information Via NS bit (Not Secure) on system bus since AMBA 3 AXI TrustZone-Aware devices can partition across both worlds Only AMBA AXI compatible devices can be TrustZone-aware AMBA 3 AXI Interconnect decodes TZ like an address line AMBA AHB and APB do not contain TrustZone information AHB and APB devices live in only one World Groups of peripherals can be managed from bus interface Inclusion of TrustZone Peripheral Controller gives more control
Base Secure System Minimal TrustZone System required for payment solutions Protects On-Chip Secure Ram area via TrustZone Memory Adaptor Keyboard and screen secured dynamically to protect PIN entry Master Key and Random Number Generators for daughter-keys permanently secures. Non-volatile counters required for state management & anti-rollback (fuse based not non-volatile memory due to process geometry limitations)
Extended Secure System Extended TrustZone system enables complex content management Builds on Base Secure System + TrustZone ASC to protect media in RAM and off-chip decode + On-chip Crypto, Media Accelerators & DMA Controller for media handling
Propagating System Security NS :  N ot  S ecure - treated like an address line
Multi-Cluster Virtualization Works just like with a single cluster MPCore system Guest OS (like threads) can migrate from CPU to CPU across clusters External (virtual) GIC used to handle interrupts Functions the same as internal GIC, but accessed by multiple CPUs AMBA Coherency Extensions (ACE) Manages coherency across clusters System MMU allows other bus masters to map from IPA to PA Hypervisor needs to be aware of different clusters and CPUs But again: it is just like a single cluster system Hardware requirements are the same as non-virtualized multi-cluster Just make sure there’s enough memory
big.LITTLE Multi-Processing “ Big”  processor is paired with a “little” processor Processors share exact same ISA and feature set High performance tasks run on the Big processor Lightweight/non-time-critical tasks run on the little processor Best of both worlds solution for high performance and low power big.LITTLE Use Models big.LITTLE Switch (Swapping) – one CPU cluster active at a time big.LITTLE MP – both CPUs can be active, loads dynamically balanced Cortex-A15 Tags SCU L2 Cache Coherent Interconnect Auxiliary Interfaces Tags SCU L2 Kingfisher
System Characteristics of big.LITTLE PERFORMANCE SMS & Voice do not need a 1GHz processor Browser needs full performance for complex rendering but very little if the page is just being read You do not know what combination of apps the user will use, but the Smartphone must continue to be responsive
big.LITTLE Processing Right sized Core for the Right Task Cortex-A7 enabled by default  with sufficient  performance for common usage scenarios Cortex-A15 performance for best user experience Software alignment allows execution to migrate between cores Transparent to user, application and OS Optimizes system workload based on application requirements Extending existing power management Additional benefit though OS Power Management Policy tuning End-product delivers longer battery life and richer user experience at the same time
Putting together a big.LITTLE System Cortex-A15 and Cortex-A7 clusters are cache coherent CCI-400 maintains cache-coherency between clusters GIC-400 provides transparent virtualized Interrupt control Supports all big.LITTLE usage models 128-bit AMBA 4 CoreLink CCI-400 Cache Coherent Interconnect  128-bit AMBA 4 A7 A15 System Memory GIC-400 Virtual GIC Full task migration (active   to active) in less than    20K cycles Cache coherency managed in hardware 128-bit system transactions Virtualization manager can    map OS either across or   between clusters Cortex-A15  A15 SCU + L2 Cache Cortex-A7 A7 SCU + L2 Cache
Forms of big.LITTLE big.LITTLE cluster switching OS sees big cores or little cores at any one time Eases deployment of software on big.LITTLE platforms Minimizes OS changes by extending DVFS framework Symmetric big/little clusters currently preferred Concurrent big.LITTLE  OS sees all CPUs all the time; better energy and  performance matching of compute to workload Asymmetric clusters supported Expected to be the preferred use model eventually Requires OS improvements and changes for efficient operation Co-ordination of MP scheduling, power policies, environmental conditions OS Switched L2$ KF KF L2$ A15 A15 L2$ A15 A15 L2$ KF KF OS Spans Clusters
Virtualization and TrustZone TrustZone offers a specialised type of Virtualisation Only 2 ‘Worlds’ – not extendable (except through paravirtualization) Although VMM can also span both worlds Fourth privilege level is provided by CPU’s secure monitor mode Non-symmetrical - The two ‘Worlds’ are not equal Secure world can access both worlds (33bit addressing) Secure Apps Secure RTOS Secure Monitor Normal  World Secure  World TrustZone coexists alongside a VMM  HARDWARE (Memory, ARM CPU, I/O Devices) Guest Operating System1 App2 App1  (EPG) Guest Operating System2 App2 App1  (Flash ) Virtual Machine Monitor (VMM) or Hypervisor
Spanning Hypervisor Framework SoC Management Domain  (ST/TEI) TrustZone Monitor RTOS/uKernel Para-virtualization with platform service API’s Secured hw timer (tick) SoC Management Domains Secured Services Domain Open Platform Hypervisor Full Virtualization (KVM) TZ Monitor Exception Level HYP(ervisor) Exception Level Open OS & Drivers Open OS (Linux) Kernel/Supervisor Exception Level A P P A P P A P P A P P User Exception Level (Application Layer) ARM Non-Secure Execution Environment ARM Secure Execution Environment Resource API Virtualized TZ Call Virtualized TZ Call Resource Driver API NoC sMMU Accelerators (eg DSP / GPU) Virtualized Heterogeneous API Driver (OpenMP RT) Coherence Platform  Resources Project website http://www.virtical.eu  This project in ARM is in part funded by ICT-Virtical, a European project supported under the Seventh Framework Programme (7FP) for research and technological development
Summary Hardware based Virtualization Extensions were introduced in the soon-available Cortex-A15 Also found the in the recently announce Cortex-A7 Enables full binary compatible virtualization of the CPU Supporting both existing and new market scenarios Extended into the system and IO through the system MMU Independent of the TrustZone environment Providing both a secure environment, and a virtualizable non-secured environment Software models available now, first silicon 2H2012

Hardware accelerated Virtualization in the ARM Cortex™ Processors

  • 1.
    Hardware accelerated Virtualizationin the ARM Cortex ™ Processors John Goodacre Director, Program Management ARM Processor Division ARM Ltd. Cambridge UK 2nd November 2010
  • 2.
    New Capabilities inthe Cortex-A15 Full compatibility with the Cortex-A9 Supporting the ARMv7 Architecture Addition of Virtualization Extension (VE) Run multiple OS binary instances simultaneously Isolates multiple work environments and data Supporting Large Physical Addressing Extensions (LPAE) Ability to use up to 1TB of physical memory With AMBA 4 System Coherency (AMBA-ACE) Other cached devices can be coherent with processor Many core multiprocessor scalability Basis of concurrent big.LITTLE Processing
  • 3.
    Large Physical AddressingCortex-A15 introduces 40-bit physical addressing Virtual memory (apps and OS) still has 32bit address space Offering up to 1 TB of physical address space Traditional 32bit ARM devices limited to 4GB What does this mean for ARM based systems? Reduced address-map congestion More applications at the same time Multiple resident virtualized operating systems Common global physical address in many-core
  • 4.
    Virtualization Extensions: TheBasics New Non-secure level of privilege to hold Hypervisor Hyp mode New mechanisms avoid the need Hypervisor intervention for: Guest OS Interrupt masking bits Guest OS page table management Guest OS Device Drivers due to Hypervisor memory relocation Guest OS communication with the interrupt controller (GIC) New traps into Hyp mode for: ID register accesses and idling (WFI/WFE) Miscellaneous “difficult” System Control Register cases New mechanisms to improve: Guest OS Load/Store emulation by the Hypervisor Emulation of trapped instructions through syndromes
  • 5.
    How does ARMdo Virtualization Extensions to the v7-A Architecture, available on the Cortex™-A15 and Cortex-A7 CPUs Second stage of address translation (separate page tables) Functionality for virtualizing interrupts inside the Interrupt Controller Functionality for virtualizing all CPU features, including CP15 Option of a MMU within the system to help virtualize IO Hypervisor runs in new “Hyp” exception mode / privilege HVC (Hypervisor Call) instruction to enter Hyp mode Uses previously unused entry (0X14 offset) in vector table for hypervisor traps Hyp mode exception link register, SPSR, stack pointer Hypervisor Control Register (HCR) marks virtualized resources Hypervisor Syndrome Register (HSR) for Hyp mode entry reason
  • 6.
    Virtualization: Third PrivilegeGuest OS same kernel/user privilege structure HYP mode higher privilege than OS kernel level VMM controls wide range of OS accesses Hardware maintains TZ security (4 th privilege) TrustZone Secure Monitor (Highest Privilege) Secure Apps Secure Operating System Non-secure State Secure State Exceptions Exception Returns User Mode ( Non-privileged ) Supervisor Mode (Privileged) Hyp Mode (More Privileged) Guest Operating System1 App2 App1 Guest Operating System2 App2 App1 Virtual Machine Monitor / Hypervisor 1 2 3
  • 7.
    Memory – theClassic Resource Before virtualisation – the OS owns the memory Allocates areas of memory to the different applications Virtual Memory commonly used in “rich” operating systems Virtual address map of each application Physical Address Map Translations from translation table (owned by the OS)
  • 8.
    Virtual Memory inTwo Stages Stage 1 translation owned by each Guest OS Virtual address (VA) map of each App on each Guest OS “ Intermediate Physical” address map of each Guest OS (IPA) Physical Address (PA) Map Stage 2 translation owned by the VMM Hardware has 2-stage memory translation Tables from Guest OS translate VA to IPA Second set of tables from VMM translate IPA to PA Allows aborts to be routed to appropriate software layer
  • 9.
    Classic Issue: InterruptsAn Interrupt might need to be routed to one of: Current or different GuestOS Hypervisor OS/RTOS running in the secure TrustZone environment Basic model of the ARM virtualisation extensions: Physical interrupts are taken initially in the Hypervisor If the Interrupt should go to a GuestOS : Hypervisor maps a “virtual” interrupt for that GuestOS Operating System App2 App1 Guest OS 1 App2 App1 Guest OS 2 App2 App1 VMM System without virtualisation System with virtualisation Physical Interrupt Physical Interrupt Virtual Interrupt
  • 10.
    Interrupt Virtualization VirtualisationExtensions provides : Registers to hold the Virtual Interrupt CPSR.{I,A,F} bits in the GuestOS only appling to that OS Physical Interrupts are not masked by the CPSR.{I.A.F} bits GuestOS changes to I,A,F no longer need to be trapped Mechanism to route all physical interrupts to Monitor Mode Already utilized in TrustZone technology based devices Virtual Interrupts are routed to the Non-secure IRQ/FIQ/Abort Guest OS manipulates a virtualized interrupt controller Actually available in the Cortex-A9 to aid paravirtualization support for interrupts
  • 11.
    Virtual Interrupt Controller New “Virtual” GIC Interface has been Architected ISR of GuestOS interacts with the virtual controller Pending and Active interrupt lists for each GuestOS Interacts with the physical GIC in hardware Creates Virtual Interrupts only when priority indicates it is necessary GuestOS ISRs therefore do not need calls for: Determining interrupt to take [Read of the Interrupt Acknowledge] Marking the end of an interrupt [Sending EOI] Changing CPU Interrupt Priority Mask [Current Priority]
  • 12.
    Virtual GIC GICnow has separate sets of internal registers: Physical registers and virtual registers Non-virtualized system and hypervisor access the physical registers Virtual machines access the virtual registers Guest OS functionality does not change when accessing the vGIC Virtual registers are remapped by hypervisor so that the Guest OS thinks it is accessing the physical registers GIC registers and functionality are identical Hypervisor can set IRQs as virtual in the HCR Interrupts are configured to generate a Hypervisor trap Hypervisor can deliver an interrupt to a CPU running a virtual process using “register lists” of interrupts
  • 13.
    Virtual interrupt exampleExternal IRQ (configured as virtual by the hypervisor) arrives at the GIC GIC Distributor signals a Physical IRQ to the CPU CPU takes HYP trap, and Hypervisor reads the interrupt status from the Physical CPU Interface Hypervisor makes an entry in register list in the GIC GIC Distributor signals a Virtual IRQ to the CPU CPU takes an IRQ exception, and Guest OS running on the virtual machine reads the interrupt status from the Virtual CPU Interface Distributor Physical CPU Interface Virtual CPU Interface Virtual IRQ Physical IRQ CPU External Interrupt source Hypervisor Guest OS
  • 14.
    Resource Ownership Software-onlyapproaches Access to resources by GuestOS intercepted by the VMM VMM interprets the GuestOS’s intent Provides its own mechanism to meet that intent Mechanism of interception varies Paravirtualisation adds a hypercall to the source code Binary translation adds a hypercall to the binary Exceptions in Hardware provide an trapping of operations Hypercalls can be more efficient More of the intent to be expressed in a single VMM entry Hardware assisted approaches: Provide further indirection to resources Accelerating trapped operations by syndrome information
  • 15.
    Helping with VirtualDevices ARM I/O handling uses memory mapped devices Reads and Writes to the Device registers have specific side-effects Creating “Virtual Devices” requires emulation: Typically reads/writes to devices have to trap to the VMM VMM interprets the operation and performs emulation Perfect virtualization means all possible devices loads/stores emulated Fetching and interpreting emulated load/store is performance intensive “ Syndrome” information on aborts available for some loads/stores Syndrome unpacks key information about the instruction Source/Destination register, Size of data transfer, Size of the instruction, SignExtension etc If syndrome not available, then fetching of the instruction for emulation still required
  • 16.
    Devices and MemoryProviding address translation for devices is important Allows unmodified device drivers in the GuestOS If the device can access memory, GuestOS will program it in IPA ARM virtualisation adds option for a “System MMU” Enables second stage memory translations in the system A System MMU could also provide stage 1 translations Allows devices to be programmed into guest’s VA space System MMU natural fit for the processor ACP port ARM defining a common programming model Intent is for the system MMU to be hardware managed using Distributed Virtual Messages found in AMBA 4 ACE
  • 17.
  • 18.
    Partitioning in aSecure ARM System ARM TrustZone technology define two worlds Everything must live in Normal World or Secure World TrustZone-Enhanced processor exports World Information Via NS bit (Not Secure) on system bus since AMBA 3 AXI TrustZone-Aware devices can partition across both worlds Only AMBA AXI compatible devices can be TrustZone-aware AMBA 3 AXI Interconnect decodes TZ like an address line AMBA AHB and APB do not contain TrustZone information AHB and APB devices live in only one World Groups of peripherals can be managed from bus interface Inclusion of TrustZone Peripheral Controller gives more control
  • 19.
    Base Secure SystemMinimal TrustZone System required for payment solutions Protects On-Chip Secure Ram area via TrustZone Memory Adaptor Keyboard and screen secured dynamically to protect PIN entry Master Key and Random Number Generators for daughter-keys permanently secures. Non-volatile counters required for state management & anti-rollback (fuse based not non-volatile memory due to process geometry limitations)
  • 20.
    Extended Secure SystemExtended TrustZone system enables complex content management Builds on Base Secure System + TrustZone ASC to protect media in RAM and off-chip decode + On-chip Crypto, Media Accelerators & DMA Controller for media handling
  • 21.
    Propagating System SecurityNS : N ot S ecure - treated like an address line
  • 22.
    Multi-Cluster Virtualization Worksjust like with a single cluster MPCore system Guest OS (like threads) can migrate from CPU to CPU across clusters External (virtual) GIC used to handle interrupts Functions the same as internal GIC, but accessed by multiple CPUs AMBA Coherency Extensions (ACE) Manages coherency across clusters System MMU allows other bus masters to map from IPA to PA Hypervisor needs to be aware of different clusters and CPUs But again: it is just like a single cluster system Hardware requirements are the same as non-virtualized multi-cluster Just make sure there’s enough memory
  • 23.
    big.LITTLE Multi-Processing “Big” processor is paired with a “little” processor Processors share exact same ISA and feature set High performance tasks run on the Big processor Lightweight/non-time-critical tasks run on the little processor Best of both worlds solution for high performance and low power big.LITTLE Use Models big.LITTLE Switch (Swapping) – one CPU cluster active at a time big.LITTLE MP – both CPUs can be active, loads dynamically balanced Cortex-A15 Tags SCU L2 Cache Coherent Interconnect Auxiliary Interfaces Tags SCU L2 Kingfisher
  • 24.
    System Characteristics ofbig.LITTLE PERFORMANCE SMS & Voice do not need a 1GHz processor Browser needs full performance for complex rendering but very little if the page is just being read You do not know what combination of apps the user will use, but the Smartphone must continue to be responsive
  • 25.
    big.LITTLE Processing Rightsized Core for the Right Task Cortex-A7 enabled by default with sufficient performance for common usage scenarios Cortex-A15 performance for best user experience Software alignment allows execution to migrate between cores Transparent to user, application and OS Optimizes system workload based on application requirements Extending existing power management Additional benefit though OS Power Management Policy tuning End-product delivers longer battery life and richer user experience at the same time
  • 26.
    Putting together abig.LITTLE System Cortex-A15 and Cortex-A7 clusters are cache coherent CCI-400 maintains cache-coherency between clusters GIC-400 provides transparent virtualized Interrupt control Supports all big.LITTLE usage models 128-bit AMBA 4 CoreLink CCI-400 Cache Coherent Interconnect 128-bit AMBA 4 A7 A15 System Memory GIC-400 Virtual GIC Full task migration (active to active) in less than 20K cycles Cache coherency managed in hardware 128-bit system transactions Virtualization manager can map OS either across or between clusters Cortex-A15 A15 SCU + L2 Cache Cortex-A7 A7 SCU + L2 Cache
  • 27.
    Forms of big.LITTLEbig.LITTLE cluster switching OS sees big cores or little cores at any one time Eases deployment of software on big.LITTLE platforms Minimizes OS changes by extending DVFS framework Symmetric big/little clusters currently preferred Concurrent big.LITTLE OS sees all CPUs all the time; better energy and performance matching of compute to workload Asymmetric clusters supported Expected to be the preferred use model eventually Requires OS improvements and changes for efficient operation Co-ordination of MP scheduling, power policies, environmental conditions OS Switched L2$ KF KF L2$ A15 A15 L2$ A15 A15 L2$ KF KF OS Spans Clusters
  • 28.
    Virtualization and TrustZoneTrustZone offers a specialised type of Virtualisation Only 2 ‘Worlds’ – not extendable (except through paravirtualization) Although VMM can also span both worlds Fourth privilege level is provided by CPU’s secure monitor mode Non-symmetrical - The two ‘Worlds’ are not equal Secure world can access both worlds (33bit addressing) Secure Apps Secure RTOS Secure Monitor Normal World Secure World TrustZone coexists alongside a VMM HARDWARE (Memory, ARM CPU, I/O Devices) Guest Operating System1 App2 App1 (EPG) Guest Operating System2 App2 App1 (Flash ) Virtual Machine Monitor (VMM) or Hypervisor
  • 29.
    Spanning Hypervisor FrameworkSoC Management Domain (ST/TEI) TrustZone Monitor RTOS/uKernel Para-virtualization with platform service API’s Secured hw timer (tick) SoC Management Domains Secured Services Domain Open Platform Hypervisor Full Virtualization (KVM) TZ Monitor Exception Level HYP(ervisor) Exception Level Open OS & Drivers Open OS (Linux) Kernel/Supervisor Exception Level A P P A P P A P P A P P User Exception Level (Application Layer) ARM Non-Secure Execution Environment ARM Secure Execution Environment Resource API Virtualized TZ Call Virtualized TZ Call Resource Driver API NoC sMMU Accelerators (eg DSP / GPU) Virtualized Heterogeneous API Driver (OpenMP RT) Coherence Platform Resources Project website http://www.virtical.eu This project in ARM is in part funded by ICT-Virtical, a European project supported under the Seventh Framework Programme (7FP) for research and technological development
  • 30.
    Summary Hardware basedVirtualization Extensions were introduced in the soon-available Cortex-A15 Also found the in the recently announce Cortex-A7 Enables full binary compatible virtualization of the CPU Supporting both existing and new market scenarios Extended into the system and IO through the system MMU Independent of the TrustZone environment Providing both a secure environment, and a virtualizable non-secured environment Software models available now, first silicon 2H2012

Editor's Notes

  • #3 Key features that are introduced on the Cortex-A15 that have not been seen before: Virtualization – one could do it before through SOFTWARE, but Cortex-A15 brings HARDWARE support making it much more efficient. Large physical addressing – before ARM cores were limited to 4GB of memory. For the most part this is effectively 2-3GB due to address mapping of peripherals etc. Increasing the number of address bits from 32 to 40 enables this increase in usable memory. Note that virtual memory available to a single program is still limited to 32-bit/4GB. AMBA 4 System coherency – allows different clusters of processors or other cached masters to share data efficiently. ARM has supported up to 4xMP previously; AMBA 4 allows MP scale well beyond that. Currently plans include support to 8 and 16 core systems. AMBA 4 can theoretically support more cores than that, but a Cache Coherent Interconnect with more than four ports will need to be created to realize such 16+ core systems. (ARM currently has plans for two and four port CCIs)
  • #7 To do: emphasize what the VMM does. The OS can now manage its own page tables and disable interrupts – details on this later. The VMM can be written to trap all hardware accesses by OS meaning that OS source need not be changed (might want to change it for performance reasons though). For instance, can configure system so that all CP15 register accesses are trapped to Hypervisor mode where they can be handled by VMM. Hyp mode can alias identification registers e.g. CPU ID etc. Interaction between Virtualization and TrustZone is not a subject for now but the two can co-exist and Virtualization is only supported in Non-Secure world.
  • #9 LPAE extensions allows each OS to live in 32-bit virtual address space and VMM to map multiple systems within 40-bit physical address space. The Guest OS believes that IPA is the real PA. The second-stage translation is invisible to it. Without virtualization extensions, VMSAv7 page tables have been extended to allow generation of 40-bit physical addresses at 16MB super-section granularity only. Support of super-sections and 40-bit PA generation is optional in VMSAv7. The LPAE extension (always used for the Stage 2 translation in a virtualized system) allows 4KB granularity over entire 40-bit address range. Support of LPAE without Virtualization is permitted (but not the other way round).
  • #14 ANIMATED Pretty simplified, but still shows delivery of a virtual interrupt. What is missing is the incredibly complex process of the hypervisor restoring the Guest OS context, or deciding which CPU to deliver the virtual interrupt to, or any one of the 101 details that the hypervisor must deal with.
  • #25 We achieve Big-Little system can use the Right-Core for the right task. Built on the coherency and software compatiblity: - Soft ware migration across processors is at a minimal cost This migration is transparent to applications and OS There is a great deal of innovation and differentiation possible here through tuning of OS Power Management, use of Hypervisors and more. [Internal Note: The idea is to rebutt asynchronous DVFS without specifically alluding to it] Even with near-perfect voltage scaling, the “Big” processor cannot match the power and energy efficiency of the simple Kingfisher pipeline for the lower-intensity workloads such as MP3 playback, basic OS UI and communication activity and scrolling Kingfisher establishes a new bar on power-efficient performance for the low and mid-intensity workloads Lead-into next slide: For the more interactive high-intensity workloads Cortex-A15 provides the instantaneous responsiveness. However, for the higher-intensity workloads At the upper end of the spectrum
  • #27 GIC-400 transports interrupts across clusters under hypervisor control
  • #28 Phase I: Single cluster in operation at any given time Good fit for Linux OS estimates system load and invokes performance governors at preset watermarks Governor back end can synchronously call switcher software When performance on one cluster falls below or exceeds required levels, execution switches seamlessly to appropriate cluster Core state switch occurs in sub 20 microseconds OS remains oblivious to state switch OS load estimation re-calculates system capability Virtualizer handles out of bound accesses OS can be built for Cortex-A15 or Kingfisher Switcher handles core switching OS agnostic so easily extended to support other systems Virtualizer natural progression of the switching work Single OS and simultaneous execution across both clusters Switcher layer is redundant – design is consequently simpler Virtualizer continues to handle out-of-bound micro-architectural accesses Dual cluster system appears logically as a single cluster of processors with varying capability Phase II: Native support requires minimal extension to the Linux kernel Machine description for “SMP” platform Inter cluster signaling modifications Map CPUID reference to CPU within target cluster Still about mechanics not policy OS scheduler still won’t make intelligent scheduling decisions Scope for ‘manual’ improvement Forcefully bind threads statically to suitable processor domain Use cgroups and cpusets