Virtualization Support in ARMv8+

Virtualization Support in
ARMv8+
Aananth C N
c.n.aananth@gmail.com
Version 1.3, 24 Oct 2020

Agenda
• To provide a simplified view on Virtualization for Automotive ECUs
• To understand and compare different solutions available.
• To share this knowledge so that this drop in ocean join with other drops
eventually quench the thirst of some good souls in universe!
Note: All contents, pictures etc., are based on either what are already published on the web and/or from my own experience /
learning / creations. My intent is not to violate any copyrights or NDA content. Please let me know if any violations if happened.

What is Virtualization?
• Operating System (OS) abstracts
hardware from its applications.
Virtualization abstracts hardware
from one or more OS.
• In automotive world, Virtualization
is about abstracting applications,
operating systems, vehicle
network, displays, audio systems
etc. away from the hardware.
What is Virtualization?
• Operating System (OS) abstracts
hardware from its applications.
Virtualization abstracts hardware
from one or more OS.
• In automotive world, Virtualization
is about abstracting operating
systems, applications, vehicle
network, displays, audio systems
etc. away from the hardware.

Virtualization Types
• Fundamental types
• Type 1: Full-virtualization, where the hypervisor takes control of the
hardware and hosts the guest OSes, and the guests are completely
unaware of running on an virtualized environment.
• Type 2: Para-virtualization, where one of the operating system (called as
Host OS) takes charge of hardware and the guest OS is modified to
connect with either Host OS or hardware devices.
• Derived Types
• Hardware assisted virtualization: Here the virtualization solution utilizes
the support provided by hardware to realize the virtualization goals.
• Example Linux/KVM falls under this category. We will see this in details, later.
• Hybrid types: Here the virtualization is realized by combining different
other types.
• For example, the core virtualization functions are realized using Type 1
hypervisor and peripheral / device virtualization are done using Type 2 or other
types such as Graphics, Display virtualization uses a server in Host OS and clients
running in Guests OSes. We will see this in details, later.
• ... and many more
Hardware
Hypervisor
OS 1 OS 2
Type 1
Hardware
Host OS
Apps Hypervisor
Type 2
Modified
Guest OS

Stop all old stories! How to realize Virtualization?
• System Virtualization involves following functions
• Virtualization of CPU cores or Processing Elements
• Virtualization of memory and the memory management
• Virtualization of Interrupts
• Virtualization of Timers
• I/O or Peripheral Virtualization
• To get a better understanding different virtualization functions (listed
above), we may need some example hardware such as Raspberry Pi 3
(ARM Cortex A53)
• Raspberry Pi is taken because that is the most open & common hardware available.

Overview of ARM Cortex A53
- CPU cores
- Exceptions Levels of ARMv8
- Memory management
- Memory Mapped I/O
- Interrupts
- Timers, Clocks, Resets

ARM Cortex A53 – CPU core hardware blocks
• 4 CPU Cores with
• Timer block
• Interrupt block
• Core includes
• NEON Coprocessor
• FPU
• Crypto extensions
• L1 Cache [, L2 Cache]
• Debug & trace
• Trace block
• Debug block
• ACP - Accelerator Coherency
Ports for AXI slaves
• Master memory interface
• Power management interface
• Test interface
The Cortex-A53 processor is a mid-range, low-power processor that implements the ARMv8-A
architecture. The Cortex-A53 processor has one to four cores, each with an L1 memory system
and a single shared L2 cache.
Figure 1-1 shows an example of a Cortex-A53 MPCore configuration with four cores and either
an ACE or a CHI interface.
Figure 1-1 Example Cortex-A53 processor configuration
See About the Cortex-A53 processor functions on page 2-2 for more information about the
functional components.
Core 3*
Core 2*
Core 1*
AXI slave interface
Core 0
Timer events
Counter
ICDT*, nIRQ, nFIQ
PMU
ATB
Debug
Core
Trace
Debug
Interrupt
Timer
ACP*
Power
management
Test
ACE or CHI
master interface
Power control
DFT
MBIST
Cortex-A53 processor
* Optional
APB debug
Clocks
Resets
Configuration
Master
interface
ICCT*, nVCPUMNTIRQ
Ref: https://developer.arm.com/documentation/ddi0500/e/introduction/about-the-cortex-a53-processor

ARM Cortex A53 – CPU Functional Blocks
• APB – slow speed(compared
to AXI) Advanced Peripheral
Bus
• CTM – CoreSight Trigger
Matrix (Debug & Trace)
• CTI – CoreSight Trigger
Interface (Debug & Trace)
• GIC – Global Interrupt
Controller
• SCU – Snoop Control Unit that
maintains cache coherency
• ACE – an extension to AXI
protocol
• CHI – a scalable protocol
supporting multi-node
interconnect
• ACP - an AMBA 4 AXI slave
interface
.1 About the Cortex-A53 processor functions
Figure 2-1 shows a top-level functional diagram of the Cortex-A53 processor.
Figure 2-1 Cortex-A53 processor block diagram
The following sections describe the main Cortex-A53 processor components and their
functions:
L1
ICache
L1
DCache
Debug
and trace
Core 0
L2 cache SCU
ACE/AMBA 5 CHI
master bus interface
ACP slave
Level 2 memory system
Core 0 governor
L1
ICache
L1
DCache
Debug
and trace
Core 1
FPU and NEON
extension
Crypto
extension
L1
ICache
L1
DCache
Debug
and trace
Core 2
L1
ICache
L1
DCache
Debug
and trace
Core 3
Core 1 governor Core 2 governor Core 3 governor
Arch
timer
GIC CPU
interface
Clock and
reset
CTI
Retention
control
Debug over
power down
Arch
timer
GIC CPU
interface
Clock and
reset
CTI
Retention
control
Debug over
power down
Arch
timer
GIC CPU
interface
Clock and
reset
CTI
Retention
control
Debug over
power down
Arch
timer
GIC CPU
interface
Clock and
reset
CTI
Retention
control
Debug over
power down
Governor
APB decoder APB ROM APB multiplexer CTM
FPU and NEON
extension
Crypto
extension
FPU and NEON
extension
Crypto
extension
FPU and NEON
extension
Crypto
extension
Ref: https://developer.arm.com/documentation/ddi0500/d/functional-description/about-the-cortex-a53-processor-functions?lang=en

CPU Virtualization – Virtual CPU Cores (vCPU)
• A Raspberry Pi3 has 4 physical cores or CPU (refer previous slide).
• vCPU is basically a time slot of a physical CPU.
• Note: ARM uses the term “vPE” (virtual Processing Element)
• There can be 1-to-1 or many-to-1 relation between vCPU and a real CPU
core.
• For understanding purpose let us imagine a single core ARMv8 processor
and if we can schedule 2 vCPUs from from it (as shown in the picture
below), then this system has 2 vCPU to 1 real CPU relationship.
7 Virtualizing the G
7 Virtualizing the Generic Timers
The Arm architecture includes the Generic Timer, which is a standardized set of timers avai
each processor. The Generic Timer consists of a set of comparators that compare against a
system count. A comparator generates an interrupt when its value is equal to or less than th
count. In the following diagram, we can see the Generic Timer in a system (orange), and its
components of comparators and a counter module.
The following diagram shows an example system with a hypervisor that hosts two virtual
CPUs (vCPUs):
Single Core ARMv8
Hypervisor
VM 1 VM 2

ARMv8 – Switching to Hypervisor context
• ARM supports WFI instruction to put
the CPU in low power state.
• But, when HCR_EL2.TWI bit is set, if
either application or the OS executes
WFI instruction, then the CPU switches
to Hypervisor context.
• ARM also supports ‘HVC #[0-65535]’
instruction, which can be called from
OS context to switch the context to
Hypervisor.
• Note: ‘HVC #imm’ is undefined in
application context.
https://developer.arm.com/architectures/learn-the-architecture/armv8-a-virtualization/trapping-and-emulation-of-instructions
Copyright © 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Note: Traps are not just for virtualization. There are EL3 and EL1 controlled traps
traps are particularly useful to virtualization software. This guide only discusses th
typically associated with virtualization.
In our WFI example, an OS would usually execute a WFI as part of an idle loop. Wi
within a VM, the hypervisor can trap this operation and schedule a different vCPU
diagram shows:
5.1 Presenting virtual values of registers
Another example of using traps is to present virtual values of registers. For examp
ID_AA64MMFR0_EL1 reports support for memory system-related features in the

ARM Cortex A53 (Raspberry Pi3) Memory Map
• Picture on right shows physical and virtual
memory addresses of RPi3 B.
• Physical memory map
• RAM (1 GB “bcm2837-rpi-3-b.dts”)
memory {
reg = <0 0x40000000>;
}
• Memory Mapped I/O
• See the map for I/O Peripherals on the right.
• Virtual memory map
• User space: 0x0000 0000 to 0xBFFF FFFF
• Kernel space: 0xC000 0000 to 0xFFFF FFFF
DDR2
I/O
Peripherals
ARM
MMU
User Space
Virtual
Memory
0x0000 0000
0x4000 0000
0x0000 0000
0xC000 0000
0xFFFF FFFF
Physical Address Virtual Address
32-bit split
0x4003 FFFF
I/O
Peripherals

ARM Cortex A53 – MMU
• ARM64 instructions such as LDR/STR and registers such as PC/LR all points to
Virtual Address.
• This means any address or pointers in the application programs points to virtual
addresses.
• MMU
• Sits between CPU and DDR controller (see next slide).
• Translates virtual to physical address. Once configured no translation penalty.
• Address translation granule of 4KB (AArch32 & AAarch64) and 64KB (AArch64 only) - pages.
• 16 bit ASID (AArch32 uses 8-bit) – used in TLB (see TLB slides).
• Max supported physical address size = 40 bits = 2^40 = 1024 GB (1 TB).
• Provides fine-grained control through virt-to-phys addr. mappings and memory attributes
held in page tables, loaded into the TLB (translation lookaside buffer).
• Generates exceptions if any access violation happens.

ARM Cortex A53 – MMU mapping with Example
• Let us take an example application
that needs 16k of RAM (.text + .bss
+ .data and others).
• As soon as the application is
spawned, assume that the OS
allocates 16k at virtual address:
0x80000000.
• As the page size is 4k, the OS allots
4 contiguous pages as shown in the
table below:
1
2
3
4
0x8000_0000 + 16k
0x00000000
Page VA: start VA: end PA: start PA: end
1 80000000 80000FFF 00017000 00017FFF
2 80001000 80001FFF 00029000 00029FFF
3 80002000 80002FFF 00003000 00003FFF
4 80003000 80003FFF 0004F000 0004FFFF
0x00016000
0x00024000
0x0004D000
This table is for illustrating the MMU lookup
table. Note that the physical address in last 2
columns are not contiguous.

ARM Cortex A53 – Translation Lookaside Buffer (TLB)
• Based on the example provided in previous slide, assume that for every 5 to 10
instructions, the CPU is asked to read a look-up table that is stored in external
DDR memory. Do you think the CPU will be efficiently utilized? No.
• What is TLB?
• TLB is a ‘memory cache’ that contains the recent Virtual Address (VA) to Physical Address
(PA) translations. This saves CPU time for the entries that are done often by the OS.
• In case of TLB miss, generally address translation info from the look-up table is fetched
and updated.
• TLB is organized as 4 major blocks listed below:
• Micro TLB
• 10 sets of physical address to cache (first level) for each data and instruction
• Main TLB
• second layer of TLB structure that catches the cache misses from Micro TLBs.
• Supports all VMSAv8 (Virtual Memory System Architecture) block sizes except 1 GB.
• IPA cache RAM
• The intermediate physical address (IPA) cache RAM holds mappings between intermediate
physical address and the physical address.
• Only non-secure EL1 and EL0 “stage 2” translation uses this cache.
• Walk cache RAM
• Holds the result of stage 1 (OS controlled) translation.
• If stage 1 translation result in a section or larger mapping then nothing is placed in the walk
cache RAM.
MMU
TLB
DDR
Controller
DDR
Memory
CPU
Registers
Entries
pa ...va
pa ...va
pa ...va
SoC
Note: The IPA is part of “stage 2” address translation, i.e., the hypervisor controlled address translation. Will be discussed later.

ARM Cortex A53 – TLB matching and Cache handling
TLB match process
• Each TLB entry contains a VA, block-size, PA and a set of memory
properties (type, access permissions, ...)
• Each entry is associated with a particular ASID, contains a field to store
VMID
• A TLB entry match occurs, when the following conditions are met:
• Its VA, moderated by VA bits [47:N], where N is log2(page size) =
12 for 4k
• Memory space matches the memory space state of request. The
memory space can be one of four values:
• Secure EL3 (AArch64)
• Non-secure EL2
• Secure EL0, EL1 (and EL3 - AArch32)
• Non-secure EL0 or EL1
• ASID matches the current ASID held in the CONTEXTIDR, TTBR0
or TTBR1 or entry marked as global
• The VMID matches the current VMID held in the VTTBR register
Data cache coherency
• Uses MOESI protocol to maintain data coherency between
multiple cores.
• M - Modified - The line is in only this cache and is dirty (Unique Dirty)
• O - Owned - The line is possibly in more than one cache and is dirty
(Shared Dirty)
• E - Exclusive - The line is in only this cache and is clean. (Unique
Clean)
• S - Shared - The line is possibly in more than one cache and is clean
(Shared Clean)
• I - Invalid - The line is not in this cache
CPU
Registers
MMU
+ TLB
L1
Cache
L2
Cache
DDR
Memory
Virtual Address Physical Address
SoC
We will discuss
this soon
Key takeaway: Memory is already virtualized on a single OS, virtualizing it for more than one OS is done by adding stage-2 address translation.

ARMv8 stage 2 translation, MMIO & SMMU

ARMv8 Stage2 Address Translation
• Allows Hypervisor to control which memory mapped system resources a VM can access and
how it appears within the VM.
• It is can be used to ensure that the VM can see only the allocated regions.
• In short, OS controlled translation table is called stage 1 table and Hypervisor controlled
translation table is called as stage 2 translation
https://developer.arm.com/architectures/learn-the-architecture/armv8-a-virtualization/stage-2-translation
resources that are allocated to other VMs or the hypervisor.
For memory address translation, stage 2 translation is a second stage of translation. To support this, a
new set of translation tables known as Stage 2 tables, are required, as shown here:
An Operating System (OS) controls a set of translation tables that map from the virtual address space
to what it thinks is the physical address space. However, this process undergoes a second translation
into the real physical address space. This second stage is controlled by the hypervisor.
The OS-controlled translation is called stage 1 translation, and the hypervisor-controlled translation
is called stage 2 translation. The address space that the OS thinks is physical memory is referred to as
the Intermediate Physical Address (IPA) space.
Note: For an introduction to how address translation works, see our guide on Memory Management.
OS or VM Hypervisor (IPA) RPi3 phy.addr space

ARMv8 - Virtual Peripheral Emulation using MMU
• There are 2 ways you can assign a
peripheral to an VM
• Pass-through or “Assigned”
• Shared or “Virtual”
• Assigned Peripheral – the physical device
is fully assigned to a VM.
• Virtual Peripheral – the device is shared
between 2 or more VMs and a stage-2 fault
is generated to trap the access and
emulate in Hypervisor.
• Why stage-2 fault? Because stage-1 fault
report the virtual address of OS which is
meaningless to hypervisor hence it can’t
decide which peripheral it needs to emulate.
• Instead, HPFAR_EL2 register can be read by
hypervisor to determine the IPA address
mapped to a specific peripheral.
Non-Confidential
Page 12 of 38
The VM can use peripheral regions to access both real physical peripherals, which are often referred
to as directly assigned peripherals, and virtual peripherals.
Virtual peripherals are completely emulated in software by the hypervisor, as this diagram highlights:

ARMv8 – Trapping and Emulation of Virtual Peripherals
• Let us take ”shared” UART (serial
comm.) as example.
• An app running in vCPU (VM) tries to
read data from UART.
• Since it is not a pass-through device,
the read will create a stage2 fault and
context switches to Hypervisor.
• Hypervisor will read HPFAR_EL2
register to know the peripheral the VM
was trying access.
• It them emulate the read operation
and return the results to the VM.
• Note that this example read results in
2 context switches.
Non-Confidential
by the hypervisor, it can use this information to determine the register that it needs to emulate.
Exception Model shows how the ESR_ELx registers report information about the exception. For
single general-purpose register loads or stores that trigger a stage 2 fault, additional syndrome
information is provided. This information includes the size of the accesses and the source or
destination register, and allows a hypervisor to determine the type of access that is being made to the
virtual peripheral.
This diagram illustrates the process of trapping then emulating the access:
This process is described in these steps:
1. Software in the VM attempts to access the virtual peripheral. In this example, this is the receive
FIFO of a virtual UART.
2. This access is blocked at stage 2 translation, leading to an abort routed to EL2.
https://developer.arm.com/architectures/learn-the-architecture/armv8-a-virtualization/trapping-and-emulation-of-instructions

ARMv8 – System Memory Management Units (SMMU)
• In the “shared” UART peripheral example discussed in previous
slide, what will happen if we need to use DMA for UART?
• Yes, there are 2 problems (see Fig-A):
• Isolation of VMs (Guests) are not possible, as the address-space has to be
shared to make DMA work for more than 2 VMs.
• The VM translates addresses to IPAs. But the UART driver in the unmodified
guest believes those IPA are PAs. But DMA operates at PAs. To fix the IPA <--
> PA incompatibility, the hypervisor (software) has to trap every transaction
of DMA which breaks the original purpose of using DMA.
• To overcome the above problem, ARM has come up with
SMMU (or IOMMU, see Fig-B), which fixes the above problem.
• The fix is, the SMMU and MMU will work in pairs so that the DMA
gets the stage-1 (reverse-translated) IPAs as the addresses for their
copy operations.
• This means, if 2 VMs wanted to do DMA operations, the 1st and 2nd
VM will provide different IPAs to DMA. So Isolation is maintained.
• During their copy operations, the same SMMU translate the IPAs to
PAs back when the copy instruction goes to DDR (via and after the
Interconnect box, shown in the Fig-B).
• The hypervisor is responsible for programming SMMU so that
the DMA see the same view of memory as the VMs.
Armv8-A virtualization
In this system, a hypervisor is using stage 2 to provide isolation b
to see memory is limited by the stage 2 tables that the hyperviso
Allowing a driver in the VM to directly interact with the DMA con
Isolation: The DMA controller is not subject to the stage 2 tables
VM’s sandbox.
Address space: With two stages of translation, what the kernel b
controller still sees PAs, therefore the kernel and DMA controlle
overcome this problem, the hypervisor could trap every interacti
controller, providing the necessary translation. When memory is
inefficient and problematic.
An alternative to trapping and emulating driver accesses is to ext
other masters, like our DMA controller. When this happens, thos
referred to as a System Memory Management Unit (SMMU, som
Copyright © 2019 Arm Limited (or its affiliates)
In this system, a hypervisor is using stage 2 to provide isolation
to see memory is limited by the stage 2 tables that the hypervis
Allowing a driver in the VM to directly interact with the DMA c
Isolation: The DMA controller is not subject to the stage 2 tabl
VM’s sandbox.
Address space: With two stages of translation, what the kernel
controller still sees PAs, therefore the kernel and DMA control
overcome this problem, the hypervisor could trap every interac
controller, providing the necessary translation. When memory
An alternative to trapping and emulating driver accesses is to e
other masters, like our DMA controller. When this happens, th
referred to as a System Memory Management Unit (SMMU, so
Fig – A: DMA access without SMMU
Fig – B: DMA access with SMMU

ARMv8 Exception Levels & Secure States

ARMv8 Exception Levels
• ARMv8 model defines 4 exception
levels
• EL0 – least privileged
• EL1 – increased privileged (OS)
• EL2 – Hypervisor mode.
• EL3 – highest privileged, Secure
monitor mode.
• On processor reset (power on
reset), the system enters EL3.
• On taking an exception, exception
level either increases or remains
the same. Doesn’t decrease.
• On return from exception, the
exception level decreases or
remains the same.
• Every exception level has its own
stack pointer. The boot loader or
the initialization part of operating
system software has to setup these
registers for all exceptions levels.
ProgrammersModel
Figure 3-1 ARMv8 security model when EL3 is using AArch64
Security model when EL3 is using AArch32
To provide software compatibility with VMSAv7 implementations that include the security
Guest OS1 Guest OS2
Hypervisor
EL0
EL1
EL2
EL3
Non-secure state Secure state
Secure monitor
Hyp
Modes:
AArch64
System, FIQ, IRQ,
Supervisor, Abort, Undefined
Modes:
System, FIQ, IRQ,
Modes:
User
Modes:
User
Modes:
User
Modes:
User
Modes:
AArch32 or
AArch64†
AArch32 or
AArch64†
App1 App2
User
Modes:
User
Modes:
AArch32 or
AArch64†
AArch32 or
AArch64†
App1 App2
AArch32 or AArch64‡
AArch32 or AArch64‡
AArch32 or AArch64
AArch32 or
AArch64†
Secure App1
AArch32 or
AArch64†
Secure App2
Secure OS
System, FIQ, IRQ,
Modes:
AArch32 or AArch64
† AArch64 permitted only if EL1 is using AArch64
‡ AArch64 permitted only if EL2 is using AArch64
Ref: https://developer.arm.com/documentation/100095/0003/programmers-model/armv8-a-architecture-concepts/armv8-security-model

ARMv8 Security States & Virtualization Support
• Secure State
• Can access both secure memory
space and non-secure memory
states.
• When executing at EL3, it can access
all system control resources.
• Non-Secure State
• Can only access non secure memory
spaces.
• Even in EL3, it cannot access all
system control resources.
• Virtualization support
• Software running in EL2 has access to
several control for virtualization
• Stage 2 translation
• EL1/0 instruction and register access
trapping.
• Virtual exception generation
https://developer.arm.com/architectures/learn-the-architecture/armv8-a-virtualization/virtualization-in-aarch64
Armv8-A virtualization
3 Virtualization in AArch64
Software running at EL2 or higher has access to several controls for virtu
Stage 2 translation
EL1/0 instruction and register access trapping
Virtual exception generation
The Exception Levels (ELs) in Non-Secure and Secure states are shown h
In the diagram, Secure EL2 is shown in gray. This is because support for E

ARM Cortex A53 – Interrupt Controller GICv4
• Interrupt Sources.
• Message-based interrupts are generated by memory-write to an assigned address.
• Wired-based interrupts are generated by peripherals such as UART or I2C via I/O
pins.
• SPI – Shared Peripheral Interrupts, which can be either message-based or wire-
based. Can be routed to any PEs configured to handle interrupts.
• PPI - Private Peripheral Interrupt, targets single specific PE (Processing Element).
• LPI - Locality-specific Peripheral Interrupt are interrupts that uses ITS (interrupt
translation service) to route an interrupt to a specific redistributor and PE.
• SGI – Software Generated Interrupts, generated by PEs.
• Distributor
• performs interrupt prioritization & distribution of SPIs and SGIs to the
Redistributors & CPU interfaces.
• Redistributor (red box)
• Holds the control, prioritization and pending information for all physical LPIs using
data structures that are held in memory. Provides programming interface for:
• Enabling / disabling SGIs and PPIs, Setting priority levels for SGIs and PPIs
• Setting each PPI to be level sensitive or edge triggered.
• ...
• CPU interface (blue box)
• Provides register interface to PE. Provides programming interface for:
• Control and config. To enable interrupt handling in accordance with the security state and
legacy support requirements of the implementation
• Acknowledging an interrupt, deactivation of interrupt,
• Performing a priority drop, deactivation of interrupt,
• ...
3 GIC Partitioning
3.1 The GIC logical components
Figure 3-2 shows the GIC partitioning in an implementation that includes an ITS.
Figure 3-2 GIC logical partitioning with an ITS
The mechanism for communication between the ITS and the Redistributors is IMPLEMENTATION DEFINED.
The mechanism for communication between the CPU interfaces and the Redistributors is also IMPLEMENTATION
DEFINED.
Distributor
PE
x.y.0.0
PE
x.y.0.1
PE
x.y.0.2
Cluster C0
PE
x.y.n.0
PE
x.y.n.1
Cluster Cn
Redistributor
ITSa
Interrupt Translation Service
CPU interface
Distributor
a. The inclusion of an ITS is optional, and there might be more than one
ITS in a GIC.
b. SGIs are generated by a PE and routed through the Distributor.
PPIs
LPIs
SGIsb SGIsb
SGIsb
SGIsb
SPIs
SGIsb
Wired-based Interrupt Message-based Interrupt
Ref: https://static.docs.arm.com/ihi0069/c/IHI0069C_gic_architecture_specification.pdf, section 3.1

ARM Cortex A53 – Interrupt Lifecycle & Interrupt numbers
GIC interrupt lifecycle, a series of high-level processes that apply to any
e interrupt lifecycle provides a basis for describing the detailed steps of the
o maintains a state machine that controls interrupt state transitions during
cycle for physical interrupts.
Figure 4-1 Physical interrupt lifecycle
s follows:
s generated either by the peripheral or by software.
Start
A device generates an
interrupt
Generate
End
Distribute
Deliver
Activate
Priority
drop
The CPU interface
delivers interrupt to the
PE
Deactivationa
a. This step does not apply to LPIs.
The PE ends the
interrupt
The PE acknowledges
the interrupt
INTID Interrupt
Type
Details
0 - 15 SGI These interrupts are local to CPU interface
16 - 31 PPI These interrupts are local to CPU interface
(0-1023 are compatible with earlier versions of
GIC architecture)
32 - 1019 SPI Shared peripheral interrupts that the Distributor
can route to either a specific PE, or to any one of
the PEs in the system that is a participating node
1020 - 1023 Special
interrupt
number
1020 - GIC returns this from EL3 -> handled at
Secure EL1
1021 - GIC returns this from EL3 -> handled at
Non-Secure EL1
1022 - legacy operations only
1023 – GIC returns this as interrupt acknowledge,
or if there are errors handling interrupt.
1024 - 8191 - Reserved
8192 -
implementat
ion defined
LPI Peripheral hardware interrupts that are routed to
a specific PE (directly).
Ref: https://static.docs.arm.com/ihi0069/c/IHI0069C_gic_architecture_specification.pdf, section 4.1

ARMv8 Interrupt Grouping
• GICv3 onwards supports Interrupt Grouping as a mechanism to align interrupt handling with ARMv8
exception & security model.
• In a system with two Security states (secure, non-secure), an interrupt is configured as one of the
following:
• A Group 0 physical interrupt:
• ARM expects these interrupts to be handled at EL3.
• A Secure Group 1 physical interrupt:
• ARM expects these interrupts to be handled at Secure EL1.
• A Non-secure Group 1 physical interrupt:
• ARM expects these interrupts to be handled at Non-secure EL2 in systems using virtualization, or at Non-secure EL1 in systems not using
virtualization
• In a system with one Security state an interrupt is configured to be either:
• Group 0.
• Group 1.
• At the System level, GICD_CTLR.DS indicates if the GIC is configured with one or two Security states.

ARM Cortex A53 – Virtual Interrupt Handling
• Say, a serial input device asserts its
interrupt signal to GIC.
• During initialization software
executing at EL3 or EL2 configures
PE to route interrupts to EL2
(hypervisor)
• GIC generates a physical interrupt
exception, either IRQ or FIQ, which
then gets routed to EL2
(hypervisor)
• The hypervisor then configures the
GIC to forward the physical
interrupt as Virtual Interrupt (vIRQ
or vFIQ) to the right vCPU/VM.
• The hypervisor then returns the
control to the vCPU/VM.
• The vCPU/VM uses Virtual CPU
Interface to read and respond to
the interrupts.
Note: In GICv4 onwards, LPIs can be directly injected to VM, which reduces the context switching to hypervisor.
Armv8-A virtualization Doc ID 10214
Issue [0
6 Virtualizing exceptio
The diagram illustrates these steps:
1. The physical peripheral asserts its interrupt signal into the GIC.
2. The GIC generates a physical interrupt exception, either IRQ or FIQ, which gets routed to EL2 by
the configuration of HCR_EL2.IMO/FMO. The hypervisor identifies the peripheral and
determines that it has been assigned to a VM. It checks which vCPU the interrupt should be

ARMv8 Generic Timer, Clock Tree & Resets

ARM Cortex A53 – Generic Timer
Functional Description
• Each core has following set of 64bit timer:
• EL1 Non-secure physical timer
• EL1 Secure physical timer
• EL2 physical timer
• Virtual timer
• The system counter value (which resides in SoC) is
distributed to the Cortex-A53 processor via
CNTVALUEB[63:0]
• The system counter typically operate at lower frequency
than the CLKIN (main processor clock)
• Each timer provides an active-LOW interrupt output to the
SoC.
• External interrupt output pins (n = no-of-cores -1)
• nCNTPNSIRQ[n:0] - EL1 Non-secure physical timer event
• nCNTPSIRQ[n:0] - EL1 Secure physical timer event
• nCNTHPIRQ[n:0] - EL2 physical timer event
• nCNTVIRQ[n:0] - Virtual timer event
https://developer.arm.com/documentation/ddi0500/e/generic-timer/generic-timer-functional-description
frequency than the main processor CLKIN, the CNTCLKEN input is provided as a clock
enable for the CNTVALUEB bus. CNTCLKEN is registered inside the Cortex-A53 processor
before being used as a clock enable for the CNTVALUEB[63:0] registers. This allows a
multicycle path to be applied to the CNTVALUEB[63:0] bus. Figure 10-1 shows the interface.
Figure 10-1 Architectural counter interface
The value on the CNTVALUEB[63:0] bus is required to be stable whenever the internally
registered version of the CNTCLKEN clock enable is asserted. CNTCLKEN must be
synchronous and balanced with CLK and must toggle at integer ratios of the processor CLK.
See Clocks on page 2-9 for more information about CNTCLKEN.
Each timer provides an active-LOW interrupt output to the SoC.
Table 10-1 shows the signals that are the external interrupt output pins.
Clock gate
CNTCLKEN
register
Architectural
counter
registers
CNTVALUEB[63:0]
CNTCLKEN
Table 10-1 Generic Timer signals
• Timer schedules events and trigger
interrupts based on an incrementing
counter value.
• It provides
• Generation of timer events as
interrupt outputs
• Generation of event streams

ARM Cortex A53 – System Counter for Timer
• System counter (in SoC) generates the count value and
distributes to all cores (PEs)
• System counter measures real time and doesn’t
affected by DVFS
• Provides interface to programmers via following frames
• CNTControlBase accessible only in EL3, contains following
registers
• CNTCR – Control Register, contains enable, freq. selection, scaling
selection etc.
• CNTSR – Status Register, reports whether timer is running or not.
• CNTCV – Reports the current count value.
• ...
• CNTReadBase
• This is a copy of CNTControlBase but includes CNTCV register only.
5 External timers
In What is the Generic Timer, we introduced the timers that are in the processor. A syste
timers. The following diagram shows an example of this:
The programming interface for these timers mirrors that of the internal timers, but these
mapped registers. The location of these registers is determined by the SoC implementor,
datasheet for the SoC that you are working with.
Interrupts from the external memory-mapped timers will typically be delivered as Shared
https://developer.arm.com/architectures/learn-the-architecture/generic-timer/what-is-the-generic-timer

ARMv8 – Timer Virtualization
• Similar to “shared” UART driver example (slide 20), we can also trap timer
interrupts in hypervisor. But this would add considerable CPU overhead on
such systems as this is the core of any OS.
• The good news here is, ARMv8 allows vCPU to access following timers for
its scheduling needs:
• EL1 Non-secure physical timer – Read Only
• Virtual Timer – Read Write
• To generate timer interrupt, the GICv4 needs to configure the interrupt
getting routed to a specific vCPU.
• Note: as discussed earlier (slide 29), usage LPIs should reduce the interrupt context
switches. This is something I need to do more research and confirm.

ARM Cortex A53 – Clocks and Resets
Clock Tree
• The Cortex-A53 processor has a single
clock input, CLKIN
• RPi3 uses 1.2GHz clock
• All cores in Cortex-A53 & SCU are
clocked with a distributed version of
CLKIN.
• Clock Tree
• PCLK - APB interface / bus
• ACLK - ACE (extension to AXI) bus, ACP
slave interface
• SCLK - SCU interface only if CHI protocol
is used.
• ATCLK - ATB interface, which can operate
at any integer multiple of main clock
• CNTCLK - 64-bit counter
Reset Inputs
• Cortex-A53 processor has the
following active-LOW reset input
signals
• nCPUPORESET[N:0] - primary, cold resets
signals initialize all resettable registers
• nCORERESET[N:0] - same as above,
except debug registers and ETM registers
• nPRESETDBG - single, cluster-wide signal
resets the integrated CoreSight
components that connect to the external
PCLK domain, such as debug logic
• nL2RESET - resets all resettable registers
in L2 memory system and the logic in the
SCU
• nMBISTRESET - an external MBIST
controller can use this signal to reset the
entire SoC.
Clock tree and resets are typically handled by Host OS. VMs generally don’t modify these.

Linux KVM/ARM implementation
An overview

Linux KVM/ARM
• KVM stands for Kernel-based Virtual
Machine, which can run “unmodified”
guests.
• As discussed in slide 4, this is a derived
type, where hypervisor is part of an OS
that uses hardware assisted features. It
doesn’t fall under type 1 or 2.
• As shown in picture on right, the KVM
implementation is split into 2 parts
• Highvisor – runs in EL1
• Lowvisor – runs in EL2, to trap hypervisor
calls and exceptions.
in slow and convoluted code paths. As a simple example, a page
fault handler needs to obtain the virtual address causing the page
fault. In Hyp mode this address is stored in a different register
than in kernel mode.
Second, running the entire kernel in Hyp mode would ad-
versely affect native performance. For example, Hyp mode has
its own separate address space. Whereas kernel mode uses two
page table base registers to provide the familiar 3GB/1GB split
between user address space and kernel address space, Hyp mode
uses a single page table register and therefore cannot have direct
access to the user space portion of the address space. Frequently
used functions to access user memory would require the kernel
to explicitly map user space data into kernel address space and
subsequently perform necessary teardown and TLB maintenance
operations, resulting in poor native performance on ARM.
These problems with running a Linux hypervisor using ARM
Hyp mode do not occur for x86 hardware virtualization. x86 root
mode is orthogonal to its CPU privilege modes. The entire Linux
kernel can run in root mode as a hypervisor because the same set
of CPU modes available in non-root mode are available in root
mode. Nevertheless, given the widespread use of ARM and the
advantages of Linux on ARM, finding an efficient virtualization
solution for ARM that can leverage Linux and take advantage
Host
Kernel
KVM
Highvisor
Host
User
QEMU
PL 0 (User)
PL 1 (Kernel)
PL 2 (Hyp)
VM
Kernel
VM
User
Trap
Lowvisor
Trap
Figure 2: KVM/ARM System Architecture
processing required and defers the bulk of the work to be done
to the highvisor after a world switch to the highvisor is complete.
The highvisor runs in kernel mode as part of the host Linux
kernel. It can therefore directly leverage existing Linux function-
ality such as the scheduler, and can make use of standard kernel
software data structures and mechanisms to implement its func-
tionality, such as locking mechanisms and memory allocation
functions. This makes higher-level functionality easier to imple-
ment in the highvisor. For example, while the lowvisor provides
https://www.cs.columbia.edu/~nieh/pubs/asplos2014_kvmarm.pdf

KVM/ARM Linux
• The source tree (picture) on right shows all the files
that realizes Highvisor and Lowvisor described in
previous slide.
• Most of the file names should be familiar by now.
• Also it should provide a feel of how simple a hypervisor
implementation (~16k lines of code).
<Linux Kernel Src>
├── COPYING
├── CREDITS
├── Documentation
~~~
└── virt
├── Makefile
├── built-in.a
├── kvm
│ ├── Kconfig
│ ├── arm
│ │ ├── aarch32.c
│ │ ├── arch_timer.c
│ │ ├── arm.c
│ │ ├── hyp
│ │ ├── mmio.c
│ │ ├── mmu.c
│ │ ├── perf.c
│ │ ├── pmu.c
│ │ ├── psci.c
│ │ ├── trace.h
│ │ └── vgic
│ ├── async_pf.c
│ ├── async_pf.h
│ ├── coalesced_mmio.c
│ ├── coalesced_mmio.h
│ ├── eventfd.c
│ ├── irqchip.c
│ ├── kvm_main.c
│ ├── vfio.c
│ └── vfio.h
├── lib
│ ├── Kconfig
│ ├── Makefile
│ ├── built-in.a
│ ├── irqbypass.c
│ ├── modules.builtin
│ └── modules.order
├── modules.builtin
└── modules.order

KVM and its suitability for Automotive
• The high-visor of KVM is basically a character device driver.
• From a software component perspective, KVM is a kernel module loaded after
the Linux kernel is initialized.
• New / modification / start of VM happens through ioctl() calls from userspace.
• Auto-start of VM machine is also possible under KVM.
• This means, the first virtual machine can start after the Linux kernel is initialized.
• So, for a product with architecture similar to the one in slide 3 will have difficulty
in meeting safety and start-up time needs for Automotive.
• But, if we use safety certified Linux as a hypervisor (i.e., kernel configured with minimum
features, ~2MB size) on a system with high speed eMMC or UFS (more than 400 Mbps),
then there is a possibility of meeting the timing and safety requirements of Automotive.
• Note: 2 MB image @400 Mbps will get loaded in 40ms.
My view: It is not wise to go on this kind of paths for Automotive. Strategically, we need a light-weight Type 1 hypervisors for Automotive.

Minos Hypervisor (Type 1)
• Though there are may type 1 hypervisor, but the Minos project sounded
interesting as it supports virtualization on Raspberry Pi.
• Please evaluate it if you get time
• https://github.com/minosproject/minos
• I will provide more details and my views as time permits.

Graphics, Display & Audio
How do current suppliers virtualize these peripherals, which are critical for
Automotive?

Graphics, Display & Audio Virtualization
• Most suppliers & tier 1s will go for para-
virtualization solution (shown below) for
these devices as these are bit complex.
• These solution add some context switch
overheads, but low latency due to shared
memory
• In future we might see solutions similar
to what ARM has done to the DMA
(picture below) for these peripherals
also.
• Alternatively, SoCs provide more than 1
peripheral blocks, so that each VM can
use one of them as pass-through.
Non-Confidential
In this system, a hypervisor is using stage 2 to provide isolation between VMs. The ability
to see memory is limited by the stage 2 tables that the hypervisor controls.
Allowing a driver in the VM to directly interact with the DMA controller creates two prob
Isolation: The DMA controller is not subject to the stage 2 tables, and could be used to br
VM’s sandbox.
Address space: With two stages of translation, what the kernel believes to be PAs are IPA
controller still sees PAs, therefore the kernel and DMA controller have different views o
overcome this problem, the hypervisor could trap every interaction between the VM and
controller, providing the necessary translation. When memory is fragmented, this proces
An alternative to trapping and emulating driver accesses is to extend the stage 2 regime
other masters, like our DMA controller. When this happens, those masters also need an M
referred to as a System Memory Management Unit (SMMU, sometimes also called IOMM
SoC
Host OSDriver(s)
Hypervisor
Guest OS
Shared Memory
GPU/Display/Sound
Backend
Server
Graphics / Sound
Application Frontend
Client Driver
Graphics / Sound
Application

Acronyms – 1 / 3
• ACE – an extension to AXI protocol
• ACLK – ACE (extension to AXI bus) Clock
• ACP – Accelerator Coherency Ports for AXI slaves
• APB – Advanced Peripheral Bus for slower speed interface by ARM
• App - Software Application (e.g. Calculator App in Android)
• ASID – Address Space Identifier
• ATB – Interface for Trace by ARM
• ATCLK – ATB Clock
• AXI – Advanced eXtensible Interface by ARM
• BCM – Broadcom (the maker of Raspberry Pi)
• CHI – a scalable protocol supporting multi-node interconnect
• CLKIN – Clock In (main processor clock, 1.2 GHz)
• CNTCLK – Timer / Counter Clock
• CNTCLKEN – Counter Clock Enable
• CNTCR – Timer Control Register
• CNTCV – Timer Count Value
• CNTHPIRQ – EL1 physical timer event pin
• CNTPNSIRQ – EL1 Non-secure physical timer event pin
• CNTPSIRQ - EL1 Secure physical timer event pin
• CNTSR – Timer Status Register
• CNTVALUE – 64bit counter value
• CNTVIRQ - Virtual timer event pin
• CONTEXTIDR – Context ID Register (identifies current process ID and ASID)
• CORERESET – Core rest (debug and ETM registers are preserved)
• CPU – Central Processing Unit
• CPUPORESET – CPU Power On Reset
• CTI – CoreSight Trigger Interface (Debug & Trace)
• CTLR.DS – Control Register -> Disable Security bit
• CTM – CoreSight Trigger Matrix (Debug & Trace)
• DDR – Double Data Rate
• DFT – Design For Test
• DMA – Dynamic Memory Access (the unit that offloads CPU for data copy)

Acronyms – 2 / 3
• DTS – Device Tree Structure
• DVFS – Dynamic Voltage and Frequency Scaling
• ECC – Error Correction Code
• ECU – Electronic Control Units in Cars
• ELx – Exception Level ‘x’ [x: 0 to 3 for ARMv8]
• ERET – Exception Return
• ESR – Exception Syndrome Register
• FIQ – Fast Interrupt Request (takes higher priority than IRQ)
• FPU – Floating Point Unit
• GIC – Global Interrupt Controller
• GICD - GIC Distributor
• GPU – Graphics Processing Unit
• HCR – Hypervisor Configuration Register
• HPFAR – Hyp IPA Fault Address Register. Holds the faulting IPA for some
aborts on stage 2 translation.
• HVC – Hypervisor Call
• I/O – Inputs and / or Outputs
• INTID – Interrupt Identifier
• IOMMU - I/O MMU, same as SMMU.
• IPA - Intermediate Physical Address
• IRQ – Interrupt Request (from I/O to CPU)
• ITS – Interrupt Translation Service that injects Interrupt directly to VMs.
• IVI – In-Vehicle Infotainment unit.
• KVM – Kernel-based Virtual Machine
• L1 – Level 1
• L2RESET – L2 Memory system reset
• LDR – Load from register
• log2 – Binary Logarithm
• LPI - Locality-specific Peripheral Interrupt are interrupts that uses ITS
• LR – Link Register
• MBIST – Memory Built In Self Test
• MBISTRESET – and external MBIST controller can reset the SoC
• MMU – Memory Management Unit
• NDA – Non Disclosure Agreement

Acronyms – 3 / 3
• OS – Operating Systems
• PA – Physical Address
• PC – Program Counter
• PCLK – Peripheral
• PE – Processing Element
• PMU – Performance Monitoring Unit
• PoC – Proof of Concept
• PPI - Private Peripheral Interrupt, targets single specific PE
• PRESETDBG - single, cluster-wide signal resets
• QNX – QNX Operating System
• RAM – Random Access Memory
• ROM – Read Only Memory
• SCLK – SCU Clock
• SCU – Snoop Control Unit that maintains cache coherency
• SGI – Software Generated Interrupts, generated by PEs.
• SMMU – System Memory Management Unit (the MMU for peripherals)
• SoC – System on Chip
• SPI – Shared Peripheral Interrupt
• SRAM – Static Random Access Memory
• STR - Store from Register
• TLB – Translation Lookaside Buffer
• TTBRx – Translation Table Base Register x [x: 0 or 1]
• UART – Universal Asynchronous Receive and Transmit. Serial Comms.
• VA – Virtual Address
• vCPU – Virtual CPU
• VM – Virtual Machine
• VMID – Virtual Machine Identifier
• VMSA – Virtual Memory System Architecture
• VTTBR – Virtualization Translation Table Base Register
• WFI – Wait For Interrupt

Thank you!
Please send your feedback, comments to c.n.aananth@gmail.com

Virtualization Support in ARMv8+

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Virtualization Support in ARMv8+

Similar to Virtualization Support in ARMv8+ (20)

Recently uploaded

Recently uploaded (17)

Virtualization Support in ARMv8+