BitVisor on Aarch64
2022/12/07 @ BitVisor Summit 11
Ake Koomsin
Agenda
◼ Current requirements
◼ How VMM works on Aarch64
◼ BitVisor Aarch64 initialization
◼ Interrupt handling
◼ MMIO handling
◼ Multiple core support
◼ Current limitation
◼ Ongoing tasks
◼ QEMU bugs we found
◼ Demo
1
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
Current requirements
◼ Armv8.1 or later
– Need Virtualization Host Extension (VHE) for process
implementation
◼ Generic Interrupt Controller v3 (GICv3)
– Guest interrupt injection
◼ EL3 and Power State Coordination Interface (PSCI)
– Firmware running in EL3
– For secondary core start-up
◼ UEFI environment and ACPI
– BitVisor currently relies on them
2
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
How VMM works on Aarch64
3
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
Firmware
Hypervisor
OS0 OS1
P0 P1 P2 P3
EL0
EL1
EL2
EL3
SMC
HVC
SVC
How VMM works on Aarch64
4
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
Hypervisor
OS0
P0 P1
Host OS/Hypervisor
OS0
P0 P1 P2
Standard With VHE
EL0
EL1
EL2
How VMM works on Aarch64
◼ Main system registers related to virtualization
– HCR_EL2
• Enable/Disable hypervisor
• Hypervisor behavior
• Register trapping
– VTTBR_EL2
• Stage-2 translation page table
– VTCR_EL2
• Stage-2 translation control
– VMPIDR_EL2
• Multiple Processor ID MPIDR_EL1 value read by the guest
– VPIDR_EL2
• Processor ID PIDR_EL1 value read by the guest
5
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
How VMM works on Aarch64
◼ Page table on Aarch64 basic
– Typically, an OS sets up TTBR0_EL1 for a process’s page
table and TTBR1_EL1 for kernel page table
• Addresses with 0xF… prefix are mapped in TTBR1_EL1
– Normally, we can access only TTBR0_EL2 only on EL2
– With VHE feature, we can make EL2 behavior as same as
EL1
• Can access to TTBR1_EL2
• System registers related to translation change their structures
– Ex. TCR_EL2 bit definition becomes like TCR_EL1
6
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
How VMM works on Aarch64
◼ Guest OS returns to EL2 from time to time through
exceptions
– Interrupt
• If the hypervisor chooses to route interrupts to EL2
– Trapping
• Register accesses
• Intermediate Physical Address translation fault
7
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
How VMM works on Aarch64
◼ When an exception occurs
– The entry point is one of locations on the vector table
pointed by VBAR_EL2
• Depending on the current running EL/exception type/mode
– The first thing to do is saving the current context
• General registers x0-x30
• Floating registers if necessary
• Other system registers if necessary
– In BitVisor case (To switch between our processes and the guest)
» HCR_EL2
» ELR_EL2, SPSR_EL2, FAR_EL2, ESR_EL2
» SP_EL0, TPIDR_EL0
8
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
How VMM works on Aarch64
◼ Handling an exception
– Interrupt (Asynchronous)
• Interrupt controller handler
– Scheduling
– Forwarding to the guest
– Hand over to the appropriate device driver
– Trapping (Synchronous)
• Read ESR_EL2 for exception syndrome
• Handle them accordingly
◼ After handling the exception, return to the guest
– Restore the entry context
– eret instruction to return to either EL0 or EL1 depending on
SPSR_EL2
9
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
Early Aarch64 boot
◼ Relocation
– To be able to run code at any address, we need a table
structure that tell us where and what to adjust to get final
addresses
• Usually for global variables
– In the linker file, we have a special section for this table
named rela.dyn
10
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
…
.rela.dyn : AT (phys + (_rela_start - head)) {
_rela_start = .;
*(.rela)
*(.rela.*)
_rela_end = .;
}
…
Early Aarch64 boot
◼ Relocation
– It contains an array of the following structure
– For BitVisor, we only deal with R_AARCH64_RELATIVE
operation currently
11
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
struct rela_entry {
u64 r_offset; /* Location to apply relocation */
u64 r_info; /* Determine operation to perform */
u64 r_addend;
};
Early Aarch64 boot
◼ Relocation
– Resolving R_AARCH64_RELATIVE type with Delta(S) +
Addend operation according to Aa64elf document
• S is the static address of a symbol
• Delta(S) means find the difference between the static link
address of S and the execution address of S
– In other words
• If head_linktime_addr is 0, diff is head_runtime_addr
– BitVisor head_linktime_addr is currently 0
12
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
diff = head_runtime_addr – head_linktime_addr;
*(u64 *)(diff + r_offset) = diff + r_addend;
Early Aarch64 boot
◼ Relocation
13
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
int SECTION_ENTRY_TEXT
apply_reloc (phys_t base, struct rela_entry *start, struct rela_entry *end)
{
struct rela_entry *entries = start;
u64 *target;
unsigned int i, n_entries = end - start;
for (i = 0; i < n_entries; i++) {
switch (entries[i].r_info) {
case R_AARCH64_NONE:
break; /* Do nothing */
case R_AARCH64_RELATIVE:
/*
* Static head address is 0. That means Delta(S) is
* the runtime address.
*/
target = (u64 *)(base + entries[i].r_offset);
*target = base + entries[i].r_addend;
break;
default:
/* Current deal with only R_AARCH64_RELATIVE */
return -1;
}
}
return 0;
}
Early Aarch64 boot
◼ Cross-compiling UEFI loader
– Mingw currently has no toolchain for Aarch64
– Switch to clang for cross-compiling instead
– Most of code for UEFI loader remains the same
◼ UEFI loader and bitvisor.elf relation
– UEFI loader looks for bitvisor.elf
– It then loads the first 64KB portion bitvisor.elf for
bootstrapping
• Early initialization + loading the rest of BitVisor into a memory
• .entry section of BitVisor must be within the first 64KB
– Once bootstrapping is done, we can jump to the newly
allocated BitVisor, and start the remaining initialization
14
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
Early Aarch64 boot
◼ Entering BitVisor code
– Firstly, save context at entry
16
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
entry:
…
adrp x9, _uefi_entry_ctx
add x9, x9, :lo12:_uefi_entry_ctx
stp x19, x20, [x9], #16
…
stp x29, x30, [x9], #16
…
mov x10, sp
…
mrs x10, TTBR0_EL2
str x10, [x9], #8
mrs x10, VBAR_EL2
str x10, [x9], #8
…
Early Aarch64 boot
◼ Entering BitVisor code
– Apply relocation, need to correct addresses listed in rela.*
section
17
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
entry:
…
adrp x0, head
add x0, x0, :lo12:head
adrp x1, _rela_start
add x1, x1, :lo12:_rela_start
adrp x2, _rela_end
add x2, x2, :lo12:_rela_end
bl apply_reloc64k
cmp x0, 0
bne .L1 /* Return if apply_reloc64() fails */
…
Early Aarch64 boot
◼ Entering BitVisor code
– Then, enter uefi_entry()
• Save some UEFI routine addresses
• Load entire BitVisor to a new allocated location
• Setup virtual address
– Enable HCR_E2H so that TTBR1_EL2 becomes effective
– Setup TTBR1_EL2 table for hypervisor memory mapping
– Setup MAIR_EL2, TCR_EL2, and SCTLR_EL2
– 0xFFFF000000000000 is our current virtual base address
• Return virtual address base to the assembly code so that we
can jump to the new location with virtual address
18
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
Early Aarch64 boot
◼ Entering BitVisor code
– Jump to asm_bitvisor_entry()
– Apply relocation again with the new virtual address base +
Additional setup for C code entry
19
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
/*
* x0 now contains new virtual memory base address.
* Next, calculate the position of asm_bitvisor_entry()
* relative x0.
*/
adrp x21, head /* Old head */
add x21, x21, :lo12:head
adrp x11, bitvisor_entry
add x11, x11, :lo12:asm_bitvisor_entry
sub x11, x11, x21
add x11, x11, x0
br x11 /* Jump to newly located asm_bitvisor_entry */
Early Aarch64 boot
◼ Before calling vmm_main()
– Initialize exception handling
20
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
void
bitvisor_entry (void)
{
uefi_booted = true;
/* Save this for secondary core start */
mair_host = mrs (MAIR_EL2);
tcr_host = mrs (TCR_EL2);
sctlr_host = mrs (SCTLR_EL2);
serial_init ();
disable_interrupt ();
init_default_exception_handler ();
init_exception ();
vmm_main ();
}
BitVisor Aarch64 initialization
◼ The initialization flow is roughly as same as current
BitVisor
– Mainly done through call_initfunc()
– There are some Aarch64 specific initialization to take care
• MMU/memory mapping/MMIO handling, GIC initialization, etc
◼ Need some adjustment of the original code
– Separate platform specific code into separate files and
create interfaces for platform independent code to call them
• Ex. in the process implementation
– x86 assembly in call_msgfunc0() is replaced by
process_exec()
– The actual implementation of process_exec() is in either
x86/process.c or aarch64/process.c
21
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
BitVisor Aarch64 initialization
◼ Entering guest
22
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
void
vm_start (void)
{
u64 orig_tcr, val;
…
/* Setting up EL1 environment */
msr (SP_EL1, _uefi_entry_ctx.sp);
msr (ESR_EL12, _uefi_entry_ctx.esr_el2);
msr (FAR_EL12, _uefi_entry_ctx.far_el2);
msr (MAIR_EL12, _uefi_entry_ctx.mair_el2);
…
msr (TCR_EL12, val);
msr (TPIDR_EL1, _uefi_entry_ctx.tpidr_el2);
msr (TTBR0_EL12, _uefi_entry_ctx.ttbr0_el2);
msr (VBAR_EL12, _uefi_entry_ctx.vbar_el2);
msr (CPACR_EL12, CPACR_ZEN (3) | CPACR_FPEN (3) | CPACR_SMEN (3));
val = (_uefi_entry_ctx.spsr_el2 & ~0xF) | 0x5; /* E1h */
msr (SPSR_EL2, val);
msr (ELR_EL2, _uefi_entry_ctx.x30);
msr (CPTR_EL2, CPTR_FLAGS);
msr (HCR_EL2, HCR_FLAGS);
start_guest (&_uefi_entry_ctx);
}
BitVisor Aarch64 initialization
◼ Entering guest
23
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
start_guest:
ldp x19, x20, [x0], #16
ldp x21, x22, [x0], #16
ldp x23, x24, [x0], #16
ldp x25, x26, [x0], #16
ldp x27, x28, [x0], #16
ldp x29, x30, [x0], #16
/* Clear all caller-saved register */
eor x15, x15, x15
eor x14, x14, x14
eor x13, x13, x13
…
mov x0, #1 /* Return 1 as success upon entry guest */
dsb ish
isb
eret
/* Prevent speculative execution */
dsb nsh
isb
Interrupt handling
◼ Physical interrupt and virtual interrupt
– The physical one is from an actual device
• Guest can receive physical interrupts if the hypervisor chooses
not to handle interrupts
– The virtual one is the one that the hypervisor injects to the
guest
• Cannot be trapped to EL2/3
– Interrupt type
• FIQ/vFIQ, high priority interrupt
• IRQ/vIRQ, low priority interrupt
• Serror/vSError, erroneous memory accesses (Ex. Bus error)
– No non-maskable interrupt until Armv8.8-A/Armv9.3-A
• QEMU still does not support this
• No need to worry about this for now
24
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
Interrupt handling
◼ Injecting interrupts
– Via system registers
• We can write
– Set HCR_VF in HCR_EL2 to make vFIQ pending
– Set HCR_VI in HCR_EL2 to make vIRQ pending
– Set HCR_VSE in HCR_EL2 to make vSError pending
• Then, need to emulate an interrupt controller
– Via GIC (our focus)
25
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
Interrupt handling
◼ Overview
26
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
GIC
Hypervisor
- Save context
- Identify interrupt
- Forward interrupt
- Return to the guest
Guest
Virtual interrupt
Inject virtual interrupt IMO=1 FMO=1
Interrupt handling
◼ BitVisor GIC initialization
– Set HCR_FMO, HCR_IMO, and HCR_AMO in HCR_EL2
– Set ICH_HCR_EN in ICH_HCR_EL2
– Configure ICH_VMCR_EL2 to initialize vGIC states
– Need to change we acknowledge an interrupt
• Make writing EOI be only dropping priority
• The guest ends the interrupt on its interrupt handling
◼ Interrupt Handling
– Read ICC_IAR0/1_EL1 to get intid and acknowledge the
interrupt
– Scheduling and do tasks
– Write ICC_EOIR0/1_EL1 with intid to drop priority
– Inject the interrupt to the guest
27
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
Interrupt handling
◼ Injecting interrupts with GIC
– Each core has a set of List Register (LR) for injecting virtual
interrupts
• ICH_LR0 – (max) ICH_LR15
– The max number is platform specific
– To inject a virtual interrupt, simply write to one of empty
ICH_LR register
– The virtual interrupt gets trapped by the guest once we
return to the guest
28
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
Interrupt handling
◼ Injecting interrupts with GIC
29
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
static void
try_inject_vint (u64 intid, u64 rpr, uint group)
{
…
/* Currently vintid = pintid */
g = !!group;
val = ICH_LR_VINTID (intid) | ICH_LR_PINTID (intid) |
ICH_LR_PRIORITY (rpr) | ICH_LR_GROUP (g) | ICH_LR_HW |
ICH_LR_STATE (LR_STATE_PENDING);
enqueue_lr (currentcpu, val);
elrsr = mrs (ICH_ELRSR_EL2);
for (i = 0; elrsr != 0 && i < currentcpu->max_int_slot; i++) {
empty = !!(elrsr & 0x1);
if (empty) {
if (dequeue_lr (currentcpu, &lr_val))
set_lr (i, lr_val);
else
break;
}
elrsr >>= 1;
}
}
Interrupt handling
◼ Injecting interrupts with GIC
30
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
static void
set_lr (uint lr_idx, u64 val)
{
switch (lr_idx) {
case 0:
msr (ICH_LR0_EL2, val);
break;
case 1:
msr (ICH_LR1_EL2, val);
break;
case 2:
msr (ICH_LR2_EL2, val);
break;
case 3:
msr (ICH_LR3_EL2, val);
break;
…
default:
panic ("lr_idx out of bound");
break;
}
}
MMIO handling
◼ Stage-1 and Stage-2 memory translation
– Stage-1 is for translating a virtual address (VA) to a physical
address (PA) or an intermediate physical address (IPA)
• For EL1, IPA is PA if stage-2 translation is not enabled
– Stage-2 is for translating the IPA to an actual PA
• Need to set up
– VTTBR_EL2 for stage-2 page tables
– VTCR_EL2 for stage-2 translation control
• In our case, IPA and PA are identity mapped
◼ In general, MMIO handling is be done through stage-
2 translation fault
– Not limited to MMIO address but any PA
31
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
MMIO handling
◼ Implementation concept
– During initialization, we create identity mapping for stage-2
address translation
• Does not need too many page tables as we can utilize 1GB
block mapping
– mmio_register() provides PA and size we want to monitor
• We unmap the address from stage-2 translation
• From MMU implementation point of view, we break down the
big mapping block into smaller blocks a hole of the address
– Exception handling is triggered once the guest accesses
monitored addresses
32
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
MMIO handling
◼ Implementation concept
– We need to emulate those accesses
• Get instruction address from ELR_EL2 register
• Get fault address from FAR_EL2 register
• Decode the instruction to get source/destination registers
• Get all necessary info together and pass them to a handler
– Once we finish access handling
• Skip the instruction by adding 4 to ELR_EL2
– An instruction is 4 bytes
• Update guest registers in saved context if necessary
33
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
Multiple core support
◼ On platform that support PSCI, multiple core support
is straightforward
◼ When the guest wants to start a secondary core
– It issues SMC instruction
– The call follows Secure Monitor Calling Convention (SMCC)
• smc #0
• x0: Function ID, x1~: Parameters
◼ BitVisor simply needs to intercept SMC instructions
– Set HCR_TSC bit in HCR_EL2 register
– Check for CPU_ON Function ID
– Save entry_address and context_id information
• entry_address is physical address
• context_id appears at x0 on secondary core entry
34
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
Multiple core support
◼ BitVisor then issues SMC on behalf of the guest
– Copy guest’s CPU_ON command
– Replace entry_address and context_id with our values
◼ Secondary core entry
– Set up MMU and stack
– Jump to designated virtual address to continue per core
initialization
– Finally, we start the guest at its entry_address with its
context_id at x0
35
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
Current limitation
◼ No Aarch32 for now
– For simplicity
◼ No Suspend/Resume for now
– Going to implement later
– PSCI SMC handling
◼ No EL0 debug shell through hypercall
– hvc instruction is not available at EL0
– Need to find an alternative
• Virtual serial?
36
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
Current limitation
◼ No 52-bit address support for now
– Need either 64KB page size or need Armv8.7
– BitVisor itself does not need 52-bit address
– To allow guest OS to use this, we need either
• 64KB page size
– Quite a waste of memory for our use cases
• Armv8.7 FEAT_LPA2 for 4KB and 16KB page size
– 4KB page size needs 5-level page table
– See no real hardware that supports this yet
– Not the current priority
37
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
Ongoing tasks
◼ Integrating Aarch64 implementation with the
mainstream
– Finalizing interfaces for platform specific implementation
– Cross-compiling implementation
38
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
QEMU patches
◼ e1000e: Fix possible interrupt loss when using MSI
– There was a logic error resulting in delaying MSI indefinitely
◼ target/arm: honor HCR_E2H and HCR_TGE in
arm_excp_unmasked()
– Found this problem when trying to run a process in EL0 with
interrupt masked
• This is valid according to the architecture manual
• It was impossible before this patch
◼ target/arm: Honor HCR_E2H and HCR_TGE in
ats_write64()
– AT instruction implementation forgot to honor HCR_E2H and
HCR_TGE
– Found this because there was a weird memory error panic
39
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
QEMU patches
◼ e1000e: Fix possible interrupt loss when using MSI
– https://github.com/qemu/qemu/commit/dd0ef128669c29734a
197ca9195e7ab64e20ba2c
◼ target/arm: honor HCR_E2H and HCR_TGE in
arm_excp_unmasked()
– https://github.com/qemu/qemu/commit/c939a7c7b93ee44a4
963fabe81454e1f956ecd4b
◼ target/arm: Honor HCR_E2H and HCR_TGE in
ats_write64()
– https://github.com/qemu/qemu/commit/638d5dbd78ea81c94
3959e2f2c65c109e5278a78
40
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
DEMO
41
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
THANK YOU
42
Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.

BitVisor Summit 11「2. BitVisor on Aarch64」

  • 1.
    BitVisor on Aarch64 2022/12/07@ BitVisor Summit 11 Ake Koomsin
  • 2.
    Agenda ◼ Current requirements ◼How VMM works on Aarch64 ◼ BitVisor Aarch64 initialization ◼ Interrupt handling ◼ MMIO handling ◼ Multiple core support ◼ Current limitation ◼ Ongoing tasks ◼ QEMU bugs we found ◼ Demo 1 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 3.
    Current requirements ◼ Armv8.1or later – Need Virtualization Host Extension (VHE) for process implementation ◼ Generic Interrupt Controller v3 (GICv3) – Guest interrupt injection ◼ EL3 and Power State Coordination Interface (PSCI) – Firmware running in EL3 – For secondary core start-up ◼ UEFI environment and ACPI – BitVisor currently relies on them 2 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 4.
    How VMM workson Aarch64 3 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved. Firmware Hypervisor OS0 OS1 P0 P1 P2 P3 EL0 EL1 EL2 EL3 SMC HVC SVC
  • 5.
    How VMM workson Aarch64 4 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved. Hypervisor OS0 P0 P1 Host OS/Hypervisor OS0 P0 P1 P2 Standard With VHE EL0 EL1 EL2
  • 6.
    How VMM workson Aarch64 ◼ Main system registers related to virtualization – HCR_EL2 • Enable/Disable hypervisor • Hypervisor behavior • Register trapping – VTTBR_EL2 • Stage-2 translation page table – VTCR_EL2 • Stage-2 translation control – VMPIDR_EL2 • Multiple Processor ID MPIDR_EL1 value read by the guest – VPIDR_EL2 • Processor ID PIDR_EL1 value read by the guest 5 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 7.
    How VMM workson Aarch64 ◼ Page table on Aarch64 basic – Typically, an OS sets up TTBR0_EL1 for a process’s page table and TTBR1_EL1 for kernel page table • Addresses with 0xF… prefix are mapped in TTBR1_EL1 – Normally, we can access only TTBR0_EL2 only on EL2 – With VHE feature, we can make EL2 behavior as same as EL1 • Can access to TTBR1_EL2 • System registers related to translation change their structures – Ex. TCR_EL2 bit definition becomes like TCR_EL1 6 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 8.
    How VMM workson Aarch64 ◼ Guest OS returns to EL2 from time to time through exceptions – Interrupt • If the hypervisor chooses to route interrupts to EL2 – Trapping • Register accesses • Intermediate Physical Address translation fault 7 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 9.
    How VMM workson Aarch64 ◼ When an exception occurs – The entry point is one of locations on the vector table pointed by VBAR_EL2 • Depending on the current running EL/exception type/mode – The first thing to do is saving the current context • General registers x0-x30 • Floating registers if necessary • Other system registers if necessary – In BitVisor case (To switch between our processes and the guest) » HCR_EL2 » ELR_EL2, SPSR_EL2, FAR_EL2, ESR_EL2 » SP_EL0, TPIDR_EL0 8 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 10.
    How VMM workson Aarch64 ◼ Handling an exception – Interrupt (Asynchronous) • Interrupt controller handler – Scheduling – Forwarding to the guest – Hand over to the appropriate device driver – Trapping (Synchronous) • Read ESR_EL2 for exception syndrome • Handle them accordingly ◼ After handling the exception, return to the guest – Restore the entry context – eret instruction to return to either EL0 or EL1 depending on SPSR_EL2 9 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 11.
    Early Aarch64 boot ◼Relocation – To be able to run code at any address, we need a table structure that tell us where and what to adjust to get final addresses • Usually for global variables – In the linker file, we have a special section for this table named rela.dyn 10 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved. … .rela.dyn : AT (phys + (_rela_start - head)) { _rela_start = .; *(.rela) *(.rela.*) _rela_end = .; } …
  • 12.
    Early Aarch64 boot ◼Relocation – It contains an array of the following structure – For BitVisor, we only deal with R_AARCH64_RELATIVE operation currently 11 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved. struct rela_entry { u64 r_offset; /* Location to apply relocation */ u64 r_info; /* Determine operation to perform */ u64 r_addend; };
  • 13.
    Early Aarch64 boot ◼Relocation – Resolving R_AARCH64_RELATIVE type with Delta(S) + Addend operation according to Aa64elf document • S is the static address of a symbol • Delta(S) means find the difference between the static link address of S and the execution address of S – In other words • If head_linktime_addr is 0, diff is head_runtime_addr – BitVisor head_linktime_addr is currently 0 12 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved. diff = head_runtime_addr – head_linktime_addr; *(u64 *)(diff + r_offset) = diff + r_addend;
  • 14.
    Early Aarch64 boot ◼Relocation 13 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved. int SECTION_ENTRY_TEXT apply_reloc (phys_t base, struct rela_entry *start, struct rela_entry *end) { struct rela_entry *entries = start; u64 *target; unsigned int i, n_entries = end - start; for (i = 0; i < n_entries; i++) { switch (entries[i].r_info) { case R_AARCH64_NONE: break; /* Do nothing */ case R_AARCH64_RELATIVE: /* * Static head address is 0. That means Delta(S) is * the runtime address. */ target = (u64 *)(base + entries[i].r_offset); *target = base + entries[i].r_addend; break; default: /* Current deal with only R_AARCH64_RELATIVE */ return -1; } } return 0; }
  • 15.
    Early Aarch64 boot ◼Cross-compiling UEFI loader – Mingw currently has no toolchain for Aarch64 – Switch to clang for cross-compiling instead – Most of code for UEFI loader remains the same ◼ UEFI loader and bitvisor.elf relation – UEFI loader looks for bitvisor.elf – It then loads the first 64KB portion bitvisor.elf for bootstrapping • Early initialization + loading the rest of BitVisor into a memory • .entry section of BitVisor must be within the first 64KB – Once bootstrapping is done, we can jump to the newly allocated BitVisor, and start the remaining initialization 14 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 16.
    Early Aarch64 boot ◼Entering BitVisor code – Firstly, save context at entry 16 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved. entry: … adrp x9, _uefi_entry_ctx add x9, x9, :lo12:_uefi_entry_ctx stp x19, x20, [x9], #16 … stp x29, x30, [x9], #16 … mov x10, sp … mrs x10, TTBR0_EL2 str x10, [x9], #8 mrs x10, VBAR_EL2 str x10, [x9], #8 …
  • 17.
    Early Aarch64 boot ◼Entering BitVisor code – Apply relocation, need to correct addresses listed in rela.* section 17 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved. entry: … adrp x0, head add x0, x0, :lo12:head adrp x1, _rela_start add x1, x1, :lo12:_rela_start adrp x2, _rela_end add x2, x2, :lo12:_rela_end bl apply_reloc64k cmp x0, 0 bne .L1 /* Return if apply_reloc64() fails */ …
  • 18.
    Early Aarch64 boot ◼Entering BitVisor code – Then, enter uefi_entry() • Save some UEFI routine addresses • Load entire BitVisor to a new allocated location • Setup virtual address – Enable HCR_E2H so that TTBR1_EL2 becomes effective – Setup TTBR1_EL2 table for hypervisor memory mapping – Setup MAIR_EL2, TCR_EL2, and SCTLR_EL2 – 0xFFFF000000000000 is our current virtual base address • Return virtual address base to the assembly code so that we can jump to the new location with virtual address 18 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 19.
    Early Aarch64 boot ◼Entering BitVisor code – Jump to asm_bitvisor_entry() – Apply relocation again with the new virtual address base + Additional setup for C code entry 19 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved. /* * x0 now contains new virtual memory base address. * Next, calculate the position of asm_bitvisor_entry() * relative x0. */ adrp x21, head /* Old head */ add x21, x21, :lo12:head adrp x11, bitvisor_entry add x11, x11, :lo12:asm_bitvisor_entry sub x11, x11, x21 add x11, x11, x0 br x11 /* Jump to newly located asm_bitvisor_entry */
  • 20.
    Early Aarch64 boot ◼Before calling vmm_main() – Initialize exception handling 20 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved. void bitvisor_entry (void) { uefi_booted = true; /* Save this for secondary core start */ mair_host = mrs (MAIR_EL2); tcr_host = mrs (TCR_EL2); sctlr_host = mrs (SCTLR_EL2); serial_init (); disable_interrupt (); init_default_exception_handler (); init_exception (); vmm_main (); }
  • 21.
    BitVisor Aarch64 initialization ◼The initialization flow is roughly as same as current BitVisor – Mainly done through call_initfunc() – There are some Aarch64 specific initialization to take care • MMU/memory mapping/MMIO handling, GIC initialization, etc ◼ Need some adjustment of the original code – Separate platform specific code into separate files and create interfaces for platform independent code to call them • Ex. in the process implementation – x86 assembly in call_msgfunc0() is replaced by process_exec() – The actual implementation of process_exec() is in either x86/process.c or aarch64/process.c 21 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 22.
    BitVisor Aarch64 initialization ◼Entering guest 22 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved. void vm_start (void) { u64 orig_tcr, val; … /* Setting up EL1 environment */ msr (SP_EL1, _uefi_entry_ctx.sp); msr (ESR_EL12, _uefi_entry_ctx.esr_el2); msr (FAR_EL12, _uefi_entry_ctx.far_el2); msr (MAIR_EL12, _uefi_entry_ctx.mair_el2); … msr (TCR_EL12, val); msr (TPIDR_EL1, _uefi_entry_ctx.tpidr_el2); msr (TTBR0_EL12, _uefi_entry_ctx.ttbr0_el2); msr (VBAR_EL12, _uefi_entry_ctx.vbar_el2); msr (CPACR_EL12, CPACR_ZEN (3) | CPACR_FPEN (3) | CPACR_SMEN (3)); val = (_uefi_entry_ctx.spsr_el2 & ~0xF) | 0x5; /* E1h */ msr (SPSR_EL2, val); msr (ELR_EL2, _uefi_entry_ctx.x30); msr (CPTR_EL2, CPTR_FLAGS); msr (HCR_EL2, HCR_FLAGS); start_guest (&_uefi_entry_ctx); }
  • 23.
    BitVisor Aarch64 initialization ◼Entering guest 23 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved. start_guest: ldp x19, x20, [x0], #16 ldp x21, x22, [x0], #16 ldp x23, x24, [x0], #16 ldp x25, x26, [x0], #16 ldp x27, x28, [x0], #16 ldp x29, x30, [x0], #16 /* Clear all caller-saved register */ eor x15, x15, x15 eor x14, x14, x14 eor x13, x13, x13 … mov x0, #1 /* Return 1 as success upon entry guest */ dsb ish isb eret /* Prevent speculative execution */ dsb nsh isb
  • 24.
    Interrupt handling ◼ Physicalinterrupt and virtual interrupt – The physical one is from an actual device • Guest can receive physical interrupts if the hypervisor chooses not to handle interrupts – The virtual one is the one that the hypervisor injects to the guest • Cannot be trapped to EL2/3 – Interrupt type • FIQ/vFIQ, high priority interrupt • IRQ/vIRQ, low priority interrupt • Serror/vSError, erroneous memory accesses (Ex. Bus error) – No non-maskable interrupt until Armv8.8-A/Armv9.3-A • QEMU still does not support this • No need to worry about this for now 24 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 25.
    Interrupt handling ◼ Injectinginterrupts – Via system registers • We can write – Set HCR_VF in HCR_EL2 to make vFIQ pending – Set HCR_VI in HCR_EL2 to make vIRQ pending – Set HCR_VSE in HCR_EL2 to make vSError pending • Then, need to emulate an interrupt controller – Via GIC (our focus) 25 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 26.
    Interrupt handling ◼ Overview 26 Copyright©2022 IGEL Co., Ltd. All Rights Reserved. GIC Hypervisor - Save context - Identify interrupt - Forward interrupt - Return to the guest Guest Virtual interrupt Inject virtual interrupt IMO=1 FMO=1
  • 27.
    Interrupt handling ◼ BitVisorGIC initialization – Set HCR_FMO, HCR_IMO, and HCR_AMO in HCR_EL2 – Set ICH_HCR_EN in ICH_HCR_EL2 – Configure ICH_VMCR_EL2 to initialize vGIC states – Need to change we acknowledge an interrupt • Make writing EOI be only dropping priority • The guest ends the interrupt on its interrupt handling ◼ Interrupt Handling – Read ICC_IAR0/1_EL1 to get intid and acknowledge the interrupt – Scheduling and do tasks – Write ICC_EOIR0/1_EL1 with intid to drop priority – Inject the interrupt to the guest 27 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 28.
    Interrupt handling ◼ Injectinginterrupts with GIC – Each core has a set of List Register (LR) for injecting virtual interrupts • ICH_LR0 – (max) ICH_LR15 – The max number is platform specific – To inject a virtual interrupt, simply write to one of empty ICH_LR register – The virtual interrupt gets trapped by the guest once we return to the guest 28 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 29.
    Interrupt handling ◼ Injectinginterrupts with GIC 29 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved. static void try_inject_vint (u64 intid, u64 rpr, uint group) { … /* Currently vintid = pintid */ g = !!group; val = ICH_LR_VINTID (intid) | ICH_LR_PINTID (intid) | ICH_LR_PRIORITY (rpr) | ICH_LR_GROUP (g) | ICH_LR_HW | ICH_LR_STATE (LR_STATE_PENDING); enqueue_lr (currentcpu, val); elrsr = mrs (ICH_ELRSR_EL2); for (i = 0; elrsr != 0 && i < currentcpu->max_int_slot; i++) { empty = !!(elrsr & 0x1); if (empty) { if (dequeue_lr (currentcpu, &lr_val)) set_lr (i, lr_val); else break; } elrsr >>= 1; } }
  • 30.
    Interrupt handling ◼ Injectinginterrupts with GIC 30 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved. static void set_lr (uint lr_idx, u64 val) { switch (lr_idx) { case 0: msr (ICH_LR0_EL2, val); break; case 1: msr (ICH_LR1_EL2, val); break; case 2: msr (ICH_LR2_EL2, val); break; case 3: msr (ICH_LR3_EL2, val); break; … default: panic ("lr_idx out of bound"); break; } }
  • 31.
    MMIO handling ◼ Stage-1and Stage-2 memory translation – Stage-1 is for translating a virtual address (VA) to a physical address (PA) or an intermediate physical address (IPA) • For EL1, IPA is PA if stage-2 translation is not enabled – Stage-2 is for translating the IPA to an actual PA • Need to set up – VTTBR_EL2 for stage-2 page tables – VTCR_EL2 for stage-2 translation control • In our case, IPA and PA are identity mapped ◼ In general, MMIO handling is be done through stage- 2 translation fault – Not limited to MMIO address but any PA 31 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 32.
    MMIO handling ◼ Implementationconcept – During initialization, we create identity mapping for stage-2 address translation • Does not need too many page tables as we can utilize 1GB block mapping – mmio_register() provides PA and size we want to monitor • We unmap the address from stage-2 translation • From MMU implementation point of view, we break down the big mapping block into smaller blocks a hole of the address – Exception handling is triggered once the guest accesses monitored addresses 32 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 33.
    MMIO handling ◼ Implementationconcept – We need to emulate those accesses • Get instruction address from ELR_EL2 register • Get fault address from FAR_EL2 register • Decode the instruction to get source/destination registers • Get all necessary info together and pass them to a handler – Once we finish access handling • Skip the instruction by adding 4 to ELR_EL2 – An instruction is 4 bytes • Update guest registers in saved context if necessary 33 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 34.
    Multiple core support ◼On platform that support PSCI, multiple core support is straightforward ◼ When the guest wants to start a secondary core – It issues SMC instruction – The call follows Secure Monitor Calling Convention (SMCC) • smc #0 • x0: Function ID, x1~: Parameters ◼ BitVisor simply needs to intercept SMC instructions – Set HCR_TSC bit in HCR_EL2 register – Check for CPU_ON Function ID – Save entry_address and context_id information • entry_address is physical address • context_id appears at x0 on secondary core entry 34 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 35.
    Multiple core support ◼BitVisor then issues SMC on behalf of the guest – Copy guest’s CPU_ON command – Replace entry_address and context_id with our values ◼ Secondary core entry – Set up MMU and stack – Jump to designated virtual address to continue per core initialization – Finally, we start the guest at its entry_address with its context_id at x0 35 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 36.
    Current limitation ◼ NoAarch32 for now – For simplicity ◼ No Suspend/Resume for now – Going to implement later – PSCI SMC handling ◼ No EL0 debug shell through hypercall – hvc instruction is not available at EL0 – Need to find an alternative • Virtual serial? 36 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 37.
    Current limitation ◼ No52-bit address support for now – Need either 64KB page size or need Armv8.7 – BitVisor itself does not need 52-bit address – To allow guest OS to use this, we need either • 64KB page size – Quite a waste of memory for our use cases • Armv8.7 FEAT_LPA2 for 4KB and 16KB page size – 4KB page size needs 5-level page table – See no real hardware that supports this yet – Not the current priority 37 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 38.
    Ongoing tasks ◼ IntegratingAarch64 implementation with the mainstream – Finalizing interfaces for platform specific implementation – Cross-compiling implementation 38 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 39.
    QEMU patches ◼ e1000e:Fix possible interrupt loss when using MSI – There was a logic error resulting in delaying MSI indefinitely ◼ target/arm: honor HCR_E2H and HCR_TGE in arm_excp_unmasked() – Found this problem when trying to run a process in EL0 with interrupt masked • This is valid according to the architecture manual • It was impossible before this patch ◼ target/arm: Honor HCR_E2H and HCR_TGE in ats_write64() – AT instruction implementation forgot to honor HCR_E2H and HCR_TGE – Found this because there was a weird memory error panic 39 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 40.
    QEMU patches ◼ e1000e:Fix possible interrupt loss when using MSI – https://github.com/qemu/qemu/commit/dd0ef128669c29734a 197ca9195e7ab64e20ba2c ◼ target/arm: honor HCR_E2H and HCR_TGE in arm_excp_unmasked() – https://github.com/qemu/qemu/commit/c939a7c7b93ee44a4 963fabe81454e1f956ecd4b ◼ target/arm: Honor HCR_E2H and HCR_TGE in ats_write64() – https://github.com/qemu/qemu/commit/638d5dbd78ea81c94 3959e2f2c65c109e5278a78 40 Copyright© 2022 IGEL Co., Ltd. All Rights Reserved.
  • 41.
    DEMO 41 Copyright© 2022 IGELCo., Ltd. All Rights Reserved.
  • 42.
    THANK YOU 42 Copyright© 2022IGEL Co., Ltd. All Rights Reserved.