Talk about how Linux kernel initializes the page table.
Note: When you view the the slide deck via web browser, the screenshots may be blurred. You can download and view them offline (Screenshots are clear).
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Decompressed vmlinux: linux kernel initialization from page table configuration perspective
1. Decompressed vmlinux: Linux Kernel Initialization
from Page Table Configuration Perspective
Adrian Huang | June, 2021
* Based on kernel 5.11 (x86_64) – QEMU
* SMP (4 CPUs) and 8GB memory
* Kernel parameter: nokaslr
* Legacy BIOS
2. Agenda
• Recap – CPU booting flow and page table before entering decompressed vmlinux
• 64-bit Virtual Address
• Decompressed vmlinux: Important functions
• Entry point: startup_64()
• x86_64_start_kernel() -> start_kernel() -> setup_arch()
• Apart from focusing on page table configuration, the following are covered as well:
• Fixed-mapped addresses
• Early ioremap: based on fixed-mapped addresses
• Physical memory models
• Especially for sparse memory
• vsyscall - virtual system call (Built on top of fixed-mapped addresses)
• percpu variable
• PTI (Page Table Isolation)
• kernel thread fork & context switch: struct pt_regs and struct inactive_task_frame in kernel
stack
• How to boot secondary CPUs? Where is the entry address?
3. Recap – CPU booting flow before entering decompressed vmlinux
setup.bin
(arch/x86/boot/setup.bin)
Compressed vmlinux
(Protected-mode kernel)
Note
ELF: arch/x86/boot/compressed/vmlinux
Binary: arch/x86/boot/vmlinux.bin
CRC
bzImage
Long Mode:
5. 64-bit Virtual Address
Kernel Space
0x0000_7FFF_FFFF_FFFF
0xFFFF_8000_0000_0000
128TB
Page frame direct
mapping (64TB)
ZONE_DMA
ZONE_DMA32
ZONE_NORMAL
page_offset_base
0
16MB
64-bit Virtual Address
Kernel Virtual Address
Physical Memory
0
0xFFFF_FFFF_FFFF_FFFF
Guard hole (8TB)
LDT remap for PTI (0.5TB)
Unused hole (0.5TB)
vmalloc/ioremap (32TB)
vmalloc_base
Unused hole (1TB)
Virtual memory map – 1TB
(store page frame descriptor)
…
vmemmap_base
64TB
*page
…
*page
…
*page
…
Page Frame
Descriptor
vmemmap_base
page_ofset_base = 0xFFFF_8880_0000_0000
vmalloc_base = 0xFFFF_C900_0000_0000
vmemmap_base = 0xFFFF_EA00_0000_0000
* Can be dynamically configured by KASLR (Kernel Address Space Layout Randomization - "arch/x86/mm/kaslr.c")
Default Configuration
Kernel text mapping from
physical address 0
Kernel code [.text, .data…]
Modules
__START_KERNEL_map = 0xFFFF_FFFF_8000_0000
__START_KERNEL = 0xFFFF_FFFF_8100_0000
MODULES_VADDR
0xFFFF_8000_0000_0000
Empty Space
User Space
128TB
1GB or 512MB
1GB or 1.5GB Fix-mapped address space
(Expanded to 4MB: 05ab1d8a4b36) FIXADDR_START
Unused hole (2MB) 0xFFFF_FFFF_FFE0_0000
0xFFFF_FFFF_FFFF_FFFF
FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000
Reference: Documentation/x86/x86_64/mm.rst
6. Decompressed vmlinux – entry point: startup_64
1. The entry point is still at 0x1000000 (16MB) – not from kernel virtual addresses
2. The kernel virtual addresses will be executed after the corresponding page tables are all set
9. Decompressed vmlinux – entry point: startup_64
Change to the kernel virtual address by issuing ‘jmp’ instruction
1
2
3
10. Decompressed vmlinux – entry point: startup_64
1. Use original per_cpu copy of ‘init_per_cpu__gdt_page’ temporarily
2. Switch CPU’s own per_cpu ‘gdt_page’ when calling switch_to_new_gdt()
11. Decompressed vmlinux – entry point: startup_64
1. Use original per_cpu copy of ‘init_per_cpu__gdt_page’ temporarily
2. Switch CPU’s own per_cpu ‘gdt_page’ when calling switch_to_new_gdt()
When to switch to CPU’s own gdt_page (percpu)?
16. Decompressed vmlinux – early_idt_handler_common
Return frame for
iretq
pt_regs
r15-r12
bx
r11-r8
bp
ax
dx
si
cx
orig_ax
ip
di
cs
sp
ss
flags
orig_ax: syscall#, error code for
CPU exception or IRQ number
of HW interrupt
Callee-saved registers:
Check x86_64 ABI
22. setup_arch() – Part 1
memblock: boot time memory management
Memblock
• Memory allocation during boot time stage
• Set up in setup_arch()
• Tear down in mem_init(): Release free pages
to buddy allocator
[memblock] Reserve page 0
• Security: Mitigate L1TF (L1 Terminal Fault)
vulnerability
32. setup_arch() – Part 2: init_mem_mapping() -- Page Table
Configuration for Direct Mapping
33. setup_arch() – Part 2: init_mem_mapping() -- Page Table
Configuration for Direct Mapping
34. setup_arch() – Part 2: init_mem_mapping() -- Page Table
Configuration for Direct Mapping
Split memory range into sub-ranges
that fulfill 4K, 2M or 1G page.
split_mem_range
35. setup_arch() – Part 2: init_mem_mapping() -- Page Table
Configuration for Direct Mapping
45. vsyscall (Virtual System Call) – Issue Statement
• The context switch overhead (user <-> kernel) of some system calls
(gettimeofday, time, getcpu) is greater than execution time of those
functions.
• Quote from Linux Programmer's Manual - VDSO(7)
• Making system calls can be slow. In x86 32-bit systems, you can trigger a
software interrupt (int $0x80) to tell the kernel you wish to make a system
call. However, this instruction is expensive: it goes through the full interrupt-
handling paths in the processor's microcode as well as in the kernel. Newer
processors have faster (but backward incompatible) instructions to initiate
system calls.
• Built on top of the fixed-mapped address
46. vsyscall – Implementation (Emulate)
[PTE] Bit 63: Execute Disable (XD)
• If IA32_EFER.NXE = 1 and XD
= 1, instruction fetches are
not allowed from this PTE.
This will generate a #PF
exception.
55. vmlinux – start_kernel() – Part 2
Original .data..percpu
.data..percpu for core 2
.data..percpu for core 3
.data..percpu for core 0
.data..percpu for core 1
Physical Memory
memcpy in
setup_per_cpu_areas()
58. percpu variable access option #1: __per_cpu_offset
APIs (include/linux/percpu-defs.h):
* per_cpu_ptr(ptr, cpu): via __per_cpu_offset
Original .data..percpu
.data..percpu for core 2
.data..percpu for core 3
.data..percpu for core 0
.data..percpu for core 1
Physical Memory
memcpy with source
address ‘__per_cpu_load’
in setup_per_cpu_areas()
__per_cpu_offset[0]
__per_cpu_offset[1]
__per_cpu_offset[2]
__per_cpu_offset[3]
59. percpu variable access option #1: __per_cpu_offset
*(.data..percpu..shared_aligned)
*(.data..percpu)
*(.data..percpu..read_mostly)
*(.data..percpu..page_aligned)
*(.data..percpu..first)
.data..percpu
__per_cpu_load
(kernel virtual address)
__per_cpu_end
__per_cpu_start = 0
[Example]
gdt_page = 0xb000
Original .data..percpu
.data..percpu for core 2
.data..percpu for core 3
.data..percpu for core 0
.data..percpu for core 1
Physical Memory
memcpy with source
address ‘__per_cpu_load’
in setup_per_cpu_areas()
__per_cpu_offset[0]
__per_cpu_offset[1]
__per_cpu_offset[2]
__per_cpu_offset[3]
60. percpu variable access option #2: gs register (MSR: IA32_GS_BASE)
APIs (include/linux/percpu-defs.h):
* this_cpu_read(pcp)
* this_cpu_write(pcp, val)
* this_cpu_add(pcp, val)
* this_cpu_ptr(ptr) & raw_cpu_ptr(ptr)
1. Use gs register
2. If option #1 is not supported, use this_cpu_off per-cpu variable (read mostly)
Original .data..percpu
.data..percpu for core 2
.data..percpu for core 3
.data..percpu for core 0
.data..percpu for core 1
Physical Memory
memcpy with source
address
‘__per_cpu_load’ in
setup_per_cpu_areas()
CPU #0: IA32_GS_BASE
CPU #1: IA32_GS_BASE
CPU #2: IA32_GS_BASE
CPU #3: IA32_GS_BASE
61. gs register (MSR: IA32_GS_BASE) vs __per_cpu_offset
DEFINE_PER_CPU(int, x);
int z;
z = this_cpu_read(x);
Convert to a single instruction:
mov %gs:x,%edx
Atomic: No need to disable
preemption and interrupt
this_cpu_inc(x)
Convert to a single instruction:
inc %gs:x
int *y;
int cpu;
cpu = get_cpu();
y = per_cpu_ptr(&x, cpu);
(*y)++;
put_cpu();
Non-atomic: Need to disable preemption
gs register __per_cpu_offset
this_cpu_read()
this_cpu_inc()
this_cpu_inc() implementation via __per_cpu_offset
63. vmlinux – start_kernel() – Part 2 – trap_init()
CPU Entry Area (percpu)
• Page Table Isolation (PTI)
o Mitigate Meltdown
o Isolate user space and kernel space memory
o When the kernel is entered via syscalls, interrupts or exceptions, the page tables are
switched to the full "kernel“ copy.
▪ Entry/exit functions and IDT (Interrupt Descriptor Table) are needed for userspace page table
Kernel
Space
User
Space
User mode &
Kernel Mode
PTI
Kernel
Space
User
Space
Kernel mode
Kernel Space
User Space
User mode
User Space
percpu TSS
entry
Kernel
Space syscall
[User mode]
User Page Table
User Space
percpu TSS
entry
Kernel
Space
Switch to kernel
page table
[Kernel Mode]
User Page Table
User Space
percpu TSS
entry
Kernel
Space
[Kernel Mode]
Kernel Page Table
…
PTI: Concept PTI: High-level implementation
74. vmlinux – start_kernel() – Part 4
CommitLimit: Total amount of memory currently available to be allocated on the system.
Committed_AS: The amount of memory requested by processes.
Over Commit: Committed_AS > CommitLimit
85. Context Switch – init_task is rescheduled
[Prev task] Return to the next instruction of calling
switch_to() when the previous task is re-scheduled.
4
Backtrace when init_task (pid = 0) is rescheduled because kernel_init thread (pid = 1) is scheduled out
jump
4
86. Kernel Thread Context Switch
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
mm = NULL
active_mm = NULL
task_struct
mm = NULL
active_mm = NULL
task_struct
mm = NULL
active_mm
scheduler
init_task (pid = 0) init_mm
swapper_pg_dir =
init_top_pgt
init process (pid = 1)
kthreadd (pid = 2)
90. Kernel Thread Context Switch
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
mm = NULL
active_mm = NULL
task_struct
mm = NULL
active_mm
task_struct
mm = NULL
active_mm = NULL
scheduler
init_task (pid = 0) init_mm
swapper_pg_dir =
init_top_pgt
init process (pid = 1)
kthreadd (pid = 2)
pid = 1
pid = 2
1. Each kernel thread does not have its own ‘mm’.
2. The active_mm of the next task inherits the one of the previous task (use the same page table).
91. Context Switch: Kernel Thread <-> User Space Task
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
scheduler
init_task (pid = 0)
sleep program (pid = 40)
task_struct
mm = NULL
active_mm
cpu = 2
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
mm
active_mm
cpu = 2
Two breakpoints
breakpoint #1
breakpoint #2
gdb breakpoint configuration
92. Context Switch: Kernel Thread <-> User Space Task
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
scheduler
init_task (pid = 0)
sleep program (pid = 40)
task_struct
mm = NULL
active_mm = NULL
cpu = 2
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
mm
active_mm
cpu = 2
`sleep` userspace task is
selected to run
93. Context Switch: Kernel Thread <-> User Space Task
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
scheduler
init_task (pid = 0)
sleep program (pid = 40)
task_struct
mm = NULL
active_mm = NULL
cpu = 2
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
mm
active_mm
cpu = 2
pid = 0
pid = 40
`sleep` userspace task is
selected to run
94. Context Switch: Kernel Thread <-> User Space Task
task_struct
scheduler
sleep program (pid = 40)
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
mm
active_mm
cpu = 2
`sleep` userspace task is
scheduled out
95. Context Switch: Kernel Thread <-> User Space Task
task_struct
scheduler
sleep program (pid = 40)
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
mm
active_mm
cpu = 2
task_struct
ksoftirqd/2 (pid = 20)
mm = NULL
active_mm
cpu = 2
pid = 40
pid = 20
[Kernel Thread ]
Inherit active_mm of
the previous task.
(No need to flush
TLB because cr3 is
not changed)
`sleep` userspace task is
scheduled out