SlideShare a Scribd company logo
1 of 111
Download to read offline
Decompressed vmlinux: Linux Kernel Initialization
from Page Table Configuration Perspective
Adrian Huang | June, 2021
* Based on kernel 5.11 (x86_64) – QEMU
* SMP (4 CPUs) and 8GB memory
* Kernel parameter: nokaslr
* Legacy BIOS
Agenda
• Recap – CPU booting flow and page table before entering decompressed vmlinux
• 64-bit Virtual Address
• Decompressed vmlinux: Important functions
• Entry point: startup_64()
• x86_64_start_kernel() -> start_kernel() -> setup_arch()
• Apart from focusing on page table configuration, the following are covered as well:
• Fixed-mapped addresses
• Early ioremap: based on fixed-mapped addresses
• Physical memory models
• Especially for sparse memory
• vsyscall - virtual system call (Built on top of fixed-mapped addresses)
• percpu variable
• PTI (Page Table Isolation)
• kernel thread fork & context switch: struct pt_regs and struct inactive_task_frame in kernel
stack
• How to boot secondary CPUs? Where is the entry address?
Recap – CPU booting flow before entering decompressed vmlinux
setup.bin
(arch/x86/boot/setup.bin)
Compressed vmlinux
(Protected-mode kernel)
Note
ELF: arch/x86/boot/compressed/vmlinux
Binary: arch/x86/boot/vmlinux.bin
CRC
bzImage
Long Mode:
Recap - Compressed vmlinux: Page table before entering decompressed
vmlinux
Sign-extend
Page Map
Level-4 Offset
Page Directory
Pointer Offset
Page Directory
Offset
Physical Page Offset
0
30 21
39 20
38 29
47
48
63
PML4E #0
PDPTE #3
Data
Page Map
Level-4 Table
Page Directory
Pointer Table
Page Directory
Table
40
9 9 9
Linear Address
CR3
PDPTE #2
PDPTE #1
PDPTE #0
PDE #1535
PDE #1024
.
.
PDE #2047
PDE #1536
.
.
PDE #511
PDE #0
.
.
PDE #1023
PDE #512
.
.
2MBbyte
Physical
Page
40
40
31
21
[Paging] Identity mapping for 0-4GB memory space
64-bit Virtual Address
Kernel Space
0x0000_7FFF_FFFF_FFFF
0xFFFF_8000_0000_0000
128TB
Page frame direct
mapping (64TB)
ZONE_DMA
ZONE_DMA32
ZONE_NORMAL
page_offset_base
0
16MB
64-bit Virtual Address
Kernel Virtual Address
Physical Memory
0
0xFFFF_FFFF_FFFF_FFFF
Guard hole (8TB)
LDT remap for PTI (0.5TB)
Unused hole (0.5TB)
vmalloc/ioremap (32TB)
vmalloc_base
Unused hole (1TB)
Virtual memory map – 1TB
(store page frame descriptor)
…
vmemmap_base
64TB
*page
…
*page
…
*page
…
Page Frame
Descriptor
vmemmap_base
page_ofset_base = 0xFFFF_8880_0000_0000
vmalloc_base = 0xFFFF_C900_0000_0000
vmemmap_base = 0xFFFF_EA00_0000_0000
* Can be dynamically configured by KASLR (Kernel Address Space Layout Randomization - "arch/x86/mm/kaslr.c")
Default Configuration
Kernel text mapping from
physical address 0
Kernel code [.text, .data…]
Modules
__START_KERNEL_map = 0xFFFF_FFFF_8000_0000
__START_KERNEL = 0xFFFF_FFFF_8100_0000
MODULES_VADDR
0xFFFF_8000_0000_0000
Empty Space
User Space
128TB
1GB or 512MB
1GB or 1.5GB Fix-mapped address space
(Expanded to 4MB: 05ab1d8a4b36) FIXADDR_START
Unused hole (2MB) 0xFFFF_FFFF_FFE0_0000
0xFFFF_FFFF_FFFF_FFFF
FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000
Reference: Documentation/x86/x86_64/mm.rst
Decompressed vmlinux – entry point: startup_64
1. The entry point is still at 0x1000000 (16MB) – not from kernel virtual addresses
2. The kernel virtual addresses will be executed after the corresponding page tables are all set
Decompressed vmlinux – entry point: startup_64
Decompressed vmlinux – entry point: startup_64
Decompressed vmlinux – entry point: startup_64
Change to the kernel virtual address by issuing ‘jmp’ instruction
1
2
3
Decompressed vmlinux – entry point: startup_64
1. Use original per_cpu copy of ‘init_per_cpu__gdt_page’ temporarily
2. Switch CPU’s own per_cpu ‘gdt_page’ when calling switch_to_new_gdt()
Decompressed vmlinux – entry point: startup_64
1. Use original per_cpu copy of ‘init_per_cpu__gdt_page’ temporarily
2. Switch CPU’s own per_cpu ‘gdt_page’ when calling switch_to_new_gdt()
When to switch to CPU’s own gdt_page (percpu)?
Decompressed vmlinux – entry point: startup_64
Decompressed vmlinux – x86_64_start_kernel()
Page Table Configuration in startup_64 Page Table Configuration in x86_64_start_kernel
init_top_pgt
Decompressed vmlinux – x86_64_start_kernel()
Decompressed vmlinux – x86_64_start_kernel()
Decompressed vmlinux – early_idt_handler_common
Return frame for
iretq
pt_regs
r15-r12
bx
r11-r8
bp
ax
dx
si
cx
orig_ax
ip
di
cs
sp
ss
flags
orig_ax: syscall#, error code for
CPU exception or IRQ number
of HW interrupt
Callee-saved registers:
Check x86_64 ABI
early_make_pgtable Memory Map
early_make_pgtable
vmlinux – early_make_pgtable
vmlinux – x86_64_start_kernel()
vmlinux – start_kernel()
setup_arch() – Part 1
memblock: boot time memory management
Memblock
• Memory allocation during boot time stage
• Set up in setup_arch()
• Tear down in mem_init(): Release free pages
to buddy allocator
[memblock] Reserve page 0
• Security: Mitigate L1TF (L1 Terminal Fault)
vulnerability
Fixed-mapped Addresses: Compile-time virtual memory allocation
vsyscall #0
…
vsyscall #511
FIX_DBGP_BASE
FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000
VSYSCALL_ADDR = 0xFFFF_FFFF_FF60_0000
FIX_EARLYCON_MEM_BASE
…
__end_of_permanent_fixed_addresses
FIX_BTMAP_END = 1024
…
FIX_BTMAP_BEGIN = 1535
__end_of_fixed_addresses = 1536
vsyscalls (2MB space)
Permanent fixed addresses
512 temporary boot-time
mappings: used by
early_ioremap()
FIXADDR_START = 0xFFFF_FFFF_FF57_C000
Enumeration: fixed_addresses
0xFFFF_FFFF_FF3F_F000
0xFFFF_FFFF_FF20_0000
Modules
MODULES_VADDR
Fix-mapped address space
(Expanded to 4MB: 05ab1d8a4b36) FIXADDR_START
Unused hole (2MB) 0xFFFF_FFFF_FFE0_0000
0xFFFF_FFFF_FFFF_FFFF
FIXADDR_TOP
4MB: fixed-mapped
address space
2MB: Borrow from
‘Modules’ space
breakdown
Fixed-mapped Addresses: Compile-time virtual memory allocation
vsyscall #0
…
vsyscall #511
FIX_DBGP_BASE
FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000
VSYSCALL_ADDR = 0xFFFF_FFFF_FF60_0000
FIX_EARLYCON_MEM_BASE
…
__end_of_permanent_fixed_addresses
FIX_BTMAP_END = 1024
…
FIX_BTMAP_BEGIN = 1535
__end_of_fixed_addresses = 1536
vsyscalls (2MB space)
Permanent fixed addresses
512 temporary boot-time
mappings: used by
early_ioremap()
FIXADDR_START = 0xFFFF_FFFF_FF57_C000
Enumeration: fixed_addresses
0xFFFF_FFFF_FF3F_F000
0xFFFF_FFFF_FF20_0000
4MB: fixed-mapped
address space
2MB: Borrow from
‘Modules’ space
Fixed-mapped Addresses: Compile-time virtual memory allocation
Fixed-mapped Addresses: Use Case
Early ioremap: based on fixed-mapped address
PDE #507: 0xFFFF_FFFF_FF60_0000
PDE #506: 0xFFFF_FFFF_FF40_0000
PDE #505: 0xFFFF_FFFF_FF20_0000
#1528
…
FIX_BTMAP_BEGIN = 1535
…
FIX_BTMAP_END = 1024
…
# 1031
slot_virt[0]
slot_virt[7]
slot_virt[0] =
0xFFFF_FFFF_FF20_0000
slot_virt[7] =
0xFFFF_FFFF_FF3C_0000
early_ioremap_setup()
Early ioremap
• Mapping/unmapping of I/O physical
address to virtual address before
ioremap mechanism is ready
• early_ioremap() & early_iounmap()
Fixed-mapped Addresses
setup_arch() – Part 1
setup_arch() – Part 1
[Linux x86 Boot Protocol]
setup_data: 64-bit physical pointer to linked list
of struct setup_data
setup_arch() – Part 2
setup_arch() – Part 2 - cleanup_highmap
setup_arch() – Part 2
setup_arch() – Part 2: init_mem_mapping() -- Page Table
Configuration for Direct Mapping
setup_arch() – Part 2: init_mem_mapping() -- Page Table
Configuration for Direct Mapping
setup_arch() – Part 2: init_mem_mapping() -- Page Table
Configuration for Direct Mapping
Split memory range into sub-ranges
that fulfill 4K, 2M or 1G page.
split_mem_range
setup_arch() – Part 2: init_mem_mapping() -- Page Table
Configuration for Direct Mapping
kernel_physical_mapping_init(): Page Table Configuration for Direct Mapping
setup_arch() – Part 3
Initialize the idt table with early pagefault handler.
idt_setup_early_pf
setup_arch() – Part 3 - x86_init.paging.pagetable_init()
x86_init.paging.pagetable_init
native_pagetable_init
paging_init
sparse_init
zone_sizes_init
cfg number of pfn for each zone
free_area_init
Zone Allocator
Buddy system
Per-CPU page
frame cache
Buddy system
Per-CPU page
frame cache
Buddy system
Per-CPU page
frame cache
ZONE_DMA
(Physical address: 0-16MB)
ZONE_DMA32
(Physical address: 16MB-4GB)
ZONE_NORMAL
(Physical address > 4GB)
Buddy system
Per-CPU page
frame cache
Buddy system
Per-CPU page
frame cache
ZONE_MOVABLE ZONE_DEVICE
ZONE_DMA
ZONE_DMA32
ZONE_NORMAL
0
16MB
Physical Memory
64TB
4GB
paging_init()
• Initialize sparse memory and zone sizes
Physical Memory Models
• Flat Memory Model (CONFIG_FLATMEM)
• UMA (Uniform Memory Access)
• Discontinuous Memory Model (CONFIG_DISCONTIGMEM)
• NUMA (Non-Uniform Memory Access)
• Sparse Memory Virtual Memmap (CONFIG_SPARSEMEM_VMEMMAP)
• NUMA
• Default configuration
• Sparse Memory
• NUMA
Sparse Memory Virtual Memmap
(CONFIG_SPARSEMEM_VMEMMAP=y)
sparse_init() – Page Table Configuration for ‘struct page’
sparse_init()
sparse_init() ALIGN_DOWN(0xbffd_efff, 128MB) >> 27 =
0xb800_0000 >> 27 = 23
ALIGN_DOWN(0x1_0000_0000, 128MB) >> 27
= 0x1_0000_0000 >> 27 = 32
ALIGN_DOWN(0x2_403f_ffff, 128MB) >> 27 =
0x2_4000_0000 >> 27 = 72
setup_arch() – Part 3 – map_vsyscall
vsyscall (Virtual System Call) – Issue Statement
• The context switch overhead (user <-> kernel) of some system calls
(gettimeofday, time, getcpu) is greater than execution time of those
functions.
• Quote from Linux Programmer's Manual - VDSO(7)
• Making system calls can be slow. In x86 32-bit systems, you can trigger a
software interrupt (int $0x80) to tell the kernel you wish to make a system
call. However, this instruction is expensive: it goes through the full interrupt-
handling paths in the processor's microcode as well as in the kernel. Newer
processors have faster (but backward incompatible) instructions to initiate
system calls.
• Built on top of the fixed-mapped address
vsyscall – Implementation (Emulate)
[PTE] Bit 63: Execute Disable (XD)
• If IA32_EFER.NXE = 1 and XD
= 1, instruction fetches are
not allowed from this PTE.
This will generate a #PF
exception.
vsyscall - Experiment
vsyscall – Experiment – gdb + backtrace
Terminal #1
Terminal #2
vsyscall – Experiment – gdb + backtrace
Terminal #1
Terminal #2
error_code = 21 (0x15)
vsyscall – Experiment – gdb + backtrace
Terminal #1
Terminal #2
Replacement of vsyscall: vDSO (virtual Dynamic
Shared Object)
• vsyscall limitation
• Security concern: fixed virtual address (0xFFFF_FFFF_FF60_0000)
• vDSO
• Exploit ASLR (Address Space Layout Randomization)
• Can be enabled/disabled via /proc/sys/kernel/randomize_va_space
• [Enable] echo 1 > /proc/sys/kernel/randomize_va_space
• [Disable] echo 0 > /proc/sys/kernel/randomize_va_space
• User space address
• Security enhancement
setup_arch() – Part 3
[Recap] Page Table Configuration after finishing setup_arch()
[Recap] Page Table Configuration after finishing setup_arch()
1
2
3
1
1
2
3
vmlinux – start_kernel() – Part 2
Original .data..percpu
.data..percpu for core 2
.data..percpu for core 3
.data..percpu for core 0
.data..percpu for core 1
Physical Memory
memcpy in
setup_per_cpu_areas()
percpu section
*(.data..percpu..shared_aligned)
*(.data..percpu)
*(.data..percpu..read_mostly)
*(.data..percpu..page_aligned)
*(.data..percpu..first)
.data..percpu
__per_cpu_load
(kernel virtual address)
__per_cpu_end
__per_cpu_start = 0
percpu section
*(.data..percpu..shared_aligned)
*(.data..percpu)
*(.data..percpu..read_mostly)
*(.data..percpu..page_aligned)
*(.data..percpu..first)
.data..percpu
__per_cpu_load
(kernel virtual address)
__per_cpu_end
__per_cpu_start = 0
percpu variable access option #1: __per_cpu_offset
APIs (include/linux/percpu-defs.h):
* per_cpu_ptr(ptr, cpu): via __per_cpu_offset
Original .data..percpu
.data..percpu for core 2
.data..percpu for core 3
.data..percpu for core 0
.data..percpu for core 1
Physical Memory
memcpy with source
address ‘__per_cpu_load’
in setup_per_cpu_areas()
__per_cpu_offset[0]
__per_cpu_offset[1]
__per_cpu_offset[2]
__per_cpu_offset[3]
percpu variable access option #1: __per_cpu_offset
*(.data..percpu..shared_aligned)
*(.data..percpu)
*(.data..percpu..read_mostly)
*(.data..percpu..page_aligned)
*(.data..percpu..first)
.data..percpu
__per_cpu_load
(kernel virtual address)
__per_cpu_end
__per_cpu_start = 0
[Example]
gdt_page = 0xb000
Original .data..percpu
.data..percpu for core 2
.data..percpu for core 3
.data..percpu for core 0
.data..percpu for core 1
Physical Memory
memcpy with source
address ‘__per_cpu_load’
in setup_per_cpu_areas()
__per_cpu_offset[0]
__per_cpu_offset[1]
__per_cpu_offset[2]
__per_cpu_offset[3]
percpu variable access option #2: gs register (MSR: IA32_GS_BASE)
APIs (include/linux/percpu-defs.h):
* this_cpu_read(pcp)
* this_cpu_write(pcp, val)
* this_cpu_add(pcp, val)
* this_cpu_ptr(ptr) & raw_cpu_ptr(ptr)
1. Use gs register
2. If option #1 is not supported, use this_cpu_off per-cpu variable (read mostly)
Original .data..percpu
.data..percpu for core 2
.data..percpu for core 3
.data..percpu for core 0
.data..percpu for core 1
Physical Memory
memcpy with source
address
‘__per_cpu_load’ in
setup_per_cpu_areas()
CPU #0: IA32_GS_BASE
CPU #1: IA32_GS_BASE
CPU #2: IA32_GS_BASE
CPU #3: IA32_GS_BASE
gs register (MSR: IA32_GS_BASE) vs __per_cpu_offset
DEFINE_PER_CPU(int, x);
int z;
z = this_cpu_read(x);
Convert to a single instruction:
mov %gs:x,%edx
Atomic: No need to disable
preemption and interrupt
this_cpu_inc(x)
Convert to a single instruction:
inc %gs:x
int *y;
int cpu;
cpu = get_cpu();
y = per_cpu_ptr(&x, cpu);
(*y)++;
put_cpu();
Non-atomic: Need to disable preemption
gs register __per_cpu_offset
this_cpu_read()
this_cpu_inc()
this_cpu_inc() implementation via __per_cpu_offset
vmlinux – start_kernel() – Part 2
vmlinux – start_kernel() – Part 2 – trap_init()
CPU Entry Area (percpu)
• Page Table Isolation (PTI)
o Mitigate Meltdown
o Isolate user space and kernel space memory
o When the kernel is entered via syscalls, interrupts or exceptions, the page tables are
switched to the full "kernel“ copy.
▪ Entry/exit functions and IDT (Interrupt Descriptor Table) are needed for userspace page table
Kernel
Space
User
Space
User mode &
Kernel Mode
PTI
Kernel
Space
User
Space
Kernel mode
Kernel Space
User Space
User mode
User Space
percpu TSS
entry
Kernel
Space syscall
[User mode]
User Page Table
User Space
percpu TSS
entry
Kernel
Space
Switch to kernel
page table
[Kernel Mode]
User Page Table
User Space
percpu TSS
entry
Kernel
Space
[Kernel Mode]
Kernel Page Table
…
PTI: Concept PTI: High-level implementation
vmlinux – start_kernel() – Part 2 – setup_cpu_entry_area()
vmlinux – start_kernel() – Part 2 – trap_init()
vmlinux – start_kernel() – Part 2 – mm_init()
mm_init
• Set up different parts of Linux kernel memory managers
vmlinux – start_kernel() – Part 2 - preallocate_vmalloc_pages()
vmlinux – start_kernel() – Part 2
pti_init()
pti_init()
vmlinux – start_kernel() – Part 2
vmlinux – start_kernel() – Part 3
vmlinux – start_kernel() – Part 4
vmlinux – start_kernel() – Part 4
CommitLimit: Total amount of memory currently available to be allocated on the system.
Committed_AS: The amount of memory requested by processes.
Over Commit: Committed_AS > CommitLimit
vmlinux – start_kernel() – Part 4
Idle Process (swapper) = init_task (pid = 0)
STACK_END_MAGIC = 0x57AC6E9D
struct pt_regs (save CPU registers for
userspace application)
task.stack
THREAD_SIZE = 16KB
kernel stack
usage space
task.stack + THREAD_SIZE
struct inactive_task_frame
task.thread_struct.sp
struct fork_frame
Kernel Stack
Context Switch – Kernel Stack
Context Switch – Kernel Stack
Return frame for
iretq
pt_regs
r15-r12
bx
r11-r8
bp
ax
dx
si
cx
orig_ax
ip
di
cs
sp
ss
flags
orig_ax: syscall#, error code for
CPU exception or IRQ number
of HW interrupt
thread_struct
tls_array
es, ds
fsindex, gsindex
fsbase, gsbase
sp
…
inactive_task_frame
r15-r13
bx (kernel thread function)
bp
ret_addr = ret_from_fork
r12 ( kernel thread argument)
Configured by copy_thread() – kernel thread
callee-saved registers
STACK_END_MAGIC = 0x57AC6E9D
struct pt_regs (save CPU registers for
userspace application)
task.stack
THREAD_SIZE = 16KB
kernel stack
usage space
task.stack + THREAD_SIZE
struct inactive_task_frame
task.thread_struct.sp
struct fork_frame
Kernel Stack
Context Switch – Kernel Thread
inactive_task_frame
r15-r13
bx (kernel thread function)
bp
ret_addr = ret_from_fork
r12 (kernel thread argument)
Configured by copy_thread() – kernel thread
callee-saved registers
STACK_END_MAGIC = 0x57AC6E9D
struct pt_regs (save CPU registers
for userspace application)
task.stack
kernel stack
usage space
Kernel Stack
bx (kernel thread function)
r13
r14
r15
r12 (kernel thread argument)
ret_addr = ret_from_fork
bp
task.stack +
THREAD_SIZE
rsp
rip
STACK_END_MAGIC = 0x57AC6E9D
struct pt_regs (save CPU registers
for userspace application)
task.stack
kernel stack
usage space
Kernel Stack
bx (kernel thread function)
r13
r14
r15
r12 (kernel thread argument)
ret_addr = ret_from_fork
bp
task.stack +
THREAD_SIZE
rsp
rip
inactive_task_frame
r15-r13
bx (kernel thread function)
bp
ret_addr = ret_from_fork
r12 (kernel thread argument)
Configured by copy_thread() – kernel thread
callee-saved registers
Context Switch – Kernel Thread
STACK_END_MAGIC = 0x57AC6E9D
struct pt_regs (save CPU registers
for userspace application)
task.stack
kernel stack
usage space
Kernel Stack
bx (kernel thread function)
r13
r14
r15
r12 (kernel thread argument)
ret_addr = ret_from_fork
bp
task.stack +
THREAD_SIZE
rsp
rip
inactive_task_frame
r15-r13
bx (kernel thread function)
bp
ret_addr = ret_from_fork
r12 (kernel thread argument)
Configured by copy_thread() – kernel thread
callee-saved registers
Context Switch – Kernel Thread
Context Switch – Kernel Thread
jump
[Prev task] Return to the next instruction of calling
switch_to() when the previous task is re-scheduled.
4
task.stack
Kernel Stack
STACK_END_MAGIC = 0x57AC6E9D
struct pt_regs (save/restore CPU
registers for userspace tasks)
kernel stack
usage space
bx (kernel thread function)
r13
r14
r15
r12 (kernel thread argument)
ret_addr = ret_from_fork
bp
task.stack +
THREAD_SIZE
rsp
2
3
rsp `return prev_p`
1
Context Switch – Kernel Thread
jump
4
Context Switch – When to run ‘context switch’?
Explicitly call ‘schedule()’ Call ‘cond_resched()’ to yield CPU resource
Context Switch
Context Switch – init_task is rescheduled
[Prev task] Return to the next instruction of calling
switch_to() when the previous task is re-scheduled.
4
Backtrace when init_task (pid = 0) is rescheduled because kernel_init thread (pid = 1) is scheduled out
jump
4
Kernel Thread Context Switch
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
mm = NULL
active_mm = NULL
task_struct
mm = NULL
active_mm = NULL
task_struct
mm = NULL
active_mm
scheduler
init_task (pid = 0) init_mm
swapper_pg_dir =
init_top_pgt
init process (pid = 1)
kthreadd (pid = 2)
Kernel Thread Context Switch
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
init_task (pid = 0) init_mm
swapper_pg_dir =
init_top_pgt
task_struct
mm = NULL
active_mm
init process (pid = 1)
kthreadd (pid = 2)
task_struct
mm = NULL
active_mm = NULL
task_struct
mm = NULL
active_mm = NULL
scheduler
pid = 0
pid = 1
Kernel Thread Context Switch – Start Here (Aug 2, 2021)
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
mm = NULL
active_mm
task_struct
mm = NULL
active_mm = NULL
task_struct
mm = NULL
active_mm = NULL
scheduler
init_task (pid = 0) init_mm
swapper_pg_dir =
init_top_pgt
init process (pid = 1)
kthreadd (pid = 2)
Kernel Thread Context Switch
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
mm = NULL
active_mm = NULL
task_struct
mm = NULL
active_mm
task_struct
mm = NULL
active_mm = NULL
scheduler
init_task (pid = 0) init_mm
swapper_pg_dir =
init_top_pgt
init process (pid = 1)
kthreadd (pid = 2)
pid = 1
pid = 2
Kernel Thread Context Switch
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
mm = NULL
active_mm = NULL
task_struct
mm = NULL
active_mm
task_struct
mm = NULL
active_mm = NULL
scheduler
init_task (pid = 0) init_mm
swapper_pg_dir =
init_top_pgt
init process (pid = 1)
kthreadd (pid = 2)
pid = 1
pid = 2
1. Each kernel thread does not have its own ‘mm’.
2. The active_mm of the next task inherits the one of the previous task (use the same page table).
Context Switch: Kernel Thread <-> User Space Task
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
scheduler
init_task (pid = 0)
sleep program (pid = 40)
task_struct
mm = NULL
active_mm
cpu = 2
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
mm
active_mm
cpu = 2
Two breakpoints
breakpoint #1
breakpoint #2
gdb breakpoint configuration
Context Switch: Kernel Thread <-> User Space Task
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
scheduler
init_task (pid = 0)
sleep program (pid = 40)
task_struct
mm = NULL
active_mm = NULL
cpu = 2
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
mm
active_mm
cpu = 2
`sleep` userspace task is
selected to run
Context Switch: Kernel Thread <-> User Space Task
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
scheduler
init_task (pid = 0)
sleep program (pid = 40)
task_struct
mm = NULL
active_mm = NULL
cpu = 2
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
mm
active_mm
cpu = 2
pid = 0
pid = 40
`sleep` userspace task is
selected to run
Context Switch: Kernel Thread <-> User Space Task
task_struct
scheduler
sleep program (pid = 40)
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
mm
active_mm
cpu = 2
`sleep` userspace task is
scheduled out
Context Switch: Kernel Thread <-> User Space Task
task_struct
scheduler
sleep program (pid = 40)
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
mm
active_mm
cpu = 2
task_struct
ksoftirqd/2 (pid = 20)
mm = NULL
active_mm
cpu = 2
pid = 40
pid = 20
[Kernel Thread ]
Inherit active_mm of
the previous task.
(No need to flush
TLB because cr3 is
not changed)
`sleep` userspace task is
scheduled out
vmlinux – start_kernel() – Part 4
init process = kernel_init() (pid = 1)
[pid = 1 – init process] When are mm & active_mm allocated?
[pid = 1 – init process] When are mm & active_mm allocated?
[pid = 1 – init process] When are mm & active_mm allocated?
clone_pgd_range()
[pid = 1 – init process] When are mm & active_mm allocated?
[pid = 1] Before running run_init_process()
[pid = 1] After finishing run_init_process():
kernel thread -> user process
clone_pgd_range(): mm.pgd verification
[pid = 1] mm_struct
smp_init() - boot secondary CPUs
smp_init() - boot secondary CPUs
smp_init() - boot secondary CPUs
cpuhp/cpu_id kernel thread
• Execute callbacks (teardown, startup and son
on) when CPU hotplug state is changed.
smp_init() - boot secondary CPUs
smp_init() - boot secondary CPUs – Boot Flow
startup_32: setup cr3 @trampoline_pgd
secondary_startup_64: setup cr3 @init_top_pgt
[Secondary CPUs] CR3 Register Configuration
startup_32() - boot secondary CPUs – Page Table Configuration
startup_32: setup cr3 @trampoline_pgd
secondary_startup_64: setup cr3 @init_top_pgt
[Secondary CPUs] CR3 Register Configuration
startup_32() - boot secondary CPUs – Page Table Configuration
startup_32: setup cr3 @trampoline_pgd
secondary_startup_64: setup cr3 @init_top_pgt
[Secondary CPUs] CR3 Register Configuration
secondary_startup_64() - boot secondary CPUs – Page Table
startup_32: setup cr3 @trampoline_pgd
secondary_startup_64: setup cr3 @init_top_pgt
[Secondary CPUs] CR3 Register Configuration
Secondary CPUs – When to configure active_mm for idle_threads?
pstree after finishing start_kernel()
• The Linux/x86 Boot Protocol, Documentation/x86/boot.rst
• Intel® 64 and IA-32 Architectures Software Developer’s Manual
• https://wdv4758h.github.io/notes/blog/linux-kernel-boot.html
• Linux insides, https://0xax.gitbooks.io/linux-insides/content/
• Debugging kernel and modules via gdb,
https://www.kernel.org/doc/Documentation/dev-tools/gdb-kernel-
debugging.rst
Reference

More Related Content

What's hot

semaphore & mutex.pdf
semaphore & mutex.pdfsemaphore & mutex.pdf
semaphore & mutex.pdfAdrian Huang
 
Anatomy of the loadable kernel module (lkm)
Anatomy of the loadable kernel module (lkm)Anatomy of the loadable kernel module (lkm)
Anatomy of the loadable kernel module (lkm)Adrian Huang
 
Linux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionGene Chang
 
Reverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelReverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelAdrian Huang
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBshimosawa
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernelAdrian Huang
 
Memory Compaction in Linux Kernel.pdf
Memory Compaction in Linux Kernel.pdfMemory Compaction in Linux Kernel.pdf
Memory Compaction in Linux Kernel.pdfAdrian Huang
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File SystemAdrian Huang
 
Memory Management with Page Folios
Memory Management with Page FoliosMemory Management with Page Folios
Memory Management with Page FoliosAdrian Huang
 
malloc & vmalloc in Linux
malloc & vmalloc in Linuxmalloc & vmalloc in Linux
malloc & vmalloc in LinuxAdrian Huang
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)shimosawa
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)shimosawa
 
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedAnne Nicolas
 
Linux Synchronization Mechanism: RCU (Read Copy Update)
Linux Synchronization Mechanism: RCU (Read Copy Update)Linux Synchronization Mechanism: RCU (Read Copy Update)
Linux Synchronization Mechanism: RCU (Read Copy Update)Adrian Huang
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugginglibfetion
 
Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)Pankaj Suryawanshi
 
Linux Crash Dump Capture and Analysis
Linux Crash Dump Capture and AnalysisLinux Crash Dump Capture and Analysis
Linux Crash Dump Capture and AnalysisPaul V. Novarese
 
Qemu device prototyping
Qemu device prototypingQemu device prototyping
Qemu device prototypingYan Vugenfirer
 
The ideal and reality of NVDIMM RAS
The ideal and reality of NVDIMM RASThe ideal and reality of NVDIMM RAS
The ideal and reality of NVDIMM RASYasunori Goto
 

What's hot (20)

semaphore & mutex.pdf
semaphore & mutex.pdfsemaphore & mutex.pdf
semaphore & mutex.pdf
 
Anatomy of the loadable kernel module (lkm)
Anatomy of the loadable kernel module (lkm)Anatomy of the loadable kernel module (lkm)
Anatomy of the loadable kernel module (lkm)
 
Linux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introduction
 
Reverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelReverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux Kernel
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKB
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernel
 
Memory Compaction in Linux Kernel.pdf
Memory Compaction in Linux Kernel.pdfMemory Compaction in Linux Kernel.pdf
Memory Compaction in Linux Kernel.pdf
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File System
 
Memory Management with Page Folios
Memory Management with Page FoliosMemory Management with Page Folios
Memory Management with Page Folios
 
malloc & vmalloc in Linux
malloc & vmalloc in Linuxmalloc & vmalloc in Linux
malloc & vmalloc in Linux
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)
 
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
 
Linux Synchronization Mechanism: RCU (Read Copy Update)
Linux Synchronization Mechanism: RCU (Read Copy Update)Linux Synchronization Mechanism: RCU (Read Copy Update)
Linux Synchronization Mechanism: RCU (Read Copy Update)
 
spinlock.pdf
spinlock.pdfspinlock.pdf
spinlock.pdf
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)
 
Linux Crash Dump Capture and Analysis
Linux Crash Dump Capture and AnalysisLinux Crash Dump Capture and Analysis
Linux Crash Dump Capture and Analysis
 
Qemu device prototyping
Qemu device prototypingQemu device prototyping
Qemu device prototyping
 
The ideal and reality of NVDIMM RAS
The ideal and reality of NVDIMM RASThe ideal and reality of NVDIMM RAS
The ideal and reality of NVDIMM RAS
 

Similar to Decompressed vmlinux: linux kernel initialization from page table configuration perspective

Experience on porting HIGHMEM and KASAN to RISC-V at COSCUP 2020
Experience on porting HIGHMEM and KASAN to RISC-V at COSCUP 2020Experience on porting HIGHMEM and KASAN to RISC-V at COSCUP 2020
Experience on porting HIGHMEM and KASAN to RISC-V at COSCUP 2020Eric Lin
 
COSCUP 2020 RISC-V 32 bit linux highmem porting
COSCUP 2020 RISC-V 32 bit linux highmem portingCOSCUP 2020 RISC-V 32 bit linux highmem porting
COSCUP 2020 RISC-V 32 bit linux highmem portingEric Lin
 
ch3-pv1-memory-management
ch3-pv1-memory-managementch3-pv1-memory-management
ch3-pv1-memory-managementyushiang fu
 
Windows memory manager internals
Windows memory manager internalsWindows memory manager internals
Windows memory manager internalsSisimon Soman
 
PV-Drivers for SeaBIOS using Upstream Qemu
PV-Drivers for SeaBIOS using Upstream QemuPV-Drivers for SeaBIOS using Upstream Qemu
PV-Drivers for SeaBIOS using Upstream QemuThe Linux Foundation
 
Compromising Linux Virtual Machines with Debugging Mechanisms
Compromising Linux Virtual Machines with Debugging MechanismsCompromising Linux Virtual Machines with Debugging Mechanisms
Compromising Linux Virtual Machines with Debugging MechanismsRussell Sanford
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocatorsHao-Ran Liu
 
Linux Porting
Linux PortingLinux Porting
Linux PortingChamp Yen
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntuSim Janghoon
 
Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Kernel TLV
 
The e820 trap of Linux kernel hibernation
The e820 trap of Linux kernel hibernationThe e820 trap of Linux kernel hibernation
The e820 trap of Linux kernel hibernationjoeylikernel
 
Advanced Root Cause Analysis
Advanced Root Cause AnalysisAdvanced Root Cause Analysis
Advanced Root Cause AnalysisEric Sloof
 
Linux Kernel Tour
Linux Kernel TourLinux Kernel Tour
Linux Kernel Toursamrat das
 

Similar to Decompressed vmlinux: linux kernel initialization from page table configuration perspective (20)

Experience on porting HIGHMEM and KASAN to RISC-V at COSCUP 2020
Experience on porting HIGHMEM and KASAN to RISC-V at COSCUP 2020Experience on porting HIGHMEM and KASAN to RISC-V at COSCUP 2020
Experience on porting HIGHMEM and KASAN to RISC-V at COSCUP 2020
 
COSCUP 2020 RISC-V 32 bit linux highmem porting
COSCUP 2020 RISC-V 32 bit linux highmem portingCOSCUP 2020 RISC-V 32 bit linux highmem porting
COSCUP 2020 RISC-V 32 bit linux highmem porting
 
ch3-pv1-memory-management
ch3-pv1-memory-managementch3-pv1-memory-management
ch3-pv1-memory-management
 
Linux memory
Linux memoryLinux memory
Linux memory
 
memory_mapping.ppt
memory_mapping.pptmemory_mapping.ppt
memory_mapping.ppt
 
Windows memory manager internals
Windows memory manager internalsWindows memory manager internals
Windows memory manager internals
 
memory.ppt
memory.pptmemory.ppt
memory.ppt
 
PV-Drivers for SeaBIOS using Upstream Qemu
PV-Drivers for SeaBIOS using Upstream QemuPV-Drivers for SeaBIOS using Upstream Qemu
PV-Drivers for SeaBIOS using Upstream Qemu
 
Compromising Linux Virtual Machines with Debugging Mechanisms
Compromising Linux Virtual Machines with Debugging MechanismsCompromising Linux Virtual Machines with Debugging Mechanisms
Compromising Linux Virtual Machines with Debugging Mechanisms
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocators
 
Linux Porting
Linux PortingLinux Porting
Linux Porting
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
 
MySQLinsanity
MySQLinsanityMySQLinsanity
MySQLinsanity
 
Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545
 
The e820 trap of Linux kernel hibernation
The e820 trap of Linux kernel hibernationThe e820 trap of Linux kernel hibernation
The e820 trap of Linux kernel hibernation
 
Advanced Root Cause Analysis
Advanced Root Cause AnalysisAdvanced Root Cause Analysis
Advanced Root Cause Analysis
 
Analisis_avanzado_vmware
Analisis_avanzado_vmwareAnalisis_avanzado_vmware
Analisis_avanzado_vmware
 
It322 intro 2
It322 intro 2It322 intro 2
It322 intro 2
 
Linux Kernel Tour
Linux Kernel TourLinux Kernel Tour
Linux Kernel Tour
 
Memory
MemoryMemory
Memory
 

Recently uploaded

Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 

Recently uploaded (20)

Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 

Decompressed vmlinux: linux kernel initialization from page table configuration perspective

  • 1. Decompressed vmlinux: Linux Kernel Initialization from Page Table Configuration Perspective Adrian Huang | June, 2021 * Based on kernel 5.11 (x86_64) – QEMU * SMP (4 CPUs) and 8GB memory * Kernel parameter: nokaslr * Legacy BIOS
  • 2. Agenda • Recap – CPU booting flow and page table before entering decompressed vmlinux • 64-bit Virtual Address • Decompressed vmlinux: Important functions • Entry point: startup_64() • x86_64_start_kernel() -> start_kernel() -> setup_arch() • Apart from focusing on page table configuration, the following are covered as well: • Fixed-mapped addresses • Early ioremap: based on fixed-mapped addresses • Physical memory models • Especially for sparse memory • vsyscall - virtual system call (Built on top of fixed-mapped addresses) • percpu variable • PTI (Page Table Isolation) • kernel thread fork & context switch: struct pt_regs and struct inactive_task_frame in kernel stack • How to boot secondary CPUs? Where is the entry address?
  • 3. Recap – CPU booting flow before entering decompressed vmlinux setup.bin (arch/x86/boot/setup.bin) Compressed vmlinux (Protected-mode kernel) Note ELF: arch/x86/boot/compressed/vmlinux Binary: arch/x86/boot/vmlinux.bin CRC bzImage Long Mode:
  • 4. Recap - Compressed vmlinux: Page table before entering decompressed vmlinux Sign-extend Page Map Level-4 Offset Page Directory Pointer Offset Page Directory Offset Physical Page Offset 0 30 21 39 20 38 29 47 48 63 PML4E #0 PDPTE #3 Data Page Map Level-4 Table Page Directory Pointer Table Page Directory Table 40 9 9 9 Linear Address CR3 PDPTE #2 PDPTE #1 PDPTE #0 PDE #1535 PDE #1024 . . PDE #2047 PDE #1536 . . PDE #511 PDE #0 . . PDE #1023 PDE #512 . . 2MBbyte Physical Page 40 40 31 21 [Paging] Identity mapping for 0-4GB memory space
  • 5. 64-bit Virtual Address Kernel Space 0x0000_7FFF_FFFF_FFFF 0xFFFF_8000_0000_0000 128TB Page frame direct mapping (64TB) ZONE_DMA ZONE_DMA32 ZONE_NORMAL page_offset_base 0 16MB 64-bit Virtual Address Kernel Virtual Address Physical Memory 0 0xFFFF_FFFF_FFFF_FFFF Guard hole (8TB) LDT remap for PTI (0.5TB) Unused hole (0.5TB) vmalloc/ioremap (32TB) vmalloc_base Unused hole (1TB) Virtual memory map – 1TB (store page frame descriptor) … vmemmap_base 64TB *page … *page … *page … Page Frame Descriptor vmemmap_base page_ofset_base = 0xFFFF_8880_0000_0000 vmalloc_base = 0xFFFF_C900_0000_0000 vmemmap_base = 0xFFFF_EA00_0000_0000 * Can be dynamically configured by KASLR (Kernel Address Space Layout Randomization - "arch/x86/mm/kaslr.c") Default Configuration Kernel text mapping from physical address 0 Kernel code [.text, .data…] Modules __START_KERNEL_map = 0xFFFF_FFFF_8000_0000 __START_KERNEL = 0xFFFF_FFFF_8100_0000 MODULES_VADDR 0xFFFF_8000_0000_0000 Empty Space User Space 128TB 1GB or 512MB 1GB or 1.5GB Fix-mapped address space (Expanded to 4MB: 05ab1d8a4b36) FIXADDR_START Unused hole (2MB) 0xFFFF_FFFF_FFE0_0000 0xFFFF_FFFF_FFFF_FFFF FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000 Reference: Documentation/x86/x86_64/mm.rst
  • 6. Decompressed vmlinux – entry point: startup_64 1. The entry point is still at 0x1000000 (16MB) – not from kernel virtual addresses 2. The kernel virtual addresses will be executed after the corresponding page tables are all set
  • 7. Decompressed vmlinux – entry point: startup_64
  • 8. Decompressed vmlinux – entry point: startup_64
  • 9. Decompressed vmlinux – entry point: startup_64 Change to the kernel virtual address by issuing ‘jmp’ instruction 1 2 3
  • 10. Decompressed vmlinux – entry point: startup_64 1. Use original per_cpu copy of ‘init_per_cpu__gdt_page’ temporarily 2. Switch CPU’s own per_cpu ‘gdt_page’ when calling switch_to_new_gdt()
  • 11. Decompressed vmlinux – entry point: startup_64 1. Use original per_cpu copy of ‘init_per_cpu__gdt_page’ temporarily 2. Switch CPU’s own per_cpu ‘gdt_page’ when calling switch_to_new_gdt() When to switch to CPU’s own gdt_page (percpu)?
  • 12. Decompressed vmlinux – entry point: startup_64
  • 13. Decompressed vmlinux – x86_64_start_kernel() Page Table Configuration in startup_64 Page Table Configuration in x86_64_start_kernel init_top_pgt
  • 14. Decompressed vmlinux – x86_64_start_kernel()
  • 15. Decompressed vmlinux – x86_64_start_kernel()
  • 16. Decompressed vmlinux – early_idt_handler_common Return frame for iretq pt_regs r15-r12 bx r11-r8 bp ax dx si cx orig_ax ip di cs sp ss flags orig_ax: syscall#, error code for CPU exception or IRQ number of HW interrupt Callee-saved registers: Check x86_64 ABI
  • 22. setup_arch() – Part 1 memblock: boot time memory management Memblock • Memory allocation during boot time stage • Set up in setup_arch() • Tear down in mem_init(): Release free pages to buddy allocator [memblock] Reserve page 0 • Security: Mitigate L1TF (L1 Terminal Fault) vulnerability
  • 23. Fixed-mapped Addresses: Compile-time virtual memory allocation vsyscall #0 … vsyscall #511 FIX_DBGP_BASE FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000 VSYSCALL_ADDR = 0xFFFF_FFFF_FF60_0000 FIX_EARLYCON_MEM_BASE … __end_of_permanent_fixed_addresses FIX_BTMAP_END = 1024 … FIX_BTMAP_BEGIN = 1535 __end_of_fixed_addresses = 1536 vsyscalls (2MB space) Permanent fixed addresses 512 temporary boot-time mappings: used by early_ioremap() FIXADDR_START = 0xFFFF_FFFF_FF57_C000 Enumeration: fixed_addresses 0xFFFF_FFFF_FF3F_F000 0xFFFF_FFFF_FF20_0000 Modules MODULES_VADDR Fix-mapped address space (Expanded to 4MB: 05ab1d8a4b36) FIXADDR_START Unused hole (2MB) 0xFFFF_FFFF_FFE0_0000 0xFFFF_FFFF_FFFF_FFFF FIXADDR_TOP 4MB: fixed-mapped address space 2MB: Borrow from ‘Modules’ space breakdown
  • 24. Fixed-mapped Addresses: Compile-time virtual memory allocation vsyscall #0 … vsyscall #511 FIX_DBGP_BASE FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000 VSYSCALL_ADDR = 0xFFFF_FFFF_FF60_0000 FIX_EARLYCON_MEM_BASE … __end_of_permanent_fixed_addresses FIX_BTMAP_END = 1024 … FIX_BTMAP_BEGIN = 1535 __end_of_fixed_addresses = 1536 vsyscalls (2MB space) Permanent fixed addresses 512 temporary boot-time mappings: used by early_ioremap() FIXADDR_START = 0xFFFF_FFFF_FF57_C000 Enumeration: fixed_addresses 0xFFFF_FFFF_FF3F_F000 0xFFFF_FFFF_FF20_0000 4MB: fixed-mapped address space 2MB: Borrow from ‘Modules’ space
  • 25. Fixed-mapped Addresses: Compile-time virtual memory allocation Fixed-mapped Addresses: Use Case
  • 26. Early ioremap: based on fixed-mapped address PDE #507: 0xFFFF_FFFF_FF60_0000 PDE #506: 0xFFFF_FFFF_FF40_0000 PDE #505: 0xFFFF_FFFF_FF20_0000 #1528 … FIX_BTMAP_BEGIN = 1535 … FIX_BTMAP_END = 1024 … # 1031 slot_virt[0] slot_virt[7] slot_virt[0] = 0xFFFF_FFFF_FF20_0000 slot_virt[7] = 0xFFFF_FFFF_FF3C_0000 early_ioremap_setup() Early ioremap • Mapping/unmapping of I/O physical address to virtual address before ioremap mechanism is ready • early_ioremap() & early_iounmap() Fixed-mapped Addresses
  • 28. setup_arch() – Part 1 [Linux x86 Boot Protocol] setup_data: 64-bit physical pointer to linked list of struct setup_data
  • 30. setup_arch() – Part 2 - cleanup_highmap
  • 32. setup_arch() – Part 2: init_mem_mapping() -- Page Table Configuration for Direct Mapping
  • 33. setup_arch() – Part 2: init_mem_mapping() -- Page Table Configuration for Direct Mapping
  • 34. setup_arch() – Part 2: init_mem_mapping() -- Page Table Configuration for Direct Mapping Split memory range into sub-ranges that fulfill 4K, 2M or 1G page. split_mem_range
  • 35. setup_arch() – Part 2: init_mem_mapping() -- Page Table Configuration for Direct Mapping
  • 36. kernel_physical_mapping_init(): Page Table Configuration for Direct Mapping
  • 37. setup_arch() – Part 3 Initialize the idt table with early pagefault handler. idt_setup_early_pf
  • 38. setup_arch() – Part 3 - x86_init.paging.pagetable_init() x86_init.paging.pagetable_init native_pagetable_init paging_init sparse_init zone_sizes_init cfg number of pfn for each zone free_area_init Zone Allocator Buddy system Per-CPU page frame cache Buddy system Per-CPU page frame cache Buddy system Per-CPU page frame cache ZONE_DMA (Physical address: 0-16MB) ZONE_DMA32 (Physical address: 16MB-4GB) ZONE_NORMAL (Physical address > 4GB) Buddy system Per-CPU page frame cache Buddy system Per-CPU page frame cache ZONE_MOVABLE ZONE_DEVICE ZONE_DMA ZONE_DMA32 ZONE_NORMAL 0 16MB Physical Memory 64TB 4GB paging_init() • Initialize sparse memory and zone sizes
  • 39. Physical Memory Models • Flat Memory Model (CONFIG_FLATMEM) • UMA (Uniform Memory Access) • Discontinuous Memory Model (CONFIG_DISCONTIGMEM) • NUMA (Non-Uniform Memory Access) • Sparse Memory Virtual Memmap (CONFIG_SPARSEMEM_VMEMMAP) • NUMA • Default configuration • Sparse Memory • NUMA
  • 40. Sparse Memory Virtual Memmap (CONFIG_SPARSEMEM_VMEMMAP=y)
  • 41. sparse_init() – Page Table Configuration for ‘struct page’
  • 43. sparse_init() ALIGN_DOWN(0xbffd_efff, 128MB) >> 27 = 0xb800_0000 >> 27 = 23 ALIGN_DOWN(0x1_0000_0000, 128MB) >> 27 = 0x1_0000_0000 >> 27 = 32 ALIGN_DOWN(0x2_403f_ffff, 128MB) >> 27 = 0x2_4000_0000 >> 27 = 72
  • 44. setup_arch() – Part 3 – map_vsyscall
  • 45. vsyscall (Virtual System Call) – Issue Statement • The context switch overhead (user <-> kernel) of some system calls (gettimeofday, time, getcpu) is greater than execution time of those functions. • Quote from Linux Programmer's Manual - VDSO(7) • Making system calls can be slow. In x86 32-bit systems, you can trigger a software interrupt (int $0x80) to tell the kernel you wish to make a system call. However, this instruction is expensive: it goes through the full interrupt- handling paths in the processor's microcode as well as in the kernel. Newer processors have faster (but backward incompatible) instructions to initiate system calls. • Built on top of the fixed-mapped address
  • 46. vsyscall – Implementation (Emulate) [PTE] Bit 63: Execute Disable (XD) • If IA32_EFER.NXE = 1 and XD = 1, instruction fetches are not allowed from this PTE. This will generate a #PF exception.
  • 48. vsyscall – Experiment – gdb + backtrace Terminal #1 Terminal #2
  • 49. vsyscall – Experiment – gdb + backtrace Terminal #1 Terminal #2 error_code = 21 (0x15)
  • 50. vsyscall – Experiment – gdb + backtrace Terminal #1 Terminal #2
  • 51. Replacement of vsyscall: vDSO (virtual Dynamic Shared Object) • vsyscall limitation • Security concern: fixed virtual address (0xFFFF_FFFF_FF60_0000) • vDSO • Exploit ASLR (Address Space Layout Randomization) • Can be enabled/disabled via /proc/sys/kernel/randomize_va_space • [Enable] echo 1 > /proc/sys/kernel/randomize_va_space • [Disable] echo 0 > /proc/sys/kernel/randomize_va_space • User space address • Security enhancement
  • 53. [Recap] Page Table Configuration after finishing setup_arch()
  • 54. [Recap] Page Table Configuration after finishing setup_arch() 1 2 3 1 1 2 3
  • 55. vmlinux – start_kernel() – Part 2 Original .data..percpu .data..percpu for core 2 .data..percpu for core 3 .data..percpu for core 0 .data..percpu for core 1 Physical Memory memcpy in setup_per_cpu_areas()
  • 58. percpu variable access option #1: __per_cpu_offset APIs (include/linux/percpu-defs.h): * per_cpu_ptr(ptr, cpu): via __per_cpu_offset Original .data..percpu .data..percpu for core 2 .data..percpu for core 3 .data..percpu for core 0 .data..percpu for core 1 Physical Memory memcpy with source address ‘__per_cpu_load’ in setup_per_cpu_areas() __per_cpu_offset[0] __per_cpu_offset[1] __per_cpu_offset[2] __per_cpu_offset[3]
  • 59. percpu variable access option #1: __per_cpu_offset *(.data..percpu..shared_aligned) *(.data..percpu) *(.data..percpu..read_mostly) *(.data..percpu..page_aligned) *(.data..percpu..first) .data..percpu __per_cpu_load (kernel virtual address) __per_cpu_end __per_cpu_start = 0 [Example] gdt_page = 0xb000 Original .data..percpu .data..percpu for core 2 .data..percpu for core 3 .data..percpu for core 0 .data..percpu for core 1 Physical Memory memcpy with source address ‘__per_cpu_load’ in setup_per_cpu_areas() __per_cpu_offset[0] __per_cpu_offset[1] __per_cpu_offset[2] __per_cpu_offset[3]
  • 60. percpu variable access option #2: gs register (MSR: IA32_GS_BASE) APIs (include/linux/percpu-defs.h): * this_cpu_read(pcp) * this_cpu_write(pcp, val) * this_cpu_add(pcp, val) * this_cpu_ptr(ptr) & raw_cpu_ptr(ptr) 1. Use gs register 2. If option #1 is not supported, use this_cpu_off per-cpu variable (read mostly) Original .data..percpu .data..percpu for core 2 .data..percpu for core 3 .data..percpu for core 0 .data..percpu for core 1 Physical Memory memcpy with source address ‘__per_cpu_load’ in setup_per_cpu_areas() CPU #0: IA32_GS_BASE CPU #1: IA32_GS_BASE CPU #2: IA32_GS_BASE CPU #3: IA32_GS_BASE
  • 61. gs register (MSR: IA32_GS_BASE) vs __per_cpu_offset DEFINE_PER_CPU(int, x); int z; z = this_cpu_read(x); Convert to a single instruction: mov %gs:x,%edx Atomic: No need to disable preemption and interrupt this_cpu_inc(x) Convert to a single instruction: inc %gs:x int *y; int cpu; cpu = get_cpu(); y = per_cpu_ptr(&x, cpu); (*y)++; put_cpu(); Non-atomic: Need to disable preemption gs register __per_cpu_offset this_cpu_read() this_cpu_inc() this_cpu_inc() implementation via __per_cpu_offset
  • 63. vmlinux – start_kernel() – Part 2 – trap_init() CPU Entry Area (percpu) • Page Table Isolation (PTI) o Mitigate Meltdown o Isolate user space and kernel space memory o When the kernel is entered via syscalls, interrupts or exceptions, the page tables are switched to the full "kernel“ copy. ▪ Entry/exit functions and IDT (Interrupt Descriptor Table) are needed for userspace page table Kernel Space User Space User mode & Kernel Mode PTI Kernel Space User Space Kernel mode Kernel Space User Space User mode User Space percpu TSS entry Kernel Space syscall [User mode] User Page Table User Space percpu TSS entry Kernel Space Switch to kernel page table [Kernel Mode] User Page Table User Space percpu TSS entry Kernel Space [Kernel Mode] Kernel Page Table … PTI: Concept PTI: High-level implementation
  • 64. vmlinux – start_kernel() – Part 2 – setup_cpu_entry_area()
  • 65. vmlinux – start_kernel() – Part 2 – trap_init()
  • 66. vmlinux – start_kernel() – Part 2 – mm_init() mm_init • Set up different parts of Linux kernel memory managers
  • 67. vmlinux – start_kernel() – Part 2 - preallocate_vmalloc_pages()
  • 74. vmlinux – start_kernel() – Part 4 CommitLimit: Total amount of memory currently available to be allocated on the system. Committed_AS: The amount of memory requested by processes. Over Commit: Committed_AS > CommitLimit
  • 75. vmlinux – start_kernel() – Part 4 Idle Process (swapper) = init_task (pid = 0)
  • 76. STACK_END_MAGIC = 0x57AC6E9D struct pt_regs (save CPU registers for userspace application) task.stack THREAD_SIZE = 16KB kernel stack usage space task.stack + THREAD_SIZE struct inactive_task_frame task.thread_struct.sp struct fork_frame Kernel Stack Context Switch – Kernel Stack
  • 77. Context Switch – Kernel Stack Return frame for iretq pt_regs r15-r12 bx r11-r8 bp ax dx si cx orig_ax ip di cs sp ss flags orig_ax: syscall#, error code for CPU exception or IRQ number of HW interrupt thread_struct tls_array es, ds fsindex, gsindex fsbase, gsbase sp … inactive_task_frame r15-r13 bx (kernel thread function) bp ret_addr = ret_from_fork r12 ( kernel thread argument) Configured by copy_thread() – kernel thread callee-saved registers STACK_END_MAGIC = 0x57AC6E9D struct pt_regs (save CPU registers for userspace application) task.stack THREAD_SIZE = 16KB kernel stack usage space task.stack + THREAD_SIZE struct inactive_task_frame task.thread_struct.sp struct fork_frame Kernel Stack
  • 78. Context Switch – Kernel Thread inactive_task_frame r15-r13 bx (kernel thread function) bp ret_addr = ret_from_fork r12 (kernel thread argument) Configured by copy_thread() – kernel thread callee-saved registers STACK_END_MAGIC = 0x57AC6E9D struct pt_regs (save CPU registers for userspace application) task.stack kernel stack usage space Kernel Stack bx (kernel thread function) r13 r14 r15 r12 (kernel thread argument) ret_addr = ret_from_fork bp task.stack + THREAD_SIZE rsp rip
  • 79. STACK_END_MAGIC = 0x57AC6E9D struct pt_regs (save CPU registers for userspace application) task.stack kernel stack usage space Kernel Stack bx (kernel thread function) r13 r14 r15 r12 (kernel thread argument) ret_addr = ret_from_fork bp task.stack + THREAD_SIZE rsp rip inactive_task_frame r15-r13 bx (kernel thread function) bp ret_addr = ret_from_fork r12 (kernel thread argument) Configured by copy_thread() – kernel thread callee-saved registers Context Switch – Kernel Thread
  • 80. STACK_END_MAGIC = 0x57AC6E9D struct pt_regs (save CPU registers for userspace application) task.stack kernel stack usage space Kernel Stack bx (kernel thread function) r13 r14 r15 r12 (kernel thread argument) ret_addr = ret_from_fork bp task.stack + THREAD_SIZE rsp rip inactive_task_frame r15-r13 bx (kernel thread function) bp ret_addr = ret_from_fork r12 (kernel thread argument) Configured by copy_thread() – kernel thread callee-saved registers Context Switch – Kernel Thread
  • 81. Context Switch – Kernel Thread jump
  • 82. [Prev task] Return to the next instruction of calling switch_to() when the previous task is re-scheduled. 4 task.stack Kernel Stack STACK_END_MAGIC = 0x57AC6E9D struct pt_regs (save/restore CPU registers for userspace tasks) kernel stack usage space bx (kernel thread function) r13 r14 r15 r12 (kernel thread argument) ret_addr = ret_from_fork bp task.stack + THREAD_SIZE rsp 2 3 rsp `return prev_p` 1 Context Switch – Kernel Thread jump 4
  • 83. Context Switch – When to run ‘context switch’? Explicitly call ‘schedule()’ Call ‘cond_resched()’ to yield CPU resource
  • 85. Context Switch – init_task is rescheduled [Prev task] Return to the next instruction of calling switch_to() when the previous task is re-scheduled. 4 Backtrace when init_task (pid = 0) is rescheduled because kernel_init thread (pid = 1) is scheduled out jump 4
  • 86. Kernel Thread Context Switch mm_struct mmap (list of VMAs) pgd pgd_t pgd task_struct mm = NULL active_mm = NULL task_struct mm = NULL active_mm = NULL task_struct mm = NULL active_mm scheduler init_task (pid = 0) init_mm swapper_pg_dir = init_top_pgt init process (pid = 1) kthreadd (pid = 2)
  • 87. Kernel Thread Context Switch mm_struct mmap (list of VMAs) pgd pgd_t pgd init_task (pid = 0) init_mm swapper_pg_dir = init_top_pgt task_struct mm = NULL active_mm init process (pid = 1) kthreadd (pid = 2) task_struct mm = NULL active_mm = NULL task_struct mm = NULL active_mm = NULL scheduler pid = 0 pid = 1
  • 88. Kernel Thread Context Switch – Start Here (Aug 2, 2021) mm_struct mmap (list of VMAs) pgd pgd_t pgd task_struct mm = NULL active_mm task_struct mm = NULL active_mm = NULL task_struct mm = NULL active_mm = NULL scheduler init_task (pid = 0) init_mm swapper_pg_dir = init_top_pgt init process (pid = 1) kthreadd (pid = 2)
  • 89. Kernel Thread Context Switch mm_struct mmap (list of VMAs) pgd pgd_t pgd task_struct mm = NULL active_mm = NULL task_struct mm = NULL active_mm task_struct mm = NULL active_mm = NULL scheduler init_task (pid = 0) init_mm swapper_pg_dir = init_top_pgt init process (pid = 1) kthreadd (pid = 2) pid = 1 pid = 2
  • 90. Kernel Thread Context Switch mm_struct mmap (list of VMAs) pgd pgd_t pgd task_struct mm = NULL active_mm = NULL task_struct mm = NULL active_mm task_struct mm = NULL active_mm = NULL scheduler init_task (pid = 0) init_mm swapper_pg_dir = init_top_pgt init process (pid = 1) kthreadd (pid = 2) pid = 1 pid = 2 1. Each kernel thread does not have its own ‘mm’. 2. The active_mm of the next task inherits the one of the previous task (use the same page table).
  • 91. Context Switch: Kernel Thread <-> User Space Task mm_struct mmap (list of VMAs) pgd pgd_t pgd task_struct scheduler init_task (pid = 0) sleep program (pid = 40) task_struct mm = NULL active_mm cpu = 2 mm_struct mmap (list of VMAs) pgd pgd_t pgd mm active_mm cpu = 2 Two breakpoints breakpoint #1 breakpoint #2 gdb breakpoint configuration
  • 92. Context Switch: Kernel Thread <-> User Space Task mm_struct mmap (list of VMAs) pgd pgd_t pgd task_struct scheduler init_task (pid = 0) sleep program (pid = 40) task_struct mm = NULL active_mm = NULL cpu = 2 mm_struct mmap (list of VMAs) pgd pgd_t pgd mm active_mm cpu = 2 `sleep` userspace task is selected to run
  • 93. Context Switch: Kernel Thread <-> User Space Task mm_struct mmap (list of VMAs) pgd pgd_t pgd task_struct scheduler init_task (pid = 0) sleep program (pid = 40) task_struct mm = NULL active_mm = NULL cpu = 2 mm_struct mmap (list of VMAs) pgd pgd_t pgd mm active_mm cpu = 2 pid = 0 pid = 40 `sleep` userspace task is selected to run
  • 94. Context Switch: Kernel Thread <-> User Space Task task_struct scheduler sleep program (pid = 40) mm_struct mmap (list of VMAs) pgd pgd_t pgd mm active_mm cpu = 2 `sleep` userspace task is scheduled out
  • 95. Context Switch: Kernel Thread <-> User Space Task task_struct scheduler sleep program (pid = 40) mm_struct mmap (list of VMAs) pgd pgd_t pgd mm active_mm cpu = 2 task_struct ksoftirqd/2 (pid = 20) mm = NULL active_mm cpu = 2 pid = 40 pid = 20 [Kernel Thread ] Inherit active_mm of the previous task. (No need to flush TLB because cr3 is not changed) `sleep` userspace task is scheduled out
  • 96. vmlinux – start_kernel() – Part 4 init process = kernel_init() (pid = 1)
  • 97. [pid = 1 – init process] When are mm & active_mm allocated?
  • 98. [pid = 1 – init process] When are mm & active_mm allocated?
  • 99. [pid = 1 – init process] When are mm & active_mm allocated? clone_pgd_range()
  • 100. [pid = 1 – init process] When are mm & active_mm allocated? [pid = 1] Before running run_init_process() [pid = 1] After finishing run_init_process(): kernel thread -> user process clone_pgd_range(): mm.pgd verification [pid = 1] mm_struct
  • 101. smp_init() - boot secondary CPUs
  • 102. smp_init() - boot secondary CPUs
  • 103. smp_init() - boot secondary CPUs cpuhp/cpu_id kernel thread • Execute callbacks (teardown, startup and son on) when CPU hotplug state is changed.
  • 104. smp_init() - boot secondary CPUs
  • 105. smp_init() - boot secondary CPUs – Boot Flow startup_32: setup cr3 @trampoline_pgd secondary_startup_64: setup cr3 @init_top_pgt [Secondary CPUs] CR3 Register Configuration
  • 106. startup_32() - boot secondary CPUs – Page Table Configuration startup_32: setup cr3 @trampoline_pgd secondary_startup_64: setup cr3 @init_top_pgt [Secondary CPUs] CR3 Register Configuration
  • 107. startup_32() - boot secondary CPUs – Page Table Configuration startup_32: setup cr3 @trampoline_pgd secondary_startup_64: setup cr3 @init_top_pgt [Secondary CPUs] CR3 Register Configuration
  • 108. secondary_startup_64() - boot secondary CPUs – Page Table startup_32: setup cr3 @trampoline_pgd secondary_startup_64: setup cr3 @init_top_pgt [Secondary CPUs] CR3 Register Configuration
  • 109. Secondary CPUs – When to configure active_mm for idle_threads?
  • 110. pstree after finishing start_kernel()
  • 111. • The Linux/x86 Boot Protocol, Documentation/x86/boot.rst • Intel® 64 and IA-32 Architectures Software Developer’s Manual • https://wdv4758h.github.io/notes/blog/linux-kernel-boot.html • Linux insides, https://0xax.gitbooks.io/linux-insides/content/ • Debugging kernel and modules via gdb, https://www.kernel.org/doc/Documentation/dev-tools/gdb-kernel- debugging.rst Reference