2. Agenda
• Initialization Phase of the Linux Kernel
• Turning on the paging feature
• Calling *init functions
• And miscellaneous things related to initialization
2
4. vmlinux
• Main kernel binary
• Runs with the final CPU state
• Protected Mode in x86_32 (i386)
• Long Mode in x86_64
• And so on…
• Runs in the virtual memory space
• Above PAGE_OFFSET (default: 0xc0000000) (32-bit)
• Above __START_KERNEL_map (default: 0xff…f80000000)
• i.e. All the absolute addresses in the binary are virtual ones
• Entry points
4
Architecture Name Location Name (secondary)
x86_32 startup_32 arch/x86/kernel/head_32.S startup_32_smp
x86_64 startup_64 arch/x86/kernel/head_64.S secondary_startup_64
ARM stext arch/arm/kernel/head[_nommu].S secondary_startup
ARM64 stext arch/arm64/kenel/head.S secondary_holding_pen
secondary_entry
PPC _stext arch/powerpc/kernel/head_32.S* (__secondary_start)
6. Why different mapping in 64-bit?
• The kernel code, data, and BSS reside in the last 2-
GB of the memory
=> Addressable by 32-bit!
• -mcmodel option in GCC
• Specifies the assumptions for the size of code/data
sections
6
-mcmodel option
(x86)
text data
small within 2GB
kernel within -2GB
medium within 2GB Can be > 2GB
large Anywhere in 64bit
7. Column: -mcmodel in gcc
7
int g_data = 4;
int main(void)
{
g_data += 7;
...
}
8b 05 c6 0b 20 00 mov 0x200bc6(%rip),%eax # 601040 <g_data>
...
bf 01 00 00 00 mov $0x1,%edi
8d 50 07 lea 0x7(%rax),%edx
48 b8 40 10 60 00 00 movabs $0x601040,%rax
00 00 00
bf 01 00 00 00 mov $0x1,%edi
8b 30 mov (%rax),%esi
...
8d 56 07 lea 0x7(%rsi),%edx
large
#define SZ (1 << 30)
int buf[SZ] = {1};
int main(void)
{
buf[0] += 3;
}
$ gcc -O3 -o ba -mcmodel=small bigarray.c
/usr/lib/gcc/x86_64-linux-gnu/4.8/crtbegin.o: In function
`deregister_tm_clones':
crtstuff.c:(.text+0x1): relocation truncated to fit:
R_X86_64_32 against symbol `__TMC_END__' defined in .data
section in ba
small
kernel
48 b8 60 10 a0 00 00 movabs $0xa01060,%rax
00 00 00
8b 08 mov (%rax),%ecx
8d 51 03 lea 0x3(%rcx),%edx
medium
large
*The offset of RIP-relative addressing is 32-
bit
8. Column: -mcmodel in gcc (2)
• Code?
8
void nop(void)
{
asm volatile(".fill (2 << 30), 1, 0x90");
}
$ gcc -O3 -o ba -mcmodel=small supernop.c
/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-
gnu/crt1.o: In function `_start':
(.text+0x12): relocation truncated to fit: R_X86_64_32S
against symbol `__libc_csu_fini' defined in .text section in
/usr/lib/x86_64-linux-gnu/libc_nonshared.a(elf-init.oS)
$ gcc -O3 -o ba -mcmodel=large supernop.c
/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-
gnu/crt1.o: In function `_start':
(.text+0x12): relocation truncated to fit: R_X86_64_32S
against symbol `__libc_csu_fini' defined in .text section in
/usr/lib/x86_64-linux-gnu/libc_nonshared.a(elf-init.oS)
small
medium
kernel
large
9. Initialization Overview
9
Booting Code
(Preparing CPU states, Gathering HW information, Decompressing vmlinux etc.)
arch/*/boot/
arch/*/kernel/head*.S, head*.c
Low-level Initialization
(Switching to virtual memory world, Getting prepared for C programs)
init/main.c (startup_kernel)
Initialization
(Initializing all the kernel features including architecture-dependent parts)
init/main.c (rest_init)
Creating the “init” process, and letting it the rest
initialization
(Setting up multiprocessing, scheduling)
kernel/sched/idle.c (cpu_idle_loop)
“Swapper” (PID=0) now sleeps
init/main.c (kernel_init)
Performing final initialization
and
“Exec”ing the “init” user
“init” (PID=1)
arch/*/kernel, arch/*/mm, …Call
vmlinux
11. Enabling paging
• The early part is executed with paging off.
• Physical address space
• vmlinux is assumed to be executed with paging on.
• The addresses in the binary are not physical addresses.
• The first big job in vmlinux is enabling paging
• Creating a (transitional) page table
• Setting the CPU to use the page table, and to enable
paging
• Jumping to the entry point in C (compiled in the virtual
address space)
11
12. Identity Map
• At first, the goal page table cannot be used
• Since changing PC and enabling paging are (at least, in
x86) separate instructions.
12
PC
Physical Virtual
Enable
Paging
Physical Virtual
Page Fault!
13. Identity Map
• Therefore, identity map is created in addition to the
(goal) map.
13
PC
Physical Virtual
Jump
(1) Create an initial page table (2) Enable paging, and
Jump to a virtual address.
(3) Zap the low
mapping
14. Addresses in the transitional phase
• x86_64
• The decompressing routine enables paging and creates
an identity page table (only for first 4GB)
• Paging is required for CPUs to switch to 64-bit mode
• Located in 6 pages (pgtable) in the decompressing routine
• Symbols in vmlinux are accessed with RIP-relative
• No trick is necessary for using the symbols
14
leaq _text(%rip), %rbp
subq $_text - __START_KERNEL_map, %rbp
...
leaq early_level4_pgt(%rip), %rbx
...
movq $(early_level4_pgt - __START_KERNEL_map), %rax
addq phys_base(%rip), %rax
movq %rax, %cr3
movq $1f, %rax
jmp *%rax
1: (arch/x86/kernel/head_64.S)
15. Addresses in the transitional phase
• i386
• Symbols in vmlinux are accessed with absolute
addresses
• Before paging is enabled, PAGE_OFFSET is always subtracted
from addresses
15
movl $pa(__bss_start),%edi
movl $pa(__bss_stop),%ecx
subl %edi,%ecx
shrl $2,%ecx
rep ; stosl
...
movl $pa(initial_page_table), %eax
movl %eax,%cr3 /* set the page table pointer.. */
movl $CR0_STATE,%eax
movl %eax,%cr0 /* ..and set paging (PG) bit */
ljmp $__BOOT_CS,$1f /* Clear prefetch and normalize %eip */
1:
...
lgdt early_gdt_descr
lidt idt_descr
#define pa(X) ((X) - __PAGE_OFFSET)
(arch/x86/kernel/head_32.S)
17. Initialization (start_kernel)
• A lot of *_init functions!
• Furthermore, some init functions call another init
functions.
• At least, 80 functions are called in this function.
• This slide will pick up some topics from the
initialization functions
17
19. Special directives
• What are these?
• “I’m curious!”.
19
asmlinkage __visible void __init start_kernel(void) {
…
}
20. asmlinkage
• asmlinkage
• Ensures the symbol is not mangled
• (in x86_32) Ensures all the parameters are passed by the
stack
20
#ifdef CONFIG_X86_32
#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))
arch/x86/include/asm/linkage.h
#ifdef __cplusplus
#define CPP_ASMLINKAGE extern "C"
#else
#define CPP_ASMLINKAGE
#endif
#ifndef asmlinkage
#define asmlinkage CPP_ASMLINKAGE
#endif
include/linux/linkage.h
21. __visible
• (Effective in gcc >=4.6)
21
#if GCC_VERSION >= 40600
/*
* Tell the optimizer that something else uses this function or
variable.
*/
#define __visible __attribute__((externally_visible))
#endif
include/linux/compiler-gcc4.h
commit 9a858dc7cebce01a7bb616bebb85087fa2b40871
author Andi Kleen <ak@linux.intel.com> Mon Sep 17 21:09:15 2012
committer Linus Torvalds <torvalds@linux-foundation.org> Mon Sep 17 22:00:38 2012
compiler.h: add __visible
gcc 4.6+ has support for a externally_visible attribute that prevents the
optimizer from optimizing unused symbols away. Add a __visible macro to
use it with that compiler version or later.
This is used (at least) by the "Link Time Optimization" patchset.
22. __init (1)
• To mark code(text) and data as only necessary
during initialization
22
#define __init __section(.init.text) __cold notrace
#define __initdata __section(.init.data)
#define __initconst __constsection(.init.rodata)
#define __exitdata __section(.exit.data)
#define __exit_call __used __section(.exitcall.exit)
(include/linux/init.h)
#ifndef __cold
#define __cold __attribute__((__cold__))
#endif
(include/linux/compiler-gcc4.h)
#ifndef __section
# define __section(S) __attribute__ ((__section__(#S)))
#endif
...
#define notrace __attribute__((no_instrument_function))
(include/linux/compiler.h)
24. __init (3)
• And, they are discarded (free’d) after initialization
• Called from kernel_init
24
void free_initmem(void)
{
free_init_pages("unused kernel",
(unsigned long)(&__init_begin),
(unsigned long)(&__init_end));
}
arch/x86/mm/init.c
void free_initmem(void)
{
...
poison_init_mem(__init_begin, __init_end - __init_begin);
if (!machine_is_integrator() && !machine_is_cintegrator())
free_initmem_default(-1);
}
arch/arm/mm/init.c
25. head32.c, head64.c
• Before calling start_kernel, i386_start_kernel or
x86_64_start_kernel is called in x86
• Located in arch/x86/kernel/head{32,64}.c
• No underscore between head and 32!
• x86 (32-bit)
• Reserve BIOS memory (in conventional memory)
• x86 (64-bit)
• Erase the identity map
• Clear BSS, copy boot information from the low memory
• And reserve BIOS memory
25
26. Reserve? But how?
• This is very initial time. No complicated memory
management is working right now.
• memblock (Logical memory blocks) is working!
• memblock simply manages memory blocks
• And in some architecture, information is took over to another
mechanism, and discarded after initialization
26
#define BIOS_LOWMEM_KILOBYTES 0x413
lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
lowmem <<= 10;
...
memblock_reserve(lowmem, 0x100000 - lowmem);
arch/x86/kernel/head.c
#ifdef CONFIG_ARCH_DISCARD_MEMBLOCK
#define __init_memblock __meminit
#define __initdata_memblock __meminitdata
#else
...
#endif
include/linux/memblock.h
Set in S+Core, IA64, S390, SH,
MIPS and x86
Without memory hotplug,
__meminit is __init.
27. memblock
• Data Structure (include/linux/memblock.h)
• Initially the arrays are allocated statically
27
memblock (memblock)
memory
(memblock_type)
reserved
(memblock_type)
memblock_region
• base, size, flags[, nid]
memblock_region
memblock_region
memblock_region
Array of memblock_region
Array of memblock_region
static struct memblock_region
memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
static struct memblock_region
memblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
*INIT_MEMBLOCK_REGIONS = 128
(memblock: Global variable)
28. Reserving in memblock
• Reserving adds the region to the region array in the
“reserved” type
• A function to adding the available region is
memblock_add
28
static int __init_memblock memblock_reserve_region(phys_addr_t base,
phys_addr_t size,
int nid,
unsigned long flags)
{
struct memblock_type *_rgn = &memblock.reserved;
...
return memblock_add_region(_rgn, base, size, nid, flags);
}
int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
{
return memblock_reserve_region(base, size, MAX_NUMNODES, 0);
}
29. When the available memory is
added?
• x86
• memblock_x86_fill
• called by setup_arch (8/80)
• ARM
• arm_memblock_init
• Also called by setup_arch (8/80)
29
void __init memblock_x86_fill(void)
{
...
memblock_allow_resize();
for (i = 0; i < e820.nr_map; i++) {
... memblock_add(ei->addr, ei->size);
}
memblock_trim_memory(PAGE_SIZE);
...
}
BTW, what’s this?
30. Resizing, or reallocation.
• Memblock uses slab for resizing if available
• # of e820 entries may be more than 128
• However, slab is available at kmem_cache_init called by
mm_init (25/80), so not at this time.
• Memblock tries to allocate by itself by finding an
area in memory && !reserved.
30
static int __init_memblock memblock_double_array(struct memblock_type *type,
phys_addr_t new_area_start,
phys_addr_t new_area_size)
{
…
addr = memblock_find_in_range(new_area_start + new_area_size,
memblock.current_limit,
new_alloc_size, PAGE_SIZE);
31. memblock: Debug options
• “memblock=debug”
31
static int __init early_memblock(char *p)
{
if (p && strstr(p, "debug"))
memblock_debug = 1;
return 0;
}
early_param("memblock", early_memblock);
static int __init_memblock memblock_reserve_region(...)
{
...
memblock_dbg("memblock_reserve: [%#016llx-%#016llx]
flags %#02lx %pFn",
(unsigned long long)base,
(unsigned long long)base + size - 1,
flags, (void *)_RET_IP_);
33. start_kernel
• What’s the first initialization function called?
33
smp_setup_processor_id() ((at least 2.6.18) ~ 3.2)
lockdep_init () (3.3 ~)
commit 73839c5b2eacc15cb0aa79c69b285fc659fa8851
Author: Ming Lei <tom.leiming@gmail.com>
Date: Thu Nov 17 13:34:31 2011 +0800
init/main.c: Execute lockdep_init() as early as possible
This patch fixes a lockdep warning on ARM platforms:
[ 0.000000] WARNING: lockdep init error! Arch code didn't call lockdep_init() early
enough?
[ 0.000000] Call stack leading to lockdep invocation was:
[ 0.000000] [<c00164bc>] save_stack_trace_tsk+0x0/0x90
[ 0.000000] [<ffffffff>] 0xffffffff
The warning is caused by printk inside smp_setup_processor_id().
34. init (1/80) : lockdep_init
• Initializes lockdep (lock validator)
• “Runtime locking correctness validator”
• Detects
• Lock inversion
• Circular lock dependencies
• When enabled, lockdep is called when any spinlock or
mutex is acquired.
• Thus, the initialization for lockdep must be first.
• Initialization is simple (just initializing list_head’s of hashes)
34
void lockdep_init(void)
{...
for (i = 0; i < CLASSHASH_SIZE; i++)
INIT_LIST_HEAD(classhash_table + i);
for (i = 0; i < CHAINHASH_SIZE; i++)
INIT_LIST_HEAD(chainhash_table + i);
...}
kernel/locking/lockdep.c
Config: CONFIG_LOCKDEP
selected by PROVE_LOCKING
or DEBUG_LOCK_ALLOC
or LOCK_STAT
35. init (2/80) : smp_setup_processor_id
• Only effective in some architecture
• ARM, s390, SPARC
35
u32 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] =
MPIDR_INVALID };
void __init smp_setup_processor_id(void)
{
int i;
u32 mpidr = is_smp() ? read_cpuid_mpidr() &
MPIDR_HWID_BITMASK : 0;
u32 cpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);
cpu_logical_map(0) = cpu;
for (i = 1; i < nr_cpu_ids; ++i)
cpu_logical_map(i) = i == cpu ? 0 : i;
set_my_cpu_offset(0);
pr_info("Booting Linux on physical CPU 0x%xn", mpidr);
}
arch/arm/kernel/setup.c
Hardware CPU (core) ID
Exchange the logical ID
for the boot CPU and
the logical ID for the
CPU 0.
12 0 3cpu_logical_map:
36. init (3/80) : debug_objects_early_init
• Initializes debugobjects
• Lifetime debugging facility for objects
• Seems to be used by timer, hrtimer, workqueue,
per_cpu_counter and rcu
• Again, this function initializes locks and listheads
36
Config:
CONFIG_DEBUG_OBJECTS
void __init debug_objects_early_init(void)
{
int i;
for (i = 0; i < ODEBUG_HASH_SIZE; i++)
raw_spin_lock_init(&obj_hash[i].lock);
for (i = 0; i < ODEBUG_POOL_SIZE; i++)
hlist_add_head(&obj_static_pool[i].node, &obj_pool);
}
lib/debugobjects.c
37. init (4/80): boot_init_stack_canary
• Setup the stackprotector
• include/asm/stackprotector.h
• Decide the canary value based on random value and TSC
37
static __always_inline void boot_init_stack_canary(void)
{
u64 canary;
u64 tsc;
#ifdef CONFIG_X86_64
BUILD_BUG_ON(offsetof(union irq_stack_union, stack_canary) != 40);
#endif
get_random_bytes(&canary, sizeof(canary));
tsc = __native_read_tsc();
canary += tsc + (tsc << 32UL);
current->stack_canary = canary;
#ifdef CONFIG_X86_64
this_cpu_write(irq_stack_union.stack_canary, canary);
#else
this_cpu_write(stack_canary.canary, canary);
#endif
}
38. init (5/80): cgroup_init_early
• Initializes cgroups
• For subsystems that have early_init set, initialize the
subsystem.
• cpu, cpuacct, cpuset
• The rest of subsystems are initialized in cgroup_init (71/80)
• Initializes the structure, and the names for the
subsystems
38
39. init (6/80): boot_cpu_init
• Initializes various cpumasks for the boot CPU
• online : available to scheduler
• active : available to migration
• present : cpu is populated
• possible : cpu is populatable
• set_cpu_online adds the cpu to active
• set_cpu_present does not add the cpu to possible
39
static void __init boot_cpu_init(void)
{
int cpu = smp_processor_id();
/* Mark the boot cpu "present", "online" etc for SMP and UP
case */
set_cpu_online(cpu, true);
set_cpu_active(cpu, true);
set_cpu_present(cpu, true);
set_cpu_possible(cpu, true);
}
init/main.c
!HOTPLUG_CPU => same
!HOTPLUG_CPU => same
40. cpumask
• A bit map
40
typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;
include/linux/cpumask.h
#define DECLARE_BITMAP(name,bits)
unsigned long name[BITS_TO_LONGS(bits)]
include/linux/types.h
#define BITS_TO_LONGS(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE *
sizeof(long))
include/linux/bitops.h
NR_CPU bits
bits :
array of long (4 / 8 bytes)
41. Set bit! (x86)
• The register bitoffset operand for bts is
• -231 ~ 231-1 or -263 ~ 263-1
41
#define IS_IMMEDIATE(nr) (__builtin_constant_p(nr))
...
static __always_inline void
set_bit(long nr, volatile unsigned long *addr)
{
if (IS_IMMEDIATE(nr)) {
asm volatile(LOCK_PREFIX "orb %1,%0"
: CONST_MASK_ADDR(nr, addr)
: "iq" ((u8)CONST_MASK(nr))
: "memory");
} else {
asm volatile(LOCK_PREFIX "bts %1,%0"
: BITOP_ADDR(addr) : "Ir" (nr) : "memory");
}
}
arch/x86/include/asm/bitops.h
43. smp_processor_id
• Returns the core ID (in the kernel)
• In ARM (and old days in x86)
• Located in “current”
• Located in the top of the current stack
• In x86
• Located in the per-cpu area.
43
#define raw_smp_processor_id() (this_cpu_read(cpu_number))
arch/x86/include/asm/smp.h
#define raw_smp_processor_id() (current_thread_info()->cpu)
arch/arm/include/asm/smp.h
static inline struct thread_info *current_thread_info(void)
{
register unsigned long sp asm ("sp");
return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));
}
arch/arm/include/asm/thread_info.h
44. Next
• Topics and the rest of initialization
• Setup parameters (early_param() etc.)
• Initcalls
• Multiprocessor supports
• Per-cpus
• SMP boot (secondary boot)
• SMP altenatives
• And other alternatives
• And Others?
• Modules?
44