Linux Kernel Booting Process (2) - For NLKB

4,684 views

Published on

Describes the bootstrapping part in Linux, and related architectural mechanisms and technologies.
This is the part two of the slides, and the succeeding slides may contain the errata for this slide.

Published in: Engineering

Linux Kernel Booting Process (2) - For NLKB

  1. 1. Booting Process (2) Taku Shimosawa Pour le livre nouveau du Linux noyau 1
  2. 2. Materials • http://www.slideshare.net/shimosawa/ 2
  3. 3. Agenda • Virtual Memory • From architectural view • Unfortunately, this presentation again does not enter the main part of the kernel! • Appendices • Source code-level overview of the bootstrapping process • Linker Scripts • Inline Assemblers • There are (implicitly) omitted spaces, tabs, white lines, comments in the quoted source code. • The omitted effective lines are denoted by … or […] 3
  4. 4. Scope of the last presentation : x86 • Real Mode (16-bit) • Boot sector, setup_header, and 16-bit entry point • C-Language main function • Retrieving memory information • Transition to the protected mode • Protected Mode (32-bit) • 32-bit(/64-bit) entry point, preparing for decompression, calling decompression code • (EFI-Stub) efi_main (entry point from UEFI) • EFI call functions • Protected Mode/Long Mode • The beginning of the main kernel 4  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S * The …_32.S files are used in the 32-bit kernel, and …_64.S files are not. Vice versa.
  5. 5. Scope of the last presentation : ARM • Compressed • Entry point • Decompressing function • Actual decompressing algorithm is in lib/decompress_*.c • Building a FDT from ATAGS for compatibility (CONFIG_ARM_ATAG_DTB_C OMPAT) • Decompressed • The beginning of the main kernel 5  arch  arm  boot  compressed  head.S  decompress.c  atags_to_fdt.c  kernel  head.S
  6. 6. Follow-ups for the last presentation • x86 assembly language • What if instuructions with ≥3 operands? • (e.g.) imul • Multiply EBX by 19(0x13) and substitute the result to EAX • Therefore, 6 AT&T Intel Operand Order Source, Destination Destination, Source AT&T Intel Example imul $0x13, %ebx, %eax IMUL EAX, EBX, 13h AT&T Intel Operand Order Source, Destination [[Op4,] Op3,] Op2, Op1 Destination, Source Op1, Op2 [, Op3, [Op4]]
  7. 7. Follow-ups for the last presentation • Multiple Relocations? • The conclusion is “at most once” (in x86 arch) • ELF relocation may follow the decompression, so the kernel may be relocated twice in this sense. • See the relocation part in this presentation. 7
  8. 8. x86 Architecture : Segmentation • 6 Segment Registers (16-bit registers) • Code Segment Register: CS • Data Segment Register: DS, ES, FS, GS • Stack Segment Register: SS • Real mode : 20-bit address space • Linear address = Physical address • The size of each segment is 64K (16-bit) • The segment register denotes the higher 16-bit offset in 20-bit address space for the segment • Protected mode : 32-bit/36-bit physical address space • Virtual –(Paging)-> Linear –(Segmentation)-> Physical • The offset and limit are stored in the descriptor table • The segment registers points to the entry in the table • Long mode : 48-bit physical address space • For CS, DS, ES, and SS, the offset is always 0, the limit is ignored. • For FS and GS, the offset can be set by the descriptor or through MSR (for > 32-bit addresses) 8 Logical – (Segmentation) -> Linear –(Paging)-> Physical Errata
  9. 9. So what? (p.32) 9 vmlinux boot/compressed/vmlinux.bin (1a) Strip symbols vmlinux.bin.xz (2a) Concatenate and compress (gzip, bzip2, lzma, lzo, lz4) piggy.o (3) mkpiggy (piggy-back) Make an object that contains the compressed image piggy.o*.o boot/compressed/vmlinux (4) Link with the other objects in boot/compressed (Decompressing codes) (5) Transform it into a simple binary boot/vmlinux.bin boot/vmlinux.binboot/setup.bin (6) Concatenate with real-mode setup code, headers, and CRC32 CRC boot/bzImage (1b) Make relocation information (2b) Append the original size info (except for gzip) vmlinux.bin.xz vmlinux.relocs Size Errata?
  10. 10. 4. Virtual Memory Segmentation and Paging 10
  11. 11. Virtual Memory • The address visible to a task is “virtualized,” i.e. translated by hardware to a certain physical address when it is actually accessed. • The hardware mechanism to translate the address is called MMU (memory management unit). • Aim / Benefit • Using larger memory area than the machine actually is equipped with. • Memory swapping, sparse memory areas • Isolating tasks’ memory area so that the different applications cannot touch (read or write) the each other’s memory • Not only between user tasks but between the kernel and tasks • Abstracting the memory resources • Providing contiguous memory area even if there is no physically contiguous memory area available. • User programs can run with certain addresses regardless of the physical addresses where they are actually running. 11
  12. 12. Two ways to virtual memory • Paging • Dividing the memory area into chunks (“pages”) with a certain small size, and defining a map from each chunk to its physical location • A different task may have a different map of the memory • Several overhead (both in speed and memory) to translate and hold the map • Segmentation • The address is considered to be an offset inside a certain segment of memory • Less overhead (just adding an offset), but impossible to achieve swapping 12
  13. 13. Illustrated 13 1 Segment 1 2 3 5 4 3 1 2 4 VA PA 1 4 2 1 3 3 5 2 2 ~ 4 Seg Star t End 1 2 4 1 Virtual Memory Physical Memory Page Table Segment Desc. Paging Segmentation
  14. 14. Architecture and VM Capability • x86 • Capable of paging • 16-bit and 32-bit has segmentation feature • 64-bit mode has a very limited segmentation feature • Because almost no one is using the segmentation feature effectively! • (See “flat model” described in a later slide) • ARM • Some CPU series has MMU, and is capable of paging • “A” series • Some CPU series only has MPU (memory protection unit) • “R” series • No MMU • “M” series (MPU is optional) 14
  15. 15. Focusing on paging… • How it works? 15 Memory instruction with a virtual address CPU (MMU) looks for for the virtual address in TLB (Translation Lookaside Buffer) Does it exist? Use the physical address in the TLB entry TLB Miss! Call the handler, and ask it to fill in a TLB entry corresponding to the virtual address Traverse the page table to find the physical address for the virtual address Present? Use the physical address (May) remember it in TLB Page fault! Call the handler. Kernel’s Role Software TLBHardware TLB Yes No Yes No
  16. 16. How far should hardware do? • TLB (Translation Lookaside Buffer) • Cache of “virtual-to-physical” mappings. • Limited number of entries. • Hardware-controlled TLB • When TLB misses occur, the CPU traverses page tables • The format for the page table is defined by the architecture. • x86 and ARM • Software-controlled TLB • When TLB misses occur, the software (typically, the OS kernel) traverses page tables, and tell the result (translated physical address) by filling in some entry in TLB. • Any type of page tables may be used (hash-based PT, for example) • But Linux uses almost the same format for this type of architecture • PowerPC 16
  17. 17. Multilevel Page Table (tree-like) • Typical structure of page table • The first-level page table consists of entries that point to another level page table. The index is some of the most significant bits of the virtual memory. • Of course, the next page table’s address is physical. • The entries in the leaf page table denotes the physical addresses. 17 Next level page table Third level page table Phys address Phys address … First-level page table Second-level page table Third-level page table
  18. 18. x86-64 example 18 Resolving 0x00000004200310a5 = 00000000 00000000 00000000 00000100 00100000 00000011 00010000 10100101 (2) PML4 Table 0 511 Page Directory Pointer Table 0 511 16 0 256 Page Directory Table 511 0x1234567000 0 49 Page Table 511 0x12345670a5 CR3 64 bits
  19. 19. x86-64 • Currently, only 48-bit in a linear address is effective. • 64-bit address is sign-extension of the 48-bit address. • Supports up to 52 bits for physical addresses • %cr3 register : the physical address for the current PML4 table • mov ~~, %cr3 switches the page table (flushing TLB) • Four level • One entry in PML4 table corresponds to 512 GB of virtual memory, an entry in PDP table to 1 GB, and so on. • Each entry is 8 byte • Each table has 512 entries • Thus, each table is 4 KB = 1 page. 19
  20. 20. Large Table • One page occupies one entry in TLB • If one process uses 1 GB of memory, it uses 256K pages. • i.e. If TLB does not have 256K entries (and usually it doesn’t), TLB misses are inevitable • x86_64 supports three types of page size • 4 KB (normal) • 2 MB • 1 GB (!) • The disadvantage is that larger page requires contiguous physical memory of the same size as the page size. 20 An entry in higher-level page table directly contains a physical address.
  21. 21. x86-64 example (2MB page) 21 Resolving 0x00000004200310a5 = 00000000 00000000 00000000 00000100 00100000 00000011 00010000 10100101 PML4 Table 0 511 Page Directory Pointer Table 0 511 16 0x1234400000 0 256 Page Directory Table 511 0x12344310a5 CR3 64 bits
  22. 22. Linux kernel usage • Large Page • The kernel mapping • The kernel creates straight-mapping of physical memory in the kernel virtual address area • This area is created in booting, and never changes after that • 1GB, 2MB pages are used • Hugetlbfs • Explicit use from user applications • Transparent Huge Pages • Implicit (transparent) use of large pages for user applications 22
  23. 23. ARM • ARM • Two memory architecture • VMSA (Virtual Memory System Architecture) : MMU • PMSA (Protected Memory System Architecture) : MPU • VMSA • Two page table formats • Short descriptor table • Up to two-level lookup • 32-bit PA (*By supersection, 40-bit can be output) • Long descriptor table • Up to three-level lookup • 40-bit PA • Fixed size of page tables 23
  24. 24. Names in Linux • Linux uses several arch-independent type names for page table entries • pgd_t, pud_t, pmd_t, pte_t • Each type is one for an entry in a table of the corresponding level 24 Architecture (& Config) Lv pgd_t pud_t pmd_t pte_t x86_64 4 PML4E PDPTE PDE PTE i386 (PAE) 3 PDPTE - PDE PTE i386 2 PDE - - PTE ARM (LPAE) 3 1st-lv. Desc. - 2nd-lv. Desc. 3rd-lv. Desc. ARM 2 1st-lv. Desc. - - 2nd-lv. Desc. ARM64 (64KB page) 2 1st-lv. Desc. - - 2nd-lv. Desc. ARM64 3 1st-lv. Desc. - 2nd-lv. Desc. 3rd-lv. Desc. (*)AArch64 supports four-level page tables, thus 48-bit VA.
  25. 25. Notes • PAE (i386) • Physical Address Extension • For those who want to enjoy >4GB of memory in 32-bit mode. • Virtual address remains 32-bit, but can map to any physical address (< 64-bit) • The size of each entry is extended to 64-bit • CONFIG_X86_PAE • LPAE • Logical Physical Address Extension • Almost the same as PAE in i386 • “The current implementation limits the output address range to 40 bits” • Each entry is extended to 64-bit (long-descriptor translation table format) • CONFIG_ARM_LPAE 25
  26. 26. ARM example (Short-descriptor) 26 Resolving 0x200310a5 = 00100000 00000011 00010000 10100101 (2) 1st Level Table 0 4095 0x12345000 0 255 49 0x123450a5 TTBR0 2nd Level Table 32 bits 512
  27. 27. Quick Chart 27 1st Level 2nd Level 3rd Level 4th Level Intel 64-bit [47:39] [38:30] [29:21] [20:12] 4 KB (64 bit x 512) 512 GB/Entry 1 GB / Entry 2 MB / Entry 4 KB / Entry PAE [31:30] [29:21] [20:12] 256 B (64 bit x 4) 4 KB (64 bit x 512) 1 GB / Entry 2 MB / Entry 4 KB / Entry 32-bit [31:22] [21:12] 4 KB (32 bit x 1024) 4 MB / Entry 4 KB / Entry ARM LPAE [31:30] [29:21] [20:12] 256 B (64 bit x 4) 4 KB (64 bit x 512) 1 GB / Entry 2 MB / Entry 4 KB / Entry 32-bit [31:20] [19:12] 16 KB (32 bit x 4096) 1 KB (32 bit x 256) 1 MB / Entry 4 KB / Entry ARM 64 4KB granule [38:30] [29:21] [20:12] 4 KB (64 bit x 512) 1 GB / Entry 2 MB / Entry 4 KB / Entry VA Range used as index Table size (entry size x n) Size represented by each entry
  28. 28. Page size supported (by HW) • x86_64 • 1 GB, 2 MB, 4 KB • i386 (PAE) • 2 MB, 4 KB • i386 • 4 MB, 4 KB • ARM • 16 MB(*), 1 MB, 64 KB, 4 KB • ARM (LPAE) • 1 GB, 2 MB, 4 KB • ARM64 • 1 GB, 2 MB, 4 KB (for 4KB translation granule) • 32 MB, 16 KB (for 16KB translation granule) • 512 MB, 64 KB (for 64KB translation granule) 28 (*) Depends on implementation
  29. 29. Page Attributes • Pages can have attributes • Used for memory protection • Used for demand paging • Used for COW (copy-on-write) • Attributes • Read / Write • User / Privileged • But where? • In the page table entry corresponding to a page • However, a page table entry is basically a physical pointer, i.e. a 32-bit entry is occupied by 32-bit physical pointer… 29
  30. 30. Page Attributes • The lower bits in page table entries • The start address of a page/page table is aligned! • The lower bits are always zero. 30 Ignored Physical Address [31:12] 3252 XD 63 Physical Address [51:32] G Igno red PAT D PCD PWT US RW PA 31 9 0 Physical Address [31:12] C B 1 XN APTEX AP2 S nG x86_64 ARM (short descriptor)
  31. 31. Page Attributes Comparison 31 x86_64 ARM (short) Enabled? Present (P) Desc type (Bits 1 & 0) RO or RW? Read/Write (RW) AP [2:1] or AP [2:0] Privileged only or any? User/Supervisor (US) Write-through? PWT TEX[2:0], B, C Cachable? PCD Accessed? Accessed (A) AP[0] (*configurable) Dirty? Dirty (D) N/A Memory Type PAT TEX[0], B, C (*configurable) Global Global (G) Not Global (nG) Executable? Execute-Disable (XD) Execute-Never (XN) Sharable? (PAT) Sharable (S)
  32. 32. PowerPC Example [PowerPC 440] • TLB is filled by software • Search (tlbsx instrunction), R/W (tlbre, tbwe instructions) 32 32220 Effective Page Number [0:21] TS V SIZE TPAR TID 40 Real Page Number [0:21] 0 PA R1 ERPN PA R2 0 Reserved U3-U0 W I M G E X W R X W R U S • Attributes • V : Valid • SIZE : Page Size (4n KB, where n in {0,1,2,3,4,5,7,9,10}) • U : User-defined storage attribute • W: Write-through • I: Caching Inhibited • M: Memory coherency required • G: Guarded • E: Endian • UX, UW, UR: User executable, writable, readable • SX, SW, SR: Supervisor executable, writable, readable • TPAR, PAR1, PAR2: Parity
  33. 33. Before the kernel starts… • x86 (32-bit) • Paging is disabled • kernel/head_32.S creates a page table and turns on paging • x86 (64-bit) • compressed/head_64.S creates an identical (virtual = physical) page table for the first 4G • Long mode requires paging enabled. • kernel/head_64.S creates better page table • ARM • kernel/head.S creates a page table and turns on paging 33
  34. 34. Virtual memory mapping 34 x86_64 Virtuali386 Virtual Physical LOWMEM PAGE_OFFSET (0xC0000000) Up to ~896 MB PAGE_OFFSET (0xFFFF8800 00000000) __START_KERNEL_map (0xFFFFFFFF 80000000)
  35. 35. A. Booting in x86 By looking into the source codes 35
  36. 36. A-1. Real Mode Plenty of assembler code, LD script, and inline assembly language 36
  37. 37. Real mode kernel (from p.45) • header.S • Boot sector code which is no longer used • Contains setup_header • Prepares stack and BSS to run C programs • Jumps into the C program (main.c) • main.c • Copies setup_header into “zeropage” • Setups early console • Initializes heap • Checks the CPUs (64-bit capable for 64-bit kernel?) • Collect HW information by querying to BIOS, and stores the results in “zeropage” • Finally transits to protected-mode, and jumps into the “protected-mode kernel” 37
  38. 38. Boot sector (Useless) 38  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S .global bootsect_start bootsect_start: #ifdef CONFIG_EFI_STUB # "MZ", MS-DOS header .byte 0x4d .byte 0x5a #endif # Normalize the start address ljmp $BOOTSEG, $start2 start2: movw %cs, %ax movw %ax, %ds movw %ax, %es movw %ax, %ss xorw %sp, %sp sti cld movw $bugger_off_msg, %si jmp msg_loop Normalize CS to BOOTSEG (0x7c0). movw %ds, %cs is not allowed. stack starts at 0x17c00 Enable interrupts cf. cli Reset directions for string instructions (Clear DF Flag) cf. std Show the message "Direct floppy boot is not supported. "
  39. 39. Wait, how the header code is placed at the beginning of the kernel? • The linker concatenates multiple object files • The position in the resulting binary are not guaranteed without any order to the linker • The linker script (.ld/.lds/.lds.S) orders the positions to the linker! • As it is quite likely for you to use the C preprocessor for the linker script, files with the extension “.lds.S” are first processed by the preprocessor, and passed to the linker. • Pass the linker script with “-T” overrides the default linker script • The default linker script can be displayed with “ld -- verbose" 39
  40. 40. LD script (1) 40  arch  x86  boot  setup.ld  compressed  vmlinux.lds.S  kernel  vmlinux.lds.S /* * setup.ld * * Linker script for the i386 setup code */ OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386") OUTPUT_ARCH(i386) ENTRY(_start) SECTIONS { . = 0; .bstext : { *(.bstext) } .bsdata : { *(.bsdata) } . = 495; .header : { *(.header) } .entrytext : { *(.entrytext) } .inittext : { *(.inittext) } .initdata : { *(.initdata) } __end_init = .; Specifies the output format (identical to --oformat option) OUTPUT_FORMAT(default, big, little) Specifies the output architecture Specifies the entry point symbol (identical to -e option)
  41. 41. LD script (2) 41 /* * setup.ld * * Linker script for the i386 setup code */ OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386") OUTPUT_ARCH(i386) ENTRY(_start) SECTIONS { . = 0; .bstext : { *(.bstext) } .bsdata : { *(.bsdata) } . = 495; .header : { *(.header) } .entrytext : { *(.entrytext) } .inittext : { *(.inittext) } .initdata : { *(.initdata) } __end_init = .;  arch  x86  boot  setup.ld  compressed  vmlinux.lds.S  kernel  vmlinux.lds.S Specifies how the sections are output . means the current position Substituting to . means setting the current position Put the .bstext section at the current position, i.e. at the address 0. Put the .bsdata section after the .bstext section.
  42. 42. bstext section…? 42 .code16 .section ".bstext", "ax" .global bootsect_start bootsect_start: #ifdef CONFIG_EFI_STUB # "MZ", MS-DOS header .byte 0x4d .byte 0x5a #endif # Normalize the start address ljmp $BOOTSEG, $start2 start2: movw %cs, %ax movw %ax, %ds movw %ax, %es movw %ax, %ss xorw %sp, %sp sti cld movw $bugger_off_msg, %si jmp msg_loop  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S Here it is! [Notes] .code16 = Specify the binary for the following code as 16-bit binary. .section name[, flags] = Starts the section. <flags> (excerpted) • “a” : allocatable (loaded to memory when executed) • “w” : writable • “x” : executable .globl/.global symbol = Makes the symbol global (Can be seen from other objects)
  43. 43. LD script (3) 43 /* * setup.ld * * Linker script for the i386 setup code */ OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386") OUTPUT_ARCH(i386) ENTRY(_start) SECTIONS { . = 0; .bstext : { *(.bstext) } .bsdata : { *(.bsdata) } . = 495; .header : { *(.header) } .entrytext : { *(.entrytext) } .inittext : { *(.inittext) } .initdata : { *(.initdata) } __end_init = .;  arch  x86  boot  setup.ld  compressed  vmlinux.lds.S  kernel  vmlinux.lds.S Specifies how the sections are output Set the current position to 495 Places the header section at the address 495 Declares a symbol “__end_init” that refers to the current position (the end of .initdata section)
  44. 44. LD script (4) 44 /* * setup.ld * * Linker script for the i386 setup code */ OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386") OUTPUT_ARCH(i386) ENTRY(_start) SECTIONS { . = 0; .bstext : { *(.bstext) } .bsdata : { *(.bsdata) } . = 495; .header : { *(.header) } .entrytext : { *(.entrytext) } .inittext : { *(.inittext) } .initdata : { *(.initdata) } __end_init = .;  arch  x86  boot  setup.ld  compressed  vmlinux.lds.S  kernel  vmlinux.lds.S .bstext .bsdata 0 495 .header .entrytext .inittext .initdataxxxx __end_init
  45. 45. LD script (5) • To be precise, • Output a section the name of which is “.bstext” • The output section contains all of the input section “.bstext” • The input and output need not be 1-to-1 • The output section “.text” contains all of the input section “.text”, and then all of the sections the names of which start with “.text.” • Creates the new symbols “_text” and “_etext” which denote the beginning and ending of the output section “.text”, respectively. 45 .bstext : { *(.bstext) } .text : { _text = .; /* Text */ *(.text) *(.text.*) _etext = . ; }
  46. 46. LD script (6) 46 . = ALIGN(16); .data : { *(.data*) } .signature : { setup_sig = .; LONG(0x5a5aaa55) } ... /DISCARD/ : { *(.note*) } /* * The ASSERT() sink to . is intentional, for binutils 2.14 compatibility: */ . = ASSERT(_end <= 0x8000, "Setup too big!"); . = ASSERT(hdr == 0x1f1, "The setup header has the wrong offset!"); /* Necessary for the very-old-loader check to work... */ . = ASSERT(__end_init <= 5*512, "init sections too big!"); }  arch  x86  boot  setup.ld  compressed  vmlinux.lds.S  kernel  vmlinux.lds.S [Usage] # Check signature at end of setup cmpl $0x5a5aaa55, setup_sig jne setup_bad Align to the 16 byte boundary Discard the sections .note* Put this long value at the current position Assertions!
  47. 47. Column: align and balign • LD’s ALIGN(x) returns the x-byte aligned address • x must be power of two • = (current + x – 1) & ~ (x – 1) • GNU Assembler has two pseudo ops for alignment • .align x, fill, max • .balign x, fill, max • Both aligns to the byte boundary specified by x. But x means… • The skipped bytes are filled by fill (zero or nop) • The maximum number of bytes to be skipped can be specified with max. 47 .align (x = 4) .balign (x = 4) i386 (elf), sparc, etc. Align to 4 byte Align to 4 byte ppc, i386 (a.out), arm Align to 16 byte (24)
  48. 48. COFF Stuffs 48  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S #ifdef CONFIG_EFI_STUB .org 0x3c # Offset to the PE header. .long pe_header #endif /* CONFIG_EFI_STUB */ .section ".bsdata", "a" bugger_off_msg: .ascii "Direct floppy boot is not supported. " .ascii "Use a boot loader program instead.rn" ... .byte 0 #ifdef CONFIG_EFI_STUB pe_header: .ascii "PE" .word 0 coff_header: #ifdef CONFIG_X86_32 .word 0x14c # i386 #else .word 0x8664 # x86-64 #endif [Notes] .org location, fill = Set the current position to location in the current section (filling the skipped bytes with fill) .ascii string = Put the string (w/o zero termination) at the current position (cf. .asciz) .byte val, .word val, .long val, .quad val = Put the 1/2/4/8-byte value(s)
  49. 49. Real mode kernel (p.45) • header.S • Boot sector code which is no longer used • Contains setup_header • Prepares stack and BSS to run C programs • Jumps into the C program (main.c) • main.c • Copies setup_header into “zeropage” • Setups early console • Initializes heap • Checks the CPUs (64-bit capable for 64-bit kernel?) • Collect HW information by querying to BIOS, and stores the results in “zeropage” • Finally transits to protected-mode, and jumps into the “protected-mode kernel” 49
  50. 50. Entry point (2nd sector) 50  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S .section ".header", "a" .globl sentinel sentinel: .byte 0xff, 0xff /* Used to detect broken loaders */ .globl hdr hdr: setup_sects: .byte 0 /* Filled in by build.c */ root_flags: .word ROOT_RDONLY syssize: .long 0 /* Filled in by build.c */ ram_size: .word 0 /* Obsolete */ vid_mode: .word SVGA_MODE root_dev: .word 0 /* Filled in by build.c */ boot_flag: .word 0xAA55 # offset 512, entry point .globl _start _start: .byte 0xeb # short (2-byte) jump .byte start_of_setup-1f 1: .ascii "HdrS" # header signature .word 0x020d # header version number (>= 0x0105) .bstext .bsdata 0 495 .header To prevent the compiler from accidentally producing a 3-byte jump
  51. 51. Setup_header • “.header” section starts at 495 • 2-byte sentinel is located at the beginning. • Struct setup_header begins at 497 (=0x1f1) 51 51 47: struct setup_header { 48: __u8 setup_sects; 49: __u16 root_flags; 50: __u32 syssize; 51: __u16 ram_size; 52: __u16 vid_mode; 53: __u16 root_dev; 54: __u16 boot_flag; 55: __u16 jump; 56: __u32 header; 57: __u16 version; 58: __u32 realmode_swtch; ... (arch/x86/include/uapi/asm/bootparam.h) Setup code Boot Sector 0x0000 0x0200 0x1f1
  52. 52. Column: Local Symbol in GAS • Local symbols • Symbols that can be used temporarily • Format is N: (where N is a positive integer) • To refer to the local symbols, use Nf or Nb. • Nf refers to the next local label N. • Nb refers to the most recently declared local label N. • According to GNU assembler manual, these symbols are internally transformed to the following format: • LN^BO • ^B is Ctrl-B (0x02), O is a serial number • For 44th 3, “L3^B44” is used. • Dollar local symbols (I haven’t seen this) 52 .byte start_of_setup-1f 1: 1: jmp 1b
  53. 53. Get prepared to C (stack) 53  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S .section ".entrytext", "ax" start_of_setup: # Force %es = %ds movw %ds, %ax movw %ax, %es cld movw %ss, %dx cmpw %ax, %dx # %ds == %ss? movw %sp, %dx je 2f # -> assume %sp is reasonably set # Invalid %ss, make up a new stack movw $_end, %dx testb $CAN_USE_HEAP, loadflags jz 1f movw heap_end_ptr, %dx 1: addw $STACK_SIZE, %dx jnc 2f xorw %dx, %dx # Prevent wraparound 2: # Now %dx should point to the end of our stack space andw $~3, %dx # dword align (might as well...) jnz 3f movw $0xfffc, %dx # Make sure we're not zero 3: movw %ax, %ss movzwl %dx, %esp # Clear upper half of %esp If %ds == %ss, %sp is assumed to be properly set by the loader If not, sets up a new stack. The address is _end + STACK_SIZE (512 byte) or heap_end_ptr + STACK_SIZE (if CAN_USE_HEAP is set)
  54. 54. In other words, • Set the stack segment as the same as %DS • Allocate 512-byte for the stack 54 unsigned short stack; if (%ds != %ss) { if (hdr.loadflags & CAN_USE_HEAP) { stack = hdr.heap_end_ptr + STACK_SIZE; } else { stack = _end + STACK_SIZE; } if (carried over) { /* stack >= 0x10000 */ stack = 0; } } /* Align to 4-byte */ stack &= ~3; if (stack == 0) stack = 0xfffc; /* – 4 */ %ss = %ds; %esp = stack;
  55. 55. Get prepared to C (CS fix and BSS clear) 55 sti # Now we should have a working stack # We will have entered with %cs = %ds+0x20, normalize %cs so # it is on par with the other segments. pushw %ds pushw $6f lretw 6: # Check signature at end of setup cmpl $0x5a5aaa55, setup_sig jne setup_bad # Zero the bss movw $__bss_start, %di movw $_end+3, %cx xorl %eax, %eax subw %di, %cx shrw $2, %cx rep; stosl # Jump to C code (should not return) calll main  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S $6f is the address of the 6f, which is the offset from the boot sector. Signature check Fill the bss by zero. “rep; stosl” (string instruction) fills the memory from %es:%di for %cx DWORDs with %eax.
  56. 56. [Column] Calling conventions • 16 bit (name unknown) • Arguments: %ax, %dx, %cx • Return value: %ax • 32 bit (cdecl) • Arguments: pushed on the stack (in the reversed order of the arguments) • Caller-saved: %eax, %ecx, and %edx • Callee-saved: the others • Return value: %eax (for int) • 64 bit (amd64) • Arguments: %rdi, %rsi, %rdx, %rcx, %r8, %r9 • Caller-saved: the others than callee-saved. • Callee-saved: %rbp, %rbx, %r12 to %r15 • Return value: %eax 56 f(2, 5, 9, 11); 11 9 5 2 (return address) stack
  57. 57. Real mode kernel (p.45) • header.S • Boot sector code which is no longer used • Contains setup_header • Prepares stack and BSS to run C programs • Jumps into the C program (main.c) • main.c • Copies setup_header into “zeropage” • Setups early console • Initializes heap • Checks the CPUs (64-bit capable for 64-bit kernel?) • Collect HW information by querying to BIOS, and stores the results in “zeropage” • Finally transits to protected-mode, and jumps into the “protected-mode kernel” 57
  58. 58. main 58 void main(void) { /* First, copy the boot header into the "zeropage" */ copy_boot_params(); /* Initialize the early-boot console */ console_init(); ... /* End of heap check */ init_heap(); /* Make sure we have all the proper CPU support */ if (validate_cpu()) { ... } set_bios_mode(); detect_memory(); keyboard_init(); query_mca(); query_ist(); ... /* Set the video mode */ set_video(); /* Do the last things and invoke protected mode */ go_to_protected_mode(); }  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S
  59. 59. Copy to zeropage • Very simple • The omitted part is for compatibility with old command-line parameter protocol (located in the certain address) 59  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S struct boot_params boot_params __attribute__((aligned(16))); ... static void copy_boot_params(void) { ... BUILD_BUG_ON(sizeof boot_params != 4096); memcpy(&boot_params.hdr, &hdr, sizeof hdr); ... }
  60. 60. Set up the serial console • Parse the command line parameter in very ad-hoc way, and find the serial configuration • Find “earlyprintk” and if it is either of the following format • “serial,0x3f8,115200” • “serial,ttyS0,115200” • “ttyS0,115200” • Find “console” and find “uart8250,io,…” or “uart,io,…” • If any serial config is found, set up it using I/O ports 60 void console_init(void) { parse_earlyprintk(); if (!early_serial_base) parse_console_uart8250(); }  arch  x86  boot  header.S  main.c  early_serial_console.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S
  61. 61. Puts and putchar • By BIOS call and serial I/O ports 61 void __attribute__((section(".inittext"))) putchar(int ch) { if (ch == 'n') putchar('r'); /* n -> rn */ bios_putchar(ch); if (early_serial_base != 0) serial_putchar(ch); } void __attribute__((section(".inittext"))) puts(const char *str) { while (*str) putchar(*str++); }  arch  x86  boot  header.S  main.c  tty.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S [Notes] GCC extension __attribute__ section(section) : locate the function/variable in the specified section.
  62. 62. Serial and BIOS putchar 62 static void __attribute__((section(".inittext"))) serial_putchar(int ch) { unsigned timeout = 0xffff; while ((inb(early_serial_base + LSR) & XMTRDY) == 0 && -- timeout) cpu_relax(); outb(ch, early_serial_base + TXR); } static void __attribute__((section(".inittext"))) bios_putchar(int ch) { struct biosregs ireg; initregs(&ireg); ireg.bx = 0x0007; ireg.cx = 0x0001; ireg.ah = 0x0e; ireg.al = ch; intcall(0x10, &ireg, NULL); }  arch  x86  boot  header.S  main.c  tty.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S Put a char on a serial line by using I/O ports (IN and OUT instructions) Put a char on VGA by BIOS Call (INT 0x10, AH = 0x0e)
  63. 63. BIOS Call • BIOS Call is invoked by using an INT instruction • Requires an assembly language support • Parameters and return values are passed by a certain set of registers • INT instruction only takes an immediate for the interrupt number. • C prototype: • struct biosregs has all the general registers, data segment registers, the flag register 63 void intcall(u8 int_no, const struct biosregs *ireg, struct biosregs *oreg); void initregs(struct biosregs *reg) { memset(reg, 0, sizeof *reg); reg->eflags |= X86_EFLAGS_CF; reg->ds = ds(); reg->es = ds(); reg->fs = fs(); reg->gs = gs(); }
  64. 64. BIOS Call Impl. (1) 64  arch  x86  boot  header.S  main.c  bioscall.S  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S .code16 .section ".inittext","ax" .globl intcall ... intcall: cmpb %al, 3f je 1f movb %al, 3f jmp 1f /* Synchronize pipeline */ 1: ... /* Actual INT */ .byte 0xcd /* INT opcode */ 3: .byte 0 ... void intcall(u8 int_no, const struct biosregs *ireg, struct biosregs *oreg); ax dx cx Checks the current operand of the INT instruction, and rewrite (self-modify) the interrupt number if different.
  65. 65. BIOS Call Impl. (2) 65 1: /* Save state */ pushfl pushw %fs pushw %gs pushal /* Copy input state to stack frame */ subw $44, %sp movw %dx, %si movw %sp, %di movw $11, %cx rep; movsd /* Pop full state from the stack */ popal popw %gs popw %fs popw %es popw %ds popfl /* Actual INT */ .byte 0xcd /* INT opcode */ 3: .byte 0 EFLAGS FS GS EAX ECX EDI … stack EFLAGS FS GS DS ES EAX … EDI Copy of struct biosregs *ireg (44 bytes) Registers Registers
  66. 66. BIOS Call Impl. (3) 66 /* Push full state to the stack */ pushfl pushw %ds pushw %es pushw %fs pushw %gs pushal ... (Restore %ds, %sp, etc.) ... /* Copy output state from stack frame */ movw 68(%esp), %di /* Original %cx == 3rd argument */ andw %di, %di jz 4f movw %sp, %si movw $11, %cx rep; movsd /* Restore state and return */ popal popw %gs popw %fs popfl retl EFLAGS FS GS EAX ECX EDI … stack EFLAGS FS GS DS ES EAX … EDI Registers *oregs Registers
  67. 67. Inline assembly • A quick way to use assembly language inside C source codes • For example, when you want to disable interrupts, put into your C code. • GCC’s extended inline assembly language enables far more features (and more complicated) • => Described in twenty or so slides later! 67 asm (“cli”); static inline void outb(u8 v, u16 port) { asm volatile("outb %0,%1" : : "a" (v), "dN" (port)); }
  68. 68. Initialize the heap 68 char *HEAP = _end; char *heap_end = _end; /* Default end of heap = no heap */ ... static void init_heap(void) { char *stack_end; if (boot_params.hdr.loadflags & CAN_USE_HEAP) { asm("leal %P1(%%esp),%0" : "=r" (stack_end) : "i" (-STACK_SIZE)); heap_end = (char *) ((size_t)boot_params.hdr.heap_end_ptr + 0x200); if (heap_end > stack_end) heap_end = stack_end; } else { /* Boot protocol 2.00 only, no heap available */ puts("WARNING: Ancient bootloader, some functionality " "may be limited!n"); } }  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S Substitute %esp – STACK_SIZE to stack_end heap_end stack_end
  69. 69. When is the heap used? • Heap allocation function is very simple • And the calls for GET_HEAP exist only in the video code files. 69 static inline char *__get_heap(size_t s, size_t a, size_t n) { char *tmp; HEAP = (char *)(((size_t)HEAP+(a-1)) & ~(a-1)); tmp = HEAP; HEAP += s*n; return tmp; } #define GET_HEAP(type, n) ((type *)__get_heap(sizeof(type),__alignof__(type),(n))) saved.data = GET_HEAP(u16, saved.x*saved.y); (boot/video.c)
  70. 70. Retrieving memory info. • As described in the last presentation, detect_memory tries 3 methods 70 int detect_memory(void) { ... if (detect_memory_e820() > 0) err = 0; if (!detect_memory_e801()) err = 0; if (!detect_memory_88()) err = 0; return err; }  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S
  71. 71. Memory Information [from p.48] • AX = 0xe820, INT 0x15 [detect_memory_e820()] • INPUT • AX = 0xe820 • CX = size of the buffer • EDX = “SMAP” (0x534d4150 / Signature) • EBX = Continuation value • ES:DI = address for the buffer • OUTPUT • CF = 0 if successful, 1 otherwise • CX = Returned Byte • EBX = Continuation value • Each call returns information for one range • To get information for the next range, give the continuation value returned in the previous call • The range information is returned by the following structure • Stored in boot_params.e820_map (struct e820entry[128]) 71 52 struct e820entry { 53 __u64 addr; /* start of memory segment */ 54 __u64 size; /* size of memory segment */ 55 __u32 type; /* type of memory segment */ 56 } __attribute__((packed)); (arch/x86/include/uapi/asm/e820.h)
  72. 72. E820 72 static int detect_memory_e820(void) { int count = 0; struct biosregs ireg, oreg; struct e820entry *desc = boot_params.e820_map; static struct e820entry buf; /* static so it is zeroed */ initregs(&ireg); ireg.ax = 0xe820; ireg.cx = sizeof buf; ireg.edx = SMAP; ireg.di = (size_t)&buf; do { intcall(0x15, &ireg, &oreg); ireg.ebx = oreg.ebx; /* for next iteration... */ if (oreg.eflags & X86_EFLAGS_CF) break; ... *desc++ = buf; count++; } while (ireg.ebx && count < ARRAY_SIZE(boot_params.e820_map)); return boot_params.e820_entries = count; }  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S
  73. 73. Video • Smells like chaos 73
  74. 74. Go To Protected Mode 74 void go_to_protected_mode(void) { /* Hook before leaving real mode, also disables interrupts */ realmode_switch_hook(); /* Enable the A20 gate */ if (enable_a20()) { puts("A20 gate not responding, unable to boot...n"); die(); } /* Reset coprocessor (IGNNE#) */ reset_coprocessor(); /* Mask all interrupts in the PIC */ mask_all_interrupts(); /* Actual transition to protected mode... */ setup_idt(); setup_gdt(); protected_mode_jump(boot_params.hdr.code32_start, (u32)&boot_params + (ds() << 4)); }  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S
  75. 75. Go To PM: Details (1) • Call the hook if set • Otherwise, disable interrupts and NMI. 75 static void realmode_switch_hook(void) { if (boot_params.hdr.realmode_swtch) { asm volatile("lcallw *%0" : : "m" (boot_params.hdr.realmode_swtch) : "eax", "ebx", "ecx", "edx"); } else { asm volatile("cli"); outb(0x80, 0x70); /* Disable NMI */ io_delay(); } } If a hook is set in realmode_swtch call the hook Out 0x80 to port 0x70 (CMOS Controller!!) (By a historical reason, “NMI disable” bit is located in the CMOS controller)  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S
  76. 76. Go To PM: Details (2) • Enable A20 line (20th bit of the address bus) • In the initial state, the bit is masked (always 0) • For compatibility with the program that expects address wraparound in 1MB • Some programs expect the address 0xFFFFF + 1 = 0x00000 • To use 32-bit of memory, this mask should be disabled. • Many ways to do it • But which way works depends on the firmware • The famous one would be via the keyboard controller port! • Linux tries several ways, and several times 76
  77. 77. Go To PM: Details (3) 77 int enable_a20(void) {... while (loops--) { if (a20_test_short()) return 0; /* Next, try the BIOS (INT 0x15, AX=0x2401) */ enable_a20_bios(); if (a20_test_short()) return 0; /* Try enabling A20 through the keyboard controller */ kbc_err = empty_8042(); if (a20_test_short()) return 0; /* BIOS worked, but with delayed reaction */ if (!kbc_err) { enable_a20_kbc(); if (a20_test_long()) return 0; } /* Finally, try enabling the "fast A20 gate" */ enable_a20_fast(); if (a20_test_long()) return 0; } ... }  arch  x86  boot  header.S  main.c  memory.c  pm.c  a20.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S *Tries 100000 times at most
  78. 78. Go To PM: Details (4) 78 /* * Reset IGNNE# if asserted in the FPU. */ static void reset_coprocessor(void) { outb(0, 0xf0); io_delay(); outb(0, 0xf1); io_delay(); } /* * Disable all interrupts at the legacy PIC. */ static void mask_all_interrupts(void) { outb(0xff, 0xa1); /* Mask all interrupts on the secondary PIC */ io_delay(); outb(0xfb, 0x21); /* Mask all but cascade on the primary PIC */ io_delay(); } The most legacy interrupt controller  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S
  79. 79. Go To PM: Details (5) • IDT (Interrupt Descriptor Table) • Describes the exception/interrupt handlers (and task gate, etc.) • At this time, no IDT is installed. • null_idt contains information for the address and size for the IDT, both of which are zero. • LIDT instruction takes an argument that is a pointer to the information. 79 static void setup_idt(void) { static const struct gdt_ptr null_idt = {0, 0}; asm volatile("lidtl %0" : : "m" (null_idt)); }  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S len (16-bit) address (32-bit) IDT
  80. 80. Go To PM: Details (6) • GDT (Global Descriptor Table) • Describes the segment information 80  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S static void setup_gdt(void) { static const u64 boot_gdt[] __attribute__((aligned(16))) = { /* CS: code, read/execute, 4 GB, base 0 */ [GDT_ENTRY_BOOT_CS] = GDT_ENTRY(0xc09b, 0, 0xfffff), /* DS: data, read/write, 4 GB, base 0 */ [GDT_ENTRY_BOOT_DS] = GDT_ENTRY(0xc093, 0, 0xfffff), /* TSS: 32-bit tss, 104 bytes, base 4096 */ /* We only have a TSS here to keep Intel VT happy; we don't actually use it for anything. */ [GDT_ENTRY_BOOT_TSS] = GDT_ENTRY(0x0089, 4096, 103), }; static struct gdt_ptr gdt; gdt.len = sizeof(boot_gdt)-1; gdt.ptr = (u32)&boot_gdt + (ds() << 4); asm volatile("lgdtl %0" : : "m" (gdt)); } len (16-bit) address (32-bit) boot_gdt (GDT)
  81. 81. x86 Architecture: GDT • GDT • Each entry has 8 byte • Offset, limit, and attributes • DPL: Descriptor privileged level (0-3: 0 is the most privileged) • When a processor executes codes at a code segment, the current privileged level (CPL) is the same as DPL of the code segment. Can access data segments with DPL >= CPL. 81 G D L * Limit 19:16 P DPL S Type Base 23:16 Base 31:24 Base Address 15:00 Limit 15:00 [GDT_ENTRY_BOOT_CS] = GDT_ENTRY(0xc09b, 0, 0xfffff), [GDT_ENTRY_BOOT_DS] = GDT_ENTRY(0xc093, 0, 0xfffff), 9 = Code, Execute Only 3 = Data, R/W 08121315162024 0 4
  82. 82. x86 Architecture: Flat Model and Segment Register • Although the segment feature is available in 32-bit, the common use is called “Flat Model.” • Uses a single segment from zero to 232 – 1 • To be precise, different segments are required for code/data and privileged/user mode. • Linux uses four segments: KERNEL_CS, KERNEL_DS, USER_CS, USER_DS • During boot time, BOOT_CS and BOOT_DS are used (as defined in the previous slide) • Segment Register (Selector) • If CS is to select BOOT_CS, CS = (index of BOOT_CS) << 3; • GDT_ENTRY_BOOT_CS = (Index of BOOT_CS) = 2, then CS = 16 • The constants BOOT_CS = 16, BOOT_DS = 24. • Note the difference between “ENTRY” and the actual value. 82 T I RPLIndex 02315
  83. 83. Go To PM: Details (7) • Call the assembler part (no return) 83 protected_mode_jump(boot_params.hdr.code32_start, (u32)&boot_params + (ds() << 4)); GLOBAL(protected_mode_jump) movl %edx, %esi # Pointer to boot_params table xorl %ebx, %ebx movw %cs, %bx shll $4, %ebx addl %ebx, 2f jmp 1f # Short jump to serialize on 386/486 1: movw $__BOOT_DS, %cx movw $__BOOT_TSS, %di movl %cr0, %edx orb $X86_CR0_PE, %dl # Protected mode movl %edx, %cr0 # Transition to 32-bit mode .byte 0x66, 0xea # ljmpl opcode 2: .long in_pm32 # offset .word __BOOT_CS # segment ENDPROC(protected_mode_jump)  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S dx ax *(uint32_t *)2f += cs() << 4; (phys addr of in_pm32) [Notes] In real mode, physical address = (Segment Register << 4) + Offset To enter protected mode, set PE bit in %cr0 register. And 32-bit far jump operation (not expressible in real mode asm)
  84. 84. Go To PM: Detail (8) 84 .code32 .section ".text32","ax" GLOBAL(in_pm32) # Set up data segments for flat 32-bit mode movl %ecx, %ds movl %ecx, %es movl %ecx, %fs movl %ecx, %gs movl %ecx, %ss ... addl %ebx, %esp ... ltr %di ... xorl %ecx, %ecx xorl %edx, %edx xorl %ebx, %ebx xorl %ebp, %ebp xorl %edi, %edi ... lldt %cx ... jmpl *%eax ENDPROC(in_pm32)  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S protected_mode_jump(boot_params.hdr.code32_start, (u32)&boot_params + (ds() << 4)); * (Omitted) Task register and LDT (Local Descriptor Table)
  85. 85. Extended inline assembler (1) • GCC’s extended inline assembly language • Input/output operands can be specified for the assembler • Assembler template • The actual assembly language with templates that will be substituted by the output/input operands • Output operands • List of C variables modified by the assembler template • Input operands • List of C expressions read by the instructions in the assembler template. • Clobber • List of registers/values to be changed by the assembler template (other than the output operands) 85 asm [volatile] (assembler template : [ output operands [ : input operands [ : clobber ]]]) static inline void outb(u8 v, u16 port) { asm volatile("outb %0,%1" : : "a" (v), "dN" (port)); } Assembler template Input operands
  86. 86. Extended inline assembler (2) • Assembler template • Basically, the same as the standalone assembly language • %n (n is zero or positive integer) refers to the (n+1)-th operand in the input and output operands. • If the character “%” is to be used (to specify a certain register “%ebx,” for example), “%%” must be used. • Other than the number, the name can be used to specify an operand. (%[symbolicname] refers to the operand with the name [symbolicname]) • To use multiple instructions, use “;” or “n” 86 static inline void outb(u8 v, u16 port) { asm volatile("outb %0,%1" : : "a" (v), "dN" (port)); }
  87. 87. Extended inline assembler (3) • Input operands • Comma-separated list of C expressions prefixed with constaints • A constraint specifies how the expression is passed to the assembler template. • When multiple constraints are specified, the complier selects the most efficient one. 87 asm volatile("outb %0,%1" : : "a" (v), "dN" (port)); constraint C expression Constraint ‘m’ Memory operand ‘r’ General register ‘i’ Immediate integer ‘0’ – ‘9’ = The same place as the operand Constraint (x86-specific) ‘a’’b’’c’’d’ A,B,C,D register ‘S’’D’ SI, DI register ‘N’ Unsigned 8-bit integer (for in/out instructions) ‘A’ EDX:EAX (32bit), RDX/RAX (64 bit) [SymbolicName] “Constraints” (C Expression),…
  88. 88. Extended inline assembler (4) • In this example, • The value of v (u8) is stored in %al register • The value of port (u16) is stored in %dx register or used as 8-bit immediate. • This function is declared as “inline,” so if this function is called with a constant value as port which is less than 256, the “N” constraint may be used. • Then, the instruction(s) in the assembler templates are executed. • The resulting assembly language will be 88 asm volatile("outb %0,%1" : : "a" (v), "dN" (port)); outb: movl 8(%esp), %edx movl 4(%esp), %eax outb %al,%dx ret
  89. 89. Extended inline assembler (5) • Output operands • Comma-separated list of C variables prefixed with constraints • Constraints should be prefixed with “=“ or “+” • “+” means that the variable is used as a both input and output operand. • “&” constraint allocates a different register from the input operands (for multiple instructions, this constraint may be necessary) • After the instruction(s) in the assembler template is executed, the value of A register (%al) is stored to the variable v. 89 [SymbolicName] “=Constraints” (C Variable),… static inline u8 inb(u16 port) { u8 v; asm volatile("inb %1,%0" : "=a" (v) : "dN" (port)); return v; }
  90. 90. Extended inline assembler (6) • Clobber • The list of registers/values modified by the instructions • The output registers need not be specified here. • The most common clobber is “memory” • This means that the memory contents may be changed as side effects, thus all the variables should be written back to the memory before the assembler, and should be read again from the memory after the assembler. • “cc” : Condition (flags) registers 90 void *memcpy(void *dest, const void *src, size_t n) { int d0, d1, d2; asm volatile( "rep ; movslnt" "movl %4,%%ecxnt" "rep ; movsbnt" : "=&c" (d0), "=&D" (d1), "=&S" (d2) : "0" (n >> 2), "g" (n & 3), "1" (dest), "2" (src) : "memory"); return dest; }
  91. 91. (?) 91 void *memcpy(void *dest, const void *src, size_t n) { int d0, d1, d2; asm volatile( "rep ; movslnt" "movl %4,%%ecxnt" "rep ; movsbnt" : "=&c" (d0), "=&D" (d1), "=&S" (d2) : "0" (n >> 2), "g" (n & 3), "1" (dest), "2" (src) : "memory"); return dest; } asm volatile( "rep ; movslnt" "movl %2,%%ecxnt" "rep ; movsbnt" : "+&D" (dest) : "c" (n >> 2), "g" (n & 3), "S" (src) : "memory");
  92. 92. Extended inline assembler (7) • Examples (appeared in the previous slides) • Example 1 • Stores %esp – STACK_SIZE to stack_end • P in “%P1” is a modifier (but cannot find in the document) • With “%P1” • With “%1” • With “%c1” (“constant expression with no punctuation”) 92 asm("leal %P1(%%esp),%0" : "=r" (stack_end) : "i" (-STACK_SIZE)); leal -512(%esp),%eax leal $-512(%esp),%eax leal -512(%esp),%eax
  93. 93. Extended inline assembler (8) • Example 2 • Far-calls the address (the value of boot_params.hdr.realmode_swtch) • The registers eax, ebx, ecx, and edx will be changed in this call. • Example 3 93 asm volatile("lcallw *%0" : : "m" (boot_params.hdr.realmode_swtch) : "eax", "ebx", "ecx", "edx"); static const struct gdt_ptr null_idt = {0, 0}; asm volatile("lidtl %0" : : "m" (null_idt)); setup_idt: lidtl null_idt.1378 ret
  94. 94. Extended inline assembler (9) 94 #define switch_to(prev, next, last) do { unsigned long ebx, ecx, edx, esi, edi; asm volatile("pushflnt" /* save flags */ "pushl %%ebpnt" /* save EBP */ "movl %%esp,%[prev_sp]nt" /* save ESP */ "movl %[next_sp],%%espnt" /* restore ESP */ "movl $1f,%[prev_ip]nt" /* save EIP */ "pushl %[next_ip]nt" /* restore EIP */ __switch_canary "jmp __switch_ton" /* regparm call */ "1:t" "popl %%ebpnt" /* restore EBP */ "popfln" /* restore flags */ /* output parameters */ : [prev_sp] "=m" (prev->thread.sp), [prev_ip] "=m" (prev->thread.ip), "=a" (last), /* clobbered output registers: */ "=b" (ebx), "=c" (ecx), "=d" (edx), "=S" (esi), "=D" (edi) __switch_canary_oparam /* input parameters: */ : [next_sp] "m" (next->thread.sp), [next_ip] "m" (next->thread.ip), /* regparm parameters for __switch_to(): */ [prev] "a" (prev), [next] "d" (next) __switch_canary_iparam : /* reloaded segment registers */ "memory"); } while (0) arch/x86/include/asm/switch_to.h
  95. 95. Extended inline assembler (10) • The key point • The context is the stack • The switched task resumes at “1:”. (just after “jmp __switch_to”) • The “__switch_to” function is called with a “jmp” instruction, not a “call” instruction. • Anyway • The template does not use %n (number), but %[name] style. (too many parameters) 95 asm volatile(... "movl %%esp,%[prev_sp]nt" /* save ESP */ ... /* output parameters */ : [prev_sp] "=m" (prev->thread.sp),
  96. 96. Exercise: RDTSC • RDTSC instruction • Input : None • Output : EDX (Higher 32-bit), EAX (Lower 32-bit) 96 unsigned long rdtsc(void) { } asm volatile(“rdtsc” : unsigned short high, low; “=d” (high), “=a” (low)); return ((unsigned long)high << 32) | low;
  97. 97. Answer: rdtscll 97 #define rdtscll(val) ((val) = __native_read_tsc()) static __always_inline unsigned long long __native_read_tsc(void) { DECLARE_ARGS(val, low, high); asm volatile("rdtsc" : EAX_EDX_RET(val, low, high)); return EAX_EDX_VAL(val, low, high); } #ifdef CONFIG_X86_64 #define DECLARE_ARGS(val, low, high) unsigned low, high #define EAX_EDX_VAL(val, low, high) ((low) | ((u64)(high) << 32)) #define EAX_EDX_ARGS(val, low, high) "a" (low), "d" (high) #define EAX_EDX_RET(val, low, high) "=a" (low), "=d" (high) #else #define DECLARE_ARGS(val, low, high) unsigned long long val #define EAX_EDX_VAL(val, low, high) (val) #define EAX_EDX_ARGS(val, low, high) "A" (val) #define EAX_EDX_RET(val, low, high) "=A" (val) #endif
  98. 98. A-2. Protected Mode Again, full of the assembly code! 98
  99. 99. Protected-Mode Kernel (p.54) • arch/x86/boot/compressed/head_{32,64}.S • Goal: Decompresses the kernel (vmlinux.gz/.bz2/.xz…) and start the kernel • Relocates the decompressing code (if relocatable and loaded at a different address) • Enables paging and enters the long-mode (in head_64.S) • Clears the BSS, and prepares the heap and stack • Decompresses the kernel • Relocates if required • RANDOMIZED_BASE or RELOCATABLE (in 32-bit) 99
  100. 100. LD script? 100  arch  x86  boot  setup.ld  compressed  vmlinux.lds.S  kernel  vmlinux.lds.S ... #ifdef CONFIG_X86_64 OUTPUT_ARCH(i386:x86-64) ENTRY(startup_64) #else OUTPUT_ARCH(i386) ENTRY(startup_32) #endif SECTIONS { /* Be careful parts of head_64.S * assume startup_32 is at address 0. */ . = 0; .head.text : { _head = . ; HEAD_TEXT _ehead = . ; } .rodata..compressed : { *(.rodata..compressed) } ... .head.text .rodata..compres sed 0 (_head) (_ehead) .text .rodata .got .data .bss .pgtable (64 only) (_etext, _rodata) (_text) (_erodata, _got) (_egot, _data) (_edata, _bss) (_ebss, _pgtable) (_epgtable, _end)
  101. 101. mkpiggy • Section “.rodata..compressed” consists of the compressed kernel (vmlinux) 101 printf(".section ".rodata..compressed","a",@progbitsn"); printf(".globl z_input_lenn"); printf("z_input_len = %lun", ilen); printf(".globl z_output_lenn"); printf("z_output_len = %lun", (unsigned long)olen); printf(".globl z_extract_offsetn"); printf("z_extract_offset = 0x%lxn", offs); /* z_extract_offset_negative allows simplification of head_32.S */ printf(".globl z_extract_offset_negativen"); printf("z_extract_offset_negative = -0x%lxn", offs); printf(".globl input_data, input_data_endn"); printf("input_data:n"); printf(".incbin "%s"n", argv[1]); printf("input_data_end:n"); (arch/x86/boot/compressed/mkpiggy.c)
  102. 102. Entry point (32-bit) 102 .text __HEAD ENTRY(startup_32) #ifdef CONFIG_EFI_STUB jmp preferred_addr ... preferred_addr: #endif cld testb $(1<<6), BP_loadflags(%esi) jnz 1f cli movl $__BOOT_DS, %eax movl %eax, %ds movl %eax, %es movl %eax, %fs movl %eax, %gs movl %eax, %ss 1: .section ".head.text","ax"  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S If KEEP_SEGMENT is set in loadflags in boot_params, do not reload the segments.
  103. 103. Protected-Mode Protocol (p.53) • Starts at the top of the protected mode kernel • Usually loaded at 0x100000 (1MB) • Can be at any position if compiled as relocatable • Should be at the same position as specified in the compile time if compiled as not relocatable • Used in “linux” module in GRUB2 • [Protocol] At the entry point, • The loaded GDT must have __BOOT_CS (0x10 / execute and read) and __BOOT_DS(0x18 / read and write) • %cs must be __BOOT_CS • %ds, %es, and %ss must be __BOOT_DS • Interrupts must be disabled • %esi must be the address for struct boot_params • %ebp, %edi, and %ebx must be zero. 103
  104. 104. Protected-Mode Kernel (p.54) • arch/x86/boot/compressed/head_{32,64}.S • Goal: Decompresses the kernel (vmlinux.gz/.bz2/.xz…) and start the kernel • Relocates the decompressing code (if relocatable and loaded at a different address) • Enables paging and enters the long-mode (in head_64.S) • Decompresses the kernel • Relocates if required • RANDOMIZED_BASE or RELOCATABLE (in 32-bit) 104
  105. 105. Where are we? 105 leal (BP_scratch+4)(%esi), %esp call 1f 1: popl %ebp subl $1b, %ebp  arch  x86  boot  header.S  main.c  memory.c  pm.c  pmjump.S  compressed  head_32.S  head_64.S  eboot.c  efi_stub_32.S  efi_stub_64.S  kernel  head_32.S  head_64.S • The call instruction pushes the return address onto the stack • The return address should be the next instruction after the call instruction, i.e. 1f • The immediate “pop” pops the return address from the stack, i.e. the absolute physical address for 1f • Subtracting 1f (in this case, 1b) from the address (%ebp) calculates the offset between the actual address and the compile-time address (0-based, as seen in lds).
  106. 106. Memory View 106 PM kernelRM Kernel Higher Address%ebp vmlinux (decompressed) Goal: headBP compressed %esi z_extract_offset (mkpiggy.c) offs = (olen > ilen) ? olen - ilen : 0; offs += olen >> 12; /* Add 8 bytes for each 32K block */ offs += 64*1024 + 128; /* Add 64K + 128 bytes slack */ offs = (offs+4095) & ~4095; /* Round to a 4K boundary */ ... printf("z_extract_offset = 0x%lxn", offs); Relocated Kernel LOAD_PHYSICAL_ADDRESS (asm/x86/include/asm/boot.h) #define LOAD_PHYSICAL_ADDR ((CONFIG_PHYSICAL_START + (CONFIG_PHYSICAL_ALIGN - 1)) & ~(CONFIG_PHYSICAL_ALIGN - 1)) *Default: 0x1000000 compressed
  107. 107. Determine where to decompress • If CONFIG_RELOCATABLE • The current position (BP_kernel_alignment- aligned) • Default: 2MB-align • If it is less than LOAD_PHYSICAL_ADDR, LOAD_PHYSICAL_ADDR is used • If not CONFIG_RELOCATABLE • LOAD_PHYSICAL_ADDR is used • Now %ebx is the target address 107 #ifdef CONFIG_RELOCATABLE movl %ebp, %ebx movl BP_kernel_alignment(%esi), %eax decl %eax addl %eax, %ebx notl %eax andl %eax, %ebx cmpl $LOAD_PHYSICAL_ADDR, %ebx jge 1f #endif movl $LOAD_PHYSICAL_ADDR, %ebx 1:  arch  x86  boot  compressed  head_32.S  head_64.S  kernel  head_32.S  head_64.S
  108. 108. Copy the decompression code • Copy the area from the head of PM kernel (startup_32) to just before the head of bss. • The code copies the kernel backwards in case of overlapping 108 addl $z_extract_offset, %ebx leal boot_stack_end(%ebx), %esp pushl $0 popfl pushl %esi leal (_bss-4)(%ebp), %esi leal (_bss-4)(%ebx), %edi movl $(_bss - startup_32), %ecx shrl $2, %ecx std rep movsl cld popl %esi PM kernel %ebp Relocated vmlinux (decompressed) %ebx z_extract_offset
  109. 109. Jump to the relocated address • Jump to the copied decompression code • The decompression code is the end in the PM kernel • Just after the compressed kernel image • Clears the BSS 109 leal relocated(%ebx), %eax jmp *%eax ENDPROC(startup_32) .text relocated: xorl %eax, %eax leal _bss(%ebx), %edi leal _ebss(%ebx), %ecx subl %edi, %ecx shrl $2, %ecx rep stosl %ebx Relocated kernel vmlinux (decompressed) relocated
  110. 110. Why is z_extract_offset? • The PM kernel contains the compressed kernel image • The relocating (copying) code is located at the head in PM kernel • The decompression code is located at the tail in the PM kernel • The decompression code after relocation is safe because z_extract_offset + the compressed image size is larger than the decompressed image size 110 headcompressed decomp decompressed z_extract_offset work area head compressed decomp Relocate z_extract_offset
  111. 111. Fix up the absolute addresses • The decompression code is built with -fPIC (position independent code), and so fixing up the absolute addresses is achieved by modifying the addresses in GOT (Global Offset Table). 111 /* * Adjust our own GOT */ leal _got(%ebx), %edx leal _egot(%ebx), %ecx 1: cmpl %ecx, %edx jae 2f addl %ebx, (%edx) addl $4, %edx jmp 1b 2: %ebx Relocated kernel
  112. 112. Protected-Mode Kernel (p.54) • arch/x86/boot/compressed/head_{32,64}.S • Goal: Decompresses the kernel (vmlinux.gz/.bz2/.xz…) and start the kernel • Relocates the decompressing code (if relocatable and loaded at a different address) • Enables paging and enters the long-mode (in head_64.S) • Decompresses the kernel • Relocates if required • RANDOMIZED_BASE or RELOCATABLE (in 32-bit) 112
  113. 113. Call the decompression routine • Call the decompress_kernel function in C • asmlinkage __visible void *decompress_kernel(void *rmode, memptr heap, unsigned char *input_data, unsigned long input_len, unsigned char *output, unsigned long output_len) 113 pushl $z_output_len /* decompressed length */ leal z_extract_offset_negative(%ebx), %ebp pushl %ebp /* output address */ pushl $z_input_len /* input_len */ leal input_data(%ebx), %eax pushl %eax /* input_data */ leal boot_heap(%ebx), %eax pushl %eax /* heap area */ pushl %esi /* real mode pointer */ call decompress_kernel /* returns kernel location in %eax */ BP Relocated vmlinux (decompressed) %ebx z_extract_offset %esi
  114. 114. Decompressing 114 asmlinkage __visible void *decompress_kernel(...) { ... output = choose_kernel_location(input_data, input_len, output, output_len); ... #ifndef CONFIG_RELOCATABLE if ((unsigned long)output != LOAD_PHYSICAL_ADDR) error("Wrong destination address"); #endif debug_putstr("nDecompressing Linux... "); decompress(input_data, input_len, NULL, NULL, output, NULL, error); parse_elf(output); handle_relocations(output, output_len); debug_putstr("done.nBooting the kernel.n"); return output; }  arch  x86  boot  compressed  head_32.S  head_64.S  misc.c  kernel  head_32.S  head_64.S
  115. 115. Choosing the destination • The choose_kernel_location function • If KASLR is enabled, it computes some random output address (aslr.c) • Otherwise, it just returns the output parameter 115
  116. 116. Decompressing the kernel • The decompress function does everything • The implementation is located at lib/decompress_*.c 116 #ifdef CONFIG_KERNEL_GZIP #include "../../../../lib/decompress_inflate.c" #endif #ifdef CONFIG_KERNEL_BZIP2 #include "../../../../lib/decompress_bunzip2.c" #endif #ifdef CONFIG_KERNEL_XZ #include "../../../../lib/decompress_unxz.c" #endif ...  arch  x86  boot  compressed  head_32.S  head_64.S  misc.c  kernel  head_32.S  head_64.S
  117. 117. Load the ELF • parse_elf • Parse the ELF header and locate the contents according to the program header (p_paddr) • If relocatable, the p_paddr is offseted by the actually loaded address. 117 for (i = 0; i < ehdr.e_phnum; i++) { ... switch (phdr->p_type) { case PT_LOAD: #ifdef CONFIG_RELOCATABLE dest = output; dest += (phdr->p_paddr – LOAD_PHYSICAL_ADDR); #else dest = (void *)(phdr->p_paddr); #endif memcpy(dest, output + phdr->p_offset, phdr->p_filesz); break; ... } } typedef struct elf32_phdr{ Elf32_Word p_type; Elf32_Off p_offset; Elf32_Addr p_vaddr; Elf32_Addr p_paddr; Elf32_Word p_filesz; Elf32_Word p_memsz; Elf32_Word p_flags; Elf32_Word p_align; } Elf32_Phdr;
  118. 118. Protected-Mode Kernel (p.54) • arch/x86/boot/compressed/head_{32,64}.S • Goal: Decompresses the kernel (vmlinux.gz/.bz2/.xz…) and start the kernel • Relocates the decompressing code (if relocatable and loaded at a different address) • Enables paging and enters the long-mode (in head_64.S) • Decompresses the kernel • Relocates if required • RANDOMIZED_BASE or RELOCATABLE (in 32-bit) 118
  119. 119. Relocate the kernel image • Relocation information (generated by the “relocs” tool) is appended just after the ELF image • The relocation information is a collection of addresses to the absolute addresses in the kernel code • These addresses are all expressed by kernel virtual addresses vmlinux (ELF) 0 0… … 32-bit relocation addresses 64-bit relocation addresses $ objdump –adr vmlinux … c1086910 <vfs_llseek>: c1086910: 55 push %ebp ... c1086919: bb 60 63 08 c1 mov $0xc1086360,%ebx c108691a: R_386_32 no_llseek
  120. 120. Calculate deltas • __START_KERNEL_map • In 32-bit, PAGE_OFFSET (default: 0xC0000000) • In 64-bit, 0xffffffff80000000 120 120 static void handle_relocations(void *output, unsigned long output_len) { ... unsigned long min_addr = (unsigned long)output; ... delta = min_addr - LOAD_PHYSICAL_ADDR; ... map = delta - __START_KERNEL_map; ... Difference between the compile-time physical address and the actual physical address The offset of the kernel virtual address to the physical address
  121. 121. Apply the relocation 121 for (reloc = output + output_len - sizeof(*reloc); *reloc; reloc--) { int extended = *reloc; extended += map; ptr = (unsigned long)extended; if (ptr < min_addr || ptr > max_addr) error("32-bit relocation outside of kernel!n"); *(uint32_t *)ptr += delta; } #ifdef CONFIG_X86_64 for (reloc--; *reloc; reloc--) { long extended = *reloc; extended += map; ptr = (unsigned long)extended; if (ptr < min_addr || ptr > max_addr) error("64-bit relocation outside of kernel!n"); *(uint64_t *)ptr += delta; } #endif
  122. 122. OK, go to the entry point • The entry point is always at the head of the kernel • decompress_kernel returns the “output” • The assembly code jumps into the entry point 122 asmlinkage __visible void *decompress_kernel(...) { ... output = choose_kernel_location(input_data, input_len, output, output_len); ... return output; } /* * Jump to the decompressed kernel. */ xorl %ebx, %ebx jmp *%eax
  123. 123. Next • Go on to startup_32/startup_64 123

×