Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Linux Kernel Tour

How the Linux Kernel is Working, from compilation, start-up, Basic Initializations to high level Initialization to subsystems.

  • Login to see the comments

Linux Kernel Tour

  1. 1. Linux Kernel Tour By: Samrat Das AED600
  2. 2. Tour Map Starting From
  3. 3. To – Full functionality working of OS
  4. 4. Topics to be covered o Introduction o Kernel Source Organization o Compilation Process o Booting Process o Loading of Kernel o Initialization Process o Working of Kernel o Subsystem of Kernel o Introduction to common Kernel API's o Kernel Symbols usage o Introduction to mailing List & How to contribute to kernel tree - (creating a patch and submitting)
  5. 5. Introduction
  6. 6. Introduction - Kernel Map
  7. 7. Lets see how the Linux Kernel source is organized Next Kernel Source Organization You can get a get source from kernel.org or git
  8. 8. Kernel Source Organization
  9. 9. Kernel Source Organization
  10. 10. Kernel Source Organization
  11. 11. Kernel Source Organization
  12. 12. Kernel Source Organization
  13. 13. Browsing source code cscope - Tool to browse source code http://lxr.free-electrons.com – Online Source Browser
  14. 14. Compilation Process After configurations, when the user types 'make zImage' or 'make bzImage' resulting bootable kernel image is stored as arch/i386/boot/zImage or bzimage . Here is how the image is built
  15. 15. Compilation Process I. C and assembly source files are compiled into ELF relocatable object format (.o) and some of them are grouped logically into archives (.a) using ar(1). II. Using ld(1), the above .o and .a are linked into vmlinux which is a statically linked, non-stripped ELF 32-bit LSB 80386 executable file. III. System.map is produced by nm vmlinux, irrelevant or uninteresting symbols are grepped out. IV. Enter directory arch/i386/boot. V. Bootsector asm code bootsect.S is preprocessed either with or without - D__BIG_KERNEL__, depending on whether the target is bzImage or zImage, into bbootsect.s or bootsect.s respectively. VI. bbootsect.s is assembled and then converted into 'raw binary' form called bbootsect (or bootsect.s assembled and raw-converted into bootsect for zImage). VII. Setup code setup.S (setup.S includes video.S) is preprocessed into bsetup.s for bzImage or setup.s for zImage. In the same way as the bootsector code, the difference is marked by -D__BIG_KERNEL__ present for bzImage. The result is then converted into 'raw binary' form called bsetup.
  16. 16. Compilation Process cont. VIII.Enter directory arch/i386/boot/compressed and convert /usr/src/linux/vmlinux to $tmppiggy (tmp filename) in raw binary format, removing .note and .comment ELF sections. IX. gzip -9 < $tmppiggy > $tmppiggy.gz X. Link $tmppiggy.gz into ELF relocatable (ld -r) piggy.o. XI. Compile compression routines head.S and misc.c (still in arch/i386/boot/compressed directory) into ELF objects head.o and misc.o. XII. Link together head.o, misc.o and piggy.o into bvmlinux (or vmlinux for zImage, don't mistake this for /usr/src/linux/vmlinux!). Note the difference between -Ttext 0x1000 used for vmlinux and -Ttext 0x100000 for bvmlinux, i.e. for bzImage compression loader is high-loaded. XIII.Convert bvmlinux to 'raw binary' bvmlinux.out removing .note and .comment ELF sections. XIV.Go back to arch/i386/boot directory and, using the program tools/build, cat together bbootsect, bsetup and compressed/bvmlinux.out into bzImage (delete extra 'b' above for zImage). This writes important variables like setup_sects and root_dev at the end of the bootsector.
  17. 17. Result after compilation - bzimage What's there inside Objdump –D bzImage
  18. 18. Let us see how this kernel is working Lets start from boot process
  19. 19. Booting Process I. BIOS selects the boot device. II. BIOS loads the bootsector from the boot device. III. Bootsector loads setup, decompression routines and compressed kernel image. IV. The kernel is uncompressed in protected mode. V. Low-level initialization is performed by asm code. VI. High-level C initialization.
  20. 20. Mapping of Kernel and other peripherals
  21. 21. Initializations – asm I. Initialize segment values. II. Initialize page tables. III. Enable paging by setting PG bit in %cr0. IV. Zero-clean BSS (on SMP, only first CPU does this). V. Copy the first 2k of bootup parameters (kernel command line). VI. Check CPU type using EFLAGS and, if possible, cpuid, able to detect 386 and higher. VII. The first CPU calls start_kernel(), all others call arch/i386/kernel/smpboot.c:initialize_secondary() if ready=1, which just reloads esp/eip and doesn't return.
  22. 22. Initializations – high level I. Take a global kernel lock (it is needed so that only one CPU goes through initialization). II. Perform arch-specific setup (memory layout analysis, copying boot command line again, etc.). III. Print Linux kernel "banner" containing the version. IV. Initialize traps. V. Initialize irqs.
  23. 23. Initializations – high level VI. Initialize data required for scheduler. VII. Initialize time keeping data. VIII.Initialize softirq subsystem. IX. Parse boot commandline options. X. Initialize console. XI. If module support was compiled into the kernel, initialize dynamical module loading facility. XII. If "profile=" command line was supplied, initialize profiling buffers. XIII.kmem_cache_init(), initialize most of slab allocator. XIV.Enable interrupts.
  24. 24. Initializations – high level XV. Calculate BogoMips value for this CPU. XVI. Call mem_init() which calculates max_mapnr, totalram_pages and high_memory and prints out the "Memory: ..." line. XVII. kmem_cache_sizes_init(), finish slab allocator initialization. XVIII. Initialize data structures used by procfs. XIX. fork_init(), create uid_cache, initialise max_threads based on the amount of memory available and configure RLIMIT_NPROC for init_task to be max_threads/2. XX. Create various slab caches needed for VFS, VM, buffer cache, etc.
  25. 25. Initializations – high level XXI.If System V IPC support is compiled in, initialise the IPC subsystem. Note that for System V shm, this includes mounting an internal (in-kernel) instance of shmfs filesystem. XXII. If quota support is compiled into the kernel, create and initialise a special slab cache for it. XXIII. Perform arch-specific "check for bugs" and, whenever possible, activate workaround for processor/bus/etc bugs. Comparing various architectures reveals that "ia64 has no bugs" and "ia32 has quite a few bugs", good example is "f00f bug" which is only checked if kernel is compiled for less than 686 and worked around accordingly.
  26. 26. Initializations – high level Finally the kernel is ready to move_to_user_mode() XXIV. Set a flag to indicate that a schedule should be invoked at "next opportunity" and create a kernel thread init() which execs execute_command if supplied via "init=" boot parameter, or tries to exec /sbin/init, /etc/init, /bin/init, /bin/sh in this order; if all these fail, panic with "suggestion" to use "init=" parameter. XXV. Go into the idle loop, this is an idle thread with pid=0.
  27. 27. Working of Kernel After exec()ing the init program from one of the standard places the kernel has no direct control on the program flow. Its role, from now on is to provide processes with system calls, as well as servicing asynchronous events. Multitasking has been setup, and it is now init which manages multiuser access by fork()ing system daemons and login processes.
  28. 28. Working of Kernel Whenever program tries to use system resource, it uses system call
  29. 29. System Call Implementation • The mechanism to signal the kernel is a software interrupt. • Incur an exception and then the system will switch to kernel mode and execute the exception handler/System call handler. • The defined software interrupt on x86 is the int $0x80 instruction. • It triggers a switch to kernel mode and the execution of exception vector 128, which is the system call handler. • The system call handler is the aptly named function system_call(). It is architecture dependent and typically implemented in assembly in entry.S. • x86 processors added a feature known as sysenter. This feature provides a faster, more specialized way of trapping into a kernel to execute a system call than using the int interrupt instruction.
  30. 30. System Call Implementation Denoting the Correct System Call • On x86, the syscall number is fed to the kernel via the eax register. • Before causing the trap into the kernel, user-space sticks in eax the number corresponding to the desired system call. • The system call handler then reads the value from eax. • The system_call() function checks the validity of the given system call number by comparing it to NR_syscalls. • If it is larger than or equal to NR_syscalls, the function returns - ENOSYS. Otherwise, the specified system call is invoked: • call *sys_call_table(,%eax,4) Because each element in the system call table is 32 bits (four bytes), the kernel multiplies the given system call number by four to arrive at its location in the system call table.
  31. 31. System Call Implementation Parameter Passing In addition to the system call number, most syscalls require that one or more parameters be passed to them. The easiest way to do this is via the same means that the syscall number is passed: • The parameters are stored in registers. On x86, the registers ebx, ecx, edx, esi, and edi contain, in order, the first five arguments. • In the unlikely case of six or more arguments, a single register is used to hold a pointer to user-space where all the parameters are stored. The return value is sent to user-space also via register. On x86, it is written into the eax register.
  32. 32. We have seen how system calls are implemented. But what about the system calls?. System calls are the calls to the subsystems of the kernel. Now let us understand about Subsystems of kernel.
  33. 33. Subsystem of Kernel  Human Interface  System Interface  Process Management  Memory Management  Storage Handling  Networking
  34. 34. Human Interface Subsystem of Kernel Required to handle input output of system It controls the functionality of: • Keyboard • Console screen • Mouse • Etc.
  35. 35. System Interface Device Drivers are the part of system Interface. Which is responsible to interface the system with the peripherals and system Hardware Components Types of drivers: • Character Drivers • Block Drivers • USB Drivers • Network Drivers
  36. 36. Process Management From the kernel point of view, a process is an entry in the process table. Nothing more. The process table, then, is one of the most important data structures within the system, together with the memory-management tables and the buffer cache. The individual item in the process table is the task_struct structure, defined in include/linux/sched.h. The process table is both an array and a double-linked list, as well as a tree. The physical implementation is a static array of pointers, whose length is NR_TASKS, a constant defined in include/linux/tasks.h, and each structure resides in a reserved memory page. The list structure is achieved through the pointers next_task and prev_task.
  37. 37. Process Management Cont. After booting is over, the kernel is always working on behalf of one of the processes, and the global variable current, a pointer to a task_struct item, is used to record the running one. current is only changed by the scheduler, in kernel/sched.c. When, however, all processes must be looked at, the macro for_each_task is used. It is considerably faster than a sequential scan of the array, when the system is lightly loaded. A process is always running in either ``user mode'' or ``kernel mode''. The main body of a user program is executed in user mode and system calls are executed in kernel mode. System calls, within the kernel, exist as C language functions, their `official' name being prefixed by `sys_'. A system call named, for example, burnout invokes the kernel function sys_burnout().
  38. 38. Process Management Creating processes A unix system creates a process though the fork() system call, and process termination is performed either by exit() or by receiving a signal. The Linux implementation for them resides in kernel/fork.c and kernel/exit.c. Fork’s main task is filling the data structure for the new process. Relevant steps, apart from filling fields, are: • getting a free page to hold the task_struct • finding an empty process slot (find_empty_process()) • getting another free page for the kernel_stack_page • copying the father's LDT to the child • duplicating mmap information of the father sys_fork() also manages file descriptors and inodes.
  39. 39. Process Management Destroying processes Exiting from a process is trickier, because the parent process must be notified about any child who exits. Moreover, a process can exit by being kill()ed by another process (these are Unix features). The file exit.c is therefore the home of sys_kill() and the various flavors of sys_wait(), in addition to sys_exit().
  40. 40. Process Management Executing programs • After fork()ing, two copies of the same program are running. One of them usually exec()s another program. • The exec() system call must locate the binary image of the executable file, load and run it. • The Linux implementation of exec() supports different binary formats. This is accomplished through the linux_binfmt structure. • Loading of shared libraries is implemented in the same source file as exec() is, but let's stick to exec() itself. • The Unix systems provide the programmer with six flavors of the exec() function. All but one of them can be implemented as library functions, and the Linux kernel implements sys_execve() alone. It performs quite a simple task: loading the head of the executable, and trying to execute it. If the first two bytes are ``#!'', then the first line is parsed and an interpreter is invoked, otherwise the registered binary formats are sequentially tried.
  41. 41. Process Management State As a process executes it changes state according to its circumstances. Linux processes have the following states: • Running: The process is either running or it is ready to run • Waiting: The process is waiting for an event or for a resource. Linux differentiates between two types of waiting process; interruptible and uninterruptible. • Stopped: The process has been stopped, usually by receiving a signal. A process that is being debugged can be in a stopped state. • Zombie: This is a halted process which, for some reason, still has a task_struct data structure in the task vector. It is what it sounds like, a dead process. The scheduler needs this information in order to fairly decide which process in the system most deserves to run
  42. 42. Process Management Process Handling - Schedulers History of Schedulers • O(n) - before – 2.6 • O(1) - Ingo Molnar - 2.6 to 2.6.23 • Rotating Staircase Deadline Scheduler - Con Kolivas • Complete Fair Scheduler - Ingo Molnar - 2.6.23 to 3.18 • Brain Fuck Scheduler - Con Kolivas – 3.18.1 Processes System Calls Scheduler
  43. 43. Memory ManagementLinux uses segmentation + pagination, which simplifies notation. Linux uses only 4 segments: 2 segments (code and data/stack) for KERNEL SPACE (3 GB) to (4 GB) 2 segments (code and data/stack) for USER SPACE from (0 GB) to (3 GB)
  44. 44. Memory Management
  45. 45. Memory Management
  46. 46. Storage Handling The Virtual Filesystem (sometimes called the Virtual File Switch or more commonly simply the VFS) is the subsystem of the kernel that implements the file and filesystem-related interfaces provided to user-space programs. The VFS is the glue that enables system calls such as open(), read(), and write() to work regardless of the filesystem or underlying physical medium.
  47. 47. Networking This Layer is Responsible for handling the network Packets. Protocol stacks required, are implemented here. It is also responsible for decrypting / encrypting the network Packets.
  48. 48. How To Program How to use the features of kernel or change existing thing in kernel.
  49. 49. Kernel Common API's Kernel API’s are documented here https://www.kernel.org/doc/htmldocs/kernel-api/ • Data Types • Basic C Library Functions • Basic Kernel Library Functions • Memory Management in Linux • Kernel IPC facilities • FIFO Buffer • relay interface support • Module Support • Hardware Interfaces • Firmware Interfaces • ……. Etc.
  50. 50. Kernel Symbol Usage When modules are loaded, they are dynamically linked into the kernel. As with user-space, dynamically linked binaries can call only into external functions that are explicitly exported for use. In the kernel, this is handled via special directives called EXPORT_ SYMBOL() and EXPORT_SYMBOL_GPL(). Functions that are exported are available for use by modules. Functions that are not exported cannot be invoked from modules. The set of kernel symbols that are exported are known as the exported kernel interfaces or even the kernel API. Exporting a symbol is easy. After the function is declared, it is usually followed by an EXPORT_SYMBOL(). For example, int get_pirate_beard_color(void) { return pirate->beard->color; } EXPORT_SYMBOL(get_pirate_beard_color);
  51. 51. Introduction to mailing List & How to contribute --------------------------------------------------------------------------------------- git diff git commit git show git format-patch git send-email
  52. 52. References The Linux Document Project – TLPD http://www.tldp.org/LDP/lki/lki.html Kernelnewbies.org http://kernelnewbies.org/Documentation/Subsystems Free-electrons http://free-electrons.com http://lxr.free-electrons.com Kernel Map http://www.makelinux.net/kernel_map/
  53. 53. Thank you Samrat Das samrat48@hotmail.com

×