Your SlideShare is downloading. ×
D book proqramlashdirma_23522
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

D book proqramlashdirma_23522


Published on

Published in: Technology, Business

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Mauerer ffirs.tex V2 - 08/26/2008 3:23am Page iiiProfessionalLinux®Kernel ArchitectureWolfgang MauererWiley Publishing, Inc.
  • 2. Mauerer ffirs.tex V2 - 08/26/2008 3:23am Page ii
  • 3. Mauerer ffirs.tex V2 - 08/26/2008 3:23am Page iProfessional Linux®Kernel ArchitectureIntroduction .................................................................. xxviiChapter 1: Introduction and Overview .......................................... 1Chapter 2: Process Management and Scheduling ............................. 35Chapter 3: Memory Management ............................................ 133Chapter 4: Virtual Process Memory .......................................... 289Chapter 5: Locking and Interprocess Communication....................... 347Chapter 6: Device Drivers .................................................... 391Chapter 7: Modules .......................................................... 473Chapter 8: The Virtual Filesystem............................................ 519Chapter 9: The Extended Filesystem Family ................................. 583Chapter 10: Filesystems without Persistent Storage ....................... 643Chapter 11: Extended Attributes and Access Control Lists ................. 707Chapter 12: Networks........................................................ 733Chapter 13: System Calls .................................................... 819Chapter 14: Kernel Activities ................................................ 847Chapter 15: Time management .............................................. 893Chapter 16: Page and Buffer Cache.......................................... 949Chapter 17: Data Synchronization ........................................... 989Chapter 18: Page Reclaim and Swapping................................... 1023Chapter 19: Auditing ........................................................ 1097Appendix A: Architecture Specifics ......................................... 1117Appendix B: Working with the Source Code ................................ 1141Appendix C: Notes on C ..................................................... 1175Appendix D: System Startup ................................................ 1223Appendix E: The ELF Binary Format ......................................... 1241Appendix F: The Kernel Development Process.............................. 1267Bibliography ................................................................. 1289Index ........................................................................ 1293
  • 4. Mauerer ffirs.tex V2 - 08/26/2008 3:23am Page ii
  • 5. Mauerer ffirs.tex V2 - 08/26/2008 3:23am Page iiiProfessionalLinux®Kernel ArchitectureWolfgang MauererWiley Publishing, Inc.
  • 6. Mauerer ffirs.tex V2 - 08/26/2008 3:23am Page ivProfessional Linux®Kernel ArchitecturePublished byWiley Publishing, Inc.10475 Crosspoint BoulevardIndianapolis, IN 46256www.wiley.comCopyright © 2008 by Wolfgang MauererPublished by Wiley Publishing, Inc., Indianapolis, IndianaPublished simultaneously in CanadaISBN: 978-0-470-34343-2Manufactured in the United States of America10 9 8 7 6 5 4 3 2 1Library of Congress Cataloging-in-Publication Data:Mauerer, Wolfgang, 1978-Professional Linux kernel architecture / Wolfgang Mauerer.p. cm.Includes index.ISBN 978-0-470-34343-2 (pbk.)1. Linux. 2. Computer architecture. 3. Application software. I. Title.QA76.9.A73M38 2008005.4’32--dc222008028067No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by anymeans, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, orauthorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 RosewoodDrive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should beaddressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint Blvd., Indianapolis, IN 46256, (317)572-3447, fax (317) 572-4355, or online at of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warrantieswith respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties,including without limitation warranties of fitness for a particular purpose. No warranty may be created or extendedby sales or promotional materials. The advice and strategies contained herein may not be suitable for everysituation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting,or other professional services. If professional assistance is required, the services of a competent professional personshould be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that anorganization or Website is referred to in this work as a citation and/or a potential source of further informationdoes not mean that the author or the publisher endorses the information the organization or Website may provideor recommendations it may make. Further, readers should be aware that Internet Websites listed in this work mayhave changed or disappeared between when this work was written and when it is read.For general information on our other products and services please contact our Customer Care Department within theUnited States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.Trademarks: Wiley, the Wiley logo, Wrox, the Wrox logo, Wrox Programmer to Programmer, and related trade dressare trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates, in the United States andother countries, and may not be used without written permission. All other trademarks are the property of theirrespective owners. Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book.Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not beavailable in electronic books.
  • 7. Mauerer fauth.tex V2 - 08/22/2008 4:52am Page vAbout the AuthorWolfgang Mauerer is a quantum physicist whose professional interests are centered around quantumcryptography, quantum electrodynamics, and compilers for — you guessed it — quantum architectures.With the confirmed capacity of being the worst experimentalist in the known universe, he sticks to thetheoretical side of his profession, which is especially reassuring considering his constant fear of acci-dentally destroying the universe. Outside his research work, he is fascinated by operating systems, andfor more than a decade — starting with an article series about the kernel in 1997 — he has found greatpleasure in documenting and explaining Linux kernel internals. He is also the author of a book abouttypesetting with LaTeX and has written numerous articles that have been translated into seven languagesin total.When he’s not submerged in vast Hilbert spaces or large quantities of source code, he tries to take theopposite direction, namely, upward — be this with model planes, a paraglider, or on foot with an ice axein his hands: Mountains especially have the power to outrival even the Linux kernel. Consequently, heconsiders planning and accomplishing a first-ascent expedition to the vast arctic glaciers of east Green-land to be the really unique achievement in his life.Being interested in everything that is fundamental, he is also the author of the first compiler forPlankalk¨ul, the world’s earliest high-level language devised in 1942–1946 by Konrad Zuse, the father ofthe computer. As an avid reader, he is proud that despite the two-digit number of computers present inhis living room, the volume required for books still occupies a larger share.
  • 8. Mauerer fauth.tex V2 - 08/22/2008 4:52am Page vi
  • 9. Mauerer fcredit.tex V2 - 08/22/2008 4:53am Page viiCreditsExecutive EditorCarol LongSenior Development EditorTom DinseProduction EditorDebra BanningerCopy EditorsCate CaffreyKathryn DugganEditorial ManagerMary Beth WakefieldProduction ManagerTim TateVice President and Executive GroupPublisherRichard SwadleyVice President and ExecutivePublisherJoseph B. WikertProject Coordinator, CoverLynsey StanfordProofreaderPublication Services, Inc.IndexerJack Lewis
  • 10. Mauerer fcredit.tex V2 - 08/22/2008 4:53am Page viii
  • 11. Mauerer fack.tex V4 - 09/04/2008 3:36pm Page ixAcknowledgmentsFirst and foremost, I have to thank the thousands of programmers who have created the Linux kernelover the years — most of them commercially based, but some also just for their own private or academicjoy. Without them, there would be no kernel, and I would have had nothing to write about. Please acceptmy apologies that I cannot list all several hundred names here, but in true UNIX style, you can easilygenerate the list by:for file in $ALL_FILES_COVERED_IN_THIS_BOOK; dogit log --pretty="format:%an" $file; done |sort -u -k 2,2It goes without saying that I admire your work very much — you are all the true heroes in this story!What you are reading right now is the result of an evolution over more than seven years: After two yearsof writing, the first edition was published in German by Carl Hanser Verlag in 2003. It then describedkernel 2.6.0. The text was used as a basis for the low-level design documentation for the EAL4+ securityevaluation of Red Hat Enterprise Linux 5, requiring to update it to kernel 2.6.18 (if the EAL acronymdoes not mean anything to you, then Wikipedia is once more your friend). Hewlett-Packard sponsoredthe translation into English and has, thankfully, granted the rights to publish the result. Updates to kernel2.6.24 were then performed specifically for this book.Several people were involved in this evolution, and my appreciation goes to all of them: Leslie Mackay-Poulton, with support from David Jacobs, did a tremendous job at translating a huge pile of text intoEnglish. I’m also indebted to Sal La Pietra of atsec information security for pulling the strings to get thetranslation project rolling, and especially to Stephan M¨uller for close cooperation during the evaluation.My cordial thanks also go to all other HP and Red Hat people involved in this evaluation, and also toClaudio Kopper and Hans L¨ohr for our very enjoyable cooperation during this project. Many thanks alsogo to the people at Wiley — both visible and invisible to me — who helped to shape the book into itscurrent form.The German edition was well received by readers and reviewers, but nevertheless comments aboutinaccuracies and suggestions for improvements were provided. I’m glad for all of them, and would alsolike to mention the instructors who answered the publisher’s survey for the original edition. Some of theirsuggestions were very valuable for improving the current publication. The same goes for the referees forthis edition, especially to Dr. Xiaodong Zhang for providing numerous suggestions for Appendix F.4.Furthermore, I express my gratitude to Dr. Christine Silberhorn for granting me the opportunity tosuspend my regular research work at the Max Planck Research Group for four weeks to work on thisproject. I hope you enjoyed the peace during this time when nobody was trying to install Linux on yourMacBook!As with every book, I owe my deepest gratitude to my family for supporting me in every aspect oflife — I more than appreciate this indispensable aid. Finally, I have to thank Hariet Fabritius for infinite
  • 12. Mauerer fack.tex V4 - 09/04/2008 3:36pm Page xAcknowledgmentspatience with an author whose work cycle not only perfectly matched the most alarming forms of sleepdyssomnias, but who was always right on the brink of confusing his native tongue with ‘‘C,’’ and whomshe consequently had to rescue from numerous situations where he seemingly had lost his mind (seebelow. . .). Now that I have more free time again, I’m not only looking forward to our well-deservedholiday, but can finally embark upon the project of giving your laptop all joys of a proper operatingsystem! (Writing these acknowledgments, I all of a sudden realize why people always hasten to lockaway their laptops when they see me approaching. . . .)x
  • 13. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xiContentsIntroduction xxviiChapter 1: Introduction and Overview 1Tasks of the Kernel 2Implementation Strategies 3Elements of the Kernel 3Processes, Task Switching, and Scheduling 4Unix Processes 4Address Spaces and Privilege Levels 7Page Tables 11Allocation of Physical Memory 13Timing 16System Calls 17Device Drivers, Block and Character Devices 17Networks 18Filesystems 18Modules and Hotplugging 18Caching 20List Handling 20Object Management and Reference Counting 22Data Types 25. . . and Beyond the Infinite 27Why the Kernel Is Special 28Some Notes on Presentation 29Summary 33Chapter 2: Process Management and Scheduling 35Process Priorities 36Process Life Cycle 38Preemptive Multitasking 40Process Representation 41Process Types 47Namespaces 47
  • 14. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xiiContentsProcess Identification Numbers 54Task Relationships 62Process Management System Calls 63Process Duplication 63Kernel Threads 77Starting New Programs 79Exiting Processes 83Implementation of the Scheduler 83Overview 84Data Structures 86Dealing with Priorities 93Core Scheduler 99The Completely Fair Scheduling Class 106Data Structures 106CFS Operations 107Queue Manipulation 112Selecting the Next Task 113Handling the Periodic Tick 114Wake-up Preemption 115Handling New Tasks 116The Real-Time Scheduling Class 117Properties 118Data Structures 118Scheduler Operations 119Scheduler Enhancements 121SMP Scheduling 121Scheduling Domains and Control Groups 126Kernel Preemption and Low Latency Efforts 127Summary 132Chapter 3: Memory Management 133Overview 133Organization in the (N)UMA Model 136Overview 136Data Structures 138Page Tables 153Data Structures 154Creating and Manipulating Entries 161Initialization of Memory Management 161Data Structure Setup 162Architecture-Specific Setup 169Memory Management during the Boot Process 191xii
  • 15. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xiiiContentsManagement of Physical Memory 199Structure of the Buddy System 199Avoiding Fragmentation 201Initializing the Zone and Node Data Structures 209Allocator API 215Reserving Pages 222Freeing Pages 240Allocation of Discontiguous Pages in the Kernel 244Kernel Mappings 251The Slab Allocator 256Alternative Allocators 258Memory Management in the Kernel 259The Principle of Slab Allocation 261Implementation 265General Caches 283Processor Cache and TLB Control 285Summary 287Chapter 4: Virtual Process Memory 289Introduction 289Virtual Process Address Space 290Layout of the Process Address Space 290Creating the Layout 294Principle of Memory Mappings 297Data Structures 298Trees and Lists 299Representation of Regions 300The Priority Search Tree 302Operations on Regions 306Associating Virtual Addresses with a Region 306Merging Regions 308Inserting Regions 309Creating Regions 310Address Spaces 312Memory Mappings 314Creating Mappings 314Removing Mappings 317Nonlinear Mappings 319Reverse Mapping 322Data Structures 323Creating a Reverse Mapping 324Using Reverse Mapping 325xiii
  • 16. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xivContentsManaging the Heap 327Handling of Page Faults 330Correction of Userspace Page Faults 336Demand Allocation/Paging 337Anonymous Pages 339Copy on Write 340Getting Nonlinear Mappings 341Kernel Page Faults 341Copying Data between Kernel and Userspace 344Summary 345Chapter 5: Locking and Interprocess Communication 347Control Mechanisms 348Race Conditions 348Critical Sections 349Kernel Locking Mechanisms 351Atomic Operations on Integers 352Spinlocks 354Semaphores 355The Read-Copy-Update Mechanism 357Memory and Optimization Barriers 359Reader/Writer Locks 361The Big Kernel Lock 361Mutexes 362Approximate Per-CPU Counters 364Lock Contention and Fine-Grained Locking 365System V Interprocess Communication 366System V Mechanisms 366Semaphores 367Message Queues 376Shared Memory 380Other IPC Mechanisms 381Signals 381Pipes and Sockets 389Summary 390Chapter 6: Device Drivers 391I/O Architecture 391Expansion Hardware 392Access to Devices 397Device Files 397Character, Block, and Other Devices 397xiv
  • 17. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xvContentsDevice Addressing Using Ioctls 400Representation of Major and Minor Numbers 401Registration 403Association with the Filesystem 406Device File Elements in Inodes 406Standard File Operations 407Standard Operations for Character Devices 407Standard Operations for Block Devices 408Character Device Operations 409Representing Character Devices 409Opening Device Files 409Reading and Writing 412Block Device Operations 412Representation of Block Devices 413Data Structures 415Adding Disks and Partitions to the System 423Opening Block Device Files 425Request Structure 427BIOs 430Submitting Requests 432I/O Scheduling 438Implementation of Ioctls 441Resource Reservation 442Resource Management 442I/O Memory 445I/O Ports 446Bus Systems 448The Generic Driver Model 449The PCI Bus 454USB 463Summary 471Chapter 7: Modules 473Overview 473Using Modules 474Adding and Removing 474Dependencies 477Querying Module Information 478Automatic Loading 480Inserting and Deleting Modules 483Module Representation 483Dependencies and References 488xv
  • 18. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xviContentsBinary Structure of Modules 491Inserting Modules 496Removing Modules 505Automation and Hotplugging 506Automatic Loading with kmod 507Hotplugging 508Version Control 511Checksum Methods 512Version Control Functions 515Summary 517Chapter 8: The Virtual Filesystem 519Filesystem Types 520The Common File Model 521Inodes 522Links 522Programming Interface 523Files as a Universal Interface 524Structure of the VFS 525Structural Overview 525Inodes 527Process-Specific Information 532File Operations 537Directory Entry Cache 542Working with VFS Objects 547Filesystem Operations 548File Operations 565Standard Functions 572Generic Read Routine 573The fault Mechanism 576Permission-Checking 578Summary 581Chapter 9: The Extended Filesystem Family 583Introduction 583Second Extended Filesystem 584Physical Structure 585Data Structures 592xvi
  • 19. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xviiContentsCreating a Filesystem 608Filesystem Actions 610Third Extended Filesystem 637Concepts 638Data Structures 639Summary 642Chapter 10: Filesystems without Persistent Storage 643The proc Filesystem 644Contents of /proc 644Data Structures 652Initialization 655Mounting the Filesystem 657Managing /proc Entries 660Reading and Writing Information 664Task-Related Information 666System Control Mechanism 671Simple Filesystems 680Sequential Files 680Writing Filesystems with Libfs 684The Debug Filesystem 687Pseudo Filesystems 689Sysfs 689Overview 690Data Structures 690Mounting the Filesystem 695File and Directory Operations 697Populating Sysfs 704Summary 706Chapter 11: Extended Attributes and Access Control Lists 707Extended Attributes 707Interface to the Virtual Filesystem 708Implementation in Ext3 714Implementation in Ext2 721Access Control Lists 722Generic Implementation 722Implementation in Ext3 726Implementation in Ext2 732Summary 732xvii
  • 20. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xviiiContentsChapter 12: Networks 733Linked Computers 734ISO/OSI and TCP/IP Reference Model 734Communication via Sockets 738Creating a Socket 738Using Sockets 740Datagram Sockets 744The Layer Model of Network Implementation 745Networking Namespaces 747Socket Buffers 749Data Management Using Socket Buffers 750Management Data of Socket Buffers 753Network Access Layer 754Representation of Network Devices 755Receiving Packets 760Sending Packets 768Network Layer 768IPv4 769Receiving Packets 771Local Delivery to the Transport Layer 772Packet Forwarding 774Sending Packets 775Netfilter 778IPv6 783Transport Layer 785UDP 785TCP 787Application Layer 799Socket Data Structures 799Sockets and Files 803The socketcall System Call 804Creating Sockets 805Receiving Data 807Sending Data 808Networking from within the Kernel 808Communication Functions 808The Netlink Mechanism 810Summary 817xviii
  • 21. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xixContentsChapter 13: System Calls 819Basics of System Programming 820Tracing System Calls 820Supported Standards 823Restarting System Calls 824Available System Calls 826Implementation of System Calls 830Structure of System Calls 830Access to Userspace 837System Call Tracing 838Summary 846Chapter 14: Kernel Activities 847Interrupts 848Interrupt Types 848Hardware IRQs 849Processing Interrupts 850Data Structures 853Interrupt Flow Handling 860Initializing and Reserving IRQs 864Servicing IRQs 866Software Interrupts 875Starting SoftIRQ Processing 877The SoftIRQ Daemon 878Tasklets 879Generating Tasklets 880Registering Tasklets 880Executing Tasklets 881Wait Queues and Completions 882Wait Queues 882Completions 887Work Queues 889Summary 891Chapter 15: Time Management 893Overview 893Types of Timers 893Configuration Options 896xix
  • 22. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xxContentsImplementation of Low-Resolution Timers 897Timer Activation and Process Accounting 897Working with Jiffies 900Data Structures 900Dynamic Timers 902Generic Time Subsystem 907Overview 908Configuration Options 909Time Representation 910Objects for Time Management 911High-Resolution Timers 920Data Structures 921Setting Timers 925Implementation 926Periodic Tick Emulation 931Switching to High-Resolution Timers 932Dynamic Ticks 933Data Structures 934Dynamic Ticks for Low-Resolution Systems 935Dynamic Ticks for High-Resolution Systems 938Stopping and Starting Periodic Ticks 939Broadcast Mode 943Implementing Timer-Related System Calls 944Time Bases 944The alarm and setitimer System Calls 945Getting the Current Time 947Managing Process Times 947Summary 948Chapter 16: Page and Buffer Cache 949Structure of the Page Cache 950Managing and Finding Cached Pages 951Writing Back Modified Data 952Structure of the Buffer Cache 954Address Spaces 955Data Structures 956Page Trees 958Operations on Address Spaces 961Implementation of the Page Cache 966Allocating Pages 966Finding Pages 967xx
  • 23. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xxiContentsWaiting on Pages 968Operations with Whole Pages 969Page Cache Readahead 970Implementation of the Buffer Cache 974Data Structures 975Operations 976Interaction of Page and Buffer Cache 977Independent Buffers 982Summary 988Chapter 17: Data Synchronization 989Overview 989The pdflush Mechanism 991Starting a New Thread 993Thread Initialization 994Performing Actual Work 995Periodic Flushing 996Associated Data Structures 996Page Status 996Writeback Control 998Adjustable Parameters 1000Central Control 1000Superblock Synchronization 1002Inode Synchronization 1003Walking the Superblocks 1003Examining Superblock Inodes 1003Writing Back Single Inodes 1006Congestion 1009Data Structures 1009Thresholds 1010Setting and Clearing the Congested State 1011Waiting on Congested Queues 1012Forced Writeback 1013Laptop Mode 1015System Calls for Synchronization Control 1016Full Synchronization 1016Synchronization of Inodes 1018Synchronization of Individual Files 1019Synchronization of Memory Mappings 1021Summary 1022xxi
  • 24. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xxiiContentsChapter 18: Page Reclaim and Swapping 1023Overview 1023Swappable Pages 1024Page Thrashing 1025Page-Swapping Algorithms 1026Page Reclaim and Swapping in the Linux Kernel 1027Organization of the Swap Area 1028Checking Memory Utilization 1029Selecting Pages to Be Swapped Out 1029Handling Page Faults 1029Shrinking Kernel Caches 1030Managing Swap Areas 1030Data Structures 1030Creating a Swap Area 1035Activating a Swap Area 1036The Swap Cache 1039Identifying Swapped-Out Pages 1041Structure of the Cache 1044Adding New Pages 1045Searching for a Page 1050Writing Data Back 1051Page Reclaim 1052Overview 1053Data Structures 1055Determining Page Activity 1057Shrinking Zones 1062Isolating LRU Pages and Lumpy Reclaim 1065Shrinking the List of Active Pages 1068Reclaiming Inactive Pages 1072The Swap Token 1079Handling Swap-Page Faults 1082Swapping Pages in 1083Reading the Data 1084Swap Readahead 1085Initiating Memory Reclaim 1086Periodic Reclaim with kswapd 1087Swap-out in the Event of Acute Memory Shortage 1090Shrinking Other Caches 1092Data Structures 1092Registering and Removing Shrinkers 1093Shrinking Caches 1093Summary 1095xxii
  • 25. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xxiiiContentsChapter 19: Auditing 1097Overview 1097Audit Rules 1099Implementation 1100Data Structures 1100Initialization 1106Processing Requests 1107Logging Events 1108System Call Auditing 1110Summary 1116Appendix A: Architecture Specifics 1117Overview 1117Data Types 1118Alignment 1119Memory Pages 1119System Calls 1120String Processing 1120Thread Representation 1122IA-32 1122IA-64 1124ARM 1126Sparc64 1128Alpha 1129Mips 1131PowerPC 1132AMD64 1134Bit Operations and Endianness 1135Manipulation of Bit Chains 1135Conversion between Byte Orders 1136Page Tables 1137Miscellaneous 1137Checksum Calculation 1137Context Switch 1137Finding the Current Process 1138Summary 1139Appendix B: Working with the Source Code 1141Organization of the Kernel Sources 1141Configuration with Kconfig 1144A Sample Configuration File 1144xxiii
  • 26. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xxivContentsLanguage Elements of Kconfig 1147Processing Configuration Information 1152Compiling the Kernel with Kbuild 1154Using the Kbuild System 1154Structure of the Makefiles 1156Useful Tools 1160LXR 1161Patch and Diff 1163Git 1165Debugging and Analyzing the Kernel 1169GDB and DDD 1170Local Kernel 1171KGDB 1172User-Mode Linux 1173Summary 1174Appendix C: Notes on C 1175How the GNU C Compiler Works 1175From Source Code to Machine Program 1176Assembly and Linking 1180Procedure Calls 1180Optimization 1185Inline Functions 1192Attributes 1192Inline Assembler 1194__builtin Functions 1198Pointer Arithmetic 1200Standard Data Structures and Techniques of the Kernel 1200Reference Counters 1200Pointer Type Conversions 1201Alignment Issues 1202Bit Arithmetic 1203Pre-Processor Tricks 1206Miscellaneous 1207Doubly Linked Lists 1209Hash Lists 1214Red-Black Trees 1214Radix Trees 1216Summary 1221xxiv
  • 27. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xxvContentsAppendix D: System Startup 1223Architecture-Specific Setup on IA-32 Systems 1224High-Level Initialization 1225Subsystem Initialization 1225Summary 1239Appendix E: The ELF Binary Format 1241Layout and Structure 1241ELF Header 1243Program Header Table 1244Sections 1246Symbol Table 1248String Tables 1249Data Structures in the Kernel 1250Data Types 1250Headers 1251String Tables 1257Symbol Tables 1257Relocation Entries 1259Dynamic Linking 1263Summary 1265Appendix F: The Kernel Development Process 1267Introduction 1267Kernel Trees and the Structure of Development 1268The Command Chain 1269The Development Cycle 1269Online Resources 1272The Structure of Patches 1273Technical Issues 1273Submission and Review 1277Linux and Academia 1281Some Examples 1282Adopting Research 1284Summary 1287References 1289Index 1293xxv
  • 28. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xxvi
  • 29. Mauerer flast.tex V2 - 09/05/2008 12:08pm Page xxviiIntroductionUnix is simple and coherent, but it takes a genius(or at any rate a programmer) to understandand appreciate the simplicity.— Dennis RitchieNote from the authors: Yes, we have lost our minds.Be forewarned: You will lose yours too.— Benny Goodheart & James CoxUnix is distinguished by a simple, coherent, and elegant design — truly remarkable features that haveenabled the system to influence the world for more than a quarter of a century. And especially thanksto the growing presence of Linux, the idea is still picking up momentum, with no end of the growthin sight.Unix and Linux carry a certain fascination, and the two quotations above hopefully capture the spirit ofthis attraction. Consider Dennis Ritchie’s quote: Is the coinventor of Unix at Bell Labs completely rightin saying that only a genius can appreciate the simplicity of Unix? Luckily not, because he puts himselfinto perspective immediately by adding that programmers also qualify to value the essence of Unix.Understanding the meagerly documented, demanding, and complex sources of Unix as well as of Linuxis not always an easy task. But once one has started to experience the rich insights that can be gained fromthe kernel sources, it is hard to escape the fascination of Linux. It seems fair to warn you that it’s easyto get addicted to the joy of the operating system kernel once starting to dive into it. This was alreadynoted by Benny Goodheart and James Cox, whose preface to their book The Magic Garden Explained(second quotation above) explained the internals of Unix System V. And Linux is definitely also capableof helping you to lose your mind!This book acts as a guide and companion that takes you through the kernel sources and sharpens yourawareness of the beauty, elegance, and — last but not least — esthetics of their concepts. There are, how-ever, some prerequisites to foster an understanding of the kernel. C should not just be a letter; neithershould it be a foreign language. Operating systems are supposed to be more than just a ‘‘Start” button, anda small amount of algorithmics can also do no harm. Finally, it is preferable if computer architecture is notjust about how to build the most fancy case. From an academic point of view, this comes closest to thelectures ‘‘Systems Programming,” ‘‘Algorithmics,” and ‘‘Fundamentals of Operating Systems.” The pre-vious edition of this book has been used to teach the fundamentals of Linux to advanced undergraduatestudents in several universities, and I hope that the current edition will serve the same purpose.Discussing all aforementioned topics in detail is outside the scope of this book, and when you considerthe mass of paper you are holding in your hands right now (or maybe you are not holding it, for thisvery reason), you’ll surely agree that this would not be a good idea. When a topic not directly related to
  • 30. Mauerer flast.tex V2 - 09/05/2008 12:08pm Page xxviiiIntroductionthe kernel, but required to understand what the kernel does, is encountered in this book, I will brieflyintroduce you to it. To gain a more thorough understanding, however, consult the books on computingfundamentals that I recommend. Naturally, there is a large selection of texts, but some books that I foundparticularly insightful and illuminating include C Programming Language, by Brian W. Kernighan andDenis M. Ritchie [KR88]; Modern Operating Systems, by Andrew S. Tanenbaum [Tan07] on the basics ofoperating systems in general; Operating Systems: Design and Implementation, by Andrew S. Tanenbaum andAlbert S. Woodhull [TW06] on Unix (Minix) in particular; Advanced Programming in the Unix Environment,by W. Richard Stevens and Stephen A. Rago [SR05] on userspace programming; and the two volumesComputer Architecture and Computer Organization and Design, on the foundations of computer architectureby John L. Hennessy and David A. Patterson [HP06, PH07]. All have established themselves as classicsin the literature.Additionally, Appendix C contains some information about extensions of the GNU C compiler that areused by the kernel, but do not necessarily find widespread use in general programming.When the first edition of this book was written, a schedule for kernel releases was more or less nonexis-tent. This has changed drastically during the development of kernel 2.6, and as I discuss in Appendix F,kernel developers have become pretty good at issuing new releases at periodic, predictable intervals. Ihave focused on kernel 2.6.24, but have also included some references to 2.6.25 and 2.6.26, which werereleased after this book was written but before all technical publishing steps had been completed. Since anumber of comprehensive changes to the whole kernel have been merged into 2.6.24, picking this releaseas the target seems a good choice. While a detail here or there will have changed in more recent kernelversions as compared to the code discussed in this book, the big picture will remain the same for quitesome time.In the discussion of the various components and subsystems of the kernel, I have tried to avoid over-loading the text with unimportant details. Likewise, I have tried not to lose track of the connection withsource code. It is a very fortunate situation that, thanks to Linux, we are able to inspect the source of areal, working, production operating system, and it would be sad to neglect this essential aspect of thekernel. To keep the book’s volume below the space of a whole bookshelf, I have selected only the mostcrucial parts of the sources. Appendix F introduces some techniques that ease reading of and workingwith the real source, an indispensable step toward understanding the structure and implementation ofthe Linux kernel.One particularly interesting observation about Linux (and Unix in general) is that it is well suited toevoke emotions. Flame wars on the Internet and heated technical debates about operating systems may beone thing, but for which other operating system does there exist a handbook (The Unix-Haters Handbook,edited by Simson Garfinkel et al. [GWS94]) on how best to hate it? When I wrote the preface to the firstedition, I noted that it is not a bad sign for the future that a certain international software companyresponds to Linux with a mixture of abstruse accusations and polemics. Five years later, the situationhas improved, and the aforementioned vendor has more or less officially accepted the fact that Linux hasbecome a serious competitor in the operating system world. And things are certainly going to improveeven more during the next five years. . . .Naturally (and not astonishingly), I admit that I am definitely fascinated by Linux (and, sometimes, amalso sure that I have lost my mind because of this), and if this book helps to carry this excitement to thereader, the long hours (and especially nights) spent writing it were worth every minute!Suggestions for improvements and constrictive critique can be passed to, or Naturally, I’m also happy if you tell me that you liked the book!xxviii
  • 31. Mauerer flast.tex V2 - 09/05/2008 12:08pm Page xxixIntroductionWhat This Book CoversThis book discusses the concepts, structure, and implementation of the Linux kernel. In particular, theindividual chapters cover the following topics:❑ Chapter 1 provides an overview of the Linux kernel and describes the big picture that is investi-gated more closely in the following chapters.❑ Chapter 2 talks about the basics of multitasking, scheduling, and process management, andinvestigates how these fundamental techniques and abstractions are implemented.❑ Chapter 3 discusses how physical memory is managed. Both the interaction with hardware andthe in-kernel distribution of RAM via the buddy system and the slab allocator are covered.❑ Chapter 4 proceeds to describe how userland processes experience virtual memory, and thecomprehensive data structures and actions required from the kernel to implement this view.❑ Chapter 5 introduces the mechanisms required to ensure proper operation of the kernel onmultiprocessor systems. Additionally, it covers the related question of how processes can com-municate with each other.❑ Chapter 6 walks you through the means for writing device drivers that are required to add sup-port for new hardware to the kernel.❑ Chapter 7 explains how modules allow for dynamically adding new functionality to the kernel.❑ Chapter 8 discusses the virtual filesystem, a generic layer of the kernel that allows for supportinga wide range of different filesystems, both physical and virtual.❑ Chapter 9 describes the extended filesystem family, that is, the Ext2 and Ext3 filesystems that arethe standard workhorses of many Linux installations.❑ Chapter 10 goes on to discuss procfs and sysfs, two filesystems that are not designed to storeinformation, but to present meta-information about the kernel to userland. Additionally, a num-ber of means to ease writing filesystems are presented.❑ Chapter 11 shows how extended attributes and access control lists that can help to improve sys-tem security are implemented.❑ Chapter 12 discusses the networking implementation of the kernel, with a specific focus on IPv4,TCP, UDP, and netfilter.❑ Chapter 13 introduces how systems calls that are the standard way to request a kernel actionfrom userland are implemented.❑ Chapter 14 analyzes how kernel activities are triggered with interrupts, and presents means ofdeferring work to a later point in time.❑ Chapter 15 shows how the kernel handles all time-related requirements, both with low and highresolution.❑ Chapter 16 talks about speeding up kernel operations with the help of the page and buffercaches.❑ Chapter 17 discusses how cached data in memory are synchronized with their sources on persis-tent storage devices.❑ Chapter 18 introduces how page reclaim and swapping work.xxix
  • 32. Mauerer flast.tex V2 - 09/05/2008 12:08pm Page xxxIntroduction❑ Chapter 19 gives an introduction to the audit implementation, which allows for observing indetail what the kernel is doing.❑ Appendix A discusses peculiarities of various architectures supported by the kernel.❑ Appendix B walks through various tools and means of working efficiently with the kernelsources.❑ Appendix C provides some technical notes about the programming language C, and alsodiscusses how the GNU C compiler is structured.❑ Appendix D describes how the kernel is booted.❑ Appendix E gives an introduction to the ELF binary format.❑ Appendix F discusses numerous social aspects of kernel development and the Linux
  • 33. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 1Introduction and OverviewOperating systems are not only regarded as a fascinating part of information technology, but arealso the subject of controversial discussion among a wide public.1 Linux has played a major rolein this development. Whereas just 10 years ago a strict distinction was made between relativelysimple academic systems available in source code and commercial variants with varying perfor-mance capabilities whose sources were a well-guarded secret, nowadays anybody can downloadthe sources of Linux (or of any other free systems) from the Internet in order to study them.Linux is now installed on millions of systems and is used by home users and professionals alikefor a wide range of tasks. From miniature embedded systems in wristwatches to massively parallelmainframes, there are countless ways of exploiting Linux productively. And this makes the sourcesso interesting. A sound, well-established concept (Unix) melded with powerful innovations and astrong penchant for dealing with problems that do not arise in academic teaching systems — this iswhat makes Linux so fascinating.This book describes the central functions of the kernel, explains its underlying structures, and exam-ines its implementation. Because complex subjects are discussed, I assume that the reader alreadyhas some experience in operating systems and systems programming in C (it goes without sayingthat I assume some familiarity with using Linux systems). I touch briefly on several general conceptsrelevant to common operating system problems, but my prime focus is on the implementation of theLinux kernel. Readers unfamiliar with a particular topic will find explanations on relevant basics inone of the many general texts on operating systems; for example, in Tanenbaum’s outstanding1It is not the intention of this book to participate in ideological discussions such as whether Linux can be regarded as afull operating system, although it is, in fact, just a kernel that cannot function productively without relying on other com-ponents. When I speak of Linux as an operating system without explicitly mentioning the acronyms of similar projects(primarily the GNU project, which despite strong initial resistance regarding the kernel reacts extremely sensitively whenLinux is used instead of GNU/Linux), this should not be taken to mean that I do not appreciate the importance of thework done by this project. Our reasons are simple and pragmatic. Where do we draw the line when citing those involvedwithout generating such lengthy constructs as GNU/IBM/RedHat/HP/KDE/Linux? If this footnote makes little sense, refer, where you will find a summary of the positions of the GNU project.After all ideological questions have been settled, I promise to refrain from using half-page footnotes in the rest of this book.
  • 34. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 2Chapter 1: Introduction and Overviewintroductions ([TW06] and [Tan07]). A solid foundation of C programming is required. Because thekernel makes use of many advanced techniques of C and, above all, of many special features of the GNUC compiler, Appendix C discusses the finer points of C with which even good programmers may notbe familiar. A basic knowledge of computer structures will be useful as Linux necessarily interacts verydirectly with system hardware — particularly with the CPU. There are also a large number of introduc-tory works dealing with this subject; some are listed in the reference section. When I deal with CPUsin greater depth (in most cases I take the IA-32 or AMD64 architecture as an example because Linux isused predominantly on these system architectures), I explain the relevant hardware details. When I dis-cuss mechanisms that are not ubiquitous in daily live, I will explain the general concept behind them,but expect that readers will also consult the quoted manual pages for more advice on how a particularfeature is used from userspace.The present chapter is designed to provide an overview of the various areas of the kernel and to illustratetheir fundamental relationships before moving on to lengthier descriptions of the subsystems in thefollowing chapters.Since the kernel evolves quickly, one question that naturally comes to mind is which version is cov-ered in this book. I have chosen kernel 2.6.24, which was released at the end of January 2008. Thedynamic nature of kernel development implies that a new kernel version will be available by the timeyou read this, and naturally, some details will have changed — this is unavoidable. If it were not thecase, Linux would be a dead and boring system, and chances are that you would not want to readthe book. While some of the details will have changed, concepts will not have varied essentially. This isparticularly true because 2.6.24 has seen some very fundamental changes as compared to earlier versions.Developers do not rip out such things overnight, naturally.1.1 Tasks of the KernelOn a purely technical level, the kernel is an intermediary layer between the hardware and the software.Its purpose is to pass application requests to the hardware and to act as a low-level driver to addressthe devices and components of the system. Nevertheless, there are other interesting ways of viewing thekernel.❑ The kernel can be regarded as an enhanced machine that, in the view of the application, abstractsthe computer on a high level. For example, when the kernel addresses a hard disk, it must decidewhich path to use to copy data from disk to memory, where the data reside, which commandsmust be sent to the disk via which path, and so on. Applications, on the other hand, need onlyissue the command that data are to be transferred. How this is done is irrelevant to the appli-cation — the details are abstracted by the kernel. Application programs have no contact withthe hardware itself,2 only with the kernel, which, for them, represents the lowest level in thehierarchy they know — and is therefore an enhanced machine.❑ Viewing the kernel as a resource manager is justified when several programs are run concurrentlyon a system. In this case, the kernel is an instance that shares available resources — CPU time,disk space, network connections, and so on — between the various system processes while at thesame time ensuring system integrity.2The CPU is an exception since it is obviously unavoidable that programs access it. Nevertheless, the full range of possible instruc-tions is not available for applications.2
  • 35. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 3Chapter 1: Introduction and Overview❑ Another view of the kernel is as a library providing a range of system-oriented commands. As isgenerally known, system calls are used to send requests to the computer; with the help of the Cstandard library, these appear to the application programs as normal functions that are invokedin the same way as any other function.1.2 Implementation StrategiesCurrently, there are two main paradigms on which the implementation of operating systems is based:1. Microkernels — In these, only the most elementary functions are implemented directlyin a central kernel — the microkernel. All other functions are delegated to autonomousprocesses that communicate with the central kernel via clearly defined communicationinterfaces — for example, various filesystems, memory management, and so on. (Ofcourse, the most elementary level of memory management that controls communicationwith the system itself is in the microkernel. However, handling on the system call level isimplemented in external servers.) Theoretically, this is a very elegant approach becausethe individual parts are clearly segregated from each other, and this forces programmersto use ‘‘clean‘‘ programming techniques. Other benefits of this approach are dynamicextensibility and the ability to swap important components at run time. However, owingto the additional CPU time needed to support complex communication between thecomponents, microkernels have not really established themselves in practice although theyhave been the subject of active and varied research for some time now.2. Monolithic Kernels — They are the alternative, traditional concept. Here, the entire codeof the kernel — including all its subsystems such as memory management, filesystems, ordevice drivers — is packed into a single file. Each function has access to all other parts ofthe kernel; this can result in elaborately nested source code if programming is not done withgreat care.Because, at the moment, the performance of monolithic kernels is still greater than that of microkernels,Linux was and still is implemented according to this paradigm. However, one major innovation has beenintroduced. Modules with kernel code that can be inserted or removed while the system is up-and-runningsupport the dynamic addition of a whole range of functions to the kernel, thus compensating for some ofthe disadvantages of monolithic kernels. This is assisted by elaborate means of communication betweenthe kernel and userland that allows for implementing hotplugging and dynamic loading of modules.1.3 Elements of the KernelThis section provides a brief overview of the various elements of the kernel and outlines the areas we willexamine in more detail in the following chapters. Despite its monolithic approach, Linux is surprisinglywell structured. Nevertheless, it is inevitable that its individual elements interact with each other; theyshare data structures, and (for performance reasons) cooperate with each other via more functions thanwould be necessary in a strictly segregated system. In the following chapters, I am obliged to makefrequent reference to the other elements of the kernel and therefore to other chapters, although I havetried to keep the number of forward references to a minimum. For this reason, I introduce the individualelements briefly here so that you can form an impression of their role and their place in the overall3
  • 36. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 4Chapter 1: Introduction and Overviewconcept. Figure 1-1 provides a rough initial overview about the layers that comprise a complete Linuxsystem, and also about some important subsystems of the kernel as such. Notice, however, that theindividual subsystems will interact in a variety of additional ways in practice that are not shown in thefigure.ApplicationsUserspaceC LibraryKernel spaceHardwareDevicedriversCore kernelSystem CallsNetworking Device DriversFilesystemsVFSMemory mgmtArchitecture specific codeProcess mgmtFigure 1-1: High-level overview of the structure of the Linux kernel and thelayers in a complete Linux system.1.3.1 Processes, Task Switching, and SchedulingApplications, servers, and other programs running under Unix are traditionally referred to as processes.Each process is assigned address space in the virtual memory of the CPU. The address spaces of the indi-vidual processes are totally independent so that the processes are unaware of each other — as far as eachprocess is concerned, it has the impression of being the only process in the system. If processes want tocommunicate to exchange data, for example, then special kernel mechanisms must be used.Because Linux is a multitasking system, it supports what appears to be concurrent execution of severalprocesses. Since only as many processes as there are CPUs in the system can really run at the sametime, the kernel switches (unnoticed by users) between the processes at short intervals to give them theimpression of simultaneous processing. Here, there are two problem areas:1. The kernel, with the help of the CPU, is responsible for the technical details of task switch-ing. Each individual process must be given the illusion that the CPU is always available. Thisis achieved by saving all state-dependent elements of the process before CPU resources arewithdrawn and the process is placed in an idle state. When the process is reactivated, theexact saved state is restored. Switching between processes is known as task switching.2. The kernel must also decide how CPU time is shared between the existing processes. Impor-tant processes are given a larger share of CPU time, less important processes a smaller share.The decision as to which process runs for how long is known as scheduling.1.3.2 UNIX ProcessesLinux employs a hierarchical scheme in which each process depends on a parent process. The kernelstarts the init program as the first process that is responsible for further system initialization actionsand display of the login prompt or (in more widespread use today) display of a graphical login interface.init is therefore the root from which all processes originate, more or less directly, as shown graphically4
  • 37. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 5Chapter 1: Introduction and Overviewby the pstree program. init is the top of a tree structure whose branches spread further and furtherdown.wolfgang@meitner> pstreeinit-+-acpid|-bonobo-activati|-cron|-cupsd|-2*[dbus-daemon]|-dbus-launch|-dcopserver|-dhcpcd|-esd|-eth1|-events/0|-gam_server|-gconfd-2|-gdm---gdm-+-X| ‘-startkde-+-kwrapper| ‘-ssh-agent|-gnome-vfs-daemo|-gpg-agent|-hald-addon-acpi|-kaccess|-kded|-kdeinit-+-amarokapp---2*[amarokapp]| |-evolution-alarm| |-kinternet| |-kio_file| |-klauncher| |-konqueror| |-konsole---bash-+-pstree| | ‘-xemacs| |-kwin| |-nautilus| ‘-netapplet|-kdesktop|-kgpg|-khelper|-kicker|-klogd|-kmix|-knotify|-kpowersave|-kscd|-ksmserver|-ksoftirqd/0|-kswapd0|-kthread-+-aio/0| |-ata/0| |-kacpid| |-kblockd/0| |-kgameportd| |-khubd5
  • 38. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 6Chapter 1: Introduction and Overview| |-kseriod| |-2*[pdflush]| ‘-reiserfs/0...How this tree structure spreads is closely connected with how new processes are generated. For thispurpose, Unix uses two mechanisms called fork and exec.1. fork — Generates an exact copy of the current process that differs from the parent processonly in its PID (process identification). After the system call has been executed, there are twoprocesses in the system, both performing the same actions. The memory contents of the ini-tial process are duplicated — at least in the view of the program. Linux uses a well-knowntechnique known as copy on write that allows it to make the operation much more efficientby deferring the copy operations until either parent or child writes to a page — read-onlyaccessed can be satisfied from the same page for both.A possible scenario for using fork is, for example, when a user opens a second browser win-dow. If the corresponding option is selected, the browser executes a fork to duplicate itscode and then starts the appropriate actions to build a new window in the child process.2. exec — Loads a new program into an existing content and then executes it. The memorypages reserved by the old program are flushed, and their contents are replaced with newdata. The new program then starts executing.ThreadsProcesses are not the only form of program execution supported by the kernel. In addition to heavy-weightprocesses — another name for classical Unix processes — there are also threads, sometimes referred to aslight-weight processes. They have also been around for some time, and essentially, a process may consist ofseveral threads that all share the same data and resources but take different paths through the programcode. The thread concept is fully integrated into many modern languages — Java, for instance. In simpleterms, a process can be seen as an executing program, whereas a thread is a program function or routinerunning in parallel to the main program. This is useful, for example, when Web browsers need to loadseveral images in parallel. Usually, the browser would have to execute several fork and exec calls togenerate parallel instances; these would then be responsible for loading the images and making datareceived available to the main program using some kind of communication mechanisms. Threads makethis situation easier to handle. The browser defines a routine to load images, and the routine is startedas a thread with multiple strands (each with different arguments). Because the threads and the mainprogram share the same address space, data received automatically reside in the main program. There istherefore no need for any communication effort whatsoever, except to prevent the threads from steppingonto their feet mutually by accessing identical memory locations, for instance. Figure 1-2 illustrates thedifference between a program with and without threads.W/O Threads With ThreadsAddress SpaceControl FlowFigure 1-2: Processes with and without threads.6
  • 39. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 7Chapter 1: Introduction and OverviewLinux provides the clone method to generate threads. This works in a similar way to fork but enables aprecise check to be made of which resources are shared with the parent process and which are generatedindependently for the thread. This fine-grained distribution of resources extends the classical threadconcept and allows for a more or less continuous transition between thread and processes.NamespacesDuring the development of kernel 2.6, support for namespaces was integrated into numerous subsystems.This allows different processes to have different views of the system. Traditionally, Linux (and Unix ingeneral) use numerous global quantities, for instance, process identifiers: Every process in the system isequipped with a unique identifier (ID), and this ID can be employed by users (or other processes) to referto the process — by sending it a signal, for instance. With namespaces, formerly global resources aregrouped differently: Every namespace can contain a specific set of PIDs, or can provide different viewsof the filesystem, where mounts in one namespace do not propagate into different namespaces.Namespaces are useful; for example, they are beneficial for hosting providers: Instead of setting upone physical machine per customer, they can instead use containers implemented with namespaces tocreate multiple views of the system where each seems to be a complete Linux installation from withinthe container and does not interact with other containers: They are separated and segregated from eachother. Every instance looks like a single machine running Linux, but in fact, many such instances canoperate simultaneously on a physical machine. This helps use resources more effectively. In contrast tofull virtualization solutions like KVM, only a single kernel needs to run on the machine and is responsibleto manage all containers.Not all parts of the kernel are yet fully aware of namespaces, and I will discuss to what extent support isavailable when we analyze the various subsystems.1.3.3 Address Spaces and Privilege LevelsBefore we start to discuss virtual address spaces, there are some notational conventions to fix. Through-out this book I use the abbreviations KiB, MiB, and GiB as units of size. The conventional units KB, MB,and GB are not really suitable in information technology because they represent decimal powers 103 ,106, and 109 although the binary system is the basis ubiquitous in computing. Accordingly KiB standsfor 210, MiB for 220, and GiB for 230bytes.Because memory areas are addressed by means of pointers, the word length of the CPU determines themaximum size of the address space that can be managed. On 32-bit systems such as IA-32, PPC, andm68k, these are 232 = 4 GiB, whereas on more modern 64-bit processors such as Alpha, Sparc64, IA-64,and AMD64, 264 bytes can be managed.The maximal size of the address space is not related to how much physical RAM is actually available,and therefore it is known as the virtual address space. One more reason for this terminology is that everyprocess in the system has the impression that it would solely live in this address space, and other pro-cesses are not present from their point of view. Applications do not need to care about other applicationsand can work as if they would run as the only process on the computer.Linux divides virtual address space into two parts known as kernel space and userspace as illustrated inFigure 1-3.7
  • 40. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 8Chapter 1: Introduction and Overview0TASK_SIZE232respectively 264UserspaceKernel-spaceFigure 1-3: Division of virtualaddress space.Every user process in the system has its own virtual address range that extends from 0 to TASK_SIZE.The area above (from TASK_SIZE to 232or 264) is reserved exclusively for the kernel — and may not beaccessed by user processes. TASK_SIZE is an architecture-specific constant that divides the address spacein a given ratio — in IA-32 systems, for instance, the address space is divided at 3 GiB so that the virtualaddress space for each process is 3 GiB; 1 GiB is available to the kernel because the total size of the virtualaddress space is 4 GiB. Although actual figures differ according to architecture, the general concepts donot. I therefore use these sample values in our further discussions.This division does not depend on how much RAM is available. As a result of address space virtualization,each user process thinks it has 3 GiB of memory. The userspaces of the individual system processes aretotally separate from each other. The kernel space at the top end of the virtual address space is alwaysthe same, regardless of the process currently executing.Notice that the picture can be more complicated on 64-bit machines because these tend to use less than64 bits to actually manage their huge principal virtual address space. Instead of 64 bits, they employa smaller number, for instance, 42 or 47 bits. Because of this, the effectively addressable portion of theaddress space is smaller than the principal size. However, it is still larger than the amount of RAM thatwill ever be present in the machine, and is therefore completely sufficient. As an advantage, the CPU cansave some effort because less bits are required to manage the effective address space than are requiredto address the complete virtual address space. The virtual address space will contain holes that are notaddressable in principle in such cases, so the simple situation depicted in Figure 1-3 is not fully valid. Wewill come back to this topic in more detail in Chapter 4.Privilege LevelsThe kernel divides the virtual address space into two parts so that it is able to protect the individualsystem processes from each other. All modern CPUs offer several privilege levels in which processes canreside. There are various prohibitions in each level including, for example, execution of certain assemblylanguage instructions or access to specific parts of virtual address space. The IA-32 architecture uses asystem of four privilege levels that can be visualized as rings. The inner rings are able to access morefunctions, the outer rings less, as shown in Figure 1-4.Whereas the Intel variant distinguishes four different levels, Linux uses only two different modes —kernel mode and user mode. The key difference between the two is that access to the memory area aboveTASK_SIZE — that is, kernel space — is forbidden in user mode. User processes are not able to manipulateor read the data in kernel space. Neither can they execute code stored there. This is the sole domain8
  • 41. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 9Chapter 1: Introduction and Overviewof the kernel. Th