Why kernelspace sucks?

OpenFest team
Why kernel space sucks
            (Or: Deconstructing two myths)
(Or: An abridged history of kernel-userspace interface blunders...)


                                                 Michael Kerrisk
  OpenFest 2011
                                                      © 2011
   5 Nov 2011
                                                 mtk@man7.org
  Sofia, Bulgaria
                                                 http://man7.org/
Who am I?
●    Professionally: programmer (primarily); also teacher
      and writer
●    Working with UNIX and Linux for nearly 25 years
●    Linux man-pages maintainer since 2004
    ●   117 releases to date
    ●   written or cowritten ~250 of ~950 man pages
    ●   lots of API testing, many bug reports
●    Author of a book on the kernel-userspace API
●    IOW: I've spent a lot of time looking at the interface


                                                          2
    man7.org
Intro: Why Userspace Sucks
●   Paper/talk by Dave Jones of Red Hat
    ●   First presented at Ottawa LS 2006
●   A lead-in to deconstructing a couple of myths
●   Why Userspace Sucks → WUSS



    ●   http://www.kernel.org/doc/ols/2006/ols2006v1-pages-441-450.pdf
    ●   http://www.codemonkey.org.uk/projects/talks/ols2k6.tar.gz
    ●   http://lwn.net/Articles/192214/


                                                                    3
    man7.org
Motivation for WUSS
●   We (kernel developers) have created a
    magnificently performant kernel
●   But, can we make it better?
    ●   Why does it take so long to boot, start applications,
        and shut down?
    ●   Why does my idle laptop consume so much battery
        power?




                                                          4
    man7.org
Findings from WUSS
●   DJ starts instrumenting the kernel, and finds...
    ●   Boot up: 80k stat(), 27k open(), 1.4k exec()
    ●   Shutdown: 23k stat(), 9k open()
●   Userspace programmers wreck performance
    doing crazy things!
    ●   open() and reparse same file multiple times!
    ●   read config files for many devices not even present!
    ●   stat() (or even open()) 100s of files they never need
    ●   timers triggering regular unnecessary wake-ups


                                                         5
    man7.org
Conclusions from WUSS
●   Room for a lot of improvement in userspace!
●   Userspace programmers should be aware of
    and using trace and analysis tools
    ●   (perf, LTTng, ftrace, systemtap, strace, valgrind,
        powertop, etc.)




                                                             6
    man7.org
Kernelspace             Userspace


“We (kernel developers) are much smarter than
    those crazy userspace programmers”
Kernelspace     Userspace


Something's wrong with this
        picture...
Let's question a couple of myths...
●   Myth 1: Kernel programmers (can) always get
    things right (in the end)
●   Myth 2: Code is always the best way to
    contribute to Free Software




                                             9
    man7.org
Myth 1
    Kernel programmers
(can) always get things right
         (in the end)

 Except, there's (at least) one place
   where they don't: the interface
The kernel-userspace interface
●   Application programming interface (API)
    presented by kernel to userspace programs
    ●   System calls (← I'll focus here)
    ●   Pseudo-file systems (/proc, /sys, etc.)
    ●   ioctl() interfaces (device drivers)
    ●   Netlink sockets
    ●   Obscure pieces (AUXV, VDSO, ...)




                                                  11
    man7.org
API designs must be done
     right first time
Why must APIs be right first time?
●   Code fixes != API fixes




                                    13
    man7.org
Why is fixing APIs so hard?
●   Usually, “fixing” an interface means breaking
    the interface for some applications




       “We care about user-space interfaces to an insane
      degree. We go to extreme lengths to maintain even
          badly designed or unintentional interfaces.
      Breaking user programs simply isn't acceptable.”
                      [LKML, Dec 2005]

                                                       14
    man7.org
We have to live with our mistakes!

    So: any API mistake by kernel
hackers creates pain that thousands
of userspace programmers must live
          with for decades
So, what does it mean
 to get an API right?
Doing (kernel-userspace) APIs right
●   Properly designed and implemented API should:
    ●   be bug free!
    ●   have a well thought out design
        –   simple as possible (but no simpler)
        –   easy to use / difficult to misuse
    ●   be consistent with related/similar APIs
    ●   integrate well with existing APIs
        –   e.g., interactions with fork(), exec(), threads, signals, FDs?
    ●   be as general as possible
    ●   be extensible, where needed; accommodate future growth trends
    ●   adhere to relevant standards (as far as possible)
    ●   be as good as, or better than, earlier APIs with similar functionality
    ●   be maintainable over time (a multilayered question)


                                                                             17
    man7.org
So how do kernel
developers score?
  (BSDers, please laugh quietly)
Bugs
Bugs
●   utimensat(2) [2.6.22]
    ●   Set file timestamps
    ●   Multiple bugs!
         –     http://linux-man-pages.blogspot.com/2008/06/whats-wrong-with-kernel-userland_30.html

    ●   Fixed in 2.6.26
●   signalfd() [2.6.22]
    ●   Receive signals via a file descriptor
    ●   Didn't correctly obtain data sent with sigqueue(2)
    ●   Fixed in 2.6.25


                                                                                                      20
    man7.org
Bugs
●   Examples of other interfaces with significant,
    easy to find bugs at release:
    ●   inotify [2.6.13]
    ●   splice() [2.6.17] (http://marc.info/?l=linux-mm&m=114238448331607&w=2)
    ●   timerfd [2.6.22] (http://marc.info/?l=linux-kernel&m=118517213626087&w=2)




                                                                                    21
    man7.org
Bugs—what's going on?
●   There's a quality control issue; way too many
    bugs in released interfaces
●   Pre-release testing insufficient and haphazard:
    ●   Too few testers (maybe just kernel developer)
    ●   No unit tests
    ●   Insufficient test coverage
    ●   No clear specification against which to test
●   Even if bug is fixed, users may still need to care
    ●   special casing for kernel versions


                                                        22
    man7.org
Thinking about
    design
Code it now, think about it later
●   Vanishing arguments:
    ●   readdir(2) ignores count argument
    ●   getcpu(2) [2.6.19] ignores tcache argument
    ●   epoll_create() [2.6] ignores size arg. (must be > 0)
        since 2.6.8
●   Probably, argument wasn't needed to start with
    ●   Or: recognized as a bad idea and made a no-op




                                                         24
    man7.org
Code it now, think about it later
●   futimesat() [2.6.16]
     ●   Extends utimes()
     ●   Proposed for POSIX.1-2008
     ●   Implemented on Linux
     ●   POSIX.1 members realize API is insufficient
         → standardized different API
     ●   utimensat() added in Linux 2.6.22




                                                       25
    man7.org
Code it now, think about it later
●   Dec 2003: Linux 2.6 added epoll_wait()
    ●   File descriptor monitoring
         –     (improves on select())
    ●   Nov 2006, 2.6.19 added epoll_pwait() to allow
        manipulation of signal mask during call
    ●   But, already in 2001, POSIX specified pselect() to
        fix analogous, well-known problem in select()




                                                        26
    man7.org
Consistency
Interface inconsistencies
●   mlock(start, length):
    ●   Round start down to page size
    ●   Round length up to next page boundary
    ●   mlock(4000, 6000) affects bytes 0..12287
●   remap_file_pages(start, length, ...) [2.6]:
    ●   Round start down to page boundary
    ●   Round length down to page boundary(!)
    ●   remap_file_pages(4000, 6000, ...) ? → 0..4095
●   User expectation: similar APIs should behave
    similarly

                                                        28
    man7.org
Confusing the users
●   Various system calls allow one process to
    change attributes of another process
    ●   e.g., setpriority(), ioprio_set(), migrate_pages(),
        prlimit()
●   Unprivileged calls require credential matches:
    ●   Some combination of caller's UIDs/GIDs matches
        some combination of target's UIDs/GIDs




                                                              29
    man7.org
Confusing the users
●   But, much inconsistency; e.g.:
     ●   setpriority(): euid == t-uid || euid == t-euid
     ●   ioprio_set(): uid == t-uid || euid == t-uid
     ●   migrate_pages(): uid == t-uid || uid == t-suid || euid == t-uid ||
         euid == t-suid
     ●   prlimit(): (uid == t-uid && uid == t-euid && uid == t-suid) &&
         (gid == t-gid && gid == t-guid && gid == t-sgid) !!
●   Inconsistency may confuse users into writing
    bugs
     ●   Potentially, security related bugs!
●   http://linux-man-pages.blogspot.com/2010/11/system-call-credential-checking-tale-of.html



                                                                                           30
    man7.org
Generality
Is the interface sufficiently general?
●   2.6.22 added timerfd(ufd, flags, utimerspec)
    ●   Create timer that notifies via a file descriptor
●   But API didn't allow user to:
    ●   Retrieve previous value when setting new timer value
    ●   Do a “get” to retrieve time until next expiration
         –     http://marc.info/?l=linux-kernel&m=118517213626087&w=2
         –     http://lwn.net/Articles/245533/
●   Older APIs ([gs]etitimer(), POSIX timers) did
    provide this functionality!


                                                                        32
    man7.org
Is the interface sufficiently general?
●   Solution:
    ●   timerfd() disabled in kernel 2.6.23
    ●   2.6.25 did it right:
         –     timerfd_create(), timerfd_settime(), timerfd_gettime()
         –     (API analogous to POSIX timers)
●   Was an ABI breakage, but
    ●   Only in a single kernel version
    ●   Original API was never exposed via glibc




                                                                    33
    man7.org
Are we learning
from the past?
Are we learning from past mistakes?
●   Dnotify [2.4]
    ●   Directory change notification API
    ●   Many problems
●   So, we added inotify [2.6.13]
    ●   But, inotify still doesn't get it all right
●   Now [2.6.37] we have yet another API, fanotify
    ●   Designed for virus scanners
    ●   Adds some functionality
    ●   Doesn't provide all functionality of inotify
●   Couldn't we have had a new API that did everything?

                                                       35
    man7.org
Extensibility
Is the interface extensible?
●   Too often, an early API didn't allow for
    extensions
●   Common solution is a new API, with a flags arg:
    ●   umount() → umount2() [2.2]
    ●   epoll_create() [2.6] → epoll_create2() [2.6.27]
    ●   futimesat() [2.6.16] → epoll_create2() [2.6.22]
    ●   signalfd() [2.6.22] → signalfd4() [2.6.27]
●   When adding a new API, consider adding an
    (unused) flags argument to allow extensions

                                                          37
    man7.org
Futureproofing
●   Suppose a syscall has a flags bit-mask arg.
●   Implementation should always have check like:
     if (flags & ~(FL_X | FL_Y))
         return -EINVAL;
         // Only allow caller to specify flags
         // bits that have a defined meaning

●   Without this check, interface is “loose”




                                                  38
    man7.org
Futureproofing
●   Suppose user makes a call of form:
         syscallxyz(-1);    // flags has all bits set

●   If implementer later adds FL_Z, an ABI
    breakage occurs for user's code
●   Conversely: user has no way of checking if a
    kernel implements FL_Z
●   Many system calls lack this kind of check!
    ●   Linux 3.2 examples: sigaction(sa.sa_flags), recv(),
        send(), clock_nanosleep(), msgrcv(), msgget(),
        semget(), shmget(), shmat(), semop(sops.sem_flg)
                                                       39
    man7.org
Futureproofing
●   Should checks be added after the fact?
    ●   e.g., umount2() [2.2] added check in 2.6.34;
        timerfd_settime() [2.6.25] added check in 2.6.29
●   But adding check can also create ABI breakage
    ●   Apps get errors where previously they did not
●   Loose APIs allow the user to define interface
    ●   Worst case: can't add new flags values to interface




                                                        40
    man7.org
Futureproofing failures
●   16 bits is enough for UIDs/GIDs...
    ●   2.4: 32-bit UIDs/GIDs
●   32 bits is enough for file offsets
    ●   Okay, it was 1991, but Moore's law...
    ●   2.4: 64-bit file offsets
●   So we have
    ●   oldstat(), stat(), stat64()
    ●   chown(), chown32()
    ●   open(), open64()
    ●   and so on
                                                41
    man7.org
Maintainability
When good ideas go astray
●   Traditional UNIX gives root all privileges
    ●   All or nothing is risky!
●   Linux divides root privileges into separate pieces:
    ●   CAP_AUDIT_CONTROL, CAP_AUDIT_WRITE, CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_DAC_READ_SEARCH, 
        CAP_FOWNER, CAP_FSETID, CAP_IPC_LOCK, CAP_IPC_OWNER, CAP_KILL, CAP_LEASE, 
        CAP_LINUX_IMMUTABLE, CAP_MAC_ADMIN, CAP_MAC_OVERRIDE, CAP_MKNOD, CAP_NET_ADMIN, 
        CAP_NET_BIND_SERVICE, CAP_NET_BROADCAST, CAP_NET_RAW, CAP_SETFCAP, CAP_SETGID, CAP_SETPCAP, 
        CAP_SETUID, CAP_SYSLOG, CAP_SYS_ADMIN, CAP_SYS_BOOT, CAP_SYS_CHROOT, CAP_SYS_MODULE, 
        CAP_SYS_NICE, CAP_SYS_PACCT, CAP_SYS_PTRACE, CAP_SYS_RAWIO, CAP_SYS_RESOURCE, CAP_SYS_TIME, 
        CAP_SYS_TTY_CONFIG, CAP_WAKE_ALARM


●   Great! But which capability do I use for my new
    feature?
●   Hmmm, CAP_SYS_ADMIN looks good
●   CAP_SYS_ADMIN, the new root, 231 uses in 3.2
                                                                                         43
    man7.org
Standards and
  portability
Needlessly breaking portability
●   sched_setscheduler()
    ●   POSIX says: successful call must return previous
        policy
    ●   Linux: successful call returns 0
    ●   No good reason for this inconsistency
    ●   Developers must special case code for Linux




                                                      45
    man7.org
Actually,
it wasn't just us...
We're just traditionalists...
●   These kinds of problems predate Linux:
    ●   API of System V IPC is awful!
    ●   Semantics of fcntl() locks when FD is closed render
        locks useless for libraries
    ●   select() modifies FD sets in place, forcing
        reinitialization inside loops
         –     poll() gets it right: uses separate input and output args
    ●   and so on...




                                                                      47
    man7.org
Overall grade?
Why do these API problems
    keep happening?
●   Excessive focus on code as primary
    contribution of value for a software project
●   Poor feedback loop between developers and
    users
Myth 2
Code is always the best way
to contribute to Free Software
“Show me the code!”
“Show me the code!”

 But anyone can write code,
  and if the design is good
     but the code is bad,
the code can usually be fixed
“Show me the code!”

             Sometimes,
other sentences are more appropriate,
and encourage contributions that are
        as (or more) valuable
“Show me the users'
  requirements!”
“Show me the users' requirements”
●   Does the API serve needs of multiple users, or
    is it just one developer scratching an itch?
    ●   Beware of limited perspectives!
●   Is the API designed for generality?
●   Is the API extensible for possible future
    requirements?




                                                55
    man7.org
“Show me the design
specification / documentation!”
“Show me the design spec. / documentation!”
●   How do we know if implementation deviates
    from intention?
●   What shall we code our tests against?
●   Writing documentation turns out often to be a
    natural sanity check for design
●   A decent man page suffices
    ●   Many of the bugs mentioned earlier were found
        while writing man pages...
    ●   It's all a question of when it's written...


                                                        57
    man7.org
“Show me the design review!”
“Show me the design review!”
●   Did other people actually review your design?
●   Is the API:
     ●   as simple as possible?
     ●   easy to use / difficult to misuse?
     ●   consistent with related/similar APIs?
     ●   well integrated with existing APIs?
     ●   as general as possible
     ●   extensible?
     ●   following standards, where relevant?
     ●   at least as good as earlier APIs with similar functionality?
     ●   maintainable?
                                                              59
    man7.org
“Show me the tests!”
“Show me the tests!”
●   Did you (the developer) write some tests?
●   More importantly: did someone else write some
    tests?
●   Do the tests cover all reasonable cases?
●   Do you test for unreasonable cases?
     ●   Do unexpected inputs generate suitable error returns?
●   While writing tests, did you find the interface easy
    to use / difficult to misuse? (Did you consequently
    make some design changes?)
●   What bugs did you discover during testing?
                                                       61
    man7.org
Finally...
●   If you're a potential contributor, don't fall into the
    trap of believing that code is the only (or best)
    vehicle for contribution
●   As a maintainer, are you letting your project
    down by failing to encourage these other types
    of contribution?
Thanks!
                    (slides up soon at http://man7.org/conf/)


Michael Kerrisk                                                                Linux man-pages project
mtk@man7.org                                                                   mtk.manpages@gmail.com
http://man7.org/                                                               http://man7.org/linux/man-pages/

                          Mamaku (Black Tree Fern) image (c) Rob Suisted




                                                                                                     New Zealand Blue Penguin (Kororā)
                                                                                                     Image (c) livescience.com
                          naturespic.com




 No Starch Press, 2010
  http://man7.org/tlpi/
1 of 63

Recommended

Porting Android by
Porting AndroidPorting Android
Porting AndroidOpersys inc.
4K views42 slides
Android for Embedded Linux Developers by
Android for Embedded Linux DevelopersAndroid for Embedded Linux Developers
Android for Embedded Linux DevelopersOpersys inc.
5.9K views60 slides
Android Variants, Hacks, Tricks and Resources presented at AnDevConII by
Android Variants, Hacks, Tricks and Resources presented at AnDevConIIAndroid Variants, Hacks, Tricks and Resources presented at AnDevConII
Android Variants, Hacks, Tricks and Resources presented at AnDevConIIOpersys inc.
2.9K views48 slides
Understanding the Android System Server by
Understanding the Android System ServerUnderstanding the Android System Server
Understanding the Android System ServerOpersys inc.
53.5K views25 slides
Porting Android by
Porting AndroidPorting Android
Porting AndroidOpersys inc.
5 views49 slides
Android Variants, Hacks, Tricks and Resources by
Android Variants, Hacks, Tricks and ResourcesAndroid Variants, Hacks, Tricks and Resources
Android Variants, Hacks, Tricks and ResourcesOpersys inc.
3.6K views47 slides

More Related Content

What's hot

Yocto and IoT - a retrospective by
Yocto and IoT - a retrospectiveYocto and IoT - a retrospective
Yocto and IoT - a retrospectiveOpen-RnD
1.7K views85 slides
Build your own embedded linux distributions by yocto project by
Build your own embedded linux distributions by yocto projectBuild your own embedded linux distributions by yocto project
Build your own embedded linux distributions by yocto projectYen-Chin Lee
31.2K views76 slides
Automotive Grade Linux and systemd by
Automotive Grade Linux and systemdAutomotive Grade Linux and systemd
Automotive Grade Linux and systemdAlison Chaiken
1.6K views21 slides
Android Internals at Linaro Connect Asia 2013 by
Android Internals at Linaro Connect Asia 2013Android Internals at Linaro Connect Asia 2013
Android Internals at Linaro Connect Asia 2013Opersys inc.
4.4K views53 slides
One Year of Porting - Post-mortem of two Linux/SteamOS launches by
One Year of Porting - Post-mortem of two Linux/SteamOS launchesOne Year of Porting - Post-mortem of two Linux/SteamOS launches
One Year of Porting - Post-mortem of two Linux/SteamOS launchesLeszek Godlewski
31.8K views44 slides
Linux as a gaming platform, ideology aside by
Linux as a gaming platform, ideology asideLinux as a gaming platform, ideology aside
Linux as a gaming platform, ideology asideLeszek Godlewski
24.4K views49 slides

What's hot(20)

Yocto and IoT - a retrospective by Open-RnD
Yocto and IoT - a retrospectiveYocto and IoT - a retrospective
Yocto and IoT - a retrospective
Open-RnD1.7K views
Build your own embedded linux distributions by yocto project by Yen-Chin Lee
Build your own embedded linux distributions by yocto projectBuild your own embedded linux distributions by yocto project
Build your own embedded linux distributions by yocto project
Yen-Chin Lee31.2K views
Automotive Grade Linux and systemd by Alison Chaiken
Automotive Grade Linux and systemdAutomotive Grade Linux and systemd
Automotive Grade Linux and systemd
Alison Chaiken1.6K views
Android Internals at Linaro Connect Asia 2013 by Opersys inc.
Android Internals at Linaro Connect Asia 2013Android Internals at Linaro Connect Asia 2013
Android Internals at Linaro Connect Asia 2013
Opersys inc.4.4K views
One Year of Porting - Post-mortem of two Linux/SteamOS launches by Leszek Godlewski
One Year of Porting - Post-mortem of two Linux/SteamOS launchesOne Year of Porting - Post-mortem of two Linux/SteamOS launches
One Year of Porting - Post-mortem of two Linux/SteamOS launches
Leszek Godlewski31.8K views
Linux as a gaming platform, ideology aside by Leszek Godlewski
Linux as a gaming platform, ideology asideLinux as a gaming platform, ideology aside
Linux as a gaming platform, ideology aside
Leszek Godlewski24.4K views
Android Internals by Opersys inc.
Android InternalsAndroid Internals
Android Internals
Opersys inc.14.1K views
Introduction to Docker (as presented at December 2013 Global Hackathon) by Jérôme Petazzoni
Introduction to Docker (as presented at December 2013 Global Hackathon)Introduction to Docker (as presented at December 2013 Global Hackathon)
Introduction to Docker (as presented at December 2013 Global Hackathon)
Jérôme Petazzoni3.3K views
Working with the AOSP - Linaro Connect Asia 2013 by Opersys inc.
Working with the AOSP - Linaro Connect Asia 2013Working with the AOSP - Linaro Connect Asia 2013
Working with the AOSP - Linaro Connect Asia 2013
Opersys inc.5.1K views
Don't Give Credit: Hacking Arcade Machines by Michael Scovetta
Don't Give Credit: Hacking Arcade MachinesDon't Give Credit: Hacking Arcade Machines
Don't Give Credit: Hacking Arcade Machines
Michael Scovetta7.9K views
Eclipse IDE Yocto Plugin by cudma
Eclipse IDE Yocto PluginEclipse IDE Yocto Plugin
Eclipse IDE Yocto Plugin
cudma4.4K views
Memory Management in Android by Opersys inc.
Memory Management in AndroidMemory Management in Android
Memory Management in Android
Opersys inc.1.3K views
Using and Customizing the Android Framework / part 4 of Embedded Android Work... by Opersys inc.
Using and Customizing the Android Framework / part 4 of Embedded Android Work...Using and Customizing the Android Framework / part 4 of Embedded Android Work...
Using and Customizing the Android Framework / part 4 of Embedded Android Work...
Opersys inc.10K views
yocto_scale_handout-with-notes by Steve Arnold
yocto_scale_handout-with-notesyocto_scale_handout-with-notes
yocto_scale_handout-with-notes
Steve Arnold450 views
A million ways to provision embedded linux devices by Mender.io
A million ways to provision embedded linux devicesA million ways to provision embedded linux devices
A million ways to provision embedded linux devices
Mender.io829 views
[Defcon] Hardware backdooring is practical by Moabi.com
[Defcon] Hardware backdooring is practical[Defcon] Hardware backdooring is practical
[Defcon] Hardware backdooring is practical
Moabi.com130.2K views
Hardware backdooring is practical : slides by Moabi.com
Hardware backdooring is practical : slidesHardware backdooring is practical : slides
Hardware backdooring is practical : slides
Moabi.com1.3K views
Leveraging Android's Linux Heritage at AnDevCon3 by Opersys inc.
Leveraging Android's Linux Heritage at AnDevCon3Leveraging Android's Linux Heritage at AnDevCon3
Leveraging Android's Linux Heritage at AnDevCon3
Opersys inc.1.2K views

Viewers also liked

2004 Summer Newsletter by
2004 Summer Newsletter2004 Summer Newsletter
2004 Summer NewsletterDirect Relief
356 views6 slides
Unbound:Endless War by
Unbound:Endless WarUnbound:Endless War
Unbound:Endless WarAnthony Thomas
509 views19 slides
Tsunami response one year later by
Tsunami response one year laterTsunami response one year later
Tsunami response one year laterDirect Relief
524 views10 slides
Alberti Center Colloquium Series - Dr. Jamie Ostrov by
Alberti Center Colloquium Series - Dr. Jamie OstrovAlberti Center Colloquium Series - Dr. Jamie Ostrov
Alberti Center Colloquium Series - Dr. Jamie OstrovUB Alberti Center for Bullying Abuse Prevention
672 views40 slides
Lengua anuncio by
Lengua anuncioLengua anuncio
Lengua anunciofranky226
170 views3 slides
Tsahim 4 by
Tsahim 4Tsahim 4
Tsahim 4mongoo_8301
329 views9 slides

Viewers also liked(20)

Tsunami response one year later by Direct Relief
Tsunami response one year laterTsunami response one year later
Tsunami response one year later
Direct Relief 524 views
Lengua anuncio by franky226
Lengua anuncioLengua anuncio
Lengua anuncio
franky226170 views
Gatti grandi avventure by Zoroastro01
Gatti grandi avventureGatti grandi avventure
Gatti grandi avventure
Zoroastro01313 views
Members of family33 by Digna Rita
Members of family33Members of family33
Members of family33
Digna Rita1.9K views
Java Tech & Tools | OSGi Best Practices | Emily Jiang by JAX London
Java Tech & Tools | OSGi Best Practices | Emily JiangJava Tech & Tools | OSGi Best Practices | Emily Jiang
Java Tech & Tools | OSGi Best Practices | Emily Jiang
JAX London1.4K views
Mla citation guide & citation data forms by ProfWillAdams
Mla citation guide & citation data formsMla citation guide & citation data forms
Mla citation guide & citation data forms
ProfWillAdams1.7K views

Similar to Why kernelspace sucks?

Road to sbt 1.0: Paved with server (2015 Amsterdam) by
Road to sbt 1.0: Paved with server (2015 Amsterdam)Road to sbt 1.0: Paved with server (2015 Amsterdam)
Road to sbt 1.0: Paved with server (2015 Amsterdam)Eugene Yokota
2.8K views54 slides
Docker and-containers-for-development-and-deployment-scale12x by
Docker and-containers-for-development-and-deployment-scale12xDocker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xrkr10
112 views82 slides
Road to sbt 1.0 paved with server by
Road to sbt 1.0   paved with serverRoad to sbt 1.0   paved with server
Road to sbt 1.0 paved with serverEugene Yokota
4.9K views53 slides
Gash Has No Privileges by
Gash Has No PrivilegesGash Has No Privileges
Gash Has No PrivilegesDavid Evans
5.1K views33 slides
Docker Introduction + what is new in 0.9 by
Docker Introduction + what is new in 0.9 Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9 Jérôme Petazzoni
3.7K views91 slides
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ by
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQDocker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQJérôme Petazzoni
9.2K views91 slides

Similar to Why kernelspace sucks?(20)

Road to sbt 1.0: Paved with server (2015 Amsterdam) by Eugene Yokota
Road to sbt 1.0: Paved with server (2015 Amsterdam)Road to sbt 1.0: Paved with server (2015 Amsterdam)
Road to sbt 1.0: Paved with server (2015 Amsterdam)
Eugene Yokota2.8K views
Docker and-containers-for-development-and-deployment-scale12x by rkr10
Docker and-containers-for-development-and-deployment-scale12xDocker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12x
rkr10112 views
Road to sbt 1.0 paved with server by Eugene Yokota
Road to sbt 1.0   paved with serverRoad to sbt 1.0   paved with server
Road to sbt 1.0 paved with server
Eugene Yokota4.9K views
Gash Has No Privileges by David Evans
Gash Has No PrivilegesGash Has No Privileges
Gash Has No Privileges
David Evans5.1K views
Docker Introduction + what is new in 0.9 by Jérôme Petazzoni
Docker Introduction + what is new in 0.9 Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9
Jérôme Petazzoni3.7K views
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ by Jérôme Petazzoni
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQDocker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Jérôme Petazzoni9.2K views
Hunting and Exploiting Bugs in Kernel Drivers - DefCamp 2012 by DefCamp
Hunting and Exploiting Bugs in Kernel Drivers - DefCamp 2012Hunting and Exploiting Bugs in Kernel Drivers - DefCamp 2012
Hunting and Exploiting Bugs in Kernel Drivers - DefCamp 2012
DefCamp1.6K views
A Gentle Introduction To Docker And All Things Containers by Jérôme Petazzoni
A Gentle Introduction To Docker And All Things ContainersA Gentle Introduction To Docker And All Things Containers
A Gentle Introduction To Docker And All Things Containers
Jérôme Petazzoni60.6K views
Introduction to Docker and all things containers, Docker Meetup at RelateIQ by dotCloud
Introduction to Docker and all things containers, Docker Meetup at RelateIQIntroduction to Docker and all things containers, Docker Meetup at RelateIQ
Introduction to Docker and all things containers, Docker Meetup at RelateIQ
dotCloud2.2K views
Introduction to Docker (and a bit more) at LSPE meetup Sunnyvale by Jérôme Petazzoni
Introduction to Docker (and a bit more) at LSPE meetup SunnyvaleIntroduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Jérôme Petazzoni4.4K views
Linux 开源操作系统发展新趋势 by Anthony Wong
Linux 开源操作系统发展新趋势Linux 开源操作系统发展新趋势
Linux 开源操作系统发展新趋势
Anthony Wong173 views
A Gentle Introduction to Docker and Containers by Docker, Inc.
A Gentle Introduction to Docker and ContainersA Gentle Introduction to Docker and Containers
A Gentle Introduction to Docker and Containers
Docker, Inc.1.6K views
OSDC 2016 | rkt and Kubernetes: What’s new with Container Runtimes and Orches... by NETWAYS
OSDC 2016 | rkt and Kubernetes: What’s new with Container Runtimes and Orches...OSDC 2016 | rkt and Kubernetes: What’s new with Container Runtimes and Orches...
OSDC 2016 | rkt and Kubernetes: What’s new with Container Runtimes and Orches...
NETWAYS19 views
OSDC 2016 - rkt and Kubernentes what's new with Container Runtimes and Orches... by NETWAYS
OSDC 2016 - rkt and Kubernentes what's new with Container Runtimes and Orches...OSDC 2016 - rkt and Kubernentes what's new with Container Runtimes and Orches...
OSDC 2016 - rkt and Kubernentes what's new with Container Runtimes and Orches...
NETWAYS199 views
Lightweight Virtualization with Linux Containers and Docker | YaC 2013 by dotCloud
Lightweight Virtualization with Linux Containers and Docker | YaC 2013Lightweight Virtualization with Linux Containers and Docker | YaC 2013
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
dotCloud14.2K views
Lightweight Virtualization with Linux Containers and Docker I YaC 2013 by Docker, Inc.
Lightweight Virtualization with Linux Containers and Docker I YaC 2013Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Docker, Inc.1.1K views
Introduction to Docker at SF Peninsula Software Development Meetup @Guidewire by dotCloud
Introduction to Docker at SF Peninsula Software Development Meetup @GuidewireIntroduction to Docker at SF Peninsula Software Development Meetup @Guidewire
Introduction to Docker at SF Peninsula Software Development Meetup @Guidewire
dotCloud3.1K views

More from OpenFest team

Embedding FreeBSD: for large and small beds by
Embedding FreeBSD: for large and small bedsEmbedding FreeBSD: for large and small beds
Embedding FreeBSD: for large and small bedsOpenFest team
1.4K views25 slides
Why you can charge for open source software by
Why you can charge for open source softwareWhy you can charge for open source software
Why you can charge for open source softwareOpenFest team
535 views7 slides
Microinvest Warehouse Open by
Microinvest Warehouse OpenMicroinvest Warehouse Open
Microinvest Warehouse OpenOpenFest team
585 views7 slides
Backbone.js by
Backbone.jsBackbone.js
Backbone.jsOpenFest team
4.6K views166 slides
Как да правим по-добър бизнес с услуги около софтуера с отворен код by
Как да правим по-добър бизнес с услуги около софтуера с отворен кодКак да правим по-добър бизнес с услуги около софтуера с отворен код
Как да правим по-добър бизнес с услуги около софтуера с отворен кодOpenFest team
632 views22 slides
Pf sense 2.0 by
Pf sense 2.0Pf sense 2.0
Pf sense 2.0OpenFest team
2.8K views15 slides

More from OpenFest team(20)

Embedding FreeBSD: for large and small beds by OpenFest team
Embedding FreeBSD: for large and small bedsEmbedding FreeBSD: for large and small beds
Embedding FreeBSD: for large and small beds
OpenFest team1.4K views
Why you can charge for open source software by OpenFest team
Why you can charge for open source softwareWhy you can charge for open source software
Why you can charge for open source software
OpenFest team535 views
Microinvest Warehouse Open by OpenFest team
Microinvest Warehouse OpenMicroinvest Warehouse Open
Microinvest Warehouse Open
OpenFest team585 views
Как да правим по-добър бизнес с услуги около софтуера с отворен код by OpenFest team
Как да правим по-добър бизнес с услуги около софтуера с отворен кодКак да правим по-добър бизнес с услуги около софтуера с отворен код
Как да правим по-добър бизнес с услуги около софтуера с отворен код
OpenFest team632 views
Електронни пари: Пътят до BitCoin и поглед напред by OpenFest team
Електронни пари: Пътят до BitCoin и поглед напредЕлектронни пари: Пътят до BitCoin и поглед напред
Електронни пари: Пътят до BitCoin и поглед напред
OpenFest team1.7K views
Виртуализирано видеонаблюдение под FreeBSD by OpenFest team
Виртуализирано видеонаблюдение под FreeBSDВиртуализирано видеонаблюдение под FreeBSD
Виртуализирано видеонаблюдение под FreeBSD
OpenFest team1K views
RFID технологии и проблеми със сигурността им by OpenFest team
RFID технологии и проблеми със сигурността имRFID технологии и проблеми със сигурността им
RFID технологии и проблеми със сигурността им
OpenFest team987 views
Redis the better NoSQL by OpenFest team
Redis the better NoSQLRedis the better NoSQL
Redis the better NoSQL
OpenFest team1.2K views
Distributed WPA PSK security audit by OpenFest team
Distributed WPA PSK security auditDistributed WPA PSK security audit
Distributed WPA PSK security audit
OpenFest team862 views
Направи си сам суперкомпютър by OpenFest team
Направи си сам суперкомпютърНаправи си сам суперкомпютър
Направи си сам суперкомпютър
OpenFest team880 views
Свободни курсове за обучение by OpenFest team
Свободни курсове за обучениеСвободни курсове за обучение
Свободни курсове за обучение
OpenFest team508 views
Using Open Source technologies to create Enterprise Level Cloud System by OpenFest team
Using Open Source technologies to create Enterprise Level Cloud SystemUsing Open Source technologies to create Enterprise Level Cloud System
Using Open Source technologies to create Enterprise Level Cloud System
OpenFest team756 views
Behaviour-Driven Development, Ruby Style by OpenFest team
Behaviour-Driven Development, Ruby StyleBehaviour-Driven Development, Ruby Style
Behaviour-Driven Development, Ruby Style
OpenFest team692 views

Recently uploaded

Zero to Automated in Under a Year by
Zero to Automated in Under a YearZero to Automated in Under a Year
Zero to Automated in Under a YearNetwork Automation Forum
15 views23 slides
2024: A Travel Odyssey The Role of Generative AI in the Tourism Universe by
2024: A Travel Odyssey The Role of Generative AI in the Tourism Universe2024: A Travel Odyssey The Role of Generative AI in the Tourism Universe
2024: A Travel Odyssey The Role of Generative AI in the Tourism UniverseSimone Puorto
12 views61 slides
HTTP headers that make your website go faster - devs.gent November 2023 by
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023Thijs Feryn
22 views151 slides
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdfDr. Jimmy Schwarzkopf
20 views29 slides
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc
11 views29 slides
"Node.js Development in 2024: trends and tools", Nikita Galkin by
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin Fwdays
11 views38 slides

Recently uploaded(20)

2024: A Travel Odyssey The Role of Generative AI in the Tourism Universe by Simone Puorto
2024: A Travel Odyssey The Role of Generative AI in the Tourism Universe2024: A Travel Odyssey The Role of Generative AI in the Tourism Universe
2024: A Travel Odyssey The Role of Generative AI in the Tourism Universe
Simone Puorto12 views
HTTP headers that make your website go faster - devs.gent November 2023 by Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn22 views
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by Dr. Jimmy Schwarzkopf
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc11 views
"Node.js Development in 2024: trends and tools", Nikita Galkin by Fwdays
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin
Fwdays11 views
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by sugiuralab
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors
sugiuralab21 views
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
The Forbidden VPN Secrets.pdf by Mariam Shaba
The Forbidden VPN Secrets.pdfThe Forbidden VPN Secrets.pdf
The Forbidden VPN Secrets.pdf
Mariam Shaba20 views
Future of AR - Facebook Presentation by ssuserb54b561
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
ssuserb54b56115 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker40 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10300 views

Why kernelspace sucks?

  • 1. Why kernel space sucks (Or: Deconstructing two myths) (Or: An abridged history of kernel-userspace interface blunders...) Michael Kerrisk OpenFest 2011 © 2011 5 Nov 2011 mtk@man7.org Sofia, Bulgaria http://man7.org/
  • 2. Who am I? ● Professionally: programmer (primarily); also teacher and writer ● Working with UNIX and Linux for nearly 25 years ● Linux man-pages maintainer since 2004 ● 117 releases to date ● written or cowritten ~250 of ~950 man pages ● lots of API testing, many bug reports ● Author of a book on the kernel-userspace API ● IOW: I've spent a lot of time looking at the interface 2 man7.org
  • 3. Intro: Why Userspace Sucks ● Paper/talk by Dave Jones of Red Hat ● First presented at Ottawa LS 2006 ● A lead-in to deconstructing a couple of myths ● Why Userspace Sucks → WUSS ● http://www.kernel.org/doc/ols/2006/ols2006v1-pages-441-450.pdf ● http://www.codemonkey.org.uk/projects/talks/ols2k6.tar.gz ● http://lwn.net/Articles/192214/ 3 man7.org
  • 4. Motivation for WUSS ● We (kernel developers) have created a magnificently performant kernel ● But, can we make it better? ● Why does it take so long to boot, start applications, and shut down? ● Why does my idle laptop consume so much battery power? 4 man7.org
  • 5. Findings from WUSS ● DJ starts instrumenting the kernel, and finds... ● Boot up: 80k stat(), 27k open(), 1.4k exec() ● Shutdown: 23k stat(), 9k open() ● Userspace programmers wreck performance doing crazy things! ● open() and reparse same file multiple times! ● read config files for many devices not even present! ● stat() (or even open()) 100s of files they never need ● timers triggering regular unnecessary wake-ups 5 man7.org
  • 6. Conclusions from WUSS ● Room for a lot of improvement in userspace! ● Userspace programmers should be aware of and using trace and analysis tools ● (perf, LTTng, ftrace, systemtap, strace, valgrind, powertop, etc.) 6 man7.org
  • 7. Kernelspace Userspace “We (kernel developers) are much smarter than those crazy userspace programmers”
  • 8. Kernelspace Userspace Something's wrong with this picture...
  • 9. Let's question a couple of myths... ● Myth 1: Kernel programmers (can) always get things right (in the end) ● Myth 2: Code is always the best way to contribute to Free Software 9 man7.org
  • 10. Myth 1 Kernel programmers (can) always get things right (in the end) Except, there's (at least) one place where they don't: the interface
  • 11. The kernel-userspace interface ● Application programming interface (API) presented by kernel to userspace programs ● System calls (← I'll focus here) ● Pseudo-file systems (/proc, /sys, etc.) ● ioctl() interfaces (device drivers) ● Netlink sockets ● Obscure pieces (AUXV, VDSO, ...) 11 man7.org
  • 12. API designs must be done right first time
  • 13. Why must APIs be right first time? ● Code fixes != API fixes 13 man7.org
  • 14. Why is fixing APIs so hard? ● Usually, “fixing” an interface means breaking the interface for some applications “We care about user-space interfaces to an insane degree. We go to extreme lengths to maintain even badly designed or unintentional interfaces. Breaking user programs simply isn't acceptable.” [LKML, Dec 2005] 14 man7.org
  • 15. We have to live with our mistakes! So: any API mistake by kernel hackers creates pain that thousands of userspace programmers must live with for decades
  • 16. So, what does it mean to get an API right?
  • 17. Doing (kernel-userspace) APIs right ● Properly designed and implemented API should: ● be bug free! ● have a well thought out design – simple as possible (but no simpler) – easy to use / difficult to misuse ● be consistent with related/similar APIs ● integrate well with existing APIs – e.g., interactions with fork(), exec(), threads, signals, FDs? ● be as general as possible ● be extensible, where needed; accommodate future growth trends ● adhere to relevant standards (as far as possible) ● be as good as, or better than, earlier APIs with similar functionality ● be maintainable over time (a multilayered question) 17 man7.org
  • 18. So how do kernel developers score? (BSDers, please laugh quietly)
  • 19. Bugs
  • 20. Bugs ● utimensat(2) [2.6.22] ● Set file timestamps ● Multiple bugs! – http://linux-man-pages.blogspot.com/2008/06/whats-wrong-with-kernel-userland_30.html ● Fixed in 2.6.26 ● signalfd() [2.6.22] ● Receive signals via a file descriptor ● Didn't correctly obtain data sent with sigqueue(2) ● Fixed in 2.6.25 20 man7.org
  • 21. Bugs ● Examples of other interfaces with significant, easy to find bugs at release: ● inotify [2.6.13] ● splice() [2.6.17] (http://marc.info/?l=linux-mm&m=114238448331607&w=2) ● timerfd [2.6.22] (http://marc.info/?l=linux-kernel&m=118517213626087&w=2) 21 man7.org
  • 22. Bugs—what's going on? ● There's a quality control issue; way too many bugs in released interfaces ● Pre-release testing insufficient and haphazard: ● Too few testers (maybe just kernel developer) ● No unit tests ● Insufficient test coverage ● No clear specification against which to test ● Even if bug is fixed, users may still need to care ● special casing for kernel versions 22 man7.org
  • 23. Thinking about design
  • 24. Code it now, think about it later ● Vanishing arguments: ● readdir(2) ignores count argument ● getcpu(2) [2.6.19] ignores tcache argument ● epoll_create() [2.6] ignores size arg. (must be > 0) since 2.6.8 ● Probably, argument wasn't needed to start with ● Or: recognized as a bad idea and made a no-op 24 man7.org
  • 25. Code it now, think about it later ● futimesat() [2.6.16] ● Extends utimes() ● Proposed for POSIX.1-2008 ● Implemented on Linux ● POSIX.1 members realize API is insufficient → standardized different API ● utimensat() added in Linux 2.6.22 25 man7.org
  • 26. Code it now, think about it later ● Dec 2003: Linux 2.6 added epoll_wait() ● File descriptor monitoring – (improves on select()) ● Nov 2006, 2.6.19 added epoll_pwait() to allow manipulation of signal mask during call ● But, already in 2001, POSIX specified pselect() to fix analogous, well-known problem in select() 26 man7.org
  • 28. Interface inconsistencies ● mlock(start, length): ● Round start down to page size ● Round length up to next page boundary ● mlock(4000, 6000) affects bytes 0..12287 ● remap_file_pages(start, length, ...) [2.6]: ● Round start down to page boundary ● Round length down to page boundary(!) ● remap_file_pages(4000, 6000, ...) ? → 0..4095 ● User expectation: similar APIs should behave similarly 28 man7.org
  • 29. Confusing the users ● Various system calls allow one process to change attributes of another process ● e.g., setpriority(), ioprio_set(), migrate_pages(), prlimit() ● Unprivileged calls require credential matches: ● Some combination of caller's UIDs/GIDs matches some combination of target's UIDs/GIDs 29 man7.org
  • 30. Confusing the users ● But, much inconsistency; e.g.: ● setpriority(): euid == t-uid || euid == t-euid ● ioprio_set(): uid == t-uid || euid == t-uid ● migrate_pages(): uid == t-uid || uid == t-suid || euid == t-uid || euid == t-suid ● prlimit(): (uid == t-uid && uid == t-euid && uid == t-suid) && (gid == t-gid && gid == t-guid && gid == t-sgid) !! ● Inconsistency may confuse users into writing bugs ● Potentially, security related bugs! ● http://linux-man-pages.blogspot.com/2010/11/system-call-credential-checking-tale-of.html 30 man7.org
  • 32. Is the interface sufficiently general? ● 2.6.22 added timerfd(ufd, flags, utimerspec) ● Create timer that notifies via a file descriptor ● But API didn't allow user to: ● Retrieve previous value when setting new timer value ● Do a “get” to retrieve time until next expiration – http://marc.info/?l=linux-kernel&m=118517213626087&w=2 – http://lwn.net/Articles/245533/ ● Older APIs ([gs]etitimer(), POSIX timers) did provide this functionality! 32 man7.org
  • 33. Is the interface sufficiently general? ● Solution: ● timerfd() disabled in kernel 2.6.23 ● 2.6.25 did it right: – timerfd_create(), timerfd_settime(), timerfd_gettime() – (API analogous to POSIX timers) ● Was an ABI breakage, but ● Only in a single kernel version ● Original API was never exposed via glibc 33 man7.org
  • 34. Are we learning from the past?
  • 35. Are we learning from past mistakes? ● Dnotify [2.4] ● Directory change notification API ● Many problems ● So, we added inotify [2.6.13] ● But, inotify still doesn't get it all right ● Now [2.6.37] we have yet another API, fanotify ● Designed for virus scanners ● Adds some functionality ● Doesn't provide all functionality of inotify ● Couldn't we have had a new API that did everything? 35 man7.org
  • 37. Is the interface extensible? ● Too often, an early API didn't allow for extensions ● Common solution is a new API, with a flags arg: ● umount() → umount2() [2.2] ● epoll_create() [2.6] → epoll_create2() [2.6.27] ● futimesat() [2.6.16] → epoll_create2() [2.6.22] ● signalfd() [2.6.22] → signalfd4() [2.6.27] ● When adding a new API, consider adding an (unused) flags argument to allow extensions 37 man7.org
  • 38. Futureproofing ● Suppose a syscall has a flags bit-mask arg. ● Implementation should always have check like: if (flags & ~(FL_X | FL_Y)) return -EINVAL; // Only allow caller to specify flags // bits that have a defined meaning ● Without this check, interface is “loose” 38 man7.org
  • 39. Futureproofing ● Suppose user makes a call of form: syscallxyz(-1); // flags has all bits set ● If implementer later adds FL_Z, an ABI breakage occurs for user's code ● Conversely: user has no way of checking if a kernel implements FL_Z ● Many system calls lack this kind of check! ● Linux 3.2 examples: sigaction(sa.sa_flags), recv(), send(), clock_nanosleep(), msgrcv(), msgget(), semget(), shmget(), shmat(), semop(sops.sem_flg) 39 man7.org
  • 40. Futureproofing ● Should checks be added after the fact? ● e.g., umount2() [2.2] added check in 2.6.34; timerfd_settime() [2.6.25] added check in 2.6.29 ● But adding check can also create ABI breakage ● Apps get errors where previously they did not ● Loose APIs allow the user to define interface ● Worst case: can't add new flags values to interface 40 man7.org
  • 41. Futureproofing failures ● 16 bits is enough for UIDs/GIDs... ● 2.4: 32-bit UIDs/GIDs ● 32 bits is enough for file offsets ● Okay, it was 1991, but Moore's law... ● 2.4: 64-bit file offsets ● So we have ● oldstat(), stat(), stat64() ● chown(), chown32() ● open(), open64() ● and so on 41 man7.org
  • 43. When good ideas go astray ● Traditional UNIX gives root all privileges ● All or nothing is risky! ● Linux divides root privileges into separate pieces: ● CAP_AUDIT_CONTROL, CAP_AUDIT_WRITE, CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_DAC_READ_SEARCH,  CAP_FOWNER, CAP_FSETID, CAP_IPC_LOCK, CAP_IPC_OWNER, CAP_KILL, CAP_LEASE,  CAP_LINUX_IMMUTABLE, CAP_MAC_ADMIN, CAP_MAC_OVERRIDE, CAP_MKNOD, CAP_NET_ADMIN,  CAP_NET_BIND_SERVICE, CAP_NET_BROADCAST, CAP_NET_RAW, CAP_SETFCAP, CAP_SETGID, CAP_SETPCAP,  CAP_SETUID, CAP_SYSLOG, CAP_SYS_ADMIN, CAP_SYS_BOOT, CAP_SYS_CHROOT, CAP_SYS_MODULE,  CAP_SYS_NICE, CAP_SYS_PACCT, CAP_SYS_PTRACE, CAP_SYS_RAWIO, CAP_SYS_RESOURCE, CAP_SYS_TIME,  CAP_SYS_TTY_CONFIG, CAP_WAKE_ALARM ● Great! But which capability do I use for my new feature? ● Hmmm, CAP_SYS_ADMIN looks good ● CAP_SYS_ADMIN, the new root, 231 uses in 3.2 43 man7.org
  • 44. Standards and portability
  • 45. Needlessly breaking portability ● sched_setscheduler() ● POSIX says: successful call must return previous policy ● Linux: successful call returns 0 ● No good reason for this inconsistency ● Developers must special case code for Linux 45 man7.org
  • 47. We're just traditionalists... ● These kinds of problems predate Linux: ● API of System V IPC is awful! ● Semantics of fcntl() locks when FD is closed render locks useless for libraries ● select() modifies FD sets in place, forcing reinitialization inside loops – poll() gets it right: uses separate input and output args ● and so on... 47 man7.org
  • 49. Why do these API problems keep happening? ● Excessive focus on code as primary contribution of value for a software project ● Poor feedback loop between developers and users
  • 50. Myth 2 Code is always the best way to contribute to Free Software
  • 51. “Show me the code!”
  • 52. “Show me the code!” But anyone can write code, and if the design is good but the code is bad, the code can usually be fixed
  • 53. “Show me the code!” Sometimes, other sentences are more appropriate, and encourage contributions that are as (or more) valuable
  • 54. “Show me the users' requirements!”
  • 55. “Show me the users' requirements” ● Does the API serve needs of multiple users, or is it just one developer scratching an itch? ● Beware of limited perspectives! ● Is the API designed for generality? ● Is the API extensible for possible future requirements? 55 man7.org
  • 56. “Show me the design specification / documentation!”
  • 57. “Show me the design spec. / documentation!” ● How do we know if implementation deviates from intention? ● What shall we code our tests against? ● Writing documentation turns out often to be a natural sanity check for design ● A decent man page suffices ● Many of the bugs mentioned earlier were found while writing man pages... ● It's all a question of when it's written... 57 man7.org
  • 58. “Show me the design review!”
  • 59. “Show me the design review!” ● Did other people actually review your design? ● Is the API: ● as simple as possible? ● easy to use / difficult to misuse? ● consistent with related/similar APIs? ● well integrated with existing APIs? ● as general as possible ● extensible? ● following standards, where relevant? ● at least as good as earlier APIs with similar functionality? ● maintainable? 59 man7.org
  • 60. “Show me the tests!”
  • 61. “Show me the tests!” ● Did you (the developer) write some tests? ● More importantly: did someone else write some tests? ● Do the tests cover all reasonable cases? ● Do you test for unreasonable cases? ● Do unexpected inputs generate suitable error returns? ● While writing tests, did you find the interface easy to use / difficult to misuse? (Did you consequently make some design changes?) ● What bugs did you discover during testing? 61 man7.org
  • 62. Finally... ● If you're a potential contributor, don't fall into the trap of believing that code is the only (or best) vehicle for contribution ● As a maintainer, are you letting your project down by failing to encourage these other types of contribution?
  • 63. Thanks! (slides up soon at http://man7.org/conf/) Michael Kerrisk Linux man-pages project mtk@man7.org mtk.manpages@gmail.com http://man7.org/ http://man7.org/linux/man-pages/ Mamaku (Black Tree Fern) image (c) Rob Suisted New Zealand Blue Penguin (Kororā) Image (c) livescience.com naturespic.com No Starch Press, 2010 http://man7.org/tlpi/