Time to rethink
/proc
Kir Kolyshkin / Andrey Vagin
@kolyshkin / @vagin_andrey
Texas Linux Fest, 9 July 2016
Austin, TX
2
Agenda
● Intro
● History of /proc
● Limitations of current interface
● Proposed solutions
● Performance results
3
$ whoami
● Linux user since 1995
● Developing containers since 2002
– author of vzctl and vzpkg
● Leading OpenVZ: 2005 to 2015
● Twitter: @kolyshkin
4
● Founded in 1997
● Spun off from Parallels
● HQ in Seattle, WA
● R&D in Moscow, RU
2016
5
Products:
● Containers and hypervisors
● Distributed cluster storage
6
OpenVZ
7
CRIU: Checkpoint / Restore In Userspace
8
Ideas behind CRIU
● We can't merge kernel c/r upstream, so...
let’s redo the whole thing in userspace
● Use existing interfaces where available
– /proc, ptrace, netlink, parasite code injection
● Amend the kernel where necessary
– only ~180 kernel patches
– kernel v3.11+ is sufficient
(if CONFIG_CHECKPOINT_RESTORE is set)
9
History of /proc part I
● Initial solution: /dev/kmem
– May 1975, UNIX 6th
edition (V6)
– http://man.cat-v.org/unix-6th/4/mem
● First “old style” /proc
– 1984, UNIX 8th
edition (V8), by Tom Killian
– A process is a file! Images of running processes
– An alternative to ptrace(2)
– http://man.cat-v.org/unix_8th/4/proc
10
History of /proc part II
● Most well-known old-style /proc
– 1988...1991: UNIX SVR4 (port from V8 with
enhancements by Roger Faulkner and Ron Gomes)
– read(), write(), and 37 ioctl()s
● First modern style /proc
– mid-1990s, Plan 9
– Each process is a directory with multiple
informational and control files
– One can use ls and cat to work with it
11
Plan 9 /proc interface
12
Modern Linux interface: /proc/PID/*
$ ls /proc/self/
attr             cwd      loginuid    numa_maps      schedstat  task
autogroup        environ  map_files   oom_adj        sessionid  timers
auxv             exe      maps        oom_score      setgroups  uid_map
cgroup           fd       mem         oom_score_adj  smaps      wchan
clear_refs       fdinfo   mountinfo   pagemap        stack
cmdline          gid_map  mounts      personality    stat
comm             io       mountstats  projid_map     statm
coredump_filter  latency  net         root           status
cpuset           limits   ns          sched          syscall
13
Limitations of /proc/PID interface
● Requires at least three syscalls per process per file
– open(), read(), close()
● Variety of formats, mostly text based
● Not enough information (/proc/PID/fd/*)
● Some formats are non-extendable
– /proc/PID/maps where the last column is optional
● Sometimes slow due to extra attributes
– /proc/PID/smaps vs /proc/PID/maps
●
14
/proc/PID/smaps
7f1cb0afc000-7f1cb0afd000 rw-p 00021000 08:03 656516 /usr/lib64/ld-2.21.so
Size: 4 kB
Rss: 4 kB
Pss: 4 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 4 kB
Referenced: 4 kB
Anonymous: 4 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
VmFlags: rd wr mr mw me dw ac sd
$ time cat /proc/*/maps > /dev/null
real 0m0.061s
user 0m0.002s
sys 0m0.059s
$ time cat /proc/*/smaps > /dev/null
real 0m0.253s
user 0m0.004s
sys 0m0.247s
15
Similar problem: info about sockets
● /proc
– /proc/net/netlink
– /proc/net/unix
– /proc/net/tcp
– /proc/net/packet
● Problems: not enough info, complex format, all-or-nothing
● Solution (2012): use netlink, generalize tcp_diag as sock_diag
– the extendable binary format
– allows to specify a group of attributes and sockets
16
Solution 1: task_diag based on netlink socket
1.Netlink message format:
binary and extendable
2.Ways to specify a set of processes
3.Optimal grouping of attributes
17
nlmsg_len
nlmsg_type nlmsg_flags
nlmsg_seq
nlmsg_id
nlattr_len nlattr_type
payload
nlattr_len nlattr_type
payload
Netlink message format
● Simple and elegant
● Binary and easily extendable
● Easy to add a new group
● Easy to add new attribute
18
Specify sets of processes
● TASK_DIAG_DUMP_ALL
– Dump all processes
● TASK_DIAG_DUMP_ALL_THREAD
– Dump all threads
● TASK_DIAG_DUMP_CHILDREN
– Dump children of a specific task
● TASK_DIAG_DUMP_THREAD
– Dump threads of a specific task
● TASK_DIAG_DUMP_ONE
– Dump one task
19
Groups of attributes
● TASK_DIAG_BASE
– PID, PGID, SID, TID, comm
● TASK_DIAG_CRED
– UID, GID, groups, capabilities
● TASK_DIAG_STAT
– per-task and per-process statistics (same as taskstats, not avail
in /proc)
● TASK_DIAG_VMA
– mapped memory regions and their access permissions (same as
maps)
● TASK_DIAG_VMA_STAT
– memory consumption for each mapping (same as smaps)
20
This is what makes it real fast
1.Netlink message format:
binary and extendable
2.Ways to specify a set of processes
3.Optimal grouping of attributes
21
Problems with netlink
● Designed for networking
● Not obvious where to get pid and user
namespaces
● Impossible to restrict netlink sockets
– Credentials are saved when a socket is created
– Process can drop privileges, but netlink doesn't care
– The same socket can be used to get process
attributes and to set ip addresses
22
Change netlink socket to a transactional file
● /proc/task_diag as a transactional file
– write request → read response
● Otherwise same as netlink socket
● LKML discussion has not reached conclusion yet
23
Performance: ps
Traditional ps (using /proc/PID/* files):
$ time ./ps/pscommand ax | wc -l
50089
real 0m1.596s
user 0m0.475s
sys 0m1.126s
New ps (using task_diag):
$ time ./ps/pscommand ax | wc -l
50089
real 0m0.148s
user 0m0.069s
sys 0m0.086s
24
Performance: using perf tool
> Using the fork test command:
> 10,000 processes; 10k proc with 5 threads = 50,000 tasks
> reading /proc: 11.3 sec
> task_diag: 2.2 sec
>
> @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096
>
> 128 instances of sepcjbb, 80,000+ tasks:
> reading /proc: 32.1 sec
> task_diag: 3.9 sec
>
> So overall much snappier startup times.
// David Ahern
25
Source code!
https://github.com/avagin/linux-task-diag/
Branch: devel
Examples: tools/testing/selftests/task_diag/
26
Thank you!
http://virtuozzo.com/
http://openvz.org/
http://criu.org/
@kolyshkin
kolyshkin AT gmail DOT com

Time to rethink /proc

  • 1.
    Time to rethink /proc KirKolyshkin / Andrey Vagin @kolyshkin / @vagin_andrey Texas Linux Fest, 9 July 2016 Austin, TX
  • 2.
    2 Agenda ● Intro ● Historyof /proc ● Limitations of current interface ● Proposed solutions ● Performance results
  • 3.
    3 $ whoami ● Linuxuser since 1995 ● Developing containers since 2002 – author of vzctl and vzpkg ● Leading OpenVZ: 2005 to 2015 ● Twitter: @kolyshkin
  • 4.
    4 ● Founded in1997 ● Spun off from Parallels ● HQ in Seattle, WA ● R&D in Moscow, RU 2016
  • 5.
    5 Products: ● Containers andhypervisors ● Distributed cluster storage
  • 6.
  • 7.
    7 CRIU: Checkpoint /Restore In Userspace
  • 8.
    8 Ideas behind CRIU ●We can't merge kernel c/r upstream, so... let’s redo the whole thing in userspace ● Use existing interfaces where available – /proc, ptrace, netlink, parasite code injection ● Amend the kernel where necessary – only ~180 kernel patches – kernel v3.11+ is sufficient (if CONFIG_CHECKPOINT_RESTORE is set)
  • 9.
    9 History of /procpart I ● Initial solution: /dev/kmem – May 1975, UNIX 6th edition (V6) – http://man.cat-v.org/unix-6th/4/mem ● First “old style” /proc – 1984, UNIX 8th edition (V8), by Tom Killian – A process is a file! Images of running processes – An alternative to ptrace(2) – http://man.cat-v.org/unix_8th/4/proc
  • 10.
    10 History of /procpart II ● Most well-known old-style /proc – 1988...1991: UNIX SVR4 (port from V8 with enhancements by Roger Faulkner and Ron Gomes) – read(), write(), and 37 ioctl()s ● First modern style /proc – mid-1990s, Plan 9 – Each process is a directory with multiple informational and control files – One can use ls and cat to work with it
  • 11.
    11 Plan 9 /procinterface
  • 12.
    12 Modern Linux interface:/proc/PID/* $ ls /proc/self/ attr             cwd      loginuid    numa_maps      schedstat  task autogroup        environ  map_files   oom_adj        sessionid  timers auxv             exe      maps        oom_score      setgroups  uid_map cgroup           fd       mem         oom_score_adj  smaps      wchan clear_refs       fdinfo   mountinfo   pagemap        stack cmdline          gid_map  mounts      personality    stat comm             io       mountstats  projid_map     statm coredump_filter  latency  net         root           status cpuset           limits   ns          sched          syscall
  • 13.
    13 Limitations of /proc/PIDinterface ● Requires at least three syscalls per process per file – open(), read(), close() ● Variety of formats, mostly text based ● Not enough information (/proc/PID/fd/*) ● Some formats are non-extendable – /proc/PID/maps where the last column is optional ● Sometimes slow due to extra attributes – /proc/PID/smaps vs /proc/PID/maps ●
  • 14.
    14 /proc/PID/smaps 7f1cb0afc000-7f1cb0afd000 rw-p 0002100008:03 656516 /usr/lib64/ld-2.21.so Size: 4 kB Rss: 4 kB Pss: 4 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 4 kB Referenced: 4 kB Anonymous: 4 kB AnonHugePages: 0 kB Swap: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Locked: 0 kB VmFlags: rd wr mr mw me dw ac sd $ time cat /proc/*/maps > /dev/null real 0m0.061s user 0m0.002s sys 0m0.059s $ time cat /proc/*/smaps > /dev/null real 0m0.253s user 0m0.004s sys 0m0.247s
  • 15.
    15 Similar problem: infoabout sockets ● /proc – /proc/net/netlink – /proc/net/unix – /proc/net/tcp – /proc/net/packet ● Problems: not enough info, complex format, all-or-nothing ● Solution (2012): use netlink, generalize tcp_diag as sock_diag – the extendable binary format – allows to specify a group of attributes and sockets
  • 16.
    16 Solution 1: task_diagbased on netlink socket 1.Netlink message format: binary and extendable 2.Ways to specify a set of processes 3.Optimal grouping of attributes
  • 17.
    17 nlmsg_len nlmsg_type nlmsg_flags nlmsg_seq nlmsg_id nlattr_len nlattr_type payload nlattr_lennlattr_type payload Netlink message format ● Simple and elegant ● Binary and easily extendable ● Easy to add a new group ● Easy to add new attribute
  • 18.
    18 Specify sets ofprocesses ● TASK_DIAG_DUMP_ALL – Dump all processes ● TASK_DIAG_DUMP_ALL_THREAD – Dump all threads ● TASK_DIAG_DUMP_CHILDREN – Dump children of a specific task ● TASK_DIAG_DUMP_THREAD – Dump threads of a specific task ● TASK_DIAG_DUMP_ONE – Dump one task
  • 19.
    19 Groups of attributes ●TASK_DIAG_BASE – PID, PGID, SID, TID, comm ● TASK_DIAG_CRED – UID, GID, groups, capabilities ● TASK_DIAG_STAT – per-task and per-process statistics (same as taskstats, not avail in /proc) ● TASK_DIAG_VMA – mapped memory regions and their access permissions (same as maps) ● TASK_DIAG_VMA_STAT – memory consumption for each mapping (same as smaps)
  • 20.
    20 This is whatmakes it real fast 1.Netlink message format: binary and extendable 2.Ways to specify a set of processes 3.Optimal grouping of attributes
  • 21.
    21 Problems with netlink ●Designed for networking ● Not obvious where to get pid and user namespaces ● Impossible to restrict netlink sockets – Credentials are saved when a socket is created – Process can drop privileges, but netlink doesn't care – The same socket can be used to get process attributes and to set ip addresses
  • 22.
    22 Change netlink socketto a transactional file ● /proc/task_diag as a transactional file – write request → read response ● Otherwise same as netlink socket ● LKML discussion has not reached conclusion yet
  • 23.
    23 Performance: ps Traditional ps(using /proc/PID/* files): $ time ./ps/pscommand ax | wc -l 50089 real 0m1.596s user 0m0.475s sys 0m1.126s New ps (using task_diag): $ time ./ps/pscommand ax | wc -l 50089 real 0m0.148s user 0m0.069s sys 0m0.086s
  • 24.
    24 Performance: using perftool > Using the fork test command: > 10,000 processes; 10k proc with 5 threads = 50,000 tasks > reading /proc: 11.3 sec > task_diag: 2.2 sec > > @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096 > > 128 instances of sepcjbb, 80,000+ tasks: > reading /proc: 32.1 sec > task_diag: 3.9 sec > > So overall much snappier startup times. // David Ahern
  • 25.
  • 26.

Editor's Notes

  • #4 Slackware on floppies. Kernel 1.0.9, recompiled 1.1.50 from source. And it’s my second time here at TXLF, long way from Seattle.
  • #6 Virtuozzo a product is a essentially a supercharged version of OpenVZ, with containers and VMs working side by side and are uniformly managed by same set of tools.Storage idea is to take the individual servers’ hard drives to
  • #7 OpenVZ, my baby. First steps, first words, first kernel panics. Do we have any users in the audience? Full (system) containers for Linux Developed since 1999,open source since 2005 Live migration since 2007 ~2000 Linux kernel patches enabling LXC, Docker, CoreOS… biggest contributor to containers Now reborn as Virtuozzo 7
  • #8 4 years old! v.2.3 (June 2016) Aims to replace OpenVZ kernel c/r Saves and restores setsof running processes Integrated into LXC, Docker* Not just for live migration! save HPC job or game, update kernel or hardware,balance load, speed-up boot, reverse debug, inject faults
  • #9 We failed to merge in-kernel c/r because that kernel code is very invasive, touching every kernel subsystem, no kernel maintainer wanted that in their code
  • #10 As I’m getting older, I find myself more and more interested in history.
  • #13 More than 40 files and 10 directories for each process. Our tests showed that reading that amount of files takes lots of time. Oh, and here is a picture of a classic locomotive, a rail transport vehicle. Why this picture? Because it’s slooow. Are there any engineers here? I mean, real ones, not software engineers. What would be the max speed of this beast?
  • #14 Variety of formats – no one wants to spend their life writing parsers for all these formats. Text-based: consider ps showing process time. Kernel has it in binary, shows to /proc as a string, ps reads it and converts to binary, to use say for sorting, and finally converts it to string when printing. An example of non-extendable format is /proc/*/maps – last field is file name, and it is ... optional!
  • #17 There are three definitive properties of this solutionLet’s see them in more details.
  • #18 The structure is pretty generic, this is what makes this format extendable.
  • #20 One important thing here is optimal grouping. If any attribute greatly affects response speed, it should be separated into a separate group.
  • #21 These three properties is what makes the API real FAST.For those of you living in US, here’s a picture of a european high speed rail train, 186 miles per hour.
  • #22 Another bad example of using netlink: taskstats
  • #25 Final remark: open source is really awesome! Why? There are many people from many different places working on many different problems. The work that I just described is one example of such work.