Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Time to rethink /proc

The current Linux kernel /proc/PID interface is great, time-proven and reliable way to get info about processes running on a system. Right? Well, yes and no. We found out (and you, too, might have noticed it) this is what makes ps and top slow when there are thousands of processes running. Besides the speed, there are a number of other problems with the current /proc/PID interface.

The talk describes all those in great details, then goes on to the alternative we are proposing for inclusion to the kernel, a new interface called task_diag. The new interface is slick, fast (5-10x speed improvement), and extendable.

  • Login to see the comments

  • Be the first to like this

Time to rethink /proc

  1. 1. Time to rethink /proc Kir Kolyshkin / Andrey Vagin @kolyshkin / @vagin_andrey Texas Linux Fest, 9 July 2016 Austin, TX
  2. 2. 2 Agenda ● Intro ● History of /proc ● Limitations of current interface ● Proposed solutions ● Performance results
  3. 3. 3 $ whoami ● Linux user since 1995 ● Developing containers since 2002 – author of vzctl and vzpkg ● Leading OpenVZ: 2005 to 2015 ● Twitter: @kolyshkin
  4. 4. 4 ● Founded in 1997 ● Spun off from Parallels ● HQ in Seattle, WA ● R&D in Moscow, RU 2016
  5. 5. 5 Products: ● Containers and hypervisors ● Distributed cluster storage
  6. 6. 6 OpenVZ
  7. 7. 7 CRIU: Checkpoint / Restore In Userspace
  8. 8. 8 Ideas behind CRIU ● We can't merge kernel c/r upstream, so... let’s redo the whole thing in userspace ● Use existing interfaces where available – /proc, ptrace, netlink, parasite code injection ● Amend the kernel where necessary – only ~180 kernel patches – kernel v3.11+ is sufficient (if CONFIG_CHECKPOINT_RESTORE is set)
  9. 9. 9 History of /proc part I ● Initial solution: /dev/kmem – May 1975, UNIX 6th edition (V6) – http://man.cat-v.org/unix-6th/4/mem ● First “old style” /proc – 1984, UNIX 8th edition (V8), by Tom Killian – A process is a file! Images of running processes – An alternative to ptrace(2) – http://man.cat-v.org/unix_8th/4/proc
  10. 10. 10 History of /proc part II ● Most well-known old-style /proc – 1988...1991: UNIX SVR4 (port from V8 with enhancements by Roger Faulkner and Ron Gomes) – read(), write(), and 37 ioctl()s ● First modern style /proc – mid-1990s, Plan 9 – Each process is a directory with multiple informational and control files – One can use ls and cat to work with it
  11. 11. 11 Plan 9 /proc interface
  12. 12. 12 Modern Linux interface: /proc/PID/* $ ls /proc/self/ attr             cwd      loginuid    numa_maps      schedstat  task autogroup        environ  map_files   oom_adj        sessionid  timers auxv             exe      maps        oom_score      setgroups  uid_map cgroup           fd       mem         oom_score_adj  smaps      wchan clear_refs       fdinfo   mountinfo   pagemap        stack cmdline          gid_map  mounts      personality    stat comm             io       mountstats  projid_map     statm coredump_filter  latency  net         root           status cpuset           limits   ns          sched          syscall
  13. 13. 13 Limitations of /proc/PID interface ● Requires at least three syscalls per process per file – open(), read(), close() ● Variety of formats, mostly text based ● Not enough information (/proc/PID/fd/*) ● Some formats are non-extendable – /proc/PID/maps where the last column is optional ● Sometimes slow due to extra attributes – /proc/PID/smaps vs /proc/PID/maps ●
  14. 14. 14 /proc/PID/smaps 7f1cb0afc000-7f1cb0afd000 rw-p 00021000 08:03 656516 /usr/lib64/ld-2.21.so Size: 4 kB Rss: 4 kB Pss: 4 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 4 kB Referenced: 4 kB Anonymous: 4 kB AnonHugePages: 0 kB Swap: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Locked: 0 kB VmFlags: rd wr mr mw me dw ac sd $ time cat /proc/*/maps > /dev/null real 0m0.061s user 0m0.002s sys 0m0.059s $ time cat /proc/*/smaps > /dev/null real 0m0.253s user 0m0.004s sys 0m0.247s
  15. 15. 15 Similar problem: info about sockets ● /proc – /proc/net/netlink – /proc/net/unix – /proc/net/tcp – /proc/net/packet ● Problems: not enough info, complex format, all-or-nothing ● Solution (2012): use netlink, generalize tcp_diag as sock_diag – the extendable binary format – allows to specify a group of attributes and sockets
  16. 16. 16 Solution 1: task_diag based on netlink socket 1.Netlink message format: binary and extendable 2.Ways to specify a set of processes 3.Optimal grouping of attributes
  17. 17. 17 nlmsg_len nlmsg_type nlmsg_flags nlmsg_seq nlmsg_id nlattr_len nlattr_type payload nlattr_len nlattr_type payload Netlink message format ● Simple and elegant ● Binary and easily extendable ● Easy to add a new group ● Easy to add new attribute
  18. 18. 18 Specify sets of processes ● TASK_DIAG_DUMP_ALL – Dump all processes ● TASK_DIAG_DUMP_ALL_THREAD – Dump all threads ● TASK_DIAG_DUMP_CHILDREN – Dump children of a specific task ● TASK_DIAG_DUMP_THREAD – Dump threads of a specific task ● TASK_DIAG_DUMP_ONE – Dump one task
  19. 19. 19 Groups of attributes ● TASK_DIAG_BASE – PID, PGID, SID, TID, comm ● TASK_DIAG_CRED – UID, GID, groups, capabilities ● TASK_DIAG_STAT – per-task and per-process statistics (same as taskstats, not avail in /proc) ● TASK_DIAG_VMA – mapped memory regions and their access permissions (same as maps) ● TASK_DIAG_VMA_STAT – memory consumption for each mapping (same as smaps)
  20. 20. 20 This is what makes it real fast 1.Netlink message format: binary and extendable 2.Ways to specify a set of processes 3.Optimal grouping of attributes
  21. 21. 21 Problems with netlink ● Designed for networking ● Not obvious where to get pid and user namespaces ● Impossible to restrict netlink sockets – Credentials are saved when a socket is created – Process can drop privileges, but netlink doesn't care – The same socket can be used to get process attributes and to set ip addresses
  22. 22. 22 Change netlink socket to a transactional file ● /proc/task_diag as a transactional file – write request → read response ● Otherwise same as netlink socket ● LKML discussion has not reached conclusion yet
  23. 23. 23 Performance: ps Traditional ps (using /proc/PID/* files): $ time ./ps/pscommand ax | wc -l 50089 real 0m1.596s user 0m0.475s sys 0m1.126s New ps (using task_diag): $ time ./ps/pscommand ax | wc -l 50089 real 0m0.148s user 0m0.069s sys 0m0.086s
  24. 24. 24 Performance: using perf tool > Using the fork test command: > 10,000 processes; 10k proc with 5 threads = 50,000 tasks > reading /proc: 11.3 sec > task_diag: 2.2 sec > > @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096 > > 128 instances of sepcjbb, 80,000+ tasks: > reading /proc: 32.1 sec > task_diag: 3.9 sec > > So overall much snappier startup times. // David Ahern
  25. 25. 25 Source code! https://github.com/avagin/linux-task-diag/ Branch: devel Examples: tools/testing/selftests/task_diag/
  26. 26. 26 Thank you! http://virtuozzo.com/ http://openvz.org/ http://criu.org/ @kolyshkin kolyshkin AT gmail DOT com

×