Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Cgroups, namespaces and beyond:
What are containers made from?
Jérôme🐳🚢📦Petazzoni
Tinkerer Extraordinaire
Who am I?
Jérôme Petazzoni (@jpetazzo)
French software engineer living in California
I have built and scaled the dotCloud ...
Outline
What is a container?
High level and low level overviews
The building blocks
Namespaces, cgroups, copy-on-write sto...
What is a container?
📦
High level approach: lightweight VM
I can get a shell on it
through SSH or otherwise
It "feels" like a VM
own process spac...
Low level approach: chroot on steroids
It's not quite like a VM:
uses the host kernel
can't boot a different OS
can't have...
How are they implemented?
Let's look in the kernel source!
Go to LXR
Look for "LXC" → zero result
Look for "container" → 1...
How are they implemented?
Let's look in the kernel source!
Go to LXR
Look for "LXC" → zero result
Look for "container" → 1...
How are they implemented?
Let's look in the kernel source!
Go to LXR
Look for "LXC" → zero result
Look for "container" → 1...
The building blocks:
control groups
Control groups
Resource metering and limiting
memory
CPU
block I/O
network*
Device node (/dev/*) access control
Crowd cont...
Generalities
Each subsystem has a hierarchy (tree)
separate hierarchies for CPU, memory, block I/O…
Hierarchies are indepe...
Example
Memory cgroup: accounting
Keeps track of pages used by each group:
file (read/write/mmap from block devices)
anonymous (st...
Memory cgroup: limits
Each group can have its own limits
limits are optional
two kinds of limits: soft and hard limits
Sof...
Memory cgroup: avoiding OOM killer
Do you like when the kernel terminates random
processes because something unrelated ate...
Memory cgroup: tricky details
Each time the kernel gives a page to a process,
or takes it away, it updates the counters
Th...
HugeTLB cgroup
Controls the amount of huge pages
usable by a process
safe to ignore if you don’t know what huge pages are
Cpu cgroup
Keeps track of user/system CPU time
Keeps track of usage per CPU
Allows to set weights
Can't set CPU limits
OK,...
Cpuset cgroup
Pin groups to specific CPU(s)
Reserve CPUs for specific apps
Avoid processes bouncing between CPUs
Also rele...
Blkio cgroup
Keeps track of I/Os for each group
per block device
read vs write
sync vs async
Set throttle (limits) for eac...
Net_cls and net_prio cgroup
Automatically set traffic class or priority,
for traffic generated by processes in the group
O...
Devices cgroup
Controls what the group can do on device nodes
Permissions include read/write/mknod
Typical use:
allow /dev...
Freezer cgroup
Allows to freeze/thaw a group of processes
Similar in functionality to SIGSTOP/SIGCONT
Cannot be detected b...
Subtleties
PID 1 is placed at the root of each hierarchy
New processes start in their parent’s groups
Groups are materiali...
The building blocks:
namespaces
Namespaces
Provide processes with their own system view
Cgroups = limits how much you can use;
namespaces = limits what yo...
Pid namespace
Processes within a PID namespace only see
processes in the same PID namespace
Each PID namespace has its own...
Net namespace: in theory
Processes within a given network namespace
get their own private network stack, including:
networ...
Net namespace: in practice
Typical use-case:
use veth pairs
(two virtual interfaces acting as a cross-over cable)
eth0 in ...
Mnt namespace
Processes can have their own root fs
conceptually close to chroot
Processes can also have “private” mounts
/...
Uts namespace
gethostname / sethostname
'nuff said!
Ipc namespace
Ipc namespace
Does anybody know about IPC?
Ipc namespace
Does anybody know about IPC?
Does anybody care about IPC?
Ipc namespace
Does anybody know about IPC?
Does anybody care about IPC?
Allows a process (or group of processes)
to have o...
User namespace
Allows to map UID/GID; e.g.:
UID 0→1999 in container C1 is mapped to
UID 10000→11999 on host;
UID 0→1999 in...
Namespace manipulation
Namespaces are created with the clone() syscall
i.e. with extra flags when creating a new process
N...
The building blocks:
copy-on-write
Copy-on-write storage
Create a new container instantly
instead of copying its whole filesystem
Storage keeps track of what...
Other details
Orthogonality
All those things can be used independently
If you only need resource isolation,
use a few selected cgroups
I...
Some missing bits
Capabilities
break down "root / non-root" into fine-grained rights
allow to keep root, but without the d...
Container runtimes
Container runtimes
using cgroups+namespaces
LXC
Set of userland tools
A container is a directory in /var/lib/lxc
Small config file + root filesystem
Early versions ha...
systemd-nspawn
From its manpage:
“For debugging, testing and building”
“Similar to chroot, but more powerful”
“Implements ...
Docker Engine
Daemon controlled by REST(ish) API
First versions shelled out to LXC
now uses its own libcontainer runtime
M...
rkt, runC
Back to the basics!
Focus on container execution
no API, no image management, no build, etc.
They implement diff...
Which one is best?
They all use the same kernel features
Performance will be exactly the same
Look at:
features
design
eco...
Container runtimes
using other mechanisms
OpenVZ
Also Linux
Older, but battle-tested
e.g. Travis CI gives you root in OpenVZ
Tons of neat features too
ploop (effici...
Jails / Zones
FreeBSD / Solaris
Coarser granularity of features
Strong emphasis on security
Great for hosting providers
No...
Building our own
containers
Not for production use!
(For educational purposes only)
55
Thank you!
Jérôme🐳🚢📦Petazzoni
@jpetazzo
jerome@docker.com
Upcoming SlideShare
Loading in …5
×

Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

16,923 views

Published on

Linux containers are different from Solaris Zones or BSD Jails: they use discrete kernel features like cgroups, namespaces, SELinux, and more. We will describe those mechanisms in depth, as well as demo how to put them together to produce a container. We will also highlight how different container runtimes compare to each other.

This talk was delivered at DockerCon Europe 2015 in Barcelona.

Published in: Technology
  • Justin Sinclair has helped thousands of menget their Ex girlfriends back using his methods. Learn them at here▲▲▲ http://goo.gl/FXTq7P
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • ➤➤ How Long Does She Want You to Last? Here's the link to the FREE report ●●● https://tinyurl.com/rockhardxx
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • ★★★ How Long Does She Want You to Last? ★★★ A recent study proved that the average man lasts just 2-5 minutes in bed (during intercourse). The study also showed that many women need at least 7-10 minutes of intercourse to reach "The Big O" - and, worse still... 30% of women never get there during intercourse. Clearly, most men are NOT fulfilling there women's needs in bed. Now, as I've said many times - how long you can last is no guarantee of being a GREAT LOVER. But, not being able to last 20, 30 minutes or more, is definitely a sign that you're not going to "set your woman's world on fire" between the sheets. Question is: "What can you do to last longer?" Well, one of the best recommendations I can give you today is to read THIS report. In it, you'll discover a detailed guide to an Ancient Taoist Thrusting Technique that can help any man to last much longer in bed. I can vouch 100% for the technique because my husband has been using it for years :) Here�s the link to the report ♣♣♣ https://tinyurl.com/rockhardxxx
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

  1. 1. Cgroups, namespaces and beyond: What are containers made from? Jérôme🐳🚢📦Petazzoni Tinkerer Extraordinaire
  2. 2. Who am I? Jérôme Petazzoni (@jpetazzo) French software engineer living in California I have built and scaled the dotCloud PaaS almost 5 years ago, time flies! I talk (too much) about containers first time was maybe in February 2013, at SCALE in L.A. Proud member of the House of Bash when annoyed, we replace things with tiny shell scripts
  3. 3. Outline What is a container? High level and low level overviews The building blocks Namespaces, cgroups, copy-on-write storage Container runtimes Docker, LXC, rkt, runC, systemd-nspawn OpenVZ, Jails, Zones Hand-crafted, house-blend, artisan containers Demos!
  4. 4. What is a container? 📦
  5. 5. High level approach: lightweight VM I can get a shell on it through SSH or otherwise It "feels" like a VM own process space own network interface can run stuff as root can install packages can run services can mess up routing, iptables …
  6. 6. Low level approach: chroot on steroids It's not quite like a VM: uses the host kernel can't boot a different OS can't have its own modules doesn't need init as PID 1 doesn't need syslogd, cron... It's just normal processes on the host machine contrast with VMs which are opaque
  7. 7. How are they implemented? Let's look in the kernel source! Go to LXR Look for "LXC" → zero result Look for "container" → 1000+ results Almost all of them are about data structures or other unrelated concepts like “ACPI containers” There are some references to “our” containers but only in the documentation
  8. 8. How are they implemented? Let's look in the kernel source! Go to LXR Look for "LXC" → zero result Look for "container" → 1000+ results Almost all of them are about data structures or other unrelated concepts like “ACPI containers” There are some references to “our” containers but only in the documentation ??? Are containers even in the kernel ???
  9. 9. How are they implemented? Let's look in the kernel source! Go to LXR Look for "LXC" → zero result Look for "container" → 1000+ results Almost all of them are about data structures or other unrelated concepts like “ACPI containers” There are some references to “our” containers but only in the documentation ??? Are containers even in the kernel ??? ??? Do containers even actually EXIST ???
  10. 10. The building blocks: control groups
  11. 11. Control groups Resource metering and limiting memory CPU block I/O network* Device node (/dev/*) access control Crowd control *With cooperation from iptables/tc; more on that later
  12. 12. Generalities Each subsystem has a hierarchy (tree) separate hierarchies for CPU, memory, block I/O… Hierarchies are independent the trees for e.g. memory and CPU can be different Each process is in a node in each hierarchy think of each hierarchy as a different dimension or axis Each hierarchy starts with 1 node (the root) Initially, all processes start at the root node* Each node = group of processes sharing the same resources *It's a bit more subtle; more on that later
  13. 13. Example
  14. 14. Memory cgroup: accounting Keeps track of pages used by each group: file (read/write/mmap from block devices) anonymous (stack, heap, anonymous mmap) active (recently accessed) inactive (candidate for eviction) Each page is “charged” to a group Pages can be shared across multiple groups e.g. multiple processes reading from the same files when pages are shared, only one group “pays” for a page
  15. 15. Memory cgroup: limits Each group can have its own limits limits are optional two kinds of limits: soft and hard limits Soft limits are not enforced they influence reclaim under memory pressure Hard limits will trigger a per-group OOM killer Limits can be set for different kinds of memory physical memory kernel memory total memory
  16. 16. Memory cgroup: avoiding OOM killer Do you like when the kernel terminates random processes because something unrelated ate all the memory in the system? Setup oom-notifier Then, when the hard limit is exceeded: freeze all processes in the group notify user space (instead of going rampage) we can kill processes, raise limits, migrate containers ... when we're in the clear again, unfreeze the group
  17. 17. Memory cgroup: tricky details Each time the kernel gives a page to a process, or takes it away, it updates the counters This adds some overhead This cannot be enabled/disabled per process it has to be done at boot time When multiple groups use the same page, only the first one is “charged” but if it stops using it, the charge is moved to another group
  18. 18. HugeTLB cgroup Controls the amount of huge pages usable by a process safe to ignore if you don’t know what huge pages are
  19. 19. Cpu cgroup Keeps track of user/system CPU time Keeps track of usage per CPU Allows to set weights Can't set CPU limits OK, let's say you give N% then the CPU throttles to a lower clock speed now what? same if you give a time slot instructions? their exec speed varies wildly ¯_ツ_/¯
  20. 20. Cpuset cgroup Pin groups to specific CPU(s) Reserve CPUs for specific apps Avoid processes bouncing between CPUs Also relevant for NUMA systems Provides extra dials and knobs per zone memory pressure, process migration costs…
  21. 21. Blkio cgroup Keeps track of I/Os for each group per block device read vs write sync vs async Set throttle (limits) for each group per block device read vs write ops vs bytes Set relative weights for each group Note: most writes go through the page cache so classic writes will appear to be unthrottled at first
  22. 22. Net_cls and net_prio cgroup Automatically set traffic class or priority, for traffic generated by processes in the group Only works for egress traffic Net_cls will assign traffic to a class class then has to be matched with tc/iptables, otherwise traffic just flows normally Net_prio will assign traffic to a priority priorities are used by queuing disciplines
  23. 23. Devices cgroup Controls what the group can do on device nodes Permissions include read/write/mknod Typical use: allow /dev/{tty,zero,random,null} ... deny everything else A few interesting nodes: /dev/net/tun (network interface manipulation) /dev/fuse (filesystems in user space) /dev/kvm (VMs in containers, yay inception!) /dev/dri (GPU)
  24. 24. Freezer cgroup Allows to freeze/thaw a group of processes Similar in functionality to SIGSTOP/SIGCONT Cannot be detected by the processes that’s how it differs from SIGSTOP/SIGCONT! Doesn’t impede ptrace/debugging Specific usecases cluster batch scheduling process migration
  25. 25. Subtleties PID 1 is placed at the root of each hierarchy New processes start in their parent’s groups Groups are materialized by pseudo-fs typically mounted in /sys/fs/cgroup Groups are created by mkdir in the pseudo-fs mkdir /sys/fs/cgroup/memory/somegroup/subcgroup To move a process: echo $PID > /sys/fs/cgroup/.../tasks The cgroup wars: systemd vs cgmanager vs ...
  26. 26. The building blocks: namespaces
  27. 27. Namespaces Provide processes with their own system view Cgroups = limits how much you can use; namespaces = limits what you can see you can’t use/affect what you can’t see Multiple namespaces: pid net mnt uts ipc user Each process is in one namespace of each type
  28. 28. Pid namespace Processes within a PID namespace only see processes in the same PID namespace Each PID namespace has its own numbering starting at 1 If PID 1 goes away, whole namespace is killed Those namespaces can be nested A process ends up having multiple PIDs one per namespace in which its nested
  29. 29. Net namespace: in theory Processes within a given network namespace get their own private network stack, including: network interfaces (including lo) routing tables iptables rules sockets (ss, netstat) You can move a network interface across netns ip link set dev eth0 netns PID
  30. 30. Net namespace: in practice Typical use-case: use veth pairs (two virtual interfaces acting as a cross-over cable) eth0 in container network namespace paired with vethXXX in host network namespace all the vethXXX are bridged together (Docker calls the bridge docker0) But also: the magic of --net container shared localhost (and more!)
  31. 31. Mnt namespace Processes can have their own root fs conceptually close to chroot Processes can also have “private” mounts /tmp (scoped per user, per service...) masking of /proc, /sys NFS auto-mounts (why not?) Mounts can be totally private, or shared No easy way to pass along a mount from a namespace to another ☹
  32. 32. Uts namespace gethostname / sethostname 'nuff said!
  33. 33. Ipc namespace
  34. 34. Ipc namespace Does anybody know about IPC?
  35. 35. Ipc namespace Does anybody know about IPC? Does anybody care about IPC?
  36. 36. Ipc namespace Does anybody know about IPC? Does anybody care about IPC? Allows a process (or group of processes) to have own: IPC semaphores IPC message queues IPC shared memory ... without risk of conflict with other instances
  37. 37. User namespace Allows to map UID/GID; e.g.: UID 0→1999 in container C1 is mapped to UID 10000→11999 on host; UID 0→1999 in container C2 is mapped to UID 12000→13999 on host; etc. UID in containers becomes irrelevant just use UID 0 in the container it gets squashed to a non-privileged user outside Security improvement But: devil is in the details Just landed in Docker experimental branch!
  38. 38. Namespace manipulation Namespaces are created with the clone() syscall i.e. with extra flags when creating a new process Namespaces are materialized by pseudo-files /proc/<pid>/ns When the last process of a namespace exits, it is destroyed but can be preserved by bind-mounting the pseudo-file It’s possible to “enter” a namespace with setns() exposed by the nsenter wrapper in util-linux
  39. 39. The building blocks: copy-on-write
  40. 40. Copy-on-write storage Create a new container instantly instead of copying its whole filesystem Storage keeps track of what has changed Many options available AUFS, overlay (file level) device mapper thinp (block level) BTRFS, ZFS (FS level) Considerably reduces footprint and “boot” times See also: Deep dive into Docker storage drivers
  41. 41. Other details
  42. 42. Orthogonality All those things can be used independently If you only need resource isolation, use a few selected cgroups If you want to simulate a network of routers, use network namespaces Put the debugger in a container's namespaces, but not its cgroups so that the debugger doesn’t steal resources Setup a network interface in a container, then move it to another etc.
  43. 43. Some missing bits Capabilities break down "root / non-root" into fine-grained rights allow to keep root, but without the dangerous bits however: CAP_SYS_ADMIN remains a big catchall SELinux / AppArmor ... containers that actually contain deserve a whole talk on their own check https://github.com/jfrazelle/bane (AppArmor profile generator)
  44. 44. Container runtimes
  45. 45. Container runtimes using cgroups+namespaces
  46. 46. LXC Set of userland tools A container is a directory in /var/lib/lxc Small config file + root filesystem Early versions had no support for CoW Early versions had no support to move images Requires significant amount of elbow grease easy for sysadmins/ops, hard for devs
  47. 47. systemd-nspawn From its manpage: “For debugging, testing and building” “Similar to chroot, but more powerful” “Implements the Container Interface” Seems to position itself as plumbing Recently added support for ɹǝʞɔop images #define INDEX_HOST "index.do" /* the URL we get the data from */ "cker.io” #define HEADER_TOKEN "X-Do" /* the HTTP header for the auth token */ "cker-Token:” #define HEADER_REGISTRY "X-Do" /*the HTTP header for the registry */ "cker-Endpoints:”
  48. 48. Docker Engine Daemon controlled by REST(ish) API First versions shelled out to LXC now uses its own libcontainer runtime Manages containers, images, builds, and more Some people think it does too many things
  49. 49. rkt, runC Back to the basics! Focus on container execution no API, no image management, no build, etc. They implement different specifications: rkt implements appc (App Container) runC implements OCP (Open Container Project) runC leverages Docker's libcontainer
  50. 50. Which one is best? They all use the same kernel features Performance will be exactly the same Look at: features design ecosystem
  51. 51. Container runtimes using other mechanisms
  52. 52. OpenVZ Also Linux Older, but battle-tested e.g. Travis CI gives you root in OpenVZ Tons of neat features too ploop (efficient block device for containers) checkpoint/restore, live migration venet (~more efficient veth) Still developed and maintained
  53. 53. Jails / Zones FreeBSD / Solaris Coarser granularity of features Strong emphasis on security Great for hosting providers Not so much for developers where's the equivalent of docker run -ti ubuntu? Illumos branded zones can run Linux binaries with Joyent Triton, you can run Linux containers on Illumos This is different from the Oracle Solaris port the latter will run Solaris binaries, not Linux binaries
  54. 54. Building our own containers
  55. 55. Not for production use! (For educational purposes only) 55
  56. 56. Thank you! Jérôme🐳🚢📦Petazzoni @jpetazzo jerome@docker.com

×