Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
N Problems
of Linux Containers
(with solutions!)
Kir Kolyshkin
<kir@openvz.org>
6 June 2015 ContainerDays Boston
openvz.org || criu.org || odin.com
Problem: Effective virtualization
● Virtualization is partitioning
● Historical way: $M...
openvz.org || criu.org || odin.com
Solution: isolation
● Run many userspace instances
on top of one single (Linux) kernel
...
openvz.org || criu.org || odin.com
One historical way to unsee
chroot()
openvz.org || criu.org || odin.com
Namespaces
● Implemented in the Linux kernel
– PID (process tree)
– net (net devices, a...
openvz.org || criu.org || odin.com
Problem: Shared resources
● All containers share the same set of resources
(CPU, RAM, d...
openvz.org || criu.org || odin.com
Solution: OpenVZ resource controls
● OpenVZ:
– user beancounters
● controls 20 paramete...
openvz.org || criu.org || odin.com
Solution 2: VSwap
● Only two primary parameters: RAM and swap
– others still exist, but...
openvz.org || criu.org || odin.com
Solution: cgroups + controllers
● Cgroups is a mechanism to control resources
per hiera...
openvz.org || criu.org || odin.com
Solution 3: vcmmd
●
4th
generation of OpenVZ resource mgmt
● A user-space daemon using ...
openvz.org || criu.org || odin.com
Problem: fast live migration
● We can already live migrate
a running OpenVZ container
f...
openvz.org || criu.org || odin.com
Live migration process
(assuming shared storage)
● 1 Freeze the container
● 2 Dump its ...
openvz.org || criu.org || odin.com
Solution 1: network swap
● 1 Dump the minimal memory, lock the rest
● 2 Restore the min...
openvz.org || criu.org || odin.com
Solution 1: network swap
● 1 Dump the minimal memory, lock the rest
● 2 Copy, undump wh...
openvz.org || criu.org || odin.com
Solution 2: Iterative RAM migration
● 1 Ask kernel to track modified pages
● 2 Copy all...
openvz.org || criu.org || odin.com
Problem: upstreaming
● OpenVZ was developed separately
● Same for many past IBM Linux p...
openvz.org || criu.org || odin.com
Problem: upstreaming
● OpenVZ was developed separately
● Same for many past IBM Linux p...
openvz.org || criu.org || odin.com
Solution 1: rewrite from scratch
● User Beancounters -> CGroups + controllers
● PID nam...
openvz.org || criu.org || odin.com
Solution 2: circumvent the system!
● We tried hard to merge checkpoint/restore
● Other ...
openvz.org || criu.org || odin.com
CRIU
● Checkpoint / Restore [mostly] In Userspace
● About 3 years old, tools at version...
openvz.org || criu.org || odin.com
CRIU Linux kernel patches, per v
Total: 176 (+11 this year)
3.3
3.4
3.5
3.6
3.7
3.8
3.9...
openvz.org || criu.org || odin.com
Problem: common file system
● Container is just a directory on the host we chroot() int...
openvz.org || criu.org || odin.com
Solution 1: LVM
● Only works only on top of block device
● Hard to manage
(e.g. how to ...
openvz.org || criu.org || odin.com
Solution 2: loop device
(filesystem within a file)
● VFS operations leads to double pag...
openvz.org || criu.org || odin.com
Solution 3: ZFS + zvol
● PRO: features
– zvol, thin provisioning, dedup, zfs send/recei...
openvz.org || criu.org || odin.com
Solution 4: ploop
● Basic idea: same as block loop, just better
● Modular design:
– var...
openvz.org || criu.org || odin.com
Any problems questions?
● kir@openvz.org
● Twitter: @kolyshkin @_openvz_ @__criu__
N problems of Linux Containers
N problems of Linux Containers
N problems of Linux Containers
Upcoming SlideShare
Loading in …5
×

N problems of Linux Containers

654 views

Published on

Over the years our engineers faced some complex technical problems. This talk is an overview of some of those problems, as well as solutions we came up with.

Published in: Software
  • Be the first to comment

N problems of Linux Containers

  1. 1. N Problems of Linux Containers (with solutions!) Kir Kolyshkin <kir@openvz.org> 6 June 2015 ContainerDays Boston
  2. 2. openvz.org || criu.org || odin.com Problem: Effective virtualization ● Virtualization is partitioning ● Historical way: $M mainframes ● Modern way: virtual machines ● Problem: performance overhead ● Partial solution: hardware support (Intel VT, AMD V)
  3. 3. openvz.org || criu.org || odin.com Solution: isolation ● Run many userspace instances on top of one single (Linux) kernel ● All processes see each other – files, process information, network, shared memory, users, etc. ● Make them unsee it!
  4. 4. openvz.org || criu.org || odin.com One historical way to unsee chroot()
  5. 5. openvz.org || criu.org || odin.com Namespaces ● Implemented in the Linux kernel – PID (process tree) – net (net devices, addresses, routing etc) – IPC (shared memory, semaphores, msg queues) – UTS (hostname, kernel version) – mnt (filesystem mounts) – user (UIDs/GIDs) ● clone() with CLONE_NEW* flags
  6. 6. openvz.org || criu.org || odin.com Problem: Shared resources ● All containers share the same set of resources (CPU, RAM, disk, various in-kernel things ...) ● Need fair distribution of “goods” so everyone gets their share ● Need DoS prevention ● Need prioritization and SLAs
  7. 7. openvz.org || criu.org || odin.com Solution: OpenVZ resource controls ● OpenVZ: – user beancounters ● controls 20 parameters – hierarchical CPU scheduler – disk quota per containers – I/O priority and I/O bandwidth limit per-container ● Dynamic control, can “resize” runtime
  8. 8. openvz.org || criu.org || odin.com Solution 2: VSwap ● Only two primary parameters: RAM and swap – others still exist, but are optional ● Swap is virtual, no actual I/O is performed ● Slow down to emulate real swap ● Only when actual global RAM shortage occurs, virtual swap goes into the real swap ● Currently only available in OpenVZ kernel
  9. 9. openvz.org || criu.org || odin.com Solution: cgroups + controllers ● Cgroups is a mechanism to control resources per hierarchical groups of processes ● Cgroups is nothing without controllers: – blkio, cpu, cpuacct, cpuset, devices, freezer, memory, net_cls, net_prio ● Cgroups are orthogonal to namespaces ● Still working on it: just added kmem controller
  10. 10. openvz.org || criu.org || odin.com Solution 3: vcmmd ● 4th generation of OpenVZ resource mgmt ● A user-space daemon using kernel controls ● Monitors usage, tweaks limits ● Adds a “time” dimension ● More flexible limits, e.g. burstable
  11. 11. openvz.org || criu.org || odin.com Problem: fast live migration ● We can already live migrate a running OpenVZ container from one server to another without shutting it down ● We want to do it fast even for huge containers – huge disk: use shared storage – huge RAM: ???
  12. 12. openvz.org || criu.org || odin.com Live migration process (assuming shared storage) ● 1 Freeze the container ● 2 Dump its complete state to a dump file ● 3 Copy the dump file to destination server ● 4 Undump back to RAM, recreate everything ● 5 Unfreeze ● Problem: huge dump file -- takes long time* to dump, copy, undump * seconds
  13. 13. openvz.org || criu.org || odin.com Solution 1: network swap ● 1 Dump the minimal memory, lock the rest ● 2 Restore the minimal memory, mark the rest as swapped out ● 3 Set up network swap from the source ● 4 Unfreeze. Missing RAM will be “swapped in” ● 5 Migrate the rest of RAM and kill it on source
  14. 14. openvz.org || criu.org || odin.com Solution 1: network swap ● 1 Dump the minimal memory, lock the rest ● 2 Copy, undump what we have, mark the rest as swapped out ● 3 Set up network swap served from the source ● 4 Unfreeze. Missing RAM will be “swapped in” ● 5 Migrate the rest of RAM and kill it on source ● PROBLEM: no way to rollback
  15. 15. openvz.org || criu.org || odin.com Solution 2: Iterative RAM migration ● 1 Ask kernel to track modified pages ● 2 Copy all memory to destination system mem ● 3 Ask kernel for list of modified pages ● 4 Copy those pages ● 5 GOTO 3 until satisfied ● 6 Freeze and do migration as usual, but with much smaller set of pages
  16. 16. openvz.org || criu.org || odin.com Problem: upstreaming ● OpenVZ was developed separately ● Same for many past IBM Linux projects (ELVM, CKRM, ...) ● Develop, then merge it upstream (i.e. to vanilla Linux kernel) ● Problem?
  17. 17. openvz.org || criu.org || odin.com Problem: upstreaming ● OpenVZ was developed separately ● Same for many past IBM Linux projects (ELVM, CKRM, ...) ● Develop, then merge it upstream (i.e. to vanilla Linux kernel) ● Problem: grizzly bears upstream developers do not accept massive patchsets appearing out of nowhere
  18. 18. openvz.org || criu.org || odin.com Solution 1: rewrite from scratch ● User Beancounters -> CGroups + controllers ● PID namespace: 2 rewrites until accepted ● Network namespace – rewritten ● It works! ● 1500+ patches ended up in vanilla ● OpenVZ made it to top10 contributors
  19. 19. openvz.org || criu.org || odin.com Solution 2: circumvent the system! ● We tried hard to merge checkpoint/restore ● Other people tried hard too, no luck ● Can't make it to the kernel? Let's riot! implement it in userspace ● With minimal kernel intervention when required ● Kernel exports most of information already, so let's just add missing bits and pieces
  20. 20. openvz.org || criu.org || odin.com CRIU ● Checkpoint / Restore [mostly] In Userspace ● About 3 years old, tools at version 1.6 ● Users: Google, Samsung, Huawei, ... ● LXC & Docker – integrated! ● Already in upstream 3.x kernel CONFIG_CHECKPOINT_RESTORE ● Live migration: P.Haul http://criu.org/P.Haul
  21. 21. openvz.org || criu.org || odin.com CRIU Linux kernel patches, per v Total: 176 (+11 this year) 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 4.0 4.1 pending 0 10 20 30 40 50 60
  22. 22. openvz.org || criu.org || odin.com Problem: common file system ● Container is just a directory on the host we chroot() into ● File system journal (metadata updates) is a bottleneck ● Lots of small-size files I/O on CT backup/migration (sometimes rsync hangs or OOMs!) ● No sub-tree disk quota support in upstream ● No sub-tree snapshots ● Live migration: rsync -- changed inodes ● File system type and properties are fixed, same for all CTs
  23. 23. openvz.org || criu.org || odin.com Solution 1: LVM ● Only works only on top of block device ● Hard to manage (e.g. how to migrate a huge volume?) ● No thin provisioning
  24. 24. openvz.org || criu.org || odin.com Solution 2: loop device (filesystem within a file) ● VFS operations leads to double page-caching – (already fixed in the recent kernels) ● No thin provisioning ● Limited feature set
  25. 25. openvz.org || criu.org || odin.com Solution 3: ZFS + zvol ● PRO: features – zvol, thin provisioning, dedup, zfs send/receive ● CONTRA: – Licensing is problematic – Linux port issues (people report cache OOM) – Was not available in 2008
  26. 26. openvz.org || criu.org || odin.com Solution 4: ploop ● Basic idea: same as block loop, just better ● Modular design: – various image formats (qcow2 in TODO progress) – various I/O backends (ext4, vfs O_DIRECT, nfs) ● Feature rich: – online resize (grow and shrink, ballooning) – instant live snapshots – write tracker to facilitate faster live migration
  27. 27. openvz.org || criu.org || odin.com Any problems questions? ● kir@openvz.org ● Twitter: @kolyshkin @_openvz_ @__criu__

×