Over the years our engineers faced some complex technical problems. This talk is an overview of some of those problems, as well as solutions we came up with.
3. openvz.org || criu.org || odin.com
Solution: isolation
● Run many userspace instances
on top of one single (Linux) kernel
● All processes see each other
– files, process information, network,
shared memory, users, etc.
● Make them unsee it!
5. openvz.org || criu.org || odin.com
Namespaces
● Implemented in the Linux kernel
– PID (process tree)
– net (net devices, addresses, routing etc)
– IPC (shared memory, semaphores, msg queues)
– UTS (hostname, kernel version)
– mnt (filesystem mounts)
– user (UIDs/GIDs)
● clone() with CLONE_NEW* flags
6. openvz.org || criu.org || odin.com
Problem: Shared resources
● All containers share the same set of resources
(CPU, RAM, disk, various in-kernel things ...)
● Need fair distribution of “goods” so everyone
gets their share
● Need DoS prevention
● Need prioritization and SLAs
8. openvz.org || criu.org || odin.com
Solution: OpenVZ resource controls
● OpenVZ:
– user beancounters
● controls 20 parameters
– hierarchical CPU scheduler
– disk quota per containers
– I/O priority and I/O bandwidth limit per-container
● Dynamic control, can “resize” runtime
10. openvz.org || criu.org || odin.com
Solution 2: VSwap
● Only two primary parameters: RAM and swap
– others still exist, but are optional
● Swap is virtual, no actual I/O is performed
● Slow down to emulate real swap
● Only when actual global RAM shortage occurs,
virtual swap goes into the real swap
● Currently only available in OpenVZ kernel
11. openvz.org || criu.org || odin.com
Solution: cgroups + controllers
● Cgroups is a mechanism to control resources
per hierarchical groups of processes
● Cgroups is nothing without controllers:
– blkio, cpu, cpuacct, cpuset, devices, freezer,
memory, net_cls, net_prio
● Cgroups are orthogonal to namespaces
● Still working on it: just added kmem controller
12. openvz.org || criu.org || odin.com
Solution 3: vcmmd
●
4th
generation of OpenVZ resource mgmt
● A user-space daemon using kernel controls
● Monitors usage, tweaks limits
● Adds a “time” dimension
● More flexible limits, e.g. burstable
13. openvz.org || criu.org || odin.com
Problem: fast live migration
● We can already live migrate
a running OpenVZ container
from one server to another
without shutting it down
● We want to do it fast even for huge containers
– huge disk: use shared storage
– huge RAM: ???
14. openvz.org || criu.org || odin.com
Live migration process
(assuming shared storage)
● 1 Freeze the container
● 2 Dump its complete state to a dump file
● 3 Copy the dump file to destination server
● 4 Undump back to RAM, recreate everything
● 5 Unfreeze
● Problem: huge dump file -- takes long time*
to dump, copy, undump
* seconds
15. openvz.org || criu.org || odin.com
Solution 1: network swap
● 1 Dump the minimal memory, lock the rest
● 2 Restore the minimal memory,
mark the rest as swapped out
● 3 Set up network swap from the source
● 4 Unfreeze. Missing RAM will be “swapped in”
● 5 Migrate the rest of RAM and kill it on source
16. openvz.org || criu.org || odin.com
Solution 1: network swap
● 1 Dump the minimal memory, lock the rest
● 2 Copy, undump what we have,
mark the rest as swapped out
● 3 Set up network swap served from the source
● 4 Unfreeze. Missing RAM will be “swapped in”
● 5 Migrate the rest of RAM and kill it on source
● PROBLEM: no way to rollback
17. openvz.org || criu.org || odin.com
Solution 2: Iterative RAM migration
● 1 Ask kernel to track modified pages
● 2 Copy all memory to destination system mem
● 3 Ask kernel for list of modified pages
● 4 Copy those pages
● 5 GOTO 3 until satisfied
● 6 Freeze and do migration as usual, but
with much smaller set of pages
18. openvz.org || criu.org || odin.com
Problem: upstreaming
● OpenVZ was developed separately
● Same for many past IBM Linux projects
(ELVM, CKRM, ...)
● Develop, then merge it upstream
(i.e. to vanilla Linux kernel)
● Problem?
20. openvz.org || criu.org || odin.com
Problem: upstreaming
● OpenVZ was developed separately
● Same for many past IBM Linux projects
(ELVM, CKRM, ...)
● Develop, then merge it upstream
(i.e. to vanilla Linux kernel)
● Problem:
grizzly bears upstream developers
do not accept massive patchsets
appearing out of nowhere
21. openvz.org || criu.org || odin.com
Solution 1: rewrite from scratch
● User Beancounters -> CGroups + controllers
● PID namespace: 2 rewrites until accepted
● Network namespace – rewritten
● It works!
● 1500+ patches ended up in vanilla
● OpenVZ made it to top10 contributors
22. openvz.org || criu.org || odin.com
Solution 2: circumvent the system!
● We tried hard to merge checkpoint/restore
● Other people tried hard too, no luck
● Can't make it to the kernel? Let's riot!
implement it in userspace
● With minimal kernel intervention when required
● Kernel exports most of information already, so
let's just add missing bits and pieces
23. openvz.org || criu.org || odin.com
CRIU
● Checkpoint / Restore [mostly] In Userspace
● About 3 years old, tools at version 1.6
● Users: Google, Samsung, Huawei, ...
● LXC & Docker – integrated!
● Already in upstream 3.x kernel
CONFIG_CHECKPOINT_RESTORE
● Live migration: P.Haul http://criu.org/P.Haul
25. openvz.org || criu.org || odin.com
Problem: common file system
● Container is just a directory on the host we chroot() into
● File system journal (metadata updates) is a bottleneck
● Lots of small-size files I/O on CT backup/migration
(sometimes rsync hangs or OOMs!)
● No sub-tree disk quota support in upstream
● No sub-tree snapshots
● Live migration: rsync -- changed inodes
● File system type and properties are fixed, same for all CTs
26. openvz.org || criu.org || odin.com
Solution 1: LVM
● Only works only on top of block device
● Hard to manage
(e.g. how to migrate a huge volume?)
● No thin provisioning
27. openvz.org || criu.org || odin.com
Solution 2: loop device
(filesystem within a file)
● VFS operations leads to double page-caching
– (already fixed in the recent kernels)
● No thin provisioning
● Limited feature set
28. openvz.org || criu.org || odin.com
Solution 3: ZFS + zvol
● PRO: features
– zvol, thin provisioning, dedup, zfs send/receive
● CONTRA:
– Licensing is problematic
– Linux port issues (people report cache OOM)
– Was not available in 2008
29. openvz.org || criu.org || odin.com
Solution 4: ploop
● Basic idea: same as block loop, just better
● Modular design:
– various image formats (qcow2 in TODO progress)
– various I/O backends (ext4, vfs O_DIRECT, nfs)
● Feature rich:
– online resize (grow and shrink, ballooning)
– instant live snapshots
– write tracker to facilitate faster live migration