1. A brief history of
(mostly)
Linux Containers
/ a nested talk /
Kir Kolyshkin <kir@openvz.org>
ContainerDays Boston, 5th
of June 2015
2. Last Century
● 1999: Initial idea about Virtuozzo
– “virtual environments” – groups of processes
– a file system to share code / save RAM
– resource management / isolation
● 2000: 5 engineers, public testing, 5000 VEs
● User Beancounters: per-group limits
● Al Viro: [mount] namespace
3. 2001-2005: stone age
● 2001: Virtuozzo for … Windows (zOMG!!11one)
● 2001: Linux-Vserver (Jacques Gélinas, Herbert Pötzl)
● 2002: First Virtuozzo release (2.0!)
● 200?: Meiosys Metacluster, acq. by IBM in 2005
● 2004: First VZWin release
● 2004: CKRM, rsrc mgmt frmwrk frm IBM [FAIL]
● 2005: OpenVZ as open source Virtuozzo
5. 2006-2010: up the stream!
● Lots of new namespaces:
– network
– PID
– IPC
– User (only completed in 2013, Linux 3.9)
● 2006: live migration in OpenVZ
● 2007: cgroups framework from Google [PASS]
● 2008: LXC tool (a la vzctl)
6. 2010-2015: contemporaneity
●
2010: OpenVZ Vswap, 3rd
gen resource mgmt
● 2010: ploop (CT in a file with bells and whistles)
● 2011: CRIU aka chkpnt/rstr in usrspc
● 2013: a plenty of container projects:
– Docker, lmctfy, CoreOS
● 2014: CRIU for Docker & LXC
● 2015: OpenVZ re-born, new devel model,
new kernel & tools
7. Future!
● Virtuozzo 7
●
4th
gen of resource management: vcmmd
– More dynamic, with bursts, guarantees etc
● Proper port to POWER, ARM
● CRIU: p.haul, integration
(http://criu.org/Integration)
● MetaPC? Mosaic?
Editor's Notes
I like that this is a nested talk, it&apos;s like a novel within a book or story within a story. I don&apos;t like it&apos;s only 15 minutes, I got so much to tell you!
Disclaimer: I work for Odin (ex Parallels, ex SWsoft), my POV is skewed.
Our chief scientist, a professor from MIPT (~ru MIT), Alexander Tormasov proposed a new direction to senior mgmt – lightweight partitioning. He was inspired by IBM mainframe partitioning. The idea is to have multiple “virtual environments”, – isolated groups of processes, each acting as a standalone Linux machine (except for the kernel – shared). Another idea was about file system to share code (binaries/libraries) and therefore save RAM, making density even higher. Third cornerstone was resource isolation.
In Feb 2000 they got an office in MIPT, 3 engineers, a sysadm, a manager/engineer. Later two guys for web mgmt tools. Initial public testing, hot summer – 5000 VEs, revealed a problem with resource isolation. A mathematician from MSU (~ru Stanford) hired, he wrote User Beancounters (with Alan Cox, luid idea from HP-UX). WARNING: PhD in economics!
Also in 2000 Al Viro wrote a first namespace for Linux kernel – the [mount] namespace. It&apos;s like chroot() but with bells and whistles. Kernel API is clone() call with CLONE_NEWNS flag.
Vzwin: really crazy idea, no source code – lot of reverse engineering. Implemented by live kernel patching. Called “the most advanced software ever written for Windows” by someone at MS.
Linux-Vserver – another pioneering project, unfortunately they don&apos;t want to contribute anything to upstream kernel.
Meiosys Metacluster was another implementation of Linux containers, specifically targeted for live migration. I am not sure about years but it was between 2000 and 2005 and then it the company was acquired
CKRM is a demonstration of a phenomenon that all the vowels can be removed from the sentence without any harm to its meaning. Also, that the way IBM worked with Linux was broken (more on that).
OpenVZ – well this is what I work on for the last 10 years of my life. I won&apos;t talk much about it today, I promise! )
This time period was characterized by lots of container-related patches contributed to the Linux kernel, i.e. the upstreaming age. Our company is few hundred people, and our kernel team is only about 10 people, give or take, and I am very proud of the fact that this upstreaming effort made us appear in the top10 companies contributing to the Linux kernel. Well, it&apos;s the bottom of that top10, that is. Other companies in that list are way bigger.
Now, upstreaming is probably as complicated for developers as it is for salmons when they run. They die exhausted, they got eaten by grizzly bears, etc. On the right you can see a salmon, err, a developer, and on the left is a bear, err, a Linux kernel subsystem maintainer.
As a result of OpenVZ upstreaming efforts, a few more namespaces appeared in the Linux kernel. Most notable ones are netns and pidns. Netns was developed by OpenVZ kernel guys based on their experience with OVZ kernel but from scratch. Pidns – were there two implementations, one from IBM, one from us, we won as ours had zero overhead on the first level of nesting.
User namespace was all IBM work, and it was initially merged in 2.6.23 (2007), but was only completed (became usable) in Linux 3.9 (2013).
We failed to upstream our User Beancounters, but Google contributed cgroups framework (it was an adaptation of cpusets feature from BULL/Silicon Graphics).
As stuff become available in the kernel, userspace tools emerged. LXC is such a tool from IBM.
Yes, I have used a dictionary to come with this title...
It looks like this slide is a try to fit about 2/3rds my tomorrow&apos;s talk into a single slide. It won&apos;t fit, so I will just give a very brief overview.
VSwap is third-generation of our approach to per-container resource management, after 10 years of experience. First gen worked fine but was too complicated to configure, second gen won&apos;t work, this one works and is easy to config!
Ploop is a container in a file technology, a la QCOW or Linux kernel loop device. It comes with a few extra features for CTs, too
CRIU is our best open source project to date. It&apos;s an approach to upstream the containers checkpoint/restore and live migration. We have in-kernel cpt/rst and we failed to merge it.
Virtuozzo 7 is reboot of OpenVZ. Ten years ago we made a mistake of not having our devel process open enough, this time we are trying to fix it. This April we opened our next kernel git repo, and just this Monday we opened our toolchain. We also moved all of our discussions to the public mailing list, and we follow the git fork-branch-pull request model of developing for our tools.
The other thing is next gen resource management. It&apos;s more dynamic, with a user-space daemon which would allow bursts, guarantees and in general more elastic limits.
We will probably be working on a proper ARM and POWER ports (the improper ones were done by me years ago just to demonstrate that the containers technology is arch-agnostic). The only arch-dependent feature is CPT/RST as it requires deep knowledge of arch to develop. CRIU is ported to ARM currently.
Finally, a MetaPC is something we&apos;re thinking about, a way to combine many servers into a single virtual big one. This is anti-partitioning, and it will work with the help of CRIU.