Andrey Vagin <avagin@openvz.org>
●
1 June 2013, Moscow<
Linux Containers
Fedora Virtualization Day
2
Different types of Virtualization
● Virtual Machines
– Emulation (qemu)
– Paravirtualization (XEN)
– Hardware Virtualization (KVM, ESX)
● OS Level Virtualization
– Containers (Linux Containers, Solaris Zones, BSD Jails)
3
Virtual Machine (VM)
Hardware
Hypervisor
Virtual HW
Kernel
Apps
Virtual HW
Kernel
Apps
Virtual HW
Kernel
Apps
Virtual HW
Kernel
Apps
4
Containers (CT)
Hardware
Host Kernel
Apps
Namespaces
Apps
Namespaces
Apps
Namespaces
Apps
Namespaces
- chroot() on steroids
5
7
Comparison VM-s vs CT-s
● One real HW, many virtual HW,
many OS-s.
● One real HW, one kernel, many
userspace instances
● Full control on the guest OS ● Native performance: [almost] no
overhead
● High density
● KSM (Kernel SamePage Merging) ● Use resources on demand
● Dynamic resource allocation
● Naturally share pages
● Depends on hardware
(VT-x, VT-d, EPT, etc)
● Not all functionality are virtualized
● Flexibility
8
9
10
Evolution of Operating System
● Multitask
many processes
● Multiuser
many users
● Multicontainer
many containers
11
Containers (CT)
Cgroups
– control resources
● cpu, cpuacct, cpuset
● blkio
● memory
● net_cls
Namespaces
– isolate environments
● MNT
● PID
● NET
● IPC
● User
● UTS
12
How to execute CT
All allowed by default
● unshare, nsenter
● Systemd Lightweight Containers
● LXC
● Libvirt LXC
All restricted by default
● OpenVZ (vzctl-core) (FC19)
13
vzctl - perform various operations on a container
# yum install -y vzctl-core
# vzctl create 101 --ostemplate fedora-15
# vzctl start 101
# vzctl exec 101 ps ax
PID TTY STAT TIME COMMAND
1 ? Ss 0:00 init
11830 ? Ss 0:00 syslogd -m 0
11897 ? Ss 0:00 /usr/sbin/sshd
11943 ? Ss 0:00 xinetd -stayalive -pidfile ...
12218 ? Ss 0:00 sendmail: accepting connections
12265 ? Ss 0:00 sendmail: Queue runner@01:00:00
13362 ? Ss 0:00 /usr/sbin/httpd
13363 ? S 0:00 _ /usr/sbin/httpd
..............................................
6416 ? Rs 0:00 ps axf
# vzctl stop 101
# vzctl destroy 101
14
OpenVZ kernel only features
● Ploop (snapshot, backups, different formats)
● Second level quota
● More functional memory accounting
● PFCache (memory deduplication. Io-ops saving)
● More isolated in compare with FC19 (lack of userns)
Questions?
http://openvz.org
Andrey Vagin <avagin@openvz.org><
CRIU - Checkpoint/Restore in User-space
17
What is C/R and how can it be used?
C/R is the ability to save states of processes and to restore them later.
Usage scenarios:
– Failure recovery
– Live migration
– Reboot-less upgrade
– Speed up of slow-boot services
– HPC issues
18
History
●
Berkeley Lab Checkpoint/Restart (BLCR) (2003)
– Load a kernel module and link with a library
● DMTCP: Distributed MultiThreaded CheckPointing (2004-2006)
– Preload a library
●
OpenVZ (2005)
– OpenVZ kernel
● Linux Checkpoint/Restart by Oren Laadan (2008)
– A non-mainline kernel
●
CRIU (2011)
OpenVZ
2005
BLCR
2003
Linux C/R
2008
CRIU
2011
DMTCP
2007
19
How does this work?
Kernel objects Process tree
crtools
Image files
Name-spaces
Files
Sockets
Pipes
001101
101010
110001
011010
000011
010101
001101
101010
110001
011010
000011
010101
001101
101010
110001
011010
000011
010101
001101
101010
110001
011010
000011
010101
001101
101010
110001
011010
000011
010101
001101
101010
110001
011010
000011
010101
20
Kernel interfaces
Dump Restore
syscalls
netlink
/proc/
ptrace
21
Dump
● Parasite code
– Receive file descriptors
– Dump memory content
– Prctl(), sigaction, pending signals, timers, etc.
● Ptrace
– freeze processes
– Inject a parasite code
● Netlink
– Get information about sockets, netns
● Procfs
/proc/PID/maps, /proc/PID/map_files/,
/proc/PID/status, /proc/PID/mountinfo
22
Restore
● Collect shared objects
● Restore name-spaces
● Create a process tree
– Restore SID, PGID
– Restore objects, which should be inherited
● Files, sockets, pipes, ...
● Restore per-task properties.
● Restore memory
● Call sigreturn
● Awesome
Namespaces
Processes
23
Interesting moments
● How to restore shared objects?
– Send file descriptors via unix sockets
– Map files from /proc/self/map_files/ for restoring anon shared mappings
● How to restore memory mappings on the correct places?
– Map a new code block and a stack
– Unmap crtools' mappings
– Remap task's mappings on the correct places
● How to resume a process?
– Create a signal frame
– Call sigreturn()
24
Kernel impact
~140 patches merged ~10 patches in flight
~11 new features appeared ~2 new features to come
25
New features in a kernel
● Parasite code injection (by Tejun Heo)
– Read task states, that are currently retrieved by a task only about itself
● The kcmp() system call
– Helps checking which kernel objects are shared between processes
● Proc map_files directory
– Find out what exact file is mapped
– Mappings sharing info
● A bunch of prctl extensions
– Set various private stuff on task/mm objects (c/r-only feature)
● Last-pid sysctl
– Restore task with desired PID value
26
New features in a kernel
● TCP repair mode
– Read intimate state of a TCP connection
and reconstructs it from scratch on a freshly created socket
● Sockets information dumping via netlink (sock_diag)
– Extendable sockets state retrieving engine
● Virtual net devices indexes
– Allows to restore network devices in a namespace
● Socket peeking offset
– Allows peeking sockets queues (reading without removing data from queue)
● Task memory tracking
– incremental snapshots, online migration
27
What are already supported?
– X86_64 architecture
– Process tree linkage
– Multi-threaded apps
– All kinds of memory mappings
– Terminals, groups, sessions
– Open files (shared and unlinked)
– Established TCP connections
– Unix sockets, Packet sockets
– Name-spaces (net, mount, ipc)
– Non-posix files (epoll, inotify)
– Pipes, Fifo-s, IPC, ...
– ARM architecture
– Pending signals
– TCP time-stamps
– Iterative snapshots
– VDSO
– LXC and OpenVZ containers
In flight
– Posix timers
– Convert OpenVZ images
28
How is CRIU tested?
● ZDTM – a set of unit-tests
● Real-life applications
– Apache, Nginx
– MySQL, MongoDB, Oracle
– Make && gcc
– Tar & gzip
– Screen
– Java
– LXC
– VNC server + GUI applications
29
Future plans (Feb, 2013)
● Support all kinds of kernel objects
● Merge all in-flight patches in the mainstream kernel
● Integrate CRIU with OpenVZ and LXC utilities
● Iterative migration
– Migrate memory content before freezing applications
● Integration in distributions
– CRIU was accepted to Fedora 19
30
How to use
● ./crtools dump -t pid [<options>]
– checkpoint a process/tree identified by pid
● ./crtools restore -t pid [<options>]
– restore - restore a process/tree identified by pid
● ./crtools show (-D dir)|(-f file) [<options>]
– show dump file(s) contents
● ./crtools check
– checks whether the kernel support is up-to-date
● ./crtools exec -t pid <syscall-string>
– exec - execute a system call by other task
31
Checkpoint/restore of a VNC server.
Questions?
http://criu.org

2. Vagin. Linux containers. June 01, 2013

  • 1.
    Andrey Vagin <avagin@openvz.org> ● 1June 2013, Moscow< Linux Containers Fedora Virtualization Day
  • 2.
    2 Different types ofVirtualization ● Virtual Machines – Emulation (qemu) – Paravirtualization (XEN) – Hardware Virtualization (KVM, ESX) ● OS Level Virtualization – Containers (Linux Containers, Solaris Zones, BSD Jails)
  • 3.
    3 Virtual Machine (VM) Hardware Hypervisor VirtualHW Kernel Apps Virtual HW Kernel Apps Virtual HW Kernel Apps Virtual HW Kernel Apps
  • 4.
  • 5.
  • 6.
    7 Comparison VM-s vsCT-s ● One real HW, many virtual HW, many OS-s. ● One real HW, one kernel, many userspace instances ● Full control on the guest OS ● Native performance: [almost] no overhead ● High density ● KSM (Kernel SamePage Merging) ● Use resources on demand ● Dynamic resource allocation ● Naturally share pages ● Depends on hardware (VT-x, VT-d, EPT, etc) ● Not all functionality are virtualized ● Flexibility
  • 7.
  • 8.
  • 9.
    10 Evolution of OperatingSystem ● Multitask many processes ● Multiuser many users ● Multicontainer many containers
  • 10.
    11 Containers (CT) Cgroups – controlresources ● cpu, cpuacct, cpuset ● blkio ● memory ● net_cls Namespaces – isolate environments ● MNT ● PID ● NET ● IPC ● User ● UTS
  • 11.
    12 How to executeCT All allowed by default ● unshare, nsenter ● Systemd Lightweight Containers ● LXC ● Libvirt LXC All restricted by default ● OpenVZ (vzctl-core) (FC19)
  • 12.
    13 vzctl - performvarious operations on a container # yum install -y vzctl-core # vzctl create 101 --ostemplate fedora-15 # vzctl start 101 # vzctl exec 101 ps ax PID TTY STAT TIME COMMAND 1 ? Ss 0:00 init 11830 ? Ss 0:00 syslogd -m 0 11897 ? Ss 0:00 /usr/sbin/sshd 11943 ? Ss 0:00 xinetd -stayalive -pidfile ... 12218 ? Ss 0:00 sendmail: accepting connections 12265 ? Ss 0:00 sendmail: Queue runner@01:00:00 13362 ? Ss 0:00 /usr/sbin/httpd 13363 ? S 0:00 _ /usr/sbin/httpd .............................................. 6416 ? Rs 0:00 ps axf # vzctl stop 101 # vzctl destroy 101
  • 13.
    14 OpenVZ kernel onlyfeatures ● Ploop (snapshot, backups, different formats) ● Second level quota ● More functional memory accounting ● PFCache (memory deduplication. Io-ops saving) ● More isolated in compare with FC19 (lack of userns)
  • 14.
  • 15.
    Andrey Vagin <avagin@openvz.org>< CRIU- Checkpoint/Restore in User-space
  • 16.
    17 What is C/Rand how can it be used? C/R is the ability to save states of processes and to restore them later. Usage scenarios: – Failure recovery – Live migration – Reboot-less upgrade – Speed up of slow-boot services – HPC issues
  • 17.
    18 History ● Berkeley Lab Checkpoint/Restart(BLCR) (2003) – Load a kernel module and link with a library ● DMTCP: Distributed MultiThreaded CheckPointing (2004-2006) – Preload a library ● OpenVZ (2005) – OpenVZ kernel ● Linux Checkpoint/Restart by Oren Laadan (2008) – A non-mainline kernel ● CRIU (2011) OpenVZ 2005 BLCR 2003 Linux C/R 2008 CRIU 2011 DMTCP 2007
  • 18.
    19 How does thiswork? Kernel objects Process tree crtools Image files Name-spaces Files Sockets Pipes 001101 101010 110001 011010 000011 010101 001101 101010 110001 011010 000011 010101 001101 101010 110001 011010 000011 010101 001101 101010 110001 011010 000011 010101 001101 101010 110001 011010 000011 010101 001101 101010 110001 011010 000011 010101
  • 19.
  • 20.
    21 Dump ● Parasite code –Receive file descriptors – Dump memory content – Prctl(), sigaction, pending signals, timers, etc. ● Ptrace – freeze processes – Inject a parasite code ● Netlink – Get information about sockets, netns ● Procfs /proc/PID/maps, /proc/PID/map_files/, /proc/PID/status, /proc/PID/mountinfo
  • 21.
    22 Restore ● Collect sharedobjects ● Restore name-spaces ● Create a process tree – Restore SID, PGID – Restore objects, which should be inherited ● Files, sockets, pipes, ... ● Restore per-task properties. ● Restore memory ● Call sigreturn ● Awesome Namespaces Processes
  • 22.
    23 Interesting moments ● Howto restore shared objects? – Send file descriptors via unix sockets – Map files from /proc/self/map_files/ for restoring anon shared mappings ● How to restore memory mappings on the correct places? – Map a new code block and a stack – Unmap crtools' mappings – Remap task's mappings on the correct places ● How to resume a process? – Create a signal frame – Call sigreturn()
  • 23.
    24 Kernel impact ~140 patchesmerged ~10 patches in flight ~11 new features appeared ~2 new features to come
  • 24.
    25 New features ina kernel ● Parasite code injection (by Tejun Heo) – Read task states, that are currently retrieved by a task only about itself ● The kcmp() system call – Helps checking which kernel objects are shared between processes ● Proc map_files directory – Find out what exact file is mapped – Mappings sharing info ● A bunch of prctl extensions – Set various private stuff on task/mm objects (c/r-only feature) ● Last-pid sysctl – Restore task with desired PID value
  • 25.
    26 New features ina kernel ● TCP repair mode – Read intimate state of a TCP connection and reconstructs it from scratch on a freshly created socket ● Sockets information dumping via netlink (sock_diag) – Extendable sockets state retrieving engine ● Virtual net devices indexes – Allows to restore network devices in a namespace ● Socket peeking offset – Allows peeking sockets queues (reading without removing data from queue) ● Task memory tracking – incremental snapshots, online migration
  • 26.
    27 What are alreadysupported? – X86_64 architecture – Process tree linkage – Multi-threaded apps – All kinds of memory mappings – Terminals, groups, sessions – Open files (shared and unlinked) – Established TCP connections – Unix sockets, Packet sockets – Name-spaces (net, mount, ipc) – Non-posix files (epoll, inotify) – Pipes, Fifo-s, IPC, ... – ARM architecture – Pending signals – TCP time-stamps – Iterative snapshots – VDSO – LXC and OpenVZ containers In flight – Posix timers – Convert OpenVZ images
  • 27.
    28 How is CRIUtested? ● ZDTM – a set of unit-tests ● Real-life applications – Apache, Nginx – MySQL, MongoDB, Oracle – Make && gcc – Tar & gzip – Screen – Java – LXC – VNC server + GUI applications
  • 28.
    29 Future plans (Feb,2013) ● Support all kinds of kernel objects ● Merge all in-flight patches in the mainstream kernel ● Integrate CRIU with OpenVZ and LXC utilities ● Iterative migration – Migrate memory content before freezing applications ● Integration in distributions – CRIU was accepted to Fedora 19
  • 29.
    30 How to use ●./crtools dump -t pid [<options>] – checkpoint a process/tree identified by pid ● ./crtools restore -t pid [<options>] – restore - restore a process/tree identified by pid ● ./crtools show (-D dir)|(-f file) [<options>] – show dump file(s) contents ● ./crtools check – checks whether the kernel support is up-to-date ● ./crtools exec -t pid <syscall-string> – exec - execute a system call by other task
  • 30.
  • 31.