Virtualsquare: tutta la virtualità che avete sempre desiderato e non avete osato chiedere
1. VIRTUALSQUARE
all the virtuality you wanted
but you were afraid to ask
Rome, March 5th 2011
Renzo Davoli
Università di Bologna
(Master in Scienze e Tecnologie del Software Libero)
(Associazione per il Software Libero)
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
2. Virtual...
Time Execution Environment
user-id
Device
Machine
Networking
Memory
File System
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
5. User of A
I(A) = I(A')
A' can be used instead of A
Interface I(A) Interface I(A')
Well Known service New Service
A A'
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
6. What is “real”
● The service A can be hardware
User of A
– Machine I(A) = I(A')
A' can be used instead of A
– Memory Interface I(A) Interface I(A')
Well Known service New Service
– Network A A'
● The service A can be software
– File system
– Execution Environment
– Identity
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
7. Why virtual?
● Flexibility
– prototyping
– modify features at run-time (“real” may be hard to
modify: e.g. hardware or kernel code).
– Satisfy several requirements while sharing common
structures
● Safety
– least privilege
– sandboxing
● Optimization
– Server/service consolidation
– No need to maintain several “real” items
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
8. Simulation-Emulation-Virtuality
● Simulation:
– Provide just a model of the phenomena to
study
– Simulation never provides virtualization
● Emulation:
– It means: Behave in the same way.
– Can provide virtualization if usable (e.g. It is
usable)
● Virtuality can be provided without emulation
– e.g. LXC
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
14. Virtual Machines
● KVM: system-vm, same-instruction set, may
provide paravitualization (virtio/vhost-net), user-
mode monitor, requires processor extensions and
kernel module (Linux specific/optimized).
● VirtualBOX(OSE): system-vm, same-instruction set,
may provide paravitualization, user-mode monitor,
requires processor extensions and kernel module (it
runs on several Operating Systems).
● XEN: system-vm, same-instruction set, provides
paravirtualization, native mode, multiplexing.
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
15. operating system level VM
● LXC: Linux Containers/Namespaces: native
mode, multiplexing, user-mode superuser-
access (it provides partial virtualization/sharing
between containers).
● OpenVZ: native mode, multiplexing, superuser-
access.
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
16. process VM
● User Mode Linux: A linux kernel is the VM
monitor for processes. Process VM. User-
mode, user-access. (almost) the same
interface (system call set, maybe the
version of the kernel may differ).
● View-OS (umview/kmview): partial-modular
virtualization. Process VM. User-mode,
user-access. (almost) the same interface
(users may define new system calls).
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
17. process VM
● Qemu: process emulation (direct
translation), user-mode, user-access.
● Application VM (JVM, Mono, ….)
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
18. Network Virtualization
● Virtual Private Networks
– For secure remote access
● Overlay Networks
– e.g. Akamai, p2p
● Networks for Virtual Machines
● Kernel bridge based virtual networks
● Virtual Distributed Ethernet: data-link
layer Ethernet consistent, user-mode,
user access. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
20. Modular Virtualization
● View-OS: modular partial virtual machine
– Umview: based on ptrace (user-mode, user access)
– Kmview: based on utrace/kernel module (user-
mode)
● Several modules available:
– File system (mount)
– File system (patchworking)
– Device
– Uname/time...
– Networking
● Chroot/fakeroot/fuse/vpn/binfmt... features have been
implemented on View-OS.
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
22. Virtualsquare
● Virtualsquare is:
– A community....
– A containers of projects....
● Virtualsquare is not:
– A company
– A brand/product line
● Virtualsquare started at the University of Bologna but now
it is an international community
– A lot of former students now work abroad
– Common ideas with other groups (joint projects)
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
23. the VirtualSquare view
● Communication/compatibility
– different Virtualities must be interconnected, must
communicate
● Integration
– different Virtualities can be seen as special cases
of a broaden idea of Virtuality
● Extension
– if a need cannot yet be captured by a kind of
virtuality, let us create a new one (maybe
combining existing virtualities).
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
24. Communication compatibility
● VDE
KVM – General purpose user-mode
networking support bochs
– Ethernet data-link consistent
VirtualBOX
– Distributed tuntap
– Intuitive (it has the same
User-mode
structure of real Ethernet:
linux switches, cables) libvirt
View-OS LWIPv6
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
25. Integration
● View-OS=each process can have its “view” of the
environment.
● User-mode/user-permission partial virtual machine approach
● The features of several existing tools have been re-
implemented as composable modules.
VPN
fakeroot(ng)
(s)chroot binutils
Virtual networking
View-OS
FUSE UnionFS Virtual devices
LXC
user-mode
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
27. Extension
● A new forge for several new concepts and ideas/tools:
– Msocket: support for multi-stack applications
– LWIPv6: user-space LWIPv4/LWIPv6 hybrid
networking (multi)stack as a library
– Purelibc: process self virtualization
– Relativistic virtualization of time: emulation of
fast machines on slow ones.
– Virtual spaces per login shells.
– Public Distributed Ethernets
– Run-time on-the-fly virtualization
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
28. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
29. VirtualSquare: the book
Renzo Davoli, Michael Goldweber
(editors)
Virtual Square: Users,
Programmers & Developers Guide
● Available at lulu books or
downloadable from
wiki.virtualsquare.org
● Warning: this book is dynamically
changing as the project evolves
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
30. ...a closer look
on virtualsquare projects...
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
31. VDE components
SWITCH SWITCH
CROSS
CABLE
VDE SWITCH VDE SWITCH
VM VDE VdeWire VDE VM TunTap
(e.g. QEMU) plug (e.g. ssh) plug (e.g. U-ML) Linux Module
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
32. VDE: Related Work
● VPN: (OpenVPN) point2point, for real machines
● Overlay Networks: specific for application (peer
to peer, Akamai).
● VM networking: (tools provided with VM, e.g.
uml-switch) specific for VM
VDE:
● multipoint, general mesh
● no need for root (administration) access
● heterogeneous VM and non VM connected
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
33. VDEv2: advertisement
● VDEv2:
– modular design
– compatible with user-mode linux, qemu, tuntap,
(bochs, plex86), umview/lwipv6
– through the vdetaplib potentially compatible any
application using tap
– VLAN (802.1Q)
– FST (fast spanning tree)
– run time maneageable via unixterm (telnet or
web with vdetelweb)
– includes slirpvde and wirefilter
– status debug
– plugin support: snmp/iplog/pdump
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
34. VDEv2
● VDE-Switch
– number of ports configurable on command
line
– port0 is reserved for management clients, n-1
ports are available for connections.
– management UNIX socket for management
clients
● self-describing SMTP-like protocol
– modules: datasock (VM conn), tuntap,
consmgmt (management)
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
35. VDE cables
● VDE-plug
– is a VM that converts the Ethernet packets of
a VDE port into a stream connection (stdin-
stdout)
● VDE-wire
– can be any application able to give a
stdin/stdout stream connection
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
36. Dual Pipe
● dpipe is a new (general purpose)
command we have added.
● Pipe are well known abstractions. The
following command prints the list of the
current directory: ls lpr
ls | lpr
● Dpipe creates a bi-irectional connection
between the processes
dpipe cmd1 = cmd2 cmd1 cmd2
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
37. VDE cables, plugs, wires/dpipe
● dpipe is used to create VDE-cables:
dpipe vde_plug = ssh vde.students.cs.unibo.it vde_plug
● this command connects by a dpipe the local
vde_plug with a vde_plug running on a remote host
(the wire is ssh)
● other applications can be used as wire (e.g.netcat)
● In the example vde_plug refers to the default
switch. It is possible to run several switches on the
same host, an extra option is needed in this case.
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
38. wirefilter
● wirefilter can be put on a cable (e.g. for network
testing)
dpipe vde_plug /tmp/s1 = wirefilter -m /tmp/m = vde_plug /tmp/s2
wirefilter -v /tmp/s1:/tmp/s2
● packet loss, delays, dup, speed, noise figures,
mtu, fifoness properties of the line can be
changed with command line options or real
time via a management socket.
● It is possible to define several “states”. The
state transition is driven by a Markov-chain
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
39. SlirpVDE
VDE SWITCH VDE SWITCH
10.0.2.15 VDE VdeWire VDE 10.0.2.16
plug (e.g. ssh) plug
VM VM
(e.g. QEMU) (e.g. U-ML)
Note: slirp supports IPv4
slirpv6 supports both ipv4 and ipv6 10.0.2.2
SlirpVDE
Firefox
http connection from slirpVDE
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
running on the hosting O.S. to
40. vde_cryptcab
● Coded by Daniele Lacamera (danielinux)
● A vde_cryptcab is a distributed cable
manager for VDE switches.
● Server side
vde_cryptcab -s /tmp/vde2.ctl -p 2100
● Client side
vde_cryptcab -s /tmp/vde2.ctl -c foo@remote.machine.org:2100
● use a blowfish channel (random key exchanged
by scp).
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
41. Marionnet (based on VDE)
● A project by Jean-Vincent Loddo and Luca
Saiu (et al) Université Paris 13.
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
42. TINC
● tinc is a Virtual Private Network (VPN)
daemon that uses tunnelling and
encryption to create a secure private
network between hosts on the Internet.
● Encryption, authentication and compression
● Automatic full mesh routing
● Easily expand your VPN
● Ability to bridge ethernet segments
● Runs on many operating systems and supports Ipv6
● A project by Ivo Timmermans and Guus Sliepen
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
43. LWIPv6
● It is a LWIPv4/v6 (multi) stack implemented as a library.
● Fork project from LWIP project (Adam Dunkels <adam@sics.se>)
● Can be connected to any number of VDE, TUN, TAP
interfaces.
● It is a hybrid stack (not a dual-stack). One single Ipv6
“engine” is able also to manage Ipv4 packets in
compatibility mode
(130.136.1.110 is managed as 0::ffff:130.136.1.110).
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
44. LWIPv6
● PF_INET, PF_INET6
● PF_PACKET for raw packet management
– support for user-level network analysis tools (e.g.
sniffers, ethereal)
– support for user-level dhcp clients.
● PF_NETLINK for configuration
● Packet filtering
● NEW: dhcp client/server, rarpd, slirp, routing, nat
on request
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
46. LWIPv6 socket API (just add lwip_ prefix)
int lwip_msocket(struct stack *stack, int domain, int type, int protocol);
int lwip_socket(int domain, int type, int protocol);
int lwip_accept(int s, struct sockaddr *addr, socklen_t *addrlen);
int lwip_bind(int s, struct sockaddr *name, socklen_t namelen);
int lwip_shutdown(int s, int how);
int lwip_getpeername (int s, struct sockaddr *name, socklen_t *namelen);
int lwip_getsockname (int s, struct sockaddr *name, socklen_t *namelen);
int lwip_getsockopt (int s, int level, int optname, void *optval, socklen_t *optlen);
int lwip_setsockopt (int s, int level, int optname, const void *optval, socklen_t optlen);
int lwip_close(int s);
int lwip_connect(int s, struct sockaddr *name, socklen_t namelen);
int lwip_listen(int s, int backlog);
int lwip_recv(int s, void *mem, int len, unsigned int flags);
int lwip_read(int s, void *mem, int len);
int lwip_recvfrom(int s, void *mem, int len, unsigned int flags,
struct sockaddr *from, socklen_t *fromlen);
int lwip_send(int s, void *dataptr, int size, unsigned int flags);
int lwip_sendto(int s, void *dataptr, int size, unsigned int flags,
struct sockaddr *to, socklen_t tolen);
int lwip_write(int s, void *dataptr, int size);
int lwip_select(int maxfdp1, fd_set *readset, fd_set *writeset, fd_set *exceptset,
struct timeval *timeout);
int lwip_ioctl(int s, long cmd, void *argp);
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
47. LWIPv6: New features
● Packet forwarding
● Filtering
● NAT
● DHCP server/RADV server onboard
● SLIRP (v4 and v6)
struct netif *lwip_add_slirpif(struct stack
*stack, void *arg, int flags);
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
48. slirpvde6
● Extension of slirpvde based on LWIPv6
– slirp ipv4/ipv6
– Stateless translator
– Dhcp/radv server
– DNS forwarder
– Port and X forwarding (in and out)
slirpvde6 -d -H10.0.2.1/24 -H2001::1/64 -s
/tmp/vde.ctl -dhcp -r
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
49. Berkeley sockets API: problem #1
● The Berkeley Sockets API has been
designed for one protocol stack (per
protocol family).
– Multiple stacks => different networking
features (per user, per application...)
● Unix uses the file system as a naming
space for everything (devices, kernel
variables, ...) except for networking.
– Access control to networking
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
50. Solution #1: msockets
#include <msocket.h>
int msocket(char *path, int domain, int type, int protocol);
● Path is the pathname of the stack
● domain/type/protocol are the same defined in socket(2).
● A stack is a special file (new type of special file, see stat(2)):
#define S_IFSTACK 0160000
● Each process has a default stack for each protocol family (domain).
– If path==NULL, msocket uses the default stack.
● It is backwards compatible.
#define socket(d,t,p) msocket(NULL,(d),(t),(p))
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
51. Msockets: set the default stack
int msocket(char *path, int domain, int type, int protocol);
● if type==SOCK_DEFAULT msocket sets the default stack. e.g.
msocket("/dev/net/ipstack2",PF_INET,SOCK_DEFAULT,0);
defines /dev/net/ipstack2 as the default stack for Ipv4
● if type==SOCK_DEFAULT && domain==PF_UNSPEC msocket sets the default
stack for all the protocol families.
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
52. Mstack: backward compatibility
● Mstack uses msocket: it defines the default stack so that existing
applications can use different stacks.
$ ip addr
..... ip addr on default net
$ mstack /dev/net/ipstack2 ip addr
.... ip addr of “ipstack2”
$ mstack /dev/net/newstack firefox
.... firefox works on newstack
$ mstack /dev/net/otherstack bash
$ ...this new bash works on otherstack
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
53. Msockets: implementation
● Msockets API is currently supported by lwipv6
and by view-os.
● It is a natural extension, backwards compatible
for the Berkeley sockets.
● Many application would benefit from this
extension (e.g. networkless user accounts).
● We are studying kernel support for msockets.
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
54. Berkeley sockets API: problem #2
● Berkeley Sockets API does provide support for IPC
(AF_UNIX)
● Berkeley Sockets API does not provide support for
multicast IPC
● Berkeley Sockets is mainly for point-to-point, client-server
communication
IP multicast, Ethernet broadcast provided by “magic”
addresses.
● Many applications need multicast IPC (dbus, vde_switch,
midi-patchbay, mpeg-ts demultiplexing...)
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
55. IPN: Inter Process Networking
● IPN is for IPC (like AF_UNIX)
● IPN provides fast, kernel implemented,
multicast communication among
processes.
sender dispatcher receiver receiver receiver
AF_UNIX based multicasting service (dbus, vde_switch, tee, ....)
sender dispatcher receiver receiver receiver
Policy submodule
AF_IPN
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
56. IPN: implementation
● A new address family AF_IPN
● Policies can be provided as submodules.
– IPN_BROADCAST (default) each messages is delivered
to all the members but the sender
– IPN_VDESWITCH a virtual ethernet switch
– IPN_MPEGTS mpeg transport stream demultiplexing
● Two services (sockopt selectable):
– LOSSLESS: bounded buffer approach, late receivers
delay senders
– LOSSY: late receivers lose data.
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
57. IPN:direct support for multicast
● BIND=define and get administration access
to the socket
– “x” permission required
● CONNECT=join the flow of data
– “r” and “w” mean permission to receive or send
struct sockaddr_un sun={.sun_family=AF_IPN,.sun_path="/tmp/sockipn"};
int s=socket(AF_IPN,SOCK_RAW,IPN_BROADCAST); /* or a different policy*/
err=bind(s,(struct sockaddr *)&sun,sizeof(sun));
err=connect(s,NULL,0);
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
58. Why IPN instead of...
● AF_UNIX?
– Point-to-point, hub process needed, slow!
● IP_MULTICAST ttl=0?
– No access control, slow!
● AF_NETLINK?
– No access control, designed for
interface/filtering configuration.
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
59. IPN is fast (time for 1M msgs, 64B per msg)
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
60. IPN is fast (time for 1M msgs, 1024B per msg)
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
61. IPN is fast (time for 1M msgs, 16 receivers)
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
62. IPN communication models
● Code examples here:
http://wiki.virtualsquare.org/index.php/IPN_examples
● Peer-to-peer
– All the member processes are senders and
receivers (e.g. vde)
● Publish_subscribe
– A process broadcast messages and client
processes can join the IPN socket
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
63. IPN extra features
● Out-Of-Band messages from core IPN and policy
submodules
– e.g. number of readers notification to stop
subscriberless services
● Networking interfaces TAP+GRAB
– TAP: a new virtual interface is defined and connected
to an IPN socket (in kernel-land)
– GRAB: an existing networking interface gets
connected to an IPN socket (in kernel-land)
● Char-device interface
– Define a character device connected to a IPN socket
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
64. VDETELWEB
● It is the Web/Telnet Server for VDE switch configuration.
● It uses the LWIPv6 library
● It has two connections to the controlled VDE switch:
– management socket to give commands
– port0: the ethernet port used by the TCP-IP stack.
● It reads the set of commands, descriptions, arguments
from the switch itself.
● Telnet has history/command editing and support for
asynch debug output (NEW)
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
67. View-OS
... a process with a view
Each process should be permitted to have its own
view of the execution environment
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
68. View Components
● filesystem namespace, including the
related ownership and permission
information,
● networking configuration,
● system name,
● current time,
● devices, etc.
● ...
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
69. Global View Assumption
● In general processes running on the same
computer share the same view.
– A given pathname refers to the same file
for all processes.
– All processes use one shared TCP-IP stack
for networking hence all processes share
the same set of IP addresses and routing
policies.
– All processes share the same notion as to
which users/processes have special
priviledges.
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
70. View-OS vs. VMs and Containers
View-OS VM Container
Memory Impact LOW HIGH LOW
Running State User User or Kernel Kernel
Administered by: user user root
Partial Virtualization Yes No Yes (sharing)
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
71. Partial Virtualization
● Virtualize just what you need:
– Virtual and real file systems, devices,
networks, etc. co-exist in the process'
view
● Support for nested virtualization:
– e.g. virtual file system defined on virtual
devices.
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
72. How to start a View-OS monitor:
user@host:~$ umview bash
This kernel supports: PTRACE_MULTI PTRACE_SYSVM ppoll
ViewOS will use: PTRACE_MULTI PTRACE_SYSVM ppoll
pure_libc library found: syscall tracing allowed
rd235 2.6.29utrace GNU/Linux/ViewOS 10585 0
user@host[10585:0]:~$
● Umview runs on vanilla Linux kernels, Kmview
requires a kernel module loaded (and utrace).
● Instead of bash one may run his/her favorite
executable (e.g. xterm, script....)
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
73. View-OS modules
● View-OS monitor loads only the
virtualities requested by the user:
– Umfuse: file system virtualization
– Umnet: networking virtualization
– Umdev: device virtualization
– Umbinfmt: executable interpreter
virtualization
– Viewfs: file system patchworking
– Ummisc: time, system id...
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
74. #1: Virtual Installation of Software
$ um_add_service viewfs
$ mkdir /tmp/newroot
$ viewsu
# mount t viewfs o mincow,except=/tmp,vstat /tmp/newroot /
# aptget install mynewsoftware
● Create an empty dir
● Mount it in “minimal copy on write” mode:
– File mod's are on the real file system when allowed.
– Mod's stored in the mounted dir otherwise.
– A single consistent view.
– Vstat: virtualize stat (support for virtual chown,
chmod/setuid, special files)
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
75. #2: Virtual Networking
$ um_add_service umnet
$ mount t umnetlwipv6 none /dev/net/default
$ ip link set vd0 up
$ ip addr add 10.1.2.3/24 dev vd0
$ ip addr
1: lo0: <LOOPBACK,UP> mtu 0
link/loopback
inet6 ::1/128 scope host
inet 127.0.0.1/8 scope host
2: vd0: <BROADCAST,UP> mtu 1500
link/ether 02:02:5a:44:e2:06 brd ff:ff:ff:ff:ff:ff
inet6 fe80::2:5aff:fe44:e206/64 scope link
inet 10.1.2.3/24 scope global
● A network stack can be “mounted.”
● /dev/net/default is the default stack, but View-OS
supports multiple stacks.
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
77. #3:Mount a Filesystem
$ um_add_service umfuse
$ mount t umfuseext2 o ro ext2filesystemimage /mnt
$ mount t umfusestrangefilesystem strangeimage /mnt2
● Source compatible with Fuse.
● Mount file systems unsupported by the kernel.
● Safe mount, limited to this View.
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
78. #4: Filesystem Image partition and mount
● Step 1: Load the umdev (virtual device)
module and mount an empty file as a disk
image.
$ um_add_service umdev
$ viewsu
# dd of=/tmp/diskimage bs=1024 count=0 seek=1024000
# mount -t umdevmbr /tmp/diskimage /dev/hda
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
79. #4: Filesystem Image partition and mount
# fdisk /dev/hda
● Step 2: Device contains a valid partition table
Building a new DOS disklabel with disk identifier 0xd403417d.
partition Command (m for help): n
Command action
the e extended
p primary partition (14)
p
file Partition number (14): 1
First cylinder (1127, default 1): 1
system Last cylinder, +cylinders or +size{K,M,G} (1127, default 127): 127
Command (m for help): p
image: Disk /dev/hda: 1048 MB, 1048576000 bytes
255 heads, 63 sectors/track, 127 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0xd403417d
Device Boot Start End Blocks Id System
/dev/hda1 1 127 1020096 83 Linux
Command (m for help): w
The partition table has been altered!
Calling ioctl() to reread partition table.
Syncing disks.
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
80. #4: Filesystem Image partition and mount
● Step 3: # mkfs.ext2 /dev/hda1
mke2fs 1.41.8 (20Jul2009)
Create Filesystem label=
OS type: Linux
the Block size=4096 (log=2)
Fragment size=4096 (log=2)
filesystem 63872 inodes, 255024 blocks
12751 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=264241152
8 block groups
32768 blocks per group, 32768 fragments per group
7984 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376
Writing inode tables: done
Writing superblocks and filesystem accounting information: do
This filesystem will be automatically checked every 38 mounts
180 days, whichever comes first. Use tune2fs c or i to over
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
81. #4: Filesystem Image partition and mount
● Step 4: mount the new partition
# um_add_service umfuse
# mount t umfuseext2 o rw+ /dev/hda1 /mnt
# ls l /mnt
total 16
drwx 2 root root 16384 20090916 11:57 lost+found
● Example of nested virtualization.
● Compatible with standard sys-admin
commands.
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
82. #5: User Mode chroot
● Step 1: create the jail filesystem
$ mkdir /tmp/root /tmp/root/bin /tmp/root/lib
$ cp /bin/busybox /tmp/root/bin
$ cp /lib/libm2.9.so /lib/libc2.9.so /tmp/root/lib
$ cd /tmp/root/lib
$ ln s libm2.9.so libm.so.6
$ ln s libc2.9.so libc.so.6
$ cd / /tmp/root
bin lib
libc.so.6 libm.so.6
busybox
libc-2.9.so libm-2.9.so
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
83. #5: User Mode chroot
● Step 2: change the file system root:
– Core mode: by the virtual chroot system call
$ exec /usr/sbin/chroot /tmp/root /bin/busybox sh
BusyBox v1.13.3 (Debian 1:1.13.31) builtin shell (ash)
Enter ’help’ for a list of builtin commands.
/ $
– By Viewfs:
$ um_add_service viewfs
$ exec busybox sh
BusyBox v1.13.3 (Debian 1:1.13.31) builtin shell (ash)
Enter ’help’ for a list of builtin commands.
/ $ mount t viewfs o move,permanent /tmp/root /
/ $
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
85. #6: Create a Ramdisk and use it:
$ um_add_service umdev
$ um_add_service umfuse
$ um_add_service umproc
$ mount t umdevramdisk o size=100M none /dev/hdx
$ /sbin/mkfs.vfat /dev/hdx
mkfs.vfat 3.0.3 (18 May 2009)
$ mount t umfusefat o rw+ /dev/hdx /mnt
$ mount
rootfs on / type rootfs (rw)
/dev/root on / type ext3 (rw,errors=remountro,data=ordered)
tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=755)
... ...
none on /proc/mounts type proc (ro)
none on /dev/hdx type umdevramdisk (size=100M)
/dev/hdx on /mnt type umfusefat (rw+)
$
● Another example of nested virtualization
● Umproc virtualizes /proc/mounts
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
86. #7: Virtualize running processes
● Shell #1 (pid 12345, an ordinary shell)
sh1 $ mkdir /tmp/mnt
sh1 $ ls /tmp/mnt
sh1 $
● Shell #2 (running under ViewOS)
sh2 $ um_add_service umfuse
sh2 $ mount t ext2 /tmp/linux.img /tmp
sh2 $ ls /tmp/mnt
bin boot dev etc lib lost+found mnt proc sbin tmp usr
sh2 $ um_attach 12345
● Shell #1 has been “attached” to ViewOS
sh1 $ ls /tmp/mnt
bin boot dev etc lib lost+found mnt proc sbin tmp usr
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
87. #8: Process proper time
● Start two xclocks one from a standard shell, the other
from a shell running ViewOS.
sh 1 $ xclock update 1 &
sh2 $ xclock update 1 &
sh2 $ um_add_service ummisc
sh2 $ mount t ummisctime none /tmp/mnt
● Now change the frequency of the virtual time for
ViewOS:
sh2 $ echo 2 > /tmp/mnt/frequency
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
88. Behind the Scenes
sumodule
sumodule
sumodule
sumodule
sumodule
module module module
*mview
Global PCB
hash dispatcher and fd
PURELIBC
table mgmt
process Capture layer Nested Capture
Ptrace or kmview kernel module (utrace) Linux Kernel
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
89. Modules & submodules
● Modules provide support for
classes of virtualizations, e.g.:
– Umfuse: file systems
sumodule
sumodule
sumodule
sumodule
sumodule
– Umnet: networking
module module module
Umdev: devices
*mview
–
Global PCB
hash dispatcher and fd
PURELIBC
process
table
Capture layer Nested Capture
mgmt
● Submodules are for specific cases,
e.g.:
Ptrace or kmview kernel module (utrace) Linux Kernel
– Umfuseext2, Umfusefat
– Umnetlwipv6, umnetnull
– Umdevmbr, umdevramdisk
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
90. Modules & submodules
module description submodule description
umproc /proc/mounts virtualization
umfuse User-mode fuse umfuseext2 ext2 implementation
umfuseiso9660 iso9660
umfusefat vat/vfat
umfusentfs3g ntfs
umfusearchive tar/cdimages (libarchive)
umfuseramfile single file virtualization
umfusessh (sshfs) remote file system via ssh
umfuseencfs (encfs) encrypted file system
umnet network multi stack support umnetnull null stack
umnetlwipv6 Ipv4/v6 hybrid stack
umnetlink move/merge stacks
umnetcurrent current stack
umdev device virtualization umdevmbr DOS master boot record
umdevnull null device
umdevramdisk ramdisk
umdevvd VDI, VMDK, VHD disks
umdevtab virtual tuntap
ummisc system call based virtualization ummisctime time virtualization
ummiscuname uname id virtualization
viewfs file system patchworking
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
91. Capture user process system calls
● umview: based on ptrace
– Vanilla Linux kernel
– (patches proposed for
sumodule
sumodule
sumodule
sumodule
sumodule
performance)
module module module
kmview needs a specific kernel
*mview
●
Global PCB
hash dispatcher and fd
module based on utrace.
PURELIBC
table mgmt
process Capture layer Nested Capture
– Security enhancement
Ptrace or kmview kernel module (utrace) Linux Kernel
– More complete virtualization
support (nested View-OS,
strace/gdb, SIGSTOP).
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
92. Global Hash Table
● Keeps track of active
virtualizations:
Pathname objects
sumodule
sumodule
sumodule
sumodule
sumodule
–
module
*mview
module module
– File System Types
Protocol families
Global PCB
hash dispatcher and fd –
PURELIBC
table mgmt
Device Major/Minor ranges
process Capture layer Nested Capture
–
System call numbers
Ptrace or kmview kernel module (utrace) Linux Kernel
–
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
93. Dispatcher
● The Dispatcher uses the global
hash table to route each system
call to the right module or to the
sumodule
sumodule
sumodule
sumodule
sumodule
kernel.
module module module
*mview
Global PCB
hash dispatcher and fd
PURELIBC
table mgmt
process Capture layer Nested Capture
Ptrace or kmview kernel module (utrace) Linux Kernel
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
94. Nested Capture
● View-OS captures (and can
virtualize) the system calls
generated by modules and
sumodule
sumodule
sumodule
sumodule
sumodule
submodules
module module module
*mview
Global PCB
● Purelibc is a C library providing
process self virtualization
PURELIBC
hash dispatcher and fd
table mgmt
process Capture layer Nested Capture
Ptrace or kmview kernel module (utrace) Linux Kernel
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
95. Desiderata: 1: Linux Kernel
New Ptrace tags for virtualization support:
– PTRACE_VM: support for partial virtualization. It is
possible to skip the current system call and/or
the second upcall after the system call. (User-
Mode Linux can use this instead of
PTRACE_SYSEMU. VM has a simpler
implementation than SYSEMU.
– PTRACE_MULTI: process a sequence of ptrace
requests + PEEK/POKE of large chunks as a
single call. (ptrace exchanges one memory
word per call and /proc/{pid}/mem is not
writable!)
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
96. Desiderata 2. Open Group/POSIX
#include <msocket.h>
int msocket(char *path, int domain, int type, int protocol);
● Path is the pathname of the stack
● domain/type/protocol are the same defined in socket(2).
● A stack is a special file (new type of special file, see stat(2)):
#define S_IFSTACK 0160000
● Each process has a default stack for each protocol family (domain).
– If path==NULL, msocket uses the default stack.
● It is backwards compatible:
#define socket(d,t,p) msocket(NULL,(d),(t),(p))
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
97. Desiderata: 2 Open Group/POSIX
int msocket(char *path, int domain, int type, int protocol);
● if type==SOCK_DEFAULT msocket sets the default stack. e.g.
msocket("/dev/net/ipstack2",PF_INET,SOCK_DEFAULT,0);
defines /dev/net/ipstack2 as the default stack for Ipv4
● if type==SOCK_DEFAULT && domain==PF_UNSPEC msocket sets the default
stack for all the protocol families.
● Mstack uses msocket: it defines the default stack so that existing applications
can use different stacks.
$ ip addr
..... ip addr on default net
$ mstack /dev/net/newstack firefox
.... firefox works on newstack
$ mstack /dev/net/otherstack bash
$ ...this new bash works on otherstack
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
98. Desiderata: 3- C library ({e}glibc)
● C libraries are impure, they are pure C libraries and
interface to system calls at the same time.
● It is not possible to do self virtualization of system calls
for processes using {e}glibc, library calls are internally
linked to the system calls (e.g. printf calls write).
● Purelibc is a (ld preloaded) layer on {e}glibc which
convert the C library in a pure library
● The support for self virtualization should be a feature of
mainstream {e}glibc.
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
99. Desiderata: 4: utrace
UTRACE_STOP
● Utrace supports more tracers (engines) on the same process.
● Utrace sends the notification to all the tracers and then waits for
utrace_control(..., UTRACE_RESUME) from each tracer which
returned UTRACE_STOP.
● This specification is bad suited for nested virtualization support:
a notification functions inspects the state (e.g. System call
parameters) and maybe it changes the state. Next tracer must
read the state as changed from the previous tracer.
● Kmview uses a semaphore in its system call notification
function to stop a process because this UTRACE_STOP
specification is useless.
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
100. UTRACE_STOP implementation
PROCESS UTRACE Kmview kernel module Outer kmview Inner kmview
Syscall Request
Notify engine#2
Notify user space
Return UTRACE_STOP
Notify engine#1
Notify user space
Return UTRACE_STOP
Mgmt of syscall
RACE CONDITION!
Wait
(all engines)
Mgmt of syscall
run syscall
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
101. Desiderata: new UTRACE_STOP
PROCESS UTRACE Kmview kernel module Outer kmview Inner kmview
Syscall Request
Notify engine#2
Notify user space
Return UTRACE_STOP
Wait Mgmt of syscall
Notify engine#1
Notify user space
Return UTRACE_STOP
Wait Mgmt of syscall
run syscall
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
102. Kmview workaround
PTRACE_SYSCALL_{RUN,ABORT} instead of PTRACE_STOP
PROCESS UTRACE Kmview kernel module Outer kmview Inner kmview
Syscall Request
Notify engine#2
Notify user space
down(sem)
Mgmt of syscall
up(sem)
Notify engine#1
Notify user space
down(sem)
Mgmt of syscall
up(sem)
run syscall
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
103. Multiple meaning of safety...
● Availability, bug effects confinement:
– ViewOS runs outside the kernel, errors in modules may
lead to a crash of the View (not a kernel panic!)
● Self protection (from mistaken commands):
– Global View Assumption often force to use root access (or
powerful capabilities), this is dangerous.
● Sandbox non-circumvention:
– At the first sight it seems that Kernel based sandboxes
are safer (e.g. seccomp).
● Kernel based sandboxes are not flexible
● On/Off security: a bug may compromise the whole
system
● A good support for VM can preserve safety
● The more code, the worse security. Is the kernel “too fat?”
– Maintenance problems, side effects, etc.
– ViewOS can move services outside the kernel.
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
104. The missing ring...
● View-OS modules are similar to microkernel
servers.
● View-OS captures some of the benefit of
microkernels (separation mechanism and
policy, flexibility, reliability).
● View-OS allow microkernel services to be
implemented (at user level) on monolithic
kernels.
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna