VIRTUALSQUARE all the virtuality you wanted  but you were afraid to ask      Rome, March 5th 2011                Renzo Dav...
Virtual...             Time                 Execution Environmentuser-id                              Device           Mac...
LXC           schroot           libvirt        chroot                                    GXemul                   Libguest...
What does Virtualization mean?          Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
User of A                                               I(A) = I(A)                                               A can be...
What is “real”●   The service A can be hardware                                                                       User...
Why virtual?●   Flexibility         –   prototyping         –   modify features at run-time (“real” may be hard to        ...
Simulation-Emulation-Virtuality●   Simulation:        –   Provide just a model of the phenomena to              study     ...
Virtual Machines (Smith-Nair)        Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
Virtual Machines●   Intrusiveness:       –   User-mode user-access       –   User-mode superuser-access       –   Kernel p...
Virtual Machines                 User                       User User             Virtualization             Virtualizatio...
Virtual Machine (multiplexing)                User               User                User User                       Virtu...
Virtual Machines:●   Qemu-system: system-vm, processor    emulation (by direct code translation),    user-space/user-permi...
Virtual Machines●   KVM: system-vm, same-instruction set, may    provide paravitualization (virtio/vhost-net), user-    mo...
operating system level VM●   LXC: Linux Containers/Namespaces: native    mode, multiplexing, user-mode superuser-    acces...
process VM●   User Mode Linux: A linux kernel is the VM    monitor for processes. Process VM. User-    mode, user-access. ...
process VM●   Qemu: process emulation (direct    translation), user-mode, user-access.●   Application VM (JVM, Mono, ….)  ...
Network Virtualization●   Virtual Private Networks         –   For secure remote access●   Overlay Networks         –   e....
File system virtualization●   Chroot/schroot●   Fakeroot/fakeroot-ng●   FUSE       –   Fuse-ext2       –   Fuse-ssh       ...
Modular Virtualization●   View-OS: modular partial virtual machine        – Umview: based on ptrace (user-mode, user acces...
the VirtualSquare view        Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
Virtualsquare●   Virtualsquare is:        –   A community....        –   A containers of projects....●   Virtualsquare is ...
the VirtualSquare view●   Communication/compatibility    –   different Virtualities must be interconnected, must        co...
Communication compatibility              ●   VDE  KVM                   –   General purpose user-mode                     ...
Integration●   View-OS=each process can have its “view” of the    environment.●   User-mode/user-permission partial virtua...
View-OSGLOBAL VIEWASSUMPTION          Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
Extension●   A new forge for several new concepts and ideas/tools:        –   Msocket: support for multi-stack application...
Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
VirtualSquare: the book             Renzo Davoli, Michael Goldweber             (editors)             Virtual Square: User...
...a closer lookon virtualsquare projects...          Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
VDE components              SWITCH                                   SWITCH                           CROSS               ...
VDE: Related Work●   VPN: (OpenVPN) point2point, for real machines●   Overlay Networks: specific for application (peer    ...
VDEv2: advertisement●   VDEv2:    –   modular design    –   compatible with user-mode linux, qemu, tuntap,        (bochs, ...
VDEv2●   VDE-Switch    –   number of ports configurable on command        line    –   port0 is reserved for management cli...
VDE cables●   VDE-plug    –   is a VM that converts the Ethernet packets of        a VDE port into a stream connection (st...
Dual Pipe●   dpipe is a new (general purpose)    command we have added.●   Pipe are well known abstractions. The    follow...
VDE cables, plugs, wires/dpipe●   dpipe is used to create VDE-cables:dpipe vde_plug = ssh vde.students.cs.unibo.it vde_plu...
wirefilter●   wirefilter can be put on a cable (e.g. for network    testing)dpipe vde_plug /tmp/s1 = wirefilter -m /tmp/m ...
SlirpVDE                 VDE SWITCH                                 VDE SWITCH10.0.2.15            VDE           VdeWire  ...
vde_cryptcab●   Coded by Daniele Lacamera (danielinux)●   A vde_cryptcab is a distributed cable    manager for VDE switche...
Marionnet (based on VDE)●   A project by Jean-Vincent Loddo and Luca    Saiu (et al) Université Paris 13.                 ...
TINC●   tinc is a Virtual Private Network (VPN)    daemon that uses tunnelling and    encryption to create a secure privat...
LWIPv6●   It is a LWIPv4/v6 (multi) stack implemented as a library.●   Fork project from LWIP project (Adam Dunkels <adam@...
LWIPv6●   PF_INET, PF_INET6●   PF_PACKET for raw packet management    –   support for user-level network analysis tools (e...
LWIPv6 interface definition APIstruct stack *lwip_stack_new(void);void lwip_stack_free(struct stack *stack);struct stack *...
LWIPv6 socket API                                  (just add lwip_ prefix)int lwip_msocket(struct stack *stack, int domain...
LWIPv6: New features●   Packet forwarding●   Filtering●   NAT●   DHCP server/RADV server onboard●   SLIRP (v4 and v6)     ...
slirpvde6●   Extension of slirpvde based on LWIPv6       –   slirp ipv4/ipv6       –   Stateless translator       –   Dhcp...
Berkeley sockets API: problem #1●   The Berkeley Sockets API has been    designed for one protocol stack (per    protocol ...
Solution #1: msockets#include <msocket.h>int msocket(char *path, int domain, int type, int protocol);●   Path is the pathn...
Msockets: set the default stackint msocket(char *path, int domain, int type, int protocol);●   if type==SOCK_DEFAULT msock...
Mstack: backward compatibility●   Mstack uses msocket: it defines the default stack so that existing    applications can u...
Msockets: implementation●   Msockets API is currently supported by lwipv6    and by view-os.●   It is a natural extension,...
Berkeley sockets API: problem #2●   Berkeley Sockets API does provide support for IPC    (AF_UNIX)●   Berkeley Sockets API...
IPN: Inter Process Networking●   IPN is for IPC (like AF_UNIX)●   IPN provides fast, kernel implemented,    multicast comm...
IPN: implementation●   A new address family AF_IPN●   Policies can be provided as submodules.    –   IPN_BROADCAST (defaul...
IPN:direct support for multicast●   BIND=define and get administration access    to the socket    –   “x” permission requi...
Why IPN instead of...●   AF_UNIX?    –   Point-to-point, hub process needed, slow!●   IP_MULTICAST ttl=0?    –   No access...
IPN is fast   (time for 1M msgs, 64B per msg)         Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
IPN is fast   (time for 1M msgs, 1024B per msg)         Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
IPN is fast   (time for 1M msgs, 16 receivers)         Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
IPN communication models●   Code examples here:http://wiki.virtualsquare.org/index.php/IPN_examples●   Peer-to-peer    –  ...
IPN extra features●   Out-Of-Band messages from core IPN and policy    submodules    –   e.g. number of readers notificati...
VDETELWEB●   It is the Web/Telnet Server for VDE switch configuration.●   It uses the LWIPv6 library●   It has two connect...
VDETELWEB: telnet        Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
VDETELWEB: Web Interface       Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
View-OS         ... a process with a viewEach process should be permitted to have its own       view of the execution envi...
View Components●   filesystem namespace, including the    related ownership and permission    information,●   networking c...
Global View Assumption●   In general processes running on the same    computer share the same view.       –   A given path...
View-OS vs. VMs and Containers                         View-OS        VM             ContainerMemory Impact            LOW...
Partial Virtualization●   Virtualize just what you need:       –   Virtual and real file systems, devices,             net...
How to start a View-OS monitor:      user@host:~$ umview bash      This kernel supports: PTRACE_MULTI PTRACE_SYSVM ppoll  ...
View-OS modules●   View-OS monitor loads only the    virtualities requested by the user:       –   Umfuse: file system vir...
#1: Virtual Installation of Software    $ um_add_service viewfs    $ mkdir /tmp/newroot    $ viewsu    # mount ­t viewfs ­...
#2: Virtual Networking    $ um_add_service umnet    $ mount ­t umnetlwipv6 none /dev/net/default    $ ip link set vd0 up  ...
#2: Virtual Networking$ um_add_service umnet$ mount ­t umnetlwipv6 ­o tn0=tunx none /dev/lwip0$ mount ­t umnetlwipv6 ­o tp...
#3:Mount a Filesystem       $ um_add_service umfuse       $ mount ­t umfuseext2 ­o ro ext2filesystemimage /mnt       $ mou...
#4: Filesystem Image partition and mount●   Step 1: Load the umdev (virtual device)    module and mount an empty file as a...
#4: Filesystem Image partition and mount              # fdisk /dev/hda●   Step 2:   Device contains a valid partition tabl...
#4: Filesystem Image partition and mount●   Step 3:   # mkfs.ext2 /dev/hda1              mke2fs 1.41.8 (20­Jul­2009)Create...
#4: Filesystem Image partition and mount●   Step 4: mount the new partition      # um_add_service umfuse      # mount ­t u...
#5: User Mode chroot●   Step 1: create the jail filesystem       $ mkdir /tmp/root /tmp/root/bin /tmp/root/lib       $ cp ...
#5: User Mode chroot●   Step 2: change the file system root:        –   Core mode: by the virtual chroot system call      ...
#5: User Mode chroot  ●   Step 3: the process is in the jail:/ $ ls ­lR //:drwxr­xr­x 2 1000   1000    4096 Sep 17 13:37 b...
#6: Create a Ramdisk and use it:$ um_add_service umdev$ um_add_service umfuse$ um_add_service umproc$ mount ­t umdevramdis...
#7: Virtualize running processes●   Shell #1 (pid 12345, an ordinary shell)        sh1 $ mkdir /tmp/mnt        sh1 $ ls /t...
#8: Process proper time●   Start two xclocks one from a standard shell, the other    from a shell running ViewOS.    sh 1 ...
Behind the Scenes                   sumodule                                 sumodule                                     ...
Modules & submodules                                                                                                 ●   M...
Modules & submodulesmodule   description                          submodule             descriptionumproc   /proc/mounts v...
Capture user process system calls                                                                                         ...
Global Hash Table                                                                                                 ●   Keep...
Dispatcher                                                                                                 ●   The Dispatc...
Nested Capture                                                                                                 ●   View-OS...
Desiderata: 1: Linux KernelNew Ptrace tags for virtualization support:    –   PTRACE_VM: support for partial virtualizatio...
Desiderata 2. Open Group/POSIX#include <msocket.h>int msocket(char *path, int domain, int type, int protocol);●   Path is ...
Desiderata: 2 Open Group/POSIXint msocket(char *path, int domain, int type, int protocol);●   if type==SOCK_DEFAULT msocke...
Desiderata: 3- C library ({e}glibc)●   C libraries are impure, they are pure C libraries and    interface to system calls ...
Desiderata: 4: utrace                           UTRACE_STOP●   Utrace supports more tracers (engines) on the same process....
UTRACE_STOP implementation  PROCESS            UTRACE         Kmview kernel module    Outer kmview      Inner kmviewSyscal...
Desiderata: new UTRACE_STOP  PROCESS            UTRACE         Kmview kernel module    Outer kmview      Inner kmviewSysca...
Kmview workaroundPTRACE_SYSCALL_{RUN,ABORT} instead of PTRACE_STOP  PROCESS            UTRACE         Kmview kernel module...
Multiple meaning of safety...●   Availability, bug effects confinement:          – ViewOS runs outside the kernel, errors ...
The missing ring...●   View-OS modules are similar to microkernel    servers.●   View-OS captures some of the benefit of  ...
VirtualSquare●   VDE●   LWIPv6●   PureLibC                                Questions?●   IPN●   View-OS          –   Umview...
Upcoming SlideShare
Loading in...5
×

Virtualsquare: tutta la virtualità che avete sempre desiderato e non avete osato chiedere

913

Published on

La presentazione di Renzo Davoli
tenuta in occasione del Codemotion, 5 marzo 2011, Roma

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
913
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Virtualsquare: tutta la virtualità che avete sempre desiderato e non avete osato chiedere

  1. 1. VIRTUALSQUARE all the virtuality you wanted but you were afraid to ask Rome, March 5th 2011 Renzo Davoli Università di Bologna(Master in Scienze e Tecnologie del Software Libero) (Associazione per il Software Libero) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  2. 2. Virtual... Time Execution Environmentuser-id Device Machine Networking Memory File System Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  3. 3. LXC schroot libvirt chroot GXemul Libguestfs Bochs PearPC Qemu tinc VDE fuse-ext2 Open-VZ Marionnet User-mode Linux UnionFS JVM Umview PureLibc VirtualBOX FUSEFairVPN SPICE LWIPv6 fakeroot VirtualBricks View-OS fakeroot-ng KVM Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  4. 4. What does Virtualization mean? Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  5. 5. User of A I(A) = I(A) A can be used instead of AInterface I(A) Interface I(A) Well Known service New Service A A Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  6. 6. What is “real”● The service A can be hardware User of A – Machine I(A) = I(A) A can be used instead of A – Memory Interface I(A) Interface I(A) Well Known service New Service – Network A A● The service A can be software – File system – Execution Environment – Identity Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  7. 7. Why virtual?● Flexibility – prototyping – modify features at run-time (“real” may be hard to modify: e.g. hardware or kernel code). – Satisfy several requirements while sharing common structures● Safety – least privilege – sandboxing● Optimization – Server/service consolidation – No need to maintain several “real” items Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  8. 8. Simulation-Emulation-Virtuality● Simulation: – Provide just a model of the phenomena to study – Simulation never provides virtualization● Emulation: – It means: Behave in the same way. – Can provide virtualization if usable (e.g. It is usable)● Virtuality can be provided without emulation – e.g. LXC Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  9. 9. Virtual Machines (Smith-Nair) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  10. 10. Virtual Machines● Intrusiveness: – User-mode user-access – User-mode superuser-access – Kernel patch/module – Native● Paravirtualization: – Change the user interface to optimize virtualization Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  11. 11. Virtual Machines User User User Virtualization VirtualizationService Service Service Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  12. 12. Virtual Machine (multiplexing) User User User User Virtually multiplexed serviceService Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  13. 13. Virtual Machines:● Qemu-system: system-vm, processor emulation (by direct code translation), user-space/user-permission● GXEmul/PearPC/Mac-on-Linux: system-vm, processor emulation, user-space/user- permission Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  14. 14. Virtual Machines● KVM: system-vm, same-instruction set, may provide paravitualization (virtio/vhost-net), user- mode monitor, requires processor extensions and kernel module (Linux specific/optimized).● VirtualBOX(OSE): system-vm, same-instruction set, may provide paravitualization, user-mode monitor, requires processor extensions and kernel module (it runs on several Operating Systems).● XEN: system-vm, same-instruction set, provides paravirtualization, native mode, multiplexing. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  15. 15. operating system level VM● LXC: Linux Containers/Namespaces: native mode, multiplexing, user-mode superuser- access (it provides partial virtualization/sharing between containers).● OpenVZ: native mode, multiplexing, superuser- access. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  16. 16. process VM● User Mode Linux: A linux kernel is the VM monitor for processes. Process VM. User- mode, user-access. (almost) the same interface (system call set, maybe the version of the kernel may differ).● View-OS (umview/kmview): partial-modular virtualization. Process VM. User-mode, user-access. (almost) the same interface (users may define new system calls). Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  17. 17. process VM● Qemu: process emulation (direct translation), user-mode, user-access.● Application VM (JVM, Mono, ….) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  18. 18. Network Virtualization● Virtual Private Networks – For secure remote access● Overlay Networks – e.g. Akamai, p2p● Networks for Virtual Machines● Kernel bridge based virtual networks● Virtual Distributed Ethernet: data-link layer Ethernet consistent, user-mode, user access. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  19. 19. File system virtualization● Chroot/schroot● Fakeroot/fakeroot-ng● FUSE – Fuse-ext2 – Fuse-ssh – Fuse-* Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  20. 20. Modular Virtualization● View-OS: modular partial virtual machine – Umview: based on ptrace (user-mode, user access) – Kmview: based on utrace/kernel module (user- mode)● Several modules available: – File system (mount) – File system (patchworking) – Device – Uname/time... – Networking● Chroot/fakeroot/fuse/vpn/binfmt... features have been implemented on View-OS. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  21. 21. the VirtualSquare view Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  22. 22. Virtualsquare● Virtualsquare is: – A community.... – A containers of projects....● Virtualsquare is not: – A company – A brand/product line● Virtualsquare started at the University of Bologna but now it is an international community – A lot of former students now work abroad – Common ideas with other groups (joint projects) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  23. 23. the VirtualSquare view● Communication/compatibility – different Virtualities must be interconnected, must communicate● Integration – different Virtualities can be seen as special cases of a broaden idea of Virtuality● Extension – if a need cannot yet be captured by a kind of virtuality, let us create a new one (maybe combining existing virtualities). Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  24. 24. Communication compatibility ● VDE KVM – General purpose user-mode networking support bochs – Ethernet data-link consistentVirtualBOX – Distributed tuntap – Intuitive (it has the sameUser-mode structure of real Ethernet: linux switches, cables) libvirt View-OS LWIPv6 Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  25. 25. Integration● View-OS=each process can have its “view” of the environment.● User-mode/user-permission partial virtual machine approach● The features of several existing tools have been re- implemented as composable modules. VPN fakeroot(ng) (s)chroot binutils Virtual networking View-OS FUSE UnionFS Virtual devices LXC user-mode Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  26. 26. View-OSGLOBAL VIEWASSUMPTION Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  27. 27. Extension● A new forge for several new concepts and ideas/tools: – Msocket: support for multi-stack applications – LWIPv6: user-space LWIPv4/LWIPv6 hybrid networking (multi)stack as a library – Purelibc: process self virtualization – Relativistic virtualization of time: emulation of fast machines on slow ones. – Virtual spaces per login shells. – Public Distributed Ethernets – Run-time on-the-fly virtualization Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  28. 28. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  29. 29. VirtualSquare: the book Renzo Davoli, Michael Goldweber (editors) Virtual Square: Users, Programmers & Developers Guide ● Available at lulu books or downloadable from wiki.virtualsquare.org ● Warning: this book is dynamically changing as the project evolves Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  30. 30. ...a closer lookon virtualsquare projects... Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  31. 31. VDE components SWITCH SWITCH CROSS CABLE VDE SWITCH VDE SWITCH VM VDE VdeWire VDE VM TunTap(e.g. QEMU) plug (e.g. ssh) plug (e.g. U-ML) Linux Module Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  32. 32. VDE: Related Work● VPN: (OpenVPN) point2point, for real machines● Overlay Networks: specific for application (peer to peer, Akamai).● VM networking: (tools provided with VM, e.g. uml-switch) specific for VMVDE:● multipoint, general mesh● no need for root (administration) access● heterogeneous VM and non VM connected Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  33. 33. VDEv2: advertisement● VDEv2: – modular design – compatible with user-mode linux, qemu, tuntap, (bochs, plex86), umview/lwipv6 – through the vdetaplib potentially compatible any application using tap – VLAN (802.1Q) – FST (fast spanning tree) – run time maneageable via unixterm (telnet or web with vdetelweb) – includes slirpvde and wirefilter – status debug – plugin support: snmp/iplog/pdump Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  34. 34. VDEv2● VDE-Switch – number of ports configurable on command line – port0 is reserved for management clients, n-1 ports are available for connections. – management UNIX socket for management clients ● self-describing SMTP-like protocol – modules: datasock (VM conn), tuntap, consmgmt (management) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  35. 35. VDE cables● VDE-plug – is a VM that converts the Ethernet packets of a VDE port into a stream connection (stdin- stdout)● VDE-wire – can be any application able to give a stdin/stdout stream connection Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  36. 36. Dual Pipe● dpipe is a new (general purpose) command we have added.● Pipe are well known abstractions. The following command prints the list of the current directory: ls lpr ls | lpr● Dpipe creates a bi-irectional connection between the processes dpipe cmd1 = cmd2 cmd1 cmd2 Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  37. 37. VDE cables, plugs, wires/dpipe● dpipe is used to create VDE-cables:dpipe vde_plug = ssh vde.students.cs.unibo.it vde_plug● this command connects by a dpipe the local vde_plug with a vde_plug running on a remote host (the wire is ssh)● other applications can be used as wire (e.g.netcat)● In the example vde_plug refers to the default switch. It is possible to run several switches on the same host, an extra option is needed in this case. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  38. 38. wirefilter● wirefilter can be put on a cable (e.g. for network testing)dpipe vde_plug /tmp/s1 = wirefilter -m /tmp/m = vde_plug /tmp/s2wirefilter -v /tmp/s1:/tmp/s2● packet loss, delays, dup, speed, noise figures, mtu, fifoness properties of the line can be changed with command line options or real time via a management socket.● It is possible to define several “states”. The state transition is driven by a Markov-chain Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  39. 39. SlirpVDE VDE SWITCH VDE SWITCH10.0.2.15 VDE VdeWire VDE 10.0.2.16 plug (e.g. ssh) plug VM VM (e.g. QEMU) (e.g. U-ML) Note: slirp supports IPv4 slirpv6 supports both ipv4 and ipv6 10.0.2.2 SlirpVDE Firefox http connection from slirpVDE Renzo Davoli – renzo@cs.unibo.it - Università di Bologna running on the hosting O.S. to
  40. 40. vde_cryptcab● Coded by Daniele Lacamera (danielinux)● A vde_cryptcab is a distributed cable manager for VDE switches.● Server side vde_cryptcab -s /tmp/vde2.ctl -p 2100● Client side vde_cryptcab -s /tmp/vde2.ctl -c foo@remote.machine.org:2100● use a blowfish channel (random key exchanged by scp). Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  41. 41. Marionnet (based on VDE)● A project by Jean-Vincent Loddo and Luca Saiu (et al) Université Paris 13. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  42. 42. TINC● tinc is a Virtual Private Network (VPN) daemon that uses tunnelling and encryption to create a secure private network between hosts on the Internet.● Encryption, authentication and compression● Automatic full mesh routing● Easily expand your VPN● Ability to bridge ethernet segments● Runs on many operating systems and supports Ipv6● A project by Ivo Timmermans and Guus Sliepen Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  43. 43. LWIPv6● It is a LWIPv4/v6 (multi) stack implemented as a library.● Fork project from LWIP project (Adam Dunkels <adam@sics.se>)● Can be connected to any number of VDE, TUN, TAP interfaces.● It is a hybrid stack (not a dual-stack). One single Ipv6 “engine” is able also to manage Ipv4 packets in compatibility mode (130.136.1.110 is managed as 0::ffff:130.136.1.110). Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  44. 44. LWIPv6● PF_INET, PF_INET6● PF_PACKET for raw packet management – support for user-level network analysis tools (e.g. sniffers, ethereal) – support for user-level dhcp clients.● PF_NETLINK for configuration● Packet filtering● NEW: dhcp client/server, rarpd, slirp, routing, nat on request Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  45. 45. LWIPv6 interface definition APIstruct stack *lwip_stack_new(void);void lwip_stack_free(struct stack *stack);struct stack *lwip_stack_get(void);void lwip_stack_set(struct stack *stack);struct netif *lwip_vdeif_add(struct stack *stack, void *arg);struct netif *lwip_tapif_add(struct stack *stack, void *arg);struct netif *lwip_tunif_add(struct stack *stack, void *arg);int lwip_add_addr(struct netif *netif, struct ip_addr *ipaddr, struct ip_addr *netmask);int lwip_del_addr(struct netif *netif, struct ip_addr *ipaddr, struct ip_addr *netmask);int lwip_add_route(struct stack *stack, struct ip_addr *addr, struct ip_addr *netmask, struct ip_addr *nexthop, struct netif *netif, int flags);int lwip_del_route(struct stack *stack, struct ip_addr *addr, struct ip_addr *netmask, struct ip_addr *nexthop, struct netif *netif, int flags);int lwip_ifup(struct netif *netif);int lwip_ifdown(struct netif *netif); Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  46. 46. LWIPv6 socket API (just add lwip_ prefix)int lwip_msocket(struct stack *stack, int domain, int type, int protocol);int lwip_socket(int domain, int type, int protocol);int lwip_accept(int s, struct sockaddr *addr, socklen_t *addrlen);int lwip_bind(int s, struct sockaddr *name, socklen_t namelen);int lwip_shutdown(int s, int how);int lwip_getpeername (int s, struct sockaddr *name, socklen_t *namelen);int lwip_getsockname (int s, struct sockaddr *name, socklen_t *namelen);int lwip_getsockopt (int s, int level, int optname, void *optval, socklen_t *optlen);int lwip_setsockopt (int s, int level, int optname, const void *optval, socklen_t optlen);int lwip_close(int s);int lwip_connect(int s, struct sockaddr *name, socklen_t namelen);int lwip_listen(int s, int backlog);int lwip_recv(int s, void *mem, int len, unsigned int flags);int lwip_read(int s, void *mem, int len);int lwip_recvfrom(int s, void *mem, int len, unsigned int flags, struct sockaddr *from, socklen_t *fromlen);int lwip_send(int s, void *dataptr, int size, unsigned int flags);int lwip_sendto(int s, void *dataptr, int size, unsigned int flags, struct sockaddr *to, socklen_t tolen);int lwip_write(int s, void *dataptr, int size);int lwip_select(int maxfdp1, fd_set *readset, fd_set *writeset, fd_set *exceptset, struct timeval *timeout);int lwip_ioctl(int s, long cmd, void *argp); Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  47. 47. LWIPv6: New features● Packet forwarding● Filtering● NAT● DHCP server/RADV server onboard● SLIRP (v4 and v6) struct netif *lwip_add_slirpif(struct stack *stack, void *arg, int flags); Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  48. 48. slirpvde6● Extension of slirpvde based on LWIPv6 – slirp ipv4/ipv6 – Stateless translator – Dhcp/radv server – DNS forwarder – Port and X forwarding (in and out) slirpvde6 -d -H10.0.2.1/24 -H2001::1/64 -s /tmp/vde.ctl -dhcp -r Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  49. 49. Berkeley sockets API: problem #1● The Berkeley Sockets API has been designed for one protocol stack (per protocol family). – Multiple stacks => different networking features (per user, per application...)● Unix uses the file system as a naming space for everything (devices, kernel variables, ...) except for networking. – Access control to networking Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  50. 50. Solution #1: msockets#include <msocket.h>int msocket(char *path, int domain, int type, int protocol);● Path is the pathname of the stack● domain/type/protocol are the same defined in socket(2).● A stack is a special file (new type of special file, see stat(2)): #define S_IFSTACK 0160000● Each process has a default stack for each protocol family (domain). – If path==NULL, msocket uses the default stack.● It is backwards compatible. #define socket(d,t,p) msocket(NULL,(d),(t),(p)) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  51. 51. Msockets: set the default stackint msocket(char *path, int domain, int type, int protocol);● if type==SOCK_DEFAULT msocket sets the default stack. e.g. msocket("/dev/net/ipstack2",PF_INET,SOCK_DEFAULT,0); defines /dev/net/ipstack2 as the default stack for Ipv4● if type==SOCK_DEFAULT && domain==PF_UNSPEC msocket sets the default stack for all the protocol families. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  52. 52. Mstack: backward compatibility● Mstack uses msocket: it defines the default stack so that existing applications can use different stacks.$ ip addr..... ip addr on default net$ mstack /dev/net/ipstack2 ip addr.... ip addr of “ipstack2”$ mstack /dev/net/newstack firefox.... firefox works on newstack$ mstack /dev/net/otherstack bash$ ...this new bash works on otherstack Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  53. 53. Msockets: implementation● Msockets API is currently supported by lwipv6 and by view-os.● It is a natural extension, backwards compatible for the Berkeley sockets.● Many application would benefit from this extension (e.g. networkless user accounts).● We are studying kernel support for msockets. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  54. 54. Berkeley sockets API: problem #2● Berkeley Sockets API does provide support for IPC (AF_UNIX)● Berkeley Sockets API does not provide support for multicast IPC● Berkeley Sockets is mainly for point-to-point, client-server communication IP multicast, Ethernet broadcast provided by “magic” addresses.● Many applications need multicast IPC (dbus, vde_switch, midi-patchbay, mpeg-ts demultiplexing...) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  55. 55. IPN: Inter Process Networking● IPN is for IPC (like AF_UNIX)● IPN provides fast, kernel implemented, multicast communication among processes. sender dispatcher receiver receiver receiver AF_UNIX based multicasting service (dbus, vde_switch, tee, ....) sender dispatcher receiver receiver receiver Policy submodule AF_IPN Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  56. 56. IPN: implementation● A new address family AF_IPN● Policies can be provided as submodules. – IPN_BROADCAST (default) each messages is delivered to all the members but the sender – IPN_VDESWITCH a virtual ethernet switch – IPN_MPEGTS mpeg transport stream demultiplexing● Two services (sockopt selectable): – LOSSLESS: bounded buffer approach, late receivers delay senders – LOSSY: late receivers lose data. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  57. 57. IPN:direct support for multicast● BIND=define and get administration access to the socket – “x” permission required● CONNECT=join the flow of data – “r” and “w” mean permission to receive or sendstruct sockaddr_un sun={.sun_family=AF_IPN,.sun_path="/tmp/sockipn"};int s=socket(AF_IPN,SOCK_RAW,IPN_BROADCAST); /* or a different policy*/err=bind(s,(struct sockaddr *)&sun,sizeof(sun));err=connect(s,NULL,0); Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  58. 58. Why IPN instead of...● AF_UNIX? – Point-to-point, hub process needed, slow!● IP_MULTICAST ttl=0? – No access control, slow!● AF_NETLINK? – No access control, designed for interface/filtering configuration. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  59. 59. IPN is fast (time for 1M msgs, 64B per msg) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  60. 60. IPN is fast (time for 1M msgs, 1024B per msg) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  61. 61. IPN is fast (time for 1M msgs, 16 receivers) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  62. 62. IPN communication models● Code examples here:http://wiki.virtualsquare.org/index.php/IPN_examples● Peer-to-peer – All the member processes are senders and receivers (e.g. vde)● Publish_subscribe – A process broadcast messages and client processes can join the IPN socket Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  63. 63. IPN extra features● Out-Of-Band messages from core IPN and policy submodules – e.g. number of readers notification to stop subscriberless services● Networking interfaces TAP+GRAB – TAP: a new virtual interface is defined and connected to an IPN socket (in kernel-land) – GRAB: an existing networking interface gets connected to an IPN socket (in kernel-land)● Char-device interface – Define a character device connected to a IPN socket Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  64. 64. VDETELWEB● It is the Web/Telnet Server for VDE switch configuration.● It uses the LWIPv6 library● It has two connections to the controlled VDE switch: – management socket to give commands – port0: the ethernet port used by the TCP-IP stack.● It reads the set of commands, descriptions, arguments from the switch itself.● Telnet has history/command editing and support for asynch debug output (NEW) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  65. 65. VDETELWEB: telnet Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  66. 66. VDETELWEB: Web Interface Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  67. 67. View-OS ... a process with a viewEach process should be permitted to have its own view of the execution environment Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  68. 68. View Components● filesystem namespace, including the related ownership and permission information,● networking configuration,● system name,● current time,● devices, etc.● ... Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  69. 69. Global View Assumption● In general processes running on the same computer share the same view. – A given pathname refers to the same file for all processes. – All processes use one shared TCP-IP stack for networking hence all processes share the same set of IP addresses and routing policies. – All processes share the same notion as to which users/processes have special priviledges. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  70. 70. View-OS vs. VMs and Containers View-OS VM ContainerMemory Impact LOW HIGH LOWRunning State User User or Kernel KernelAdministered by: user user rootPartial Virtualization Yes No Yes (sharing) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  71. 71. Partial Virtualization● Virtualize just what you need: – Virtual and real file systems, devices, networks, etc. co-exist in the process view● Support for nested virtualization: – e.g. virtual file system defined on virtual devices. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  72. 72. How to start a View-OS monitor: user@host:~$ umview bash This kernel supports: PTRACE_MULTI PTRACE_SYSVM ppoll  View­OS will use: PTRACE_MULTI PTRACE_SYSVM ppoll  pure_libc library found: syscall tracing allowed rd235 2.6.29­utrace GNU/Linux/View­OS 10585 0   user@host[10585:0]:~$ ● Umview runs on vanilla Linux kernels, Kmview requires a kernel module loaded (and utrace).● Instead of bash one may run his/her favorite executable (e.g. xterm, script....) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  73. 73. View-OS modules● View-OS monitor loads only the virtualities requested by the user: – Umfuse: file system virtualization – Umnet: networking virtualization – Umdev: device virtualization – Umbinfmt: executable interpreter virtualization – Viewfs: file system patchworking – Ummisc: time, system id... Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  74. 74. #1: Virtual Installation of Software $ um_add_service viewfs $ mkdir /tmp/newroot $ viewsu # mount ­t viewfs ­o mincow,except=/tmp,vstat /tmp/newroot / # apt­get install mynewsoftware● Create an empty dir● Mount it in “minimal copy on write” mode: – File mods are on the real file system when allowed. – Mods stored in the mounted dir otherwise. – A single consistent view. – Vstat: virtualize stat (support for virtual chown, chmod/setuid, special files) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  75. 75. #2: Virtual Networking $ um_add_service umnet $ mount ­t umnetlwipv6 none /dev/net/default $ ip link set vd0 up $ ip addr add 10.1.2.3/24 dev vd0 $ ip addr 1: lo0: <LOOPBACK,UP> mtu 0     link/loopback     inet6 ::1/128 scope host     inet 127.0.0.1/8 scope host 2: vd0: <BROADCAST,UP> mtu 1500     link/ether 02:02:5a:44:e2:06 brd ff:ff:ff:ff:ff:ff     inet6 fe80::2:5aff:fe44:e206/64 scope link     inet 10.1.2.3/24 scope global● A network stack can be “mounted.”● /dev/net/default is the default stack, but View-OS supports multiple stacks. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  76. 76. #2: Virtual Networking$ um_add_service umnet$ mount ­t umnetlwipv6 ­o tn0=tunx none /dev/lwip0$ mount ­t umnetlwipv6 ­o tp0=tapx,vd0=/tmp/switch none /dev/lwip1$ mstack /dev/lwip0 ip addr1: lo0: <LOOPBACK,UP> mtu 0    link/loopback    inet6 ::1/128 scope host    inet 127.0.0.1/8 scope host2: tn0: <> mtu 0    link/generic$ mstack /dev/lwip1 ip addr1: lo0: <LOOPBACK,UP> mtu 0    link/loopback    inet6 ::1/128 scope host    inet 127.0.0.1/8 scope host2: vd0: <BROADCAST> mtu 1500    link/ether 02:02:47:98:ad:06 brd ff:ff:ff:ff:ff:ff3: tp0: <BROADCAST> mtu 1500    link/ether 02:02:03:04:05:06 brd ff:ff:ff:ff:ff:ff$ Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  77. 77. #3:Mount a Filesystem $ um_add_service umfuse $ mount ­t umfuseext2 ­o ro ext2filesystemimage /mnt $ mount ­t umfusestrangefilesystem strangeimage /mnt2● Source compatible with Fuse.● Mount file systems unsupported by the kernel.● Safe mount, limited to this View. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  78. 78. #4: Filesystem Image partition and mount● Step 1: Load the umdev (virtual device) module and mount an empty file as a disk image. $ um_add_service umdev $ viewsu # dd of=/tmp/diskimage bs=1024 count=0 seek=1024000 # mount -t umdevmbr /tmp/diskimage /dev/hda Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  79. 79. #4: Filesystem Image partition and mount # fdisk /dev/hda● Step 2: Device contains a valid partition table Building a new DOS disklabel with disk identifier 0xd403417d.partition Command (m for help): n Command actionthe    e   extended    p   primary partition (1­4) pfile Partition number (1­4): 1 First cylinder (1­127, default 1): 1system Last cylinder, +cylinders or +size{K,M,G} (1­127, default 127): 127 Command (m for help): pimage: Disk /dev/hda: 1048 MB, 1048576000 bytes 255 heads, 63 sectors/track, 127 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0xd403417d Device Boot      Start         End       Blocks    Id System /dev/hda1               1          127     1020096    83 Linux Command (m for help): w The partition table has been altered! Calling ioctl() to re­read partition table. Syncing disks. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  80. 80. #4: Filesystem Image partition and mount● Step 3: # mkfs.ext2 /dev/hda1 mke2fs 1.41.8 (20­Jul­2009)Create Filesystem label= OS type: Linuxthe Block size=4096 (log=2) Fragment size=4096 (log=2)filesystem 63872 inodes, 255024 blocks 12751 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=264241152 8 block groups 32768 blocks per group, 32768 fragments per group 7984 inodes per group Superblock backups stored on blocks:        32768, 98304, 163840, 229376 Writing inode tables: done Writing superblocks and filesystem accounting information: do This filesystem will be automatically checked every 38 mounts 180 days, whichever comes first. Use tune2fs ­c or ­i to over Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  81. 81. #4: Filesystem Image partition and mount● Step 4: mount the new partition # um_add_service umfuse # mount ­t umfuseext2 ­o rw+ /dev/hda1 /mnt # ls ­l /mnt total 16 drwx­­­­­­ 2 root root 16384 2009­09­16 11:57 lost+found● Example of nested virtualization.● Compatible with standard sys-admin commands. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  82. 82. #5: User Mode chroot● Step 1: create the jail filesystem $ mkdir /tmp/root /tmp/root/bin /tmp/root/lib $ cp /bin/busybox /tmp/root/bin $ cp /lib/libm­2.9.so /lib/libc­2.9.so /tmp/root/lib $ cd /tmp/root/lib $ ln ­s libm­2.9.so libm.so.6 $ ln ­s libc­2.9.so libc.so.6 $ cd / /tmp/root bin lib libc.so.6 libm.so.6 busybox libc-2.9.so libm-2.9.so Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  83. 83. #5: User Mode chroot● Step 2: change the file system root: – Core mode: by the virtual chroot system call $ exec /usr/sbin/chroot /tmp/root /bin/busybox sh BusyBox v1.13.3 (Debian 1:1.13.3­1) built­in shell (ash) Enter ’help’ for a list of built­in commands. / $ – By Viewfs: $ um_add_service viewfs $ exec busybox sh BusyBox v1.13.3 (Debian 1:1.13.3­1) built­in shell (ash) Enter ’help’ for a list of built­in commands. / $ mount ­t viewfs ­o move,permanent /tmp/root / / $ Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  84. 84. #5: User Mode chroot ● Step 3: the process is in the jail:/ $ ls ­lR //:drwxr­xr­x 2 1000   1000    4096 Sep 17 13:37 bindrwxr­xr­x 2 1000   1000    4096 Sep 17 13:37 lib/bin:­rwxr­xr­x   1 1000 1000  401216 Sep 17 13:37 busybox/lib:­rwxr­xr­x   1 1000 1000 1302732 Sep 17 13:37 libc­2.9.solrwxrwxrwx   1 1000 1000      11 Sep 17 13:37 libc.so.6 ­> libc­2.9.so­rw­r­­r­­   1 1000 1000  149328 Sep 17 13:37 libm­2.9.solrwxrwxrwx   1 1000 1000      11 Sep 17 13:37 libm.so.6 ­> libm­2.9.so/ $ Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  85. 85. #6: Create a Ramdisk and use it:$ um_add_service umdev$ um_add_service umfuse$ um_add_service umproc$ mount ­t umdevramdisk ­o size=100M none /dev/hdx$ /sbin/mkfs.vfat /dev/hdxmkfs.vfat 3.0.3 (18 May 2009)$ mount ­t umfusefat ­o rw+ /dev/hdx /mnt$ mountrootfs on / type rootfs (rw)/dev/root on / type ext3 (rw,errors=remount­ro,data=ordered)tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=755)... ...none on /proc/mounts type proc (ro)none on /dev/hdx type umdevramdisk (size=100M)/dev/hdx on /mnt type umfusefat (rw+)$● Another example of nested virtualization● Umproc virtualizes /proc/mounts Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  86. 86. #7: Virtualize running processes● Shell #1 (pid 12345, an ordinary shell) sh1 $ mkdir /tmp/mnt sh1 $ ls /tmp/mnt sh1 $● Shell #2 (running under ViewOS) sh2 $ um_add_service umfuse sh2 $ mount ­t ext2 /tmp/linux.img /tmp sh2 $ ls /tmp/mnt bin  boot dev etc lib lost+found mnt    proc sbin tmp usr sh2 $ um_attach 12345● Shell #1 has been “attached” to ViewOS sh1 $ ls /tmp/mnt bin  boot dev etc lib lost+found mnt    proc sbin tmp usr Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  87. 87. #8: Process proper time● Start two xclocks one from a standard shell, the other from a shell running ViewOS. sh 1 $ xclock ­update 1 & sh2 $ xclock ­update 1 & sh2 $ um_add_service ummisc sh2 $ mount ­t ummisctime none /tmp/mnt● Now change the frequency of the virtual time for ViewOS: sh2 $ echo 2 > /tmp/mnt/frequency Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  88. 88. Behind the Scenes sumodule sumodule sumodule sumodule sumodule module module module *mview Global PCB hash dispatcher and fd PURELIBC table mgmtprocess Capture layer Nested CapturePtrace or kmview kernel module (utrace) Linux Kernel Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  89. 89. Modules & submodules ● Modules provide support for classes of virtualizations, e.g.: – Umfuse: file systems sumodule sumodule sumodule sumodule sumodule – Umnet: networking module module module Umdev: devices *mview – Global PCB hash dispatcher and fd PURELIBCprocess table Capture layer Nested Capture mgmt ● Submodules are for specific cases, e.g.:Ptrace or kmview kernel module (utrace) Linux Kernel – Umfuseext2, Umfusefat – Umnetlwipv6, umnetnull – Umdevmbr, umdevramdisk Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  90. 90. Modules & submodulesmodule description submodule descriptionumproc /proc/mounts virtualizationumfuse User-mode fuse umfuseext2 ext2 implementation umfuseiso9660 iso9660 umfusefat vat/vfat umfusentfs3g ntfs umfusearchive tar/cdimages (libarchive) umfuseramfile single file virtualization umfusessh (sshfs) remote file system via ssh umfuseencfs (encfs) encrypted file systemumnet network multi stack support umnetnull null stack umnetlwipv6 Ipv4/v6 hybrid stack umnetlink move/merge stacks umnetcurrent current stackumdev device virtualization umdevmbr DOS master boot record umdevnull null device umdevramdisk ramdisk umdevvd VDI, VMDK, VHD disks umdevtab virtual tuntapummisc system call based virtualization ummisctime time virtualization ummiscuname uname id virtualizationviewfs file system patchworking Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  91. 91. Capture user process system calls ● umview: based on ptrace – Vanilla Linux kernel – (patches proposed for sumodule sumodule sumodule sumodule sumodule performance) module module module kmview needs a specific kernel *mview ● Global PCB hash dispatcher and fd module based on utrace. PURELIBC table mgmtprocess Capture layer Nested Capture – Security enhancementPtrace or kmview kernel module (utrace) Linux Kernel – More complete virtualization support (nested View-OS, strace/gdb, SIGSTOP). Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  92. 92. Global Hash Table ● Keeps track of active virtualizations: Pathname objects sumodule sumodule sumodule sumodule sumodule – module *mview module module – File System Types Protocol families Global PCB hash dispatcher and fd – PURELIBC table mgmt Device Major/Minor rangesprocess Capture layer Nested Capture – System call numbersPtrace or kmview kernel module (utrace) Linux Kernel – Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  93. 93. Dispatcher ● The Dispatcher uses the global hash table to route each system call to the right module or to the sumodule sumodule sumodule sumodule sumodule kernel. module module module *mview Global PCB hash dispatcher and fd PURELIBC table mgmtprocess Capture layer Nested CapturePtrace or kmview kernel module (utrace) Linux Kernel Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  94. 94. Nested Capture ● View-OS captures (and can virtualize) the system calls generated by modules and sumodule sumodule sumodule sumodule sumodule submodules module module module *mview Global PCB ● Purelibc is a C library providing process self virtualization PURELIBC hash dispatcher and fd table mgmtprocess Capture layer Nested CapturePtrace or kmview kernel module (utrace) Linux Kernel Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  95. 95. Desiderata: 1: Linux KernelNew Ptrace tags for virtualization support: – PTRACE_VM: support for partial virtualization. It is possible to skip the current system call and/or the second upcall after the system call. (User- Mode Linux can use this instead of PTRACE_SYSEMU. VM has a simpler implementation than SYSEMU. – PTRACE_MULTI: process a sequence of ptrace requests + PEEK/POKE of large chunks as a single call. (ptrace exchanges one memory word per call and /proc/{pid}/mem is not writable!) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  96. 96. Desiderata 2. Open Group/POSIX#include <msocket.h>int msocket(char *path, int domain, int type, int protocol);● Path is the pathname of the stack● domain/type/protocol are the same defined in socket(2).● A stack is a special file (new type of special file, see stat(2)):#define S_IFSTACK 0160000● Each process has a default stack for each protocol family (domain). – If path==NULL, msocket uses the default stack.● It is backwards compatible:#define socket(d,t,p) msocket(NULL,(d),(t),(p)) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  97. 97. Desiderata: 2 Open Group/POSIXint msocket(char *path, int domain, int type, int protocol);● if type==SOCK_DEFAULT msocket sets the default stack. e.g. msocket("/dev/net/ipstack2",PF_INET,SOCK_DEFAULT,0); defines /dev/net/ipstack2 as the default stack for Ipv4● if type==SOCK_DEFAULT && domain==PF_UNSPEC msocket sets the default stack for all the protocol families.● Mstack uses msocket: it defines the default stack so that existing applications can use different stacks.$ ip addr..... ip addr on default net$ mstack /dev/net/newstack firefox.... firefox works on newstack$ mstack /dev/net/otherstack bash$ ...this new bash works on otherstack Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  98. 98. Desiderata: 3- C library ({e}glibc)● C libraries are impure, they are pure C libraries and interface to system calls at the same time.● It is not possible to do self virtualization of system calls for processes using {e}glibc, library calls are internally linked to the system calls (e.g. printf calls write).● Purelibc is a (ld preloaded) layer on {e}glibc which convert the C library in a pure library● The support for self virtualization should be a feature of mainstream {e}glibc. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  99. 99. Desiderata: 4: utrace UTRACE_STOP● Utrace supports more tracers (engines) on the same process.● Utrace sends the notification to all the tracers and then waits for utrace_control(..., UTRACE_RESUME) from each tracer which returned UTRACE_STOP.● This specification is bad suited for nested virtualization support: a notification functions inspects the state (e.g. System call parameters) and maybe it changes the state. Next tracer must read the state as changed from the previous tracer.● Kmview uses a semaphore in its system call notification function to stop a process because this UTRACE_STOP specification is useless. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  100. 100. UTRACE_STOP implementation PROCESS UTRACE Kmview kernel module Outer kmview Inner kmviewSyscall Request Notify engine#2 Notify user space Return UTRACE_STOP Notify engine#1 Notify user space Return UTRACE_STOP Mgmt of syscall RACE CONDITION! Wait (all engines) Mgmt of syscall run syscall Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  101. 101. Desiderata: new UTRACE_STOP PROCESS UTRACE Kmview kernel module Outer kmview Inner kmviewSyscall Request Notify engine#2 Notify user space Return UTRACE_STOP Wait Mgmt of syscall Notify engine#1 Notify user space Return UTRACE_STOP Wait Mgmt of syscall run syscall Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  102. 102. Kmview workaroundPTRACE_SYSCALL_{RUN,ABORT} instead of PTRACE_STOP PROCESS UTRACE Kmview kernel module Outer kmview Inner kmviewSyscall Request Notify engine#2 Notify user space down(sem) Mgmt of syscall up(sem) Notify engine#1 Notify user space down(sem) Mgmt of syscall up(sem) run syscall Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  103. 103. Multiple meaning of safety...● Availability, bug effects confinement: – ViewOS runs outside the kernel, errors in modules may lead to a crash of the View (not a kernel panic!)● Self protection (from mistaken commands): – Global View Assumption often force to use root access (or powerful capabilities), this is dangerous.● Sandbox non-circumvention: – At the first sight it seems that Kernel based sandboxes are safer (e.g. seccomp). ● Kernel based sandboxes are not flexible ● On/Off security: a bug may compromise the whole system ● A good support for VM can preserve safety● The more code, the worse security. Is the kernel “too fat?” – Maintenance problems, side effects, etc. – ViewOS can move services outside the kernel. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  104. 104. The missing ring...● View-OS modules are similar to microkernel servers.● View-OS captures some of the benefit of microkernels (separation mechanism and policy, flexibility, reliability).● View-OS allow microkernel services to be implemented (at user level) on monolithic kernels. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  105. 105. VirtualSquare● VDE● LWIPv6● PureLibC Questions?● IPN● View-OS – Umview/Kmview Renzo Davoli – renzo@cs.unibo.it - Università di Bologna

×