Virtualsquare: tutta la virtualità che avete sempre desiderato e non avete osato chiedere

  • 840 views
Uploaded on

La presentazione di Renzo Davoli …

La presentazione di Renzo Davoli
tenuta in occasione del Codemotion, 5 marzo 2011, Roma

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
840
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. VIRTUALSQUARE all the virtuality you wanted but you were afraid to ask Rome, March 5th 2011 Renzo Davoli Università di Bologna(Master in Scienze e Tecnologie del Software Libero) (Associazione per il Software Libero) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 2. Virtual... Time Execution Environmentuser-id Device Machine Networking Memory File System Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 3. LXC schroot libvirt chroot GXemul Libguestfs Bochs PearPC Qemu tinc VDE fuse-ext2 Open-VZ Marionnet User-mode Linux UnionFS JVM Umview PureLibc VirtualBOX FUSEFairVPN SPICE LWIPv6 fakeroot VirtualBricks View-OS fakeroot-ng KVM Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 4. What does Virtualization mean? Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 5. User of A I(A) = I(A) A can be used instead of AInterface I(A) Interface I(A) Well Known service New Service A A Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 6. What is “real”● The service A can be hardware User of A – Machine I(A) = I(A) A can be used instead of A – Memory Interface I(A) Interface I(A) Well Known service New Service – Network A A● The service A can be software – File system – Execution Environment – Identity Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 7. Why virtual?● Flexibility – prototyping – modify features at run-time (“real” may be hard to modify: e.g. hardware or kernel code). – Satisfy several requirements while sharing common structures● Safety – least privilege – sandboxing● Optimization – Server/service consolidation – No need to maintain several “real” items Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 8. Simulation-Emulation-Virtuality● Simulation: – Provide just a model of the phenomena to study – Simulation never provides virtualization● Emulation: – It means: Behave in the same way. – Can provide virtualization if usable (e.g. It is usable)● Virtuality can be provided without emulation – e.g. LXC Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 9. Virtual Machines (Smith-Nair) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 10. Virtual Machines● Intrusiveness: – User-mode user-access – User-mode superuser-access – Kernel patch/module – Native● Paravirtualization: – Change the user interface to optimize virtualization Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 11. Virtual Machines User User User Virtualization VirtualizationService Service Service Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 12. Virtual Machine (multiplexing) User User User User Virtually multiplexed serviceService Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 13. Virtual Machines:● Qemu-system: system-vm, processor emulation (by direct code translation), user-space/user-permission● GXEmul/PearPC/Mac-on-Linux: system-vm, processor emulation, user-space/user- permission Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 14. Virtual Machines● KVM: system-vm, same-instruction set, may provide paravitualization (virtio/vhost-net), user- mode monitor, requires processor extensions and kernel module (Linux specific/optimized).● VirtualBOX(OSE): system-vm, same-instruction set, may provide paravitualization, user-mode monitor, requires processor extensions and kernel module (it runs on several Operating Systems).● XEN: system-vm, same-instruction set, provides paravirtualization, native mode, multiplexing. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 15. operating system level VM● LXC: Linux Containers/Namespaces: native mode, multiplexing, user-mode superuser- access (it provides partial virtualization/sharing between containers).● OpenVZ: native mode, multiplexing, superuser- access. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 16. process VM● User Mode Linux: A linux kernel is the VM monitor for processes. Process VM. User- mode, user-access. (almost) the same interface (system call set, maybe the version of the kernel may differ).● View-OS (umview/kmview): partial-modular virtualization. Process VM. User-mode, user-access. (almost) the same interface (users may define new system calls). Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 17. process VM● Qemu: process emulation (direct translation), user-mode, user-access.● Application VM (JVM, Mono, ….) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 18. Network Virtualization● Virtual Private Networks – For secure remote access● Overlay Networks – e.g. Akamai, p2p● Networks for Virtual Machines● Kernel bridge based virtual networks● Virtual Distributed Ethernet: data-link layer Ethernet consistent, user-mode, user access. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 19. File system virtualization● Chroot/schroot● Fakeroot/fakeroot-ng● FUSE – Fuse-ext2 – Fuse-ssh – Fuse-* Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 20. Modular Virtualization● View-OS: modular partial virtual machine – Umview: based on ptrace (user-mode, user access) – Kmview: based on utrace/kernel module (user- mode)● Several modules available: – File system (mount) – File system (patchworking) – Device – Uname/time... – Networking● Chroot/fakeroot/fuse/vpn/binfmt... features have been implemented on View-OS. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 21. the VirtualSquare view Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 22. Virtualsquare● Virtualsquare is: – A community.... – A containers of projects....● Virtualsquare is not: – A company – A brand/product line● Virtualsquare started at the University of Bologna but now it is an international community – A lot of former students now work abroad – Common ideas with other groups (joint projects) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 23. the VirtualSquare view● Communication/compatibility – different Virtualities must be interconnected, must communicate● Integration – different Virtualities can be seen as special cases of a broaden idea of Virtuality● Extension – if a need cannot yet be captured by a kind of virtuality, let us create a new one (maybe combining existing virtualities). Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 24. Communication compatibility ● VDE KVM – General purpose user-mode networking support bochs – Ethernet data-link consistentVirtualBOX – Distributed tuntap – Intuitive (it has the sameUser-mode structure of real Ethernet: linux switches, cables) libvirt View-OS LWIPv6 Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 25. Integration● View-OS=each process can have its “view” of the environment.● User-mode/user-permission partial virtual machine approach● The features of several existing tools have been re- implemented as composable modules. VPN fakeroot(ng) (s)chroot binutils Virtual networking View-OS FUSE UnionFS Virtual devices LXC user-mode Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 26. View-OSGLOBAL VIEWASSUMPTION Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 27. Extension● A new forge for several new concepts and ideas/tools: – Msocket: support for multi-stack applications – LWIPv6: user-space LWIPv4/LWIPv6 hybrid networking (multi)stack as a library – Purelibc: process self virtualization – Relativistic virtualization of time: emulation of fast machines on slow ones. – Virtual spaces per login shells. – Public Distributed Ethernets – Run-time on-the-fly virtualization Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 28. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 29. VirtualSquare: the book Renzo Davoli, Michael Goldweber (editors) Virtual Square: Users, Programmers & Developers Guide ● Available at lulu books or downloadable from wiki.virtualsquare.org ● Warning: this book is dynamically changing as the project evolves Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 30. ...a closer lookon virtualsquare projects... Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 31. VDE components SWITCH SWITCH CROSS CABLE VDE SWITCH VDE SWITCH VM VDE VdeWire VDE VM TunTap(e.g. QEMU) plug (e.g. ssh) plug (e.g. U-ML) Linux Module Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 32. VDE: Related Work● VPN: (OpenVPN) point2point, for real machines● Overlay Networks: specific for application (peer to peer, Akamai).● VM networking: (tools provided with VM, e.g. uml-switch) specific for VMVDE:● multipoint, general mesh● no need for root (administration) access● heterogeneous VM and non VM connected Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 33. VDEv2: advertisement● VDEv2: – modular design – compatible with user-mode linux, qemu, tuntap, (bochs, plex86), umview/lwipv6 – through the vdetaplib potentially compatible any application using tap – VLAN (802.1Q) – FST (fast spanning tree) – run time maneageable via unixterm (telnet or web with vdetelweb) – includes slirpvde and wirefilter – status debug – plugin support: snmp/iplog/pdump Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 34. VDEv2● VDE-Switch – number of ports configurable on command line – port0 is reserved for management clients, n-1 ports are available for connections. – management UNIX socket for management clients ● self-describing SMTP-like protocol – modules: datasock (VM conn), tuntap, consmgmt (management) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 35. VDE cables● VDE-plug – is a VM that converts the Ethernet packets of a VDE port into a stream connection (stdin- stdout)● VDE-wire – can be any application able to give a stdin/stdout stream connection Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 36. Dual Pipe● dpipe is a new (general purpose) command we have added.● Pipe are well known abstractions. The following command prints the list of the current directory: ls lpr ls | lpr● Dpipe creates a bi-irectional connection between the processes dpipe cmd1 = cmd2 cmd1 cmd2 Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 37. VDE cables, plugs, wires/dpipe● dpipe is used to create VDE-cables:dpipe vde_plug = ssh vde.students.cs.unibo.it vde_plug● this command connects by a dpipe the local vde_plug with a vde_plug running on a remote host (the wire is ssh)● other applications can be used as wire (e.g.netcat)● In the example vde_plug refers to the default switch. It is possible to run several switches on the same host, an extra option is needed in this case. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 38. wirefilter● wirefilter can be put on a cable (e.g. for network testing)dpipe vde_plug /tmp/s1 = wirefilter -m /tmp/m = vde_plug /tmp/s2wirefilter -v /tmp/s1:/tmp/s2● packet loss, delays, dup, speed, noise figures, mtu, fifoness properties of the line can be changed with command line options or real time via a management socket.● It is possible to define several “states”. The state transition is driven by a Markov-chain Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 39. SlirpVDE VDE SWITCH VDE SWITCH10.0.2.15 VDE VdeWire VDE 10.0.2.16 plug (e.g. ssh) plug VM VM (e.g. QEMU) (e.g. U-ML) Note: slirp supports IPv4 slirpv6 supports both ipv4 and ipv6 10.0.2.2 SlirpVDE Firefox http connection from slirpVDE Renzo Davoli – renzo@cs.unibo.it - Università di Bologna running on the hosting O.S. to
  • 40. vde_cryptcab● Coded by Daniele Lacamera (danielinux)● A vde_cryptcab is a distributed cable manager for VDE switches.● Server side vde_cryptcab -s /tmp/vde2.ctl -p 2100● Client side vde_cryptcab -s /tmp/vde2.ctl -c foo@remote.machine.org:2100● use a blowfish channel (random key exchanged by scp). Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 41. Marionnet (based on VDE)● A project by Jean-Vincent Loddo and Luca Saiu (et al) Université Paris 13. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 42. TINC● tinc is a Virtual Private Network (VPN) daemon that uses tunnelling and encryption to create a secure private network between hosts on the Internet.● Encryption, authentication and compression● Automatic full mesh routing● Easily expand your VPN● Ability to bridge ethernet segments● Runs on many operating systems and supports Ipv6● A project by Ivo Timmermans and Guus Sliepen Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 43. LWIPv6● It is a LWIPv4/v6 (multi) stack implemented as a library.● Fork project from LWIP project (Adam Dunkels <adam@sics.se>)● Can be connected to any number of VDE, TUN, TAP interfaces.● It is a hybrid stack (not a dual-stack). One single Ipv6 “engine” is able also to manage Ipv4 packets in compatibility mode (130.136.1.110 is managed as 0::ffff:130.136.1.110). Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 44. LWIPv6● PF_INET, PF_INET6● PF_PACKET for raw packet management – support for user-level network analysis tools (e.g. sniffers, ethereal) – support for user-level dhcp clients.● PF_NETLINK for configuration● Packet filtering● NEW: dhcp client/server, rarpd, slirp, routing, nat on request Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 45. LWIPv6 interface definition APIstruct stack *lwip_stack_new(void);void lwip_stack_free(struct stack *stack);struct stack *lwip_stack_get(void);void lwip_stack_set(struct stack *stack);struct netif *lwip_vdeif_add(struct stack *stack, void *arg);struct netif *lwip_tapif_add(struct stack *stack, void *arg);struct netif *lwip_tunif_add(struct stack *stack, void *arg);int lwip_add_addr(struct netif *netif, struct ip_addr *ipaddr, struct ip_addr *netmask);int lwip_del_addr(struct netif *netif, struct ip_addr *ipaddr, struct ip_addr *netmask);int lwip_add_route(struct stack *stack, struct ip_addr *addr, struct ip_addr *netmask, struct ip_addr *nexthop, struct netif *netif, int flags);int lwip_del_route(struct stack *stack, struct ip_addr *addr, struct ip_addr *netmask, struct ip_addr *nexthop, struct netif *netif, int flags);int lwip_ifup(struct netif *netif);int lwip_ifdown(struct netif *netif); Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 46. LWIPv6 socket API (just add lwip_ prefix)int lwip_msocket(struct stack *stack, int domain, int type, int protocol);int lwip_socket(int domain, int type, int protocol);int lwip_accept(int s, struct sockaddr *addr, socklen_t *addrlen);int lwip_bind(int s, struct sockaddr *name, socklen_t namelen);int lwip_shutdown(int s, int how);int lwip_getpeername (int s, struct sockaddr *name, socklen_t *namelen);int lwip_getsockname (int s, struct sockaddr *name, socklen_t *namelen);int lwip_getsockopt (int s, int level, int optname, void *optval, socklen_t *optlen);int lwip_setsockopt (int s, int level, int optname, const void *optval, socklen_t optlen);int lwip_close(int s);int lwip_connect(int s, struct sockaddr *name, socklen_t namelen);int lwip_listen(int s, int backlog);int lwip_recv(int s, void *mem, int len, unsigned int flags);int lwip_read(int s, void *mem, int len);int lwip_recvfrom(int s, void *mem, int len, unsigned int flags, struct sockaddr *from, socklen_t *fromlen);int lwip_send(int s, void *dataptr, int size, unsigned int flags);int lwip_sendto(int s, void *dataptr, int size, unsigned int flags, struct sockaddr *to, socklen_t tolen);int lwip_write(int s, void *dataptr, int size);int lwip_select(int maxfdp1, fd_set *readset, fd_set *writeset, fd_set *exceptset, struct timeval *timeout);int lwip_ioctl(int s, long cmd, void *argp); Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 47. LWIPv6: New features● Packet forwarding● Filtering● NAT● DHCP server/RADV server onboard● SLIRP (v4 and v6) struct netif *lwip_add_slirpif(struct stack *stack, void *arg, int flags); Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 48. slirpvde6● Extension of slirpvde based on LWIPv6 – slirp ipv4/ipv6 – Stateless translator – Dhcp/radv server – DNS forwarder – Port and X forwarding (in and out) slirpvde6 -d -H10.0.2.1/24 -H2001::1/64 -s /tmp/vde.ctl -dhcp -r Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 49. Berkeley sockets API: problem #1● The Berkeley Sockets API has been designed for one protocol stack (per protocol family). – Multiple stacks => different networking features (per user, per application...)● Unix uses the file system as a naming space for everything (devices, kernel variables, ...) except for networking. – Access control to networking Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 50. Solution #1: msockets#include <msocket.h>int msocket(char *path, int domain, int type, int protocol);● Path is the pathname of the stack● domain/type/protocol are the same defined in socket(2).● A stack is a special file (new type of special file, see stat(2)): #define S_IFSTACK 0160000● Each process has a default stack for each protocol family (domain). – If path==NULL, msocket uses the default stack.● It is backwards compatible. #define socket(d,t,p) msocket(NULL,(d),(t),(p)) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 51. Msockets: set the default stackint msocket(char *path, int domain, int type, int protocol);● if type==SOCK_DEFAULT msocket sets the default stack. e.g. msocket("/dev/net/ipstack2",PF_INET,SOCK_DEFAULT,0); defines /dev/net/ipstack2 as the default stack for Ipv4● if type==SOCK_DEFAULT && domain==PF_UNSPEC msocket sets the default stack for all the protocol families. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 52. Mstack: backward compatibility● Mstack uses msocket: it defines the default stack so that existing applications can use different stacks.$ ip addr..... ip addr on default net$ mstack /dev/net/ipstack2 ip addr.... ip addr of “ipstack2”$ mstack /dev/net/newstack firefox.... firefox works on newstack$ mstack /dev/net/otherstack bash$ ...this new bash works on otherstack Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 53. Msockets: implementation● Msockets API is currently supported by lwipv6 and by view-os.● It is a natural extension, backwards compatible for the Berkeley sockets.● Many application would benefit from this extension (e.g. networkless user accounts).● We are studying kernel support for msockets. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 54. Berkeley sockets API: problem #2● Berkeley Sockets API does provide support for IPC (AF_UNIX)● Berkeley Sockets API does not provide support for multicast IPC● Berkeley Sockets is mainly for point-to-point, client-server communication IP multicast, Ethernet broadcast provided by “magic” addresses.● Many applications need multicast IPC (dbus, vde_switch, midi-patchbay, mpeg-ts demultiplexing...) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 55. IPN: Inter Process Networking● IPN is for IPC (like AF_UNIX)● IPN provides fast, kernel implemented, multicast communication among processes. sender dispatcher receiver receiver receiver AF_UNIX based multicasting service (dbus, vde_switch, tee, ....) sender dispatcher receiver receiver receiver Policy submodule AF_IPN Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 56. IPN: implementation● A new address family AF_IPN● Policies can be provided as submodules. – IPN_BROADCAST (default) each messages is delivered to all the members but the sender – IPN_VDESWITCH a virtual ethernet switch – IPN_MPEGTS mpeg transport stream demultiplexing● Two services (sockopt selectable): – LOSSLESS: bounded buffer approach, late receivers delay senders – LOSSY: late receivers lose data. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 57. IPN:direct support for multicast● BIND=define and get administration access to the socket – “x” permission required● CONNECT=join the flow of data – “r” and “w” mean permission to receive or sendstruct sockaddr_un sun={.sun_family=AF_IPN,.sun_path="/tmp/sockipn"};int s=socket(AF_IPN,SOCK_RAW,IPN_BROADCAST); /* or a different policy*/err=bind(s,(struct sockaddr *)&sun,sizeof(sun));err=connect(s,NULL,0); Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 58. Why IPN instead of...● AF_UNIX? – Point-to-point, hub process needed, slow!● IP_MULTICAST ttl=0? – No access control, slow!● AF_NETLINK? – No access control, designed for interface/filtering configuration. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 59. IPN is fast (time for 1M msgs, 64B per msg) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 60. IPN is fast (time for 1M msgs, 1024B per msg) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 61. IPN is fast (time for 1M msgs, 16 receivers) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 62. IPN communication models● Code examples here:http://wiki.virtualsquare.org/index.php/IPN_examples● Peer-to-peer – All the member processes are senders and receivers (e.g. vde)● Publish_subscribe – A process broadcast messages and client processes can join the IPN socket Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 63. IPN extra features● Out-Of-Band messages from core IPN and policy submodules – e.g. number of readers notification to stop subscriberless services● Networking interfaces TAP+GRAB – TAP: a new virtual interface is defined and connected to an IPN socket (in kernel-land) – GRAB: an existing networking interface gets connected to an IPN socket (in kernel-land)● Char-device interface – Define a character device connected to a IPN socket Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 64. VDETELWEB● It is the Web/Telnet Server for VDE switch configuration.● It uses the LWIPv6 library● It has two connections to the controlled VDE switch: – management socket to give commands – port0: the ethernet port used by the TCP-IP stack.● It reads the set of commands, descriptions, arguments from the switch itself.● Telnet has history/command editing and support for asynch debug output (NEW) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 65. VDETELWEB: telnet Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 66. VDETELWEB: Web Interface Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 67. View-OS ... a process with a viewEach process should be permitted to have its own view of the execution environment Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 68. View Components● filesystem namespace, including the related ownership and permission information,● networking configuration,● system name,● current time,● devices, etc.● ... Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 69. Global View Assumption● In general processes running on the same computer share the same view. – A given pathname refers to the same file for all processes. – All processes use one shared TCP-IP stack for networking hence all processes share the same set of IP addresses and routing policies. – All processes share the same notion as to which users/processes have special priviledges. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 70. View-OS vs. VMs and Containers View-OS VM ContainerMemory Impact LOW HIGH LOWRunning State User User or Kernel KernelAdministered by: user user rootPartial Virtualization Yes No Yes (sharing) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 71. Partial Virtualization● Virtualize just what you need: – Virtual and real file systems, devices, networks, etc. co-exist in the process view● Support for nested virtualization: – e.g. virtual file system defined on virtual devices. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 72. How to start a View-OS monitor: user@host:~$ umview bash This kernel supports: PTRACE_MULTI PTRACE_SYSVM ppoll  View­OS will use: PTRACE_MULTI PTRACE_SYSVM ppoll  pure_libc library found: syscall tracing allowed rd235 2.6.29­utrace GNU/Linux/View­OS 10585 0   user@host[10585:0]:~$ ● Umview runs on vanilla Linux kernels, Kmview requires a kernel module loaded (and utrace).● Instead of bash one may run his/her favorite executable (e.g. xterm, script....) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 73. View-OS modules● View-OS monitor loads only the virtualities requested by the user: – Umfuse: file system virtualization – Umnet: networking virtualization – Umdev: device virtualization – Umbinfmt: executable interpreter virtualization – Viewfs: file system patchworking – Ummisc: time, system id... Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 74. #1: Virtual Installation of Software $ um_add_service viewfs $ mkdir /tmp/newroot $ viewsu # mount ­t viewfs ­o mincow,except=/tmp,vstat /tmp/newroot / # apt­get install mynewsoftware● Create an empty dir● Mount it in “minimal copy on write” mode: – File mods are on the real file system when allowed. – Mods stored in the mounted dir otherwise. – A single consistent view. – Vstat: virtualize stat (support for virtual chown, chmod/setuid, special files) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 75. #2: Virtual Networking $ um_add_service umnet $ mount ­t umnetlwipv6 none /dev/net/default $ ip link set vd0 up $ ip addr add 10.1.2.3/24 dev vd0 $ ip addr 1: lo0: <LOOPBACK,UP> mtu 0     link/loopback     inet6 ::1/128 scope host     inet 127.0.0.1/8 scope host 2: vd0: <BROADCAST,UP> mtu 1500     link/ether 02:02:5a:44:e2:06 brd ff:ff:ff:ff:ff:ff     inet6 fe80::2:5aff:fe44:e206/64 scope link     inet 10.1.2.3/24 scope global● A network stack can be “mounted.”● /dev/net/default is the default stack, but View-OS supports multiple stacks. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 76. #2: Virtual Networking$ um_add_service umnet$ mount ­t umnetlwipv6 ­o tn0=tunx none /dev/lwip0$ mount ­t umnetlwipv6 ­o tp0=tapx,vd0=/tmp/switch none /dev/lwip1$ mstack /dev/lwip0 ip addr1: lo0: <LOOPBACK,UP> mtu 0    link/loopback    inet6 ::1/128 scope host    inet 127.0.0.1/8 scope host2: tn0: <> mtu 0    link/generic$ mstack /dev/lwip1 ip addr1: lo0: <LOOPBACK,UP> mtu 0    link/loopback    inet6 ::1/128 scope host    inet 127.0.0.1/8 scope host2: vd0: <BROADCAST> mtu 1500    link/ether 02:02:47:98:ad:06 brd ff:ff:ff:ff:ff:ff3: tp0: <BROADCAST> mtu 1500    link/ether 02:02:03:04:05:06 brd ff:ff:ff:ff:ff:ff$ Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 77. #3:Mount a Filesystem $ um_add_service umfuse $ mount ­t umfuseext2 ­o ro ext2filesystemimage /mnt $ mount ­t umfusestrangefilesystem strangeimage /mnt2● Source compatible with Fuse.● Mount file systems unsupported by the kernel.● Safe mount, limited to this View. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 78. #4: Filesystem Image partition and mount● Step 1: Load the umdev (virtual device) module and mount an empty file as a disk image. $ um_add_service umdev $ viewsu # dd of=/tmp/diskimage bs=1024 count=0 seek=1024000 # mount -t umdevmbr /tmp/diskimage /dev/hda Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 79. #4: Filesystem Image partition and mount # fdisk /dev/hda● Step 2: Device contains a valid partition table Building a new DOS disklabel with disk identifier 0xd403417d.partition Command (m for help): n Command actionthe    e   extended    p   primary partition (1­4) pfile Partition number (1­4): 1 First cylinder (1­127, default 1): 1system Last cylinder, +cylinders or +size{K,M,G} (1­127, default 127): 127 Command (m for help): pimage: Disk /dev/hda: 1048 MB, 1048576000 bytes 255 heads, 63 sectors/track, 127 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0xd403417d Device Boot      Start         End       Blocks    Id System /dev/hda1               1          127     1020096    83 Linux Command (m for help): w The partition table has been altered! Calling ioctl() to re­read partition table. Syncing disks. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 80. #4: Filesystem Image partition and mount● Step 3: # mkfs.ext2 /dev/hda1 mke2fs 1.41.8 (20­Jul­2009)Create Filesystem label= OS type: Linuxthe Block size=4096 (log=2) Fragment size=4096 (log=2)filesystem 63872 inodes, 255024 blocks 12751 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=264241152 8 block groups 32768 blocks per group, 32768 fragments per group 7984 inodes per group Superblock backups stored on blocks:        32768, 98304, 163840, 229376 Writing inode tables: done Writing superblocks and filesystem accounting information: do This filesystem will be automatically checked every 38 mounts 180 days, whichever comes first. Use tune2fs ­c or ­i to over Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 81. #4: Filesystem Image partition and mount● Step 4: mount the new partition # um_add_service umfuse # mount ­t umfuseext2 ­o rw+ /dev/hda1 /mnt # ls ­l /mnt total 16 drwx­­­­­­ 2 root root 16384 2009­09­16 11:57 lost+found● Example of nested virtualization.● Compatible with standard sys-admin commands. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 82. #5: User Mode chroot● Step 1: create the jail filesystem $ mkdir /tmp/root /tmp/root/bin /tmp/root/lib $ cp /bin/busybox /tmp/root/bin $ cp /lib/libm­2.9.so /lib/libc­2.9.so /tmp/root/lib $ cd /tmp/root/lib $ ln ­s libm­2.9.so libm.so.6 $ ln ­s libc­2.9.so libc.so.6 $ cd / /tmp/root bin lib libc.so.6 libm.so.6 busybox libc-2.9.so libm-2.9.so Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 83. #5: User Mode chroot● Step 2: change the file system root: – Core mode: by the virtual chroot system call $ exec /usr/sbin/chroot /tmp/root /bin/busybox sh BusyBox v1.13.3 (Debian 1:1.13.3­1) built­in shell (ash) Enter ’help’ for a list of built­in commands. / $ – By Viewfs: $ um_add_service viewfs $ exec busybox sh BusyBox v1.13.3 (Debian 1:1.13.3­1) built­in shell (ash) Enter ’help’ for a list of built­in commands. / $ mount ­t viewfs ­o move,permanent /tmp/root / / $ Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 84. #5: User Mode chroot ● Step 3: the process is in the jail:/ $ ls ­lR //:drwxr­xr­x 2 1000   1000    4096 Sep 17 13:37 bindrwxr­xr­x 2 1000   1000    4096 Sep 17 13:37 lib/bin:­rwxr­xr­x   1 1000 1000  401216 Sep 17 13:37 busybox/lib:­rwxr­xr­x   1 1000 1000 1302732 Sep 17 13:37 libc­2.9.solrwxrwxrwx   1 1000 1000      11 Sep 17 13:37 libc.so.6 ­> libc­2.9.so­rw­r­­r­­   1 1000 1000  149328 Sep 17 13:37 libm­2.9.solrwxrwxrwx   1 1000 1000      11 Sep 17 13:37 libm.so.6 ­> libm­2.9.so/ $ Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 85. #6: Create a Ramdisk and use it:$ um_add_service umdev$ um_add_service umfuse$ um_add_service umproc$ mount ­t umdevramdisk ­o size=100M none /dev/hdx$ /sbin/mkfs.vfat /dev/hdxmkfs.vfat 3.0.3 (18 May 2009)$ mount ­t umfusefat ­o rw+ /dev/hdx /mnt$ mountrootfs on / type rootfs (rw)/dev/root on / type ext3 (rw,errors=remount­ro,data=ordered)tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=755)... ...none on /proc/mounts type proc (ro)none on /dev/hdx type umdevramdisk (size=100M)/dev/hdx on /mnt type umfusefat (rw+)$● Another example of nested virtualization● Umproc virtualizes /proc/mounts Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 86. #7: Virtualize running processes● Shell #1 (pid 12345, an ordinary shell) sh1 $ mkdir /tmp/mnt sh1 $ ls /tmp/mnt sh1 $● Shell #2 (running under ViewOS) sh2 $ um_add_service umfuse sh2 $ mount ­t ext2 /tmp/linux.img /tmp sh2 $ ls /tmp/mnt bin  boot dev etc lib lost+found mnt    proc sbin tmp usr sh2 $ um_attach 12345● Shell #1 has been “attached” to ViewOS sh1 $ ls /tmp/mnt bin  boot dev etc lib lost+found mnt    proc sbin tmp usr Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 87. #8: Process proper time● Start two xclocks one from a standard shell, the other from a shell running ViewOS. sh 1 $ xclock ­update 1 & sh2 $ xclock ­update 1 & sh2 $ um_add_service ummisc sh2 $ mount ­t ummisctime none /tmp/mnt● Now change the frequency of the virtual time for ViewOS: sh2 $ echo 2 > /tmp/mnt/frequency Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 88. Behind the Scenes sumodule sumodule sumodule sumodule sumodule module module module *mview Global PCB hash dispatcher and fd PURELIBC table mgmtprocess Capture layer Nested CapturePtrace or kmview kernel module (utrace) Linux Kernel Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 89. Modules & submodules ● Modules provide support for classes of virtualizations, e.g.: – Umfuse: file systems sumodule sumodule sumodule sumodule sumodule – Umnet: networking module module module Umdev: devices *mview – Global PCB hash dispatcher and fd PURELIBCprocess table Capture layer Nested Capture mgmt ● Submodules are for specific cases, e.g.:Ptrace or kmview kernel module (utrace) Linux Kernel – Umfuseext2, Umfusefat – Umnetlwipv6, umnetnull – Umdevmbr, umdevramdisk Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 90. Modules & submodulesmodule description submodule descriptionumproc /proc/mounts virtualizationumfuse User-mode fuse umfuseext2 ext2 implementation umfuseiso9660 iso9660 umfusefat vat/vfat umfusentfs3g ntfs umfusearchive tar/cdimages (libarchive) umfuseramfile single file virtualization umfusessh (sshfs) remote file system via ssh umfuseencfs (encfs) encrypted file systemumnet network multi stack support umnetnull null stack umnetlwipv6 Ipv4/v6 hybrid stack umnetlink move/merge stacks umnetcurrent current stackumdev device virtualization umdevmbr DOS master boot record umdevnull null device umdevramdisk ramdisk umdevvd VDI, VMDK, VHD disks umdevtab virtual tuntapummisc system call based virtualization ummisctime time virtualization ummiscuname uname id virtualizationviewfs file system patchworking Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 91. Capture user process system calls ● umview: based on ptrace – Vanilla Linux kernel – (patches proposed for sumodule sumodule sumodule sumodule sumodule performance) module module module kmview needs a specific kernel *mview ● Global PCB hash dispatcher and fd module based on utrace. PURELIBC table mgmtprocess Capture layer Nested Capture – Security enhancementPtrace or kmview kernel module (utrace) Linux Kernel – More complete virtualization support (nested View-OS, strace/gdb, SIGSTOP). Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 92. Global Hash Table ● Keeps track of active virtualizations: Pathname objects sumodule sumodule sumodule sumodule sumodule – module *mview module module – File System Types Protocol families Global PCB hash dispatcher and fd – PURELIBC table mgmt Device Major/Minor rangesprocess Capture layer Nested Capture – System call numbersPtrace or kmview kernel module (utrace) Linux Kernel – Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 93. Dispatcher ● The Dispatcher uses the global hash table to route each system call to the right module or to the sumodule sumodule sumodule sumodule sumodule kernel. module module module *mview Global PCB hash dispatcher and fd PURELIBC table mgmtprocess Capture layer Nested CapturePtrace or kmview kernel module (utrace) Linux Kernel Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 94. Nested Capture ● View-OS captures (and can virtualize) the system calls generated by modules and sumodule sumodule sumodule sumodule sumodule submodules module module module *mview Global PCB ● Purelibc is a C library providing process self virtualization PURELIBC hash dispatcher and fd table mgmtprocess Capture layer Nested CapturePtrace or kmview kernel module (utrace) Linux Kernel Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 95. Desiderata: 1: Linux KernelNew Ptrace tags for virtualization support: – PTRACE_VM: support for partial virtualization. It is possible to skip the current system call and/or the second upcall after the system call. (User- Mode Linux can use this instead of PTRACE_SYSEMU. VM has a simpler implementation than SYSEMU. – PTRACE_MULTI: process a sequence of ptrace requests + PEEK/POKE of large chunks as a single call. (ptrace exchanges one memory word per call and /proc/{pid}/mem is not writable!) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 96. Desiderata 2. Open Group/POSIX#include <msocket.h>int msocket(char *path, int domain, int type, int protocol);● Path is the pathname of the stack● domain/type/protocol are the same defined in socket(2).● A stack is a special file (new type of special file, see stat(2)):#define S_IFSTACK 0160000● Each process has a default stack for each protocol family (domain). – If path==NULL, msocket uses the default stack.● It is backwards compatible:#define socket(d,t,p) msocket(NULL,(d),(t),(p)) Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 97. Desiderata: 2 Open Group/POSIXint msocket(char *path, int domain, int type, int protocol);● if type==SOCK_DEFAULT msocket sets the default stack. e.g. msocket("/dev/net/ipstack2",PF_INET,SOCK_DEFAULT,0); defines /dev/net/ipstack2 as the default stack for Ipv4● if type==SOCK_DEFAULT && domain==PF_UNSPEC msocket sets the default stack for all the protocol families.● Mstack uses msocket: it defines the default stack so that existing applications can use different stacks.$ ip addr..... ip addr on default net$ mstack /dev/net/newstack firefox.... firefox works on newstack$ mstack /dev/net/otherstack bash$ ...this new bash works on otherstack Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 98. Desiderata: 3- C library ({e}glibc)● C libraries are impure, they are pure C libraries and interface to system calls at the same time.● It is not possible to do self virtualization of system calls for processes using {e}glibc, library calls are internally linked to the system calls (e.g. printf calls write).● Purelibc is a (ld preloaded) layer on {e}glibc which convert the C library in a pure library● The support for self virtualization should be a feature of mainstream {e}glibc. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 99. Desiderata: 4: utrace UTRACE_STOP● Utrace supports more tracers (engines) on the same process.● Utrace sends the notification to all the tracers and then waits for utrace_control(..., UTRACE_RESUME) from each tracer which returned UTRACE_STOP.● This specification is bad suited for nested virtualization support: a notification functions inspects the state (e.g. System call parameters) and maybe it changes the state. Next tracer must read the state as changed from the previous tracer.● Kmview uses a semaphore in its system call notification function to stop a process because this UTRACE_STOP specification is useless. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 100. UTRACE_STOP implementation PROCESS UTRACE Kmview kernel module Outer kmview Inner kmviewSyscall Request Notify engine#2 Notify user space Return UTRACE_STOP Notify engine#1 Notify user space Return UTRACE_STOP Mgmt of syscall RACE CONDITION! Wait (all engines) Mgmt of syscall run syscall Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 101. Desiderata: new UTRACE_STOP PROCESS UTRACE Kmview kernel module Outer kmview Inner kmviewSyscall Request Notify engine#2 Notify user space Return UTRACE_STOP Wait Mgmt of syscall Notify engine#1 Notify user space Return UTRACE_STOP Wait Mgmt of syscall run syscall Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 102. Kmview workaroundPTRACE_SYSCALL_{RUN,ABORT} instead of PTRACE_STOP PROCESS UTRACE Kmview kernel module Outer kmview Inner kmviewSyscall Request Notify engine#2 Notify user space down(sem) Mgmt of syscall up(sem) Notify engine#1 Notify user space down(sem) Mgmt of syscall up(sem) run syscall Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 103. Multiple meaning of safety...● Availability, bug effects confinement: – ViewOS runs outside the kernel, errors in modules may lead to a crash of the View (not a kernel panic!)● Self protection (from mistaken commands): – Global View Assumption often force to use root access (or powerful capabilities), this is dangerous.● Sandbox non-circumvention: – At the first sight it seems that Kernel based sandboxes are safer (e.g. seccomp). ● Kernel based sandboxes are not flexible ● On/Off security: a bug may compromise the whole system ● A good support for VM can preserve safety● The more code, the worse security. Is the kernel “too fat?” – Maintenance problems, side effects, etc. – ViewOS can move services outside the kernel. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 104. The missing ring...● View-OS modules are similar to microkernel servers.● View-OS captures some of the benefit of microkernels (separation mechanism and policy, flexibility, reliability).● View-OS allow microkernel services to be implemented (at user level) on monolithic kernels. Renzo Davoli – renzo@cs.unibo.it - Università di Bologna
  • 105. VirtualSquare● VDE● LWIPv6● PureLibC Questions?● IPN● View-OS – Umview/Kmview Renzo Davoli – renzo@cs.unibo.it - Università di Bologna