Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Unikernels: Rise of the Library Hypervisor

9,436 views

Published on

Explaining the emergency of the library hypervisor, and two examples of using it in Docker for Mac and the MirageOS 3 unikernel

Published in: Software
  • Youtube video of talk at Docker Distributed Systems Summit 2016: https://www.youtube.com/watch?v=dn4ARS4lDlQ&list=PLkA60AVN3hh8oPas3cq2VA9xB7WazcIgs&index=8
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Unikernels: Rise of the Library Hypervisor

  1. 1. Unikernels: the Rise of the Library Hypervisor Anil Madhavapeddy, @avsm Mindy Preston, @yomimono Martin Lucina +the MirageOS and Docker for Mac/Win teams Docker Inc, @docker with contributions from IBM Docker Distributed Systems Summit 7th October 2016, Berlin, Germany
  2. 2. Conventional hypervisors • Run full guest operating systems with complex emulation needs. • Scaffolding for device emulation, instruction emulation, etc. • Hard to compose into existing infrastructure without wrapping a full hypervisor layer. Xen Hypervisor qemu xenstored xenconsoled Hardware Dom0DomU
  3. 3. Conventional hypervisors CVE-2016-3710: VGA emulation missing bounds checks causes exploit. CVE-2016-5403: unbounded virtio memory usage causes DoS. CVE-2016-3672: unrestricted qemu logging causes DoS. CVE-2015-8554: qemu-dm buffer overrun in MSI-X causes exploit. CVE-2015-7504: heap overflow in pcnet emulator causes exploit. • Run full guest operating systems with complex emulation needs. • Scaffolding for device emulation, instruction emulation, etc. • Hard to compose into existing infrastructure without wrapping a full hypervisor layer.
  4. 4. How can distributed systems use hardware protection more flexibly and composably?
  5. 5. Recap: Unikernels • "library operating systems" break kernels into libraries. • Link libraries with a boot layer, scheduler and application. • Portable microservices that boot directly on hypervisors or Unix. Xen Hardware App Linux Hardware DockerApp Configuration Business Logic HTTP JSON SSL TCP/IP Xen Devices Unix libev Unix musl libc Application Libraries Libraries
  6. 6. Recap: Unikernels • Many benefits are lost when deploying on existing clouds. • Tiny binaries (200k) still require scaffolding of a full OS to boot. • Difficult to manage hypervisor from inside a container as full host privilege is needed. • "library operating systems" break kernels into libraries. • Link libraries with a boot layer, scheduler and application. • Portable microservices that boot directly on hypervisors or Unix.
  7. 7. Library Hypervisors • Extend the "kit" model and break down hypervisor functionality into libraries. • Expose core functionality (CPU and memory) as library, and other pieces (device emulation) are optional. • Benefit: huge reduction in TCB, and better fit to container-native infrastructure with privilege dropping. • Drawback: no existing support in operating systems.
  8. 8. Library Hypervisors • Extend the "kit" model and break down hypervisor functionality into libraries. • Expose core functionality (CPU and memory) as library, and other pieces (device emulation) are optional. • Benefit: huge reduction in TCB, and better fit to container-native infrastructure with privilege dropping. • Drawback: no existing support in operating systems. But let's a closer look!
  9. 9. What has changed? OSX Hypervisor framework FreeBSD bHyve xHyveHyperKit bhyve.org xhyve.org github.com/docker/hyperkit
  10. 10. What has changed? OSX Hypervisor framework Linux /dev/kvm FreeBSD bHyve xHyveHyperKit kvmtool novm ukvm
  11. 11. What has changed? OSX Hypervisor framework Linux /dev/kvm FreeBSD bHyve xHyveHyperKit kvmtool novm Docker for Mac MirageOS3 ukvm
  12. 12. • Easy drag and drop installation, and autoupdates to get latest Docker. • Secure, sandboxed virtualisation architecture without elevated privileges. • Native networking support, with VPN and network sharing compatibility. • File sharing between container and host: uid mapping, inotify events, etc. Docker for Mac Aiming for a native OSX experience that works with existing developer workflows.
  13. 13. • Uses the new HyperKit framework, which is in turn based on xHyve and FreeBSD's bHyve. • Sandbox friendly: processes largely run as non- root, with privileges of the local user. Virtualisation
  14. 14. • Uses the new HyperKit framework, which is in turn based on xHyve and FreeBSD's bHyve. • Sandbox friendly: processes largely run as non- root, with privileges of the local user. Virtualisation OSX Kernel Hypervisor. framework Hardware virt: VMX, nested paging
  15. 15. • Uses the new HyperKit framework, which is in turn based on xHyve and FreeBSD's bHyve. • Sandbox friendly: processes largely run as non- root, with privileges of the local user. Virtualisation OSX Kernel Userspace Hypervisor. framework User Process Thread/vCPU Traps on I/O pages Manages ACPI, PCI devices Hardware virt: VMX, nested paging
  16. 16. • Uses the new HyperKit framework, which is in turn based on xHyve and FreeBSD's bHyve. • Sandbox friendly: processes largely run as non- root, with privileges of the local user. Virtualisation OSX Kernel Userspace Hypervisor. framework User ProcessHardware virt: VMX, nested paging Process Linux Kernel VirtIO IPC VirtIO Block VirtIO Net Alpine Linux Userspace Latest Docker preconfigured QCow2 VPNKit Logs redirected to OSX host
  17. 17. • Uses the new HyperKit framework, which is in turn based on xHyve and FreeBSD's bHyve. • Embeds Linux: includes an embedded lightweight Alpine Linux distribution optimised for fast boot and stateless operation for containers. Virtualisation $ docker info Containers: 358 Running: 13 Paused: 0 Stopped: 345 Images: 485 Server Version: 1.11.1 Storage Driver: aufs Root Dir: /var/lib/docker/aufs Backing Filesystem: extfs Dirperm1 Supported: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge null host Kernel Version: 4.4.9-moby Operating System: Alpine Linux v3.3 OSType: linux Architecture: x86_64 CPUs: 2 Total Memory: 3.858 GiB
  18. 18. HyperKit library structure • In HyperKit, most functionality is linked as a library. • If app doesn't need a protocol, it is not linked and not part of the trusted computing base.
  19. 19. • Want to hide the gory details of virtualisation from the user. The Linux VM should be "invisible". • Not solving this leads to many user complaints: • VPN software and corporate installations do not like bridged virtual machines or custom routing.
 Result: container traffic cannot connect to Internet. • Services cannot be exposed on localhost or the external interface and are instead on the Linux VM IP address.
 Result: breaks common web oAuth workflows. Networking
  20. 20. Networking OSX Kernel Userspace Hypervisor. framework HyperKitHardware virt: VMX, nested paging VirtIO IPC VirtIO Block VirtIO Net
  21. 21. Networking OSX Kernel Userspace Hypervisor. framework HyperKitHardware virt: VMX, nested paging VirtIO IPC VirtIO Block VirtIO Net Ethernet In Containers! Containers! Containers!
  22. 22. Networking OSX Kernel Userspace Hypervisor. framework HyperKitHardware virt: VMX, nested paging VirtIO IPC VirtIO Block VirtIO Net Ethernet In Bridge Ethernet Kernel Module Containers! Containers! Containers!
  23. 23. • Want to hide the gory details of virtualisation from the user. The Linux VM should be "invisible". • Not solving this leads to many user complaints: • VPN software and corporate installations do not like bridged virtual machines or custom routing.
 Result: container traffic cannot connect to Internet. • Services cannot be exposed on localhost or the external interface and are instead on the Linux VM IP address.
 Result: breaks common web oAuth workflows. Networking
  24. 24. • Challenge: Services publishing ports should be exposed on localhost without needing VM info. • Solution: VPNKit forwards container port requests to a OSX service which binds them natively on its external interface. • Benefits: • docker run -P on the Mac now works without requiring any knowledge of the VM innards. • External oAuth workflows operate with web apps. Networking
  25. 25. Networking OSX Kernel Userspace Hypervisor. framework HyperKitHardware virt: VMX, nested paging VirtIO IPC VirtIO Block VirtIO Net Ethernet In Bridge Ethernet Kernel Module Containers! Containers! Containers!
  26. 26. Networking OSX Kernel Userspace Hypervisor. framework HyperKitHardware virt: VMX, nested paging VirtIO IPC VirtIO Block VirtIO Net Ethernet In VPNKit MirageOS TCP/IP DNS Socketer Kernel Sockets Containers! Containers! Containers! github.com/docker/vpnkit
  27. 27. • Challenge: Deal with custom VPN software on the host that makes it difficult to bridge. • Solution: VPNKit, efficiently reconstructs container traffic into separate TCP/IP flows and translates them into native OSX/Windows sockets. • Benefits: • All network traffic is generated from normal socket calls (e.g. gethostbyaddr) on the Mac, so interacts well with firewalls, VPNs, and any local security policies. Networking
  28. 28. • Native OSX application, uses HyperKit to virtualise for domain-specific purpose ("docker run") • Links MirageOS unikernel libraries for networking and storage translation between OS boundaries. • The library approach let us glue together these components really easily. • Docker for Mac is quite a complex distributed system internally, but (hopefully) hidden from user. Docker for Mac + unikernels
  29. 29. MirageOS 3 + Solo5 •Unikernels have been gathering pace; next challenge is to make them easily deployable. •Build handled via Docker, but docker run shouldn't need privileges (e.g. to start a VM). •MirageOS 3 has a new library hypervisor for Linux, developed by IBM, Docker and Cambridge University contributors. mirage.io
  30. 30. MirageOS 3 + Solo5 • Source: https://github.com/Solo5/solo5 • Runs as a Unix process and opens /dev/kvm for hardware isolation. • ukvm is a small, modular monitor that links only what is needed. Can be 10k in size! • Can run privilege separated: one process opens /dev/ kvm and drops privileges and executes the unikernel. • Boot times are the same as process fork times, since all the device setup is handled in-process.
  31. 31. MirageOS 3 + Solo5 Source: Dan Williams and Ricardo Koller, IBM Research, HotCloud 16
  32. 32. MirageOS 3 + Solo5 • Due for stable release in the next month. • Intended to be "unikernel template" for other projects to share hypervisor code. • Liberally licensed under BSD/Apache2/ISC to encourage adoption and embedding. • BoF and tutorials tomorrow to demonstrate it. Developers are all here and hacking!
  33. 33. Demo!
  34. 34. How can distributed systems use hardware protection more flexibly and composably?
  35. 35. Questions? Download free at docker.com Twitter: @avsm https://github.com/docker/hyperkit https://github.com/docker/vpnkit https://github.com/docker/datakit https://github.com/mirage/ We will be hacking tomorrow!
  36. 36. Backup Slides
  37. 37. • Challenge: Share arbitrary OSX directory tree into Linux container without requiring extensive modification of either side. • Solution: Use a FUSE forwarding layer and translate Linux filesystem calls to OSX equivalents. OSX Host Linux Host Container VOLUMEcom.docker.osxfs Track extra metadata Translate to OSX filesystem calls FUSE Filesystem Sharing
  38. 38. • Challenge: Need filesystem activation so events on the Mac wake up container servers and vice-versa. • Solution: osxfs uses FSEvents API and injects inotify activation events into container. OSX Host Linux Host Container VOLUMEcom.docker.osxfs FSEvents watches open files Events from Linux causes OSX apps to wake up FUSE Filesystem Sharing
  39. 39. • Challenge: Need filesystem activation so events on the Mac wake up container servers and vice-versa. • Solution: osxfs uses FSEvents API and injects inotify activation events into container. OSX Host Linux Host Container VOLUMEcom.docker.osxfs FSEvents watches open files Events from Linux causes OSX apps to wake up FUSE Filesystem Sharing
  40. 40. • Challenge: Deal with custom VPN software on the host that makes it difficult to bridge. • Solution: VPNKit, efficiently reconstructs container traffic into separate TCP/IP flows and translates them into native OSX/Windows sockets. OSX Host Linux Host Container RUN <...>com.docker.hyperkit-net Reconstruct traffic TCP flows Translate to OSX socket calls Ethernet bridge DHCPv4 NTP Networking
  41. 41. OSX Host Linux Host Privileged Port Service Container EXPOSE Port Service VSock Binder RUN <...> VSock Listener Userland Proxy • Challenge: Services publishing ports should be exposed on localhost without needing VM info. • Solution: VPNKit forwards container port requests to a OSX service which binds them natively on its external interface. Networking
  42. 42. $ docker run resin/armv7hf-debian uname -a Linux 7ed2fca7a3f0 4.1.12 #1 SMP Tue Jan 12 10:51:00 UTC 2016 armv7l GNU/Linux $ docker run justincormack/ppc64le-debian uname -a Linux edd13885f316 4.1.12 #1 SMP Tue Jan 12 10:51:00 UTC 2016 ppc64le GNU/Linux Multi-CPU architectures

×