Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

XPDS16: A Paravirtualized Interface for Socket Syscalls - Dimitri Stiliadis, Aporeto

7,661 views

Published on

Docker and other container runtimes are gathering momentum and becoming the new industry standard for server applications. Linux namespaces, commonly used to run Docker apps, come with a large surface of attack which is difficult to reduce. Intel’s Clear Containers use KVM to run containers as VMs to provide additional isolation. It is possible to provide VM-like isolation for containers without sacrificing performance.

This talk focuses on the benefits of using Xen to provide an execution environment for Docker apps. The presentation starts by listing the requirements of this environment. It explains why monitoring container syscalls is important and what its security benefits are. The talk introduces a new paravirtualized protocol to virtualize IP sockets and provides the design and implementation details. The presentation clarifies the impact of the new protocol from a security perspective. The discussion concludes by comparing performance figures with the traditional PV network frontend and backend drivers in Linux, explaining the reasons for any performance gaps.

Published in: Technology
  • Be the first to comment

XPDS16: A Paravirtualized Interface for Socket Syscalls - Dimitri Stiliadis, Aporeto

  1. 1. A Para-virtualized Interface for Socket Calls Dimitri Stiliadis Founder/CEO – Aporeto @dstiliadis Stefano Stabellini Linux Kernel Lead - Aporeto @stabelinnist
  2. 2. Overview • Why are we working on this (and what it is not)? • Use cases • Proposed protocol • Performance • Demo
  3. 3. Security Threats in a Container Environment Namespace configuration (capabilities, seccomp, SELinux, AppArmor Container Images and Sources (validation, vulnerability analysis) Access control to management daemon Networking & Communications The Kernel Itself Secure defaults But .. several ways to mess it up Addressed by several tools Image scanning, signatures Delegated to management systems Several options available ?
  4. 4. Security Recommendations (from NCC White Paper) • From “Understanding and Hardening Linux Containers” by NCC Group: • Run unprivileged containers (user namespaces, root capability, dropping) • Apply a Mandatory Access Control system, such as SELinux • Build a custom kernel binary with as few modules as possible • Apply sysctl hardening • Apply disk and storage limits • Control device access and limit resource usage with cgroups • Drop any capabilities which are not required for the application within the container • Use custom mount options to increase defense in depth • Apply GRSecurity and PAX patches to Linux • Reduce Linux attack surface with Seccomp-bpf • Isolate containers based on trust and exposure • Logging, auditing and monitoring is important for container deployment • Use hardware virtualization along application trust zones
  5. 5. It’s Complicated Picture from Don Norman’s talk “Living with Complexity”
  6. 6. Security Risks - ”The Kernel Itself” Kernel Exploit Disables Security Linux Kernel Apps/ Docker netfilterSeccomp/bfp Apps/ Docker Attack
  7. 7. And the Kernel is not Free of Bugs http://www.cvedetails.com/product/47/Linux-Linux-Kernel.html?vendor_id=33
  8. 8. The Alternative: Containers in VMs Kernel Root Ring 0 Ring 3 Container OS Containers Container HypervisorRoot Ring 0 Ring 3 KernelKernel Container Container HW Virtualization Virtual Dev Virtual Dev Virtual Dev Virtual Dev Isolation, significant I/O overheads Different OS between Hypervisor and Guests Device Abstraction Simplicity, limited hardware isolation Same Kernel for all Containers
  9. 9. The Virtualization Overhead: Example Network Hypervisor Root Ring 0 Ring 3 Kernel Container Virtual Dev Virtual Dev TCP/IP stack NS Bridge Bridge IP Stack And of course, managing security in multiple kernels Ring 0 Ring 3 Container OS Containers Container Kernel Dev TCP/IP stack NS Bridge Hardware Virtualization
  10. 10. What We Really Want Container Performance Virtual Machine Isolation ?
  11. 11. What If we Thought of Virtualization A Little Different? Hardware Virtualization Assumptions Host and Guest OS are different Run any Guest on any Host VM moves etc OS Virtualization Assumptions All Guests share the same type of Kernel All Guests are of the same type We don’t care about moves (Docker Model)
  12. 12. System Call Virtualization • Introduce proxy kernel • Same as root kernel • Allows memory pages re-use • Single kernel to manage • Subset of syscallsdelivered to machine kernel • Socket, file , time • Majority of system calls restricted within syscall proxy Syscall Kernel Proxy Root KernelRoot Ring 0 Ring 3 Container Container SyscallVirtualization Unprotected Proxied/Translated Hypercall Syscall Kernel Proxy
  13. 13. Example Implementation In Xen Dom0App (Container) Xen PV Interface VM Syscall backend Syscall frontend PV Calls All other syscalls Linux DomU internals
  14. 14. Why Xen? • Efficient para-virtualization interface • Allow deployments in bare metal and cloud • Xen on GCP • (More complex though to do Xen-on-Xen in AWS with para-virtualized IO )
  15. 15. Example: Network Access • Translate socket calls to hypercalls • Container opens a “paravirtualized socket” inside the host OS • Uses natively IP stack of host • Security and forwarding policies applied at the host Syscall Kernel Proxy KernelRoot Ring 0 Ring 3 Container Container SyscallVirtualization Syscall Kernel Proxy Connect Connect 10.1.1.5 NIC 10.1.1.5
  16. 16. Example: Network Access with Namespaces • Container namespace created at the host as before • Container process is launched inside a protected VM • Through System Call virtualization system calls applied to namespace context • Container gets IP address of network namespace • Transparent to Docker and other container systems Syscall Kernel Proxy Root Ring 0 Ring 3 Container Container SyscallVirtualization Syscall Kernel Proxy Connect Connect 192.168.2.1 Bridge 192.168.2.1
  17. 17. PV Calls for networking Ports opened in a VM, are opened on the host Enable cross-domains network namespaces and SELinux labels Zero-conf networking in VMs • no need for a bridge in dom0 • works with wireless networks, VPNs, any other special configurations in Dom0
  18. 18. First Implementation • Design document • http://marc.info/?l=xen-devel&m=147033114613017 • Code • First, simple implementation on Xen • 1 Command ring • Per socket: • data ring • event ring • Variable ring data sizes configurable per socket • Supported functions (socket, connect, release, bind, listen, accept, poll) • git://git.kernel.org/pub/scm/linux/kernel/git/sstabellini/xen.git pvcalls-5
  19. 19. PV Calls Benchmarks Xen 4.7.0-rc3 Linux v4.6-rc2 Dom0 4 vcpus, pinned, 28 GB RAM DomU 4 vcpus, pinned, 4 GB RAM
  20. 20. App (Container) Linux DomU Xen POSIX PV Interface VM Dom0 Iperf -c 127.0.0.1 Iperf -s PV Calls
  21. 21. PV Calls
  22. 22. App (Container) Linux DomU Xen POSIX PV Interface VM Dom0 Iperf -s Iperf -c 127.0.0.1 PV Calls
  23. 23. PV Calls
  24. 24. ?! PV Calls
  25. 25. How is that possible?
  26. 26. How is that possible?
  27. 27. PV Calls
  28. 28. And , you use something like that today (Docker for Mac and VPNKit) Mac OSx Root Ring 0 Ring 3 Kernel Docker for Mac Virtual Dev Virtual Dev Virtual Dev The “simplistic” version of the syscall proxy Socket Proxy Container Container “VPNKit operates by reconstructing Ethernet traffic from the VM and translating it into the relevant socket API calls on OSX or Windows. This allows the host application to generate traffic without requiring low-level Ethernet bridging support.”
  29. 29. First Implementation • Design document • http://marc.info/?l=xen-devel&m=147033114613017 • Code • First, simple implementation on Xen • 1 Command ring • Per socket: • data ring • event ring • Variable ring data sizes configurable per socket • Supported functions (socket, connect, release, bind, listen, accept, poll) • git://git.kernel.org/pub/scm/linux/kernel/git/sstabellini/xen.git pvcalls-5
  30. 30. Extensions • Mechanism is generic and can be extended to other system calls • Co-processing of system calls is also possible • Guest can process system call parameters and translate at hypercall • Resolve memory references (pointers) • Resolves TOCTOU risk of system call interposition • Time of Check/Time of Use • Using N/N+1 kernel versions can reduce attack surface farther
  31. 31. Demo 1: Performance Comparison Kernel Root Ring 0 Ring 3 Container Syscall Proxy KernelRoot Ring 0 Ring 3 Container Container SyscallVirtualization Syscall Proxy Container No noticeable performance difference
  32. 32. Demo 2: Kernel Exploit Kernel Root Ring 0 Ring 3 Container Docker Container Vulnerable container crashes machine and all other containers Syscall Proxy KernelRoot Ring 0 Ring 3 Container Container SyscallVirtualization Syscall Proxy Vulnerable container crashes itself only Attack contained
  33. 33. Thank You ! We are hiring !! stefano@aporeto.com, dimitri@aporeto.com

×