Advertisement

NUSE (Network Stack in Userspace) at #osio

IIJ Innovation Institute
Nov. 25, 2014
Advertisement

More Related Content

Similar to NUSE (Network Stack in Userspace) at #osio(20)

Advertisement
Advertisement

NUSE (Network Stack in Userspace) at #osio

  1. Network Stack in Userspace (NUSE) ! Hajime Tazaki Ryo Nakamura (University of Tokyo) ! New Directions in Operating Systems London, 2014
  2. Motivation Implementation of the Internet is not finished yet ! ! Faster evolution of OSes (network stack) OS personalization 2
  3. I have a new Layer-3/4 protocol! Yay! I have new, great Layer-3/4 protocol ! It will change the WORLD ! Replace network stack ? No: destroy my life ?! (experimental ? not tested ?) Yes: I wanna be your slave. Slow evolution of network stack ? VM on personal device ? 3
  4. Virtual Machine ? Poll: “When you download and run software, how often do you use a virtual machine (to reduce security risks)?” Jon Howell, Galen Hunt, David Molnar, and Donald E. Porter, Living Dangerously: A Survey of Software Download Practices, no. MSR-TR-2010-51, May 2010 4
  5. costin.raiciu@cs.pub.ro, j.araujo@ucl.ac.uk, rizzo@iet.unipi.it Internet paths that it is still despite the the blame extensions taking placed on end moving protocols deployment optimizations. support for user-level commodity number of host stack, s. our mux/de-mux line rate (up Slow evolution of network stack Honda et al., Rekindling Network Protocol Innovation with User-Level Stacks, ACM SIGCOMM CCR, Vol.44, Num. 2, April 2014 cores, and over a basic same server 1.00 0.75 0.50 0.25 0.00 2007 2008 2009 2010 2011 2012 Date Ratio of flows Option SACK Timestamp Windowscale Direction Inbound Outbound Figure 1: TCP options deployment over time. pen infrequently not only because of slow release cycles, but also due to their cost and potential disruption to existing setups. If protocol stacks were embedded into applications, they could be updated on a case-by-case basis, and deploy-ment would be a lot more timely. For example, Mac OS, Windows XP and FreeBSD still use a traditional Additive Increase Multiplicative Decrease (AIMD) algorithm for TCP congestion control, while Linux
  6. Meanwhile in Filesystem world.. There is, Filesystem in Userspace (FUSE) Userspace code can host new filesystem (sshfs, GmailFS, etc) Performance is bad, but doesn’t matter Flexibility and functionality do matter 6 http://fuse.sourceforge.net/
  7. Alternatives Container (LXC, OpenVZ, vimage) share kernel with host operating system (no flexibility) Library OS full scratch: mtcp, Mirage, lwIP Porting: OSv, Sandstorm, libuinet (FreeBSD), Arrakis (lwIP), OpenOnload (lwIP?) Glue-layer: LKL (Linux-2.6), rumpkernel (NetBSD) 7
  8. Network Stack in Userspace
  9. What’s NUSE ? Network stack in Userspace A library operating system Library version of network stack (of monolithic kernel) Linux (latest), FreeBSD (plan) (UNIX) Process-based virtualization 9 nuse example kernel bypassed TCP/IP ARP/ ndisc libnuse glibc NIC userspace kernel raw sock netmap DPDK (etc)
  10. Why NUSE ? minimized porting effort Linux (net-next) changes frequently ! full functional network stack for netmap DPDK (any kernel-bypass technology) 10
  11. How it works Application POSIX glue TCP UDP DCCP SCTP ICMP ARP IPv6 IPv4 Qdisc Netfilter Bridging Netlink IPSec Tunneling Kernel layer NUSE core bottom halves/ rcu/timer/ interrupt struct net_device RAW DPDK netmap ... NIC petit-scheduler 1. (monolithic) kernel source 2. scheduler 3. POSIX glue redirect system calls 4. network I/O raw socket, DPDK, netmap, etc.. 11
  12. 1) kernel build Application POSIX glue TCP UDP DCCP SCTP ICMP ARP IPv6 IPv4 Qdisc Netfilter Bridging Netlink IPSec Tunneling Kernel layer NUSE core bottom halves/ rcu/timer/ interrupt struct net_device RAW DPDK netmap ... NIC petit-scheduler patch to kernel tree with new (hw independent) arch (arch/sim) robust to (frequent) mainstream changes 12
  13. 2) scheduler Application POSIX glue TCP UDP DCCP SCTP ICMP ARP IPv6 IPv4 Qdisc Netfilter Bridging Netlink IPSec Tunneling Kernel layer NUSE core bottom halves/ rcu/timer/ interrupt struct net_device RAW DPDK netmap ... NIC petit-scheduler offer alternate context primitives interrupts, timer, thread, bottom halves (tasklet, workqueue, waiter, etc) wrap with POSIX thread easily debuggable ucontext fiber for low overhead (not yet) 13
  14. 3) POSIX glue code Application POSIX glue TCP UDP DCCP SCTP ICMP ARP IPv6 IPv4 Qdisc Netfilter Bridging Netlink IPSec Tunneling Kernel layer NUSE core bottom halves/ rcu/timer/ interrupt struct net_device RAW DPDK netmap ... NIC petit-scheduler Hijack function calls socket => nuse_socket read => nuse_read apps not aware of LD_PRELOAD=libnuse.so .. 14
  15. 4) network I/O Application POSIX glue TCP UDP DCCP SCTP ICMP ARP IPv6 IPv4 Qdisc Netfilter Bridging Netlink IPSec Tunneling Kernel layer NUSE core bottom halves/ rcu/timer/ interrupt struct net_device RAW DPDK netmap ... NIC petit-scheduler connect NUSE to NIC options raw socket (default) DPDK (if available) netmap (if available) Tap 15
  16. Usage download git clone git://github.com/libos-nuse/net-next- nuse compile make library ARCH=sim (NETMAP=yes) (DPDK=yes) execute sudo NUSECONF=nuse.conf ./nuse (application) 16
  17. configs 17 # Interface definition.! interface eth0! address 192.168.0.10! netmask 255.255.255.0! macaddr 00:01:01:01:01:01! viftype TAP! ! interface p1p1! address 172.16.0.1! netmask 255.255.255.0! macaddr 00:01:01:01:01:02! ! # route entry definition.! route! network 0.0.0.0! netmask 0.0.0.0! gateway 192.168.0.1
  18. (possible) use cases New protocol deployment Chrome + Linux mptcp (on NUSE) Process-level virtual instance % NUSE-linux-ovs | NUSE-freebsd-NAT | NUSE-router | NUSE-nginx! VM chaining via UNIX command line 18
  19. Limitation (ongoings) no fork(2)/exec(2) support no multi-processes no sysctl/proc (inefficient) thread scheduling 19
  20. Experiments 1. Can we benefit with OS personalization? present a custom (NUSE) kernel with an application (OS personalization) 2. How much overhead does NUSE add? Simple performance measurements 20
  21. Tested applications working ping, iperf, nginx (partially), sleep, need patches nc, wget, dig, host 21
  22. Setup: Performance measurement NUSE node Tx/Rx nodes CPU Xeon E5-2650v2 @ ping! flowgen 10G 10G 22 2.60GHz (16 core) Xeon L3426 @ 1.87GHz (8 core) Memory 32GB 4GB NIC Intel X520 Intel X520 ping! flowgen vnstat! (packet count) Tx NUSE Rx
  23. Host Tx (NUSE->Receiver) NUSE Rx 23 avg max min dpdk! 2.610 8.000 0.156 netmap 0.370 0.494 0.252 raw 0.396 0.501 0.290 tap 0.397 0.538 0.303 500 450 400 350 300 250 200 150 100 50 0 dpdk netmap raw tap Throughput (Mbps) ping (RTT) throughput (1024byte,UDP) 8 7 6 5 4 3 2 1 0 dpdk netmap raw tap RTT (ms)
  24. L3 Routing Sender->NUSE->Receiver Tx NUSE Rx 24 avg max min dpdk! 11.998 27.700 0.252 netmap 0.664 0.741 0.556 raw 0.663 0.761 0.575 tap 0.694 0.749 0.602 ping (RTT) 500 450 400 350 300 250 200 150 100 50 0 netmap raw tap Throughput (Mbps) throughput (1024byte,UDP) 30 25 20 15 10 5 0 dpdk netmap raw tap RTT (ms)
  25. Discussions not so bad performance we don’t care much about performance network stack is full functional but supplemental tools are not sufficient 25
  26. Network Simulator Integration (ns-3) network stack +ns-3 network simulator ! Direct Code Execution (DCE) Established by Mathieu Lacage (2006) part of ns-3 project ! Features reproducible (deterministic clock) controllable (simulator’s facility) http://www.nsnam.org/overview/projects/direct-code-execution/ 26
  27. 27
  28. NUSE vs DCE NUSE DCE shared kernel library ARCH=sim ARCH=sim scheduler (host) pthread simulator’s scheduler! 28 (deterministic) POSIX hijack hijack network I/O raw/netmap/DPDK/tap ns3:NetDevice execution LD_PRELOAD dlmopen(3)! single proc/multi-instances
  29. DCE Architecture Application (ip, iptables, quagga) POSIX layer TCP UDP DCCP SCTP ICMP ARP IPv6 IPv4 Netfilter Bridging struct net_device 29 Qdisc Netlink IPSec Tunneling bottom halves/rcu/ timer/interrupt Kernel layer Heap Stack memory Virtualization Core layer ns-3 (network simulation core) DCE ns-3 applicati on ns-3 TCP/IP stack 3) POSIX! Layer 1) Core! Layer 2) Kernel! Layer
  30. Bug reproducibility Home Agent AP1 AP2 30 Wi-Fi Wi-Fi handoff ping6 correspondent node mobile node (gdb) b mip6_mh_filter if dce_debug_nodeid()==0 Breakpoint 1 at 0x7ffff287c569: file net/ipv6/mip6.c, line 88. <continue> (gdb) bt 4 #0 mip6_mh_filter (sk=0x7ffff7f69e10, skb=0x7ffff7cde8b0) at net/ipv6/mip6.c:109 #1 0x00007ffff2831418 in ipv6_raw_deliver (skb=0x7ffff7cde8b0, nexthdr=135) at net/ipv6/raw.c:199 #2 0x00007ffff2831697 in raw6_local_deliver (skb=0x7ffff7cde8b0, nexthdr=135) at net/ipv6/raw.c:232 #3 0x00007ffff27e6068 in ip6_input_finish (skb=0x7ffff7cde8b0) at net/ipv6/ip6_input.c:197
  31. Debugging ==5864== Memcheck, a memory error detector ==5864== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al. ==5864== Using Valgrind-3.6.0.SVN and LibVEX; rerun with -h for copyright info ==5864== Command: ../build/bin/ns3test-dce-vdl --verbose ==5864== ==5864== Conditional jump or move depends on uninitialised value(s) ==5864== at 0x7D5AE32: tcp_parse_options (tcp_input.c:3782) ==5864== by 0x7D65DCB: tcp_check_req (tcp_minisocks.c:532) ==5864== by 0x7D63B09: tcp_v4_hnd_req (tcp_ipv4.c:1496) ==5864== by 0x7D63CB4: tcp_v4_do_rcv (tcp_ipv4.c:1576) ==5864== by 0x7D6439C: tcp_v4_rcv (tcp_ipv4.c:1696) ==5864== by 0x7D447CC: ip_local_deliver_finish (ip_input.c:226) ==5864== by 0x7D442E4: ip_rcv_finish (dst.h:318) ==5864== by 0x7D2313F: process_backlog (dev.c:3368) ==5864== by 0x7D23455: net_rx_action (dev.c:3526) ==5864== by 0x7CF2477: do_softirq (softirq.c:65) ==5864== by 0x7CF2544: softirq_task_function (softirq.c:21) ==5864== by 0x4FA2BE1: ns3::TaskManager::Trampoline(void*) (task-manager.==5864== Uninitialised value was created by a stack allocation ==5864== at 0x7D65B30: tcp_check_req (tcp_minisocks.c:522) ==5864== Memory error detection among distributed nodes in a single process using Valgrind ! ! 31
  32. Fine-grained parameter coverage Code coverage measurement with DCE With fine-grained network, node, protocol parameters 32
  33. Continuous integration http://ns-3-dce.cloud.wide.ad.jp/jenkins/job/daily-net-next-sim/ 33
  34. Summary NUSE (Network Stack in Userspace) OS personalization (fast evolution, easy deployment) DCE (Direct Code Execution) Flexible network experiment/test with deterministic clock 34
  35. github: https://github.com/libos-nuse/net-next- nuse DCE: http://bit.ly/ns-3-dce twitter: @thehajime 35
  36. Backups
  37. Potentials of Userspace Networking High-performance networking Useful debugging facilities Operating system personalization 37
  38. 1) kernel build build kernel source tree w/ the patch make menuconfig ARCH=sim make library ARCH=sim ➔ libnuse-linux-3.17-rc1.so 38
  39. Example: How timer works 39 add_timer() TIMER_SOFTIRQ timer_list run_timer_softirq () timer handler timer thread (timer_create (2))
  40. 3) POSIX glue code extern int sim_sock_socket (int,int,int, struct socket **); int socket (int family, int type, int proto) { sim_update_jiffies (); struct socket *kernel_socket = sim_malloc (sizeof (struct socket)); memset (kernel_socket, 0, sizeof (struct socket)); int ret = sim_sock_socket (family, type, proto, &kernel_socket); g_fd_table[curfd++] = kernel_socket; sim_softirq_wakeup (); return curfd - 1; } https://github.com/libos-nuse/net-next-nuse/blob/nuse/arch/sim/nuse-glue.c 40
  41. Tx callgraph sendmsg () (socket API) sim_sock_sendmsg () (NUSE) sock_sendmsg () ip_send_skb () ip_finish_output2 () dst_neigh_output () (ex-kernel) neigh_resolve_output () arp_solicit () dev_queue_xmit () sim_dev_xmit () (NUSE) nuse_vif_raw_write () 41
  42. Rx callgraph start_thread () (pthread) nuse_netdev_rx_trampoline () nuse_vif_raw_read () (NUSE) sim_dev_rx () netif_rx () (ex-kernel) start_thread () (pthread) do_softirq () (NUSE) net_rx_action () process_backlog () (ex-kernel) __netif_receive_skb_core () ip_rcv () 42 vNIC! rx softirq! rx
Advertisement