Network Stack in
Userspace (NUSE)
!
Hajime Tazaki
Ryo Nakamura
(University of Tokyo)
!
New Directions in Operating Systems
London, 2014
Motivation
Implementation of the Internet
is not finished yet
!
!
Faster evolution of OSes (network
stack)
OS personalization
2
I have a new Layer-3/4
protocol! Yay!
I have new, great Layer-3/4 protocol ! It
will change the WORLD !
Replace network stack ?
No: destroy my life ?!
(experimental ? not tested ?)
Yes: I wanna be your slave.
Slow evolution of network stack ?
VM on personal device ?
3
Virtual Machine ?
Poll: “When you download and run software, how often do you use a virtual machine (to reduce
security risks)?”
Jon Howell, Galen Hunt, David Molnar, and Donald E. Porter, Living Dangerously: A Survey of Software Download
Practices, no. MSR-TR-2010-51, May 2010
4
costin.raiciu@cs.pub.ro, j.araujo@ucl.ac.uk, rizzo@iet.unipi.it
Internet paths
that it is still
despite the
the blame
extensions taking
placed on end
moving protocols
deployment
optimizations.
support for user-level
commodity
number of
host stack,
s.
our mux/de-mux
line rate (up
Slow evolution of network stack
Honda et al., Rekindling Network Protocol Innovation with User-Level Stacks, ACM
SIGCOMM CCR, Vol.44, Num. 2, April 2014
cores, and
over a basic
same server
1.00
0.75
0.50
0.25
0.00
2007 2008 2009 2010 2011 2012
Date
Ratio of flows
Option
SACK
Timestamp
Windowscale
Direction
Inbound
Outbound
Figure 1: TCP options deployment over time.
pen infrequently not only because of slow release cycles, but
also due to their cost and potential disruption to existing
setups. If protocol stacks were embedded into applications,
they could be updated on a case-by-case basis, and deploy-ment
would be a lot more timely.
For example, Mac OS, Windows XP and FreeBSD still
use a traditional Additive Increase Multiplicative Decrease
(AIMD) algorithm for TCP congestion control, while Linux
Meanwhile in
Filesystem world..
There is,
Filesystem in Userspace
(FUSE)
Userspace code can host
new filesystem (sshfs,
GmailFS, etc)
Performance is bad,
but doesn’t matter
Flexibility and
functionality do matter
6
http://fuse.sourceforge.net/
Alternatives
Container (LXC, OpenVZ, vimage)
share kernel with host operating system (no
flexibility)
Library OS
full scratch: mtcp, Mirage, lwIP
Porting: OSv, Sandstorm, libuinet (FreeBSD),
Arrakis (lwIP), OpenOnload (lwIP?)
Glue-layer: LKL (Linux-2.6), rumpkernel (NetBSD)
7
What’s NUSE ?
Network stack in Userspace
A library operating system
Library version of network
stack (of monolithic kernel)
Linux (latest), FreeBSD (plan)
(UNIX) Process-based
virtualization
9
nuse example
kernel bypassed
TCP/IP
ARP/
ndisc
libnuse
glibc
NIC
userspace
kernel
raw sock
netmap
DPDK (etc)
Why NUSE ?
minimized porting effort
Linux (net-next) changes frequently
!
full functional network stack for
netmap
DPDK
(any kernel-bypass technology)
10
How it works
Application
POSIX glue
TCP UDP DCCP SCTP
ICMP ARP
IPv6 IPv4
Qdisc
Netfilter Bridging
Netlink
IPSec Tunneling
Kernel layer
NUSE core
bottom halves/
rcu/timer/
interrupt
struct
net_device
RAW DPDK netmap ...
NIC
petit-scheduler
1. (monolithic) kernel
source
2. scheduler
3. POSIX glue
redirect system calls
4. network I/O
raw socket, DPDK,
netmap, etc..
11
1) kernel build
Application
POSIX glue
TCP UDP DCCP SCTP
ICMP ARP
IPv6 IPv4
Qdisc
Netfilter Bridging
Netlink
IPSec Tunneling
Kernel layer
NUSE core
bottom halves/
rcu/timer/
interrupt
struct
net_device
RAW DPDK netmap ...
NIC
petit-scheduler
patch to kernel tree
with new (hw independent)
arch (arch/sim)
robust to (frequent)
mainstream changes
12
(possible) use cases
New protocol deployment
Chrome + Linux mptcp (on NUSE)
Process-level virtual instance
% NUSE-linux-ovs | NUSE-freebsd-NAT |
NUSE-router | NUSE-nginx!
VM chaining via UNIX command line
18
Limitation (ongoings)
no fork(2)/exec(2) support
no multi-processes
no sysctl/proc
(inefficient) thread scheduling
19
Experiments
1. Can we benefit with OS personalization?
present a custom (NUSE) kernel with an
application (OS personalization)
2. How much overhead does NUSE add?
Simple performance measurements
20
Bug reproducibility
Home Agent
AP1 AP2
30
Wi-Fi Wi-Fi
handoff
ping6
correspondent
node
mobile node
(gdb) b mip6_mh_filter if dce_debug_nodeid()==0
Breakpoint 1 at 0x7ffff287c569: file net/ipv6/mip6.c, line 88.
<continue>
(gdb) bt 4
#0 mip6_mh_filter
(sk=0x7ffff7f69e10, skb=0x7ffff7cde8b0)
at net/ipv6/mip6.c:109
#1 0x00007ffff2831418 in ipv6_raw_deliver
(skb=0x7ffff7cde8b0, nexthdr=135)
at net/ipv6/raw.c:199
#2 0x00007ffff2831697 in raw6_local_deliver
(skb=0x7ffff7cde8b0, nexthdr=135)
at net/ipv6/raw.c:232
#3 0x00007ffff27e6068 in ip6_input_finish
(skb=0x7ffff7cde8b0)
at net/ipv6/ip6_input.c:197
Debugging
==5864== Memcheck, a memory error detector
==5864== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==5864== Using Valgrind-3.6.0.SVN and LibVEX; rerun with -h for copyright info
==5864== Command: ../build/bin/ns3test-dce-vdl --verbose
==5864==
==5864== Conditional jump or move depends on uninitialised value(s)
==5864== at 0x7D5AE32: tcp_parse_options (tcp_input.c:3782)
==5864== by 0x7D65DCB: tcp_check_req (tcp_minisocks.c:532)
==5864== by 0x7D63B09: tcp_v4_hnd_req (tcp_ipv4.c:1496)
==5864== by 0x7D63CB4: tcp_v4_do_rcv (tcp_ipv4.c:1576)
==5864== by 0x7D6439C: tcp_v4_rcv (tcp_ipv4.c:1696)
==5864== by 0x7D447CC: ip_local_deliver_finish (ip_input.c:226)
==5864== by 0x7D442E4: ip_rcv_finish (dst.h:318)
==5864== by 0x7D2313F: process_backlog (dev.c:3368)
==5864== by 0x7D23455: net_rx_action (dev.c:3526)
==5864== by 0x7CF2477: do_softirq (softirq.c:65)
==5864== by 0x7CF2544: softirq_task_function (softirq.c:21)
==5864== by 0x4FA2BE1: ns3::TaskManager::Trampoline(void*) (task-manager.==5864== Uninitialised value was created by a stack allocation
==5864== at 0x7D65B30: tcp_check_req (tcp_minisocks.c:522)
==5864==
Memory error detection
among distributed nodes
in a single process
using Valgrind
!
!
31