Network Stack in 
Userspace (NUSE) 
! 
! 
Hajime Tazaki 
高速PCルーター研究会 2014/9/29
Today’s talk 
• Userspace version of (Linux) network 
stack 
• not intended for high-speed something 
• but useful for high-speed network I/O 
2
I have a new Layer-3/4 
protocol! Yey! 
• I have new, great Layer-3/4 protocol ! It will 
change the WORLD ! 
• network stack って、入れかえたいですか? 
• No: your code will destroy my life ?! 
(experimental ? not tested ?) 
• Yes: I wanna be your slave. 
• VM cloud = OK, no much users/services interfere 
• multi-user server, PC, phone = Nightmare, my life 
will have trouble… 
3
I have a new Layer-3/4 
protocol! Yey! (cont’d) 
• Kernel programming sucks 
• LKM ? can cause panic anyway.. 
• Click ? only router/middlebox, not for 
end-hosts 
• Slow evolution 
• VM ? Hmm, I’m a lazy guy.. 
4
costin.raiciu@cs.pub.ro, j.araujo@ucl.ac.uk, rizzo@iet.unipi.it 
Internet paths 
that it is still 
despite the 
the blame 
extensions taking 
placed on end 
moving protocols 
deployment 
optimizations. 
support for user-level 
commodity 
number of 
host stack, 
s. 
our mux/de-mux 
line rate (up 
Slow evolution of network stack 
Honda et al., Rekindling Network Protocol Innovation with User-Level Stacks, ACM 
SIGCOMM CCR, Vol.44, Num. 2, April 2014 
cores, and 
over a basic 
same server 
1.00 
0.75 
0.50 
0.25 
0.00 
2007 2008 2009 2010 2011 2012 
Date 
Ratio of flows 
Option 
SACK 
Timestamp 
Windowscale 
Direction 
Inbound 
Outbound 
Figure 1: TCP options deployment over time. 
pen infrequently not only because of slow release cycles, but 
also due to their cost and potential disruption to existing 
setups. If protocol stacks were embedded into applications, 
they could be updated on a case-by-case basis, and deploy-ment 
would be a lot more timely. 
For example, Mac OS, Windows XP and FreeBSD still 
use a traditional Additive Increase Multiplicative Decrease 
(AIMD) algorithm for TCP congestion control, while Linux
Virtual Machine ? 
Poll: “When you download and run software, how often do you use a virtual machine (to reduce 
security risks)?” 
Jon Howell, Galen Hunt, David Molnar, and Donald E. Porter, Living Dangerously: A Survey of Software Download 
Practices, no. MSR-TR-2010-51, May 2010 
6
Meanwhile in 
Filesystem world.. 
• There is, 
• Filesystem in Userspace 
(FUSE) 
• Userspace code can host 
new filesystem (sshfs, 
GmailFS, etc) 
• Performance is bad, 
but doesn’t matter 
• Flexibility and 
functionality do matter 
7 
http://fuse.sourceforge.net/
Problem Statements 
• Slow evolution of network stack 
• Interfere to host OS (which is 
untouchable) 
• Too heavy workload of VM 
8
What’s NUSE ? 
• Network stack in Userspace 
• Userspace as much as possible 
• like Fuse (Filesystem in Userspace) 
• Library version of network stack (of 
monolithic kernel) 
• kernel bypassed 
• (UNIX) Process-based virtualization 
9
What can do with NUSE ? 
• Host operating system 
• Linux (for the moment) 
• Guest operating systems 
• Linux (3.17-rc1 based) 
• FreeBSD (ongoing) 
• Suitable with kernel-bypass technologies 
• DPDK/netmap with (full) network stack + (existing) applications 
• Applications 
• ping, iperf, nginx (partially worked) 
10
FUSE vs NUSE 
11 
nuse example 
kernel bypassed 
TCP/IP 
ARP/ 
ndisc 
libnuse 
glibc 
NIC 
userspace 
kernel 
raw sock 
netmap 
DPDK (etc) 
libfuse 
glibc glibc 
VFS 
FUSE 
...... 
NFS 
ext3 
ls -l 
/tmp/fuse 
example 
/tmp/fuse 
userspace 
kernel
Design Goals 
• No modification to userspace apps 
• No mod to kernel space as well 
• Transparent 
• LD_PRELOADable 
• x1 performance of native OS 
12
Application 
POSIX glue 
TCP UDP DCCP SCTP 
ICMP ARP 
IPv6 IPv4 
Qdisc 
Netfilter Bridging 
Netlink 
IPSec Tunneling 
Kernel layer 
NUSE core 
bottom halves/ 
rcu/timer/ 
interrupt 
struct 
net_device 
RAW DPDK netmap ... 
NIC 
Recipe 
petit-scheduler 
1. (monolithic) kernel 
source 
2. petit-scheduler 
3. POSIX glue 
• redirect system calls (at 
libc-level) 
4. network I/O 
• raw socket, DPDK, netmap, 
etc.. 
13
1) kernel build 
Application 
POSIX glue 
TCP UDP DCCP SCTP 
ICMP ARP 
IPv6 IPv4 
Qdisc 
Netfilter Bridging 
Netlink 
IPSec Tunneling 
Kernel layer 
NUSE core 
bottom halves/ 
rcu/timer/ 
interrupt 
struct 
net_device 
RAW DPDK netmap ... 
NIC 
petit-scheduler 
• patch to kernel tree 
• with new (hw independent) 
arch (arch/sim) 
• robust to (frequent) 
mainstream changes 
• build kernel source tree 
w/ the patch 
• make menuconfig ARCH=sim 
• make library ARCH=sim 
• ➔ libnuse-linux-3.17-rc1.so 
14
2) petit scheduler 
• offer alternate context 
primitives 
Application 
POSIX glue 
TCP UDP DCCP SCTP 
ICMP ARP 
IPv6 IPv4 
Qdisc 
Netfilter Bridging 
Netlink 
IPSec Tunneling 
Kernel layer 
NUSE core 
bottom halves/ 
rcu/timer/ 
interrupt 
struct 
net_device 
RAW DPDK netmap ... 
NIC 
petit-scheduler 
• interrupts, timer, thread, 
bottom halves (tasklet, 
workqueue, waiter, etc) 
! 
• Implemented with POSIX 
thread 
• easily debuggable 
• ucontext fiber for low 
overhead (not yet) 
15
3) POSIX glue code 
Application 
POSIX glue 
TCP UDP DCCP SCTP 
ICMP ARP 
IPv6 IPv4 
Qdisc 
Netfilter Bridging 
Netlink 
IPSec Tunneling 
Kernel layer 
NUSE core 
bottom halves/ 
rcu/timer/ 
interrupt 
struct 
net_device 
RAW DPDK netmap ... 
NIC 
petit-scheduler 
• Hijack function calls 
• socket => nuse_socket 
• read => nuse_read 
• libc level hijack 
• apps not aware of 
• LD_PRELOAD=libnuse.so .. 
• can’t catch int 0x80 
16
extern int sim_sock_socket (int,int,int, struct socket **); 
int socket (int family, int type, int proto) 
{ 
sim_update_jiffies (); 
struct socket *kernel_socket = 
sim_malloc (sizeof (struct socket)); 
memset (kernel_socket, 0, sizeof (struct socket)); 
int ret = sim_sock_socket (family, type, proto, &kernel_socket); 
g_fd_table[curfd++] = kernel_socket; 
sim_softirq_wakeup (); 
return curfd - 1; 
} 
https://github.com/thehajime/net-next-nuse/blob/nuse/arch/sim/nuse-glue.c
4) network I/O 
Application 
POSIX glue 
TCP UDP DCCP SCTP 
ICMP ARP 
IPv6 IPv4 
Qdisc 
Netfilter Bridging 
Netlink 
IPSec Tunneling 
Kernel layer 
NUSE core 
bottom halves/ 
rcu/timer/ 
interrupt 
struct 
net_device 
RAW DPDK netmap ... 
NIC 
petit-scheduler 
• connect NUSE to NIC 
• options 
• raw socket (general) 
• DPDK (if available) 
• netmap (if available) 
• Tap ? 
18
tatic netdev_tx_t 
kernel_dev_xmit(struct sk_buff *skb, 
struct net_device *dev) 
{ 
netif_stop_queue(dev); 
sim_dev_xmit ((struct SimDevice *)dev, skb->data, skb->len); 
dev_kfree_skb(skb); 
netif_wake_queue(dev); 
return 0; 
} 
static const struct net_device_ops sim_dev_ops = { 
.ndo_start_xmit = kernel_dev_xmit, 
.ndo_set_mac_address = eth_mac_addr, 
}; 
void sim_dev_rx (struct SimDevice *device, struct SimDevicePacket 
packet) 
{ 
struct sk_buff *skb = packet.token; 
struct net_device *dev = &device->dev; 
skb->protocol = eth_type_trans(skb, dev); 
skb->ip_summed = CHECKSUM_PARTIAL; // Do the TCP checksum (FIXME: 
should be configurable) 
! 
netif_rx (skb); 
} 
https://github.com/thehajime/net-next-nuse/blob/nuse/arch/sim/sim-device.c
How to use NUSE ? 
• download 
• git clone git://github.com/thehajime/net-next-nuse 
• compile 
• make library ARCH=sim NETMAP=yes 
• execute 
• sudo ./nuse (application) 
• success ? : lucky guy ! 
• fail: add hijack calls 
20
Alternatives 
• Container (LXC, OpenVZ, vimage) 
• share kernel with host operating system (no flexibility) 
• virtual machine (KVM,Xen,UML) 
• flexible/functional, but heavy bootstrap 
• Library OS 
• full scratch: mtcp, Mirage, lwIP 
• Porting: OSv, Sandstorm, libuinet (FreeBSD), Arrakis 
(lwIP), OpenOnload (lwIP?) 
• Glue-layer: LKL (Linux-2.6), Rump (NetBSD) 
21
Alternatives (cont’d) 
Rumpkernel 
• https://github.com/rumpkernel/wiki/wiki 
• One binary runs on everywhere 
• Linux,xBSD,Soralis,cygwin Host 
• Xen Dom-U 
• Bare metal (hardware, KVM, Virtualbox) 
• Well-defined API (hypercall) 
! 
• Only NetBSD network stack is available 
22
Evaluation 
• Performance ? 
• not good so far.. 
• Generality 
• Run all applications ? up to POSIX 
coverage 
23
next time..
Ongoings 
• (efficient) thread scheduling 
• batch Tx/Rx 
• fork(2)/exec(2) 
• multi-processes 
! 
• => migrate to rumpkernel ? 
25
Summary 
• Network Stack in Userspace (NUSE) 
• network stack library 
• light virtualization 
• fast evolution, easy deployments 
https://github.com/thehajime/net-next-nuse 
26
GASPP: A GPU-Accelerated Stateful 
Packet Processing Framework 
Giorgos Vasiliadis, Lazaros Koromilas, Michalis Polychronakis, and Sotiris Ioannidis, GASPP: A GPU-Accelerated Stateful Packet 
Processing Framework, USENIX ATC 2014, June, 2014 
28

Network Stack in Userspace (NUSE)

  • 1.
    Network Stack in Userspace (NUSE) ! ! Hajime Tazaki 高速PCルーター研究会 2014/9/29
  • 2.
    Today’s talk •Userspace version of (Linux) network stack • not intended for high-speed something • but useful for high-speed network I/O 2
  • 3.
    I have anew Layer-3/4 protocol! Yey! • I have new, great Layer-3/4 protocol ! It will change the WORLD ! • network stack って、入れかえたいですか? • No: your code will destroy my life ?! (experimental ? not tested ?) • Yes: I wanna be your slave. • VM cloud = OK, no much users/services interfere • multi-user server, PC, phone = Nightmare, my life will have trouble… 3
  • 4.
    I have anew Layer-3/4 protocol! Yey! (cont’d) • Kernel programming sucks • LKM ? can cause panic anyway.. • Click ? only router/middlebox, not for end-hosts • Slow evolution • VM ? Hmm, I’m a lazy guy.. 4
  • 5.
    costin.raiciu@cs.pub.ro, j.araujo@ucl.ac.uk, rizzo@iet.unipi.it Internet paths that it is still despite the the blame extensions taking placed on end moving protocols deployment optimizations. support for user-level commodity number of host stack, s. our mux/de-mux line rate (up Slow evolution of network stack Honda et al., Rekindling Network Protocol Innovation with User-Level Stacks, ACM SIGCOMM CCR, Vol.44, Num. 2, April 2014 cores, and over a basic same server 1.00 0.75 0.50 0.25 0.00 2007 2008 2009 2010 2011 2012 Date Ratio of flows Option SACK Timestamp Windowscale Direction Inbound Outbound Figure 1: TCP options deployment over time. pen infrequently not only because of slow release cycles, but also due to their cost and potential disruption to existing setups. If protocol stacks were embedded into applications, they could be updated on a case-by-case basis, and deploy-ment would be a lot more timely. For example, Mac OS, Windows XP and FreeBSD still use a traditional Additive Increase Multiplicative Decrease (AIMD) algorithm for TCP congestion control, while Linux
  • 6.
    Virtual Machine ? Poll: “When you download and run software, how often do you use a virtual machine (to reduce security risks)?” Jon Howell, Galen Hunt, David Molnar, and Donald E. Porter, Living Dangerously: A Survey of Software Download Practices, no. MSR-TR-2010-51, May 2010 6
  • 7.
    Meanwhile in Filesystemworld.. • There is, • Filesystem in Userspace (FUSE) • Userspace code can host new filesystem (sshfs, GmailFS, etc) • Performance is bad, but doesn’t matter • Flexibility and functionality do matter 7 http://fuse.sourceforge.net/
  • 8.
    Problem Statements •Slow evolution of network stack • Interfere to host OS (which is untouchable) • Too heavy workload of VM 8
  • 9.
    What’s NUSE ? • Network stack in Userspace • Userspace as much as possible • like Fuse (Filesystem in Userspace) • Library version of network stack (of monolithic kernel) • kernel bypassed • (UNIX) Process-based virtualization 9
  • 10.
    What can dowith NUSE ? • Host operating system • Linux (for the moment) • Guest operating systems • Linux (3.17-rc1 based) • FreeBSD (ongoing) • Suitable with kernel-bypass technologies • DPDK/netmap with (full) network stack + (existing) applications • Applications • ping, iperf, nginx (partially worked) 10
  • 11.
    FUSE vs NUSE 11 nuse example kernel bypassed TCP/IP ARP/ ndisc libnuse glibc NIC userspace kernel raw sock netmap DPDK (etc) libfuse glibc glibc VFS FUSE ...... NFS ext3 ls -l /tmp/fuse example /tmp/fuse userspace kernel
  • 12.
    Design Goals •No modification to userspace apps • No mod to kernel space as well • Transparent • LD_PRELOADable • x1 performance of native OS 12
  • 13.
    Application POSIX glue TCP UDP DCCP SCTP ICMP ARP IPv6 IPv4 Qdisc Netfilter Bridging Netlink IPSec Tunneling Kernel layer NUSE core bottom halves/ rcu/timer/ interrupt struct net_device RAW DPDK netmap ... NIC Recipe petit-scheduler 1. (monolithic) kernel source 2. petit-scheduler 3. POSIX glue • redirect system calls (at libc-level) 4. network I/O • raw socket, DPDK, netmap, etc.. 13
  • 14.
    1) kernel build Application POSIX glue TCP UDP DCCP SCTP ICMP ARP IPv6 IPv4 Qdisc Netfilter Bridging Netlink IPSec Tunneling Kernel layer NUSE core bottom halves/ rcu/timer/ interrupt struct net_device RAW DPDK netmap ... NIC petit-scheduler • patch to kernel tree • with new (hw independent) arch (arch/sim) • robust to (frequent) mainstream changes • build kernel source tree w/ the patch • make menuconfig ARCH=sim • make library ARCH=sim • ➔ libnuse-linux-3.17-rc1.so 14
  • 15.
    2) petit scheduler • offer alternate context primitives Application POSIX glue TCP UDP DCCP SCTP ICMP ARP IPv6 IPv4 Qdisc Netfilter Bridging Netlink IPSec Tunneling Kernel layer NUSE core bottom halves/ rcu/timer/ interrupt struct net_device RAW DPDK netmap ... NIC petit-scheduler • interrupts, timer, thread, bottom halves (tasklet, workqueue, waiter, etc) ! • Implemented with POSIX thread • easily debuggable • ucontext fiber for low overhead (not yet) 15
  • 16.
    3) POSIX gluecode Application POSIX glue TCP UDP DCCP SCTP ICMP ARP IPv6 IPv4 Qdisc Netfilter Bridging Netlink IPSec Tunneling Kernel layer NUSE core bottom halves/ rcu/timer/ interrupt struct net_device RAW DPDK netmap ... NIC petit-scheduler • Hijack function calls • socket => nuse_socket • read => nuse_read • libc level hijack • apps not aware of • LD_PRELOAD=libnuse.so .. • can’t catch int 0x80 16
  • 17.
    extern int sim_sock_socket(int,int,int, struct socket **); int socket (int family, int type, int proto) { sim_update_jiffies (); struct socket *kernel_socket = sim_malloc (sizeof (struct socket)); memset (kernel_socket, 0, sizeof (struct socket)); int ret = sim_sock_socket (family, type, proto, &kernel_socket); g_fd_table[curfd++] = kernel_socket; sim_softirq_wakeup (); return curfd - 1; } https://github.com/thehajime/net-next-nuse/blob/nuse/arch/sim/nuse-glue.c
  • 18.
    4) network I/O Application POSIX glue TCP UDP DCCP SCTP ICMP ARP IPv6 IPv4 Qdisc Netfilter Bridging Netlink IPSec Tunneling Kernel layer NUSE core bottom halves/ rcu/timer/ interrupt struct net_device RAW DPDK netmap ... NIC petit-scheduler • connect NUSE to NIC • options • raw socket (general) • DPDK (if available) • netmap (if available) • Tap ? 18
  • 19.
    tatic netdev_tx_t kernel_dev_xmit(structsk_buff *skb, struct net_device *dev) { netif_stop_queue(dev); sim_dev_xmit ((struct SimDevice *)dev, skb->data, skb->len); dev_kfree_skb(skb); netif_wake_queue(dev); return 0; } static const struct net_device_ops sim_dev_ops = { .ndo_start_xmit = kernel_dev_xmit, .ndo_set_mac_address = eth_mac_addr, }; void sim_dev_rx (struct SimDevice *device, struct SimDevicePacket packet) { struct sk_buff *skb = packet.token; struct net_device *dev = &device->dev; skb->protocol = eth_type_trans(skb, dev); skb->ip_summed = CHECKSUM_PARTIAL; // Do the TCP checksum (FIXME: should be configurable) ! netif_rx (skb); } https://github.com/thehajime/net-next-nuse/blob/nuse/arch/sim/sim-device.c
  • 20.
    How to useNUSE ? • download • git clone git://github.com/thehajime/net-next-nuse • compile • make library ARCH=sim NETMAP=yes • execute • sudo ./nuse (application) • success ? : lucky guy ! • fail: add hijack calls 20
  • 21.
    Alternatives • Container(LXC, OpenVZ, vimage) • share kernel with host operating system (no flexibility) • virtual machine (KVM,Xen,UML) • flexible/functional, but heavy bootstrap • Library OS • full scratch: mtcp, Mirage, lwIP • Porting: OSv, Sandstorm, libuinet (FreeBSD), Arrakis (lwIP), OpenOnload (lwIP?) • Glue-layer: LKL (Linux-2.6), Rump (NetBSD) 21
  • 22.
    Alternatives (cont’d) Rumpkernel • https://github.com/rumpkernel/wiki/wiki • One binary runs on everywhere • Linux,xBSD,Soralis,cygwin Host • Xen Dom-U • Bare metal (hardware, KVM, Virtualbox) • Well-defined API (hypercall) ! • Only NetBSD network stack is available 22
  • 23.
    Evaluation • Performance? • not good so far.. • Generality • Run all applications ? up to POSIX coverage 23
  • 24.
  • 25.
    Ongoings • (efficient)thread scheduling • batch Tx/Rx • fork(2)/exec(2) • multi-processes ! • => migrate to rumpkernel ? 25
  • 26.
    Summary • NetworkStack in Userspace (NUSE) • network stack library • light virtualization • fast evolution, easy deployments https://github.com/thehajime/net-next-nuse 26
  • 27.
    GASPP: A GPU-AcceleratedStateful Packet Processing Framework Giorgos Vasiliadis, Lazaros Koromilas, Michalis Polychronakis, and Sotiris Ioannidis, GASPP: A GPU-Accelerated Stateful Packet Processing Framework, USENIX ATC 2014, June, 2014 28