Library Operating System for Linux #netdev01

Hajime Tazaki
Hajime TazakiIIJ Innovation Institute
Library Operating
System with Mainline
Linux Network Stack
!
Hajime Tazaki, Ryo Nakamura, Yuji Sekiya	
netdev0.1, Feb. 2015
Motivation
Why kernel space ?	
Packets were expensive in 1970’	
Why not userspace ?	
well grown in decades, costs degrades	
obtain network stack personalization	
controllable by userspace utilities
2
Userspace network stacks
A lot of userspace network stack	
full scratch: mTCP, Mirage, lwIP	
Porting: OSv, Sandstorm, libuinet (FreeBSD),
Arrakis (lwIP), OpenOnload (lwIP?) 	
Motivated by their own problems (specialized NIC,
cloud, high-speed Apps)	
Writing a network stack is 1-week DIY,	
but writing opera-table network stack is decades
DIY (which is not DIY)
3
Questions
How to benefit matured network stack
in userspace ?	
How to trivially introduce your idea
on network stack ?	
xxTCP, IPvX, etc..	
How to flexibly test your code with a
complex scenario ?
4
The answers
Using Linux network stack as-is	
!
as a userspace Library (library
operating system)
5
This talk is about
an introduction of a library
operating system for Linux	
and its implementation	
with a couple of useful use cases
6
Outlook (design)
hardware-independent arch (arch/lib)	
3 components	
Host backend layer	
Kernel layer	
POSIX layer
7
https://github.com/libos-nuse/net-next-nuse
Outlook (cont’d)
8
ARP
Qdisc
TCP UDP DCCP SCTP
ICMP IPv4IPv6
Netlink
BridgingNetfilter
IPSec Tunneling
Kernel layer
Host backend layer
bottom halves/
rcu/timer/
interrupt
struct
net_device
scheduler
netdev
clock
source
POSIX glue layer
Application
1) Build Linux srctree
w/ glues as a library
2) put backend!
(vNIC, clock source,!
scheduler) and bind
3) add POSIX glue code
4) applications
magically runs
Kernel glue code
9
https://github.com/libos-nuse/net-next-nuse/blob/nuse/arch/lib/sched.c
void schedule(void)!
{!
! lib_task_wait();!
}!
signed long schedule_timeout(signed long timeout)!
{!
! u64 ns;!
! struct SimTask *self;!
!
! if (timeout == MAX_SCHEDULE_TIMEOUT) {!
! ! lib_task_wait();!
! ! return MAX_SCHEDULE_TIMEOUT;!
! }!
! lib_assert(timeout >= 0);!
! ns = ((__u64)timeout) * (1000000000 / HZ);!
! self = lib_task_current();!
! lib_event_schedule_ns(ns, &trampoline, self);!
! lib_task_wait();!
! /* we know that we are always perfectly on time. */!
! return 0;!
}
POSIX glue code
10
https://github.com/libos-nuse/net-next-nuse/blob/nuse/arch/lib/nuse-glue.c
int nuse_socket(int domain, int type, int protocol)!
{!
! lib_update_jiffies();!
! struct socket *kernel_socket = malloc(sizeof(struct socket));!
! int ret, real_fd;!
!
! memset(kernel_socket, 0, sizeof(struct socket));!
! ret = lib_sock_socket(domain, type, protocol, &kernel_socket);!
! if (ret < 0)!
! ! errno = -ret;!
(snip)!
! lib_softirq_wakeup();!
! return real_fd;!
}!
weak_alias(nuse_socket, socket);
Implementations
(Instances)
Direct Code Execution (DCE)	
network simulator integration (ns-3)	
for more testing	
Network Stack in Userspace (NUSE)	
gives new platform of Linux network stack	
for ad-hoc network stack
11
Direct Code Execution
ns-3 integration	
deterministic scheduler	
single-process model virtualization	
dlmopen(3)-like virtualization	
full control over multiple network stacks
12
Execution (DCE)
main() => dlmopen(ping,liblinux.so)

=> main()=>socket(2)=>dce_socket()

=> (do whatever)
13
14
15
Network Stack in
Userspace
Userspace network stack running on
Linux (POSIX) platform	
Network stack personalization	
Full features by design (full stack)	
ARP/ND, UDP/TCP (all cc algorithm), SCTP,
DCCP, QDISC, XFRM, netfilter, etc.
16
17
Application
ARP
Qdisc
TCP UDP DCCP SCTP
ICMP IPv4IPv6
Netlink
BridgingNetfilter
IPSec Tunneling
Kernel layer
Host backend layer (NUSE)
POSIX glue layer
bottom halves/
rcu/timer/
interrupt
struct
net_device
RAW DPDK netmap ...
NIC
scheduler
netdev
clock
source
system call hijack
Application
master process slave processes
rump
syscall
proxy
rump
server
Execution (NUSE)
LD_PRELOAD=libnuse-linux.so 

ping www.google.com	
ping(8) => socket(2) => nuse_socket()
=> raw(7) => (network)
18
When it’s useful?
ad-hoc network stack (network stack
personalization)	
LD_PRELOAD=liblinux-mptcp.so firefox	
Bundle with kernel bypasses	
Intel DPDK / netmap / PF_RING / etc.	
debugging/testing with ns-3
19
Testing workflow
1.Write/modify code (patches)	
2.Write a test code (incl. packet
exchanges)	
3.if PASS; accept pull-request

else; rejects
20
continuous integration
(CI)
21
http://ns-3-dce.cloud.wide.ad.jp/jenkins/job/daily-net-next-sim/
T1) write a patch
22
Fixes: de3b7a06dfe1 ("xfrm6: Fix transport header offset in _decode_session6.")!
Signed-off-by: Hajime Tazaki <tazaki@sfc.wide.ad.jp>!
---!
net/ipv6/xfrm6_policy.c | 1 +!
1 file changed, 1 insertion(+)!
!
diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c!
index 48bf5a0..8d2d01b 100644!
--- a/net/ipv6/xfrm6_policy.c!
+++ b/net/ipv6/xfrm6_policy.c!
@@ -200,6 +200,7 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, int
reverse)!
!
#if IS_ENABLED(CONFIG_IPV6_MIP6)!
! ! case IPPROTO_MH:!
+! ! ! offset += ipv6_optlen(exthdr);!
! ! ! if (!onlyproto && pskb_may_pull(skb, nh + offset + 3 - skb->data)) {!
! ! ! ! struct ip6_mh *mh;!
http://patchwork.ozlabs.org/patch/436351/
T2) write a test
As ns-3 scenario	
C++ or python	
create a topology	
config nodes	
run/check results
(e.g., ping6)
23
+-----------+!
| HA |!
+-----------+!
|sim0!
+----------+------------+!
|sim0 |sim0!
sim2+----+---+ +----+---+!
- - -| AR1 | | AR2 |!
+---+----+ +----+---+!
|sim1 |sim1!
| |!
sim0 sim0!
+----+------+ (Movement) +----+-----+!
| MR | <=====> | MR |!
+-----------+ +----------+!
|sim1 |sim1!
+---------+ +---------+!
| MNN | | MNN |!
+---------+ +---------+!
http://code.nsnam.org/thehajime/ns-3-dce-umip/file/tip/test/dce-umip-test.cc
24
#!/usr/bin/python!
!
from ns.dce import *!
from ns.core import *!
!
nodes = NodeContainer()!
nodes.Create (100)!
dce = DceManagerHelper()!
dce.SetNetworkStack ("liblinux.so")!
dce.Install (nodes)!
!
app = DceApplicationHelper()!
app.SetBinary ("ping6")!
app.Install (nodes)!
(snip)!
!
NS_TEST_ASSERT_MSG_EQ (m_pingStatus, true, "Umip test " << m_testname!
<< " did not return successfully: " << g_testError)!
!
Simulator.Stop (Seconds(1000.0))!
Simulator.Run ()
Performance of NUSE
10G Ethernet back-to-back	
transmission	
IP forwarding	
native Linux, raw socket, tap, dpdk,
netmap
25
Performance: setup
26
10G10G
NUSE node Tx/Rx nodes
CPU
Xeon E5-2650v2 @ 2.60GHz 	
(16 core)
Xeon L3426 @ 1.87GHz 	
(8 core)
Memory 32GB 4GB
NIC Intel X520 Intel X520
OS
host:3.13.0-32	
nuse: 3.17.0-rc1
host:3.13.0-32
ping!
flowgen
vnstat!
(packet count)
Tx NUSE Rx
ping!
flowgen
Host Tx
27
RxNUSE
ping (RTT)
throughput	
(1024byte,UDP)
0
1000
2000
3000
4000
5000
6000
dpdk native netmap raw tap
Throughput(Mbps)
0
0.2
0.4
0.6
0.8
1
dpdk native netmap raw tap
RTT(ms)
native: ping A.B.C.D!
others: ./nuse ping A.B.C.D
L3 Routing	
Sender->NUSE->Receiver
28
Tx RxNUSE
ping (RTT)
throughput	
(1024byte,UDP)
0
1000
2000
3000
4000
5000
6000
dpdk native netmap raw tap
Throughput(Mbps)
0
0.2
0.4
0.6
0.8
1
dpdk native netmap raw tap
RTT(ms)
Alternatives
UML/LKL (1proc/1vm, no POSIX i/f)	
Containers (can’t change kernel)	
scratch-based (mTCP,Mirage)	
rumpkernel (in NetBSD)
29
Limitations
ad-hoc kernel glues required	
when we changed a member of a struct,
LibOS needs to follow it	
Performance drawbacks on NUSE	
adapt known techniques (mTCP)
30
(not) Conclusions
An abstraction for multiple benefits	
Conservative	
Use past decades effort as much	
with a small amount of effort	
Planing to RFC for upstreaming
31
github: https://github.com/libos-nuse/net-
next-nuse	
DCE: http://bit.ly/ns-3-dce	
twitter: @thehajime
32
Backups
Bug reproducibility
34
Wi-Fi Wi-Fi
Home Agent
AP1 AP2
handoff
ping6
mobile node
correspondent
node
(gdb) b mip6_mh_filter if dce_debug_nodeid()==0

Breakpoint 1 at 0x7ffff287c569: file net/ipv6/mip6.c, line 88.	

<continue>	

(gdb) bt 4	

#0  mip6_mh_filter	

(sk=0x7ffff7f69e10, skb=0x7ffff7cde8b0)	

at net/ipv6/mip6.c:109 	

#1  0x00007ffff2831418 in ipv6_raw_deliver	

(skb=0x7ffff7cde8b0, nexthdr=135) 

at net/ipv6/raw.c:199 	

#2  0x00007ffff2831697 in raw6_local_deliver	

(skb=0x7ffff7cde8b0, nexthdr=135) 

at net/ipv6/raw.c:232 	

#3  0x00007ffff27e6068 in ip6_input_finish	

(skb=0x7ffff7cde8b0) 	

at net/ipv6/ip6_input.c:197
Debugging
Memory error detection	
among distributed nodes	
in a single process	
using Valgrind	
!
!
35
==5864== Memcheck, a memory error detector	

==5864== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.	

==5864== UsingValgrind-3.6.0.SVN and LibVEX; rerun with -h for copyright in
==5864== Command: ../build/bin/ns3test-dce-vdl --verbose	

==5864== 	

==5864== Conditional jump or move depends on uninitialised value(s)
==5864== at 0x7D5AE32: tcp_parse_options (tcp_input.c:3782)
==5864== by 0x7D65DCB: tcp_check_req (tcp_minisocks.c:532)	

==5864== by 0x7D63B09: tcp_v4_hnd_req (tcp_ipv4.c:1496)	

==5864== by 0x7D63CB4: tcp_v4_do_rcv (tcp_ipv4.c:1576)	

==5864== by 0x7D6439C: tcp_v4_rcv (tcp_ipv4.c:1696)	

==5864== by 0x7D447CC: ip_local_deliver_finish (ip_input.c:226)	

==5864== by 0x7D442E4: ip_rcv_finish (dst.h:318)	

==5864== by 0x7D2313F: process_backlog (dev.c:3368)	

==5864== by 0x7D23455: net_rx_action (dev.c:3526)	

==5864== by 0x7CF2477: do_softirq (softirq.c:65)	

==5864== by 0x7CF2544: softirq_task_function (softirq.c:21)	

==5864== by 0x4FA2BE1: ns3::TaskManager::Trampoline(void*) (task-manage
==5864== Uninitialised value was created by a stack allocation
==5864== at 0x7D65B30: tcp_check_req (tcp_minisocks.c:522)	

==5864==
Fine-grained parameter coverage
36
Code coverage measurement with DCE	
With fine-grained network, node, protocol parameters
1) kernel build
build kernel source tree w/ the patch	
make menuconfig ARCH=sim	
make library ARCH=sim	
➔ libnuse-linux-3.17-rc1.so
37
Example: How timer
works
38
add_timer()
TIMER_SOFTIRQ
timer_list
run_timer_softirq ()
timer handler
timer thread
(timer_create (2))
Tx callgraph
39
sendmsg () (socket API)
lib_sock_sendmsg () (NUSE)
sock_sendmsg ()
ip_send_skb ()
ip_finish_output2 ()
dst_neigh_output () (existing
neigh_resolve_output () -kernel)
arp_solicit ()
dev_queue_xmit ()
lib_dev_xmit () (NUSE)
nuse_vif_raw_write ()
start_thread () (pthread)
nuse_netdev_rx_trampoline ()
nuse_vif_raw_read () (NUSE)
lib_dev_rx ()
netif_rx () (ex-kernel)
Rx callgraph
40
start_thread () (pthread)
do_softirq () (NUSE)
net_rx_action ()
process_backlog () (ex-kernel)
__netif_receive_skb_core ()
ip_rcv ()
vNIC!
rx
softirq!
rx
1 of 40

Recommended

eBPF maps 101 by
eBPF maps 101eBPF maps 101
eBPF maps 101SUSE Labs Taipei
4.2K views64 slides
NUSE (Network Stack in Userspace) at #osio by
NUSE (Network Stack in Userspace) at #osioNUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osioHajime Tazaki
6.7K views42 slides
Staring into the eBPF Abyss by
Staring into the eBPF AbyssStaring into the eBPF Abyss
Staring into the eBPF AbyssSasha Goldshtein
3.1K views35 slides
Understanding DPDK by
Understanding DPDKUnderstanding DPDK
Understanding DPDKDenys Haryachyy
102.4K views42 slides
Ethernetの受信処理 by
Ethernetの受信処理Ethernetの受信処理
Ethernetの受信処理Takuya ASADA
24.5K views62 slides
Introduction to eBPF and XDP by
Introduction to eBPF and XDPIntroduction to eBPF and XDP
Introduction to eBPF and XDPlcplcp1
5.5K views57 slides

More Related Content

What's hot

ネットワーク ゲームにおけるTCPとUDPの使い分け by
ネットワーク ゲームにおけるTCPとUDPの使い分けネットワーク ゲームにおけるTCPとUDPの使い分け
ネットワーク ゲームにおけるTCPとUDPの使い分けモノビット エンジン
61.4K views63 slides
BGP Unnumbered で遊んでみた by
BGP Unnumbered で遊んでみたBGP Unnumbered で遊んでみた
BGP Unnumbered で遊んでみたakira6592
5.7K views21 slides
コンテナネットワーキング(CNI)最前線 by
コンテナネットワーキング(CNI)最前線コンテナネットワーキング(CNI)最前線
コンテナネットワーキング(CNI)最前線Motonori Shindo
31.6K views34 slides
Kernel Recipes 2019 - XDP closer integration with network stack by
Kernel Recipes 2019 -  XDP closer integration with network stackKernel Recipes 2019 -  XDP closer integration with network stack
Kernel Recipes 2019 - XDP closer integration with network stackAnne Nicolas
1.7K views27 slides
大規模DCのネットワークデザイン by
大規模DCのネットワークデザイン大規模DCのネットワークデザイン
大規模DCのネットワークデザインMasayuki Kobayashi
17.6K views32 slides
LinuxCon 2015 Linux Kernel Networking Walkthrough by
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughThomas Graf
26.2K views20 slides

What's hot(20)

BGP Unnumbered で遊んでみた by akira6592
BGP Unnumbered で遊んでみたBGP Unnumbered で遊んでみた
BGP Unnumbered で遊んでみた
akira65925.7K views
コンテナネットワーキング(CNI)最前線 by Motonori Shindo
コンテナネットワーキング(CNI)最前線コンテナネットワーキング(CNI)最前線
コンテナネットワーキング(CNI)最前線
Motonori Shindo31.6K views
Kernel Recipes 2019 - XDP closer integration with network stack by Anne Nicolas
Kernel Recipes 2019 -  XDP closer integration with network stackKernel Recipes 2019 -  XDP closer integration with network stack
Kernel Recipes 2019 - XDP closer integration with network stack
Anne Nicolas1.7K views
大規模DCのネットワークデザイン by Masayuki Kobayashi
大規模DCのネットワークデザイン大規模DCのネットワークデザイン
大規模DCのネットワークデザイン
Masayuki Kobayashi17.6K views
LinuxCon 2015 Linux Kernel Networking Walkthrough by Thomas Graf
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking Walkthrough
Thomas Graf26.2K views
10GbE時代のネットワークI/O高速化 by Takuya ASADA
10GbE時代のネットワークI/O高速化10GbE時代のネットワークI/O高速化
10GbE時代のネットワークI/O高速化
Takuya ASADA59.4K views
Tutorial: Using GoBGP as an IXP connecting router by Shu Sugimoto
Tutorial: Using GoBGP as an IXP connecting routerTutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting router
Shu Sugimoto20.7K views
マルチコアとネットワークスタックの高速化技法 by Takuya ASADA
マルチコアとネットワークスタックの高速化技法マルチコアとネットワークスタックの高速化技法
マルチコアとネットワークスタックの高速化技法
Takuya ASADA16.9K views
BPF Internals (eBPF) by Brendan Gregg
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
Brendan Gregg15.3K views
マイクロサービスバックエンドAPIのためのRESTとgRPC by disc99_
マイクロサービスバックエンドAPIのためのRESTとgRPCマイクロサービスバックエンドAPIのためのRESTとgRPC
マイクロサービスバックエンドAPIのためのRESTとgRPC
disc99_19.9K views
High-Performance Networking Using eBPF, XDP, and io_uring by ScyllaDB
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
ScyllaDB3.3K views
DockerとKubernetesをかけめぐる by Kohei Tokunaga
DockerとKubernetesをかけめぐるDockerとKubernetesをかけめぐる
DockerとKubernetesをかけめぐる
Kohei Tokunaga3.6K views
"SRv6の現状と展望" ENOG53@上越 by Kentaro Ebisawa
"SRv6の現状と展望" ENOG53@上越"SRv6の現状と展望" ENOG53@上越
"SRv6の現状と展望" ENOG53@上越
Kentaro Ebisawa10.5K views
最近のOpenStackを振り返ってみよう by Takashi Kajinami
最近のOpenStackを振り返ってみよう最近のOpenStackを振り返ってみよう
最近のOpenStackを振り返ってみよう
Takashi Kajinami1.4K views
Cilium - Bringing the BPF Revolution to Kubernetes Networking and Security by Thomas Graf
Cilium - Bringing the BPF Revolution to Kubernetes Networking and SecurityCilium - Bringing the BPF Revolution to Kubernetes Networking and Security
Cilium - Bringing the BPF Revolution to Kubernetes Networking and Security
Thomas Graf2.6K views
今話題のいろいろなコンテナランタイムを比較してみた by Kohei Tokunaga
今話題のいろいろなコンテナランタイムを比較してみた今話題のいろいろなコンテナランタイムを比較してみた
今話題のいろいろなコンテナランタイムを比較してみた
Kohei Tokunaga61.8K views
HTML5 + JavaScriptでDRMつきMPEG-DASHを再生させる by Gaprot
HTML5 + JavaScriptでDRMつきMPEG-DASHを再生させるHTML5 + JavaScriptでDRMつきMPEG-DASHを再生させる
HTML5 + JavaScriptでDRMつきMPEG-DASHを再生させる
Gaprot24.5K views

Similar to Library Operating System for Linux #netdev01

LibOS as a regression test framework for Linux networking #netdev1.1 by
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1Hajime Tazaki
1.6K views44 slides
May2010 hex-core-opt by
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
459 views101 slides
[CCC-28c3] Post Memory Corruption Memory Analysis by
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory AnalysisMoabi.com
2.5K views94 slides
Direct Code Execution - LinuxCon Japan 2014 by
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Hajime Tazaki
1.3K views51 slides
Network Stack in Userspace (NUSE) by
Network Stack in Userspace (NUSE)Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)Hajime Tazaki
6.7K views27 slides
Cray XT Porting, Scaling, and Optimization Best Practices by
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
782 views122 slides

Similar to Library Operating System for Linux #netdev01(20)

LibOS as a regression test framework for Linux networking #netdev1.1 by Hajime Tazaki
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1
Hajime Tazaki1.6K views
May2010 hex-core-opt by Jeff Larkin
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
Jeff Larkin459 views
[CCC-28c3] Post Memory Corruption Memory Analysis by Moabi.com
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis
Moabi.com2.5K views
Direct Code Execution - LinuxCon Japan 2014 by Hajime Tazaki
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
Hajime Tazaki1.3K views
Network Stack in Userspace (NUSE) by Hajime Tazaki
Network Stack in Userspace (NUSE)Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)
Hajime Tazaki6.7K views
Cray XT Porting, Scaling, and Optimization Best Practices by Jeff Larkin
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
Jeff Larkin782 views
Building Network Functions with eBPF & BCC by Kernel TLV
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCC
Kernel TLV3K views
05 - Bypassing DEP, or why ASLR matters by Alexandre Moneger
05 - Bypassing DEP, or why ASLR matters05 - Bypassing DEP, or why ASLR matters
05 - Bypassing DEP, or why ASLR matters
Alexandre Moneger1.5K views
Picobgp - A simple deamon for routing advertising by Claudio Mignanti
Picobgp - A simple deamon for routing advertisingPicobgp - A simple deamon for routing advertising
Picobgp - A simple deamon for routing advertising
Claudio Mignanti325 views
Network Programming: Data Plane Development Kit (DPDK) by Andriy Berestovskyy
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
Andriy Berestovskyy2.3K views
The true story_of_hello_world by fantasy zheng
The true story_of_hello_worldThe true story_of_hello_world
The true story_of_hello_world
fantasy zheng2.5K views
Pragmatic Optimization in Modern Programming - Demystifying the Compiler by Marina Kolpakova
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Marina Kolpakova1.6K views
SFO15-500: VIXL by Linaro
SFO15-500: VIXLSFO15-500: VIXL
SFO15-500: VIXL
Linaro1.2K views
The Fuzzing Project - 32C3 by hannob
The Fuzzing Project - 32C3The Fuzzing Project - 32C3
The Fuzzing Project - 32C3
hannob1.7K views
Understanding eBPF in a Hurry! by Ray Jenkins
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!
Ray Jenkins1.5K views
BUD17-300: Journey of a packet by Linaro
BUD17-300: Journey of a packetBUD17-300: Journey of a packet
BUD17-300: Journey of a packet
Linaro3.4K views
Share the Experience of Using Embedded Development Board by Jian-Hong Pan
Share the Experience of Using Embedded Development BoardShare the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development Board
Jian-Hong Pan271 views
Golang Performance : microbenchmarks, profilers, and a war story by Aerospike
Golang Performance : microbenchmarks, profilers, and a war storyGolang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war story
Aerospike10.3K views
[Kiwicon 2011] Post Memory Corruption Memory Analysis by Moabi.com
[Kiwicon 2011] Post Memory Corruption Memory Analysis[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis
Moabi.com807 views
Joblib: Lightweight pipelining for parallel jobs (v2) by Marcel Caraciolo
Joblib:  Lightweight pipelining for parallel jobs (v2)Joblib:  Lightweight pipelining for parallel jobs (v2)
Joblib: Lightweight pipelining for parallel jobs (v2)
Marcel Caraciolo2.8K views

More from Hajime Tazaki

Linux rumpkernel - ABC2018 (AsiaBSDCon 2018) by
Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)
Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)Hajime Tazaki
2K views65 slides
Network stack personality in Android phone - netdev 2.2 by
Network stack personality in Android phone - netdev 2.2Network stack personality in Android phone - netdev 2.2
Network stack personality in Android phone - netdev 2.2Hajime Tazaki
1.1K views26 slides
Playing BBR with a userspace network stack by
Playing BBR with a userspace network stackPlaying BBR with a userspace network stack
Playing BBR with a userspace network stackHajime Tazaki
2.1K views43 slides
Linux Kernel Library - Reusing Monolithic Kernel by
Linux Kernel Library - Reusing Monolithic KernelLinux Kernel Library - Reusing Monolithic Kernel
Linux Kernel Library - Reusing Monolithic KernelHajime Tazaki
2.1K views49 slides
IIJlab seminar - Linux Kernel Library: Reusable monolithic kernel (in Japanese) by
IIJlab seminar - Linux Kernel Library: Reusable monolithic kernel (in Japanese)IIJlab seminar - Linux Kernel Library: Reusable monolithic kernel (in Japanese)
IIJlab seminar - Linux Kernel Library: Reusable monolithic kernel (in Japanese)Hajime Tazaki
3.2K views39 slides
mTCP使ってみた by
mTCP使ってみたmTCP使ってみた
mTCP使ってみたHajime Tazaki
15.7K views26 slides

More from Hajime Tazaki(8)

Linux rumpkernel - ABC2018 (AsiaBSDCon 2018) by Hajime Tazaki
Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)
Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)
Hajime Tazaki2K views
Network stack personality in Android phone - netdev 2.2 by Hajime Tazaki
Network stack personality in Android phone - netdev 2.2Network stack personality in Android phone - netdev 2.2
Network stack personality in Android phone - netdev 2.2
Hajime Tazaki1.1K views
Playing BBR with a userspace network stack by Hajime Tazaki
Playing BBR with a userspace network stackPlaying BBR with a userspace network stack
Playing BBR with a userspace network stack
Hajime Tazaki2.1K views
Linux Kernel Library - Reusing Monolithic Kernel by Hajime Tazaki
Linux Kernel Library - Reusing Monolithic KernelLinux Kernel Library - Reusing Monolithic Kernel
Linux Kernel Library - Reusing Monolithic Kernel
Hajime Tazaki2.1K views
IIJlab seminar - Linux Kernel Library: Reusable monolithic kernel (in Japanese) by Hajime Tazaki
IIJlab seminar - Linux Kernel Library: Reusable monolithic kernel (in Japanese)IIJlab seminar - Linux Kernel Library: Reusable monolithic kernel (in Japanese)
IIJlab seminar - Linux Kernel Library: Reusable monolithic kernel (in Japanese)
Hajime Tazaki3.2K views
mTCP使ってみた by Hajime Tazaki
mTCP使ってみたmTCP使ってみた
mTCP使ってみた
Hajime Tazaki15.7K views
Direct Code Execution @ CoNEXT 2013 by Hajime Tazaki
Direct Code Execution @ CoNEXT 2013Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013
Hajime Tazaki2.3K views
Kernelvm 201312-dlmopen by Hajime Tazaki
Kernelvm 201312-dlmopenKernelvm 201312-dlmopen
Kernelvm 201312-dlmopen
Hajime Tazaki2.3K views

Recently uploaded

handbook for web 3 adoption.pdf by
handbook for web 3 adoption.pdfhandbook for web 3 adoption.pdf
handbook for web 3 adoption.pdfLiveplex
22 views16 slides
PharoJS - Zürich Smalltalk Group Meetup November 2023 by
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023Noury Bouraqadi
126 views17 slides
Special_edition_innovator_2023.pdf by
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdfWillDavies22
17 views6 slides
Unit 1_Lecture 2_Physical Design of IoT.pdf by
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdfStephenTec
12 views36 slides
Transcript: The Details of Description Techniques tips and tangents on altern... by
Transcript: The Details of Description Techniques tips and tangents on altern...Transcript: The Details of Description Techniques tips and tangents on altern...
Transcript: The Details of Description Techniques tips and tangents on altern...BookNet Canada
135 views15 slides
AMAZON PRODUCT RESEARCH.pdf by
AMAZON PRODUCT RESEARCH.pdfAMAZON PRODUCT RESEARCH.pdf
AMAZON PRODUCT RESEARCH.pdfJerikkLaureta
19 views13 slides

Recently uploaded(20)

handbook for web 3 adoption.pdf by Liveplex
handbook for web 3 adoption.pdfhandbook for web 3 adoption.pdf
handbook for web 3 adoption.pdf
Liveplex22 views
PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi126 views
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2217 views
Unit 1_Lecture 2_Physical Design of IoT.pdf by StephenTec
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdf
StephenTec12 views
Transcript: The Details of Description Techniques tips and tangents on altern... by BookNet Canada
Transcript: The Details of Description Techniques tips and tangents on altern...Transcript: The Details of Description Techniques tips and tangents on altern...
Transcript: The Details of Description Techniques tips and tangents on altern...
BookNet Canada135 views
AMAZON PRODUCT RESEARCH.pdf by JerikkLaureta
AMAZON PRODUCT RESEARCH.pdfAMAZON PRODUCT RESEARCH.pdf
AMAZON PRODUCT RESEARCH.pdf
JerikkLaureta19 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker33 views
Voice Logger - Telephony Integration Solution at Aegis by Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma31 views
Black and White Modern Science Presentation.pptx by maryamkhalid2916
Black and White Modern Science Presentation.pptxBlack and White Modern Science Presentation.pptx
Black and White Modern Science Presentation.pptx
maryamkhalid291616 views
Perth MeetUp November 2023 by Michael Price
Perth MeetUp November 2023 Perth MeetUp November 2023
Perth MeetUp November 2023
Michael Price19 views
Case Study Copenhagen Energy and Business Central.pdf by Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana16 views
Empathic Computing: Delivering the Potential of the Metaverse by Mark Billinghurst
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the Metaverse
Mark Billinghurst476 views
SAP Automation Using Bar Code and FIORI.pdf by Virendra Rai, PMP
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdf

Library Operating System for Linux #netdev01

  • 1. Library Operating System with Mainline Linux Network Stack ! Hajime Tazaki, Ryo Nakamura, Yuji Sekiya netdev0.1, Feb. 2015
  • 2. Motivation Why kernel space ? Packets were expensive in 1970’ Why not userspace ? well grown in decades, costs degrades obtain network stack personalization controllable by userspace utilities 2
  • 3. Userspace network stacks A lot of userspace network stack full scratch: mTCP, Mirage, lwIP Porting: OSv, Sandstorm, libuinet (FreeBSD), Arrakis (lwIP), OpenOnload (lwIP?) Motivated by their own problems (specialized NIC, cloud, high-speed Apps) Writing a network stack is 1-week DIY, but writing opera-table network stack is decades DIY (which is not DIY) 3
  • 4. Questions How to benefit matured network stack in userspace ? How to trivially introduce your idea on network stack ? xxTCP, IPvX, etc.. How to flexibly test your code with a complex scenario ? 4
  • 5. The answers Using Linux network stack as-is ! as a userspace Library (library operating system) 5
  • 6. This talk is about an introduction of a library operating system for Linux and its implementation with a couple of useful use cases 6
  • 7. Outlook (design) hardware-independent arch (arch/lib) 3 components Host backend layer Kernel layer POSIX layer 7 https://github.com/libos-nuse/net-next-nuse
  • 8. Outlook (cont’d) 8 ARP Qdisc TCP UDP DCCP SCTP ICMP IPv4IPv6 Netlink BridgingNetfilter IPSec Tunneling Kernel layer Host backend layer bottom halves/ rcu/timer/ interrupt struct net_device scheduler netdev clock source POSIX glue layer Application 1) Build Linux srctree w/ glues as a library 2) put backend! (vNIC, clock source,! scheduler) and bind 3) add POSIX glue code 4) applications magically runs
  • 9. Kernel glue code 9 https://github.com/libos-nuse/net-next-nuse/blob/nuse/arch/lib/sched.c void schedule(void)! {! ! lib_task_wait();! }! signed long schedule_timeout(signed long timeout)! {! ! u64 ns;! ! struct SimTask *self;! ! ! if (timeout == MAX_SCHEDULE_TIMEOUT) {! ! ! lib_task_wait();! ! ! return MAX_SCHEDULE_TIMEOUT;! ! }! ! lib_assert(timeout >= 0);! ! ns = ((__u64)timeout) * (1000000000 / HZ);! ! self = lib_task_current();! ! lib_event_schedule_ns(ns, &trampoline, self);! ! lib_task_wait();! ! /* we know that we are always perfectly on time. */! ! return 0;! }
  • 10. POSIX glue code 10 https://github.com/libos-nuse/net-next-nuse/blob/nuse/arch/lib/nuse-glue.c int nuse_socket(int domain, int type, int protocol)! {! ! lib_update_jiffies();! ! struct socket *kernel_socket = malloc(sizeof(struct socket));! ! int ret, real_fd;! ! ! memset(kernel_socket, 0, sizeof(struct socket));! ! ret = lib_sock_socket(domain, type, protocol, &kernel_socket);! ! if (ret < 0)! ! ! errno = -ret;! (snip)! ! lib_softirq_wakeup();! ! return real_fd;! }! weak_alias(nuse_socket, socket);
  • 11. Implementations (Instances) Direct Code Execution (DCE) network simulator integration (ns-3) for more testing Network Stack in Userspace (NUSE) gives new platform of Linux network stack for ad-hoc network stack 11
  • 12. Direct Code Execution ns-3 integration deterministic scheduler single-process model virtualization dlmopen(3)-like virtualization full control over multiple network stacks 12
  • 13. Execution (DCE) main() => dlmopen(ping,liblinux.so)
 => main()=>socket(2)=>dce_socket()
 => (do whatever) 13
  • 14. 14
  • 15. 15
  • 16. Network Stack in Userspace Userspace network stack running on Linux (POSIX) platform Network stack personalization Full features by design (full stack) ARP/ND, UDP/TCP (all cc algorithm), SCTP, DCCP, QDISC, XFRM, netfilter, etc. 16
  • 17. 17 Application ARP Qdisc TCP UDP DCCP SCTP ICMP IPv4IPv6 Netlink BridgingNetfilter IPSec Tunneling Kernel layer Host backend layer (NUSE) POSIX glue layer bottom halves/ rcu/timer/ interrupt struct net_device RAW DPDK netmap ... NIC scheduler netdev clock source system call hijack Application master process slave processes rump syscall proxy rump server
  • 18. Execution (NUSE) LD_PRELOAD=libnuse-linux.so 
 ping www.google.com ping(8) => socket(2) => nuse_socket() => raw(7) => (network) 18
  • 19. When it’s useful? ad-hoc network stack (network stack personalization) LD_PRELOAD=liblinux-mptcp.so firefox Bundle with kernel bypasses Intel DPDK / netmap / PF_RING / etc. debugging/testing with ns-3 19
  • 20. Testing workflow 1.Write/modify code (patches) 2.Write a test code (incl. packet exchanges) 3.if PASS; accept pull-request
 else; rejects 20
  • 22. T1) write a patch 22 Fixes: de3b7a06dfe1 ("xfrm6: Fix transport header offset in _decode_session6.")! Signed-off-by: Hajime Tazaki <tazaki@sfc.wide.ad.jp>! ---! net/ipv6/xfrm6_policy.c | 1 +! 1 file changed, 1 insertion(+)! ! diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c! index 48bf5a0..8d2d01b 100644! --- a/net/ipv6/xfrm6_policy.c! +++ b/net/ipv6/xfrm6_policy.c! @@ -200,6 +200,7 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, int reverse)! ! #if IS_ENABLED(CONFIG_IPV6_MIP6)! ! ! case IPPROTO_MH:! +! ! ! offset += ipv6_optlen(exthdr);! ! ! ! if (!onlyproto && pskb_may_pull(skb, nh + offset + 3 - skb->data)) {! ! ! ! ! struct ip6_mh *mh;! http://patchwork.ozlabs.org/patch/436351/
  • 23. T2) write a test As ns-3 scenario C++ or python create a topology config nodes run/check results (e.g., ping6) 23 +-----------+! | HA |! +-----------+! |sim0! +----------+------------+! |sim0 |sim0! sim2+----+---+ +----+---+! - - -| AR1 | | AR2 |! +---+----+ +----+---+! |sim1 |sim1! | |! sim0 sim0! +----+------+ (Movement) +----+-----+! | MR | <=====> | MR |! +-----------+ +----------+! |sim1 |sim1! +---------+ +---------+! | MNN | | MNN |! +---------+ +---------+! http://code.nsnam.org/thehajime/ns-3-dce-umip/file/tip/test/dce-umip-test.cc
  • 24. 24 #!/usr/bin/python! ! from ns.dce import *! from ns.core import *! ! nodes = NodeContainer()! nodes.Create (100)! dce = DceManagerHelper()! dce.SetNetworkStack ("liblinux.so")! dce.Install (nodes)! ! app = DceApplicationHelper()! app.SetBinary ("ping6")! app.Install (nodes)! (snip)! ! NS_TEST_ASSERT_MSG_EQ (m_pingStatus, true, "Umip test " << m_testname! << " did not return successfully: " << g_testError)! ! Simulator.Stop (Seconds(1000.0))! Simulator.Run ()
  • 25. Performance of NUSE 10G Ethernet back-to-back transmission IP forwarding native Linux, raw socket, tap, dpdk, netmap 25
  • 26. Performance: setup 26 10G10G NUSE node Tx/Rx nodes CPU Xeon E5-2650v2 @ 2.60GHz (16 core) Xeon L3426 @ 1.87GHz (8 core) Memory 32GB 4GB NIC Intel X520 Intel X520 OS host:3.13.0-32 nuse: 3.17.0-rc1 host:3.13.0-32 ping! flowgen vnstat! (packet count) Tx NUSE Rx ping! flowgen
  • 27. Host Tx 27 RxNUSE ping (RTT) throughput (1024byte,UDP) 0 1000 2000 3000 4000 5000 6000 dpdk native netmap raw tap Throughput(Mbps) 0 0.2 0.4 0.6 0.8 1 dpdk native netmap raw tap RTT(ms) native: ping A.B.C.D! others: ./nuse ping A.B.C.D
  • 28. L3 Routing Sender->NUSE->Receiver 28 Tx RxNUSE ping (RTT) throughput (1024byte,UDP) 0 1000 2000 3000 4000 5000 6000 dpdk native netmap raw tap Throughput(Mbps) 0 0.2 0.4 0.6 0.8 1 dpdk native netmap raw tap RTT(ms)
  • 29. Alternatives UML/LKL (1proc/1vm, no POSIX i/f) Containers (can’t change kernel) scratch-based (mTCP,Mirage) rumpkernel (in NetBSD) 29
  • 30. Limitations ad-hoc kernel glues required when we changed a member of a struct, LibOS needs to follow it Performance drawbacks on NUSE adapt known techniques (mTCP) 30
  • 31. (not) Conclusions An abstraction for multiple benefits Conservative Use past decades effort as much with a small amount of effort Planing to RFC for upstreaming 31
  • 34. Bug reproducibility 34 Wi-Fi Wi-Fi Home Agent AP1 AP2 handoff ping6 mobile node correspondent node (gdb) b mip6_mh_filter if dce_debug_nodeid()==0
 Breakpoint 1 at 0x7ffff287c569: file net/ipv6/mip6.c, line 88. <continue> (gdb) bt 4 #0  mip6_mh_filter (sk=0x7ffff7f69e10, skb=0x7ffff7cde8b0) at net/ipv6/mip6.c:109 #1  0x00007ffff2831418 in ipv6_raw_deliver (skb=0x7ffff7cde8b0, nexthdr=135) 
 at net/ipv6/raw.c:199 #2  0x00007ffff2831697 in raw6_local_deliver (skb=0x7ffff7cde8b0, nexthdr=135) 
 at net/ipv6/raw.c:232 #3  0x00007ffff27e6068 in ip6_input_finish (skb=0x7ffff7cde8b0) at net/ipv6/ip6_input.c:197
  • 35. Debugging Memory error detection among distributed nodes in a single process using Valgrind ! ! 35 ==5864== Memcheck, a memory error detector ==5864== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al. ==5864== UsingValgrind-3.6.0.SVN and LibVEX; rerun with -h for copyright in ==5864== Command: ../build/bin/ns3test-dce-vdl --verbose ==5864== ==5864== Conditional jump or move depends on uninitialised value(s) ==5864== at 0x7D5AE32: tcp_parse_options (tcp_input.c:3782) ==5864== by 0x7D65DCB: tcp_check_req (tcp_minisocks.c:532) ==5864== by 0x7D63B09: tcp_v4_hnd_req (tcp_ipv4.c:1496) ==5864== by 0x7D63CB4: tcp_v4_do_rcv (tcp_ipv4.c:1576) ==5864== by 0x7D6439C: tcp_v4_rcv (tcp_ipv4.c:1696) ==5864== by 0x7D447CC: ip_local_deliver_finish (ip_input.c:226) ==5864== by 0x7D442E4: ip_rcv_finish (dst.h:318) ==5864== by 0x7D2313F: process_backlog (dev.c:3368) ==5864== by 0x7D23455: net_rx_action (dev.c:3526) ==5864== by 0x7CF2477: do_softirq (softirq.c:65) ==5864== by 0x7CF2544: softirq_task_function (softirq.c:21) ==5864== by 0x4FA2BE1: ns3::TaskManager::Trampoline(void*) (task-manage ==5864== Uninitialised value was created by a stack allocation ==5864== at 0x7D65B30: tcp_check_req (tcp_minisocks.c:522) ==5864==
  • 36. Fine-grained parameter coverage 36 Code coverage measurement with DCE With fine-grained network, node, protocol parameters
  • 37. 1) kernel build build kernel source tree w/ the patch make menuconfig ARCH=sim make library ARCH=sim ➔ libnuse-linux-3.17-rc1.so 37
  • 39. Tx callgraph 39 sendmsg () (socket API) lib_sock_sendmsg () (NUSE) sock_sendmsg () ip_send_skb () ip_finish_output2 () dst_neigh_output () (existing neigh_resolve_output () -kernel) arp_solicit () dev_queue_xmit () lib_dev_xmit () (NUSE) nuse_vif_raw_write ()
  • 40. start_thread () (pthread) nuse_netdev_rx_trampoline () nuse_vif_raw_read () (NUSE) lib_dev_rx () netif_rx () (ex-kernel) Rx callgraph 40 start_thread () (pthread) do_softirq () (NUSE) net_rx_action () process_backlog () (ex-kernel) __netif_receive_skb_core () ip_rcv () vNIC! rx softirq! rx