SlideShare a Scribd company logo
1 of 43
Download to read offline
1
Playing BBR with a userspace
network stack
Hajime Tazaki
IIJ
April, 2017, Linux netdev 2.1, Montreal, Canada
2 . 1
Linux Kernel Library
A library of Linux kernel code
to be reusable on various platforms
On userspace applications (can be FUSE, NUSE, BUSE)
As a core of Unikernel
With network simulation (under development)
Use cases
Operating system personality
Tiny guest operating system (single process)
Testing/Debugging
2 . 2
Motivation
MegaPipe [OSDI '12]
outperforms baseline Linux .. 582% (for short connections).
New API for applications (no existing applications benefit)
mTCP [NSDI '14]
improve... by a factor of 25 compared to the latest Linux TCP
implement with very limited TCP extensions
SandStorm [SIGCOMM '14]
our approach ..., demonstrating 2-10x improvements
specialized (no existing applications benefit)
Arrakis [OSDI '14]
improvements of 2-5x in latency and 9x in throughput .. to Linux
utilize simplified TCP/IP stack (lwip) (loose feature-rich extensions)
IX [OSDI '14]
improves throughput ... by 3.6x and reduces latency by 2x
utilize simplified TCP/IP stack (lwip) (loose feature-rich extensions)
2 . 3
Sigh
2 . 4
Motivation (cont'd)
1. Reuse feature-rich network stack, not re-implement or port
re-implement: give up (matured) decades' effort
port: hard to track the latest version
2. Reuse preserves various semantics
syntax level (command line)
API level
operation level (utility scripts)
3. Reasonable speed with generalized userspace network stack
x1 speed of the original
2 . 5
LKL outlooks
h/w independent (arch/lkl)
various platforms
Linux userspace
Windows userspace
FreeBSD user space
qemu/kvm (x86, arm) (unikernel)
uEFI (EFIDroid)
existing applications support
musl libc bind
cross build toolchain
EFIDroid: http://efidroid.org/
2 . 6
Demo
2 . 7
userspace network stack ?
Concerns about timing accuracy
how LKL behaves with BBR (requires higher timing accuracy) ?
Having network stack in userspace may complicate various
optimization
LKL at netdev1.2
https://youtu.be/xP9crHI0aAU?t=34m18s
3 . 1
Playing BBR with LKL
3 . 2
TCP BBR
Bottleneck Bandwidth and Round-trip propagation time
Control Tx rate
congestion not based on the packet loss
estimate MinRTT and MaxBW (on each ACK)
http://queue.acm.org/detail.cfm?id=3022184
3 . 3
TCP BBR (cont'd)
On Google's B4 WAN
(across North America, EU, Asia)
Migrated from cubic to bbr in 2016
x2 - x25 improvements
4 . 1
1st Benchmark (Oct. 2016)
netperf (TCP_STREAM, -K bbr/cubic)
2-node 10Gbps b2b link
tap+bridge (LKL)
direct ixgbe (native)
No loss, no bottleneck, close link
netperf(client) netserver
+------+ +--------+
| | | |
|sender+--------------+|receiver|
| |==============>| |
| | | |
+------+ +--------+
Linux-4.9-rc4 Linux-4.6
(host,LKL)
bbr/cubic cubic,fq_codel
(default)
4 . 2
1st Benchmark
netperf(client) netserver
+------+ +--------+
| | | |
|sender+--------------+|receiver|
| |==============>| |
| | | |
+------+ +--------+
Linux-4.9-rc4 Linux-4.6
(host,LKL)
bbr/cubic cubic,fq_codel
(default)
cc tput (Linux) tput (LKL)
bbr 9414.40 Mbps 456.43 Mbps
cubic 9411.46 Mbps 9385.28 Mbps
4 . 3
What ??
only BBR + LKL shows bad
Investigation
ack timestamp used by RTT measurement needed a precise time
event (clock)
providing high resolution timestamp improve the BBR performance
4 . 4
Change HZ (tick interval)
cc tput
(Linux,hz1000)
tput
(LKL,hz100)
tput
(LKL,hz1000)
bbr 9414.40 Mbps 456.43 Mbps 6965.05 Mbps
cubic 9411.46 Mbps 9385.28 Mbps 9393.35 Mbps
4 . 5
Timestamp on (ack) receipt
From
unsigned long long __weak sched_clock(void)
{
return (unsigned long long)(jiffies - INITIAL_JIFFIES)
* (NSEC_PER_SEC / HZ);
}
To
unsigned long long sched_clock(void)
{
return lkl_ops->time(); // i.e., clock_gettime()
}
cc tput (Linux) tput
(LKL,hz100)
tput (LKL
sched_clock,hz100)
bbr 9414.40
Mbps
456.43 Mbps 9409.98 Mbps
4 . 6
What happens if no sched_clock() ?
low throughput due to longer RTT measurement
A patch (by Neal Cardwell) to torelate lower in jiffies resolution
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 56fe736..b0f1426 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3196,6 +3196,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_
ca_rtt_us = skb_mstamp_us_delta(now, &sack->last_sackt);
}
sack->rate->rtt_us = ca_rtt_us; /* RTT of last (S)ACKed packet, or -1 */
+ if (sack->rate->rtt_us == 0)
+ sack->rate->rtt_us = jiffies_to_usecs(1);
rtt_update = tcp_ack_update_rtt(sk, flag, seq_rtt_us, sack_rtt_us,
ca_rtt_us);
diff --git a/net/ipv4/tcp_rate.c b/net/ipv4/tcp_rate.c
index 9be1581..981c48e 100644
--- a/net/ipv4/tcp_rate.c
+++ b/net/ipv4/tcp_rate.c
cc tput (Linux) tput (LKL,hz100) tput (LKL patched,hz100)
bbr 9414.40 Mbps 456.43 Mbps 9413.51 Mbps
https://groups.google.com/forum/#!topic/bbr-dev/sNwlUuIzzOk
5 . 1
2nd Benchmark
delayed, lossy network on 10Gbps
netem (middlebox)
netperf(client) netserver
+------+ +---------+ +--------+
| | | | | |
|sender+--------+middlebox+------+|receiver|
| |======= |======== |======>| |
| | | | | |
+------+ +---------+ +--------+
cc: BBR 1% pkt loss
fq-enabled 100ms delay
tcp_wmem=100M
cc tput (Linux) tput (LKL)
bbr 8602.40 Mbps 145.32 Mbps
cubic 632.63 Mbps 118.71 Mbps
5 . 2
Memory w/ TCP
configurable parameter for socket, TCP
sysctl -w net.ipv4.tcp_wmem="4096 16384 100000000"
delay and loss w/ TCP requires increased buffer
"LKL_SYSCTL=net.ipv4.tcp_wmem=4096 16384 100000000"
5 . 3
Memory w/ TCP (cont'd)
default memory size (of LKL): 64MiB
the size affects the sndbuf size
static bool tcp_should_expand_sndbuf(const struct sock *sk)
{
(snip)
/* If we are under global TCP memory pressure, do not expand. */
if (tcp_under_memory_pressure(sk))
return false;
(snip)
}
5 . 4
Timer relates
CONFIG_HIGH_RES_TIMERS enabled
fq scheduler uses
properly transmit packets with probed BW
fq configuration
instead of tc qdisc add fq
5 . 5
fq scheduler
Every fq_flow entry scheduled schedule a timer event
with high-resolution timer (in nsec)
static struct sk_buff *fq_dequeue()
=> void qdisc_watchdog_schedule_ns()
=> hrtimer_start()
5 . 6
How slow high-resolution timer ?
Delay = (nsec of expiration) - (nsec of scheduled)
5 . 7
Scheduler improvement
LKL's scheduler
outsourced based on thread impls (green/native)
minimum delay of timer interrupt (of LKL emulated)
>60 usec (green thread)
>20 usec (native thread)
5 . 8
Scheduler improvement
1. avoid system call (clock_nanosleep) when block
busy poll (watch clock instead) if sleep is < 10usec
60 usec => 20 usec
2. reuse green thread stack (avoid mmap per a timer irq)
20 usec => 3 usec
int sleep(u64 nsec) {
/* fast path */
while (1) {
if (nsec < 10*1000) {
clock_gettime(CLOCK_MONOTONIC, &now);
if (now - start > nsec)
return;
}
}
/* slow path */
return syscall(SYS_clock_nanosleep)
}
5 . 9
Timer delay improved ?
Before (top), After (bottom)
5 . 10
Results (TCP_STREAM, bbr/cubic)
netperf(client) netserver
+------+ +---------+ +--------+
| | | | | |
|sender+--------+middlebox+------+|receiver|
| |======= |======== |======>| |
| | | | | |
+------+ +---------+ +--------+
cc: BBR 1% pkt loss
fq-enabled 100ms delay
tcp_wmem=100M
6 . 1
Patched LKL
1. add sched_clock()
2. add sysctl configuration i/f (net.ipv4.tcp_wmem)
3. make system memory configurable (net.ipv4.tcp_mem)
4. enable CONFIG_HIGH_RES_TIMERS
5. add sch-fq configuration
6. scheduler hacked (uspace specific)
avoid syscall for short sleep
avoid memory allocation for each (thread) stack
7. (TSO, csum offload, by Jerry/Yuan from Google, netdev1.2)
6 . 2
Next possible steps
do profile while (lower LKL performance)
e.g., context switch of uspace threads
Various short-cuts
busy polling I/Os (packet, clock, etc)
replacing packet I/O (packet_mmap)
short packet performance (i.e., 64B)
practical workload (e.g., HTTP)
> 10Gbps link
6 . 3
on qemu/kvm ?
Based on rumprun unikernel
Performance under investigation
No scheduler issue (not
depending on syscall)
- http://www.linux.com/news/enterprise/cloud-
computing/751156-are-cloud-operating-systems-the-next-
big-thing-
Summary
Timing accuracy concern was right
performance obstacle in userspace execution
scheduler related
alleviated somehow
Timing severe features degraded from Linux
other options (unikernel)
The benefit of reusable code
6 . 46 . 5
References
LKL
Other related repos
https://github.com/lkl/linux
https://github.com/libos-nuse/lkl-linux
https://github.com/libos-nuse/frankenlibc
https://github.com/libos-nuse/rumprun
6 . 6
Backup
6 . 7
Alternatives
Full Virtualization
KVM
Para Virtualization
Xen
UML
Lightweight Virtualization
Container/namespaces
6 . 8
What is not LKL ?
not specific to a userspace network stack
Is a reusable library that we can use everywhere (in theory)
6 . 9
How others think about userspace ?
DPDK is not Linux (@ netdev 1.2)
The model of DPDK isn't compatible with Linux
break security model (protection never works)
XDP is Linux
6 . 10
userspace network stack (checklist)
Performance
Safety
Typos take the entire system down
Developer pervasiveness
Kernel reboot is disruptive
Traffic loss
ref: XDP Inside and Out
( )https://github.com/iovisor/bpf-docs/blob/master/XDP_Inside_and_Out.pdf
6 . 11
TCP BBR (cont'd)
BBR requires
packet pacing
precise RTT measurement
function onAck(packet)
rtt = now - packet.sendtime
update_min_filter(RTpropFilter, rtt)
delivered += packet.size
delivered_time = now
deliveryRate = (delivered - packet.delivered) /
(delivered_time - packet.delivered_time)
if (deliveryRate > BtlBwFilter.currentMax ||
! packet.app_limited)
update_max_filter(BtlBwFilter, deliveryRate)
if (app_limited_until > 0)
app_limited_until = app_limited_until - packet.size
http://queue.acm.org/detail.cfm?id=3022184
6 . 12
How timer works ?
1. schedule an event
2. add (hr)timer list queue (hrtimer_start())
3. (check expired timers in timer irq) (hrtimer_interrupt())
4. invoke callbacks (__run_hrtimer())
Timer delay improved ?
Before (top), After (bottom)
6 . 136 . 14
IPv6 ready
6 . 15
how timer interrupt works ?
native thread ver.
1. timer_settime(2)
instantiate a pthread
2. wakeup the thread
3. trigger a timer interrupt (of LKL)
update jiffies, invoke handlers
green thread ver.
1. instantiate a green thread
malloc/mmap, add to sched queue
2. schedule an event
clock_nanosleep(2) until next event)
or do something (goto above)
3. trigger a timer interrupt
6 . 16
TSO/Checksum offload
virtio based
guest-side: use Linux driver
TCP_STREAM (cubic, no delay)

More Related Content

What's hot

Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Hajime Tazaki
 
Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013Hajime Tazaki
 
Kernelvm 201312-dlmopen
Kernelvm 201312-dlmopenKernelvm 201312-dlmopen
Kernelvm 201312-dlmopenHajime Tazaki
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsJiannan Ouyang, PhD
 
Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)micchie
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] IO Visor Project
 
VLANs in the Linux Kernel
VLANs in the Linux KernelVLANs in the Linux Kernel
VLANs in the Linux KernelKernel TLV
 
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUsShoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUsJiannan Ouyang, PhD
 
How to Speak Intel DPDK KNI for Web Services.
How to Speak Intel DPDK KNI for Web Services.How to Speak Intel DPDK KNI for Web Services.
How to Speak Intel DPDK KNI for Web Services.Naoto MATSUMOTO
 
Netmap presentation
Netmap presentationNetmap presentation
Netmap presentationAmir Razmjou
 
mSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software SwitchmSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software Switchmicchie
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)Kirill Tsym
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesKernel TLV
 
LinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughThomas Graf
 
Make Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsMake Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsKernel TLV
 
Accelerating Neutron with Intel DPDK
Accelerating Neutron with Intel DPDKAccelerating Neutron with Intel DPDK
Accelerating Neutron with Intel DPDKAlexander Shalimov
 
FreeBSD and Drivers
FreeBSD and DriversFreeBSD and Drivers
FreeBSD and DriversKernel TLV
 
PASTE: Network Stacks Must Integrate with NVMM Abstractions
PASTE: Network Stacks Must Integrate with NVMM AbstractionsPASTE: Network Stacks Must Integrate with NVMM Abstractions
PASTE: Network Stacks Must Integrate with NVMM Abstractionsmicchie
 
Introduction to RCU
Introduction to RCUIntroduction to RCU
Introduction to RCUKernel TLV
 

What's hot (20)

Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01
 
Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013
 
Kernelvm 201312-dlmopen
Kernelvm 201312-dlmopenKernelvm 201312-dlmopen
Kernelvm 201312-dlmopen
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-Kernels
 
Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
 
VLANs in the Linux Kernel
VLANs in the Linux KernelVLANs in the Linux Kernel
VLANs in the Linux Kernel
 
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUsShoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
 
How to Speak Intel DPDK KNI for Web Services.
How to Speak Intel DPDK KNI for Web Services.How to Speak Intel DPDK KNI for Web Services.
How to Speak Intel DPDK KNI for Web Services.
 
Netmap presentation
Netmap presentationNetmap presentation
Netmap presentation
 
mSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software SwitchmSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software Switch
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use Cases
 
LinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking Walkthrough
 
Make Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsMake Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance Tools
 
Accelerating Neutron with Intel DPDK
Accelerating Neutron with Intel DPDKAccelerating Neutron with Intel DPDK
Accelerating Neutron with Intel DPDK
 
FreeBSD and Drivers
FreeBSD and DriversFreeBSD and Drivers
FreeBSD and Drivers
 
PASTE: Network Stacks Must Integrate with NVMM Abstractions
PASTE: Network Stacks Must Integrate with NVMM AbstractionsPASTE: Network Stacks Must Integrate with NVMM Abstractions
PASTE: Network Stacks Must Integrate with NVMM Abstractions
 
Introduction to RCU
Introduction to RCUIntroduction to RCU
Introduction to RCU
 

Similar to Playing BBR with a userspace network stack using LKL

Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Andriy Berestovskyy
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPFAlex Maestretti
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...PROIDEA
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...Chester Chen
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018Brendan Gregg
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioOPNFV
 
Oow2007 performance
Oow2007 performanceOow2007 performance
Oow2007 performanceRicky Zhu
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
Busy Polling: Past, Present, Future
Busy Polling: Past,      Present, FutureBusy Polling: Past,      Present, Future
Busy Polling: Past, Present, FutureVenkatPulimi
 
Eco-friendly Linux kernel development
Eco-friendly Linux kernel developmentEco-friendly Linux kernel development
Eco-friendly Linux kernel developmentAndrea Righi
 
Disruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on LinuxDisruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on LinuxNaoto MATSUMOTO
 
In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)Naoto MATSUMOTO
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
 
Ovs perf
Ovs perfOvs perf
Ovs perfMadhu c
 
Microservices in Unikernels
Microservices in UnikernelsMicroservices in Unikernels
Microservices in UnikernelsMadhuri Yechuri
 

Similar to Playing BBR with a userspace network stack using LKL (20)

Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPF
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
 
Brkdct 3101
Brkdct 3101Brkdct 3101
Brkdct 3101
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
 
Oow2007 performance
Oow2007 performanceOow2007 performance
Oow2007 performance
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Busy Polling: Past, Present, Future
Busy Polling: Past,      Present, FutureBusy Polling: Past,      Present, Future
Busy Polling: Past, Present, Future
 
Eco-friendly Linux kernel development
Eco-friendly Linux kernel developmentEco-friendly Linux kernel development
Eco-friendly Linux kernel development
 
Postgres clusters
Postgres clustersPostgres clusters
Postgres clusters
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Disruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on LinuxDisruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on Linux
 
In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)
 
Userspace networking
Userspace networkingUserspace networking
Userspace networking
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
Ovs perf
Ovs perfOvs perf
Ovs perf
 
Microservices in Unikernels
Microservices in UnikernelsMicroservices in Unikernels
Microservices in Unikernels
 

Recently uploaded

chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 

Recently uploaded (20)

chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 

Playing BBR with a userspace network stack using LKL

  • 1. 1 Playing BBR with a userspace network stack Hajime Tazaki IIJ April, 2017, Linux netdev 2.1, Montreal, Canada
  • 2. 2 . 1 Linux Kernel Library A library of Linux kernel code to be reusable on various platforms On userspace applications (can be FUSE, NUSE, BUSE) As a core of Unikernel With network simulation (under development) Use cases Operating system personality Tiny guest operating system (single process) Testing/Debugging
  • 3. 2 . 2 Motivation MegaPipe [OSDI '12] outperforms baseline Linux .. 582% (for short connections). New API for applications (no existing applications benefit) mTCP [NSDI '14] improve... by a factor of 25 compared to the latest Linux TCP implement with very limited TCP extensions SandStorm [SIGCOMM '14] our approach ..., demonstrating 2-10x improvements specialized (no existing applications benefit) Arrakis [OSDI '14] improvements of 2-5x in latency and 9x in throughput .. to Linux utilize simplified TCP/IP stack (lwip) (loose feature-rich extensions) IX [OSDI '14] improves throughput ... by 3.6x and reduces latency by 2x utilize simplified TCP/IP stack (lwip) (loose feature-rich extensions)
  • 5. 2 . 4 Motivation (cont'd) 1. Reuse feature-rich network stack, not re-implement or port re-implement: give up (matured) decades' effort port: hard to track the latest version 2. Reuse preserves various semantics syntax level (command line) API level operation level (utility scripts) 3. Reasonable speed with generalized userspace network stack x1 speed of the original
  • 6. 2 . 5 LKL outlooks h/w independent (arch/lkl) various platforms Linux userspace Windows userspace FreeBSD user space qemu/kvm (x86, arm) (unikernel) uEFI (EFIDroid) existing applications support musl libc bind cross build toolchain EFIDroid: http://efidroid.org/
  • 8. 2 . 7 userspace network stack ? Concerns about timing accuracy how LKL behaves with BBR (requires higher timing accuracy) ? Having network stack in userspace may complicate various optimization LKL at netdev1.2 https://youtu.be/xP9crHI0aAU?t=34m18s
  • 9. 3 . 1 Playing BBR with LKL
  • 10. 3 . 2 TCP BBR Bottleneck Bandwidth and Round-trip propagation time Control Tx rate congestion not based on the packet loss estimate MinRTT and MaxBW (on each ACK) http://queue.acm.org/detail.cfm?id=3022184
  • 11. 3 . 3 TCP BBR (cont'd) On Google's B4 WAN (across North America, EU, Asia) Migrated from cubic to bbr in 2016 x2 - x25 improvements
  • 12. 4 . 1 1st Benchmark (Oct. 2016) netperf (TCP_STREAM, -K bbr/cubic) 2-node 10Gbps b2b link tap+bridge (LKL) direct ixgbe (native) No loss, no bottleneck, close link netperf(client) netserver +------+ +--------+ | | | | |sender+--------------+|receiver| | |==============>| | | | | | +------+ +--------+ Linux-4.9-rc4 Linux-4.6 (host,LKL) bbr/cubic cubic,fq_codel (default)
  • 13. 4 . 2 1st Benchmark netperf(client) netserver +------+ +--------+ | | | | |sender+--------------+|receiver| | |==============>| | | | | | +------+ +--------+ Linux-4.9-rc4 Linux-4.6 (host,LKL) bbr/cubic cubic,fq_codel (default) cc tput (Linux) tput (LKL) bbr 9414.40 Mbps 456.43 Mbps cubic 9411.46 Mbps 9385.28 Mbps
  • 14. 4 . 3 What ?? only BBR + LKL shows bad Investigation ack timestamp used by RTT measurement needed a precise time event (clock) providing high resolution timestamp improve the BBR performance
  • 15. 4 . 4 Change HZ (tick interval) cc tput (Linux,hz1000) tput (LKL,hz100) tput (LKL,hz1000) bbr 9414.40 Mbps 456.43 Mbps 6965.05 Mbps cubic 9411.46 Mbps 9385.28 Mbps 9393.35 Mbps
  • 16. 4 . 5 Timestamp on (ack) receipt From unsigned long long __weak sched_clock(void) { return (unsigned long long)(jiffies - INITIAL_JIFFIES) * (NSEC_PER_SEC / HZ); } To unsigned long long sched_clock(void) { return lkl_ops->time(); // i.e., clock_gettime() } cc tput (Linux) tput (LKL,hz100) tput (LKL sched_clock,hz100) bbr 9414.40 Mbps 456.43 Mbps 9409.98 Mbps
  • 17. 4 . 6 What happens if no sched_clock() ? low throughput due to longer RTT measurement A patch (by Neal Cardwell) to torelate lower in jiffies resolution diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 56fe736..b0f1426 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3196,6 +3196,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_ ca_rtt_us = skb_mstamp_us_delta(now, &sack->last_sackt); } sack->rate->rtt_us = ca_rtt_us; /* RTT of last (S)ACKed packet, or -1 */ + if (sack->rate->rtt_us == 0) + sack->rate->rtt_us = jiffies_to_usecs(1); rtt_update = tcp_ack_update_rtt(sk, flag, seq_rtt_us, sack_rtt_us, ca_rtt_us); diff --git a/net/ipv4/tcp_rate.c b/net/ipv4/tcp_rate.c index 9be1581..981c48e 100644 --- a/net/ipv4/tcp_rate.c +++ b/net/ipv4/tcp_rate.c cc tput (Linux) tput (LKL,hz100) tput (LKL patched,hz100) bbr 9414.40 Mbps 456.43 Mbps 9413.51 Mbps https://groups.google.com/forum/#!topic/bbr-dev/sNwlUuIzzOk
  • 18. 5 . 1 2nd Benchmark delayed, lossy network on 10Gbps netem (middlebox) netperf(client) netserver +------+ +---------+ +--------+ | | | | | | |sender+--------+middlebox+------+|receiver| | |======= |======== |======>| | | | | | | | +------+ +---------+ +--------+ cc: BBR 1% pkt loss fq-enabled 100ms delay tcp_wmem=100M cc tput (Linux) tput (LKL) bbr 8602.40 Mbps 145.32 Mbps cubic 632.63 Mbps 118.71 Mbps
  • 19. 5 . 2 Memory w/ TCP configurable parameter for socket, TCP sysctl -w net.ipv4.tcp_wmem="4096 16384 100000000" delay and loss w/ TCP requires increased buffer "LKL_SYSCTL=net.ipv4.tcp_wmem=4096 16384 100000000"
  • 20. 5 . 3 Memory w/ TCP (cont'd) default memory size (of LKL): 64MiB the size affects the sndbuf size static bool tcp_should_expand_sndbuf(const struct sock *sk) { (snip) /* If we are under global TCP memory pressure, do not expand. */ if (tcp_under_memory_pressure(sk)) return false; (snip) }
  • 21. 5 . 4 Timer relates CONFIG_HIGH_RES_TIMERS enabled fq scheduler uses properly transmit packets with probed BW fq configuration instead of tc qdisc add fq
  • 22. 5 . 5 fq scheduler Every fq_flow entry scheduled schedule a timer event with high-resolution timer (in nsec) static struct sk_buff *fq_dequeue() => void qdisc_watchdog_schedule_ns() => hrtimer_start()
  • 23. 5 . 6 How slow high-resolution timer ? Delay = (nsec of expiration) - (nsec of scheduled)
  • 24. 5 . 7 Scheduler improvement LKL's scheduler outsourced based on thread impls (green/native) minimum delay of timer interrupt (of LKL emulated) >60 usec (green thread) >20 usec (native thread)
  • 25. 5 . 8 Scheduler improvement 1. avoid system call (clock_nanosleep) when block busy poll (watch clock instead) if sleep is < 10usec 60 usec => 20 usec 2. reuse green thread stack (avoid mmap per a timer irq) 20 usec => 3 usec int sleep(u64 nsec) { /* fast path */ while (1) { if (nsec < 10*1000) { clock_gettime(CLOCK_MONOTONIC, &now); if (now - start > nsec) return; } } /* slow path */ return syscall(SYS_clock_nanosleep) }
  • 26. 5 . 9 Timer delay improved ? Before (top), After (bottom)
  • 27. 5 . 10 Results (TCP_STREAM, bbr/cubic) netperf(client) netserver +------+ +---------+ +--------+ | | | | | | |sender+--------+middlebox+------+|receiver| | |======= |======== |======>| | | | | | | | +------+ +---------+ +--------+ cc: BBR 1% pkt loss fq-enabled 100ms delay tcp_wmem=100M
  • 28. 6 . 1 Patched LKL 1. add sched_clock() 2. add sysctl configuration i/f (net.ipv4.tcp_wmem) 3. make system memory configurable (net.ipv4.tcp_mem) 4. enable CONFIG_HIGH_RES_TIMERS 5. add sch-fq configuration 6. scheduler hacked (uspace specific) avoid syscall for short sleep avoid memory allocation for each (thread) stack 7. (TSO, csum offload, by Jerry/Yuan from Google, netdev1.2)
  • 29. 6 . 2 Next possible steps do profile while (lower LKL performance) e.g., context switch of uspace threads Various short-cuts busy polling I/Os (packet, clock, etc) replacing packet I/O (packet_mmap) short packet performance (i.e., 64B) practical workload (e.g., HTTP) > 10Gbps link
  • 30. 6 . 3 on qemu/kvm ? Based on rumprun unikernel Performance under investigation No scheduler issue (not depending on syscall) - http://www.linux.com/news/enterprise/cloud- computing/751156-are-cloud-operating-systems-the-next- big-thing-
  • 31. Summary Timing accuracy concern was right performance obstacle in userspace execution scheduler related alleviated somehow Timing severe features degraded from Linux other options (unikernel) The benefit of reusable code
  • 32. 6 . 46 . 5 References LKL Other related repos https://github.com/lkl/linux https://github.com/libos-nuse/lkl-linux https://github.com/libos-nuse/frankenlibc https://github.com/libos-nuse/rumprun
  • 34. 6 . 7 Alternatives Full Virtualization KVM Para Virtualization Xen UML Lightweight Virtualization Container/namespaces
  • 35. 6 . 8 What is not LKL ? not specific to a userspace network stack Is a reusable library that we can use everywhere (in theory)
  • 36. 6 . 9 How others think about userspace ? DPDK is not Linux (@ netdev 1.2) The model of DPDK isn't compatible with Linux break security model (protection never works) XDP is Linux
  • 37. 6 . 10 userspace network stack (checklist) Performance Safety Typos take the entire system down Developer pervasiveness Kernel reboot is disruptive Traffic loss ref: XDP Inside and Out ( )https://github.com/iovisor/bpf-docs/blob/master/XDP_Inside_and_Out.pdf
  • 38. 6 . 11 TCP BBR (cont'd) BBR requires packet pacing precise RTT measurement function onAck(packet) rtt = now - packet.sendtime update_min_filter(RTpropFilter, rtt) delivered += packet.size delivered_time = now deliveryRate = (delivered - packet.delivered) / (delivered_time - packet.delivered_time) if (deliveryRate > BtlBwFilter.currentMax || ! packet.app_limited) update_max_filter(BtlBwFilter, deliveryRate) if (app_limited_until > 0) app_limited_until = app_limited_until - packet.size http://queue.acm.org/detail.cfm?id=3022184
  • 39. 6 . 12 How timer works ? 1. schedule an event 2. add (hr)timer list queue (hrtimer_start()) 3. (check expired timers in timer irq) (hrtimer_interrupt()) 4. invoke callbacks (__run_hrtimer())
  • 40. Timer delay improved ? Before (top), After (bottom)
  • 41. 6 . 136 . 14 IPv6 ready
  • 42. 6 . 15 how timer interrupt works ? native thread ver. 1. timer_settime(2) instantiate a pthread 2. wakeup the thread 3. trigger a timer interrupt (of LKL) update jiffies, invoke handlers green thread ver. 1. instantiate a green thread malloc/mmap, add to sched queue 2. schedule an event clock_nanosleep(2) until next event) or do something (goto above) 3. trigger a timer interrupt
  • 43. 6 . 16 TSO/Checksum offload virtio based guest-side: use Linux driver TCP_STREAM (cubic, no delay)