Gary Lin
Sofware Engineer, SUSE Labs
glin@suse.com
Introduction to
eBPF and XDP
Taiwan Linux Kernel Hackers
eBPF
BPF?
Berkeley Packet Filter
BPF
No Red
BPF Program
The BSD Packet Filter:
A New Architecture for User-level
Packet Capture
December 19, 1992
SCO lawsuit, August 2003
BPF ASM
ldh [12]
jne #0x800, drop
ldb [23]
jneq #1, drop
# get a random uint32 number
ld rand
mod #4
jneq #1, drop
ret #-1
drop: ret #0
BPF Bytecode
struct sock_filter code[] = {
{ 0x28, 0, 0, 0x0000000c },
{ 0x15, 0, 8, 0x000086dd },
{ 0x30, 0, 0, 0x00000014 },
{ 0x15, 2, 0, 0x00000084 },
{ 0x15, 1, 0, 0x00000006 },
{ 0x15, 0, 17, 0x00000011 },
{ 0x28, 0, 0, 0x00000036 },
{ 0x15, 14, 0, 0x00000016 },
{ 0x28, 0, 0, 0x00000038 },
{ 0x15, 12, 13, 0x00000016 },
...
};
Virtual Machine
kind of
BPF JIT
# find arch -name bpf_jit*
arch/sparc/net/bpf_jit_asm_64.S
...
arch/arm/net/bpf_jit_32.c
arch/arm/net/bpf_jit_32.h
arch/arm64/net/bpf_jit_comp.c
arch/arm64/net/bpf_jit.h
arch/powerpc/net/bpf_jit_asm64.S
arch/powerpc/net/bpf_jit_asm.S
...
arch/s390/net/bpf_jit_comp.c
arch/s390/net/bpf_jit.S
...
arch/mips/net/bpf_jit_asm.S
...
arch/x86/net/bpf_jit_comp.c
arch/x86/net/bpf_jit.S
Stable and Fast!
Linux 3.15
Extended BPF
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bd4cf0ed331a275e9bf5a49e6d0fd55dffc551b8
From BPF to eBPF
●
2 32-bit registers → 10 64-bit registers
●
New instructions
BPF_MOV, BPF_JNE, BPF_CALL, …
●
Helper functions
●
eBPF verifier: kernel/bpf/verifier.c
Loading programs from user space
●
eBPF map
BPF Calling Convention
●
R0
Return value from in-kernel function, and exit value for eBPF program
●
R1 – R5
Arguments from eBPF program to in-kernel function
●
R6 – R9
Callee saved registers that in-kernel function will preserve
●
R10
Read-only frame pointer to access stack
x86_64 Register Mapping
R6 rbx→
R7 r13→
R8 r14→
R9 r15→
R10 rbp→
R0 rax→
R1 rdi→
R2 rsi→
R3 rdx→
R4 rcx→
R5 r8→
BPF Helper Functions
●
/usr/include/bpf.h
– bpf_probe_read
– bpf_ktime_get_ns
– bpf_trace_printk
– bpf_get_smp_processor_id
– bpf_perf_event_output
– ...
eBPF Verifier
●
Instructions limit: 4096
●
Two-Step Verification
– Directed acyclic graph check
– Execution Simulation
Direct Acyclic Graph Check
●
Back-edge detection
●
Unreachable instructions
Direct Acyclic Graph Check
●
Back-edge detection
●
Unreachable instructions
PERMISSION DENIED
Execution Simulation
●
Reading an uninitialized register
●
Arithmetic of two valid pointer
●
Load or store registers of invalid types
●
Read stack before writing data into stack
Execution Simulation
●
Reading an uninitialized register
●
Arithmetic of two valid pointer
●
Load or store registers of invalid types
●
Read stack before writing data into stack
PERMISSION DENIED
Stable, Fast, and Secure!
eBPF Maps
eBPF Map Types
●
Hash
●
Array
●
Tail Call Array
●
Per-CPU Hash/Array
●
Stack Trace
●
cgroup Array
●
LRU (per-CPU) Hash
●
Longest-Prefix Matching Trie
●
Array/Hash of Maps
●
Net device Map
●
Socket Map
https://github.com/iovisor/bcc/blob/master/docs/kernel-versions.md#tables-aka-maps
eBPF Map Syscalls
●
BPF_MAP_CREATE
●
BPF_MAP_LOOKUP_ELEM
●
BPF_MAP_UPDATE_ELEM
●
BPF_MAP_DELETE_ELEM
●
BPF_MAP_GET_NEXT_KEY
●
BPF_MAP_GET_NEXT_ID
●
BPF_MAP_GET_FD_BY_ID
eBPF
BPF bytecode Access Map
BPF bytecode Map
BPF_PROG_LOAD BPF_MAP_*
userspace
kernel
user program
User Program
eBPF
Kernel
Program
As simple
as possible
Whatever you want
userspace
kernel
eBPF MAP
BTW
clang >= 3.7 with bpf taget
$ clang -target bpf source.c -o code.o
eBPF Projects
●
Networking
tc, socket, XDP, cilium, ...
●
System Tracing and Monitoring
kprobe/uprobe/tracepoint/perf event/usdt
●
Security
LandLock LSM, seccomp
●
System Error Handler Testing
eBPF directed error injection
XDP
RX Packet Processing
userspace
kernel
Driver
Network
Stack
NIC
Network
Program
DDoS
userspace
kernel
Driver
Network
Stack
NIC
Network
Program
netfilter
userspace
kernel
DriverNIC
Network
Program
Network Stack
netfilter
DROP
Trafic Control
userspace
kernel
Driver
Network
Stack
NIC
Network
Program
TC
ingress
DROP
eXpress Data Path
userspace
kernel
Network
Stack
NIC
Network
Program
Driver
skb
alloc
DROP
eBPF
TX
XDP
●
A high performance, programmable network data path
Attaching eBPF programs through netlink (IFLA_XDP)
●
No specialized hardware
●
No kernel bypass
●
Works with the existing network stack
●
Direct packet write
userspace
kernel
Driver
Network
Stack
NIC
Network
Program
generic
XDP
tc
ingress
netfilter
ingress
Generic XDP
virtnet_poll [virtio_net]() {
receive_buf [virtio_net]() {
receive_mergeable [virtio_net]() {
bpf_prog_run_xdp();----------------------Native XDP
page_to_skb [virtio_net]() {
__napi_alloc_skb() {
__build_skb();
}
skb_put();
}
}
skb_gro_reset_offset();
tcp4_gro_receive() {
tcp_gro_receive();
}
netif_receive_skb_internal() {
netif_receive_generic_xdp();------------generic XDP
__netif_receive_skb() {
__netif_receive_skb_core() {
sch_handle_ingress();----------------TC ingress
nf_ingress();-----------------Netfilter Ingress
ip_rcv() {
nf_hook_slow() {----Netfilter RAW Pre-routing
ipv4_conntrack_defrag [nf_defrag_ipv4]();
ipv4_conntrack_in [nf_conntrack_ipv4]() {
nf_conntrack_in [nf_conntrack]() {
ipv4_get_l4proto [nf_conntrack_ipv4]();
__nf_ct_l4proto_find [nf_conntrack]();
tcp_error [nf_conntrack]() {
nf_ip_checksum();
}
nf_ct_get_tuple [nf_conntrack]() {
ipv4_pkt_to_tuple [nf_conntrack_ipv4]();
tcp_pkt_to_tuple [nf_conntrack]();
}
hash_conntrack_raw [nf_conntrack]();
__nf_conntrack_find_get [nf_conntrack]();
tcp_get_timeouts [nf_conntrack]();
tcp_packet [nf_conntrack]() {
tcp_in_window [nf_conntrack]() {
nf_ct_seq_offset [nf_conntrack]();
tcp_options.isra.11 [nf_conntrack]();
}
__nf_ct_refresh_acct [nf_conntrack]();
}
}
}
}
ip_rcv_finish() {
tcp_v4_early_demux();
ip_route_input_noref();
ip_local_deliver() {------routing decisions
nf_hook_slow() {---Netfilter filter Input
ipt_do_table [ip_tables]();
ipv4_helper [nf_conntrack_ipv4]();
ipv4_confirm [nf_conntrack_ipv4]();
}
ip_local_deliver_finish() {
raw_local_deliver();
tcp_v4_rcv() {------L4 Protocol Handler
tcp_filter() {
security_sock_rcv_skb();
}
tcp_prequeue();
tcp_v4_do_rcv() {
tcp_rcv_state_process() {
tcp_parse_options();
tcp_ack() {
...
#define KBUILD_MODNAME "foo" /*for some headers*/
#include <uapi/linux/bpf.h>
...
SEC("xdp_prog")
int xdp_prog(struct xdp_md *ctx)
{
void *data_end = (void *)(long)ctx->data_end;
void *data = (void *)(long)ctx->data;
struct ethhdr *eth = data;
...
return XDP_DROP; /* the action */
}
XDP Actions
●
XDP_ABORTED
Indicate eBPF program error (treat as XDP_DROP)
●
XDP_DROP
Drop the packet
●
XDP_PASS
Pass the packet up to the stack
●
XDP_TX
Transmit the packet out through the same NIC
●
XDP_REDIRECT (4.14)
Redirect the packet to another NIC or CPU
XDP Restrictions
●
Memory model change in driver
– One packet per memory page (memory waste)
– ixgbe and i40e using refcnt instead of one packet per page
●
No per-RX-queue XDP instance yet
●
XDP_REDIRECT only supported by limited drivers
●
eBPF program limitations
Current Status
●
XDP Core: 4.8
●
Supported Drivers
– mlx4: 4.8
– mlx5: 4.9
– nfp, qed, virtio_net: 4.10
– ixgbe, generic_xdp, thunderx: 4.12
– i40e: 4.13
– veth, tap: 4.14
XDP Benchmarks (mlx4)
●
Generated using pktgen
●
Single core
– ip routing drop: ~3.6 Mpps
– tc clsact using bpf: ~4.2 Mpps
– XDP drop: 20 Mpps (< 10% cpu util)
https://www.slideshare.net/IOVisor/express-data-path-linux-meetup-santa-clara-july-2016
XDP Benchmarks (virtio-net)
●
Generated using pktgen
●
Host: i7-4790 CPU @ 3.6 GHz
●
Single core qemu guest
– iptables drop (raw preroute): ~3.0 Mpps
– tc clsact using bpf: ~3.0 Mpps
– generic XDP drop: ~3.5 Mpps
– native XDP drop: ~4.0 Mpps
XDP Use Cases
●
DDoS attack mitigation
●
Load Balancing
●
Tunnelling: packet header handling
●
Network sampling and monitoring
●
And more
Question?
Thank You!
References
●
BPF and XDP Reference Guide
http://cilium.readthedocs.io/en/stable/bpf/
●
Dive into BPF: a list of reading material
https://qmonnet.github.io/whirl-ofload/2016/09/01/dive-into-bpf/
●
Linux Socket Filtering aka Berkeley Packet Filter (BPF)
Documentation/networking/filter.txt
Join Us at www.opensuse.org
License
This slide deck is licensed under the Creative Commons Attribution-ShareAlike 4.0 International license.
It can be shared and adapted for any purpose (even commercially) as long as Attribution is given and any
derivative work is distributed under the same license.
Details can be found at https://creativecommons.org/licenses/by-sa/4.0/
General Disclaimer
This document is not to be construed as a promise by any participating organisation to develop, deliver, or
market a product. It is not a commitment to deliver any material, code, or functionality, and should not be
relied upon in making purchasing decisions. openSUSE makes no representations or warranties with respect
to the contents of this document, and specifically disclaims any express or implied warranties of
merchantability or fitness for any particular purpose. The development, release, and timing of features or
functionality described for openSUSE products remains at the sole discretion of openSUSE. Further,
openSUSE reserves the right to revise this document and to make changes to its content, at any time,
without obligation to notify any person or entity of such revisions or changes. All openSUSE marks
referenced in this presentation are trademarks or registered trademarks of SUSE LLC, in the United States
and other countries. All third-party trademarks are the property of their respective owners.
Credits
Template
Richard Brown
rbrown@opensuse.org
Design & Inspiration
openSUSE Design Team
http://opensuse.github.io/branding-
guidelines/

Introduction to eBPF and XDP