Introduction to eBPF and XDP

Gary Lin
Sofware Engineer, SUSE Labs
glin@suse.com
Introduction to
eBPF and XDP
Taiwan Linux Kernel Hackers

The BSD Packet Filter:
A New Architecture for User-level
Packet Capture
December 19, 1992

BPF ASM
ldh [12]
jne #0x800, drop
ldb [23]
jneq #1, drop
# get a random uint32 number
ld rand
mod #4
jneq #1, drop
ret #-1
drop: ret #0

BPF Bytecode
struct sock_filter code[] = {
{ 0x28, 0, 0, 0x0000000c },
{ 0x15, 0, 8, 0x000086dd },
{ 0x30, 0, 0, 0x00000014 },
{ 0x15, 2, 0, 0x00000084 },
{ 0x15, 1, 0, 0x00000006 },
{ 0x15, 0, 17, 0x00000011 },
{ 0x28, 0, 0, 0x00000036 },
{ 0x15, 14, 0, 0x00000016 },
{ 0x28, 0, 0, 0x00000038 },
{ 0x15, 12, 13, 0x00000016 },
...
};

# find arch -name bpf_jit*
arch/sparc/net/bpf_jit_asm_64.S
...
arch/arm/net/bpf_jit_32.c
arch/arm/net/bpf_jit_32.h
arch/arm64/net/bpf_jit_comp.c
arch/arm64/net/bpf_jit.h
arch/powerpc/net/bpf_jit_asm64.S
arch/powerpc/net/bpf_jit_asm.S
...
arch/s390/net/bpf_jit_comp.c
arch/s390/net/bpf_jit.S
...
arch/mips/net/bpf_jit_asm.S
...
arch/x86/net/bpf_jit_comp.c
arch/x86/net/bpf_jit.S

Extended BPF
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bd4cf0ed331a275e9bf5a49e6d0fd55dffc551b8

From BPF to eBPF
●
2 32-bit registers → 10 64-bit registers
●
New instructions
BPF_MOV, BPF_JNE, BPF_CALL, …
●
Helper functions
●
eBPF verifier: kernel/bpf/verifier.c
Loading programs from user space
●
eBPF map

BPF Calling Convention
●
R0
Return value from in-kernel function, and exit value for eBPF program
●
R1 – R5
Arguments from eBPF program to in-kernel function
●
R6 – R9
Callee saved registers that in-kernel function will preserve
●
R10
Read-only frame pointer to access stack

x86_64 Register Mapping
R6 rbx→
R7 r13→
R8 r14→
R9 r15→
R10 rbp→
R0 rax→
R1 rdi→
R2 rsi→
R3 rdx→
R4 rcx→
R5 r8→

BPF Helper Functions
●
/usr/include/bpf.h
– bpf_probe_read
– bpf_ktime_get_ns
– bpf_trace_printk
– bpf_get_smp_processor_id
– bpf_perf_event_output
– ...

eBPF Verifier
●
Instructions limit: 4096
●
Two-Step Verification
– Directed acyclic graph check
– Execution Simulation

Direct Acyclic Graph Check
●
Back-edge detection
●
Unreachable instructions

Direct Acyclic Graph Check
●
Back-edge detection
●
Unreachable instructions
PERMISSION DENIED

Execution Simulation
●
Reading an uninitialized register
●
Arithmetic of two valid pointer
●
Load or store registers of invalid types
●
Read stack before writing data into stack

Execution Simulation
●
Reading an uninitialized register
●
Arithmetic of two valid pointer
●
Load or store registers of invalid types
●
Read stack before writing data into stack
PERMISSION DENIED

eBPF Map Types
●
Hash
●
Array
●
Tail Call Array
●
Per-CPU Hash/Array
●
Stack Trace
●
cgroup Array
●
LRU (per-CPU) Hash
●
Longest-Prefix Matching Trie
●
Array/Hash of Maps
●
Net device Map
●
Socket Map
https://github.com/iovisor/bcc/blob/master/docs/kernel-versions.md#tables-aka-maps

eBPF Map Syscalls
●
BPF_MAP_CREATE
●
BPF_MAP_LOOKUP_ELEM
●
BPF_MAP_UPDATE_ELEM
●
BPF_MAP_DELETE_ELEM
●
BPF_MAP_GET_NEXT_KEY
●
BPF_MAP_GET_NEXT_ID
●
BPF_MAP_GET_FD_BY_ID

eBPF
BPF bytecode Access Map
BPF bytecode Map
BPF_PROG_LOAD BPF_MAP_*
userspace
kernel
user program

User Program
eBPF
Kernel
Program
As simple
as possible
Whatever you want
userspace
kernel
eBPF MAP

clang >= 3.7 with bpf taget
$ clang -target bpf source.c -o code.o

eBPF Projects
●
Networking
tc, socket, XDP, cilium, ...
●
System Tracing and Monitoring
kprobe/uprobe/tracepoint/perf event/usdt
●
Security
LandLock LSM, seccomp
●
System Error Handler Testing
eBPF directed error injection

RX Packet Processing
userspace
kernel
Driver
Network
Stack
NIC
Network
Program

userspace
kernel
Driver
Network
Stack
NIC
Network
Program

netfilter
userspace
kernel
DriverNIC
Network
Program
Network Stack
netfilter
DROP

Trafic Control
userspace
kernel
Driver
Network
Stack
NIC
Network
Program
TC
ingress
DROP

eXpress Data Path
userspace
kernel
Network
Stack
NIC
Network
Program
Driver
skb
alloc
DROP
eBPF
TX

XDP
●
A high performance, programmable network data path
Attaching eBPF programs through netlink (IFLA_XDP)
●
No specialized hardware
●
No kernel bypass
●
Works with the existing network stack
●
Direct packet write

userspace
kernel
Driver
Network
Stack
NIC
Network
Program
generic
XDP
tc
ingress
netfilter
ingress
Generic XDP

virtnet_poll [virtio_net]() {
receive_buf [virtio_net]() {
receive_mergeable [virtio_net]() {
bpf_prog_run_xdp();----------------------Native XDP
page_to_skb [virtio_net]() {
__napi_alloc_skb() {
__build_skb();
}
skb_put();
}
}
skb_gro_reset_offset();
tcp4_gro_receive() {
tcp_gro_receive();
}
netif_receive_skb_internal() {
netif_receive_generic_xdp();------------generic XDP
__netif_receive_skb() {
__netif_receive_skb_core() {
sch_handle_ingress();----------------TC ingress
nf_ingress();-----------------Netfilter Ingress
ip_rcv() {
nf_hook_slow() {----Netfilter RAW Pre-routing

ipv4_conntrack_defrag [nf_defrag_ipv4]();
ipv4_conntrack_in [nf_conntrack_ipv4]() {
nf_conntrack_in [nf_conntrack]() {
ipv4_get_l4proto [nf_conntrack_ipv4]();
__nf_ct_l4proto_find [nf_conntrack]();
tcp_error [nf_conntrack]() {
nf_ip_checksum();
}
nf_ct_get_tuple [nf_conntrack]() {
ipv4_pkt_to_tuple [nf_conntrack_ipv4]();
tcp_pkt_to_tuple [nf_conntrack]();
}
hash_conntrack_raw [nf_conntrack]();
__nf_conntrack_find_get [nf_conntrack]();
tcp_get_timeouts [nf_conntrack]();
tcp_packet [nf_conntrack]() {
tcp_in_window [nf_conntrack]() {
nf_ct_seq_offset [nf_conntrack]();
tcp_options.isra.11 [nf_conntrack]();
}
__nf_ct_refresh_acct [nf_conntrack]();
}
}

}
}
ip_rcv_finish() {
tcp_v4_early_demux();
ip_route_input_noref();
ip_local_deliver() {------routing decisions
nf_hook_slow() {---Netfilter filter Input
ipt_do_table [ip_tables]();
ipv4_helper [nf_conntrack_ipv4]();
ipv4_confirm [nf_conntrack_ipv4]();
}
ip_local_deliver_finish() {
raw_local_deliver();
tcp_v4_rcv() {------L4 Protocol Handler
tcp_filter() {
security_sock_rcv_skb();
}
tcp_prequeue();
tcp_v4_do_rcv() {
tcp_rcv_state_process() {
tcp_parse_options();
tcp_ack() {
...

#define KBUILD_MODNAME "foo" /*for some headers*/
#include <uapi/linux/bpf.h>
...
SEC("xdp_prog")
int xdp_prog(struct xdp_md *ctx)
{
void *data_end = (void *)(long)ctx->data_end;
void *data = (void *)(long)ctx->data;
struct ethhdr *eth = data;
...
return XDP_DROP; /* the action */
}

XDP Actions
●
XDP_ABORTED
Indicate eBPF program error (treat as XDP_DROP)
●
XDP_DROP
Drop the packet
●
XDP_PASS
Pass the packet up to the stack
●
XDP_TX
Transmit the packet out through the same NIC
●
XDP_REDIRECT (4.14)
Redirect the packet to another NIC or CPU

XDP Restrictions
●
Memory model change in driver
– One packet per memory page (memory waste)
– ixgbe and i40e using refcnt instead of one packet per page
●
No per-RX-queue XDP instance yet
●
XDP_REDIRECT only supported by limited drivers
●
eBPF program limitations

Current Status
●
XDP Core: 4.8
●
Supported Drivers
– mlx4: 4.8
– mlx5: 4.9
– nfp, qed, virtio_net: 4.10
– ixgbe, generic_xdp, thunderx: 4.12
– i40e: 4.13
– veth, tap: 4.14

XDP Benchmarks (mlx4)
●
Generated using pktgen
●
Single core
– ip routing drop: ~3.6 Mpps
– tc clsact using bpf: ~4.2 Mpps
– XDP drop: 20 Mpps (< 10% cpu util)
https://www.slideshare.net/IOVisor/express-data-path-linux-meetup-santa-clara-july-2016

XDP Benchmarks (virtio-net)
●
Generated using pktgen
●
Host: i7-4790 CPU @ 3.6 GHz
●
Single core qemu guest
– iptables drop (raw preroute): ~3.0 Mpps
– tc clsact using bpf: ~3.0 Mpps
– generic XDP drop: ~3.5 Mpps
– native XDP drop: ~4.0 Mpps

XDP Use Cases
●
DDoS attack mitigation
●
Load Balancing
●
Tunnelling: packet header handling
●
Network sampling and monitoring
●
And more

References
●
BPF and XDP Reference Guide
http://cilium.readthedocs.io/en/stable/bpf/
●
Dive into BPF: a list of reading material
https://qmonnet.github.io/whirl-ofload/2016/09/01/dive-into-bpf/
●
Linux Socket Filtering aka Berkeley Packet Filter (BPF)
Documentation/networking/filter.txt

License
This slide deck is licensed under the Creative Commons Attribution-ShareAlike 4.0 International license.
It can be shared and adapted for any purpose (even commercially) as long as Attribution is given and any
derivative work is distributed under the same license.
Details can be found at https://creativecommons.org/licenses/by-sa/4.0/
General Disclaimer
This document is not to be construed as a promise by any participating organisation to develop, deliver, or
market a product. It is not a commitment to deliver any material, code, or functionality, and should not be
relied upon in making purchasing decisions. openSUSE makes no representations or warranties with respect
to the contents of this document, and specifically disclaims any express or implied warranties of
merchantability or fitness for any particular purpose. The development, release, and timing of features or
functionality described for openSUSE products remains at the sole discretion of openSUSE. Further,
openSUSE reserves the right to revise this document and to make changes to its content, at any time,
without obligation to notify any person or entity of such revisions or changes. All openSUSE marks
referenced in this presentation are trademarks or registered trademarks of SUSE LLC, in the United States
and other countries. All third-party trademarks are the property of their respective owners.
Credits
Template
Richard Brown
rbrown@opensuse.org
Design & Inspiration
openSUSE Design Team
http://opensuse.github.io/branding-
guidelines/

Introduction to eBPF and XDP

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to eBPF and XDP

Similar to Introduction to eBPF and XDP (20)

Recently uploaded

Recently uploaded (20)

Introduction to eBPF and XDP