Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
eBPF
Trace from Kernel to Userspace
Gary Lin
SUSE Labs
Software Engineer
Technology
Sharing Day
2016
Tracer
tick_nohz_idle_enter
set_cpu_sd_state_idle
up_write
__tick_nohz_idle_enter
ktime_get
uprobe_mmap
read_hpet
vma_set_page_pr...
kprobe
Kernel
Userspace
uprobe
/sys/kernel/debug/tracing/kprobe_events
/sys/kernel/debug/tracing/uprobe_events
eBPF
BPF?
Berkeley Packet Filter
BPF
No Red
BPF Program
The BSD Packet Filter:
A New Architecture for User-level
Packet Capture
December 19, 1992
SCO lawsuit, August 2003
Old
Stable
BPF ASM
ldh [12]
jne #0x800, drop
ldb [23]
jneq #1, drop
# get a random uint32 number
ld rand
mod #4
jneq #1, drop
ret #-1...
BPF Bytecode
struct sock_filter code[] = {
{ 0x28, 0, 0, 0x0000000c },
{ 0x15, 0, 8, 0x000086dd },
{ 0x30, 0, 0, 0x0000001...
Virtual Machinekind of
BPF JIT
BPF
Bytecode
Native
Machine
Code
BPF JIT
$ find arch/ -name bpf_jit*
arch/sparc/net/bpf_jit_comp.c
arch/sparc/net/bpf_jit_asm.S
arch/sparc/net/bpf_jit.h
arch/arm/n...
Stable and Efficient
eBPF
Extended BPF
eBPF
userspacekernel
eBPF
Program
BPF_PROG_LOAD
At most
4096
instructions
Extended Registers
eBPF Verifier
eBPF Map
Probe Event
Extended Registers
eBPF Verifier
eBPF Map
Probe Event
Classic BPF: 32 bit
Extended BPF: 64 bit
Classic BPF: A, X (2)
Extended BPF: R0 – R9 (10)
R10 (read-only)
For x86_64 JIT
R0 → rax
R1 → rdi
R2 → rsi
R3 → rdx
R4 → rcx
R5 → r8
R6 → rbx
R7 → r13
R8 → r14
R9 → r15
R10 → rbp
BPF Calling Convention
● R0
Return value from in-kernel function, and exit value for
eBPF program
● R1 – R5
Arguments from...
Extended Registers
eBPF Verifier
eBPF Map
Probe Event
Two-Step Verification
Step 1
Directed Acyclic Graph
Check
Loops
Unreachable Instructions
Loops
Unreachable Instructions
Step 2
Simulate the Execution
Read a never-written register
Do arithmetic of two valid pointer
Load/store registers of invalid types
Read stack before w...
Read a never-written register
Do arithmetic of two valid pointer
Load/store registers of invalid types
Read stack before w...
Extended Registers
eBPF Verifier
eBPF Map
Probe Event
eBPF
userspacekernel
User
Program
Map BPF_MAP_*
eBPF Map Types
● BPF_MAP_TYPE_HASH
● BPF_MAP_TYPE_ARRAY
● BPF_MAP_TYPE_PROG_ARRAY
● BPF_MAP_TYPE_PERF_EVENT_ARRAY
eBPF Map Syscalls
● BPF_MAP_CREATE
● BPF_MAP_LOOKUP_ELEM
● BPF_MAP_UPDATE_ELEM
● BPF_MAP_DELETE_ELEM
● BPF_MAP_GET_NEXT_KEY
Extended Registers
eBPF Verifier
eBPF Map
Probe Event
New ioctl request
PERF_EVENT_IOC_SET_BPF
Kprobe
BPF_PROG_LOAD
User Program
eBPF
userspace
kernel
Kernel
Program
kprobe
Event
fd
fd
PERF_EVENT_IOC_SET_BPF
fd
Attach
Registration
perf_tp_event_init() kernel/events/core.c
perf_trace_init() kernel/trace/trace_event_perf.c
perf_trace_event_...
Attach
perf_ioctl() kernel/events/core.c
_perf_ioctl() kernel/events/core.c
case PERF_EVENT_IOC_SET_BPF:
return perf_event...
Dispatch Event
kprobe_dispatcher() kernel/trace/trace_kprobe.c
kprobe_perf_func() kernel/trace/trace_kprobe.c
if (prog && ...
kfree_skb(struct sk_buff *skb)
{
if (unlikely(!skb))
return;
….
}
kprobe
eBPF
BPF bytecode Read Map
BPF bytecode Map
BPF_P...
Uprobe
BPF_PROG_LOAD
User Program
eBPF
userspace
kernel
Kernel
Program
uprobe
Event
fd
fd
PERF_EVENT_IOC_SET_BPF
fd
Attach
__libc_malloc(size_t *bytes)
{
arena_lookup(ar_ptr);
arena_lock(ar_ptr, bytes);
….
}
uprobe
eBPF
BPF bytecode
BPF bytecode...
How to use eBPF?
Linux Kernel >= 4.1
Kernel Config
● CONFIG_BPF=y
● CONFIG_BPF_SYSCALL=y
● CONFIG_BPF_JIT=y
● CONFIG_HAVE_BPF_JIT=y
● CONFIG_BPF_EVENTS=y
BPF ASM
BPF ASM
Restricted C
LLVM >= 3.7
clang:
llc:
--emit-llvm
--march=bpf
C code
LLVM
IR Bitcode
BPF Bytecodeclang llc
User Program
eBPF
userspace
kernel
eBPF MAP
Kernel
Program
As simple
as possible
Whatever you want
BPF Compiler Collection
obs://Base:System/bcc
C & Python Library
Built-in BPF compiler
Hello World
from bcc import BPF
bpf_prog="""
void kprobe__sys_clone(void *ctx) {
bpf_trace_printk(“Hello, Worldn”);
}
"""
...
Access Map
In bitehist.c:
BPF_HISTOGRAM(dist);
dist.increment(bpf_log2l(req->__data_len / 1024));
In bitehist.py:
b = BPF(...
Access Map (Cont’)
# ./bitehist.py
Tracing... Hit Ctrl-C to end.
^C
kbytes : count distribution
0 -> 1 : 8 |****** |
2 -> ...
memleak.py
if not kernel_trace:
print("Attaching to malloc and free in pid %d,"
"Ctrl+C to quit." % pid)
bpf_program.attac...
memleak.py (alloc_enter)
BPF_HASH(sizes, u64);
BPF_HASH(allocs, u64, struct alloc_info_t);
int alloc_enter(struct pt_regs ...
memleak.py (alloc_exit)
BPF_HASH(sizes, u64);
BPF_HASH(allocs, u64, struct alloc_info_t);
int alloc_exit(struct pt_regs *c...
memleak.py (free)
BPF_HASH(sizes, u64);
BPF_HASH(allocs, u64, struct alloc_info_t);
int free_enter(struct pt_regs *ctx, vo...
Demo
Question?
Thank
You
References
● Documentation/networking/filter.txt
● http://www.brendangregg.com/blog/2015-05-15/ebpf-one-small-s
tep.html
●...
Upcoming SlideShare
Loading in …5
×

eBPF Trace from Kernel to Userspace

3,953 views

Published on

SUSE Labs Taipei Technology Sharing Day 2016

Published in: Software

eBPF Trace from Kernel to Userspace

  1. 1. eBPF Trace from Kernel to Userspace Gary Lin SUSE Labs Software Engineer Technology Sharing Day 2016
  2. 2. Tracer
  3. 3. tick_nohz_idle_enter set_cpu_sd_state_idle up_write __tick_nohz_idle_enter ktime_get uprobe_mmap read_hpet vma_set_page_prot vma_wants_writenotify rcu_needs_cpu fput get_next_timer_interrupt _raw_spin_lock hrtimer_get_next_event _raw_spin_lock_irqsave _raw_spin_unlock_irqrestore syscall_trace_leave _raw_write_unlock_irqrestore __audit_syscall_exit path_put dput mntput up_write rax: 0x0000000000000000 rbx: 0xffff88012b5a5a28 rcx: 0xffff8800987c18e0 rdx: 0x0000000000000000 rsi: 0xffff88012b439f20 rdi: 0xffff88012b464628 rbp: 0xffff8800959e3d98
  4. 4. kprobe Kernel Userspace uprobe /sys/kernel/debug/tracing/kprobe_events /sys/kernel/debug/tracing/uprobe_events
  5. 5. eBPF
  6. 6. BPF?
  7. 7. Berkeley Packet Filter
  8. 8. BPF No Red BPF Program
  9. 9. The BSD Packet Filter: A New Architecture for User-level Packet Capture December 19, 1992
  10. 10. SCO lawsuit, August 2003
  11. 11. Old
  12. 12. Stable
  13. 13. BPF ASM ldh [12] jne #0x800, drop ldb [23] jneq #1, drop # get a random uint32 number ld rand mod #4 jneq #1, drop ret #-1 drop: ret #0
  14. 14. BPF Bytecode struct sock_filter code[] = { { 0x28, 0, 0, 0x0000000c }, { 0x15, 0, 8, 0x000086dd }, { 0x30, 0, 0, 0x00000014 }, { 0x15, 2, 0, 0x00000084 }, { 0x15, 1, 0, 0x00000006 }, { 0x15, 0, 17, 0x00000011 }, { 0x28, 0, 0, 0x00000036 }, { 0x15, 14, 0, 0x00000016 }, { 0x28, 0, 0, 0x00000038 }, { 0x15, 12, 13, 0x00000016 }, ... };
  15. 15. Virtual Machinekind of
  16. 16. BPF JIT
  17. 17. BPF Bytecode Native Machine Code BPF JIT
  18. 18. $ find arch/ -name bpf_jit* arch/sparc/net/bpf_jit_comp.c arch/sparc/net/bpf_jit_asm.S arch/sparc/net/bpf_jit.h arch/arm/net/bpf_jit_32.c arch/arm/net/bpf_jit_32.h arch/arm64/net/bpf_jit_comp.c arch/arm64/net/bpf_jit.h arch/powerpc/net/bpf_jit_comp.c arch/powerpc/net/bpf_jit_asm.S arch/powerpc/net/bpf_jit.h arch/s390/net/bpf_jit_comp.c arch/s390/net/bpf_jit.S arch/s390/net/bpf_jit.h arch/mips/net/bpf_jit.c arch/mips/net/bpf_jit_asm.S arch/mips/net/bpf_jit.h arch/x86/net/bpf_jit_comp.c arch/x86/net/bpf_jit.S
  19. 19. Stable and Efficient
  20. 20. eBPF
  21. 21. Extended BPF
  22. 22. eBPF userspacekernel eBPF Program BPF_PROG_LOAD At most 4096 instructions
  23. 23. Extended Registers eBPF Verifier eBPF Map Probe Event
  24. 24. Extended Registers eBPF Verifier eBPF Map Probe Event
  25. 25. Classic BPF: 32 bit Extended BPF: 64 bit
  26. 26. Classic BPF: A, X (2) Extended BPF: R0 – R9 (10) R10 (read-only)
  27. 27. For x86_64 JIT R0 → rax R1 → rdi R2 → rsi R3 → rdx R4 → rcx R5 → r8 R6 → rbx R7 → r13 R8 → r14 R9 → r15 R10 → rbp
  28. 28. BPF Calling Convention ● R0 Return value from in-kernel function, and exit value for eBPF program ● R1 – R5 Arguments from eBPF program to in-kernel function ● R6 – R9 Callee saved registers that in-kernel function will preserve ● R10 Read-only frame pointer to access stack
  29. 29. Extended Registers eBPF Verifier eBPF Map Probe Event
  30. 30. Two-Step Verification
  31. 31. Step 1 Directed Acyclic Graph Check
  32. 32. Loops Unreachable Instructions
  33. 33. Loops Unreachable Instructions
  34. 34. Step 2 Simulate the Execution
  35. 35. Read a never-written register Do arithmetic of two valid pointer Load/store registers of invalid types Read stack before writing data into stack
  36. 36. Read a never-written register Do arithmetic of two valid pointer Load/store registers of invalid types Read stack before writing data into stack
  37. 37. Extended Registers eBPF Verifier eBPF Map Probe Event
  38. 38. eBPF userspacekernel User Program Map BPF_MAP_*
  39. 39. eBPF Map Types ● BPF_MAP_TYPE_HASH ● BPF_MAP_TYPE_ARRAY ● BPF_MAP_TYPE_PROG_ARRAY ● BPF_MAP_TYPE_PERF_EVENT_ARRAY
  40. 40. eBPF Map Syscalls ● BPF_MAP_CREATE ● BPF_MAP_LOOKUP_ELEM ● BPF_MAP_UPDATE_ELEM ● BPF_MAP_DELETE_ELEM ● BPF_MAP_GET_NEXT_KEY
  41. 41. Extended Registers eBPF Verifier eBPF Map Probe Event
  42. 42. New ioctl request PERF_EVENT_IOC_SET_BPF
  43. 43. Kprobe
  44. 44. BPF_PROG_LOAD User Program eBPF userspace kernel Kernel Program kprobe Event fd fd PERF_EVENT_IOC_SET_BPF fd Attach
  45. 45. Registration perf_tp_event_init() kernel/events/core.c perf_trace_init() kernel/trace/trace_event_perf.c perf_trace_event_init() kernel/trace/trace_event_perf.c perf_trace_event_reg() kernel/trace/trace_event_perf.c ret = tp_event->class->reg(tp_event, TRACE_REG_PERF_REGISTER, NULL); kprobe_register() kernel/trace/trace_kprobe.c enable_trace_kprobe() kernel/trace/trace_kprobe.c enable_kprobe() kernel/kprobes.c
  46. 46. Attach perf_ioctl() kernel/events/core.c _perf_ioctl() kernel/events/core.c case PERF_EVENT_IOC_SET_BPF: return perf_event_set_bpf_prog(event, arg); perf_event_set_bpf_prog() kernel/events/core.c prog = bpf_prog_get(prog_fd); event->tp_event->prog = prog;
  47. 47. Dispatch Event kprobe_dispatcher() kernel/trace/trace_kprobe.c kprobe_perf_func() kernel/trace/trace_kprobe.c if (prog && !trace_call_bpf(prog, regs)) Return; trace_call_bpf() kernel/trace/bpf_trace.c BPF_PROG_RUN() include/linux/filter.h __bpf_prog_run() kernel/bpf/core.c
  48. 48. kfree_skb(struct sk_buff *skb) { if (unlikely(!skb)) return; …. } kprobe eBPF BPF bytecode Read Map BPF bytecode Map BPF_PROG_LOAD BPF_MAP_* userspace kernel bpf_tracer.c
  49. 49. Uprobe
  50. 50. BPF_PROG_LOAD User Program eBPF userspace kernel Kernel Program uprobe Event fd fd PERF_EVENT_IOC_SET_BPF fd Attach
  51. 51. __libc_malloc(size_t *bytes) { arena_lookup(ar_ptr); arena_lock(ar_ptr, bytes); …. } uprobe eBPF BPF bytecode BPF bytecode userspace kernel bpf_tracer.c glibc
  52. 52. How to use eBPF?
  53. 53. Linux Kernel >= 4.1
  54. 54. Kernel Config ● CONFIG_BPF=y ● CONFIG_BPF_SYSCALL=y ● CONFIG_BPF_JIT=y ● CONFIG_HAVE_BPF_JIT=y ● CONFIG_BPF_EVENTS=y
  55. 55. BPF ASM
  56. 56. BPF ASM Restricted C
  57. 57. LLVM >= 3.7
  58. 58. clang: llc: --emit-llvm --march=bpf
  59. 59. C code LLVM IR Bitcode BPF Bytecodeclang llc
  60. 60. User Program eBPF userspace kernel eBPF MAP Kernel Program As simple as possible Whatever you want
  61. 61. BPF Compiler Collection
  62. 62. obs://Base:System/bcc
  63. 63. C & Python Library Built-in BPF compiler
  64. 64. Hello World from bcc import BPF bpf_prog=""" void kprobe__sys_clone(void *ctx) { bpf_trace_printk(“Hello, Worldn”); } """ BPF(text=bpf_prog).trace_print()
  65. 65. Access Map In bitehist.c: BPF_HISTOGRAM(dist); dist.increment(bpf_log2l(req->__data_len / 1024)); In bitehist.py: b = BPF(src_file = "bitehist.c") b["dist"].print_log2_hist("kbytes")
  66. 66. Access Map (Cont’) # ./bitehist.py Tracing... Hit Ctrl-C to end. ^C kbytes : count distribution 0 -> 1 : 8 |****** | 2 -> 3 : 0 | | 4 -> 7 : 51 |****************************************| 8 -> 15 : 8 |****** | 16 -> 31 : 1 | | 32 -> 63 : 3 |** | 64 -> 127 : 2 |* |
  67. 67. memleak.py if not kernel_trace: print("Attaching to malloc and free in pid %d," "Ctrl+C to quit." % pid) bpf_program.attach_uprobe(name="c", sym="malloc", fn_name="alloc_enter", pid=pid) bpf_program.attach_uretprobe(name="c", sym="malloc", fn_name="alloc_exit", pid=pid) bpf_program.attach_uprobe(name="c", sym="free", fn_name="free_enter", pid=pid) else: print("Attaching to kmalloc and kfree, Ctrl+C to quit.") bpf_program.attach_kprobe(event="__kmalloc", fn_name="alloc_enter") bpf_program.attach_kretprobe(event="__kmalloc", fn_name="alloc_exit") bpf_program.attach_kprobe(event="kfree", fn_name="free_enter")
  68. 68. memleak.py (alloc_enter) BPF_HASH(sizes, u64); BPF_HASH(allocs, u64, struct alloc_info_t); int alloc_enter(struct pt_regs *ctx, size_t size) { ... u64 pid = bpf_get_current_pid_tgid(); u64 size64 = size; sizes.update(&pid, &size64); ... }
  69. 69. memleak.py (alloc_exit) BPF_HASH(sizes, u64); BPF_HASH(allocs, u64, struct alloc_info_t); int alloc_exit(struct pt_regs *ctx) { u64 address = ctx->ax; u64 pid = bpf_get_current_pid_tgid(); u64* size64 = sizes.lookup(&pid); struct alloc_info_t info = {0}; if (size64 == 0) return 0; // missed alloc entry info.size = *size64; sizes.delete(&pid); info.timestamp_ns = bpf_ktime_get_ns(); info.num_frames = grab_stack(ctx, &info) - 2; allocs.update(&address, &info); ... }
  70. 70. memleak.py (free) BPF_HASH(sizes, u64); BPF_HASH(allocs, u64, struct alloc_info_t); int free_enter(struct pt_regs *ctx, void *address) { u64 addr = (u64)address; struct alloc_info_t *info = allocs.lookup(&addr); if (info == 0) return 0; allocs.delete(&addr); ... }
  71. 71. Demo
  72. 72. Question?
  73. 73. Thank You
  74. 74. References ● Documentation/networking/filter.txt ● http://www.brendangregg.com/blog/2015-05-15/ebpf-one-small-s tep.html ● https://suchakra.wordpress.com/2015/05/18/bpf-internals-i/ ● https://suchakra.wordpress.com/2015/08/12/bpf-internals-ii/ ● https://lkml.org/lkml/2013/9/30/627 ● https://lwn.net/Articles/612878/ ● https://lwn.net/Articles/650953/ ● https://github.com/iovisor/bcc

×