6. Off-CPU wakeup path analysis
● Requirements
○ Function argument (target, who I'm waking up)
○ Stack trace (cause, who's the caller)
○ Context information (process id, cpu id, timing, etc)
● Benefit
○ Identify performance bottlenecks
○ Account for all running applications system wide
○ Real time
http://www.brendangregg.com/blog/2016-02-01/linux-wakeup-offwake-profiling.html
7. Program Specification
int try_to_wake_up(struct task_struct *p,
unsigned int state, int wake_flags)
1. Function argument
(target, who I'm waking up)
2. Context information
(source process id, cpu id, timing, etc)
3. Stack trace
(cause, why am I called)
8. Naive jprobe
● Based on kprobe, but more accessible
● Attachable at any kernel function
● Keeps original function argument
● Calls Linux functions:
○ save_stack_trace
○ trace_printk
http://www.cs.dartmouth.edu/~reeves/kprobes-2016.pdf
24. eBPF
● Extended berkeley packet filter
● Subset of C that compiles to virtual machine bytecode via llvm
● Verifiably safe, no loops
● Extended from two registers to fourteen
● Originally used in network filters
● Easy loading and compilation with bcc
(https://github.com/iovisor/bcc)
25. No more tree walker
McCanne, S., & Jacobson, V. (1993, January). The BSD Packet Filter: A New Architecture for User-level Packet Capture. In USENIX winter (Vol. 46).
26. bpf source code
● Attaching to kprobe
SEC("kprobe/try_to_wake_up")
int bpf_prog1(struct pt_regs *ctx)
● Reading pointer value
bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8));
27. Micro benchmark
● 10 million syscall to write 512 bytes of 0 to /dev/null
○ Consistent measure of syscall overhead
○ Baseline ~92 nsec
● Try it yourself
○ dd if=/dev/zero of=/dev/null bs=512 count=1000k
29. Conclusion
● Optimized kprobe outperforms other implementations by 3-6x
○ with a combined overhead of slightly under 400ns per syscall
● In-kernel aggregation could further reduce overhead to 200ns
○ by deferring printing cost to analysis time
● With overhead in the microsecond range, tracing can be enabled on
production systems without sampling to capture hard-to-reproduce bugs
○ lock contention
○ I/O latency
30. Future work
● Implement in-kernel aggregation with persistent tracing
● Integrate with ARM devices by applying kprobe patchset
(https://lwn.net/Articles/676434/)
● Investigate perf integration with eBPF
(https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=1f4
5b1d49073541947193bd7dac9e904142576aa)