If you look under the hood, Linux containers are just processes with some isolation features and resource quotas sprinkled on top. In this talk, we will apply modern Linux performance tools to container analysis: get high-level resource utilization on running containers with docker stats, htop, and nsenter; dig into high-CPU issues with perf; detect slow filesystem latency with BPF-based tools; and generate flame graphs of interesting event call stacks.
Sasha Goldshtein is the CTO of Sela Group, a Microsoft MVP and Regional Director, Pluralsight and O'Reilly author, and international consultant and trainer. Sasha is the author of two books and multiple online courses, and a prolific blogger. He is also an active open source contributor to projects focused on system diagnostics, performance monitoring, and tracing -- across multiple operating systems and runtimes. Sasha authored and delivered training courses on Linux performance optimization, event tracing, production debugging, mobile application development, and modern C++. Between his consulting engagements, Sasha speaks at international conferences world-wide.
You can find more details on the meetup page - https://www.meetup.com/Tel-Aviv-Yafo-Linux-Kernel-Meetup/events/245319189/
2. The Plan
• This talk explains how to analyze Linux container performance and
which challenges you’ll encounter along the way
• You’ll learn:
To retrieve basic resource utilization per-container
Which deployment strategies to use with container performance tools
To dig into high-CPU issues using perf
To make symbol resolution work across namespaces
To use BPF-based tools with containers
2
4. Containers Under The Hood in 30 Seconds
CONTROL GROUPS
• Restrict usage (place quotas)
• cpu,cpuacct: used to cap CPU
usage and apply CPU shares
• docker run --cpus --cpu-shares
• memory: used to cap user and
kernel memory usage
• docker run --memory
--kernel-memory
• blkio: used to cap IOPS and
throughput per block device,
and to assign weights (shares)
NAMESPACES
• Restrict visibility
• PID: container gets its own PIDs
• mnt: container gets its own /
• net: container gets its own
network interfaces
• user: container gets its own user
and group ids
• Etc.
5. USE Checklist for Linux Systems
http://www.brendangregg.com/USEmethod/use-linux.html
CPU
Core Core
LLC
RAM
Memory
controller
GFX
I/O
controller
SSD
E1000
FSB
PCIe
U: perf
U: mpstat -P 0
S: vmstat 1
E: perf
U: sar -n DEV 1
S: ifconfig
E: ifconfig
U: free -m
S: sar -B
E: dmesg
U: iostat 1
S: iostat -xz 1
E: …/ioerr_cntU: perf
6. Retrieving Container Resource Utilization
• High-level, Docker-specific: docker stats
• htop with cgroup column (highly unwieldy)
• systemd-cgtop
• Execute a command in the container :
• nsenter -t $PID -p -m top
• docker exec -it $CID top
• Control group metrics
• E.g. /sys/fs/cgroup/cpu,cpuacct/*/*/*
• Third-party monitoring solutions: cAdvisor, Intel Snap, PCP, collectd
6
9. Tool Deployment: On The Host
9
User
Kernel
perf_events
node:6 openjdk:8 ubuntu:xenial
# perf record -G …
-g -F 97
10. Tool Deployment: In Container
10
User
Kernel
perf_events
node:6 openjdk:8
ubuntu:xenial
# perf record -G … -g -F 97
☢
11. Problems
• On the host:
• Need privileged access to the host (most container orchestrators try to
abstract this away from you)
• Tools need to understand container pid namespace, mount namespace, and
other details (discussed later)
• In the container:
• Need to run the container with extra privileges (e.g. for perf: enable
perf_event_open syscall, perf_event_paranoid sysctl)
• Bloats container images with performance tools that are not always used, and
increases attack surface
• Performance tools may be throttled by quotas placed on the container
11
15. Recording CPU Stacks With perf
• To find a CPU bottleneck, record stacks at timed intervals:
# system-wide
perf record -ag -F 97
# specific process
perf record -p 188 -g -F 97
# specific cgroup
perf record -G docker-1ae… -g -F 97
Legend
-a all CPUs
-p specific process
-G specific cgroup
-g capture call stacks
-F frequency of samples (Hz)
-c # of events in each sample
16. Symbol Resolution Challenges
• Address to module and symbol resolution, dynamic instrumentation
require access to debug information
• Because of mount namespace, container’s binaries and debuginfo are not
visible to host (/lib64/libc.so.6 – what?)
• Need to enter the container’s namespace or share the binaries
16
17. Perf Maps: Containerized Java/Node/.NET…
• For interpreted or JIT-compiled languages, map files need to be
generated at runtime
$ cat /tmp/perf-1882.map
7f2cd1108880 1e8 Ljava/lang/System;::arraycopy
7f2cd1108c00 200 Ljava/lang/String;::hashCode
7f2cd1109120 2e0 Ljava/lang/String;::indexOf
• When profiling from the host:
• PID namespace – always /tmp/perf-1.map in the container, not on the host
• Mount namespace – container’s /tmp/perf-1.map not visible to host
• perf handles this automatically in Linux 4.13+, as do the BCC tools
(discussed next)
17
19. BPF Tracing
kernel
BPF runtimekprobes
tracepoints
user
control program
BPF program
map
application
USDT
uprobesperf output
installs BPF program and attaches to events
events invoke the BPF program
perf_events
BPF program updates a map or pushes a
new event to a buffer shared with user-space
user-space program is invoked with
data from the shared buffer
user-space program reads statistics
from the map and clears it if necessary
control
program
19
21. Summary
• We have seen:
To retrieve basic resource utilization per-container
Which deployment strategies to use with container performance tools
To dig into high-CPU issues using perf
To make symbol resolution work across namespaces
To use BPF-based tools with containers
21