QCon 2015 Broken Performance Tools

Brendan Gregg
Brendan GreggSenior Performance Architect at Netflix
Brendan Gregg
Senior Performance Architect, Netflix
Nov 2015
CAUTION: PERFORMANCE TOOLS
• Over 60 million subscribers
• AWS EC2 Linux cloud
• FreeBSD CDN
• Awesome place to work
This Talk
• Observability, benchmarking, anti-patterns, and lessons
• Broken and misleading things that are surprising
Note: problems with current implementations are discussed, which
may be fixed/improved in the future
Observability:
System Metrics
LOAD AVERAGES
RFC 546
Load Averages (1, 5, 15 min)
• "load"
– Usually CPU demand (scheduler run queue length/latency)
– On Linux, task demand: CPU + uninterruptible disk I/O (?)
• "average"
– Exponentially damped moving sum
• "1, 5, and 15 minutes"
– Constants used in the equation
• Don't study these for longer than 10 seconds
$ uptime
22:08:07 up 9:05, 1 user, load average: 11.42, 11.87, 12.12
t=0
Load begins
(1 thread)
1
5
15
@ 1 min:
1 min avg =~ 0.62
Load Average
"1 minute load average"
really means…
"The exponentially damped moving sum of
CPU + uninterruptible disk I/O that uses
a value of 60 seconds in its equation"
TOP %CPU
top %CPU
• Who is consuming CPU?
• And by how much?
$ top - 20:15:55 up 19:12, 1 user, load average: 7.96, 8.59, 7.05
Tasks: 470 total, 1 running, 468 sleeping, 0 stopped, 1 zombie
%Cpu(s): 28.1 us, 0.4 sy, 0.0 ni, 71.2 id, 0.0 wa, 0.0 hi, 0.1 si, 0.1 st
KiB Mem: 61663100 total, 61342588 used, 320512 free, 9544 buffers
KiB Swap: 0 total, 0 used, 0 free. 3324696 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11959 apiprod 20 0 81.731g 0.053t 14476 S 935.8 92.1 13568:22 java
12595 snmp 20 0 21240 3256 1392 S 3.6 0.0 2:37.23 snmp-pass
10447 snmp 20 0 51512 6028 1432 S 2.0 0.0 2:12.12 snmpd
18463 apiprod 20 0 23932 1972 1176 R 0.7 0.0 0:00.07 top
[…]
top: Missing %CPU
• Short-lived processes can be missing entirely
– Process creates and exits in-between sampling /proc.
e.g., software builds.
– Try atop(1), or sampling using perf(1)
• Stop clearing the screen!
– No option to turn this off. Your eyes can miss updates.
– I often use pidstat(1) on Linux instead. Scroll back for history.
top: Misinterpreting %CPU
• Different top(1)s use different calculations
- On different OSes, check the man page, and run a test!
• %CPU can mean:
– A) Sum of per-CPU percents (0-Ncpu x 100%) consumed
during the last interval
– B) Percentage of total CPU capacity (0-100%) consumed
during the last interval
– C) (A) but historically damped (like load averages)
– D) (B) " " "
top: %Cpu vs %CPU
• This 4 CPU system is consuming:
– 130% total CPU, via %Cpu(s)
– 190% total CPU, via %CPU
• Which one is right? Is either?
$ top - 15:52:58 up 10 days, 21:58, 2 users, load average: 0.27, 0.53, 0.41
Tasks: 180 total, 1 running, 179 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.2 us, 24.5 sy, 0.0 ni, 67.2 id, 0.2 wa, 0.0 hi, 6.6 si, 0.4 st
KiB Mem: 2872448 total, 2778160 used, 94288 free, 31424 buffers
KiB Swap: 4151292 total, 76 used, 4151216 free. 2411728 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12678 root 20 0 96812 1100 912 S 100.4 0.0 0:23.52 iperf
12675 root 20 0 170544 1096 904 S 88.8 0.0 0:20.83 iperf
215 root 20 0 0 0 0 S 0.3 0.0 0:27.73 jbd2/sda1-8
[…]
CPU Summary Statistics
• %Cpu row is from /proc/stat
• linux/Documentation/cpu-load.txt:
• /proc/stat is used by everything for CPU stats
In most cases the `/proc/stat' information reflects
the reality quite closely, however due to the nature
of how/when the kernel collects this data
sometimes it can not be trusted at all.
%CPU
What is %CPU anyway?
• "Good" %CPU:
– Retiring instructions (provided they aren't a spin loop)
– High IPC (Instructions-Per-Cycle)
• "Bad" %CPU:
– Stall cycles waiting on resources, usually memory I/O
– Low IPC
– Buying faster processors may make little difference
• %CPU alone is ambiguous
– Would love top(1) to split %CPU into cycles retiring vs stalled
– Although, it gets worse…
A CPU Mystery…
• As load increased, CPU ms per request lowered (blue)
– up to 1.84x faster
• Was it due to:
- Cache warmth? no
- Different code? no
- Turbo boost? no
• (Same test, but
problem fixed, is
shown in red)
CPU Speed Variation
• Clock speed can vary thanks to:
– Intel Turbo Boost: by hardware, based on power, temp, etc
– Intel Speed Step: by software, controlled by the kernel
• %CPU is still ambiguous, given IPC. Need to know the
clock speed as well
• CPU counters nowadays have "reference cycles"
Out-of-order Execution
• CPUs execute uops out-of-
order and in parallel across
multiple functional units
• %CPU doesn't account for
how many units are active
• Accounting each cycles as
"stalled" or “retiring" is a
simplification
• Nowadays it's a lot of work
to truly understand what
CPUs are doing
https://upload.wikimedia.org/wikipedia/commons/6/64/Intel_Nehalem_arch.svg
I/O WAIT
I/O Wait
• Suggests system is disk I/O bound, but often misleading
• Comparing I/O wait between system A and B:
- higher might be bad: slower disks, more blocking
- lower might be bad: slower processor and architecture
consumes more CPU, obscuring I/O wait
• Solaris implementation was also broken and later
hardwired to zero
• Can be very useful when understood: another idle state
$ mpstat -P ALL 1
08:06:43 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
08:06:44 PM all 53.45 0.00 3.77 0.00 0.00 0.39 0.13 0.00 42.26
[…]
I/O Wait Venn Diagram
"CPU" "I/O Wait""CPU"
"Idle"
CPU Waiting for disk I/O
Per CPU:
FREE MEMORY
Free Memory
• "free" is near-zero: I'm running
out of memory!
- No, it's in the file system cache,
and is still free for apps to use
• Linux free(1) explains it, but
other tools, e.g. vmstat(1), don't
• Some file systems (e.g., ZFS)
may not be shown in the
system's cached metrics at all
www.linuxatemyram.com
$ free -m
total used free shared buffers cached
Mem: 3750 1111 2639 0 147 527
-/+ buffers/cache: 436 3313
Swap: 0 0 0
VMSTAT
vmstat(1)
• Linux: first line has some summary since boot values —
confusing!
• Other implementations:
- "r" may be sampled once per second. Almost useless.
- Columns like "de" for deficit, making much less sense for non-
page scanned situations
$ vmstat –Sm 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
8 0 0 1620 149 552 0 0 1 179 77 12 25 34 0 0
7 0 0 1598 149 552 0 0 0 0 205 186 46 13 0 0
8 0 0 1617 149 552 0 0 0 8 210 435 39 21 0 0
8 0 0 1589 149 552 0 0 0 0 218 219 42 17 0 0
[…]
NETSTAT -S
netstat -s
$ netstat -s
Ip:
7962754 total packets received
8 with invalid addresses
0 forwarded
0 incoming packets discarded
7962746 incoming packets delivered
8019427 requests sent out
Icmp:
382 ICMP messages received
0 input ICMP message failed.
ICMP input histogram:
destination unreachable: 125
timeout in transit: 257
3410 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 3410
IcmpMsg:
InType3: 125
InType11: 257
OutType3: 3410
Tcp:
17337 active connections openings
395515 passive connection openings
8953 failed connection attempts
240214 connection resets received
3 connections established
7198375 segments received
7504939 segments send out
62696 segments retransmited
10 bad segments received.
1072 resets sent
InCsumErrors: 5
Udp:
759925 packets received
3412 packets to unknown port received.
0 packet receive errors
784370 packets sent
UdpLite:
TcpExt:
858 invalid SYN cookies received
8951 resets received for embryonic SYN_RECV sockets
14 packets pruned from receive queue because of socket buffer overrun
6177 TCP sockets finished time wait in fast timer
293 packets rejects in established connections because of timestamp
733028 delayed acks sent
89 delayed acks further delayed because of locked socket
Quick ack mode was activated 13214 times
336520 packets directly queued to recvmsg prequeue.
43964 packets directly received from backlog
11406012 packets directly received from prequeue
1039165 packets header predicted
7066 packets header predicted and directly queued to user
1428960 acknowledgments not containing data received
1004791 predicted acknowledgments
1 times recovered from packet loss due to fast retransmit
5044 times recovered from packet loss due to SACK data
2 bad SACKs received
Detected reordering 4 times using SACK
Detected reordering 11 times using time stamp
13 congestion windows fully recovered
11 congestion windows partially recovered using Hoe heuristic
TCPDSACKUndo: 39
2384 congestion windows recovered after partial ack
228 timeouts after SACK recovery
100 timeouts in loss state
5018 fast retransmits
39 forward retransmits
783 retransmits in slow start
32455 other TCP timeouts
TCPLossProbes: 30233
TCPLossProbeRecovery: 19070
992 sack retransmits failed
18 times receiver scheduled too late for direct processing
705 packets collapsed in receive queue due to low socket buffer
13658 DSACKs sent for old packets
8 DSACKs sent for out of order packets
13595 DSACKs received
33 DSACKs for out of order packets received
32 connections reset due to unexpected data
108 connections reset due to early user close
1608 connections aborted due to timeout
TCPSACKDiscard: 4
TCPDSACKIgnoredOld: 1
TCPDSACKIgnoredNoUndo: 8649
TCPSpuriousRTOs: 445
TCPSackShiftFallback: 8588
TCPRcvCoalesce: 95854
TCPOFOQueue: 24741
TCPOFOMerge: 8
TCPChallengeACK: 1441
TCPSYNChallenge: 5
TCPSpuriousRtxHostQueues: 1
TCPAutoCorking: 4823
IpExt:
InOctets: 1561561375
OutOctets: 1509416943
InNoECTPkts: 8201572
InECT1Pkts: 2
InECT0Pkts: 3844
InCEPkts: 306
netstat -s
[…]
Tcp:
17337 active connections openings
395515 passive connection openings
8953 failed connection attempts
240214 connection resets received
3 connections established
7198870 segments received
7505329 segments send out
62697 segments retransmited
10 bad segments received.
1072 resets sent
InCsumErrors: 5
[…]
netstat -s
• Many metrics on Linux (can be over 200)
• Still doesn't include everything: getting better, but don't
assume everything is there
• Includes typos & inconsistencies
• Might be more readable to:
cat /proc/net/snmp /proc/net/netstat
• Totals since boot can be misleading
• On Linux, -s needs -c support
• Often no documentation outside kernel source code
• Requires expertise to comprehend
DISK METRICS
Disk Metrics
• All disk metrics are misleading
• Disk %utilization / %busy
– Logical devices (volume managers) can process requests in
parallel, and may accept more I/O at 100%
• Disk IOPS
– High IOPS is "bad"? That depends…
• Disk latency
– Does it matter? File systems and volume managers try hard
to hide latency and make latency asynchronous
– Better measuring latency via application->FS calls
Rules for Metrics Makers
• They must work
– As well as possible. Clearly document caveats.
• They must be useful
– Document a real use case (eg, my example.txt files).
If you get stuck, it's not useful – ditch it.
• Aim to be intuitive
– Document it. If it's too weird to explain, redo it.
• As few as possible
– Respect end-user's time
• Good system examples:
– iostat -x: workload columns, then resulting perf columns
– Linux sar: consistency, units on columns, logical groups
Observability:
Profilers
PROFILERS
Linux perf
• Can sample stack traces and summarize output:
# perf report -n -stdio
[…]
# Overhead Samples Command Shared Object Symbol
# ........ ............ ....... ................. .............................
#
20.42% 605 bash [kernel.kallsyms] [k] xen_hypercall_xen_version
|
--- xen_hypercall_xen_version
check_events
|
|--44.13%-- syscall_trace_enter
| tracesys
| |
| |--35.58%-- __GI___libc_fcntl
| | |
| | |--65.26%-- do_redirection_internal
| | | do_redirections
| | | execute_builtin_or_function
| | | execute_simple_command
[… ~13,000 lines truncated …]
Too Much Output
… as a Flame Graph
PROFILER VISIBILITY
System Profilers with Java
Java Kernel
TCP/IP
GC
Idle
thread
Time
Locks epoll
JVM
System Profilers with Java
• e.g., Linux perf
• Visibility
– JVM (C++)
– GC (C++)
– libraries (C)
– kernel (C)
• Typical problems (x86):
– Stacks missing for Java and other runtimes
– Symbols missing for Java methods
• Profile everything except Java and similar runtimes
Java Profilers
Java
GC
Kernel,
libraries,
JVM
Java Profilers
• Visibility
– Java method execution
– Object usage
– GC logs
– Custom Java context
• Typical problems:
– Sampling often happens at safety/yield points (skew)
– Method tracing has massive observer effect
– Misidentifies RUNNING as on-CPU (e.g., epoll)
– Doesn't include or profile GC or JVM CPU time
– Tree views not quick (proportional) to comprehend
• Inaccurate (skewed) and incomplete profiles
COMPILER OPTIMIZATIONS
Broken System Stack Traces
• Profiling Java on
x86 using perf
• The stacks are
1 or 2 levels
deep, and have
junk values
# perf record –F 99 –a –g – sleep 30
# perf script
[…]
java 4579 cpu-clock:
ffffffff8172adff tracesys ([kernel.kallsyms])
7f4183bad7ce pthread_cond_timedwait@@GLIBC_2…
java 4579 cpu-clock:
7f417908c10b [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f4179101c97 [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f41792fc65f [unknown] (/tmp/perf-4458.map)
a2d53351ff7da603 [unknown] ([unknown])
java 4579 cpu-clock:
7f4179349aec [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f4179101d0f [unknown] (/tmp/perf-4458.map)
[…]
Why Stacks are Broken
• On x86 (x86_64), hotspot uses the frame pointer register
(RBP) as general purpose
• This "compiler optimization" breaks stack walking
• Once upon a time, x86 had fewer registers, and this
made much more sense
• gcc provides -fno-omit-frame-pointer to avoid
doing this
– JDK8u60+ now has this as -XX:+PreserveFramePoiner
Missing Symbols
12.06% 62 sed sed [.] re_search_internal
|
--- re_search_internal
|
|--96.78%-- re_search_stub
| rpl_re_search
| match_regex
| do_subst
| execute_program
| process_files
| main
| __libc_start_main
71.79% 334 sed sed [.] 0x000000000001afc1
|
|--11.65%-- 0x40a447
| 0x40659a
| 0x408dd8
| 0x408ed1
| 0x402689
| 0x7fa1cd08aec5
broken
not broken
• Missing symbols may show up as hex; e.g., Linux perf:
Fixing Symbols
• For applications, install debug symbol package
• For JIT'd code, Linux perf already looks for an
externally provided symbol file: /tmp/perf-PID.map
• Find for a way to create this for your runtime
# perf script
Failed to open /tmp/perf-8131.map, continuing without symbols
[…]
java 8131 cpu-clock:
7fff76f2dce1 [unknown] ([vdso])
7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
7fd301861e46 [unknown] (/tmp/perf-8131.map)
[…]
INSTRUCTION PROFILING
Instruction Profiling
# perf annotate -i perf.data.noplooper --stdio
Percent | Source code & Disassembly of noplooper
--------------------------------------------------------
: Disassembly of section .text:
:
: 00000000004004ed <main>:
0.00 : 4004ed: push %rbp
0.00 : 4004ee: mov %rsp,%rbp
20.86 : 4004f1: nop
0.00 : 4004f2: nop
0.00 : 4004f3: nop
0.00 : 4004f4: nop
19.84 : 4004f5: nop
0.00 : 4004f6: nop
0.00 : 4004f7: nop
0.00 : 4004f8: nop
18.73 : 4004f9: nop
0.00 : 4004fa: nop
0.00 : 4004fb: nop
0.00 : 4004fc: nop
19.08 : 4004fd: nop
0.00 : 4004fe: nop
0.00 : 4004ff: nop
0.00 : 400500: nop
21.49 : 400501: jmp 4004f1 <main+0x4>
• Often broken nowadays due
to skid, out-of-order
execution, and sampling the
resumption instruction
• Better with PEBS support
Observability:
Overhead
TCPDUMP
tcpdump
• Packet tracing doesn't scale. Overheads:
– CPU cost of per-packet tracing (improved by [e]BPF)
• Consider CPU budget per-packet at 10/40/100 GbE
– Transfer to user-level (improved by ring buffers)
– File system storage (more CPU, and disk I/O)
– Possible additional network transfer
• Can also drop packets when overloaded
• You should only trace send/receive as a last resort
– I solve problems by tracing lower frequency TCP events
$ tcpdump -i eth0 -w /tmp/out.tcpdump
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
^C7985 packets captured
8996 packets received by filter
1010 packets dropped by kernel
STRACE
strace
• Before:
• After:
• 442x slower. This is worst case.
• strace(1) pauses the process twice for each syscall.
This is like putting metering lights on your app.
– "BUGS: A traced process runs slowly." – strace(1) man page
– Use buffered tracing / in-kernel counters instead, e.g. DTrace
$ dd if=/dev/zero of=/dev/null bs=1 count=500k
[…]
512000 bytes (512 kB) copied, 0.103851 s, 4.9 MB/s
$ strace -eaccept dd if=/dev/zero of=/dev/null bs=1 count=500k
[…]
512000 bytes (512 kB) copied, 45.9599 s, 11.1 kB/s
DTRACE
DTrace
• Overhead often negligible, but not always
• Before:
• After:
• 384x slower. Fairly worst case: frequent pid probes.
# time wc systemlog
262600 2995200 23925200 systemlog
real 0m1.098s
user 0m1.085s
sys 0m0.012s
# time dtrace -n 'pid$target:::entry { @[probefunc] = count(); }' -c 'wc systemlog'
dtrace: description 'pid$target:::entry ' matched 3756 probes
262600 2995200 23925200 systemlog
[…]
real 7m2.896s
user 7m2.650s
sys 0m0.572s
Tracing Dangers
• Overhead potential exists for all tracers
– Overhead = event instrumentation cost X frequency of event
• Costs
– Lower: counters, in-kernel aggregations
– Higher: event dumps, stack traces, string copies, copyin/outs
• Frequencies
– Lower: process creation & destruction, disk I/O (usually), …
– Higher: instructions, functions in I/O hot path, malloc/free,
Java methods, …
• Advice
– < 10,000 events/sec, probably ok
– > 100,000 events/sec, overhead may start to be measurable
DTraceToolkit
• My own tools that can cause massive overhead:
– dapptrace/dappprof: can trace all native functions
– Java/j_flow.d, ...: can trace all Java methods with
+ExtendedDTraceProbes
• Useful for debugging, but should warn about overheads
# j_flow.d
C PID TIME(us) -- CLASS.METHOD
0 311403 4789112583163 -> java/lang/Object.<clinit>
0 311403 4789112583207 -> java/lang/Object.registerNatives
0 311403 4789112583323 <- java/lang/Object.registerNatives
0 311403 4789112583333 <- java/lang/Object.<clinit>
0 311403 4789112583343 -> java/lang/String.<clinit>
0 311403 4789112583732 -> java/lang/String$CaseInsensitiveComparator.<init>
0 311403 4789112583743 -> java/lang/String$CaseInsensitiveComparator.<init>
0 311403 4789112583752 -> java/lang/Object.<init>
[...]
VALGRIND
Valgrind
• A suite of tools including an extensive leak detector
• To its credit it does warn the end user
"Your program will run much slower
(eg. 20 to 30 times) than normal"
– http://valgrind.org/docs/manual/quick-start.html
JAVA PROFILERS
Java Profilers
• Some Java profilers have two modes:
– Sampling stacks: eg, at 100 Hertz
– Tracing methods: instrumenting and timing every method
• Method timing has been described as "highly accurate",
despite slowing the target by up to 1000x!
• Issues & advice already covered at QCon:
– Nitsan Wakart "Profilers are Lying Hobbitses" earlier today
– Java track tomorrow
Observability:
Monitoring
MONITORING
Monitoring
• By now you should recognize these pathologies:
– Let's just graph the system metrics!
• That's not the problem that needs solving
– Let's just trace everything and post process!
• Now you have one million problems per second
• Monitoring adds additional problems:
– Let's have a cloud-wide dashboard update per-second!
• From every instance? Packet overheads?
– Now we have billions of metrics!
Observability:
Statistics
STATISTICS
Statistics
"Then there is the man who drowned crossing a
stream with an average depth of six inches."
– W.I.E. Gates
Statistics
• Averages can be misleading
– Hide latency outliers
– Per-minute averages can hide multi-second issues
• Percentiles can be misleading
– Probability of hitting 99.9th latency may be more than 1/1000
after many dependency requests
• Show the distribution:
– Summarize: histogram, density plot, frequency trail
– Over-time: scatter plot, heat map
• See Gil Tene's "How Not to Measure Latency" QCon talk
from earlier today
Average Latency
• When the index of central tendency isn't…
Observability:
Visualizations
VISUALIZATIONS
Tachometers
…especially with arbitrary color highlighting
Pie Charts
…for real-time metrics
usr sys wait idle
Doughnuts
usr sys wait idle
…like pie charts but worse
Traffic Lights
…when used for subjective metrics
These can be used for objective metrics
RED == BAD (usually)
GREEN == GOOD (hopefully)
Benchmarking
BENCHMARKING
~100% of benchmarks are wrong
"Most popular benchmarks are flawed"
Source: Traeger, A., E. Zadok, N. Joukov, and C. Wright.
“A Nine Year Study of File System and Storage
Benchmarking,” ACM Transactions on Storage, 2008.
Not only can a popular benchmark be broken, but so can
all alternates.
REFUTING BENCHMARKS
The energy needed
to refute benchmarks
is multiple orders of magnitude
bigger than to run them
It can take 1-2 weeks of senior performance engineering
time to debug a single benchmark.
Benchmarking
• Benchmarking is a useful form of experimental analysis
– Try observational first; benchmarks can perturb
• Accurate and realistic benchmarking is vital for technical
investments that improve our industry
• However, benchmarking is error prone
COMMON MISTAKES
Common Mistakes
1. Testing the wrong target
– eg, FS cache instead of disk; misconfiguration
2. Choosing the wrong target
– eg, disk instead of FS cache … doesn’t resemble real world
3. Invalid results
– benchmark software bugs
4. Ignoring errors
– error path may be fast!
5. Ignoring variance or perturbations
– real workload isn't steady/consistent, which matters
6. Misleading results
– you benchmark A, but actually measure B, and conclude you
measured C
PRODUCT EVALUATIONS
Product Evaluations
• Benchmarking is used for product evaluations & sales
• The Benchmark Paradox:
– If your product’s chances of winning a benchmark are
50/50, you’ll usually lose
– To justify a product switch, a customer may run several
benchmarks, and expect you to win them all
– May mean winning a coin toss at least 3 times in a row
– http://www.brendangregg.com/blog/2014-05-03/the-benchmark-paradox.html
• Solving this seeming paradox (and benchmarking):
– Confirm benchmark is relevant to intended workload
– Ask: why isn't it 10x?
Active Benchmarking
• Root cause performance analysis while the
benchmark is still running
– Use observability tools
– Identify the limiter (or suspected limiter) and include it with the
benchmark results
– Answer: why not 10x?
• This takes time, but uncovers most mistakes
MICRO BENCHMARKS
Micro Benchmarks
• Test a specific function in isolation. e.g.:
– File system maximum cached read operations/sec
– Network maximum throughput
• Examples of bad microbenchmarks:
– gitpid() in a tight loop
– speed of /dev/zero and /dev/null
• Common problems:
– Testing a workload that is not very relevant
– Missing other workloads that are relevant
MACRO BENCHMARKS
Macro Benchmarks
• Simulate application user load. e.g.:
– Simulated web client transaction
• Common problems:
– Misplaced trust: believed to be realistic, but misses variance,
errors, perturbations, e.t.c.
– Complex to debug, verify, and root cause
KITCHEN SINK BENCHMARKS
Kitchen Sink Benchmarks
• Run everything!
– Mostly random benchmarks found on the Internet, where
most are are broken or irrelevant
– Developers focus on collecting more benchmarks than
verifying or fixing the existing ones
• Myth that more benchmarks == greater accuracy
– No, use active benchmarking (analysis)
AUTOMATION
Automated Benchmarks
• Completely automated procedure. e.g.:
– Cloud benchmarks: spin up an instance, benchmark, destroy.
Automate.
• Little or no provision for debugging
• Automation is only part of the solution
Benchmarking:
More Examples
BONNIE++
bonnie++
• "simple tests of hard drive and file system performance"
• First metric printed by (thankfully) older versions:
per character sequential output
• What was actually tested:
– 1 byte writes to libc (via putc())
– 4 Kbyte writes from libc -> FS (depends on OS; see setbuffer())
– 128 Kbyte async writes to disk (depends on storage stack)
– Any file system throttles that may be present (eg, ZFS)
– C++ code, to some extent (bonnie++ 10% slower than Bonnie)
• Actual limiter:
– Single threaded write_block_putc() and putc() calls
APACHE BENCH
Apache Bench
• HTTP web server benchmark
• Single thread limited (use wrk for multi-threaded)
• Keep-alive option (-k):
– without: Can become an unrealistic TCP session benchmark
– with: Can become an unrealistic server throughput test
• Performance issues of ab's own code
UNIXBENCH
UnixBench
• The original kitchen-sink micro benchmark from 1984,
published in BYTE magazine
• Innovative & useful for the time, but that time has passed
• More problems than I can shake a stick at
• Starting with…
COMPILERS
UnixBench Makefile
• Default (by ./Run) for Linux. Would you edit it? Then what?
## Very generic
#OPTON = -O
## For Linux 486/Pentium, GCC 2.7.x and 2.8.x
#OPTON = -O2 -fomit-frame-pointer -fforce-addr -fforce-mem -ffast-math 
# -m486 -malign-loops=2 -malign-jumps=2 -malign-functions=2
## For Linux, GCC previous to 2.7.0
#OPTON = -O2 -fomit-frame-pointer -fforce-addr -fforce-mem -ffast-math -m486
#OPTON = -O2 -fomit-frame-pointer -fforce-addr -fforce-mem -ffast-math 
# -m386 -malign-loops=1 -malign-jumps=1 -malign-functions=1
## For Solaris 2, or general-purpose GCC 2.7.x
OPTON = -O2 -fomit-frame-pointer -fforce-addr -ffast-math -Wall
## For Digital Unix v4.x, with DEC cc v5.x
#OPTON = -O4
#CFLAGS = -DTIME -std1 -verbose -w0
UnixBench Makefile
• "Fixing" the Makefile improved the first result, Dhrystone
2, by 64%
• Is everyone "fixing" it the same way, or not? Are they
using the same compiler version? Same OS? (No.)
UnixBench Documentation
"The results will depend not only on your
hardware, but on your operating system,
libraries, and even compiler."
"So you may want to make sure that all your
test systems are running the same version of
the OS; or at least publish the OS and
compuiler versions with your results."
SYSTEM MICROBENCHMARKS
UnixBench Tests
• Results summarized as "The BYTE Index". From USAGE:
• What can go wrong? Everything.
system:
dhry2reg Dhrystone 2 using register variables
whetstone-double Double-Precision Whetstone
syscall System Call Overhead
pipe Pipe Throughput
context1 Pipe-based Context Switching
spawn Process Creation
execl Execl Throughput
fstime-w File Write 1024 bufsize 2000 maxblocks
fstime-r File Read 1024 bufsize 2000 maxblocks
fstime File Copy 1024 bufsize 2000 maxblocks
fsbuffer-w File Write 256 bufsize 500 maxblocks
fsbuffer-r File Read 256 bufsize 500 maxblocks
fsbuffer File Copy 256 bufsize 500 maxblocks
fsdisk-w File Write 4096 bufsize 8000 maxblocks
fsdisk-r File Read 4096 bufsize 8000 maxblocks
fsdisk File Copy 4096 bufsize 8000 maxblocks
shell1 Shell Scripts (1 concurrent) (runs "looper 60 multi.sh 1")
shell8 Shell Scripts (8 concurrent) (runs "looper 60 multi.sh 8")
shell16 Shell Scripts (8 concurrent) (runs "looper 60 multi.sh 16")
Anti-Patterns
ANTI-PATTERNS
Street Light Anti-Method
1. Pick observability tools that are:
– Familiar
– Found on the Internet
– Found at random
2. Run tools
3. Look for obvious issues
Blame Someone Else Anti-Method
1. Find a system or environment component you are not
responsible for
2. Hypothesize that the issue is with that component
3. Redirect the issue to the responsible team
4. When proven wrong, go to 1
Performance Tools Team
• Having a separate performance tools team, who creates
tools but doesn't use them (no production exposure)
• At Netflix:
– The performance engineering team builds tools and uses
tools for both service consulting and live production triage
• Mogul, Vector, …
– Other teams (CORE, traffic, …) also build performance tools
and use them during issues
• Good performance tools are built out of necessity
Messy House Fallacy
• Fallacy: my code is a mess, I bet yours is immaculate,
therefore the bug must be mine
• Reality: everyone's code is terrible and buggy
• When analyzing performance, don't overlook the system:
kernel, libraries, etc.
Lessons
PERFORMANCE TOOLS
Observability
• Trust nothing, verify everything
– Cross-check with other observability tools
– Write small "known" workloads, and confirm metrics match
– Find other sanity tests: e.g. check known system limits
– Determine how metrics are calculated, averaged, updated
• Find metrics to solve problems
– Instead of understanding hundreds of system metrics
– What problems do you want to observe? What metrics would
be sufficient? Find, verify, and use those. e.g., USE Method.
– The metric you want may not yet exist
• File bugs, get these fixed/improved
Observe Everything
• Use functional diagrams to pose Q's and find missing metrics:
Java Mixed-Mode Flame Graph
Profile Everything
Java JVM
Kernel
GC
Visualize Everything
Benchmark Nothing
• Trust nothing, verify everything
• Do Active Benchmarking:
1. Configure the benchmark to run in steady state, 24x7
2. Do root-cause analysis of benchmark performance
3. Answer: why is it not 10x?
Links & References
• https://www.rfc-editor.org/rfc/rfc546.pdf
• https://upload.wikimedia.org/wikipedia/commons/6/64/Intel_Nehalem_arch.svg
• http://www.linuxatemyram.com/
• Traeger, A., E. Zadok, N. Joukov, and C. Wright. “A Nine Year Study of File System
and Storage Benchmarking,” ACM Trans- actions on Storage, 2008.
• http://www.brendangregg.com/blog/2014-06-09/java-cpu-sampling-using-hprof.html
• http://www.brendangregg.com/blog/2014-05-03/the-benchmark-paradox.html
• http://www.brendangregg.com/ActiveBenchmarking/bonnie++.html
• https://blogs.oracle.com/roch/entry/decoding_bonnie
• http://www.brendangregg.com/blog/2014-05-02/compilers-love-messing-with-
benchmarks.html
• https://code.google.com/p/byte-unixbench/
• https://qconsf.com/sf2015/presentation/how-not-measure-latency
• https://qconsf.com/sf2015/presentation/profilers-lying
• Caution signs drawn be me, inspired by real-world signs
THANKS
Thanks
• Questions?
• http://techblog.netflix.com
• http://slideshare.net/brendangregg
• http://www.brendangregg.com
• bgregg@netflix.com
• @brendangregg
Nov 2015
1 of 128

Recommended

FreeBSD 2014 Flame Graphs by
FreeBSD 2014 Flame GraphsFreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame GraphsBrendan Gregg
17.8K views66 slides
OSSNA 2017 Performance Analysis Superpowers with Linux BPF by
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFBrendan Gregg
5.1K views68 slides
EuroBSDcon 2017 System Performance Analysis Methodologies by
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesBrendan Gregg
15.8K views65 slides
Blazing Performance with Flame Graphs by
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBrendan Gregg
323.6K views170 slides
Performance Analysis: new tools and concepts from the cloud by
Performance Analysis: new tools and concepts from the cloudPerformance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloudBrendan Gregg
5.4K views86 slides
Linux Performance 2018 (PerconaLive keynote) by
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Brendan Gregg
426.3K views19 slides

More Related Content

What's hot

ACM Applicative System Methodology 2016 by
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016Brendan Gregg
158K views57 slides
Velocity 2015 linux perf tools by
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf toolsBrendan Gregg
1.1M views142 slides
re:Invent 2019 BPF Performance Analysis at Netflix by
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
5.5K views65 slides
Systems@Scale 2021 BPF Performance Getting Started by
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedBrendan Gregg
1.5K views30 slides
USENIX ATC 2017: Visualizing Performance with Flame Graphs by
USENIX ATC 2017: Visualizing Performance with Flame GraphsUSENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame GraphsBrendan Gregg
672.9K views66 slides
Performance Wins with BPF: Getting Started by
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedBrendan Gregg
2K views24 slides

What's hot(20)

ACM Applicative System Methodology 2016 by Brendan Gregg
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016
Brendan Gregg158K views
Velocity 2015 linux perf tools by Brendan Gregg
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
Brendan Gregg1.1M views
re:Invent 2019 BPF Performance Analysis at Netflix by Brendan Gregg
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
Brendan Gregg5.5K views
Systems@Scale 2021 BPF Performance Getting Started by Brendan Gregg
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started
Brendan Gregg1.5K views
USENIX ATC 2017: Visualizing Performance with Flame Graphs by Brendan Gregg
USENIX ATC 2017: Visualizing Performance with Flame GraphsUSENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame Graphs
Brendan Gregg672.9K views
Performance Wins with BPF: Getting Started by Brendan Gregg
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
Brendan Gregg2K views
YOW2020 Linux Systems Performance by Brendan Gregg
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
Brendan Gregg1.9K views
LSFMM 2019 BPF Observability by Brendan Gregg
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
Brendan Gregg8.3K views
BPF: Tracing and more by Brendan Gregg
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
Brendan Gregg200.3K views
YOW2021 Computing Performance by Brendan Gregg
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
Brendan Gregg2K views
Open Source Systems Performance by Brendan Gregg
Open Source Systems PerformanceOpen Source Systems Performance
Open Source Systems Performance
Brendan Gregg33.4K views
Container Performance Analysis by Brendan Gregg
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
Brendan Gregg448.6K views
Analyzing OS X Systems Performance with the USE Method by Brendan Gregg
Analyzing OS X Systems Performance with the USE MethodAnalyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE Method
Brendan Gregg20.3K views
Linux Performance Profiling and Monitoring by Georg Schönberger
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and Monitoring
Georg Schönberger8.5K views
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt by Anne Nicolas
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtKernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Anne Nicolas5.6K views
Extreme Linux Performance Monitoring and Tuning by Milind Koyande
Extreme Linux Performance Monitoring and TuningExtreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and Tuning
Milind Koyande2K views
Tuning parallelcodeonsolaris005 by dflexer
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005
dflexer521 views
Linux Performance Tools by Brendan Gregg
Linux Performance ToolsLinux Performance Tools
Linux Performance Tools
Brendan Gregg232.8K views

Viewers also liked

Linux Systems Performance 2016 by
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
504.5K views72 slides
Monitorama 2015 Netflix Instance Analysis by
Monitorama 2015 Netflix Instance AnalysisMonitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance AnalysisBrendan Gregg
34.8K views69 slides
Designing Tracing Tools by
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing ToolsBrendan Gregg
5.7K views40 slides
SREcon 2016 Performance Checklists for SREs by
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsBrendan Gregg
206.5K views79 slides
Java Performance Analysis on Linux with Flame Graphs by
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsBrendan Gregg
24.7K views71 slides
JavaOne 2015 Java Mixed-Mode Flame Graphs by
JavaOne 2015 Java Mixed-Mode Flame GraphsJavaOne 2015 Java Mixed-Mode Flame Graphs
JavaOne 2015 Java Mixed-Mode Flame GraphsBrendan Gregg
23.1K views92 slides

Viewers also liked(20)

Linux Systems Performance 2016 by Brendan Gregg
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
Brendan Gregg504.5K views
Monitorama 2015 Netflix Instance Analysis by Brendan Gregg
Monitorama 2015 Netflix Instance AnalysisMonitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance Analysis
Brendan Gregg34.8K views
Designing Tracing Tools by Brendan Gregg
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
Brendan Gregg5.7K views
SREcon 2016 Performance Checklists for SREs by Brendan Gregg
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
Brendan Gregg206.5K views
Java Performance Analysis on Linux with Flame Graphs by Brendan Gregg
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame Graphs
Brendan Gregg24.7K views
JavaOne 2015 Java Mixed-Mode Flame Graphs by Brendan Gregg
JavaOne 2015 Java Mixed-Mode Flame GraphsJavaOne 2015 Java Mixed-Mode Flame Graphs
JavaOne 2015 Java Mixed-Mode Flame Graphs
Brendan Gregg23.1K views
Linux 4.x Tracing Tools: Using BPF Superpowers by Brendan Gregg
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
Brendan Gregg210.2K views
Linux 4.x Tracing: Performance Analysis with bcc/BPF by Brendan Gregg
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Brendan Gregg10.7K views
Linux Performance Analysis: New Tools and Old Secrets by Brendan Gregg
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
Brendan Gregg603.9K views
"Graphite — как построить миллион графиков". Дмитрий Куликовский, Яндекс by Yandex
"Graphite — как построить миллион графиков". Дмитрий Куликовский, Яндекс"Graphite — как построить миллион графиков". Дмитрий Куликовский, Яндекс
"Graphite — как построить миллион графиков". Дмитрий Куликовский, Яндекс
Yandex11.3K views
LISA2010 visualizations by Brendan Gregg
LISA2010 visualizationsLISA2010 visualizations
LISA2010 visualizations
Brendan Gregg8.7K views
The New Systems Performance by Brendan Gregg
The New Systems PerformanceThe New Systems Performance
The New Systems Performance
Brendan Gregg7.6K views
Linux Performance Tools 2014 by Brendan Gregg
Linux Performance Tools 2014Linux Performance Tools 2014
Linux Performance Tools 2014
Brendan Gregg116.8K views
DTrace Topics: Introduction by Brendan Gregg
DTrace Topics: IntroductionDTrace Topics: Introduction
DTrace Topics: Introduction
Brendan Gregg3.4K views
From DTrace to Linux by Brendan Gregg
From DTrace to LinuxFrom DTrace to Linux
From DTrace to Linux
Brendan Gregg23.3K views
Lisa12 methodologies by Brendan Gregg
Lisa12 methodologiesLisa12 methodologies
Lisa12 methodologies
Brendan Gregg16.3K views
Icinga Camp San Francisco 2017 - Icinga Director - Managing your configuration by Icinga
Icinga Camp San Francisco 2017 - Icinga Director - Managing your configurationIcinga Camp San Francisco 2017 - Icinga Director - Managing your configuration
Icinga Camp San Francisco 2017 - Icinga Director - Managing your configuration
Icinga3.7K views
Icinga Camp Berlin 2017 - Integrations all the way by Icinga
Icinga Camp Berlin 2017 - Integrations all the wayIcinga Camp Berlin 2017 - Integrations all the way
Icinga Camp Berlin 2017 - Integrations all the way
Icinga7.7K views

Similar to QCon 2015 Broken Performance Tools

Broken Performance Tools by
Broken Performance ToolsBroken Performance Tools
Broken Performance ToolsC4Media
763 views131 slides
In Memory Database In Action by Tanel Poder and Kerry Osborne by
In Memory Database In Action by Tanel Poder and Kerry OsborneIn Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry OsborneEnkitec
1.5K views40 slides
Oracle Database In-Memory Option in Action by
Oracle Database In-Memory Option in ActionOracle Database In-Memory Option in Action
Oracle Database In-Memory Option in ActionTanel Poder
7.1K views40 slides
UKOUG, Lies, Damn Lies and I/O Statistics by
UKOUG, Lies, Damn Lies and I/O StatisticsUKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O StatisticsKyle Hailey
3.3K views77 slides
MeetBSD2014 Performance Analysis by
MeetBSD2014 Performance AnalysisMeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisBrendan Gregg
32K views60 slides
Finding an unusual cause of max_user_connections in MySQL by
Finding an unusual cause of max_user_connections in MySQLFinding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQLOlivier Doucet
572 views47 slides

Similar to QCon 2015 Broken Performance Tools(20)

Broken Performance Tools by C4Media
Broken Performance ToolsBroken Performance Tools
Broken Performance Tools
C4Media763 views
In Memory Database In Action by Tanel Poder and Kerry Osborne by Enkitec
In Memory Database In Action by Tanel Poder and Kerry OsborneIn Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry Osborne
Enkitec1.5K views
Oracle Database In-Memory Option in Action by Tanel Poder
Oracle Database In-Memory Option in ActionOracle Database In-Memory Option in Action
Oracle Database In-Memory Option in Action
Tanel Poder7.1K views
UKOUG, Lies, Damn Lies and I/O Statistics by Kyle Hailey
UKOUG, Lies, Damn Lies and I/O StatisticsUKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O Statistics
Kyle Hailey3.3K views
MeetBSD2014 Performance Analysis by Brendan Gregg
MeetBSD2014 Performance AnalysisMeetBSD2014 Performance Analysis
MeetBSD2014 Performance Analysis
Brendan Gregg32K views
Finding an unusual cause of max_user_connections in MySQL by Olivier Doucet
Finding an unusual cause of max_user_connections in MySQLFinding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQL
Olivier Doucet572 views
Performance Monitoring: Understanding Your Scylla Cluster by ScyllaDB
Performance Monitoring: Understanding Your Scylla ClusterPerformance Monitoring: Understanding Your Scylla Cluster
Performance Monitoring: Understanding Your Scylla Cluster
ScyllaDB5.1K views
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC by Kristofferson A
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A10.5K views
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,... by ScyllaDB
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
ScyllaDB1.2K views
Performance tunning by lokesh777
Performance tunningPerformance tunning
Performance tunning
lokesh777271 views
Debugging linux issues with eBPF by Ivan Babrou
Debugging linux issues with eBPFDebugging linux issues with eBPF
Debugging linux issues with eBPF
Ivan Babrou1.7K views
Analyzing and Interpreting AWR by pasalapudi
Analyzing and Interpreting AWRAnalyzing and Interpreting AWR
Analyzing and Interpreting AWR
pasalapudi64.2K views
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014 by Amazon Web Services
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
Amazon Web Services7.4K views
Performance tweaks and tools for Linux (Joe Damato) by Ontico
Performance tweaks and tools for Linux (Joe Damato)Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)
Ontico2.1K views
Percona Live UK 2014 Part III by Alkin Tezuysal
Percona Live UK 2014  Part IIIPercona Live UK 2014  Part III
Percona Live UK 2014 Part III
Alkin Tezuysal401 views
VMworld 2016: vSphere 6.x Host Resource Deep Dive by VMworld
VMworld 2016: vSphere 6.x Host Resource Deep DiveVMworld 2016: vSphere 6.x Host Resource Deep Dive
VMworld 2016: vSphere 6.x Host Resource Deep Dive
VMworld8.5K views
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1 by Tanel Poder
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder14K views
Apache Spark At Scale in the Cloud by Databricks
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks3.2K views

More from Brendan Gregg

IntelON 2021 Processor Benchmarking by
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingBrendan Gregg
1K views17 slides
Performance Wins with eBPF: Getting Started (2021) by
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Brendan Gregg
1.4K views30 slides
Computing Performance: On the Horizon (2021) by
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
92.7K views113 slides
BPF Internals (eBPF) by
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)Brendan Gregg
15.3K views122 slides
UM2019 Extended BPF: A New Type of Software by
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareBrendan Gregg
33.1K views48 slides
LISA2019 Linux Systems Performance by
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
374.3K views64 slides

More from Brendan Gregg(19)

IntelON 2021 Processor Benchmarking by Brendan Gregg
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
Brendan Gregg1K views
Performance Wins with eBPF: Getting Started (2021) by Brendan Gregg
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
Brendan Gregg1.4K views
Computing Performance: On the Horizon (2021) by Brendan Gregg
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
Brendan Gregg92.7K views
BPF Internals (eBPF) by Brendan Gregg
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
Brendan Gregg15.3K views
UM2019 Extended BPF: A New Type of Software by Brendan Gregg
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
Brendan Gregg33.1K views
LISA2019 Linux Systems Performance by Brendan Gregg
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
Brendan Gregg374.3K views
LPC2019 BPF Tracing Tools by Brendan Gregg
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing Tools
Brendan Gregg1.8K views
YOW2018 CTO Summit: Working at netflix by Brendan Gregg
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflix
Brendan Gregg3.5K views
YOW2018 Cloud Performance Root Cause Analysis at Netflix by Brendan Gregg
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
Brendan Gregg142.4K views
NetConf 2018 BPF Observability by Brendan Gregg
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
Brendan Gregg2.7K views
ATO Linux Performance 2018 by Brendan Gregg
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
Brendan Gregg3.3K views
How Netflix Tunes EC2 Instances for Performance by Brendan Gregg
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
Brendan Gregg524.1K views
LISA17 Container Performance Analysis by Brendan Gregg
LISA17 Container Performance AnalysisLISA17 Container Performance Analysis
LISA17 Container Performance Analysis
Brendan Gregg9.3K views
Kernel Recipes 2017: Using Linux perf at Netflix by Brendan Gregg
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at Netflix
Brendan Gregg1.5M views
Kernel Recipes 2017: Performance Analysis with BPF by Brendan Gregg
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPF
Brendan Gregg5.3K views
USENIX ATC 2017 Performance Superpowers with Enhanced BPF by Brendan Gregg
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
Brendan Gregg7.1K views
Velocity 2017 Performance analysis superpowers with Linux eBPF by Brendan Gregg
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPF
Brendan Gregg735.7K views

Recently uploaded

The Research Portal of Catalonia: Growing more (information) & more (services) by
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)CSUC - Consorci de Serveis Universitaris de Catalunya
80 views25 slides
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Safe Software
280 views86 slides
PharoJS - Zürich Smalltalk Group Meetup November 2023 by
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023Noury Bouraqadi
132 views17 slides
Voice Logger - Telephony Integration Solution at Aegis by
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at AegisNirmal Sharma
39 views1 slide
"Running students' code in isolation. The hard way", Yurii Holiuk by
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk Fwdays
17 views34 slides
Info Session November 2023.pdf by
Info Session November 2023.pdfInfo Session November 2023.pdf
Info Session November 2023.pdfAleksandraKoprivica4
13 views15 slides

Recently uploaded(20)

Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software280 views
PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi132 views
Voice Logger - Telephony Integration Solution at Aegis by Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma39 views
"Running students' code in isolation. The hard way", Yurii Holiuk by Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays17 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker40 views
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by Dr. Jimmy Schwarzkopf
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf
Case Study Copenhagen Energy and Business Central.pdf by Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana16 views
Piloting & Scaling Successfully With Microsoft Viva by Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
Powerful Google developer tools for immediate impact! (2023-24) by wesley chun
Powerful Google developer tools for immediate impact! (2023-24)Powerful Google developer tools for immediate impact! (2023-24)
Powerful Google developer tools for immediate impact! (2023-24)
wesley chun10 views
HTTP headers that make your website go faster - devs.gent November 2023 by Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn22 views

QCon 2015 Broken Performance Tools

  • 1. Brendan Gregg Senior Performance Architect, Netflix Nov 2015
  • 3. • Over 60 million subscribers • AWS EC2 Linux cloud • FreeBSD CDN • Awesome place to work
  • 4. This Talk • Observability, benchmarking, anti-patterns, and lessons • Broken and misleading things that are surprising Note: problems with current implementations are discussed, which may be fixed/improved in the future
  • 8. Load Averages (1, 5, 15 min) • "load" – Usually CPU demand (scheduler run queue length/latency) – On Linux, task demand: CPU + uninterruptible disk I/O (?) • "average" – Exponentially damped moving sum • "1, 5, and 15 minutes" – Constants used in the equation • Don't study these for longer than 10 seconds $ uptime 22:08:07 up 9:05, 1 user, load average: 11.42, 11.87, 12.12
  • 9. t=0 Load begins (1 thread) 1 5 15 @ 1 min: 1 min avg =~ 0.62
  • 10. Load Average "1 minute load average" really means… "The exponentially damped moving sum of CPU + uninterruptible disk I/O that uses a value of 60 seconds in its equation"
  • 12. top %CPU • Who is consuming CPU? • And by how much? $ top - 20:15:55 up 19:12, 1 user, load average: 7.96, 8.59, 7.05 Tasks: 470 total, 1 running, 468 sleeping, 0 stopped, 1 zombie %Cpu(s): 28.1 us, 0.4 sy, 0.0 ni, 71.2 id, 0.0 wa, 0.0 hi, 0.1 si, 0.1 st KiB Mem: 61663100 total, 61342588 used, 320512 free, 9544 buffers KiB Swap: 0 total, 0 used, 0 free. 3324696 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11959 apiprod 20 0 81.731g 0.053t 14476 S 935.8 92.1 13568:22 java 12595 snmp 20 0 21240 3256 1392 S 3.6 0.0 2:37.23 snmp-pass 10447 snmp 20 0 51512 6028 1432 S 2.0 0.0 2:12.12 snmpd 18463 apiprod 20 0 23932 1972 1176 R 0.7 0.0 0:00.07 top […]
  • 13. top: Missing %CPU • Short-lived processes can be missing entirely – Process creates and exits in-between sampling /proc. e.g., software builds. – Try atop(1), or sampling using perf(1) • Stop clearing the screen! – No option to turn this off. Your eyes can miss updates. – I often use pidstat(1) on Linux instead. Scroll back for history.
  • 14. top: Misinterpreting %CPU • Different top(1)s use different calculations - On different OSes, check the man page, and run a test! • %CPU can mean: – A) Sum of per-CPU percents (0-Ncpu x 100%) consumed during the last interval – B) Percentage of total CPU capacity (0-100%) consumed during the last interval – C) (A) but historically damped (like load averages) – D) (B) " " "
  • 15. top: %Cpu vs %CPU • This 4 CPU system is consuming: – 130% total CPU, via %Cpu(s) – 190% total CPU, via %CPU • Which one is right? Is either? $ top - 15:52:58 up 10 days, 21:58, 2 users, load average: 0.27, 0.53, 0.41 Tasks: 180 total, 1 running, 179 sleeping, 0 stopped, 0 zombie %Cpu(s): 1.2 us, 24.5 sy, 0.0 ni, 67.2 id, 0.2 wa, 0.0 hi, 6.6 si, 0.4 st KiB Mem: 2872448 total, 2778160 used, 94288 free, 31424 buffers KiB Swap: 4151292 total, 76 used, 4151216 free. 2411728 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12678 root 20 0 96812 1100 912 S 100.4 0.0 0:23.52 iperf 12675 root 20 0 170544 1096 904 S 88.8 0.0 0:20.83 iperf 215 root 20 0 0 0 0 S 0.3 0.0 0:27.73 jbd2/sda1-8 […]
  • 16. CPU Summary Statistics • %Cpu row is from /proc/stat • linux/Documentation/cpu-load.txt: • /proc/stat is used by everything for CPU stats In most cases the `/proc/stat' information reflects the reality quite closely, however due to the nature of how/when the kernel collects this data sometimes it can not be trusted at all.
  • 17. %CPU
  • 18. What is %CPU anyway? • "Good" %CPU: – Retiring instructions (provided they aren't a spin loop) – High IPC (Instructions-Per-Cycle) • "Bad" %CPU: – Stall cycles waiting on resources, usually memory I/O – Low IPC – Buying faster processors may make little difference • %CPU alone is ambiguous – Would love top(1) to split %CPU into cycles retiring vs stalled – Although, it gets worse…
  • 19. A CPU Mystery… • As load increased, CPU ms per request lowered (blue) – up to 1.84x faster • Was it due to: - Cache warmth? no - Different code? no - Turbo boost? no • (Same test, but problem fixed, is shown in red)
  • 20. CPU Speed Variation • Clock speed can vary thanks to: – Intel Turbo Boost: by hardware, based on power, temp, etc – Intel Speed Step: by software, controlled by the kernel • %CPU is still ambiguous, given IPC. Need to know the clock speed as well • CPU counters nowadays have "reference cycles"
  • 21. Out-of-order Execution • CPUs execute uops out-of- order and in parallel across multiple functional units • %CPU doesn't account for how many units are active • Accounting each cycles as "stalled" or “retiring" is a simplification • Nowadays it's a lot of work to truly understand what CPUs are doing https://upload.wikimedia.org/wikipedia/commons/6/64/Intel_Nehalem_arch.svg
  • 23. I/O Wait • Suggests system is disk I/O bound, but often misleading • Comparing I/O wait between system A and B: - higher might be bad: slower disks, more blocking - lower might be bad: slower processor and architecture consumes more CPU, obscuring I/O wait • Solaris implementation was also broken and later hardwired to zero • Can be very useful when understood: another idle state $ mpstat -P ALL 1 08:06:43 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 08:06:44 PM all 53.45 0.00 3.77 0.00 0.00 0.39 0.13 0.00 42.26 […]
  • 24. I/O Wait Venn Diagram "CPU" "I/O Wait""CPU" "Idle" CPU Waiting for disk I/O Per CPU:
  • 26. Free Memory • "free" is near-zero: I'm running out of memory! - No, it's in the file system cache, and is still free for apps to use • Linux free(1) explains it, but other tools, e.g. vmstat(1), don't • Some file systems (e.g., ZFS) may not be shown in the system's cached metrics at all www.linuxatemyram.com $ free -m total used free shared buffers cached Mem: 3750 1111 2639 0 147 527 -/+ buffers/cache: 436 3313 Swap: 0 0 0
  • 28. vmstat(1) • Linux: first line has some summary since boot values — confusing! • Other implementations: - "r" may be sampled once per second. Almost useless. - Columns like "de" for deficit, making much less sense for non- page scanned situations $ vmstat –Sm 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 8 0 0 1620 149 552 0 0 1 179 77 12 25 34 0 0 7 0 0 1598 149 552 0 0 0 0 205 186 46 13 0 0 8 0 0 1617 149 552 0 0 0 8 210 435 39 21 0 0 8 0 0 1589 149 552 0 0 0 0 218 219 42 17 0 0 […]
  • 30. netstat -s $ netstat -s Ip: 7962754 total packets received 8 with invalid addresses 0 forwarded 0 incoming packets discarded 7962746 incoming packets delivered 8019427 requests sent out Icmp: 382 ICMP messages received 0 input ICMP message failed. ICMP input histogram: destination unreachable: 125 timeout in transit: 257 3410 ICMP messages sent 0 ICMP messages failed ICMP output histogram: destination unreachable: 3410 IcmpMsg: InType3: 125 InType11: 257 OutType3: 3410 Tcp: 17337 active connections openings 395515 passive connection openings 8953 failed connection attempts 240214 connection resets received 3 connections established 7198375 segments received 7504939 segments send out 62696 segments retransmited 10 bad segments received. 1072 resets sent InCsumErrors: 5 Udp: 759925 packets received 3412 packets to unknown port received. 0 packet receive errors 784370 packets sent UdpLite: TcpExt: 858 invalid SYN cookies received 8951 resets received for embryonic SYN_RECV sockets 14 packets pruned from receive queue because of socket buffer overrun 6177 TCP sockets finished time wait in fast timer 293 packets rejects in established connections because of timestamp 733028 delayed acks sent 89 delayed acks further delayed because of locked socket Quick ack mode was activated 13214 times 336520 packets directly queued to recvmsg prequeue. 43964 packets directly received from backlog 11406012 packets directly received from prequeue 1039165 packets header predicted 7066 packets header predicted and directly queued to user 1428960 acknowledgments not containing data received 1004791 predicted acknowledgments 1 times recovered from packet loss due to fast retransmit 5044 times recovered from packet loss due to SACK data 2 bad SACKs received Detected reordering 4 times using SACK Detected reordering 11 times using time stamp 13 congestion windows fully recovered 11 congestion windows partially recovered using Hoe heuristic TCPDSACKUndo: 39 2384 congestion windows recovered after partial ack 228 timeouts after SACK recovery 100 timeouts in loss state 5018 fast retransmits 39 forward retransmits 783 retransmits in slow start 32455 other TCP timeouts TCPLossProbes: 30233 TCPLossProbeRecovery: 19070 992 sack retransmits failed 18 times receiver scheduled too late for direct processing 705 packets collapsed in receive queue due to low socket buffer 13658 DSACKs sent for old packets 8 DSACKs sent for out of order packets 13595 DSACKs received 33 DSACKs for out of order packets received 32 connections reset due to unexpected data 108 connections reset due to early user close 1608 connections aborted due to timeout TCPSACKDiscard: 4 TCPDSACKIgnoredOld: 1 TCPDSACKIgnoredNoUndo: 8649 TCPSpuriousRTOs: 445 TCPSackShiftFallback: 8588 TCPRcvCoalesce: 95854 TCPOFOQueue: 24741 TCPOFOMerge: 8 TCPChallengeACK: 1441 TCPSYNChallenge: 5 TCPSpuriousRtxHostQueues: 1 TCPAutoCorking: 4823 IpExt: InOctets: 1561561375 OutOctets: 1509416943 InNoECTPkts: 8201572 InECT1Pkts: 2 InECT0Pkts: 3844 InCEPkts: 306
  • 31. netstat -s […] Tcp: 17337 active connections openings 395515 passive connection openings 8953 failed connection attempts 240214 connection resets received 3 connections established 7198870 segments received 7505329 segments send out 62697 segments retransmited 10 bad segments received. 1072 resets sent InCsumErrors: 5 […]
  • 32. netstat -s • Many metrics on Linux (can be over 200) • Still doesn't include everything: getting better, but don't assume everything is there • Includes typos & inconsistencies • Might be more readable to: cat /proc/net/snmp /proc/net/netstat • Totals since boot can be misleading • On Linux, -s needs -c support • Often no documentation outside kernel source code • Requires expertise to comprehend
  • 34. Disk Metrics • All disk metrics are misleading • Disk %utilization / %busy – Logical devices (volume managers) can process requests in parallel, and may accept more I/O at 100% • Disk IOPS – High IOPS is "bad"? That depends… • Disk latency – Does it matter? File systems and volume managers try hard to hide latency and make latency asynchronous – Better measuring latency via application->FS calls
  • 35. Rules for Metrics Makers • They must work – As well as possible. Clearly document caveats. • They must be useful – Document a real use case (eg, my example.txt files). If you get stuck, it's not useful – ditch it. • Aim to be intuitive – Document it. If it's too weird to explain, redo it. • As few as possible – Respect end-user's time • Good system examples: – iostat -x: workload columns, then resulting perf columns – Linux sar: consistency, units on columns, logical groups
  • 38. Linux perf • Can sample stack traces and summarize output: # perf report -n -stdio […] # Overhead Samples Command Shared Object Symbol # ........ ............ ....... ................. ............................. # 20.42% 605 bash [kernel.kallsyms] [k] xen_hypercall_xen_version | --- xen_hypercall_xen_version check_events | |--44.13%-- syscall_trace_enter | tracesys | | | |--35.58%-- __GI___libc_fcntl | | | | | |--65.26%-- do_redirection_internal | | | do_redirections | | | execute_builtin_or_function | | | execute_simple_command [… ~13,000 lines truncated …]
  • 40. … as a Flame Graph
  • 42. System Profilers with Java Java Kernel TCP/IP GC Idle thread Time Locks epoll JVM
  • 43. System Profilers with Java • e.g., Linux perf • Visibility – JVM (C++) – GC (C++) – libraries (C) – kernel (C) • Typical problems (x86): – Stacks missing for Java and other runtimes – Symbols missing for Java methods • Profile everything except Java and similar runtimes
  • 45. Java Profilers • Visibility – Java method execution – Object usage – GC logs – Custom Java context • Typical problems: – Sampling often happens at safety/yield points (skew) – Method tracing has massive observer effect – Misidentifies RUNNING as on-CPU (e.g., epoll) – Doesn't include or profile GC or JVM CPU time – Tree views not quick (proportional) to comprehend • Inaccurate (skewed) and incomplete profiles
  • 47. Broken System Stack Traces • Profiling Java on x86 using perf • The stacks are 1 or 2 levels deep, and have junk values # perf record –F 99 –a –g – sleep 30 # perf script […] java 4579 cpu-clock: ffffffff8172adff tracesys ([kernel.kallsyms]) 7f4183bad7ce pthread_cond_timedwait@@GLIBC_2… java 4579 cpu-clock: 7f417908c10b [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f4179101c97 [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f41792fc65f [unknown] (/tmp/perf-4458.map) a2d53351ff7da603 [unknown] ([unknown]) java 4579 cpu-clock: 7f4179349aec [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f4179101d0f [unknown] (/tmp/perf-4458.map) […]
  • 48. Why Stacks are Broken • On x86 (x86_64), hotspot uses the frame pointer register (RBP) as general purpose • This "compiler optimization" breaks stack walking • Once upon a time, x86 had fewer registers, and this made much more sense • gcc provides -fno-omit-frame-pointer to avoid doing this – JDK8u60+ now has this as -XX:+PreserveFramePoiner
  • 49. Missing Symbols 12.06% 62 sed sed [.] re_search_internal | --- re_search_internal | |--96.78%-- re_search_stub | rpl_re_search | match_regex | do_subst | execute_program | process_files | main | __libc_start_main 71.79% 334 sed sed [.] 0x000000000001afc1 | |--11.65%-- 0x40a447 | 0x40659a | 0x408dd8 | 0x408ed1 | 0x402689 | 0x7fa1cd08aec5 broken not broken • Missing symbols may show up as hex; e.g., Linux perf:
  • 50. Fixing Symbols • For applications, install debug symbol package • For JIT'd code, Linux perf already looks for an externally provided symbol file: /tmp/perf-PID.map • Find for a way to create this for your runtime # perf script Failed to open /tmp/perf-8131.map, continuing without symbols […] java 8131 cpu-clock: 7fff76f2dce1 [unknown] ([vdso]) 7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm… 7fd301861e46 [unknown] (/tmp/perf-8131.map) […]
  • 52. Instruction Profiling # perf annotate -i perf.data.noplooper --stdio Percent | Source code & Disassembly of noplooper -------------------------------------------------------- : Disassembly of section .text: : : 00000000004004ed <main>: 0.00 : 4004ed: push %rbp 0.00 : 4004ee: mov %rsp,%rbp 20.86 : 4004f1: nop 0.00 : 4004f2: nop 0.00 : 4004f3: nop 0.00 : 4004f4: nop 19.84 : 4004f5: nop 0.00 : 4004f6: nop 0.00 : 4004f7: nop 0.00 : 4004f8: nop 18.73 : 4004f9: nop 0.00 : 4004fa: nop 0.00 : 4004fb: nop 0.00 : 4004fc: nop 19.08 : 4004fd: nop 0.00 : 4004fe: nop 0.00 : 4004ff: nop 0.00 : 400500: nop 21.49 : 400501: jmp 4004f1 <main+0x4> • Often broken nowadays due to skid, out-of-order execution, and sampling the resumption instruction • Better with PEBS support
  • 55. tcpdump • Packet tracing doesn't scale. Overheads: – CPU cost of per-packet tracing (improved by [e]BPF) • Consider CPU budget per-packet at 10/40/100 GbE – Transfer to user-level (improved by ring buffers) – File system storage (more CPU, and disk I/O) – Possible additional network transfer • Can also drop packets when overloaded • You should only trace send/receive as a last resort – I solve problems by tracing lower frequency TCP events $ tcpdump -i eth0 -w /tmp/out.tcpdump tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes ^C7985 packets captured 8996 packets received by filter 1010 packets dropped by kernel
  • 57. strace • Before: • After: • 442x slower. This is worst case. • strace(1) pauses the process twice for each syscall. This is like putting metering lights on your app. – "BUGS: A traced process runs slowly." – strace(1) man page – Use buffered tracing / in-kernel counters instead, e.g. DTrace $ dd if=/dev/zero of=/dev/null bs=1 count=500k […] 512000 bytes (512 kB) copied, 0.103851 s, 4.9 MB/s $ strace -eaccept dd if=/dev/zero of=/dev/null bs=1 count=500k […] 512000 bytes (512 kB) copied, 45.9599 s, 11.1 kB/s
  • 59. DTrace • Overhead often negligible, but not always • Before: • After: • 384x slower. Fairly worst case: frequent pid probes. # time wc systemlog 262600 2995200 23925200 systemlog real 0m1.098s user 0m1.085s sys 0m0.012s # time dtrace -n 'pid$target:::entry { @[probefunc] = count(); }' -c 'wc systemlog' dtrace: description 'pid$target:::entry ' matched 3756 probes 262600 2995200 23925200 systemlog […] real 7m2.896s user 7m2.650s sys 0m0.572s
  • 60. Tracing Dangers • Overhead potential exists for all tracers – Overhead = event instrumentation cost X frequency of event • Costs – Lower: counters, in-kernel aggregations – Higher: event dumps, stack traces, string copies, copyin/outs • Frequencies – Lower: process creation & destruction, disk I/O (usually), … – Higher: instructions, functions in I/O hot path, malloc/free, Java methods, … • Advice – < 10,000 events/sec, probably ok – > 100,000 events/sec, overhead may start to be measurable
  • 61. DTraceToolkit • My own tools that can cause massive overhead: – dapptrace/dappprof: can trace all native functions – Java/j_flow.d, ...: can trace all Java methods with +ExtendedDTraceProbes • Useful for debugging, but should warn about overheads # j_flow.d C PID TIME(us) -- CLASS.METHOD 0 311403 4789112583163 -> java/lang/Object.<clinit> 0 311403 4789112583207 -> java/lang/Object.registerNatives 0 311403 4789112583323 <- java/lang/Object.registerNatives 0 311403 4789112583333 <- java/lang/Object.<clinit> 0 311403 4789112583343 -> java/lang/String.<clinit> 0 311403 4789112583732 -> java/lang/String$CaseInsensitiveComparator.<init> 0 311403 4789112583743 -> java/lang/String$CaseInsensitiveComparator.<init> 0 311403 4789112583752 -> java/lang/Object.<init> [...]
  • 63. Valgrind • A suite of tools including an extensive leak detector • To its credit it does warn the end user "Your program will run much slower (eg. 20 to 30 times) than normal" – http://valgrind.org/docs/manual/quick-start.html
  • 65. Java Profilers • Some Java profilers have two modes: – Sampling stacks: eg, at 100 Hertz – Tracing methods: instrumenting and timing every method • Method timing has been described as "highly accurate", despite slowing the target by up to 1000x! • Issues & advice already covered at QCon: – Nitsan Wakart "Profilers are Lying Hobbitses" earlier today – Java track tomorrow
  • 68. Monitoring • By now you should recognize these pathologies: – Let's just graph the system metrics! • That's not the problem that needs solving – Let's just trace everything and post process! • Now you have one million problems per second • Monitoring adds additional problems: – Let's have a cloud-wide dashboard update per-second! • From every instance? Packet overheads? – Now we have billions of metrics!
  • 71. Statistics "Then there is the man who drowned crossing a stream with an average depth of six inches." – W.I.E. Gates
  • 72. Statistics • Averages can be misleading – Hide latency outliers – Per-minute averages can hide multi-second issues • Percentiles can be misleading – Probability of hitting 99.9th latency may be more than 1/1000 after many dependency requests • Show the distribution: – Summarize: histogram, density plot, frequency trail – Over-time: scatter plot, heat map • See Gil Tene's "How Not to Measure Latency" QCon talk from earlier today
  • 73. Average Latency • When the index of central tendency isn't…
  • 77. Pie Charts …for real-time metrics usr sys wait idle
  • 78. Doughnuts usr sys wait idle …like pie charts but worse
  • 79. Traffic Lights …when used for subjective metrics These can be used for objective metrics RED == BAD (usually) GREEN == GOOD (hopefully)
  • 82. ~100% of benchmarks are wrong
  • 83. "Most popular benchmarks are flawed" Source: Traeger, A., E. Zadok, N. Joukov, and C. Wright. “A Nine Year Study of File System and Storage Benchmarking,” ACM Transactions on Storage, 2008. Not only can a popular benchmark be broken, but so can all alternates.
  • 85. The energy needed to refute benchmarks is multiple orders of magnitude bigger than to run them It can take 1-2 weeks of senior performance engineering time to debug a single benchmark.
  • 86. Benchmarking • Benchmarking is a useful form of experimental analysis – Try observational first; benchmarks can perturb • Accurate and realistic benchmarking is vital for technical investments that improve our industry • However, benchmarking is error prone
  • 88. Common Mistakes 1. Testing the wrong target – eg, FS cache instead of disk; misconfiguration 2. Choosing the wrong target – eg, disk instead of FS cache … doesn’t resemble real world 3. Invalid results – benchmark software bugs 4. Ignoring errors – error path may be fast! 5. Ignoring variance or perturbations – real workload isn't steady/consistent, which matters 6. Misleading results – you benchmark A, but actually measure B, and conclude you measured C
  • 90. Product Evaluations • Benchmarking is used for product evaluations & sales • The Benchmark Paradox: – If your product’s chances of winning a benchmark are 50/50, you’ll usually lose – To justify a product switch, a customer may run several benchmarks, and expect you to win them all – May mean winning a coin toss at least 3 times in a row – http://www.brendangregg.com/blog/2014-05-03/the-benchmark-paradox.html • Solving this seeming paradox (and benchmarking): – Confirm benchmark is relevant to intended workload – Ask: why isn't it 10x?
  • 91. Active Benchmarking • Root cause performance analysis while the benchmark is still running – Use observability tools – Identify the limiter (or suspected limiter) and include it with the benchmark results – Answer: why not 10x? • This takes time, but uncovers most mistakes
  • 93. Micro Benchmarks • Test a specific function in isolation. e.g.: – File system maximum cached read operations/sec – Network maximum throughput • Examples of bad microbenchmarks: – gitpid() in a tight loop – speed of /dev/zero and /dev/null • Common problems: – Testing a workload that is not very relevant – Missing other workloads that are relevant
  • 95. Macro Benchmarks • Simulate application user load. e.g.: – Simulated web client transaction • Common problems: – Misplaced trust: believed to be realistic, but misses variance, errors, perturbations, e.t.c. – Complex to debug, verify, and root cause
  • 97. Kitchen Sink Benchmarks • Run everything! – Mostly random benchmarks found on the Internet, where most are are broken or irrelevant – Developers focus on collecting more benchmarks than verifying or fixing the existing ones • Myth that more benchmarks == greater accuracy – No, use active benchmarking (analysis)
  • 99. Automated Benchmarks • Completely automated procedure. e.g.: – Cloud benchmarks: spin up an instance, benchmark, destroy. Automate. • Little or no provision for debugging • Automation is only part of the solution
  • 102. bonnie++ • "simple tests of hard drive and file system performance" • First metric printed by (thankfully) older versions: per character sequential output • What was actually tested: – 1 byte writes to libc (via putc()) – 4 Kbyte writes from libc -> FS (depends on OS; see setbuffer()) – 128 Kbyte async writes to disk (depends on storage stack) – Any file system throttles that may be present (eg, ZFS) – C++ code, to some extent (bonnie++ 10% slower than Bonnie) • Actual limiter: – Single threaded write_block_putc() and putc() calls
  • 104. Apache Bench • HTTP web server benchmark • Single thread limited (use wrk for multi-threaded) • Keep-alive option (-k): – without: Can become an unrealistic TCP session benchmark – with: Can become an unrealistic server throughput test • Performance issues of ab's own code
  • 106. UnixBench • The original kitchen-sink micro benchmark from 1984, published in BYTE magazine • Innovative & useful for the time, but that time has passed • More problems than I can shake a stick at • Starting with…
  • 108. UnixBench Makefile • Default (by ./Run) for Linux. Would you edit it? Then what? ## Very generic #OPTON = -O ## For Linux 486/Pentium, GCC 2.7.x and 2.8.x #OPTON = -O2 -fomit-frame-pointer -fforce-addr -fforce-mem -ffast-math # -m486 -malign-loops=2 -malign-jumps=2 -malign-functions=2 ## For Linux, GCC previous to 2.7.0 #OPTON = -O2 -fomit-frame-pointer -fforce-addr -fforce-mem -ffast-math -m486 #OPTON = -O2 -fomit-frame-pointer -fforce-addr -fforce-mem -ffast-math # -m386 -malign-loops=1 -malign-jumps=1 -malign-functions=1 ## For Solaris 2, or general-purpose GCC 2.7.x OPTON = -O2 -fomit-frame-pointer -fforce-addr -ffast-math -Wall ## For Digital Unix v4.x, with DEC cc v5.x #OPTON = -O4 #CFLAGS = -DTIME -std1 -verbose -w0
  • 109. UnixBench Makefile • "Fixing" the Makefile improved the first result, Dhrystone 2, by 64% • Is everyone "fixing" it the same way, or not? Are they using the same compiler version? Same OS? (No.)
  • 110. UnixBench Documentation "The results will depend not only on your hardware, but on your operating system, libraries, and even compiler." "So you may want to make sure that all your test systems are running the same version of the OS; or at least publish the OS and compuiler versions with your results."
  • 112. UnixBench Tests • Results summarized as "The BYTE Index". From USAGE: • What can go wrong? Everything. system: dhry2reg Dhrystone 2 using register variables whetstone-double Double-Precision Whetstone syscall System Call Overhead pipe Pipe Throughput context1 Pipe-based Context Switching spawn Process Creation execl Execl Throughput fstime-w File Write 1024 bufsize 2000 maxblocks fstime-r File Read 1024 bufsize 2000 maxblocks fstime File Copy 1024 bufsize 2000 maxblocks fsbuffer-w File Write 256 bufsize 500 maxblocks fsbuffer-r File Read 256 bufsize 500 maxblocks fsbuffer File Copy 256 bufsize 500 maxblocks fsdisk-w File Write 4096 bufsize 8000 maxblocks fsdisk-r File Read 4096 bufsize 8000 maxblocks fsdisk File Copy 4096 bufsize 8000 maxblocks shell1 Shell Scripts (1 concurrent) (runs "looper 60 multi.sh 1") shell8 Shell Scripts (8 concurrent) (runs "looper 60 multi.sh 8") shell16 Shell Scripts (8 concurrent) (runs "looper 60 multi.sh 16")
  • 115. Street Light Anti-Method 1. Pick observability tools that are: – Familiar – Found on the Internet – Found at random 2. Run tools 3. Look for obvious issues
  • 116. Blame Someone Else Anti-Method 1. Find a system or environment component you are not responsible for 2. Hypothesize that the issue is with that component 3. Redirect the issue to the responsible team 4. When proven wrong, go to 1
  • 117. Performance Tools Team • Having a separate performance tools team, who creates tools but doesn't use them (no production exposure) • At Netflix: – The performance engineering team builds tools and uses tools for both service consulting and live production triage • Mogul, Vector, … – Other teams (CORE, traffic, …) also build performance tools and use them during issues • Good performance tools are built out of necessity
  • 118. Messy House Fallacy • Fallacy: my code is a mess, I bet yours is immaculate, therefore the bug must be mine • Reality: everyone's code is terrible and buggy • When analyzing performance, don't overlook the system: kernel, libraries, etc.
  • 121. Observability • Trust nothing, verify everything – Cross-check with other observability tools – Write small "known" workloads, and confirm metrics match – Find other sanity tests: e.g. check known system limits – Determine how metrics are calculated, averaged, updated • Find metrics to solve problems – Instead of understanding hundreds of system metrics – What problems do you want to observe? What metrics would be sufficient? Find, verify, and use those. e.g., USE Method. – The metric you want may not yet exist • File bugs, get these fixed/improved
  • 122. Observe Everything • Use functional diagrams to pose Q's and find missing metrics:
  • 123. Java Mixed-Mode Flame Graph Profile Everything Java JVM Kernel GC
  • 125. Benchmark Nothing • Trust nothing, verify everything • Do Active Benchmarking: 1. Configure the benchmark to run in steady state, 24x7 2. Do root-cause analysis of benchmark performance 3. Answer: why is it not 10x?
  • 126. Links & References • https://www.rfc-editor.org/rfc/rfc546.pdf • https://upload.wikimedia.org/wikipedia/commons/6/64/Intel_Nehalem_arch.svg • http://www.linuxatemyram.com/ • Traeger, A., E. Zadok, N. Joukov, and C. Wright. “A Nine Year Study of File System and Storage Benchmarking,” ACM Trans- actions on Storage, 2008. • http://www.brendangregg.com/blog/2014-06-09/java-cpu-sampling-using-hprof.html • http://www.brendangregg.com/blog/2014-05-03/the-benchmark-paradox.html • http://www.brendangregg.com/ActiveBenchmarking/bonnie++.html • https://blogs.oracle.com/roch/entry/decoding_bonnie • http://www.brendangregg.com/blog/2014-05-02/compilers-love-messing-with- benchmarks.html • https://code.google.com/p/byte-unixbench/ • https://qconsf.com/sf2015/presentation/how-not-measure-latency • https://qconsf.com/sf2015/presentation/profilers-lying • Caution signs drawn be me, inspired by real-world signs
  • 127. THANKS
  • 128. Thanks • Questions? • http://techblog.netflix.com • http://slideshare.net/brendangregg • http://www.brendangregg.com • bgregg@netflix.com • @brendangregg Nov 2015

Editor's Notes

  1. G'Day, I'm Brendan. I normally give talks about performance tools that work, but today I'm going to talk about those that don't: broken and misleading tools and metrics.
  2. As a warning: when you use performance tools, you will get hurt. It's like a surface that's slippery when wet, or dry. It's like, always slippery.
  3. Awesome place to work Centos and Ubuntu
  4. I'm not really going to talk about software bugs. I'm going to talk about things that appear to work that don't. Things that are counter-intuitive. Understanding those is what separates beginners from experts, and is the hardest part to learn.
  5. Is this ok? Does that answer our questions? Pretty simple, right?
  6. imagine you went to another country and hired a car, and were told that "green" on a traffic light means "probably ok, but you might get T-boned"… That'd make the traffic lights pretty useless.
  7. Why are you still running top?
  8. "Maybe it's the network… Can you check with the network team if they have had dropped packets … or something?"
  9. We're hiring