- The document discusses debugging Node.js applications in production environments at Netflix, which has strict uptime requirements. It describes techniques used such as collecting stack traces from running processes using perf and visualizing them in flame graphs to identify performance bottlenecks. It also covers configuring Node.js to dump core files on errors to enable post-mortem debugging without affecting uptime. The techniques help Netflix reduce latency, increase throughput, and fix runtime crashes and memory leaks in production Node.js applications.
15. Snapshot What’s Currently Executing
Stacktrace: A stack trace is a report of the active stack frames
at a certain point in time during the execution of a program.
> console.log(ex, ex.stack.split("n"))
ReferenceError: ex is not defined
at repl:1:13
at REPLServer.defaultEval (repl.js:132:27)
at bound (domain.js:254:14)
at REPLServer.runBound [as eval] (domain.js:267:12)
at REPLServer.<anonymous> (repl.js:279:12)
at REPLServer.emit (events.js:107:17)
at REPLServer.Interface._onLine (readline.js:214:10)
at REPLServer.Interface._line (readline.js:553:8)
at REPLServer.Interface._ttyWrite (readline.js:830:14)
at ReadStream.onkeypress (readline.js:109:10)
16. Two Problems
1) How to sample stack traces from a running
process?
2) How to do 1) without affecting the process?
17. Linux Perf Events
PERF(1) perf Manual PERF(1)
NAME
perf - Performance analysis tools for Linux
SYNOPSIS
perf [--version] [--help] COMMAND [ARGS]
DESCRIPTION
Performance counters for Linux are a new kernel-based subsystem
that provide a framework for all things performance analysis.
It covers hardware level (CPU/PMU, Performance Monitoring Unit)
features and software features (software counters, tracepoints)
as well.
18. Sample Stack Traces w/ perf(1)
# perf record -F 99 -p `pgrep -n node` -g -- sleep 30
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.524 MB perf.data
(~22912 samples) ]
28. Flamegraph
❖ Each box presents a function in
the stack (stack frame)
❖ x-axis: percent of time on CPU
❖ y-axis: stack depth
❖ colors: random, or can be a
dimension
❖ https://github.com/
brendangregg/FlameGraph
v8
libc
JS
built ins
45. - Chafin, R. "Pioneer F & G Telemetry and Command Processor Core Dump
Program." JPL Technical Report XVI, no. 32-1526 (1971): 174.
“The method described in this article was designed
to provide a core dump… with a minimal impact
on the spacecraft… as the resumption of data
acquisition from the spacecraft is the highest
priority.”
48. Core Dumps — A Brief History
❖ Magnetic core memory
❖ Dump out the contents of
“core” memory for debugging
❖ “Core dump” was born
❖ Initially printed on paper!
❖ Postmortem debugging was
born!
49.
50. Production Constraints
❖ Uptime is critical
❖ Not easily reproducible
❖ Can’t simulate environment
❖ Resume normal operations ASAP
53. Node Post Mortem Tooling
❖ Netflix uses Linux in Prod
❖ Linux — Work in progress
❖ https://github.com/tjfontaine/lldb-v8
❖ https://github.com/indutny/llnode
❖ Solaris — Full featured, compatible with Linux cores
❖ https://github.com/joyent/mdb_v8
54.
55. Socks & Duct Tape: Setup a Debug Solaris Instance
EC2: http://omnios.omniti.com/wiki.php/
Installation#IntheCloud
VM: http://omnios.omniti.com/wiki.php/
Installation#Quickstart
78. Memory Leak Strategy
❖ Look at objects on heap for suspicious objects
❖ Take successive core dumps and compare object counts
❖ Growing object counts are likely leaking
❖ Inspect object for more context
❖ Walk reverse references to find root object
93. Spot the Leak
var cache = {};
function checkCache(someModule) {
var mod = cache[someModule];
if (!mod) {
try {
mod = require(someModule);
cache[someModule] = mod;
return mod;
} catch (e) {
return {};
}
}
return mod;
}
Module could be client only, must catch
Should cache the
fact we caught an exception here
94. Root Cause
❖ Node caches metadata for each module
❖ If require process throws an exception, the module
metadata is leaked (bug?)
❖ Client side module meant we were throwing during
every request, and not caching the fact we tried to
require it
❖ Each request leaks 3+ module metadata objects
95. Memory Leaks
❖ Take successive core dumps (gcore(1))
❖ Compare object counts (::findjsobjects)
❖ Growing objects are likely leaking
❖ Inspect object for more context (::jsprint)
❖ Walk reverse references to find root obj (::findjsobjects -
r)
97. More State than Just Logs
❖ Detailed stack trace (::jsstack)
❖ Function args for each frame (::jsstack -vn0)
❖ Get state of any object and its provenance
(::jsprint, ::jsconstructor)
❖ Get source code of any function (::jssource)
❖ Find arbitrary JS objects (::findjsobjects)
❖ Unmodified Node binary!
100. Production Debugging
❖ Runtime Performance
❖ CPU profiling/flame graphs
❖ Runtime Crashes
❖ Inspect program state with core dumps and mdb
❖ Memory leaks
❖ Analyze objects and references with core dumps and
mdb
102. Epilogue — State of Tooling
❖ Join Working Group https://github.com/nodejs/post-
mortem
❖ Help make mdb_v8 cross platform https://github.com/
joyent/mdb_v8
❖ Contribute to https://github.com/tjfontaine/lldb-v8
and https://github.com/indutny/llnode
103. Acknowledgements
❖ mdb_v8
❖ Dave Pacheco, TJ Fontaine, Julien Gilli, Bryan Cantrill
❖ CPU Profiling/Flamegraphs
❖ Brendan Gregg, Google v8 team, Ali Ijaz Sheikh
❖ Linux Perf
❖ Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Jiri Olsa, Peter Zijlstra
❖ lldb-v8
❖ TJ Fontaine
❖ llnode
❖ Fedor Indutny
107. Citations
❖ Slides 29-32 used with permission from “Java Mixed-
Mode Flame Graphs”, Brendan Gregg, Oct 2015
❖ Slide 26 used with permission from http://
www.brendangregg.com/FlameGraphs/
cpuflamegraphs.html