Node Interactive Debugging Node.js In Production

Debugging Node.js in Production
Yunong Xiao
@yunongx
Software Engineer
Node Platform

Node.js @ Netflix
❖ 65+ Million Subscribers
❖ Website (netﬂix.com)
❖ Dynamic asset packager
❖ PaaS on Node
❖ Internal Services

–Gene Kranz, Flight Director, Apollo 13
“Let's work the problem, people. Let's not make
things any worse by guessing”

Apply the Scientific Method
1. Construct a Hypothesis
2. Collect data
3. Analyze data and draw a conclusion
4. Repeat

Production Crisis
❖ Runtime Performance
❖ Runtime Crashes
❖ Memory Leaks

Gather Request Data
http://restify.com
http://github.com/restify/node-restify
Observable REST Framework

to the Rescue
[2014-12-09T14:07:26.293Z] INFO: shakti/restify-audit/20067: handled:
200, latency=1402
(req_id=b3fa3820-7fac-11e4-8908-a5c7b70d676f, latency=1435)
GET / HTTP/1.1
host: www.netflix.com
--
HTTP/1.1 200 OK
x-netflix.client.instance: i-057e47ef
x-frame-options: DENY
content-type: text/html
--
req.timers: {
"parseBody": 700123,
"apiRpc": 701911,
"render": 400031
}

req.timers: {
“apiRPC”: 301911,
"render": 400031,
}
On CPU

CPU is Critical
❖ Node is essentially “single threaded”
❖ Cascading effect on ALL requests in process

req.timers: {
“apiRPC”: 301911,
"render": 400031,
}
Can’t process ANY other request for 1.1 seconds
On CPU

How Much Code?
$ find . -name "*.js*" | xargs cat | wc -l
6 042 301

Statistically Sample Stack Traces

Snapshot What’s Currently Executing
Stacktrace: A stack trace is a report of the active stack frames
at a certain point in time during the execution of a program.
> console.log(ex, ex.stack.split("n"))
ReferenceError: ex is not defined
at repl:1:13
at REPLServer.defaultEval (repl.js:132:27)
at bound (domain.js:254:14)
at REPLServer.runBound [as eval] (domain.js:267:12)
at REPLServer.<anonymous> (repl.js:279:12)
at REPLServer.emit (events.js:107:17)
at REPLServer.Interface._onLine (readline.js:214:10)
at REPLServer.Interface._line (readline.js:553:8)
at REPLServer.Interface._ttyWrite (readline.js:830:14)
at ReadStream.onkeypress (readline.js:109:10)

Two Problems
1) How to sample stack traces from a running
process?
2) How to do 1) without affecting the process?

Linux Perf Events
PERF(1) perf Manual PERF(1)
NAME
perf - Performance analysis tools for Linux
SYNOPSIS
perf [--version] [--help] COMMAND [ARGS]
DESCRIPTION
Performance counters for Linux are a new kernel-based subsystem
that provide a framework for all things performance analysis.
It covers hardware level (CPU/PMU, Performance Monitoring Unit)
features and software features (software counters, tracepoints)
as well.

Sample Stack Traces w/ perf(1)
# perf record -F 99 -p `pgrep -n node` -g -- sleep 30
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.524 MB perf.data
(~22912 samples) ]

Sample Stack Trace
ab2fee v8::internal::Heap::DeoptMarkedAllocationSites() (/apps/node/bin/
a69754 v8::internal::StackGuard::HandleInterrupts() (/apps/node/bin/node)
c9f13b v8::internal::Runtime_StackGuard(int, v8::internal::Object**
3c793e3060bb (/tmp/perf-5382.map)
3c793e3060bb (/tmp/perf-5382.map) (repeated 30 more lines)
8e6b2f v8::Function::Call(v8::Local<v8::Context>, v8::Local<v8::Value>,
int, v8::Local<v8::Value>*) (/apps/node/bin/node)
8f2281 v8::Function::Call(v8::Local<v8::Value>, int,
v8::Local<v8::Value>*) (/apps/node/bin/node)
df599a node::MakeCallback(node::Environment*, v8::Local<v8::Value>,...
df5ccb node::CheckImmediate(uv_check_s*) (/apps/node/bin/node)
fb1597 uv__run_check (/apps/node/bin/node)
fabcee uv_run (/apps/node/bin/node)
dfaa50 node::Start(int, char**) (/apps/node/bin/node)
7fcc3ef6876d __libc_start_main (/lib/x86_64-linux-gnu/libc-2.15.so)
Missing JS Frames

Why? v8 places symbols JIT(Just in
Time)

node --perf_basic_prof_only_functions
“outputs the ﬁles in a format that the existing perf tool
can consume.”

node --perf_basic_prof_only_functions
Available right now in Node v5.x
Coming soon to Node v4.3:
https://github.com/nodejs/node/pull/3609

Results
node 5382 cpu-clock:
3c793e38b0c1 LazyCompile:DELETE native runtime.js:349 (/tmp/perf-5382.map)
3c793e31981d Builtin:JSConstructStubGeneric (/tmp/perf-5382.map)
3c793ff2ca94 (/tmp/perf-5382.map)
3c793e98a10f LazyCompile:~AtlasClient._run /apps/node/webapp/node_modules/nf-atlas-client/lib/client/AtlasClient.js:85 (/tmp/
perf-5382.map)
3c793f47de29 LazyCompile:*AtlasClient.timer /apps/node/webapp/node_modules/nf-atlas-client/lib/client/AtlasClient.js:70 (/tmp/
perf-5382.map)
3c793e9eee38 LazyCompile:~fetchSingleGetCallback /apps/node/webapp/singletons/ShaktiFetcher.js:120 (/tmp/perf-5382.map)
3c793f6cffee LazyCompile:*Model.get /apps/node/webapp/node_modules/nf-models/lib/Model.js:90 (/tmp/perf-5382.map)
3c793ed3e2ad (/tmp/perf-5382.map)
3c7940e4357b Handler:ca (/tmp/perf-5382.map)
3c793f060e3c Function:~ /apps/node/webapp/node_modules/vasync/lib/vasync.js:134 (/tmp/perf-5382.map)
3c79404edbfa (/tmp/perf-5382.map)
3c79401fd3f7 (/tmp/perf-5382.map)
3c79400e307b LazyCompile:*fetchMulti /apps/node/webapp/singletons/ShaktiFetcher.js:50 (/tmp/perf-5382.map)
3c793fb9a59f LazyCompile:*fetch /apps/node/webapp/singletons/ShaktiFetcher.js:32 (/tmp/perf-5382.map)
3c793e896697 (/tmp/perf-5382.map)
3c7943aaabbe (/tmp/perf-5382.map)
3c793ef4c53c Function:~ /apps/node/webapp/node_modules/vasync/lib/vasync.js:245 (/tmp/perf-5382.map)
3c793eaf4f01 LazyCompile:* /apps/node/webapp/node_modules/nf-packager/lib/index.js:194 (/tmp/perf-5382.map)
3c793eab130a LazyCompile:processImmediate timers.js:352 (/tmp/perf-5382.map)
3c793e319f7d Builtin:JSEntryTrampoline (/tmp/perf-5382.map)
3c793e3189e2 Stub:JSEntryStub (/tmp/perf-5382.map)
a65baf v8::internal::Execution::Call(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>,
v8::internal::Handle<v8::internal::Object>, int, v8::internal::Handle<v8::internal::Object>*, bool) (/apps/node/bin/node)
8e6b2f v8::Function::Call(v8::Local<v8::Context>, v8::Local<v8::Value>, int, v8::Local<v8::Value>*) (/apps/node/bin/node)
8f2281 v8::Function::Call(v8::Local<v8::Value>, int, v8::Local<v8::Value>*) (/apps/node/bin/node)
df599a node::MakeCallback(node::Environment*, v8::Local<v8::Value>, v8::Local<v8::Function>, int, v8::Local<v8::Value>*)
(/apps/node/bin/node)
df5ccb node::CheckImmediate(uv_check_s*) (/apps/node/bin/node)
fb1597 uv__run_check (/apps/node/bin/node)
fabcee uv_run (/apps/node/bin/node)
dfaa50 node::Start(int, char**) (/apps/node/bin/node)
7fcc3ef6876d __libc_start_main (/lib/x86_64-linux-gnu/libc-2.15.so)
)
JS Frames
Native Frames

Problem: Too Many Traces
$ cat out.nodestacks01 | grep cpu-clock | wc -l
744
$ wc -l out.nodestacks01
58116

Flamegraph
❖ Each box presents a function in
the stack (stack frame)
❖ x-axis: percent of time on CPU
❖ y-axis: stack depth
❖ colors: random, or can be a
dimension
❖ https://github.com/
brendangregg/FlameGraph
v8
libc
JS
built ins

Flame Graph Interpretation
a()
b() h()
c()
d()
e() f()
g()
i()

Top edge shows who is running on-CPU, 
and how much (width)
a()
b() h()
c()
d()
e() f()
g()
i()

Top-down shows ancestry
e.g., from g():
h()
d()
e()
i()
a()
b()
c()
f()
g()

a()
b() h()
c()
d()
e() f()
g()
i()
Widths are proportional to presence in samples
e.g., comparing b() to h() (incl. children)

function merge(object) {
var args = arguments,
length = 2;
...

Flame Graphs
Helps you ﬁnd 1 LoC out of 6 Million

Results
❖ Dramatically reduced request latency
❖ Reduced CPU utilization
❖ Increased throughput

Runtime Performance Technique
❖ Sample stack traces via perf(1)
❖ Visualize code distribution with CPU ﬂame graphs
❖ Identify candidate code paths for performance
improvement
❖ Repeat

- Chaﬁn, R. "Pioneer F & G Telemetry and Command Processor Core Dump
Program." JPL Technical Report XVI, no. 32-1526 (1971): 174.
“The method described in this article was designed
to provide a core dump… with a minimal impact
on the spacecraft… as the resumption of data
acquisition from the spacecraft is the highest
priority.”

Core Dumps — A Brief History
❖ Magnetic core memory
❖ Dump out the contents of
“core” memory for debugging
❖ “Core dump” was born
❖ Initially printed on paper!
❖ Postmortem debugging was
born!

Production Constraints
❖ Uptime is critical
❖ Not easily reproducible
❖ Can’t simulate environment
❖ Resume normal operations ASAP

Postmortem Debugging
Take core
dump
Restart app
Load core
dump
elsewhere
Engineer
Fix
Debug
Continue
serving
trafﬁc

Configure Node to Dump Core on Error
!"[0] <> node --abort_on_uncaught_exception throw.js
Uncaught Error
FROM
Object.<anonymous> (/Users/yunong/throw.js:1:63)
Module._compile (module.js:435:26)
Object.Module._extensions..js (module.js:442:10)
Module.load (module.js:356:32)
Function.Module._load (module.js:311:12)
Function.Module.runMain (module.js:467:10)
startup (node.js:134:18)
node.js:961:3
[1] 4131 illegal hardware instruction (core dumped) node --
abort_on_uncaught_exception throw.js

Node Post Mortem Tooling
❖ Netﬂix uses Linux in Prod
❖ Linux — Work in progress
❖ https://github.com/tjfontaine/lldb-v8
❖ https://github.com/indutny/llnode
❖ Solaris — Full featured, compatible with Linux cores
❖ https://github.com/joyent/mdb_v8

Socks & Duct Tape: Setup a Debug Solaris Instance
EC2: http://omnios.omniti.com/wiki.php/
Installation#IntheCloud
VM: http://omnios.omniti.com/wiki.php/
Installation#Quickstart

Post Mortem Methodology
❖ Where: Inspect stack trace
❖ Why: Inspect heap and stack variable state

mdb(1) JS commands
❖ ::help <cmd>
❖ ::jsstack
❖ ::jsprint
❖ ::jssource
❖ ::jsconstructor
❖ ::ﬁndjsobjects
❖ ::jsfunctions

Load the Core Dump
# mdb ./node-v4.2.2-linux/node-v4.2.2-linux-x64/bin/node ./core.7186
> ::load ./mdb_v8_amd64.so
mdb_v8 version: 1.1.1 (release, from 28cedf2)
V8 version: 143.156.132.195
Autoconfigured V8 support from target
C++ symbol demangling enabled
linux node binary core dumpload mdb_v8 module

::jsstack
> ::jsstack
js: test
js: storeHeader
js: <anonymous> (as OutgoingMessage._storeHeader)
js: <anonymous> (as ServerResponse.writeHead)
js: restifyWriteHead
js: _cb
js: send
js: <anonymous> (as <anon>)
js: <anonymous> (as ReactRenderer._renderLayout)
js: <anonymous> (as dispatchHandler)
js: runHooks
js: runTransitionToHooks
js: <anonymous> (as assign.to)
js: runHooks
js: runTransitionFromHooks
js: <anonymous> (as assign.from)
js: <anonymous> (as React.createClass.statics.dispatch)
native: _ZN2v88internalL6InvokeEbNS0_6HandleINS0_10JSFunctionEEENS1_INS0...
native: v8::internal::Execution::Call+0xc8
native: v8::internal::Runtime_Apply+0x1ce
frame type
func name

Always name your functions!
var foo = function foo() {};
Foo.prototype.bar = function bar() {};
foo(function bar() {});

::jsstack -v Frame Source
> ::jsstack -v
js: storeHeader
file: http.js
posn: position 18774
this: 2ad561306c91 (<unknown>)
arg1: 3bd67e0669b9 (JSObject: ServerResponse)
arg2: 3dfe966ae299 (JSObject: Object)
arg3: 34d5391d8859 (SeqAsciiString)
652
653 function storeHeader(self, state, field, value) {
654 // Protect against response splitting. The if statement is there to
655 // minimize the performance impact in the common case.
656 if (/[rn]/.test(value))
657 value = value.replace(/[rn]+[ t]*/g, '');
658
659 state.messageHeader += field + ': ' + value + CRLF;
660
661 if (connectionExpression.test(field)) {
662 state.sentConnectionHeader = true;
663 if (closeExpression.test(value)) {
664 self._last = true;
665 } else {
666 self.shouldKeepAlive = true;
667 }
668

::jsstack -vn0 Frame and Function Args
> ::jsstack -vn0
js: test
file: native regexp.js
posn: position 2677
this: 2421205bd4d9 (JSRegExp)
js: storeHeader
file: http.js
file: http.js
this: 3bd67e0669b9 (JSObject: ServerResponse)
arg1: 3dfe966ae271 (ConsString)
arg2: 3dfe966add99 (JSObject: Object)
file: /apps/node/webapp/node_modules/restify/lib/response.js
posn: position 6964
(1 internal frame elided)
js: _cb
Func Name
JS File
Line #
Func Args

::jsstack Function Args
> ::jsstack -vn0
js: test
file: native regexp.js
posn: position 2677
this: 2421205bd4d9 (JSRegExp)
js: storeHeader
file: http.js
file: http.js
arg1: 3dfe966ae271 (ConsString)
arg2: 3dfe966add99 (JSObject: Object)
file: /apps/node/webapp/node_modules/restify/lib/response.js
posn: position 6964
(1 internal frame elided)
js: _cb
Memory Address of Var Var Type

::jsprint Print JS Objects
> 3bd67e0669b9::jsprint
{
"_time": 1437690472539,
"_headers": {
"content-type": "text/html",
"req_id": "5b7f18f2-7f12-4c68-b07f-3cd75698ba65",
"set-cookie": “CENSORED; Domain=.netflix.com; Expires=Fri, 24 Jul 2015 10:27:52 GMT
"x-frame-options": "DENY",
"x-ua-compatible": "IE=edge",
"x-netflix.client.instance": "i-c420596c",
},
"output": [],
"_last": false,
"_hangupClose": false,
"_hasBody": true,
"socket": {
"_connecting": false,
"_handle": [...],
"_readableState": [...],
"readable": true,
"domain": null,
"_events": [...],
"_maxListeners": 10,
"_writableState": [...],
"writable": true,
"allowHalfOpen": true,
Actual JS Object Instance

::jsconstructor Show Object Constructor
> 3bd67e0669b9::jsconstructor -v
ServerResponse (JSFunction: 2421205bced9)

::jssource Print f() Source
> 2421205bced9::jssource
file: http.js
1066 function ServerResponse(req) {
1067 OutgoingMessage.call(this);
1068
1069 if (req.method === 'HEAD') this._hasBody = false;
1070
1071 this.sendDate = true;
1072
1073 if (req.httpVersionMajor < 1 || req.httpVersionMinor < 1) {
1074 this.useChunkedEncodingByDefault = chunkExpression.test(req.headers.te);
1075 this.shouldKeepAlive = false;
1076 }
1077 }
1078 util.inherits(ServerResponse, OutgoingMessage);

Core Dump === Complete Process State

gcore(1) GNU Tools gcore(1)
NAME
gcore - Generate a core file for a running process
SYNOPSIS
gcore [-o filename] pid

Take a Core Dump!
root@demo:~# gcore `pgrep node`
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/
libthread_db.so.1".
[New Thread 0x7facaeffd700 (LWP 5650)]
[New Thread 0x7facaf7fe700 (LWP 5649)]
[New Thread 0x7facaffff700 (LWP 5648)]
[New Thread 0x7facbc967700 (LWP 5647)]
[New Thread 0x7facbd168700 (LWP 5617)]
[New Thread 0x7facbd969700 (LWP 5616)]
[New Thread 0x7facbe16a700 (LWP 5615)]
[New Thread 0x7facbe96b700 (LWP 5614)]
0x00007facbea5b5a9 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
Saved corefile core.5602

::findjsobjects
NAME
findjsobjects - find JavaScript objects
SYNOPSIS
[ addr ] ::findjsobjects [-vb] [-r | -c
cons | -p prop]

::findjsobjects Find ALL JS Objects on Heap
> ::findjsobjects
OBJECT #OBJECTS #PROPS CONSTRUCTOR: PROPS
...
3dfe97453121 18 6721 Array
157a020e01 1304 101 <anonymous> (as Constructor): ...
8f1a53211 13879 12 ReactDOMComponent: _tag, tagName,
props, ...
8f1a05691 85776 2 Array
3dfe97451a99 36 5589 Array
23e5d7d44351 1 218020 Object: .2f5hpw2hgjk.1.0.3, ...
8f1a05f31 40533 6 <anonymous> (as ReactElement): type, ...
8f1a04da1 252133 1 Array
8f1a04dc1 125869 7 Array
8f1a04f01 114914 8 Array
8f1a04d39 230924 7 Module: id, exports, parent,
filename, ...

Memory Leak Strategy
❖ Look at objects on heap for suspicious objects
❖ Take successive core dumps and compare object counts
❖ Growing object counts are likely leaking
❖ Inspect object for more context
❖ Walk reverse references to ﬁnd root object

Look at Object Delta Between
Successive Core Dumps

Uptime = 45mins
> ::findjsobjects
...
8f1a04d39 230924 7 Module: id, exports,
parent, filename, ...

Uptime = 90 mins
> ::findjsobjects
...

Representative Object
> ::findjsobjects
...
Representative Object, 1 of 323454

Look Closer
> 8f1a04d39::jsprint
{
"id": "/apps/node/webapp/ui/js/pages/akiraClient.js",
"exports": {},
"parent": {
"id": "/apps/node/webapp/middleware/autoClientStrings.js",
"exports": function autoExposeClientStrings,
"parent": [...],
"filename": "/apps/node/webapp/middleware/
autoClientStrings.js",
"loaded": true,
"children": [...],
"paths": [...],
},
"filename": "/apps/node/webapp/ui/js/pages/akiraClient.js",

Use ::findjsobjects to Find All “Module” Objects
> 8f1a04d39::findjsobjects
8f1a04d39
3fd996bffb39
3fd996bfcff1
3fd996bfbac1
3fd996bf8a19
3fd996bf7949
3fd996bf3ce9
3fd996bf0f19
3fd996bead71
3fd996bea821
3fd996bea001
3fd996be92b1
3fd996be73d1
3fd996be58d1
3fd996bd88b1
3fd996bcb459
3fd996bcaa41
3fd996bc7009

Custom Querying With Pipes and Unix Tools
8f1a04d39::findjsobjects | ::jsprint ! grep filename | sort | uniq -c

Results
...
1 "filename": "/apps/node/webapp/ui/js/akira/
components/messaging/paymentHold.js",
2 "filename": "/apps/node/webapp/ui/js/common/
commonCore.js",
playPrediction/playPrediction.js",
presentationTracking/presentationTracking.js",
111061 "filename": “/apps/node/webapp/ui/js/common/
playPrediction/playPrediction.js",
7103 "filename": “/apps/node/webapp/ui/js/pages/
reactClientRender.js",
111061 "filename": “/apps/node/webapp/ui/js/pages/
akiraClient.js",
118257 "filename": “/apps/node/webapp/middleware/
autoClientStrings.js",
... Client Side Modules

What’s holding on to these
modules?

Walk Reverse Refs with ::findjsobjects -r
> 8f1a04d39::findjsobjects -r
8f1a04d39 referred to by 14fd6c5b13c1.parent

Root Object
> 1f313791bb41::jsprint
[
{
"exports": [...],
"parent": [...],
"loaded": false,
"children": [...],
"paths": [...],
},
{
"exports": [...],
"parent": [...],
"loaded": false,
"children": [...],
"paths": [...],
},
{
"exports": [...],
"parent": [...],

Spot the Leak
var cache = {};
function checkCache(someModule) {
var mod = cache[someModule];
if (!mod) {
try {
mod = require(someModule);
cache[someModule] = mod;
return mod;
} catch (e) {
return {};
}
}
return mod;
}
Module could be client only, must catch
Should cache the
fact we caught an exception here

Root Cause
❖ Node caches metadata for each module
❖ If require process throws an exception, the module
metadata is leaked (bug?)
❖ Client side module meant we were throwing during
every request, and not caching the fact we tried to
require it
❖ Each request leaks 3+ module metadata objects

Memory Leaks
❖ Take successive core dumps (gcore(1))
❖ Compare object counts (::findjsobjects)
❖ Growing objects are likely leaking
❖ Inspect object for more context (::jsprint)
❖ Walk reverse references to find root obj (::findjsobjects -
r)

Post Mortem Debugging is Critical to
Large Scale Prod Node Deployments

More State than Just Logs
❖ Detailed stack trace (::jsstack)
❖ Function args for each frame (::jsstack -vn0)
❖ Get state of any object and its provenance
(::jsprint, ::jsconstructor)
❖ Get source code of any function (::jssource)
❖ Find arbitrary JS objects (::ﬁndjsobjects)
❖ Unmodiﬁed Node binary!

Production Failures are Inevitable

Production Debugging
❖ Runtime Performance
❖ CPU proﬁling/ﬂame graphs
❖ Runtime Crashes
❖ Inspect program state with core dumps and mdb
❖ Memory leaks
❖ Analyze objects and references with core dumps and
mdb

Epilogue — State of Tooling
❖ Join Working Group https://github.com/nodejs/post-
mortem
❖ Help make mdb_v8 cross platform https://github.com/
joyent/mdb_v8
❖ Contribute to https://github.com/tjfontaine/lldb-v8
and https://github.com/indutny/llnode

Acknowledgements
❖ mdb_v8
❖ Dave Pacheco, TJ Fontaine, Julien Gilli, Bryan Cantrill
❖ CPU Proﬁling/Flamegraphs
❖ Brendan Gregg, Google V8 team, Ali Ijaz Sheikh
❖ Linux Perf
❖ Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Jiri Olsa, Peter Zijlstra
❖ lldb-v8
❖ TJ Fontaine
❖ llnode
❖ Fedor Indutny

THANKS
❖ Questions? We’re Hiring!
❖ yunong@netﬂix.com
❖ @yunongx

Citations
❖ Slides 29-32 used with permission from “Java Mixed-
Mode Flame Graphs”, Brendan Gregg, Oct 2015
❖ Slide 26 used with permission from http://
www.brendangregg.com/FlameGraphs/
cpuﬂamegraphs.html

Node Interactive Debugging Node.js In Production

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Node Interactive Debugging Node.js In Production

Similar to Node Interactive Debugging Node.js In Production (20)

Recently uploaded

Recently uploaded (20)

Node Interactive Debugging Node.js In Production