node.js in production: Reflections on three years of riding the unicorn


Published on

My presentation at #NodeSummit, December 3, 2013. Video is at

Published in: Technology

node.js in production: Reflections on three years of riding the unicorn

  1. 1. node.js in production: Reflections on three years of riding the unicorn Bryan Cantrill SVP, Engineering @bcantrill Tuesday, December 3, 13
  2. 2. Production systems • Production systems are ones doing real work: when they misbehave, users or other systems are affected • Production systems value reliability, performance and ease of deployment — usually in that order • Contrast to development systems, that value ease of development and speed of development — in that order • These values can be in tension: new languages and environments typically arise for their development values, not their production ones • Would node.js be any different? Tuesday, December 3, 13
  3. 3. node.js advantages • In terms of production suitability, node.js had — and still has — a couple of major advantages going for it: • • It’s built on a VM (V8) that itself was designed for performance • Tuesday, December 3, 13 It leverages extant (Unix) abstractions • • It’s not a new language Its pure event-oriented model aligns ease of programming with scalability with respect to load As the stewards of both node and SmartOS, Joyent had another advantage: we could change, improve or leverage SmartOS to accommodate node in production
  4. 4. node.js challenges • But node.js also has a couple of major challenges: • • JavaScript closures make it easy to accidentally reference memory • Because node.js is often used to connect backend components, failure to propagate back pressure can induce memory explosion and death • Tuesday, December 3, 13 Single-threaded execution of JavaScript means that compute-bound code can entirely impede progress High performance VM also implies inscrutable core dumps and very limited instrumentation
  5. 5. August 2010: DTrace in node.js • Added simple user-level statically defined tracing (USDT) probes for node.js on platforms that support DTrace (e.g., Mac OS X, SmartOS) • Probes were around connection establishment, serving HTTP requests, etc. • Allowed questions to be dynamically asked of running, production node.js servers, e.g.: dtrace -n ‘node*:::http-server-request{ printf(“%s of %s from %sn”, args[0]->method, args[0]->url, args[1]->remoteAddress)}‘ dtrace -n http-server-request’{ @[args[1]->remoteAddress] = count()}‘ dtrace -n gc-start’{self->ts = timestamp}’ -n gc-done’/self->ts/{@ = quantize(timestamp - self->ts)}’ Tuesday, December 3, 13
  6. 6. August 2010: Deploying 0.2.x • In August 2010, we deployed our first node.js-based service into production: a NodeKnockout leader-board that used node.js DTrace probes to geolocate connections to contestants in real-time • Results were promising; surprisingly easy to develop and deploy a node.js based service — and service consumed very little CPU • Watching the Node Knockout contestants in production revealed they were all light on CPU: • But there was a storm cloud... Tuesday, December 3, 13
  7. 7. August 2010: Deploying 0.2.x, cont. • We had a memory leak that resulted in heap exhaustion after several hours under heavy load • Our service was stateless and load balanced for HA, so this was more disconcerting than debilitating... • ...but we also had quite a few contestants that would run their RSS up and crash; there was clearly a larger issue: Tuesday, December 3, 13
  8. 8. February 2011: 0.4.0 • In February 2011, we deployed our first major node.jsbased service (on 0.4.0) • Service was able to be built remarkably quickly — but with some pain-points around Connect • Despite being potentially a compute-bound service, CPU consumption was (again) a non-issue • And with an updated node (and many fixed node leaks), memory consumption wasn’t necessarily as acute... • …but we hit our first “spinning black hole” problem Tuesday, December 3, 13
  9. 9. January 2011: node-dtrace-provider • Our DTrace probes in node were proving to be too lowlevel for higher-level services — we needed to allow USDT probes to be expressed in JavaScript • Fortunately, DTrace community member Chris Andrews extended his libusdt to node.js, allowed statically defined probes in JavaScript, e.g.: var dtp = d.createDTraceProvider(‘foo’); var probe = dtp.addProbe(‘foo-start’); { return ([ { bar: 123, baz: ‘bar’ } ]); }); Tuesday, December 3, 13
  10. 10. April 2011: Restify • Based on our experiences with Connect/Express, we wanted to build a node module that was purpose-built to implement HTTP-based API endpoints • Based on Chris Andrews’ work, we wanted to have first class support for DTrace • Joyent’s Mark Cavage developed node-restify, which quickly became the foundation for all of our services • Built-in DTrace support allows full observability into perroute/per-handler latency — a capability that we could not live without at this point Tuesday, December 3, 13
  11. 11. November 2011: MDB support for V8 • In mid-2011, Joyent’s Dave Pacheco dared to dream the impossible dream: full postmortem support for V8 for MDB, the debugger native to SmartOS • Several unspeakable layer violations, mdb_v8 brought postmortem debugging to node.js • ::jsstack prints full stack including both native C++ frames and JavaScript frames • • ::jsprint prints JavaScript objects — from the dump Tuesday, December 3, 13 Thanks to mdb_v8, we were able to go back to a core dump from that infinite loop in our service deployed several months earlier — and nail it
  12. 12. December 2011: DTrace ustack helper • mdb_v8 was actually a way station to an even bolder dream: a DTrace ustack helper for node.js • A ustack helper is a bit of code that accompanies a binary and assists DTrace in probe context to resolve stack frames to their higher-level names • Once completed, allows user-level stack traces to be associated with in-kernel events — like profiling events • Can use the DTrace profile provider to determine how a node.js program is consuming CPU via stack sampling Tuesday, December 3, 13
  13. 13. December 2011: Flame graphs • Pouring through stack traces can make hot functions difficult to visualize • Joyent’s Brendan Gregg developed flame graphs, which allow us to easily visualize thousands of sampled stacks: Tuesday, December 3, 13
  14. 14. January 2012: Bunyan • Logging was becoming more and more of a problem for us — especially as we were developing distributed systems in node.js • Joyent’s Trent Mick developed node-bunyan, a simple and fast JSON logging library for node.js • Provides standardized, JSON, line-based log output that can be easily processed with JSON tools, e.g.: {"name":"moray","hostname":"d1cfb6c7-c975-4ed8-a689fb18f94b6bfc","pid":8393,"component":"manatee","path":"/manatee/sdc/ election","level":20,"db":{"available":2,"max":15,"size":2,"waiting": 0},"options":{"async":false,"read":true},"msg":"pg: entered","time":"2013-12-03T02:54:24.565Z","v":0} • Tuesday, December 3, 13 Also includes command line tool, bunyan, for displaying Bunyan logs
  15. 15. February 2012: npm shrinkwrap • npm allows for fine-grained semver control over package dependencies, but we found that nested dependencies could result in non-replicable installs • “npm shrinkwrap” generates a file that shrinkwraps all nested dependencies into npm-shrinkwrap.json, thereby locking down all nested versions • Guarantees that all installs will have same semver versions of dependencies • Doesn’t necessarily guarantee identical installs, however; for this, one needs private npm repositories Tuesday, December 3, 13
  16. 16. April 2012: node-vasync • There are a number of modules that deal with some of the mechanics of asynchronous control flow… • But we found that libraries that handle We found we needed one that emphasized debugging, and in particular, • node-vasync captures a number of popular flow patterns and allows state to be inspected via MDB Tuesday, December 3, 13
  17. 17. May 2012: ::findjsobjects • Building on Dave Pacheco’s mdb_v8, we implemented a debugger command that iterates over all of memory in a core dump, looking for JavaScript objects • Entirely brute force, but allows one to take a swing at a nasty node.js issue: semantic memory leaks > ::findjsobjects OBJECT #OBJECTS 95709ac1 195 957093f9 66 95f13181 130 8432ff55 222 843304dd 91 8432cc55 99 95f08545 66 8432f2e1 546 9570cafd 47 8432be95 415 8432fb09 67 Tuesday, December 3, 13 #PROPS 3 9 5 3 9 9 14 2 24 3 19 CONSTRUCTOR: PROPS Object: socket, type, handle Object: uid, windowsVerbatimArguments, stdio, … <anonymous> (as exports.StringDecoder): … Buffer: length, offset, parent Object: refreservation, creation, name, type, … Object: time, msg, level, hostname, pid, action, … ChildProcess: _closesNeeded, stdio, … Array Object: <sliced string>, <sliced string>, … Array Socket: errorEmitted, _bytesDispatched, …
  18. 18. May 2012: ::findjsobjects -p • Searching by property name allows one to find particular objects in the JavaScript heap, e.g.: > ::findjsobjects -p ip4addr | ::findjsobjects | ::jsprint -a 8432b109: { ip4addr: 9aee115d: "", VLAN: 9aee1199: "0", Host Interface: 9aee1185: "e1000g0", Link Status: 9aee1175: "up", MAC Address: 9aee113d: "02:08:20:47:93:82", } … • While designed for postmortem debugging, this allows mdb_v8 to be used for in situ debugging in development • Also guides one to a best practice: towards unique property names (which we have historically done in the operating system via structure prefixing) Tuesday, December 3, 13
  19. 19. July 2012: node-fast • While HTTP makes it very easy to put together a distributed system, parsing and connection management can become prohibitively expensive • In building Manta, we found that we needed something lighter/faster; Joyent’s Mark Cavage built node-fast • Only what you need: fully async/duplex/persistent connections, simple on-wire protocol (JSON), etc. • None of what you don’t want: no IDL madness, no object model, no binary translation madness, etc. • Deliberately light and limited — HTTP is still the right answer until it isn’t Tuesday, December 3, 13
  20. 20. October 2012: Bunyan + DTrace • With all of our services using Bunyan, we could enable dynamic logging by adding DTrace USDT probes • Can use the raw DTrace probes: # dtrace -qn log-debug'{printf("%sn", copyinstr(arg0))}' -x strsize=8k {"name":"wf-moray-backend","hostname":"414ffb35-adee-47b7-bdf4d21cb039386c","pid": 10952,"component":"MorayClient","host":"","port": 2020,"req_id":"bddb180f-1770-edcf-8df2-b3a81d97e9b1","level": 20,"bucket":"wf_runners","key":"414ffb35-adee-47b7-bdf4d21cb039386c","value": {"active_at":"2013-12-03T07:22:25.125Z","idle":false},"msg":"putObject: entered","time":"2013-12-03T07:22:25.135Z","v":0} ... • Added the json() subroutine to DTrace to make this easier to process • Can also use “bunyan -p” and avoid the lower-level DTrace details entirely Tuesday, December 3, 13
  21. 21. May 2013: --abort-on-uncaught-exception • Crash dumps are great — but aborting after an uncaught exception makes it very difficult to determine the true origin of the exception • Dave Pacheco implemented a V8 patch to induce a process abort (and a core dump) on an uncaught exception • This allows us to use postmortem debugging to debug our everyday logic errors • Available starting in 0.10.x — we use it wherever we have it! Tuesday, December 3, 13
  22. 22. July 2013: Thoth • One of the most important systems we have built in node is Manta, our object store featuring in situ compute • Manta is an excellent platform for building data-based services — especially for large data objects • We built manta-thoth, a platform for core and crash dump analysis that allows us to debug core dumps without moving them • Thoth has become critically important for us to track and automatically debug production node.js services Tuesday, December 3, 13
  23. 23. December 2013: Dump analysis on Linux • Postmortem debugging has been a (the) tremendous breakthrough for node.js in production… • ...but despite all node’s postmortem support all being open source, it has been limited to SmartOS • Some have toyed with porting MDB to Linux; this is in principle possible, but will be rough sledding • Joyent’s TJ Fontaine (of node core fame) observed what we had done with dump analysis on Manta and had a simpler idea… • What about making Linux dumps consumable on SmartOS — and therefore Manta? Tuesday, December 3, 13
  24. 24. December 2013: Linux support in libproc • Over the course of a multiday engineering hackathon, TJ and Joyent’s Max Brunning added support for Linux crash dumps in SmartOS’s libproc • Fortunately, because of the way the postmortem work was done by Dave Pacheco, it Just Works • Do this yourself: • For Linux users: put your Linux dumps to Manta, and you can finally debug those pesky leaks and crashes! • Use --abort-on-uncaught-exception and you can use Manta and postmortem debugging to debug more quotidian programming errors! Tuesday, December 3, 13
  25. 25. Node.js in production! • For us at Joyent, the tooling that we have built into node.js has resulted in what we believe to be the best dynamic environment for production use • Yes, even when compared to much older platforms like Java and Erlang... • There is still work to be done, especially around add-on development (see TJ’s shim work!) and potentially better bundling of objects… • We will continue to emphasize production deployment and use in our stewardship of node.js! Tuesday, December 3, 13
  26. 26. Thank you • @dapsays, the Patron Saint of node.js in production, for DTrace support, MDB support, node-vasync, Manta, etc. • • • • • @mcavage for node-restify, node-fast, Manta, etc. Tuesday, December 3, 13 @trentmick for node-bunyan @chrisandrews for node-dtrace-provider @brendangregg for flame graphs @tjfontaine for bringing postmortem debugging to an entirely new audience with Linux support for libproc!