Towards 100% uptime with node


Published on

Eliminating every last bit of downtime caused by deployment and application errors takes some work. Learn how a combination of domains, sensible handling of uncaught exceptions, graceful connection termination, and process management with the cluster module and its friends can give you confidence that your application is always available.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Towards 100% uptime with node

  1. 1. Towards 100% Uptime with Node.js
  2. 2. 9M uniques / month. 75K+ users, some are paid subscribers.
  3. 3. ( We | you | users ) hate downtime.
  4. 4. Important, but out of scope: Redundant infrastructure. Backups. Disaster recovery.
  5. 5. In scope: Application errors. Deploys. Node.js stuff: Domains. Cluster. Express.
  6. 6. Keys to 100% uptime.
  7. 7. 1. Sensibly handle uncaught exceptions.
  8. 8. 2. Use domains to catch and contain errors.
  9. 9. 3. Manage processes with cluster.
  10. 10. 4. Gracefully terminate connections.
  11. 11. 1. Sensibly handle uncaught exceptions.
  12. 12. Uncaught exceptions happen when: An exception is thrown but not caught. An error event is emitted but nothing is listening for it.
  13. 13. From node/lib/events.js: { vnEitrpooyeei ucintp) / I teei n 'ro'eetlsee te trw / f hr s o err vn itnr hn ho. i (ye=='ro' { f tp = err) .. . }es i (risaco Err { le f e ntnef ro) trwe;/ Uhnld'ro'eet ho r / nade err vn }es { le .. .
  14. 14. An uncaught exception crashes the process.
  15. 15. If the process is a server:  x 100s??
  16. 16. It starts with...
  17. 17. Domains. 2. Use domains to catch and contain errors.
  18. 18. tycth r / a c  doesn't do async. ty{ r vrf=fnto( { a ucin) trwnwErr"ho"; ho e ro(u-h) } ; stieu(,10; eTmotf 0) }cth(x { ac e) cnoelg"r /cthwntcth,e) osl.o(ty ac o' ac" x; }
  19. 19. Domains are a bit like tycth r / a c  for async. vrd=rqie'oan)cet(; a eur(dmi'.rae) do(err,fnto (r){ .n'ro' ucin er cnoelg"oancuh" er; osl.o(dmi agt, r) }; ) vrf=dbn(ucin){ a .idfnto( trwnwErr"ho"; ho e ro(u-h) }; ) stieu(,10; eTmotf 0)
  20. 20. The active domain is dmi.cie oanatv. vrd=rqie'oan)cet(; a eur(dmi'.rae) cnoelgdmi.cie;/ <-nl osl.o(oanatv) / - ul vrf=dbn(ucin){ a .idfnto( cnoelgdmi.cie==d / <-tu osl.o(oanatv = ) / - re cnoelgpoesdmi ==dmi.cie / <-tu osl.o(rcs.oan = oanatv) / - re trwnwErr"ho"; ho e ro(u-h) }; )
  21. 21. New EventEmitters bind to the active domain. { vnEitrpooyeei ucintp) i (ye=='ro' { f tp = err) i (hsdmi){ / Ti i iprat f ti.oan / hs s motn! .. . ti.oanei(err,e)'ro' r; }es i .. le f .
  22. 22. Log the error. Helpful additional fields: errdmi ro.oan errdmiEitr ro.oanmte errdmiBud ro.oanon errdmiTrw ro.oanhon
  23. 23. Then it's up to you. Ignore. Retry. Abort (e.g., return 500). Throw (becomes an unknown error).
  24. 24. Do I have to create a new domain every time I do an async operation?
  25. 25. Use middleware. More convenient.
  26. 26. In Express, this might look like: vrdmiWapr=fnto(e,rs nx){ a oanrpe ucinrq e, et vrrqoan=dmi.rae) a eDmi oancet(; rqoanadrq; eDmi.d(e) rqoanadrs; eDmi.d(e) rqoanoc(err,fnto(r){'ro' uciner rssn(0) / o nx(r) e.ed50; / r eter; }; ) rqoanrnnx) eDmi.u(et; } ; Based on
  27. 27. Domain methods. a dbind an EE to the domain. d: r nrun a function in context of domain. u: b n : bind one function. id i t r e tlike bind but handles 1st arg e r necp: r. d s o ecancels IO and timers. ips:
  28. 28. Domains are great until they're not.
  29. 29. node-mongodb-native does not play well with active domain. cnoelgdmi.cie;/ admi osl.o(oanatv) / oan Apoe.idn(uciner dc { pMdlfnOefnto(r, o) cnoelgdmi.cie;/ udfnd osl.o(oanatv) / neie nx(; et) }; ) See
  30. 30. Fix with explicit binding. cnoelgdmi.cie;/ admi osl.o(oanatv) / oan Apoe.idn(oanatv.idfnto(r,dc { pMdlfnOedmi.ciebn(uciner o) cnoelgdmi.cie;/ siladmi osl.o(oanatv) / tl oan nx(; et) }) );
  31. 31. What other operations don't play well well with d m i . c i e oanatv? Good question! Package authors could note this. If you find one, let package author know.
  32. 32. Can 100% uptime be achieved just by using domains? No. Not if only one instance of your app is running.
  33. 33. 3. Manage processes with cluster.
  34. 34. Cluster module. Node = one thread per process. Most machines have multiple CPUs. One process per CPU = cluster.
  35. 35. master / workers 1 master process forks n workers. Master and workers communicate state via IPC. When workers want to listen to a socket, master registers them for it. Each new connection to socket is handed off to a worker. No shared application state between workers.
  36. 36. What about when a worker isn't working anymore? Some coordination is needed.
  37. 37. 1. Worker tells cluster master it's done accepting new connections. 2. Cluster master forks replacement. 3. Worker dies.
  38. 38. Another use case for cluster: Deployment. Want to replace all existing servers. Something must manage that = cluster master process.
  39. 39. Zero downtime deployment. When master starts, give it a symlink to worker code. After deploy new code, update symlink. Send signal to master: fork new workers! Master tells old workers to shut down, forks new workers from new code. Master process never stops running.
  40. 40. Signals. A way to communicate with running processes. S G U : reload workers (some like S G S 2 IHP I U R ). $kl - HP<i> il s U pd $srie<oesrienm>rla evc nd-evc-ae eod
  41. 41. Process management options.
  42. 42. Forever Has been around...forever. No cluster awareness — used on a single process. Simply restarts the process when it dies. More comparable to Upstart or Monit.
  43. 43. Naught Newer. Cluster aware. Zero downtime errors and deploys. Runs as daemon. Handles log compression, rotation.
  44. 44. Recluster Newer. Cluster aware. Zero downtime errors and deploys. Does not run as daemon. Log agnostic. Simple, relatively easy to reason about.
  45. 45. We went with recluster. Happy so far.
  46. 46. I have been talking about starting / stopping workers as if it's atomic. It's not.
  47. 47. 4. Gracefully terminate connections when needed.
  48. 48. Don't call p o e s e i  too soon! rcs.xt Give it a grace period to clean up.
  49. 49. Need to clean up: In-flight requests. HTTP keep-alive (open TCP) connections.
  50. 50. Revisiting our middleware from earlier: vrdmiWapr=fnto(feErrok { a oanrpe ucinatrroHo) rtr fnto(e,rs nx){ eun ucinrq e, et vrrqoan=dmi.rae) a eDmi oancet(; rqoanadrq; eDmi.d(e) rqoanadrs; eDmi.d(e) rqoanoc(err,fnto(r){'ro' uciner nx(r) eter; i(feErrok atrroHo(r) / Ho. fatrroHo) feErroker; / ok }; ) rqoanrnnx) eDmi.u(et; } ; } ;
  51. 51. 1. Call s r e . l s . evrcoe vratrroHo =fnto(r){ a feErrok uciner;/ <-esr n nwcnetos evrcoe) / - nue o e oncin }
  52. 52. 2. Shut down keep-alive connections. vratrroHo =fnto(r){ a feErrok uciner apst"shtigon,tu) / <-stsae p.e(iSutnDw" re; / - e tt; evrcoe) } vrsudwMdl =fnto(e,rs nx){ a htonide ucinrq e, et i(p.e(iSutnDw" { / <-ceksae fapgt"shtigon) / - hc tt rqcneto.eTmot1; / <-kl ke-lv e.oncinstieu() / - il epaie } nx(; et) } Idea from
  53. 53. 3. Then call p o e s e i rcs.xt in s r e . l s  callback. evrcoe vratrroHo =fnto(r){ a feErrok uciner apst"shtigon,tu) p.e(iSutnDw" re;{ evrcoefnto( poesei() / <-alcert ei rcs.xt1; / - l la o xt }; ) }
  54. 54. Set a timer. If timeout period expires and server is still around, call poesei. rcs.xt
  55. 55. Summing up: Our ideal server.
  56. 56. On startup: Cluster master comes up (for example, via Upstart). Cluster master forks workers from symlink. Each worker's server starts accepting connections.
  57. 57. On deploy: Point symlink to new version. Send signal to cluster master. Master tells existing workers to stop accepting new connections. Master forks new workers from new code. Existing workers shut down gracefully.
  58. 58. On error: Server catches it via domain. Next action depends on you: retry? abort? rethrow? etc.
  59. 59. On uncaught exception: ?? / Teifmu "nagtxeto"eet / h naos ucuhEcpin vn! poeso(ucuhEcpin,fnto(r){ rcs.n'nagtxeto' uciner / ? / ? } )
  60. 60. Back to where we started: 1. Sensibly handle uncaught exceptions. We have minimized these by using domains. But they can still happen.
  61. 61. Node docs say not to keep running. An unhandled exception means your application — and by extension node.js itself — is in an undefined state. Blindly resuming means anything could happen. You have been warned.
  62. 62. What to do? First, log the error so you know what happened.
  63. 63. Then, you've got to kill the process.
  64. 64. It's not so bad. We can now do so with minimal trouble.
  65. 65. On uncaught exception: Log error. Server stops accepting new connections. Worker tells cluster master it's done. Master forks a replacement worker. Worker exits gracefully when all connections are closed, or after timeout.
  66. 66. What about the request that killed the worker? How does the dying worker gracefully respond to it? Good question!
  67. 67. People are also under the illusion that it is possible to trace back [an uncaught] exception to the http request that caused it... -felixge,
  68. 68. This is too bad, because you always want to return a response, even on error.
  69. 69. This is Towards 100% Uptime b/c these approaches don't guarantee response for every request. But we can get very close.
  70. 70. Fortunately, given what we've seen, uncaughts shouldn't happen often. And when they do, only one connection will be left hanging.
  71. 71. Must restart cluster master when: Upgrade Node. Cluster master code changes.
  72. 72. During timeout periods, might have: More workers than CPUs. Workers running different versions (old/new). Should be brief. Probably preferable to downtime.
  73. 73. Tip: Be able to produce errors on demand on your dev and staging servers. (Disable this in production.)
  74. 74. Tip: Keep cluster master simple. It needs to run for a long time without being updated.
  75. 75. Things change. I've been talking about: { "oe:"01.0, nd" ~.02" "xrs" "340, epes: ~.." "onc" "290, cnet: ~.." "ogoe:"361" mnos" ~..8, "else" "034 rcutr: =.." }
  76. 76. The Future: Node 0.11 / 0.12 For example, cluster module has some changes.
  77. 77. Cluster is experimental. Domains are unstable.
  78. 78. Good reading: Node.js Best Practice Exception Handling (some answers more helpful than others) Remove uncaught exception handler? Isaacs stands by killing on uncaught Domains don't incur performance hits compared to try catch Rejected PR to add domains to Mongoose, with discussion Don't call enter / exit across async Comparison of naught and forever What's changing in cluster
  79. 79. If you thought this was interesting, We're hiring.
  80. 80. Thanks! @williamjohnbert