Towards

100% Uptime
with Node.js
9M uniques / month.

75K+ users, some are paid
subscribers.
( We | you | users )
hate downtime.
Important, but
out of scope:
Redundant infrastructure.
Backups.
Disaster recovery.
In scope:
Application errors.
Deploys.
Node.js stuff:
Domains.
Cluster.
Express.
Keys to 100% uptime.
1. Sensibly handle
uncaught exceptions.
2. Use domains
to catch and contain errors.
3. Manage processes
with cluster.
4. Gracefully terminate
connections.
1. Sensibly handle uncaught
exceptions.
Uncaught exceptions happen when:
An exception is thrown but not caught.
An error event is emitted but nothing is listening...
From node/lib/events.js:
Eetmte.rttp.mt=fnto(ye {
vnEitrpooyeei
ucintp)
/ I teei n 'ro'eetlsee te trw
/ f hr s o err vn it...
An uncaught exception
crashes the process.
If the process is a server: 

x 100s??
It starts with...
Domains.

2. Use domains to catch and contain errors.
tycth
r / a c  doesn't do
async.
ty{
r
vrf=fnto( {
a
ucin)
trwnwErr"ho";
ho e ro(u-h)
}
;
stieu(,10;
eTmotf 0)
}cth(x {
ac...
Domains are a bit like
tycth
r / a c  for async.
vrd=rqie'oan)cet(;
a
eur(dmi'.rae)
do(err,fnto (r){
.n'ro' ucin er
cnoelg...
The active domain is
dmi.cie
oanatv.
vrd=rqie'oan)cet(;
a
eur(dmi'.rae)
cnoelgdmi.cie;/ <-nl
osl.o(oanatv) / - ul
vrf=dbn(...
New EventEmitters bind
to the active domain.
Eetmte.rttp.mt=fnto(ye {
vnEitrpooyeei
ucintp)
i (ye=='ro' {
f tp = err)
i (h...
Log the error.
Helpful additional fields:
errdmi
ro.oan
errdmiEitr
ro.oanmte
errdmiBud
ro.oanon
errdmiTrw
ro.oanhon
Then it's up to you.
Ignore.
Retry.
Abort (e.g., return 500).
Throw (becomes an unknown error).
Do I have to create a new domain
every time I do an async operation?
Use middleware.
More convenient.
In Express, this might look like:
vrdmiWapr=fnto(e,rs nx){
a oanrpe
ucinrq e, et
vrrqoan=dmi.rae)
a eDmi
oancet(;
rqoanadr...
Domain methods.
a dbind an EE to the domain.
d:
r nrun a function in context of domain.
u:
b n : bind one function.
id
i t...
Domains

are great
until they're not.
node-mongodb-native does not
play well with active domain.
cnoelgdmi.cie;/ admi
osl.o(oanatv) /
oan
Apoe.idn(uciner dc {
p...
Fix with explicit binding.
cnoelgdmi.cie;/ admi
osl.o(oanatv) /
oan
Apoe.idn(oanatv.idfnto(r,dc {
pMdlfnOedmi.ciebn(uciner...
What other operations don't play well
well with d m i . c i e
oanatv?
Good question!
Package authors could note this.
If y...
Can 100% uptime be achieved
just by using domains?

No.

Not if only one instance of your app
is running.
3. Manage processes
with cluster.
Cluster module.
Node = one thread per process.
Most machines have multiple CPUs.
One process per CPU = cluster.
master / workers
1 master process forks n
workers.
Master and workers communicate state via IPC.
When workers want to list...
What about when a worker
isn't working anymore?
Some coordination is needed.
1. Worker tells cluster master it's done accepting new connections.
2. Cluster master forks replacement.
3. Worker dies.
Another use case for cluster:

Deployment.

Want to replace all existing servers.
Something must manage that = cluster mas...
Zero downtime deployment.
When master starts, give it a symlink to worker code.
After deploy new code, update symlink.
Sen...
Signals.
A way to communicate with running processes.
S G U : reload workers (some like S G S 2
IHP
I U R ).
$kl - HP<i>
i...
Process management options.
Forever
github.com/nodejitsu/forever
Has been around...forever.
No cluster awareness — used on a single process.
Simply re...
Naught
github.com/superjoe30/naught
Newer.
Cluster aware.
Zero downtime errors and deploys.
Runs as daemon.
Handles log co...
Recluster
github.com/doxout/recluster
Newer.
Cluster aware.
Zero downtime errors and deploys.
Does not run as daemon.
Log ...
We went with recluster.
Happy so far.
I have been talking about
starting / stopping workers
as if it's atomic.

It's not.
4. Gracefully terminate
connections
when needed.
Don't call p o e s e i  too soon!
rcs.xt
Give it a grace period to clean up.
Need to clean up:
In-flight requests.
HTTP keep-alive (open TCP) connections.
Revisiting our middleware from earlier:
vrdmiWapr=fnto(feErrok {
a oanrpe
ucinatrroHo)
rtr fnto(e,rs nx){
eun ucinrq e, et...
1. Call s r e . l s .
evrcoe
vratrroHo =fnto(r){
a feErrok
uciner
sre.ls(;/ <-esr n nwcnetos
evrcoe) / - nue o e oncin
}
2. Shut down keep-alive
connections.
vratrroHo =fnto(r){
a feErrok
uciner
apst"shtigon,tu) / <-stsae
p.e(iSutnDw" re; / - ...
3. Then call p o e s e i
rcs.xt
in s r e . l s  callback.
evrcoe
vratrroHo =fnto(r){
a feErrok
uciner
apst"shtigon,tu)
p.e...
Set a timer.
If timeout period expires and server is still around, call
poesei.
rcs.xt
Summing up:

Our ideal server.
On startup:
Cluster master comes up (for example, via Upstart).
Cluster master forks workers from symlink.
Each worker's s...
On deploy:
Point symlink to new version.
Send signal to cluster master.
Master tells existing workers to stop accepting ne...
On error:
Server catches it via domain.
Next action depends on you: retry? abort? rethrow? etc.
On uncaught exception:
??
/ Teifmu "nagtxeto"eet
/ h naos ucuhEcpin vn!
poeso(ucuhEcpin,fnto(r){
rcs.n'nagtxeto' uciner
/ ...
Back to where we started:

1. Sensibly handle uncaught
exceptions.
We have minimized these by using domains.
But they can ...
Node docs say not to keep running.

An unhandled exception means your
application — and by extension node.js
itself — is i...
What to do?
First, log the error so you know what happened.
Then, you've got to
kill the process.
It's not so bad. We can now do so
with minimal trouble.
On uncaught exception:
Log error.
Server stops accepting new connections.
Worker tells cluster master it's done.
Master fo...
What about the request
that killed the worker?
How does the dying worker
gracefully respond to it?
Good question!
People are also under the illusion that it is
possible to trace back [an uncaught]
exception to the http request that caus...
This is too bad, because you
always want to return a response,
even on error.
This is Towards 100% Uptime b/c these approaches don't
guarantee response for every request.

But we can get very close.
Fortunately, given what we've seen,
uncaughts shouldn't happen often.
And when they do, only one
connection will be left h...
Must restart cluster master when:
Upgrade Node.
Cluster master code changes.
During timeout periods, might have:
More workers than CPUs.
Workers running different versions (old/new).
Should be brief....
Tip:

Be able to produce errors on demand
on your dev and staging servers.
(Disable this in production.)
Tip:

Keep cluster master simple.
It needs to run for a long time without being updated.
Things change.
I've been talking about:
{
"oe:"01.0,
nd" ~.02"
"xrs" "340,
epes: ~.."
"onc" "290,
cnet: ~.."
"ogoe:"361"
m...
The Future:
Node 0.11 / 0.12
For example, cluster module has some changes.
Cluster is experimental.
Domains are unstable.
Good reading:
Node.js Best Practice Exception Handling (some answers more
helpful than others)
Remove uncaught exception h...
If you thought this was interesting,

We're hiring.
careers.fluencia.com
Thanks!
@williamjohnbert
github.com/sandinmyjoints/towards-100-pct-uptime
github.com/sandinmyjoints/towards-100-pct-uptime...
Upcoming SlideShare
Loading in...5
×

Towards 100% uptime with node

26,047

Published on

Eliminating every last bit of downtime caused by deployment and application errors takes some work. Learn how a combination of domains, sensible handling of uncaught exceptions, graceful connection termination, and process management with the cluster module and its friends can give you confidence that your application is always available.

Published in: Technology
0 Comments
21 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
26,047
On Slideshare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
56
Comments
0
Likes
21
Embeds 0
No embeds

No notes for slide

Towards 100% uptime with node

  1. 1. Towards 100% Uptime with Node.js
  2. 2. 9M uniques / month. 75K+ users, some are paid subscribers.
  3. 3. ( We | you | users ) hate downtime.
  4. 4. Important, but out of scope: Redundant infrastructure. Backups. Disaster recovery.
  5. 5. In scope: Application errors. Deploys. Node.js stuff: Domains. Cluster. Express.
  6. 6. Keys to 100% uptime.
  7. 7. 1. Sensibly handle uncaught exceptions.
  8. 8. 2. Use domains to catch and contain errors.
  9. 9. 3. Manage processes with cluster.
  10. 10. 4. Gracefully terminate connections.
  11. 11. 1. Sensibly handle uncaught exceptions.
  12. 12. Uncaught exceptions happen when: An exception is thrown but not caught. An error event is emitted but nothing is listening for it.
  13. 13. From node/lib/events.js: Eetmte.rttp.mt=fnto(ye { vnEitrpooyeei ucintp) / I teei n 'ro'eetlsee te trw / f hr s o err vn itnr hn ho. i (ye=='ro' { f tp = err) .. . }es i (risaco Err { le f e ntnef ro) trwe;/ Uhnld'ro'eet ho r / nade err vn }es { le .. .
  14. 14. An uncaught exception crashes the process.
  15. 15. If the process is a server:  x 100s??
  16. 16. It starts with...
  17. 17. Domains. 2. Use domains to catch and contain errors.
  18. 18. tycth r / a c  doesn't do async. ty{ r vrf=fnto( { a ucin) trwnwErr"ho"; ho e ro(u-h) } ; stieu(,10; eTmotf 0) }cth(x { ac e) cnoelg"r /cthwntcth,e) osl.o(ty ac o' ac" x; }
  19. 19. Domains are a bit like tycth r / a c  for async. vrd=rqie'oan)cet(; a eur(dmi'.rae) do(err,fnto (r){ .n'ro' ucin er cnoelg"oancuh" er; osl.o(dmi agt, r) }; ) vrf=dbn(ucin){ a .idfnto( trwnwErr"ho"; ho e ro(u-h) }; ) stieu(,10; eTmotf 0)
  20. 20. The active domain is dmi.cie oanatv. vrd=rqie'oan)cet(; a eur(dmi'.rae) cnoelgdmi.cie;/ <-nl osl.o(oanatv) / - ul vrf=dbn(ucin){ a .idfnto( cnoelgdmi.cie==d / <-tu osl.o(oanatv = ) / - re cnoelgpoesdmi ==dmi.cie / <-tu osl.o(rcs.oan = oanatv) / - re trwnwErr"ho"; ho e ro(u-h) }; )
  21. 21. New EventEmitters bind to the active domain. Eetmte.rttp.mt=fnto(ye { vnEitrpooyeei ucintp) i (ye=='ro' { f tp = err) i (hsdmi){ / Ti i iprat f ti.oan / hs s motn! .. . ti.oanei(err,e) hsdmi.mt'ro' r; }es i .. le f .
  22. 22. Log the error. Helpful additional fields: errdmi ro.oan errdmiEitr ro.oanmte errdmiBud ro.oanon errdmiTrw ro.oanhon
  23. 23. Then it's up to you. Ignore. Retry. Abort (e.g., return 500). Throw (becomes an unknown error).
  24. 24. Do I have to create a new domain every time I do an async operation?
  25. 25. Use middleware. More convenient.
  26. 26. In Express, this might look like: vrdmiWapr=fnto(e,rs nx){ a oanrpe ucinrq e, et vrrqoan=dmi.rae) a eDmi oancet(; rqoanadrq; eDmi.d(e) rqoanadrs; eDmi.d(e) rqoanoc(err,fnto(r){ eDmi.ne'ro' uciner rssn(0) / o nx(r) e.ed50; / r eter; }; ) rqoanrnnx) eDmi.u(et; } ; Based on https://github.com/brianc/node-domain-middleware https://github.com/mathrawka/express-domain-errors
  27. 27. Domain methods. a dbind an EE to the domain. d: r nrun a function in context of domain. u: b n : bind one function. id i t r e tlike bind but handles 1st arg e r necp: r. d s o ecancels IO and timers. ips:
  28. 28. Domains are great until they're not.
  29. 29. node-mongodb-native does not play well with active domain. cnoelgdmi.cie;/ admi osl.o(oanatv) / oan Apoe.idn(uciner dc { pMdlfnOefnto(r, o) cnoelgdmi.cie;/ udfnd osl.o(oanatv) / neie nx(; et) }; ) See https://github.com/LearnBoost/mongoose/pull/1337
  30. 30. Fix with explicit binding. cnoelgdmi.cie;/ admi osl.o(oanatv) / oan Apoe.idn(oanatv.idfnto(r,dc { pMdlfnOedmi.ciebn(uciner o) cnoelgdmi.cie;/ siladmi osl.o(oanatv) / tl oan nx(; et) }) );
  31. 31. What other operations don't play well well with d m i . c i e oanatv? Good question! Package authors could note this. If you find one, let package author know.
  32. 32. Can 100% uptime be achieved just by using domains? No. Not if only one instance of your app is running.
  33. 33. 3. Manage processes with cluster.
  34. 34. Cluster module. Node = one thread per process. Most machines have multiple CPUs. One process per CPU = cluster.
  35. 35. master / workers 1 master process forks n workers. Master and workers communicate state via IPC. When workers want to listen to a socket, master registers them for it. Each new connection to socket is handed off to a worker. No shared application state between workers.
  36. 36. What about when a worker isn't working anymore? Some coordination is needed.
  37. 37. 1. Worker tells cluster master it's done accepting new connections. 2. Cluster master forks replacement. 3. Worker dies.
  38. 38. Another use case for cluster: Deployment. Want to replace all existing servers. Something must manage that = cluster master process.
  39. 39. Zero downtime deployment. When master starts, give it a symlink to worker code. After deploy new code, update symlink. Send signal to master: fork new workers! Master tells old workers to shut down, forks new workers from new code. Master process never stops running.
  40. 40. Signals. A way to communicate with running processes. S G U : reload workers (some like S G S 2 IHP I U R ). $kl - HP<i> il s U pd $srie<oesrienm>rla evc nd-evc-ae eod
  41. 41. Process management options.
  42. 42. Forever github.com/nodejitsu/forever Has been around...forever. No cluster awareness — used on a single process. Simply restarts the process when it dies. More comparable to Upstart or Monit.
  43. 43. Naught github.com/superjoe30/naught Newer. Cluster aware. Zero downtime errors and deploys. Runs as daemon. Handles log compression, rotation.
  44. 44. Recluster github.com/doxout/recluster Newer. Cluster aware. Zero downtime errors and deploys. Does not run as daemon. Log agnostic. Simple, relatively easy to reason about.
  45. 45. We went with recluster. Happy so far.
  46. 46. I have been talking about starting / stopping workers as if it's atomic. It's not.
  47. 47. 4. Gracefully terminate connections when needed.
  48. 48. Don't call p o e s e i  too soon! rcs.xt Give it a grace period to clean up.
  49. 49. Need to clean up: In-flight requests. HTTP keep-alive (open TCP) connections.
  50. 50. Revisiting our middleware from earlier: vrdmiWapr=fnto(feErrok { a oanrpe ucinatrroHo) rtr fnto(e,rs nx){ eun ucinrq e, et vrrqoan=dmi.rae) a eDmi oancet(; rqoanadrq; eDmi.d(e) rqoanadrs; eDmi.d(e) rqoanoc(err,fnto(r){ eDmi.ne'ro' uciner nx(r) eter; i(feErrok atrroHo(r) / Ho. fatrroHo) feErroker; / ok }; ) rqoanrnnx) eDmi.u(et; } ; } ;
  51. 51. 1. Call s r e . l s . evrcoe vratrroHo =fnto(r){ a feErrok uciner sre.ls(;/ <-esr n nwcnetos evrcoe) / - nue o e oncin }
  52. 52. 2. Shut down keep-alive connections. vratrroHo =fnto(r){ a feErrok uciner apst"shtigon,tu) / <-stsae p.e(iSutnDw" re; / - e tt sre.ls(; evrcoe) } vrsudwMdl =fnto(e,rs nx){ a htonide ucinrq e, et i(p.e(iSutnDw" { / <-ceksae fapgt"shtigon) / - hc tt rqcneto.eTmot1; / <-kl ke-lv e.oncinstieu() / - il epaie } nx(; et) } Idea from https://github.com/mathrawka/express-graceful-exit
  53. 53. 3. Then call p o e s e i rcs.xt in s r e . l s  callback. evrcoe vratrroHo =fnto(r){ a feErrok uciner apst"shtigon,tu) p.e(iSutnDw" re; sre.ls(ucin){ evrcoefnto( poesei() / <-alcert ei rcs.xt1; / - l la o xt }; ) }
  54. 54. Set a timer. If timeout period expires and server is still around, call poesei. rcs.xt
  55. 55. Summing up: Our ideal server.
  56. 56. On startup: Cluster master comes up (for example, via Upstart). Cluster master forks workers from symlink. Each worker's server starts accepting connections.
  57. 57. On deploy: Point symlink to new version. Send signal to cluster master. Master tells existing workers to stop accepting new connections. Master forks new workers from new code. Existing workers shut down gracefully.
  58. 58. On error: Server catches it via domain. Next action depends on you: retry? abort? rethrow? etc.
  59. 59. On uncaught exception: ?? / Teifmu "nagtxeto"eet / h naos ucuhEcpin vn! poeso(ucuhEcpin,fnto(r){ rcs.n'nagtxeto' uciner / ? / ? } )
  60. 60. Back to where we started: 1. Sensibly handle uncaught exceptions. We have minimized these by using domains. But they can still happen.
  61. 61. Node docs say not to keep running. An unhandled exception means your application — and by extension node.js itself — is in an undefined state. Blindly resuming means anything could happen. You have been warned. http://nodejs.org/api/process.html#process_event_uncaughtexception
  62. 62. What to do? First, log the error so you know what happened.
  63. 63. Then, you've got to kill the process.
  64. 64. It's not so bad. We can now do so with minimal trouble.
  65. 65. On uncaught exception: Log error. Server stops accepting new connections. Worker tells cluster master it's done. Master forks a replacement worker. Worker exits gracefully when all connections are closed, or after timeout.
  66. 66. What about the request that killed the worker? How does the dying worker gracefully respond to it? Good question!
  67. 67. People are also under the illusion that it is possible to trace back [an uncaught] exception to the http request that caused it... -felixge, https://github.com/joyent/node/issues/2582
  68. 68. This is too bad, because you always want to return a response, even on error.
  69. 69. This is Towards 100% Uptime b/c these approaches don't guarantee response for every request. But we can get very close.
  70. 70. Fortunately, given what we've seen, uncaughts shouldn't happen often. And when they do, only one connection will be left hanging.
  71. 71. Must restart cluster master when: Upgrade Node. Cluster master code changes.
  72. 72. During timeout periods, might have: More workers than CPUs. Workers running different versions (old/new). Should be brief. Probably preferable to downtime.
  73. 73. Tip: Be able to produce errors on demand on your dev and staging servers. (Disable this in production.)
  74. 74. Tip: Keep cluster master simple. It needs to run for a long time without being updated.
  75. 75. Things change. I've been talking about: { "oe:"01.0, nd" ~.02" "xrs" "340, epes: ~.." "onc" "290, cnet: ~.." "ogoe:"361" mnos" ~..8, "else" "034 rcutr: =.." }
  76. 76. The Future: Node 0.11 / 0.12 For example, cluster module has some changes.
  77. 77. Cluster is experimental. Domains are unstable.
  78. 78. Good reading: Node.js Best Practice Exception Handling (some answers more helpful than others) Remove uncaught exception handler? Isaacs stands by killing on uncaught Domains don't incur performance hits compared to try catch Rejected PR to add domains to Mongoose, with discussion Don't call enter / exit across async Comparison of naught and forever What's changing in cluster
  79. 79. If you thought this was interesting, We're hiring. careers.fluencia.com
  80. 80. Thanks! @williamjohnbert github.com/sandinmyjoints/towards-100-pct-uptime github.com/sandinmyjoints/towards-100-pct-uptimeexamples
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×