Towards 100% uptime with node

Towards

100% Uptime
with Node.js

9M uniques / month.

75K+ users, some are paid
subscribers.

( We | you | users )
hate downtime.

Important, but
out of scope:
Redundant infrastructure.
Backups.
Disaster recovery.

In scope:
Application errors.
Deploys.
Node.js stuff:
Domains.
Cluster.
Express.

1. Sensibly handle
uncaught exceptions.

2. Use domains
to catch and contain errors.

3. Manage processes
with cluster.

4. Gracefully terminate
connections.

1. Sensibly handle uncaught
exceptions.

Uncaught exceptions happen when:
An exception is thrown but not caught.
An error event is emitted but nothing is listening for it.

From node/lib/events.js:
Eetmte.rttp.mt=fnto(ye {
vnEitrpooyeei
ucintp)
/ I teei n 'ro'eetlsee te trw
/ f hr s o err vn itnr hn ho.
i (ye=='ro' {
f tp = err)
..
.
}es i (risaco Err {
le f e ntnef ro)
trwe;/ Uhnld'ro'eet
ho r / nade err vn
}es {
le
..
.

An uncaught exception
crashes the process.

If the process is a server:

x 100s??

Domains.

2. Use domains to catch and contain errors.

tycth
r / a c doesn't do
async.
ty{
r
vrf=fnto( {
a
ucin)
trwnwErr"ho";
ho e ro(u-h)
}
;
stieu(,10;
eTmotf 0)
}cth(x {
ac e)
cnoelg"r /cthwntcth,e)
osl.o(ty
ac o' ac" x;
}

Domains are a bit like
tycth
r / a c for async.
vrd=rqie'oan)cet(;
a
eur(dmi'.rae)
do(err,fnto (r){
.n'ro' ucin er
cnoelg"oancuh" er;
osl.o(dmi agt, r)
};
)
vrf=dbn(ucin){
a
.idfnto(
trwnwErr"ho";
ho e ro(u-h)
};
)
stieu(,10;
eTmotf 0)

The active domain is
dmi.cie
oanatv.
vrd=rqie'oan)cet(;
a
eur(dmi'.rae)
cnoelgdmi.cie;/ <-nl
osl.o(oanatv) / - ul
vrf=dbn(ucin){
a
.idfnto(
cnoelgdmi.cie==d / <-tu
osl.o(oanatv = ) / - re
cnoelgpoesdmi ==dmi.cie / <-tu
osl.o(rcs.oan = oanatv) / - re
trwnwErr"ho";
ho e ro(u-h)
};
)

New EventEmitters bind
to the active domain.
Eetmte.rttp.mt=fnto(ye {
vnEitrpooyeei
ucintp)
i (ye=='ro' {
f tp = err)
i (hsdmi){ / Ti i iprat
f ti.oan
/ hs s motn!
..
.
ti.oanei(err,e)
hsdmi.mt'ro' r;
}es i ..
le f .

Log the error.
Helpful additional fields:
errdmi
ro.oan
errdmiEitr
ro.oanmte
errdmiBud
ro.oanon
errdmiTrw
ro.oanhon

Then it's up to you.
Ignore.
Retry.
Abort (e.g., return 500).
Throw (becomes an unknown error).

Do I have to create a new domain
every time I do an async operation?

Use middleware.
More convenient.

In Express, this might look like:
vrdmiWapr=fnto(e,rs nx){
a oanrpe
ucinrq e, et
vrrqoan=dmi.rae)
a eDmi
oancet(;
rqoanadrq;
eDmi.d(e)
rqoanadrs;
eDmi.d(e)
rqoanoc(err,fnto(r){
eDmi.ne'ro' uciner
rssn(0) / o nx(r)
e.ed50; / r eter;
};
)
rqoanrnnx)
eDmi.u(et;
}
;
Based on
https://github.com/brianc/node-domain-middleware
https://github.com/mathrawka/express-domain-errors

Domain methods.
a dbind an EE to the domain.
d:
r nrun a function in context of domain.
u:
b n : bind one function.
id
i t r e tlike bind but handles 1st arg e r
necp:
r.
d s o ecancels IO and timers.
ips:

Domains

are great
until they're not.

node-mongodb-native does not
play well with active domain.
cnoelgdmi.cie;/ admi
osl.o(oanatv) /
oan
Apoe.idn(uciner dc {
pMdlfnOefnto(r, o)
cnoelgdmi.cie;/ udfnd
osl.o(oanatv) / neie
nx(;
et)
};
)
See https://github.com/LearnBoost/mongoose/pull/1337

Fix with explicit binding.
cnoelgdmi.cie;/ admi
osl.o(oanatv) /
oan
Apoe.idn(oanatv.idfnto(r,dc {
pMdlfnOedmi.ciebn(uciner o)
cnoelgdmi.cie;/ siladmi
osl.o(oanatv) / tl
oan
nx(;
et)
})
);

What other operations don't play well
well with d m i . c i e
oanatv?
Good question!
Package authors could note this.
If you find one, let package author know.

Can 100% uptime be achieved
just by using domains?

No.

Not if only one instance of your app
is running.

Cluster module.
Node = one thread per process.
Most machines have multiple CPUs.
One process per CPU = cluster.

master / workers
1 master process forks n
workers.
Master and workers communicate state via IPC.
When workers want to listen to a socket, master registers them
for it.
Each new connection to socket is handed off to a worker.
No shared application state between workers.

What about when a worker
isn't working anymore?
Some coordination is needed.

1. Worker tells cluster master it's done accepting new connections.
2. Cluster master forks replacement.
3. Worker dies.

Another use case for cluster:

Deployment.

Want to replace all existing servers.
Something must manage that = cluster master process.

Zero downtime deployment.
When master starts, give it a symlink to worker code.
After deploy new code, update symlink.
Send signal to master: fork new workers!
Master tells old workers to shut down, forks new workers from
new code.
Master process never stops running.

Signals.
A way to communicate with running processes.
S G U : reload workers (some like S G S 2
IHP
I U R ).
$kl - HP<i>
il s U pd
$srie<oesrienm>rla
evc nd-evc-ae eod

Forever
github.com/nodejitsu/forever
Has been around...forever.
No cluster awareness — used on a single process.
Simply restarts the process when it dies.
More comparable to Upstart or Monit.

Naught
github.com/superjoe30/naught
Newer.
Cluster aware.
Zero downtime errors and deploys.
Runs as daemon.
Handles log compression, rotation.

Recluster
github.com/doxout/recluster
Newer.
Cluster aware.
Zero downtime errors and deploys.
Does not run as daemon.
Log agnostic.
Simple, relatively easy to reason about.

We went with recluster.
Happy so far.

I have been talking about
starting / stopping workers
as if it's atomic.

It's not.

4. Gracefully terminate
connections
when needed.

Don't call p o e s e i too soon!
rcs.xt
Give it a grace period to clean up.

Need to clean up:
In-flight requests.
HTTP keep-alive (open TCP) connections.

Revisiting our middleware from earlier:
vrdmiWapr=fnto(feErrok {
a oanrpe
ucinatrroHo)
rtr fnto(e,rs nx){
eun ucinrq e, et
vrrqoan=dmi.rae)
a eDmi
oancet(;
rqoanadrq;
eDmi.d(e)
rqoanadrs;
eDmi.d(e)
rqoanoc(err,fnto(r){
eDmi.ne'ro' uciner
nx(r)
eter;
i(feErrok atrroHo(r) / Ho.
fatrroHo) feErroker; / ok
};
)
rqoanrnnx)
eDmi.u(et;
}
;
}
;

1. Call s r e . l s .
evrcoe
vratrroHo =fnto(r){
a feErrok
uciner
sre.ls(;/ <-esr n nwcnetos
evrcoe) / - nue o e oncin
}

2. Shut down keep-alive
connections.
vratrroHo =fnto(r){
a feErrok
uciner
apst"shtigon,tu) / <-stsae
p.e(iSutnDw" re; / - e tt
sre.ls(;
evrcoe)
}
vrsudwMdl =fnto(e,rs nx){
a htonide
ucinrq e, et
i(p.e(iSutnDw" { / <-ceksae
fapgt"shtigon)
/ - hc tt
rqcneto.eTmot1; / <-kl ke-lv
e.oncinstieu()
/ - il epaie
}
nx(;
et)
}
Idea from https://github.com/mathrawka/express-graceful-exit

3. Then call p o e s e i
rcs.xt
in s r e . l s callback.
evrcoe
vratrroHo =fnto(r){
a feErrok
uciner
apst"shtigon,tu)
p.e(iSutnDw" re;
sre.ls(ucin){
evrcoefnto(
poesei() / <-alcert ei
rcs.xt1; / - l la o xt
};
)
}

Set a timer.
If timeout period expires and server is still around, call
poesei.
rcs.xt

Summing up:

Our ideal server.

On startup:
Cluster master comes up (for example, via Upstart).
Cluster master forks workers from symlink.
Each worker's server starts accepting connections.

On deploy:
Point symlink to new version.
Send signal to cluster master.
Master tells existing workers to stop accepting new connections.
Master forks new workers from new code.
Existing workers shut down gracefully.

On error:
Server catches it via domain.
Next action depends on you: retry? abort? rethrow? etc.

On uncaught exception:
??
/ Teifmu "nagtxeto"eet
/ h naos ucuhEcpin vn!
poeso(ucuhEcpin,fnto(r){
rcs.n'nagtxeto' uciner
/ ?
/ ?
}
)

Back to where we started:

1. Sensibly handle uncaught
exceptions.
We have minimized these by using domains.
But they can still happen.

Node docs say not to keep running.

An unhandled exception means your
application — and by extension node.js
itself — is in an undefined state. Blindly
resuming means anything could happen.
You have been warned.
http://nodejs.org/api/process.html#process_event_uncaughtexception

What to do?
First, log the error so you know what happened.

Then, you've got to
kill the process.

It's not so bad. We can now do so
with minimal trouble.

On uncaught exception:
Log error.
Server stops accepting new connections.
Worker tells cluster master it's done.
Master forks a replacement worker.
Worker exits gracefully when all connections are closed, or after
timeout.

What about the request
that killed the worker?
How does the dying worker
gracefully respond to it?
Good question!

People are also under the illusion that it is
possible to trace back [an uncaught]
exception to the http request that caused
it...
-felixge, https://github.com/joyent/node/issues/2582

This is too bad, because you
always want to return a response,
even on error.

This is Towards 100% Uptime b/c these approaches don't
guarantee response for every request.

But we can get very close.

Fortunately, given what we've seen,
uncaughts shouldn't happen often.
And when they do, only one
connection will be left hanging.

Must restart cluster master when:
Upgrade Node.
Cluster master code changes.

During timeout periods, might have:
More workers than CPUs.
Workers running different versions (old/new).
Should be brief. Probably preferable to downtime.

Tip:

Be able to produce errors on demand
on your dev and staging servers.
(Disable this in production.)

Tip:

Keep cluster master simple.
It needs to run for a long time without being updated.

Things change.
I've been talking about:
{
"oe:"01.0,
nd" ~.02"
"xrs" "340,
epes: ~.."
"onc" "290,
cnet: ~.."
"ogoe:"361"
mnos" ~..8,
"else" "034
rcutr: =.."
}

The Future:
Node 0.11 / 0.12
For example, cluster module has some changes.

Cluster is experimental.
Domains are unstable.

Good reading:
Node.js Best Practice Exception Handling (some answers more
helpful than others)
Remove uncaught exception handler?
Isaacs stands by killing on uncaught
Domains don't incur performance hits compared to try catch
Rejected PR to add domains to Mongoose, with discussion
Don't call enter / exit across async
Comparison of naught and forever
What's changing in cluster

If you thought this was interesting,

We're hiring.
careers.fluencia.com

Thanks!
@williamjohnbert
github.com/sandinmyjoints/towards-100-pct-uptime
github.com/sandinmyjoints/towards-100-pct-uptimeexamples

Towards 100% uptime with node

Recommended

Recommended

More Related Content

Similar to Towards 100% uptime with node

Similar to Towards 100% uptime with node (20)

Recently uploaded

Recently uploaded (20)

Towards 100% uptime with node