5 Essential Techniques for
Building Fault-tolerant Systems
DIEGO BERRUETA | ENGINEERING PRINCIPAL | ATLASSIAN
Preston Rhea; Flickr (www.flickr.com/photos/prestonrhea/), CC-by
ALERT!
Incident response
REVERSE PROXY
CASCADING FAILURE
CASCADING FAILURE
REVERSE PROXY
aotaro; Flickr (www.flickr.com/photos/aotaro/), CC-by
Fault-tolerant system
Continues to operate in the event of faults in some
of its components
Fault
Deviation from the normal state, either due to
internal or external causes
Failure
Observable impact of a fault in a system, usually
manifests as reduced availability
Fault-tolerant system
Continues to operate in the event of faults in some
of its components
Fault
Deviation from the normal state, either due to
internal or external causes
Failure
Observable impact of a fault in a system, usually
manifests as reduced availability
Fault-tolerant system
Continues to operate in the event of faults in some
of its components
Fault
Deviation from the normal state, either due to
internal or external causes
Failure
Observable impact of a fault in a system, usually
manifests as reduced availability
Bob Yeats (OSU); Flickr (www.flickr.com/photos/oregonstateuniversity/), CC-by
Faults happen
Preventing them may be technically or
economically impractical
Systems can be designed and
built to tolerate faults
FAULT TOLERANCE
STABLE SYSTEM UNSTABLE SYSTEM
Essential
techniques
Contain
Fail fast
Escape
Adjust
Learn
Contain the fault
Using physical or logical barriers
Norlando Pobre; Flickr (www.flickr.com/photos/npobre/), CC-by
Shawn O’Neil; Flickr (www.flickr.com/photos/oneilsh/), CC-by
Pool separation
Different pools for each

task or dependency
Brian Cantoni; Flickr (www.flickr.com/photos/cantoni/), CC-by
Asynchronous communication
Sender and receiver fail independently,
receiver can catch up later
Synchronous communication
Propagates failures, context may be lost
Contain: 

in practice
CLIENT ISOLATION
NEW CLIENT
CLIENT A
CLIENT B
Avoid SPOF
Find the components which
compromise the system
How to contain faults
Invest in redundancy
Improve availability by having
more than one of everything
Build bulkheads
Set up logic walls to

reduce the blast radius
Essential
techniques
Contain
Fail fast
Escape
Adjust
Learn
A quick error is better

than a slow response
FAIL FAST PRINCIPLE
Validate early
Anticipate problems and

change course
Carl Wycoff; Wikimedia Commons, CC-by
Reject politely
Fast refusal is better

than slow error
Taber A.B.; Flickr (www.flickr.com/photos/andrewbain/), CC-by
Never wait long
Protect long tasks

with timeouts
Robert C-B.; Flickr (www.flickr.com/photos/29233640@N07/), CC-by
Watch out
for slowness
In-process locks
Database queries
Sockets
Remote APIs
Albert Herring, Wikimedia Commons, CC-by
Watch out
for slowness
In-process locks
Database queries
Sockets
Remote APIs
Albert Herring, Wikimedia Commons, CC-by
Watch out
for slowness
In-process locks
Database queries
Sockets
Remote APIs
Albert Herring, Wikimedia Commons, CC-by
Watch out
for slowness
In-process locks
Database queries
Sockets
Remote APIs
Albert Herring, Wikimedia Commons, CC-by
Fail fast: 

in practice
DATABASE FAILOVER
SYNC
Decline service
When overloaded, ask clients
to come back later
How to fail fast
Never wait long
Set a timeout for blocking
calls and slow operations
Validate early
Avoid starting something that
cannot be completed
Essential
techniques
Contain
Fail fast
Escape
Adjust
Learn
Circuit breakers
Separate the failing parts
REST CIRCUIT-BREAKER
CLIENT SERVER
Fault-tolerance libraries
Circuit breaking
Avoid cascading failures during
periods of turbulence
Monitoring and alerting
Observe the behaviour of all your
dependencies
Timeouts
Time-bound any operation
Fall-back
Recover using an alternative path
Escape: 

in practice
SERVICE EVOLUTION
TENANT ATENANT A
TENANT ATENANT B
TENANT ATENANT C
Anticipate failure
If it is not going to work,

do not even try
How to escape
Degrade gracefully
A cached result or a default
value may be an alternative
Detect problems
Compare all interactions
against error thresholds
Essential
techniques
Contain
Fail fast
Escape
Adjust
Learn
Communication
Share initial expectations 

and status updates
Adjust
Page size
Growth
Flow
Retry
Paginate queries
Avoid unbounded database and
remote queries
Adjust
Page size
Growth
Flow
Retry
Beware of things that grow
Clean up old data and set size limits
Adjust
Page size
Growth
Flow
Retry
Negotiate speed
Adjust dynamically using rate limits,
request throttling, buffering and
autoscaling
Adjust
Page size
Growth
Flow
Retry
Insist smartly
Retry transient errors with timeouts
and back-off
Adjust: 

in practice
KILLER HEALTH CHECK
Insist smartly
Transient errors can be
retried with back-off
How to adjust
Report availability
Apply back pressure to
prevent congestion
Negotiate size
Limit the cost of the job
Essential
techniques
Contain
Fail fast
Escape
Adjust
Learn
Monitor the behaviour
Define metrics, observe trends, set up
alerts, capture logs
Learn from
experience
Observe
Test
Reflect
Test with chaos
Verify your hypothesis about
fault-tolerance by continuously
introducing chaos
Learn from
experience
Observe
Test
Reflect
Learn from mistakes
Each incident is an opportunity to
make the system more robust
Learn from
experience
Observe
Test
Reflect
Learn: 

in practice
PARTIAL FAILURE
1
2
3
PARTIAL FAILURE
1 32
2
2
Monitor and alert
Understand the behaviour of
the system in production
How to learn
Reflect on incidents
Analyse the root cause and
prevent recurrences
Test what if…?
Deliberately introduce chaos
to assess fault-tolerance
Essential
techniques
Contain
Fail fast
Escape
Adjust
Learn
Hope is not a strategy
Test, observe, reflect
Life starts after releasing
“Code complete” is not “production ready”
Be cynical
Do not trust anybodyRobustness
is an attitude
Some disasters

can be prevented
Build and test with failure in mind
Faults are unavoidable
Any possible fault

will eventually happen
Leo Hidalgo; Flickr (www.flickr.com/photos/ileohidalgo/), CC-by
Release it!
Michael Nygard
Site Reliability
Engineering
Google
References
Thank you!
DIEGO BERRUETA | ENGINEERING PRINCIPAL | ATLASSIAN

5 Essential Techniques for Building Fault-tolerant Systems