5 Essential Techniques for Building Fault-tolerant Systems

5 Essential Techniques for
Building Fault-tolerant Systems
DIEGO BERRUETA | ENGINEERING PRINCIPAL | ATLASSIAN

Preston Rhea; Flickr (www.flickr.com/photos/prestonrhea/), CC-by

REVERSE PROXY
CASCADING FAILURE

CASCADING FAILURE
REVERSE PROXY

aotaro; Flickr (www.flickr.com/photos/aotaro/), CC-by

Fault-tolerant system
Continues to operate in the event of faults in some
of its components
Fault
Deviation from the normal state, either due to
internal or external causes
Failure
Observable impact of a fault in a system, usually
manifests as reduced availability

Bob Yeats (OSU); Flickr (www.flickr.com/photos/oregonstateuniversity/), CC-by
Faults happen
Preventing them may be technically or
economically impractical

Systems can be designed and
built to tolerate faults
FAULT TOLERANCE

Essential
techniques
Contain
Fail fast
Escape
Adjust
Learn

Contain the fault
Using physical or logical barriers

Norlando Pobre; Flickr (www.flickr.com/photos/npobre/), CC-by

Shawn O’Neil; Flickr (www.flickr.com/photos/oneilsh/), CC-by

Pool separation
Different pools for each 
task or dependency
Brian Cantoni; Flickr (www.flickr.com/photos/cantoni/), CC-by

Asynchronous communication
Sender and receiver fail independently,
receiver can catch up later
Synchronous communication
Propagates failures, context may be lost

CLIENT ISOLATION
NEW CLIENT
CLIENT A
CLIENT B

Avoid SPOF
Find the components which
compromise the system
How to contain faults
Invest in redundancy
Improve availability by having
more than one of everything
Build bulkheads
Set up logic walls to 
reduce the blast radius

A quick error is better 
than a slow response
FAIL FAST PRINCIPLE

Validate early
Anticipate problems and 
change course
Carl Wycoff; Wikimedia Commons, CC-by

Reject politely
Fast refusal is better 
than slow error
Taber A.B.; Flickr (www.flickr.com/photos/andrewbain/), CC-by

Never wait long
Protect long tasks 
with timeouts
Robert C-B.; Flickr (www.flickr.com/photos/29233640@N07/), CC-by

Watch out
for slowness
In-process locks
Database queries
Sockets
Remote APIs
Albert Herring, Wikimedia Commons, CC-by

Decline service
When overloaded, ask clients
to come back later
How to fail fast
Never wait long
Set a timeout for blocking
calls and slow operations
Validate early
Avoid starting something that
cannot be completed

Circuit breakers
Separate the failing parts

REST CIRCUIT-BREAKER
CLIENT SERVER

Fault-tolerance libraries
Circuit breaking
Avoid cascading failures during
periods of turbulence
Monitoring and alerting
Observe the behaviour of all your
dependencies
Timeouts
Time-bound any operation
Fall-back
Recover using an alternative path

SERVICE EVOLUTION
TENANT ATENANT A
TENANT ATENANT B
TENANT ATENANT C

Anticipate failure
If it is not going to work, 
do not even try
How to escape
Degrade gracefully
A cached result or a default
value may be an alternative
Detect problems
Compare all interactions
against error thresholds

Communication
Share initial expectations  
and status updates

Adjust
Page size
Growth
Flow
Retry
Paginate queries
Avoid unbounded database and
remote queries

Adjust
Page size
Growth
Flow
Retry
Beware of things that grow
Clean up old data and set size limits

Adjust
Page size
Growth
Flow
Retry
Negotiate speed
Adjust dynamically using rate limits,
request throttling, buffering and
autoscaling

Adjust
Page size
Growth
Flow
Retry
Insist smartly
Retry transient errors with timeouts
and back-off

Insist smartly
Transient errors can be
retried with back-off
How to adjust
Report availability
Apply back pressure to
prevent congestion
Negotiate size
Limit the cost of the job

Monitor the behaviour
Define metrics, observe trends, set up
alerts, capture logs
Learn from
experience
Observe
Test
Reflect

Test with chaos
Verify your hypothesis about
fault-tolerance by continuously
introducing chaos
Learn from
experience
Observe
Test
Reflect

Learn from mistakes
Each incident is an opportunity to
make the system more robust
Learn from
experience
Observe
Test
Reflect

Monitor and alert
Understand the behaviour of
the system in production
How to learn
Reflect on incidents
Analyse the root cause and
prevent recurrences
Test what if…?
Deliberately introduce chaos
to assess fault-tolerance

Hope is not a strategy
Test, observe, reflect
Life starts after releasing
“Code complete” is not “production ready”
Be cynical
Do not trust anybodyRobustness
is an attitude

Some disasters 
can be prevented
Build and test with failure in mind
Faults are unavoidable
Any possible fault 
will eventually happen

Leo Hidalgo; Flickr (www.flickr.com/photos/ileohidalgo/), CC-by

Release it!
Michael Nygard
Site Reliability
Engineering
Google
References

Thank you!
DIEGO BERRUETA | ENGINEERING PRINCIPAL | ATLASSIAN

5 Essential Techniques for Building Fault-tolerant Systems

More Related Content

What's hot

Viewers also liked

Similar to 5 Essential Techniques for Building Fault-tolerant Systems

More from Atlassian

Recently uploaded

5 Essential Techniques for Building Fault-tolerant Systems