Embracing Failure
(not my life story)
Setting the Mood
•Understand that they WILL
happen
•Failures are not binary
•Impact determines importance
•deadlines for fixes are variable
Terminology
•Website
•Production
•Downtime
Monitor Failures
What is Monitoring?
•Graphs. Everywhere.
•Alerts on failures
•phone calls
•texts
•Answers: Are we failing?
healthcare.gov
•Know when you’re down
before CNN
Postmortems
(fool me once. shame on you.
fool me twice. shame on me.)
Postmortems
1. Reconstruct the factual
timeline
2. Root cause analysis
3. Remediation items
Postmortems
•Why did we fail?
•Blameless
•Moderated
Gamedays
(You wouldn’t wing a talk.
Don’t wing a hot fix)
Gameday
•Best defense is a good
offense
•Simulate possible failures
•Do it in production
kill -9
1. Draw a block
diagram
2. Cut every connection
3. Watch the fireworks
SafeMachine
(like a state machine … but safer)
Try, Try, Try again
•What if we could just retry
failures?
•Side effects are the root of all
evil
•Safe failures vs Unsafe failures
What’s in a SafeMachine
•Actions
•States
START
Computed
File
Uploaded
File
END
compute upload
record
successful
initialize_succeeded
initialize_failed
initialize_inprogress
computed_succeeded
START
a1
a1
a2
a2
a2
a3
a3
a3
END
The Pipeline
The Pipeline
START
Computed
File
Uploaded
File
END
Safe Unsafe Safe
Embracing Failure
•Monitor
•Postmortems
•Gamedays - you wouldn’t
wing a talk?
•SafeMachine
@chriswu_
Additional resources
• Postmortems https://codeascraft.com/2012/05/22/blameless-
postmortems/
• Gamedays - https://stripe.com/blog/game-day-exercises-at-stripe
• links at the bottom of this post are also great
• Error Tracking - https://getsentry.com/welcome/

Embracing Failure