Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Embracing Failure
(not my life story)
Setting the Mood
•Understand that they WILL
happen
•Failures are not binary
•Impact determines importance
•deadlines for f...
Terminology
•Website
•Production
•Downtime
Monitor Failures
What is Monitoring?
•Graphs. Everywhere.
•Alerts on failures
•phone calls
•texts
•Answers: Are we failing?
healthcare.gov
•Know when you’re down
before CNN
Postmortems
(fool me once. shame on you.
fool me twice. shame on me.)
Postmortems
1. Reconstruct the factual
timeline
2. Root cause analysis
3. Remediation items
Postmortems
•Why did we fail?
•Blameless
•Moderated
Gamedays
(You wouldn’t wing a talk.
Don’t wing a hot fix)
Gameday
•Best defense is a good
offense
•Simulate possible failures
•Do it in production
kill -9
1. Draw a block
diagram
2. Cut every connection
3. Watch the fireworks
SafeMachine
(like a state machine … but safer)
Try, Try, Try again
•What if we could just retry
failures?
•Side effects are the root of all
evil
•Safe failures vs Unsafe...
What’s in a SafeMachine
•Actions
•States
START
Computed
File
Uploaded
File
END
compute upload
record
successful
initialize_succeeded
initialize_failed
initialize_inprogress
computed_succeeded
START
a1
a1
a2
a2
a2
a3
a3
a3
END
The Pipeline
The Pipeline
START
Computed
File
Uploaded
File
END
Safe Unsafe Safe
Embracing Failure
•Monitor
•Postmortems
•Gamedays - you wouldn’t
wing a talk?
•SafeMachine
@chriswu_
Additional resources
• Postmortems https://codeascraft.com/2012/05/22/blameless-
postmortems/
• Gamedays - https://stripe....
Embracing Failure
Embracing Failure
Embracing Failure
Embracing Failure
Upcoming SlideShare
Loading in …5
×

Embracing Failure

159 views

Published on

A Learning Night talk by Chris Wu on how Stripe deals with system failures

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Embracing Failure

  1. 1. Embracing Failure (not my life story)
  2. 2. Setting the Mood •Understand that they WILL happen •Failures are not binary •Impact determines importance •deadlines for fixes are variable
  3. 3. Terminology •Website •Production •Downtime
  4. 4. Monitor Failures
  5. 5. What is Monitoring? •Graphs. Everywhere. •Alerts on failures •phone calls •texts •Answers: Are we failing?
  6. 6. healthcare.gov •Know when you’re down before CNN
  7. 7. Postmortems (fool me once. shame on you. fool me twice. shame on me.)
  8. 8. Postmortems 1. Reconstruct the factual timeline 2. Root cause analysis 3. Remediation items
  9. 9. Postmortems •Why did we fail? •Blameless •Moderated
  10. 10. Gamedays (You wouldn’t wing a talk. Don’t wing a hot fix)
  11. 11. Gameday •Best defense is a good offense •Simulate possible failures •Do it in production
  12. 12. kill -9 1. Draw a block diagram 2. Cut every connection 3. Watch the fireworks
  13. 13. SafeMachine (like a state machine … but safer)
  14. 14. Try, Try, Try again •What if we could just retry failures? •Side effects are the root of all evil •Safe failures vs Unsafe failures
  15. 15. What’s in a SafeMachine •Actions •States START Computed File Uploaded File END compute upload record successful
  16. 16. initialize_succeeded initialize_failed initialize_inprogress computed_succeeded
  17. 17. START a1 a1 a2 a2 a2 a3 a3 a3 END The Pipeline
  18. 18. The Pipeline START Computed File Uploaded File END Safe Unsafe Safe
  19. 19. Embracing Failure •Monitor •Postmortems •Gamedays - you wouldn’t wing a talk? •SafeMachine
  20. 20. @chriswu_
  21. 21. Additional resources • Postmortems https://codeascraft.com/2012/05/22/blameless- postmortems/ • Gamedays - https://stripe.com/blog/game-day-exercises-at-stripe • links at the bottom of this post are also great • Error Tracking - https://getsentry.com/welcome/

×