Doug Barth discusses how PagerDuty started injecting failure into our production systems with minimal effort and the full support of the development teams. He discusses why you should start proactively injecting failure and the exact steps you can take. Additionally, he goes over the importance of setting an agenda, keeping a log of the actions taken, and to-dos that were uncovered. Finally, he talks about the benefits your company will get from causing all this chaos.
7. 9/15/14
Designed for reliability
FAILURE FRIDAY!
Downstream providers fail
3 phone providers
3 email providers
6 SMS providers
PagerDuty providers fail
2 cloud providers
3 data centers
8. 9/15/14
Hung up on details
FAILURE FRIDAY!
Bugs in exceptional code paths
Systems not recovering as quickly as
expected
What is normal when things are
abnormal?
17. 9/15/14
Keep a log
FAILURE FRIDAY!
Keep track of actions taken
Times are super important
Also track discoveries and TODOs
Share dashboards/metrics
Chat rooms make this easy
26. 9/15/14
Issues fixed
FAILURE FRIDAY!
Aggressive restarts by monit
Large files on ext3 volumes
Failing to restart due to bad /etc/fstab file
High latency from network isolated cache
Low capacity with a lost DC
Missing alerts/metrics
29. 9/15/14
Break more things
FAILURE FRIDAY!
Start testing whole DC outages
Break multiple services at once
Distribute failure testing to teams
Automate
30. 9/15/14
Break more things
FAILURE FRIDAY!
Start testing whole DC outages
Break multiple services at once
Distribute failure testing to teams
Automate