Failure Friday: Start Injecting Failure Today!

9/15/14
@dougbarth
DEVOPSDAYS TORONTO 2014
Failure Friday!

9/15/14FAILURE FRIDAY!
Dev
Ops

DevOps Engineer

9/15/14
“DO NOT FEAR FAILURE” BYTOMASZ STASIUK

How is babby PagerDuty formed?

9/15/14
Designed for reliability
FAILURE FRIDAY!
Downstream providers fail
3 phone providers
3 email providers
6 SMS providers
PagerDuty providers fail
2 cloud providers
3 data centers

9/15/14
Hung up on details
FAILURE FRIDAY!
Bugs in exceptional code paths
Systems not recovering as quickly as
expected
What is normal when things are
abnormal?

9/15/14
Simian Army
FAILURE FRIDAY!
Chaos Monkey
Latency Monkey
Chaos Gorilla
Chaos Kong
“WP7WALLPAPER_EVIL_MONKEY_09”
BY SKYLER817

9/15/14
Keep it simple
FAILURE FRIDAY!
“KISS BAND MEMBER CUPCAKES” BY CLEVER CUPCAKES

9/15/14
Process
FAILURE FRIDAY!
“HOW TO DRAW AN OWL” BY CHESTER

9/15/14
Get buy in
FAILURE FRIDAY!
“ANGRY BOSS” BY KAUSHAL KARKHANIS

9/15/14
Schedule
FAILURE FRIDAY!
1 hour recurring meeting
Developers & Operations
List of attacks and identify victim
Finish as much as possible

9/15/14
Before starting
FAILURE FRIDAY!
Disable cron jobs & CM system
Announce the start
Open up relevant dashboards
Leave alarms enabled

9/15/14
Attacks
FAILURE FRIDAY!
Test a single host and then DC
5 minutes
Return to a working state
Stop if things break

9/15/14
Keep a log
FAILURE FRIDAY!
Keep track of actions taken
Times are super important
Also track discoveries and TODOs
Share dashboards/metrics
Chat rooms make this easy

9/15/14
Graphs are awesome
FAILURE FRIDAY!

9/15/14
Finishing up
FAILURE FRIDAY!
Sound the all clear
Enable crons & CM
Move TODOs to issue tracker

9/15/14
Attack Strategies
FAILURE FRIDAY!
“UNICORN ATTACK!” BY SAM HOWZIT

SERVICE STOP CASSANDRA

SHUTDOWN -R NOW

IPTABLES -I INPUT 1 -P TCP --DPORT 9160 -J DROP
IPTABLES -I INPUT 1 -P TCP --DPORT 7000 -J DROP
!
IPTABLES -I OUTPUT 1 -P TCP --SPORT 9160 -J DROP
IPTABLES -I OUTPUT 1 -P TCP --SPORT 7000 -J DROP

TC QDISC ADD DEV ETH0 ROOT
NETEM DELAY 500MS 100MS
LOSS 5%

9/15/14
“RESULTS READER BOARD” BY ROSA SAY

9/15/14
Issues fixed
FAILURE FRIDAY!
Aggressive restarts by monit
Large files on ext3 volumes
Failing to restart due to bad /etc/fstab file
High latency from network isolated cache
Low capacity with a lost DC
Missing alerts/metrics

9/15/14
Cultural impact
FAILURE FRIDAY!
Knowledge sharing
Highlights untestable systems
Keeps failure handling on everyone’s mind

9/15/14
Future plans
“ROBOT SWORDSMAN FIGHT.” BY PATRICK GAGE KELLEY

9/15/14
Break more things
FAILURE FRIDAY!
Start testing whole DC outages
Break multiple services at once
Distribute failure testing to teams
Automate

9/15/14
Summary
FAILURE FRIDAY!
Failures will happen
Proactively test failure handling now
Choose something easy: app server, cache
Automate later

9/15/14
pagerduty.com/jobs
Thank you.

Failure Friday: Start Injecting Failure Today!

More Related Content

Similar to Failure Friday: Start Injecting Failure Today!

More from PagerDuty

Recently uploaded

Failure Friday: Start Injecting Failure Today!