NETFLIX’S
CHAOS
MONKEY
Michael Whitehead
“EVERYTHING FAILS ALL THE TIME”
- WERNER VOGELS
CHAOS MONKEY
A service that causes failure and
wreaks havoc on instances in Auto
Scaling Groups
A member of the Simian Army
developed by Netflix
WHY WOULD WE INTENTIONALLY
CAUSE FAILURE?!?
 It is inevitable
 Infrastructure is Complex
 Forcing failure puts you in control
 Identify faults in your architecture
• Does you load balancers reroute traffic correctly?
• Do your instances function correctly when they come back up?
• Are you monitoring tools alerting you on important events?
GETTING STARTED WITH CHAOS MONKEY
 Amazon Web Services
 Must be using Auto Scaling Groups
 Uses Amazon SimpleDB for event storage
 Simple Email Service setup (optional for notifications)
 Can be used with Netflix’s Asgard (optional)
 Java 7 JDK or newer
WOW!
EXAMPLE WITH CLOUDFORMATION
NEAT!
AWESOME!
COOL!
NO WAY!
BUILDING & CONFIGURATION
 Clone SimianArmy repo from Github
 Builds using Gradle
 Runs 6 times a day during business hours- 9am to 3pm
 Does not run on holidays or weekends
 Timeframes and frequency of runs can be configured
IMPORTANT PROPERTIES
 Enabling Chaos Monkey
 Set simianarmy.chaos.enabled = true
 Set simianarmy.chaos.leashed=false
 Probability of 1 instance being terminated per day per ASG
 simianarmy.chaos.ASG.probability = 1.0
 Opt-in or Opt-out model
OPT-IN / OPT-OUT MODEL
 Set to False = Opt-in Set to True = Opt-out
 simianarmy.chaos.ASG.enabled = false
 When Opt-In (false) you must enable each auto scaling group you
want to run Chaos Monkey in
 simianarmy.chaos.<<auto scaling group name>>.enabled = true
 When Opt-Out (true) you must disable each auto scaling group
you do not want it to run in
 simianarmy.chaos.<<auto scaling group name>>.enabled = false
EMAIL NOTIFICATIONS
ARE TERMINATIONS ALL IT CAN DO?
 Block all network traffic
 Burn CPU
 Burn IO
 Fill Disk
 Kill Processes
 Network Loss
 Null-Route
• All EC2 <-> EC2 traffic
SSH REQUIRED
 Detach all EBS volumes
 Fail DNS
 Fail EC2 API
 Fail S3 API
 Fail DynamoDB API
 Network Corruption
 Network Latency
LINKS
 CloudFormation Template:
https://github.com/joehack3r/aws/blob/master/cloudformation/te
mplates/chaosMonkey.json
 Chaos Monkey Announcement:
http://techblog.netflix.com/2012/07/chaos-monkey-released-into-
wild.html
 Simian Army Quick Start Guide:
https://github.com/Netflix/SimianArmy/wiki/Quick-Start-Guide
 Chaos Monkey Configuration:
https://github.com/Netflix/SimianArmy/wiki/Chaos-Settings
 Chaos Monkey Army:
https://github.com/Netflix/SimianArmy/wiki/The-Chaos-Monkey-Army

Intro to Netflix's Chaos Monkey

  • 1.
  • 2.
    “EVERYTHING FAILS ALLTHE TIME” - WERNER VOGELS
  • 3.
    CHAOS MONKEY A servicethat causes failure and wreaks havoc on instances in Auto Scaling Groups A member of the Simian Army developed by Netflix
  • 4.
    WHY WOULD WEINTENTIONALLY CAUSE FAILURE?!?  It is inevitable  Infrastructure is Complex  Forcing failure puts you in control  Identify faults in your architecture • Does you load balancers reroute traffic correctly? • Do your instances function correctly when they come back up? • Are you monitoring tools alerting you on important events?
  • 5.
    GETTING STARTED WITHCHAOS MONKEY  Amazon Web Services  Must be using Auto Scaling Groups  Uses Amazon SimpleDB for event storage  Simple Email Service setup (optional for notifications)  Can be used with Netflix’s Asgard (optional)  Java 7 JDK or newer
  • 6.
  • 7.
    BUILDING & CONFIGURATION Clone SimianArmy repo from Github  Builds using Gradle  Runs 6 times a day during business hours- 9am to 3pm  Does not run on holidays or weekends  Timeframes and frequency of runs can be configured
  • 8.
    IMPORTANT PROPERTIES  EnablingChaos Monkey  Set simianarmy.chaos.enabled = true  Set simianarmy.chaos.leashed=false  Probability of 1 instance being terminated per day per ASG  simianarmy.chaos.ASG.probability = 1.0  Opt-in or Opt-out model
  • 9.
    OPT-IN / OPT-OUTMODEL  Set to False = Opt-in Set to True = Opt-out  simianarmy.chaos.ASG.enabled = false  When Opt-In (false) you must enable each auto scaling group you want to run Chaos Monkey in  simianarmy.chaos.<<auto scaling group name>>.enabled = true  When Opt-Out (true) you must disable each auto scaling group you do not want it to run in  simianarmy.chaos.<<auto scaling group name>>.enabled = false
  • 10.
  • 11.
    ARE TERMINATIONS ALLIT CAN DO?  Block all network traffic  Burn CPU  Burn IO  Fill Disk  Kill Processes  Network Loss  Null-Route • All EC2 <-> EC2 traffic SSH REQUIRED  Detach all EBS volumes  Fail DNS  Fail EC2 API  Fail S3 API  Fail DynamoDB API  Network Corruption  Network Latency
  • 12.
    LINKS  CloudFormation Template: https://github.com/joehack3r/aws/blob/master/cloudformation/te mplates/chaosMonkey.json Chaos Monkey Announcement: http://techblog.netflix.com/2012/07/chaos-monkey-released-into- wild.html  Simian Army Quick Start Guide: https://github.com/Netflix/SimianArmy/wiki/Quick-Start-Guide  Chaos Monkey Configuration: https://github.com/Netflix/SimianArmy/wiki/Chaos-Settings  Chaos Monkey Army: https://github.com/Netflix/SimianArmy/wiki/The-Chaos-Monkey-Army