Jesse Robbins
Cofounder, Opscode

@jesserobbins
jesse@opscode.com




                     1
Join Us!!!
             2
“You don’t choose the moment,
  the moment chooses you.

You only choose how prepared
    you are when it does.”
             -Fire Chief Mike Burtch




                                       3
Operations is Work that Matters




                                  4
GameDay



          5
define:
 GameDay
   An exercise designed to increase
   Resilience through large-scale fault
   injection across critical systems.

   Part of a larger discipline called
   Resilience Engineering.

   Not new, just new to us ;-)
define:
 Resilience
   Resilience is a the ability
   of a System to adapt to
   changes, failures, &
   disturbances.
define:
 System
   People
   Culture
   Processes
   Applications & Services
   Infrastructure
   Software
   Hardware
This will be on the test:
Resilience is a product of
    People & Culture


                             9
Copyright © 2010 Opscode, Inc - All Rights Reserved   10
This will be on the test:
FAILURE HAPPENS!
“multiple & unexpected interactions of
    failures are inevitable”
                       -Charles Perrow
Catastrophic Potential
           Simple             Complexity                               Complex


   Tight
                                                     KEEP
                                                     OUT!!!
Coupling
 Loose




                                             Created by Jesse Robbins
              "Catastrophic Potential" adapted from Normal Accidents by Charles Perrow   14
define:
 The Nines (roughly)
   99%	 5256 min (3.5 days)
   99.9%	 528 min ( 8.8 hours )
   99.99% 53 min
   99.999% 5 min
   99.9999% 30 Seconds
   99.99999% 3 Seconds
99.9% *
99.9% *
99.9%
   =
99.7% (oops!)
                16
MTTR > MTBF


              17
I ❤ MTTR
           18
Copyright © 2010 Opscode, Inc - All Rights Reserved   19
GameDay


Slide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
http://www.flickr.com/photos/dnorman/2678090600                                                                            20
Useful Ops Personality Defects



          25%       Pyromaniac



  75%                Paranoid
setting a good example
GameDay increases Resilience in 3 ways
 Preparation
  ‣ Identification and mitigation of risks and impact from
    failure
  ‣ Reduces frequency of failure (MTBF)
  ‣ Reduces duration of recovery (MTTR)
 Participation
  ‣ Builds confidence & competence responding to failure
    and under stress.
  ‣ Strengthens individual and cultural ability to anticipate,
    mitigate, respond to, and recover from failures of all
    types.
 Exercises
  ‣ Trigger and expose “latent defects”
  ‣ Choose discover them, instead of letting that be
    determined by the next real disaster.
                                                                 23
start small...
http://www.flickr.com/photos/oakleyoriginals/5674150237   24
increase awareness
 http://www.flickr.com/photos/maunzy/5099921731   Copyright © 2010 Opscode, Inc - All Rights Reserved   25
build confidence
http://www.flickr.com/photos/skevbo/4864249944
full scale, live fire exercises
http://tacomafiredepartment.blogspot.com/2010/05/west-slope-training-burn.html Opscode, Inc - All Rights Reserved
                                                                  Copyright © 2010                                  27
safety standards &
                   “building codes”
http://www.flickr.com/photos/peregrinari/3801964067   28
GameDay increases Resilience in 3 ways
 Preparation
  ‣ Identification and mitigation of risks and impact from
    failure
  ‣ Reduces frequency of failure (MTBF)
  ‣ Reduces duration of recovery (MTTR)
 Participation
  ‣ Builds confidence & competence responding to failure
    and under stress.
  ‣ Strengthens individual and cultural ability to anticipate,
    mitigate, respond to, and recover from failures of all
    types.
 Exercises
  ‣ Trigger and expose “latent defects”
  ‣ Choose discover them, instead of letting that be
    determined by the next real disaster.
                                                                 29
no substitutes for experience...
 Failure free operations require
 experience with failure.
Ana Grillo © Ana Grillo Photography
                                      30
The “OODA” Loop
Observe, Orient, Decide, Act



                               31
OODA: Observe, Orient, Decide, Act




             http://en.wikipedia.org/wiki/OODA_loop




                                                      32
“You don’t choose the moment,
  the moment chooses you.

You only choose how prepared
    you are when it does.”
             -Fire Chief Mike Burtch




                                       33
Jesse Robbins
Cofounder, Opscode

@jesserobbins
jesse@opscode.com




                     34
Please See:	

 John Allspaw
  ‣ Resilience Engineering:
    http://www.kitchensoap.com/2011/04/07/resilience-engineering-part-i/

  ‣ Advanced Post Mortem Fu:
    http://www.slideshare.net/jallspaw/advanced-postmortem-fu-and-human-error-101-velocity-2011




 Dr. Richard Cook
  ‣ How Complex Systems Fail
    http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf




                                                                                                  35

GameDay: Creating Resiliency Through Destruction - LISA11