Join Us!!!             2
“You don’t choose the moment,  the moment chooses you.You only choose how prepared    you are when it does.”             -...
Operations is Work that Matters                                  4
GameDay          5
define: GameDay   An exercise designed to increase   Resilience through large-scale fault   injection across critical syste...
define: Resilience   Resilience is a the ability   of a System to adapt to   changes, failures, &   disturbances.
define: System   People   Culture   Processes   Applications & Services   Infrastructure   Software   Hardware
This will be on the test:Resilience is a product of    People & Culture                             9
Copyright © 2010 Opscode, Inc - All Rights Reserved   10
This will be on the test:FAILURE HAPPENS!
“multiple & unexpected interactions of    failures are inevitable”                       -Charles Perrow
Catastrophic Potential           Simple             Complexity                               Complex   Tight              ...
define: The Nines (roughly)   99%	 5256 min (3.5 days)   99.9%	 528 min ( 8.8 hours )   99.99% 53 min   99.999% 5 min   99....
99.9% *99.9% *99.9%   =99.7% (oops!)                16
MTTR > MTBF              17
I ❤ MTTR           18
Copyright © 2010 Opscode, Inc - All Rights Reserved   19
GameDaySlide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-fl...
Useful Ops Personality Defects          25%       Pyromaniac  75%                Paranoid
setting a good example
GameDay increases Resilience in 3 ways Preparation  ‣ Identification and mitigation of risks and impact from    failure  ‣...
start small...http://www.flickr.com/photos/oakleyoriginals/5674150237   24
increase awareness http://www.flickr.com/photos/maunzy/5099921731   Copyright © 2010 Opscode, Inc - All Rights Reserved   25
build confidencehttp://www.flickr.com/photos/skevbo/4864249944
full scale, live fire exerciseshttp://tacomafiredepartment.blogspot.com/2010/05/west-slope-training-burn.html Opscode, Inc...
safety standards &                   “building codes”http://www.flickr.com/photos/peregrinari/3801964067   28
GameDay increases Resilience in 3 ways Preparation  ‣ Identification and mitigation of risks and impact from    failure  ‣...
no substitutes for experience... Failure free operations require experience with failure.Ana Grillo © Ana Grillo Photograp...
The “OODA” LoopObserve, Orient, Decide, Act                               31
OODA: Observe, Orient, Decide, Act             http://en.wikipedia.org/wiki/OODA_loop                                     ...
“You don’t choose the moment,  the moment chooses you.You only choose how prepared    you are when it does.”             -...
Jesse RobbinsCofounder, Opscode@jesserobbinsjesse@opscode.com                     34
Please See:	 John Allspaw  ‣ Resilience Engineering:    http://www.kitchensoap.com/2011/04/07/resilience-engineering-part-...
GameDay: Creating Resiliency Through Destruction - LISA11
GameDay: Creating Resiliency Through Destruction - LISA11
Upcoming SlideShare
Loading in...5
×

GameDay: Creating Resiliency Through Destruction - LISA11

6,517

Published on

Jesse Robbins (Cofounder of Opscode) explains GameDay, an exercise designed to increase Resilience through large-scale fault injection across critical systems.

Published in: Technology, Business

GameDay: Creating Resiliency Through Destruction - LISA11

  1. 1. Jesse RobbinsCofounder, Opscode@jesserobbinsjesse@opscode.com 1
  2. 2. Join Us!!! 2
  3. 3. “You don’t choose the moment, the moment chooses you.You only choose how prepared you are when it does.” -Fire Chief Mike Burtch 3
  4. 4. Operations is Work that Matters 4
  5. 5. GameDay 5
  6. 6. define: GameDay An exercise designed to increase Resilience through large-scale fault injection across critical systems. Part of a larger discipline called Resilience Engineering. Not new, just new to us ;-)
  7. 7. define: Resilience Resilience is a the ability of a System to adapt to changes, failures, & disturbances.
  8. 8. define: System People Culture Processes Applications & Services Infrastructure Software Hardware
  9. 9. This will be on the test:Resilience is a product of People & Culture 9
  10. 10. Copyright © 2010 Opscode, Inc - All Rights Reserved 10
  11. 11. This will be on the test:FAILURE HAPPENS!
  12. 12. “multiple & unexpected interactions of failures are inevitable” -Charles Perrow
  13. 13. Catastrophic Potential Simple Complexity Complex Tight KEEP OUT!!!Coupling Loose Created by Jesse Robbins "Catastrophic Potential" adapted from Normal Accidents by Charles Perrow 14
  14. 14. define: The Nines (roughly) 99% 5256 min (3.5 days) 99.9% 528 min ( 8.8 hours ) 99.99% 53 min 99.999% 5 min 99.9999% 30 Seconds 99.99999% 3 Seconds
  15. 15. 99.9% *99.9% *99.9% =99.7% (oops!) 16
  16. 16. MTTR > MTBF 17
  17. 17. I ❤ MTTR 18
  18. 18. Copyright © 2010 Opscode, Inc - All Rights Reserved 19
  19. 19. GameDaySlide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickrhttp://www.flickr.com/photos/dnorman/2678090600 20
  20. 20. Useful Ops Personality Defects 25% Pyromaniac 75% Paranoid
  21. 21. setting a good example
  22. 22. GameDay increases Resilience in 3 ways Preparation ‣ Identification and mitigation of risks and impact from failure ‣ Reduces frequency of failure (MTBF) ‣ Reduces duration of recovery (MTTR) Participation ‣ Builds confidence & competence responding to failure and under stress. ‣ Strengthens individual and cultural ability to anticipate, mitigate, respond to, and recover from failures of all types. Exercises ‣ Trigger and expose “latent defects” ‣ Choose discover them, instead of letting that be determined by the next real disaster. 23
  23. 23. start small...http://www.flickr.com/photos/oakleyoriginals/5674150237 24
  24. 24. increase awareness http://www.flickr.com/photos/maunzy/5099921731 Copyright © 2010 Opscode, Inc - All Rights Reserved 25
  25. 25. build confidencehttp://www.flickr.com/photos/skevbo/4864249944
  26. 26. full scale, live fire exerciseshttp://tacomafiredepartment.blogspot.com/2010/05/west-slope-training-burn.html Opscode, Inc - All Rights Reserved Copyright © 2010 27
  27. 27. safety standards & “building codes”http://www.flickr.com/photos/peregrinari/3801964067 28
  28. 28. GameDay increases Resilience in 3 ways Preparation ‣ Identification and mitigation of risks and impact from failure ‣ Reduces frequency of failure (MTBF) ‣ Reduces duration of recovery (MTTR) Participation ‣ Builds confidence & competence responding to failure and under stress. ‣ Strengthens individual and cultural ability to anticipate, mitigate, respond to, and recover from failures of all types. Exercises ‣ Trigger and expose “latent defects” ‣ Choose discover them, instead of letting that be determined by the next real disaster. 29
  29. 29. no substitutes for experience... Failure free operations require experience with failure.Ana Grillo © Ana Grillo Photography 30
  30. 30. The “OODA” LoopObserve, Orient, Decide, Act 31
  31. 31. OODA: Observe, Orient, Decide, Act http://en.wikipedia.org/wiki/OODA_loop 32
  32. 32. “You don’t choose the moment, the moment chooses you.You only choose how prepared you are when it does.” -Fire Chief Mike Burtch 33
  33. 33. Jesse RobbinsCofounder, Opscode@jesserobbinsjesse@opscode.com 34
  34. 34. Please See: John Allspaw ‣ Resilience Engineering: http://www.kitchensoap.com/2011/04/07/resilience-engineering-part-i/ ‣ Advanced Post Mortem Fu: http://www.slideshare.net/jallspaw/advanced-postmortem-fu-and-human-error-101-velocity-2011 Dr. Richard Cook ‣ How Complex Systems Fail http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf 35

×