Successfully reported this slideshow.
Your SlideShare is downloading. ×

GameDay: Creating Resiliency Through Destruction - LISA11

More Related Content

Related Books

Free with a 30 day trial from Scribd

See all

GameDay: Creating Resiliency Through Destruction - LISA11

  1. 1. Jesse Robbins Cofounder, Opscode @jesserobbins jesse@opscode.com 1
  2. 2. Join Us!!! 2
  3. 3. “You don’t choose the moment, the moment chooses you. You only choose how prepared you are when it does.” -Fire Chief Mike Burtch 3
  4. 4. Operations is Work that Matters 4
  5. 5. GameDay 5
  6. 6. define: GameDay An exercise designed to increase Resilience through large-scale fault injection across critical systems. Part of a larger discipline called Resilience Engineering. Not new, just new to us ;-)
  7. 7. define: Resilience Resilience is a the ability of a System to adapt to changes, failures, & disturbances.
  8. 8. define: System People Culture Processes Applications & Services Infrastructure Software Hardware
  9. 9. This will be on the test: Resilience is a product of People & Culture 9
  10. 10. Copyright © 2010 Opscode, Inc - All Rights Reserved 10
  11. 11. This will be on the test: FAILURE HAPPENS!
  12. 12. “multiple & unexpected interactions of failures are inevitable” -Charles Perrow
  13. 13. Catastrophic Potential Simple Complexity Complex Tight KEEP OUT!!! Coupling Loose Created by Jesse Robbins "Catastrophic Potential" adapted from Normal Accidents by Charles Perrow 14
  14. 14. define: The Nines (roughly) 99% 5256 min (3.5 days) 99.9% 528 min ( 8.8 hours ) 99.99% 53 min 99.999% 5 min 99.9999% 30 Seconds 99.99999% 3 Seconds
  15. 15. 99.9% * 99.9% * 99.9% = 99.7% (oops!) 16
  16. 16. MTTR > MTBF 17
  17. 17. I ❤ MTTR 18
  18. 18. Copyright © 2010 Opscode, Inc - All Rights Reserved 19
  19. 19. GameDay Slide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr http://www.flickr.com/photos/dnorman/2678090600 20
  20. 20. Useful Ops Personality Defects 25% Pyromaniac 75% Paranoid
  21. 21. setting a good example
  22. 22. GameDay increases Resilience in 3 ways Preparation ‣ Identification and mitigation of risks and impact from failure ‣ Reduces frequency of failure (MTBF) ‣ Reduces duration of recovery (MTTR) Participation ‣ Builds confidence & competence responding to failure and under stress. ‣ Strengthens individual and cultural ability to anticipate, mitigate, respond to, and recover from failures of all types. Exercises ‣ Trigger and expose “latent defects” ‣ Choose discover them, instead of letting that be determined by the next real disaster. 23
  23. 23. start small... http://www.flickr.com/photos/oakleyoriginals/5674150237 24
  24. 24. increase awareness http://www.flickr.com/photos/maunzy/5099921731 Copyright © 2010 Opscode, Inc - All Rights Reserved 25
  25. 25. build confidence http://www.flickr.com/photos/skevbo/4864249944
  26. 26. full scale, live fire exercises http://tacomafiredepartment.blogspot.com/2010/05/west-slope-training-burn.html Opscode, Inc - All Rights Reserved Copyright © 2010 27
  27. 27. safety standards & “building codes” http://www.flickr.com/photos/peregrinari/3801964067 28
  28. 28. GameDay increases Resilience in 3 ways Preparation ‣ Identification and mitigation of risks and impact from failure ‣ Reduces frequency of failure (MTBF) ‣ Reduces duration of recovery (MTTR) Participation ‣ Builds confidence & competence responding to failure and under stress. ‣ Strengthens individual and cultural ability to anticipate, mitigate, respond to, and recover from failures of all types. Exercises ‣ Trigger and expose “latent defects” ‣ Choose discover them, instead of letting that be determined by the next real disaster. 29
  29. 29. no substitutes for experience... Failure free operations require experience with failure. Ana Grillo © Ana Grillo Photography 30
  30. 30. The “OODA” Loop Observe, Orient, Decide, Act 31
  31. 31. OODA: Observe, Orient, Decide, Act http://en.wikipedia.org/wiki/OODA_loop 32
  32. 32. “You don’t choose the moment, the moment chooses you. You only choose how prepared you are when it does.” -Fire Chief Mike Burtch 33
  33. 33. Jesse Robbins Cofounder, Opscode @jesserobbins jesse@opscode.com 34
  34. 34. Please See: John Allspaw ‣ Resilience Engineering: http://www.kitchensoap.com/2011/04/07/resilience-engineering-part-i/ ‣ Advanced Post Mortem Fu: http://www.slideshare.net/jallspaw/advanced-postmortem-fu-and-human-error-101-velocity-2011 Dr. Richard Cook ‣ How Complex Systems Fail http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf 35

×