• Save
Upcoming SlideShare
Loading in...5
×
 

GameDay: Creating Resiliency Through Destruction - LISA11

on

  • 6,525 views

Jesse Robbins (Cofounder of Opscode) explains GameDay, an exercise designed to increase Resilience through large-scale fault injection across critical systems.

Jesse Robbins (Cofounder of Opscode) explains GameDay, an exercise designed to increase Resilience through large-scale fault injection across critical systems.

Statistics

Views

Total Views
6,525
Views on SlideShare
5,700
Embed Views
825

Actions

Likes
7
Downloads
0
Comments
0

13 Embeds 825

http://slid.es 567
http://server.dzone.com 175
http://t.co 20
http://a0.twimg.com 13
http://lanyrd.com 12
http://slides.com 11
http://paper.li 11
http://www.linkedin.com 4
http://tweetedtimes.com 4
http://agile.dzone.com 3
http://architects.dzone.com 2
https://twitter.com 2
http://urasoko.hatenablog.jp 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    GameDay: Creating Resiliency Through Destruction - LISA11 GameDay: Creating Resiliency Through Destruction - LISA11 Presentation Transcript

    • Jesse RobbinsCofounder, Opscode@jesserobbinsjesse@opscode.com 1
    • Join Us!!! 2
    • “You don’t choose the moment, the moment chooses you.You only choose how prepared you are when it does.” -Fire Chief Mike Burtch 3
    • Operations is Work that Matters 4
    • GameDay 5
    • define: GameDay An exercise designed to increase Resilience through large-scale fault injection across critical systems. Part of a larger discipline called Resilience Engineering. Not new, just new to us ;-)
    • define: Resilience Resilience is a the ability of a System to adapt to changes, failures, & disturbances.
    • define: System People Culture Processes Applications & Services Infrastructure Software Hardware
    • This will be on the test:Resilience is a product of People & Culture 9
    • Copyright © 2010 Opscode, Inc - All Rights Reserved 10
    • This will be on the test:FAILURE HAPPENS!
    • “multiple & unexpected interactions of failures are inevitable” -Charles Perrow
    • Catastrophic Potential Simple Complexity Complex Tight KEEP OUT!!!Coupling Loose Created by Jesse Robbins "Catastrophic Potential" adapted from Normal Accidents by Charles Perrow 14
    • define: The Nines (roughly) 99% 5256 min (3.5 days) 99.9% 528 min ( 8.8 hours ) 99.99% 53 min 99.999% 5 min 99.9999% 30 Seconds 99.99999% 3 Seconds
    • 99.9% *99.9% *99.9% =99.7% (oops!) 16
    • MTTR > MTBF 17
    • I ❤ MTTR 18
    • Copyright © 2010 Opscode, Inc - All Rights Reserved 19
    • GameDaySlide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickrhttp://www.flickr.com/photos/dnorman/2678090600 20
    • Useful Ops Personality Defects 25% Pyromaniac 75% Paranoid
    • setting a good example
    • GameDay increases Resilience in 3 ways Preparation ‣ Identification and mitigation of risks and impact from failure ‣ Reduces frequency of failure (MTBF) ‣ Reduces duration of recovery (MTTR) Participation ‣ Builds confidence & competence responding to failure and under stress. ‣ Strengthens individual and cultural ability to anticipate, mitigate, respond to, and recover from failures of all types. Exercises ‣ Trigger and expose “latent defects” ‣ Choose discover them, instead of letting that be determined by the next real disaster. 23
    • start small...http://www.flickr.com/photos/oakleyoriginals/5674150237 24
    • increase awareness http://www.flickr.com/photos/maunzy/5099921731 Copyright © 2010 Opscode, Inc - All Rights Reserved 25
    • build confidencehttp://www.flickr.com/photos/skevbo/4864249944
    • full scale, live fire exerciseshttp://tacomafiredepartment.blogspot.com/2010/05/west-slope-training-burn.html Opscode, Inc - All Rights Reserved Copyright © 2010 27
    • safety standards & “building codes”http://www.flickr.com/photos/peregrinari/3801964067 28
    • GameDay increases Resilience in 3 ways Preparation ‣ Identification and mitigation of risks and impact from failure ‣ Reduces frequency of failure (MTBF) ‣ Reduces duration of recovery (MTTR) Participation ‣ Builds confidence & competence responding to failure and under stress. ‣ Strengthens individual and cultural ability to anticipate, mitigate, respond to, and recover from failures of all types. Exercises ‣ Trigger and expose “latent defects” ‣ Choose discover them, instead of letting that be determined by the next real disaster. 29
    • no substitutes for experience... Failure free operations require experience with failure.Ana Grillo © Ana Grillo Photography 30
    • The “OODA” LoopObserve, Orient, Decide, Act 31
    • OODA: Observe, Orient, Decide, Act http://en.wikipedia.org/wiki/OODA_loop 32
    • “You don’t choose the moment, the moment chooses you.You only choose how prepared you are when it does.” -Fire Chief Mike Burtch 33
    • Jesse RobbinsCofounder, Opscode@jesserobbinsjesse@opscode.com 34
    • Please See: John Allspaw ‣ Resilience Engineering: http://www.kitchensoap.com/2011/04/07/resilience-engineering-part-i/ ‣ Advanced Post Mortem Fu: http://www.slideshare.net/jallspaw/advanced-postmortem-fu-and-human-error-101-velocity-2011 Dr. Richard Cook ‣ How Complex Systems Fail http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf 35