GameDay: Creating Resiliency Through Destruction - LISA11
Dec. 7, 2011•0 likes•12,294 views
Report
Technology
Business
Jesse Robbins (Cofounder of Opscode) explains GameDay, an exercise designed to increase Resilience through large-scale fault injection across critical systems.
6. define:
GameDay
An exercise designed to increase
Resilience through large-scale fault
injection across critical systems.
Part of a larger discipline called
Resilience Engineering.
Not new, just new to us ;-)
7. define:
Resilience
Resilience is a the ability
of a System to adapt to
changes, failures, &
disturbances.
8. define:
System
People
Culture
Processes
Applications & Services
Infrastructure
Software
Hardware
9. This will be on the test:
Resilience is a product of
People & Culture
9
14. Catastrophic Potential
Simple Complexity Complex
Tight
KEEP
OUT!!!
Coupling
Loose
Created by Jesse Robbins
"Catastrophic Potential" adapted from Normal Accidents by Charles Perrow 14
15. define:
The Nines (roughly)
99% 5256 min (3.5 days)
99.9% 528 min ( 8.8 hours )
99.99% 53 min
99.999% 5 min
99.9999% 30 Seconds
99.99999% 3 Seconds
23. GameDay increases Resilience in 3 ways
Preparation
‣ Identification and mitigation of risks and impact from
failure
‣ Reduces frequency of failure (MTBF)
‣ Reduces duration of recovery (MTTR)
Participation
‣ Builds confidence & competence responding to failure
and under stress.
‣ Strengthens individual and cultural ability to anticipate,
mitigate, respond to, and recover from failures of all
types.
Exercises
‣ Trigger and expose “latent defects”
‣ Choose discover them, instead of letting that be
determined by the next real disaster.
23
29. GameDay increases Resilience in 3 ways
Preparation
‣ Identification and mitigation of risks and impact from
failure
‣ Reduces frequency of failure (MTBF)
‣ Reduces duration of recovery (MTTR)
Participation
‣ Builds confidence & competence responding to failure
and under stress.
‣ Strengthens individual and cultural ability to anticipate,
mitigate, respond to, and recover from failures of all
types.
Exercises
‣ Trigger and expose “latent defects”
‣ Choose discover them, instead of letting that be
determined by the next real disaster.
29
35. Please See:
John Allspaw
‣ Resilience Engineering:
http://www.kitchensoap.com/2011/04/07/resilience-engineering-part-i/
‣ Advanced Post Mortem Fu:
http://www.slideshare.net/jallspaw/advanced-postmortem-fu-and-human-error-101-velocity-2011
Dr. Richard Cook
‣ How Complex Systems Fail
http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf
35