0
Outage Insurance: Everything
You Need to Know
Mr. White has fifteen years of experience designing and managing the
deployment of Systems Monitoring and Event Management ...
http://weheartit.com/entry/12433848!
Ground rules for this
session…
•  If you can’t tell if I am trying to be funny…
–  
GO AHEAD AND LAUGH!
•  Feel free to te...
I am here today to share some of what I have learned about
We (IT) sells promises…
The value of these promises depends on the
customer’s perception that we are willing and
capable o...
http://www.flickr.com/photos/anneacaso/3693155059/sizes/l/in/photostream/!
Objective #1: Users Love Our IT Systems…
Anatomy of an Outage
Corporate!
LANs & VPNs!
Load Balancer!
Firewall!
Web!
Servers!
Message!
Queue!
zOS!
CICS!
WAS!
Databa...
http://www.flickr.com/photos/gregphoto/4881356366/sizes/l/in/photostream/!
Bad Experience!!!
h"p://www.ithakabound.com/wp-­‐content/uploads/2010/02/DC-­‐Snow-­‐men-­‐pushing-­‐car.jpg	
  
Why did this happen?!
Why is problem solving hard?
• commencement opacity
• continuation opacity
Non-transparency (lack of
clarity of the situat...
Boyd’s Loop
Observation
Outside
Information
Implicit Guidance & Control
Unfolding Interaction With Environment
Feedback
Fe...
Where the Breakdown
Occurs
Observe! Orient! Decide! Act!
Situational Awareness!
Perception of
Elements in
Current Situatio...
Incident Life Cycle
Down Time
Detection Time
 Response Time
 Repair Time
 Recovery Time
Outage
Detection
Diagnosis
Repair
...
Problem Life Cycle
Evaluation	
  
Recognition
Observation
AnalysisSolution
Validation
Control
Point of
Observation
Past Behavior
• The observation period
used to feed the
forecasting models
Future Behavior
• The perf...
Predictive models
harness the information
lost in past data so you
can identify discretely
identify situations and
react t...
What Matters Most?
Dr.	
  Lee	
  
Goldman	
  
Cook	
  County	
  Hospital,	
  
Chicago,	
  IL	
  
§  Is the patient feelin...
The Goldman Algorithm
ECG Evidence of Acute Ischemia?
ST-Segment Depression ≥ 1mm in ≥ 2 Contiguous Leads
(New or Unknown ...
First…
… we need
to talk a little
bit about
your brain
The Triune Brain
Reptilian Brain
(basal ganglia)
Mammalian Brain
(limbic system)
Cognitive Brain
(neocortex)
Our Thought Process
*** not very reliable
Cognition
Limbic Center
(hypocampus and amygdala)
Cortex
(hypocampus and amygdal...
Short Term Memory
Your Brain
Working Memory
Understanding
Judgement
Relationship
Short-term memory is
where the real work ...
The big-data dilema
Time
Quantity
Information the brain can consume
Information is cheap.
Understanding is expensive.
-Karl Fast, Professor of UX Design, Kent State University
• Patterns
• Comparisons
• Organization
Information
• Decisions
• Skill
• Adaptation
Intelligence
• Trends
• Generalizatio...
x
y
0i i i i
y xα α ε= + +
Data
Information
Knowledge
Past
 Future
Abstract
Tangible
Information
 Intelligence
Knowledge
 Wisdom
Data
Knowledge is the point of transition
Why K...
All You Need
Love
1. Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems.
Human Factors 37(1), 32–...
Our success in any endeavor depends directly on
our ability to solve problems
What do we need to do that?
You Gotta Have Skillz…!
Common Problem Types
§  Design Problems
§  Creative Problems
§  Daily Problems
§  People Problems
Rule-Based
Approach
...
The Problem with the
Rules-Based Approach
•  Solutions are driven by accepted conventions
•  Best practices are coveted an...
Event-Based Problem Solving
•  Appreciative Understanding
•  Know What We Are Solving
•  Create A Common Reality
•  Soluti...
The Pre-Mortem Process
Define the
Problem
Chart the
Causal
Relationships
and Add
EVidence
Identify
Solutions
Implement
the...
Step 1: Define the Problem
Problem Definition
•  What: 
•  When: 
Date/Time:
Relative: what was happening at the time of this event?
•  Where:
Specifi...
Gut Check…
•  Why are we working on this? 
•  How much time should we spend?
•  What people do we need?
•  How much money ...
The What Statement
•  It is used as “The Primary Effect (PE)”
– It is a statement of what we want to prevent from
happenin...
Step 2: Add Causal
Relationships and Evidence
The T: Drive
reached 0 Bytes
free
The database
stopped
processing
queries
The application
server was timing
out
Users were...
It’s never that simple
Customers
Complaining
Web Server returning
500 errors
The application
server was timing
out
SQL Ser...
Rules for Causal
Relationships
Database
Down !
(Effect)!
Drive Full
(Cause/Effect)!
Logs Not
Truncated
(Cause)!
①  Causes ...
Rules for Causal
Relationships
End of the
Universe
(Effect)!
Database Down !
(Primary Effect)!
Drive Full
(Cause/Effect)!
...
Two Important Questions
End of the
Universe
(Effect)!
Database Down !
(Primary Effect)!
Drive Full
(Cause/Effect)!
Logs No...
Rules for Causal
Relationships
③  An Effect is often the result of multiple causes!
SQL Server was
not processing
queries ...
Rules for Causal
Relationships
④  Causes need to be both necessary and sufficient!
SQL Server was not
processing queries
(E...
How Fire Works
Time
Oxygen
Heat
Fuel
Fire
MatchStrike
Transitory
Non-Transitory
Fire
Oxygen
Heat
Fuel
Match
Strike
-AND-
•...
RCA Diagram
Customers
Complaining
Web Server returning
500 errors
The application
server was timing
out
SQL Server was not...
Add Evidence
Customers
Complaining
Web Server returning
500 errors
The application
server was timing
out
SQL Server was no...
Examples of Evidence
•  Personal experience or observation
•  Statistical data (Monitoring Metrics)
•  Examples, particula...
Ideas for Finding Causes
Causes
Management
Organization
Process
Knowledge
Technology
People
Information
Applications
Infra...
Step 3: Find Solutions
Failure Modes Analysis
SQL Server Not Available
Transaction log is unable
to grow
T: Drive at 0 Bytes free
Logs were not t...
Picking Monitors
SQL Server Not Available
Transaction log is unable
to grow
T: Drive at 0 Bytes free
Logs were not truncat...
FMEA Matrix
(Impact Calculation)
Negligible (1-2): no loss in functionality,
mostly cosmetic
Marginal (3-4): temporary int...
FMEA Matrix
(Evidence)
These are the events that help us to RULE IN a
failure mode as a possible cause
These are the event...
Application-Technology Matrix
Maps services, applications and technologies
enabling:
• Monitoring investment prioritizatio...
Step 4: Use this
knowledge intelligently
During Service Support
•  Command Centers and Support Teams
–  Use the failure modes to rule out causes
–  Each failure mo...
Facilitating Production Assurance
•  CritSits
–  Start the CritSit meeting and provide an accounting of all the
potential ...
During the Design Process
•  Architects 

–  Certify that designs do not contain the known failure
modes or document that ...
Improving Enterprise
Processes and Tools
•  Systems Management and Monitoring
–  Develop new monitoring requirements using...
A few final thoughts…
Running a Good Pre-Mortem
Defer
Judgment
Encourage
Wild Ideas
Build on
Ideas
Stay Focused
One Person
at a Time
Be Visual
G...
Here is Why It Works
RCA
Process
Re-
Establishes
Personal
Relationships
Social
Networks
Cooling-Off
Period
De-
Escalating
...
Don’t try to create everything at once.
Knowledge is something that is
created over time.
Iterative Development
Let’s keep the
conversation going…
Andrew.P.White@Gmail.com!
ReverendDrew!
SystemsManagementZen.Wordpress.com!
systemsmana...
Brighttalk   outage insurance- what you need to know - final
Upcoming SlideShare
Loading in...5
×

Brighttalk outage insurance- what you need to know - final

90

Published on

Published in: Technology, Health & Medicine
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
90
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
6
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Brighttalk outage insurance- what you need to know - final"

  1. 1. Outage Insurance: Everything You Need to Know
  2. 2. Mr. White has fifteen years of experience designing and managing the deployment of Systems Monitoring and Event Management software. Prior to joining IBM, Mr. White held various positions including the leader of the Monitoring and Event Management organization of a Fortune 100 company and developing solutions as a consultant for a wide variety of organizations, including the Mexican Secretaría de Hacienda y Crédito Público, Telmex, Wal-Mart of Mexico, JP Morgan Chase, Nationwide Insurance and the US Navy Facilities and Engineering Command. Andrew White Cloud and Smarter Infrastructure Solution Specialist IBM Corporation
  3. 3. http://weheartit.com/entry/12433848!
  4. 4. Ground rules for this session… •  If you can’t tell if I am trying to be funny… –  GO AHEAD AND LAUGH! •  Feel free to text, tweet, yammer, or whatever to share with the rest of the attendees •  If you have a question, no need to wait until the end. Just interrupt me. Seriously… I don’t mind.
  5. 5. I am here today to share some of what I have learned about
  6. 6. We (IT) sells promises… The value of these promises depends on the customer’s perception that we are willing and capable of making good on the promise when the time comes. This perception is affected by the interactions they have with us.
  7. 7. http://www.flickr.com/photos/anneacaso/3693155059/sizes/l/in/photostream/! Objective #1: Users Love Our IT Systems…
  8. 8. Anatomy of an Outage Corporate! LANs & VPNs! Load Balancer! Firewall! Web! Servers! Message! Queue! zOS! CICS! WAS! Database! WAS! Database! zOS! MQ! DB2! ! ! ! ! 4! ! ! ! ! ! ! 3! ! ! ! ! ! ! 1! 5:45-ish pm: CICS ABENDS start flooding the console but not high enough to ticket! ! ! ! ! ! ! 2! 6:00-ish pm: MQ flows start are interrupted and are alerting in Flow Diagnostics! 6:04pm: Synthetic transactions fail at and 6:14 the Ops Center confirms the issue and creates a P0 Incident! 6:54pm: Support teams investigate the interrupted flows and determine it is a “back-end” problem! 10:29pm: Support teams investigate MQ and ultimately and rule it out and ultimately decide to reset CICS to resolve the issue! ! ! ! ! 5!
  9. 9. http://www.flickr.com/photos/gregphoto/4881356366/sizes/l/in/photostream/! Bad Experience!!!
  10. 10. h"p://www.ithakabound.com/wp-­‐content/uploads/2010/02/DC-­‐Snow-­‐men-­‐pushing-­‐car.jpg   Why did this happen?!
  11. 11. Why is problem solving hard? • commencement opacity • continuation opacity Non-transparency (lack of clarity of the situation) • inexpressiveness • opposition • transience Polytely (multiple goals) • enumerability • connectivity (hierarchy relation, communication relation, allocation relation) • heterogeneity Complexity (large numbers of items, interrelations, and decisions) • temporal constraints • temporal sensitivity • phase effects • dynamic unpredictability Dynamics (time considerations)
  12. 12. Boyd’s Loop Observation Outside Information Implicit Guidance & Control Unfolding Interaction With Environment Feedback Feedback Unfolding Circumstances Cultural Norms Cognitive Abilities Knowledge Life Cycle Prior Wisdom New Information Feed Forward Decision (Hypothesis) Feed Forward Action (Test) Feed Forward •  Note how observation shapes orientation, shapes decision, shapes action, and in turn is shaped by the feedback and other phenomena coming into our sensing or observing window. •  Also note how the entire “loop” (not just orientation) is an ongoing many-sided implicit cross-referencing process of projection, empathy, correlation, and rejection. From “The Essence of Winning and Losing,” John R. Boyd, January 1996. Observe Orient Decide Act
  13. 13. Where the Breakdown Occurs Observe! Orient! Decide! Act! Situational Awareness! Perception of Elements in Current Situation! ! Level 1! Comprehension of Current Situation! ! Level 2! Projection of Future Status! ! ! Level 3! Decision! Performance of Actions! CurrentState! Feedback! • Goals & Objectives! • Preconceptions! • Expectations! • Abilities! • Experience! • Training! Long Term Memory! Automaticity! Cognitive Processes! • System Capability! • Interface Design! • Stress & Workload! • Complexity! • Automation! Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32–64.! Systemic Influences! Individual Influences!
  14. 14. Incident Life Cycle Down Time Detection Time Response Time Repair Time Recovery Time Outage Detection Diagnosis Repair Recover Restore Observe Orient Decide Act
  15. 15. Problem Life Cycle Evaluation   Recognition Observation AnalysisSolution Validation Control
  16. 16. Point of Observation Past Behavior • The observation period used to feed the forecasting models Future Behavior • The performance period the model is trying to predict Predictive Modeling Timeline
  17. 17. Predictive models harness the information lost in past data so you can identify discretely identify situations and react to them quickly.
  18. 18. What Matters Most? Dr.  Lee   Goldman   Cook  County  Hospital,   Chicago,  IL   §  Is the patient feeling unstable angina? §  Is there fluid in the patient’s lungs? §  Is the patient’s systolic blood pressure below 100?" The Goldman Algorithm Prediction of Patients Expected to Have a Heart Attack Within 72 Hours 0   20   40   60   80   100   Traditional Techniques Goldman Algorithm By paying attention to what really matters, Dr. Goldman improved the “false negatives” by 20 percentage points and eliminated the “false positives” altogether.
  19. 19. The Goldman Algorithm ECG Evidence of Acute Ischemia? ST-Segment Depression ≥ 1mm in ≥ 2 Contiguous Leads (New or Unknown Age) or T- Wave Inversion in ≥ 2 Contiguous Leads (New or Unknown Age) or Left Bundle-Branch Block (New or Unknown Age) Observation Unit Inpatient Telemetry Unit High Risk Low Risk Very Low Risk Moderate Risk Yes No Coronary Care Unit No ECG Evidence of Acute Myocardial Infarction (MI)? ST-Segment Elevation ≥ 1mm in ≥ 2 Contiguous Leads (New or Unknown Age) or Pathologic Q Waves in ≥ 2 Contiguous Leads (New or Unknown Age) Yes Patient suspected of Acute Cardiac Ischema Perform Electrocardiogram (EKG) 0 Factors 2 or 3 Factors 1 Factors 0 or 1 Factors 2 or 3 Factors Urgent Factors Present? Rates Above Both Lung Bases Systolic Blood Pressure <100 mm Hg Unstable Ischemic Heart Disease Urgent Factors Present? Rates Above Both Lung Bases Systolic Blood Pressure <100 mm Hg Unstable Ischemic Heart Disease
  20. 20. First… … we need to talk a little bit about your brain
  21. 21. The Triune Brain Reptilian Brain (basal ganglia) Mammalian Brain (limbic system) Cognitive Brain (neocortex)
  22. 22. Our Thought Process *** not very reliable Cognition Limbic Center (hypocampus and amygdala) Cortex (hypocampus and amygdala) Conscious Choice (via motor centers) Most primitive, seat of unconscious Long-term memory Conscious, meaning, choice Perception (via the senses)*** Pre-Frontal Cortex (hypocampus and amygdala) Stimulus
  23. 23. Short Term Memory Your Brain Working Memory Understanding Judgement Relationship Short-term memory is where the real work of sense-making takes place Short-term memory has a limited amount of space (The estimate is 7 ± 2)
  24. 24. The big-data dilema Time Quantity Information the brain can consume
  25. 25. Information is cheap. Understanding is expensive. -Karl Fast, Professor of UX Design, Kent State University
  26. 26. • Patterns • Comparisons • Organization Information • Decisions • Skill • Adaptation Intelligence • Trends • Generalizations • Beliefs Knowledge • Accountability • Foresight • Synthesis Wisdom • Symbols • Metrics • Facts Data Correlation Analysis Application Understanding Complexity Context Communication Repetition From Data to Wisdom
  27. 27. x y 0i i i i y xα α ε= + + Data Information Knowledge
  28. 28. Past Future Abstract Tangible Information Intelligence Knowledge Wisdom Data Knowledge is the point of transition Why Knowledge?
  29. 29. All You Need Love
  30. 30. 1. Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32–64.! ! Our systems are capable of producing a huge amount of data, both on the status of their own components and on the status of the environment. The problem with today’s systems is not a lack of information, but finding what is needed when it is needed.
  31. 31. Our success in any endeavor depends directly on our ability to solve problems What do we need to do that?
  32. 32. You Gotta Have Skillz…!
  33. 33. Common Problem Types §  Design Problems §  Creative Problems §  Daily Problems §  People Problems Rule-Based Approach Event Based Approach
  34. 34. The Problem with the Rules-Based Approach •  Solutions are driven by accepted conventions •  Best practices are coveted and are adopted without understanding how and why they were developed •  There must always be a right answer •  No logical analysis is required •  People are frequently seen as the “root cause” •  The outcomes are enforced using “re-dos” and punitive actions (or the looming threat of these things)
  35. 35. Event-Based Problem Solving •  Appreciative Understanding •  Know What We Are Solving •  Create A Common Reality •  Solutions Based on Causes
  36. 36. The Pre-Mortem Process Define the Problem Chart the Causal Relationships and Add EVidence Identify Solutions Implement the Solutions
  37. 37. Step 1: Define the Problem
  38. 38. Problem Definition •  What: •  When: Date/Time: Relative: what was happening at the time of this event? •  Where: Specific: Relative: logical dependencies? •  Significance: availability: environment: costs: revenue maintenance? other miscellaneous costs frequency:
  39. 39. Gut Check… •  Why are we working on this? •  How much time should we spend? •  What people do we need? •  How much money should we spend? You should be able to answer all of the following:
  40. 40. The What Statement •  It is used as “The Primary Effect (PE)” – It is a statement of what we want to prevent from happening again •  There may be more than one – If they are unrelated, perform separate RCA’s – If they are related and you can’t decide which to use, pick the one that is nearest to the present time •  Noun/verb statement
  41. 41. Step 2: Add Causal Relationships and Evidence
  42. 42. The T: Drive reached 0 Bytes free The database stopped processing queries The application server was timing out Users were getting 500 errors on the website Customers to call the helpdesk to complain Add more hard drive space Have you see something like this before? What do we really know?
  43. 43. It’s never that simple Customers Complaining Web Server returning 500 errors The application server was timing out SQL Server was not processing queries Transaction log was unable to grow T: Drive at 0 Bytes free Logs were not truncated DBA on honeymoon vacation in Fiji Logs are truncated manually Company has only 1 DBA “Backup” DBA was not aware the logs require truncation Space allocations are fixed Lack of Control Only one database cluster in use DR SQL Cluster DR Cluster being used for UAT testing More Information Needed One one application server exists More Information Needed Trying to do business on the website Desired Condition -AND- -AND- -AND- -AND- -AND- -AND- -AND-
  44. 44. Rules for Causal Relationships Database Down ! (Effect)! Drive Full (Cause/Effect)! Logs Not Truncated (Cause)! ①  Causes are effects, and effects are causes!
  45. 45. Rules for Causal Relationships End of the Universe (Effect)! Database Down ! (Primary Effect)! Drive Full (Cause/Effect)! Logs Not Truncated (Cause/Effect)! Beginning of Time (Cause)! ②  You can keep identifying causes – there is no limit!
  46. 46. Two Important Questions End of the Universe (Effect)! Database Down ! (Primary Effect)! Drive Full (Cause/Effect)! Logs Not Truncated (Cause/Effect)! Beginning of Time (Cause)! Ask “Why?” Ask “What”
  47. 47. Rules for Causal Relationships ③  An Effect is often the result of multiple causes! SQL Server was not processing queries (Effect)! Transaction log was unable to grow! T: Drive at 0 Bytes free! Logs were not truncated! DBA on honeymoon vacation in Fiji! Logs are truncated manually! Company has only 1 DBA! “Backup” DBA was not aware the logs require truncation! Space allocations are fixed! Lack of Control! -AND-! -AND-! -AND-!
  48. 48. Rules for Causal Relationships ④  Causes need to be both necessary and sufficient! SQL Server was not processing queries (Effect)! Transaction log was unable to grow (Transitory Cause)! T: Drive at 0 Bytes free! (Non-transitory Cause & Effect)! Logs were not truncated! (Transitory Cause & Effect)! DBA on honeymoon vacation in Fiji! (Transitory Cause)! Logs are truncated manually! (Non-Transitory Cause)! Company has only 1 DBA! (Non-Transitory Cause)! “Backup” DBA was not aware the logs require truncation! (Non-Transitory Cause)! Space allocations are fixed! (Non-Transitory Cause)! Lack of Control! -AND-! -AND-! -AND-!
  49. 49. How Fire Works Time Oxygen Heat Fuel Fire MatchStrike Transitory Non-Transitory Fire Oxygen Heat Fuel Match Strike -AND- • Transitory Causes act as catalysts to bring about change (think Transition) • Non-Transitory Causes are objects, properties/attributes, and status
  50. 50. RCA Diagram Customers Complaining Web Server returning 500 errors The application server was timing out SQL Server was not processing queries Transaction log was unable to grow T: Drive at 0 Bytes free Logs were not truncated DBA on honeymoon vacation in Fiji Logs are truncated manually Company has only 1 DBA “Backup” DBA was not aware the logs require truncation Space allocations are fixed Lack of Control Only one database cluster in use DR SQL Cluster DR Cluster being used for UAT testing More Information Needed One one application server exists More Information Needed Trying to do business on the website Desired Condition -AND- -AND- -AND- -AND- -AND- -AND- -AND-
  51. 51. Add Evidence Customers Complaining Web Server returning 500 errors The application server was timing out SQL Server was not processing queries Transaction log was unable to grow T: Drive at 0 Bytes free Logs were not truncated DBA on honeymoon vacation in Fiji Logs are truncated manually Company has only 1 DBA “Backup” DBA was not aware the logs require truncation Space allocations are fixed Lack of Control Only one database cluster in use DR SQL Cluster DR Cluster being used for UAT testing More Information Needed One one application server exists More Information Needed Trying to do business on the website Desired Condition -AND- -AND- -AND- -AND- -AND- -AND- -AND- Statistical Data Situational Observation
  52. 52. Examples of Evidence •  Personal experience or observation •  Statistical data (Monitoring Metrics) •  Examples, particular events, or situations that illustrate •  Analogies (comparisons with similar situations) •  Informed opinion (the opinions of experts and authorities) •  Historical documentation •  Experimental evidence
  53. 53. Ideas for Finding Causes Causes Management Organization Process Knowledge Technology People Information Applications Infrastructure Capital
  54. 54. Step 3: Find Solutions
  55. 55. Failure Modes Analysis SQL Server Not Available Transaction log is unable to grow T: Drive at 0 Bytes free Logs were not truncated DBA on honeymoon vacation in Fiji Logs are truncated manually Company has only 1 DBA “Backup” DBA was not aware the logs require truncation (Condition Cause) Space allocations are fixed (Condition Cause) Lack of Control SQL is unable to cache query results Available RAM at 0 Bytes Free C: Drive at 0 Bytes free Minidump is configured to write to C: Drive Server was ASRing frequently Software distributions were leaving files in the TEMP folder %TEMP% configured to C:Temp Kernel able to write to page file -AND- -AND- -AND- -AND- -OR- -AND- -OR-
  56. 56. Picking Monitors SQL Server Not Available Transaction log is unable to grow T: Drive at 0 Bytes free Logs were not truncated DBA on honeymoon vacation in Fiji Logs are truncated manually Company has only 1 DBA “Backup” DBA was not aware the logs require truncation (Condition Cause) Space allocations are fixed (Condition Cause) Lack of Control SQL is unable to cache query results Available RAM at 0 Bytes Free C: Drive at 0 Bytes free Minidump is configured to write to C: Drive Server was ASRing frequently Software distributions were leaving files in the TEMP folder %TEMP% configured to C:Temp Kernel able to write to page file -AND- -AND- -AND- -AND- -OR- -AND- -OR- Monitor the intersections at the “OR’s” At least one point along each branch after the “OR”
  57. 57. FMEA Matrix (Impact Calculation) Negligible (1-2): no loss in functionality, mostly cosmetic Marginal (3-4): temporary interruptions or the degradation lasts for a brief period of time Critical (5-6): the problem will not resolve itself but a work around exists allowing the problem to be bypassed Serious (7-8): the problem will not resolve itself and no work around is possible. Functionality is impaired or lost but the system is usable to some extent Catastrophic (9-10): the system is completely unusable Improbable (1-2): less than 1 time per year Remote (3-4): 1 time per year Occasional (5-6): 1 time per month Probable (7-8): 1 time per day Chronic (9-10): 1 or more times per day Very high (1-2): during the design phase High (3-4): during peer review or unit testing Moderate (5-6): during system testing or acceptance testing Remote (7-8): during or immediately after production deployment Very Remote (9-10): only after heavy usage by users
  58. 58. FMEA Matrix (Evidence) These are the events that help us to RULE IN a failure mode as a possible cause These are the events that help us RULE OUT the failure mode as not relevant
  59. 59. Application-Technology Matrix Maps services, applications and technologies enabling: • Monitoring investment prioritization • Monitoring maturity • Which templates need to be deployed when new hardware is acquired • Whether an service has sufficient monitoring coverage based on its application components • This approach allows for anticipating changes to a customer’s monitoring needs Scores indicate: 0 – No Strategy 1 – Limited Monitoring 2 – Fully Integrated Strategy
  60. 60. Step 4: Use this knowledge intelligently
  61. 61. During Service Support •  Command Centers and Support Teams –  Use the failure modes to rule out causes –  Each failure mode will have a documented process to follow to mitigate the impact once the likely failure mode is identified •  Incident Managers –  Start bridge calls and provide an accounting of all the potential failure modes, which have been successfully ruled out, and which need to be investigated –  Coordinate the investigation assignments and consolidate the investigation results
  62. 62. Facilitating Production Assurance •  CritSits –  Start the CritSit meeting and provide an accounting of all the potential failure modes, which have been successfully ruled out, and which need to be investigated –  Initiate investigations / experiments by assign potential failure modes to the incident response teams •  Problem Management –  Document the causal elements as new failure modes –  Disseminate new failure modes to Architecture, the Monitoring Team, and the Command Center/Service Desk •  Reporting –  Produce a monthly news letter to application owners with the list of failure modes they should discuss with their architects –  Incorporate failure modes into “Fault Line” analysis
  63. 63. During the Design Process •  Architects –  Certify that designs do not contain the known failure modes or document that the failure mode does not present an unacceptable risk –  Document the requirements for Solution Architects to follow to ensure the mitigation strategies are implemented •  Developers –  Certify that designs do not contain the known failure modes or document that the failure mode does not present an unacceptable risk –  Certify the designs implement the mitigation strategies
  64. 64. Improving Enterprise Processes and Tools •  Systems Management and Monitoring –  Develop new monitoring requirements using the documented indications and contraindications •  Event Management –  Develop new correlations tying indications and contraindications to failure modes to assist in ruling out or ruling in those “in play” more efficiently •  Configuration Management –  Develop new discovery patterns using the documented indications and contraindications –  Develop automations to detect the presence of failure mode conditions and generate an event to the Event Management System
  65. 65. A few final thoughts…
  66. 66. Running a Good Pre-Mortem Defer Judgment Encourage Wild Ideas Build on Ideas Stay Focused One Person at a Time Be Visual Go for Quantity SUCCESSFUL RCA
  67. 67. Here is Why It Works RCA Process Re- Establishes Personal Relationships Social Networks Cooling-Off Period De- Escalating Gestures Confidence- Building Measures Trust Building Respect
  68. 68. Don’t try to create everything at once. Knowledge is something that is created over time. Iterative Development
  69. 69. Let’s keep the conversation going… Andrew.P.White@Gmail.com! ReverendDrew! SystemsManagementZen.Wordpress.com! systemsmanagementzen.wordpress.com/feed/! @SystemsMgmtZen! ReverendDrew! APWhite@us.ibm.com! 614-306-3434!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×