Brighttalk   outage insurance- what you need to know - final
Upcoming SlideShare
Loading in...5
×
 

Brighttalk outage insurance- what you need to know - final

on

  • 41 views

 

Statistics

Views

Total Views
41
Views on SlideShare
41
Embed Views
0

Actions

Likes
1
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Brighttalk   outage insurance- what you need to know - final Brighttalk outage insurance- what you need to know - final Presentation Transcript

  • Outage Insurance: Everything You Need to Know
  • Mr. White has fifteen years of experience designing and managing the deployment of Systems Monitoring and Event Management software. Prior to joining IBM, Mr. White held various positions including the leader of the Monitoring and Event Management organization of a Fortune 100 company and developing solutions as a consultant for a wide variety of organizations, including the Mexican Secretaría de Hacienda y Crédito Público, Telmex, Wal-Mart of Mexico, JP Morgan Chase, Nationwide Insurance and the US Navy Facilities and Engineering Command. Andrew White Cloud and Smarter Infrastructure Solution Specialist IBM Corporation
  • http://weheartit.com/entry/12433848! View slide
  • Ground rules for this session… •  If you can’t tell if I am trying to be funny… –  GO AHEAD AND LAUGH! •  Feel free to text, tweet, yammer, or whatever to share with the rest of the attendees •  If you have a question, no need to wait until the end. Just interrupt me. Seriously… I don’t mind. View slide
  • I am here today to share some of what I have learned about
  • We (IT) sells promises… The value of these promises depends on the customer’s perception that we are willing and capable of making good on the promise when the time comes. This perception is affected by the interactions they have with us.
  • http://www.flickr.com/photos/anneacaso/3693155059/sizes/l/in/photostream/! Objective #1: Users Love Our IT Systems…
  • Anatomy of an Outage Corporate! LANs & VPNs! Load Balancer! Firewall! Web! Servers! Message! Queue! zOS! CICS! WAS! Database! WAS! Database! zOS! MQ! DB2! ! ! ! ! 4! ! ! ! ! ! ! 3! ! ! ! ! ! ! 1! 5:45-ish pm: CICS ABENDS start flooding the console but not high enough to ticket! ! ! ! ! ! ! 2! 6:00-ish pm: MQ flows start are interrupted and are alerting in Flow Diagnostics! 6:04pm: Synthetic transactions fail at and 6:14 the Ops Center confirms the issue and creates a P0 Incident! 6:54pm: Support teams investigate the interrupted flows and determine it is a “back-end” problem! 10:29pm: Support teams investigate MQ and ultimately and rule it out and ultimately decide to reset CICS to resolve the issue! ! ! ! ! 5!
  • http://www.flickr.com/photos/gregphoto/4881356366/sizes/l/in/photostream/! Bad Experience!!!
  • h"p://www.ithakabound.com/wp-­‐content/uploads/2010/02/DC-­‐Snow-­‐men-­‐pushing-­‐car.jpg   Why did this happen?!
  • Why is problem solving hard? • commencement opacity • continuation opacity Non-transparency (lack of clarity of the situation) • inexpressiveness • opposition • transience Polytely (multiple goals) • enumerability • connectivity (hierarchy relation, communication relation, allocation relation) • heterogeneity Complexity (large numbers of items, interrelations, and decisions) • temporal constraints • temporal sensitivity • phase effects • dynamic unpredictability Dynamics (time considerations)
  • Boyd’s Loop Observation Outside Information Implicit Guidance & Control Unfolding Interaction With Environment Feedback Feedback Unfolding Circumstances Cultural Norms Cognitive Abilities Knowledge Life Cycle Prior Wisdom New Information Feed Forward Decision (Hypothesis) Feed Forward Action (Test) Feed Forward •  Note how observation shapes orientation, shapes decision, shapes action, and in turn is shaped by the feedback and other phenomena coming into our sensing or observing window. •  Also note how the entire “loop” (not just orientation) is an ongoing many-sided implicit cross-referencing process of projection, empathy, correlation, and rejection. From “The Essence of Winning and Losing,” John R. Boyd, January 1996. Observe Orient Decide Act
  • Where the Breakdown Occurs Observe! Orient! Decide! Act! Situational Awareness! Perception of Elements in Current Situation! ! Level 1! Comprehension of Current Situation! ! Level 2! Projection of Future Status! ! ! Level 3! Decision! Performance of Actions! CurrentState! Feedback! • Goals & Objectives! • Preconceptions! • Expectations! • Abilities! • Experience! • Training! Long Term Memory! Automaticity! Cognitive Processes! • System Capability! • Interface Design! • Stress & Workload! • Complexity! • Automation! Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32–64.! Systemic Influences! Individual Influences!
  • Incident Life Cycle Down Time Detection Time Response Time Repair Time Recovery Time Outage Detection Diagnosis Repair Recover Restore Observe Orient Decide Act
  • Problem Life Cycle Evaluation   Recognition Observation AnalysisSolution Validation Control
  • Point of Observation Past Behavior • The observation period used to feed the forecasting models Future Behavior • The performance period the model is trying to predict Predictive Modeling Timeline
  • Predictive models harness the information lost in past data so you can identify discretely identify situations and react to them quickly.
  • What Matters Most? Dr.  Lee   Goldman   Cook  County  Hospital,   Chicago,  IL   §  Is the patient feeling unstable angina? §  Is there fluid in the patient’s lungs? §  Is the patient’s systolic blood pressure below 100?" The Goldman Algorithm Prediction of Patients Expected to Have a Heart Attack Within 72 Hours 0   20   40   60   80   100   Traditional Techniques Goldman Algorithm By paying attention to what really matters, Dr. Goldman improved the “false negatives” by 20 percentage points and eliminated the “false positives” altogether.
  • The Goldman Algorithm ECG Evidence of Acute Ischemia? ST-Segment Depression ≥ 1mm in ≥ 2 Contiguous Leads (New or Unknown Age) or T- Wave Inversion in ≥ 2 Contiguous Leads (New or Unknown Age) or Left Bundle-Branch Block (New or Unknown Age) Observation Unit Inpatient Telemetry Unit High Risk Low Risk Very Low Risk Moderate Risk Yes No Coronary Care Unit No ECG Evidence of Acute Myocardial Infarction (MI)? ST-Segment Elevation ≥ 1mm in ≥ 2 Contiguous Leads (New or Unknown Age) or Pathologic Q Waves in ≥ 2 Contiguous Leads (New or Unknown Age) Yes Patient suspected of Acute Cardiac Ischema Perform Electrocardiogram (EKG) 0 Factors 2 or 3 Factors 1 Factors 0 or 1 Factors 2 or 3 Factors Urgent Factors Present? Rates Above Both Lung Bases Systolic Blood Pressure <100 mm Hg Unstable Ischemic Heart Disease Urgent Factors Present? Rates Above Both Lung Bases Systolic Blood Pressure <100 mm Hg Unstable Ischemic Heart Disease
  • First… … we need to talk a little bit about your brain
  • The Triune Brain Reptilian Brain (basal ganglia) Mammalian Brain (limbic system) Cognitive Brain (neocortex)
  • Our Thought Process *** not very reliable Cognition Limbic Center (hypocampus and amygdala) Cortex (hypocampus and amygdala) Conscious Choice (via motor centers) Most primitive, seat of unconscious Long-term memory Conscious, meaning, choice Perception (via the senses)*** Pre-Frontal Cortex (hypocampus and amygdala) Stimulus
  • Short Term Memory Your Brain Working Memory Understanding Judgement Relationship Short-term memory is where the real work of sense-making takes place Short-term memory has a limited amount of space (The estimate is 7 ± 2)
  • The big-data dilema Time Quantity Information the brain can consume
  • Information is cheap. Understanding is expensive. -Karl Fast, Professor of UX Design, Kent State University
  • • Patterns • Comparisons • Organization Information • Decisions • Skill • Adaptation Intelligence • Trends • Generalizations • Beliefs Knowledge • Accountability • Foresight • Synthesis Wisdom • Symbols • Metrics • Facts Data Correlation Analysis Application Understanding Complexity Context Communication Repetition From Data to Wisdom
  • x y 0i i i i y xα α ε= + + Data Information Knowledge
  • Past Future Abstract Tangible Information Intelligence Knowledge Wisdom Data Knowledge is the point of transition Why Knowledge?
  • All You Need Love
  • 1. Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32–64.! ! Our systems are capable of producing a huge amount of data, both on the status of their own components and on the status of the environment. The problem with today’s systems is not a lack of information, but finding what is needed when it is needed.
  • Our success in any endeavor depends directly on our ability to solve problems What do we need to do that?
  • You Gotta Have Skillz…!
  • Common Problem Types §  Design Problems §  Creative Problems §  Daily Problems §  People Problems Rule-Based Approach Event Based Approach
  • The Problem with the Rules-Based Approach •  Solutions are driven by accepted conventions •  Best practices are coveted and are adopted without understanding how and why they were developed •  There must always be a right answer •  No logical analysis is required •  People are frequently seen as the “root cause” •  The outcomes are enforced using “re-dos” and punitive actions (or the looming threat of these things)
  • Event-Based Problem Solving •  Appreciative Understanding •  Know What We Are Solving •  Create A Common Reality •  Solutions Based on Causes
  • The Pre-Mortem Process Define the Problem Chart the Causal Relationships and Add EVidence Identify Solutions Implement the Solutions
  • Step 1: Define the Problem
  • Problem Definition •  What: •  When: Date/Time: Relative: what was happening at the time of this event? •  Where: Specific: Relative: logical dependencies? •  Significance: availability: environment: costs: revenue maintenance? other miscellaneous costs frequency:
  • Gut Check… •  Why are we working on this? •  How much time should we spend? •  What people do we need? •  How much money should we spend? You should be able to answer all of the following:
  • The What Statement •  It is used as “The Primary Effect (PE)” – It is a statement of what we want to prevent from happening again •  There may be more than one – If they are unrelated, perform separate RCA’s – If they are related and you can’t decide which to use, pick the one that is nearest to the present time •  Noun/verb statement
  • Step 2: Add Causal Relationships and Evidence
  • The T: Drive reached 0 Bytes free The database stopped processing queries The application server was timing out Users were getting 500 errors on the website Customers to call the helpdesk to complain Add more hard drive space Have you see something like this before? What do we really know?
  • It’s never that simple Customers Complaining Web Server returning 500 errors The application server was timing out SQL Server was not processing queries Transaction log was unable to grow T: Drive at 0 Bytes free Logs were not truncated DBA on honeymoon vacation in Fiji Logs are truncated manually Company has only 1 DBA “Backup” DBA was not aware the logs require truncation Space allocations are fixed Lack of Control Only one database cluster in use DR SQL Cluster DR Cluster being used for UAT testing More Information Needed One one application server exists More Information Needed Trying to do business on the website Desired Condition -AND- -AND- -AND- -AND- -AND- -AND- -AND-
  • Rules for Causal Relationships Database Down ! (Effect)! Drive Full (Cause/Effect)! Logs Not Truncated (Cause)! ①  Causes are effects, and effects are causes!
  • Rules for Causal Relationships End of the Universe (Effect)! Database Down ! (Primary Effect)! Drive Full (Cause/Effect)! Logs Not Truncated (Cause/Effect)! Beginning of Time (Cause)! ②  You can keep identifying causes – there is no limit!
  • Two Important Questions End of the Universe (Effect)! Database Down ! (Primary Effect)! Drive Full (Cause/Effect)! Logs Not Truncated (Cause/Effect)! Beginning of Time (Cause)! Ask “Why?” Ask “What”
  • Rules for Causal Relationships ③  An Effect is often the result of multiple causes! SQL Server was not processing queries (Effect)! Transaction log was unable to grow! T: Drive at 0 Bytes free! Logs were not truncated! DBA on honeymoon vacation in Fiji! Logs are truncated manually! Company has only 1 DBA! “Backup” DBA was not aware the logs require truncation! Space allocations are fixed! Lack of Control! -AND-! -AND-! -AND-!
  • Rules for Causal Relationships ④  Causes need to be both necessary and sufficient! SQL Server was not processing queries (Effect)! Transaction log was unable to grow (Transitory Cause)! T: Drive at 0 Bytes free! (Non-transitory Cause & Effect)! Logs were not truncated! (Transitory Cause & Effect)! DBA on honeymoon vacation in Fiji! (Transitory Cause)! Logs are truncated manually! (Non-Transitory Cause)! Company has only 1 DBA! (Non-Transitory Cause)! “Backup” DBA was not aware the logs require truncation! (Non-Transitory Cause)! Space allocations are fixed! (Non-Transitory Cause)! Lack of Control! -AND-! -AND-! -AND-!
  • How Fire Works Time Oxygen Heat Fuel Fire MatchStrike Transitory Non-Transitory Fire Oxygen Heat Fuel Match Strike -AND- • Transitory Causes act as catalysts to bring about change (think Transition) • Non-Transitory Causes are objects, properties/attributes, and status
  • RCA Diagram Customers Complaining Web Server returning 500 errors The application server was timing out SQL Server was not processing queries Transaction log was unable to grow T: Drive at 0 Bytes free Logs were not truncated DBA on honeymoon vacation in Fiji Logs are truncated manually Company has only 1 DBA “Backup” DBA was not aware the logs require truncation Space allocations are fixed Lack of Control Only one database cluster in use DR SQL Cluster DR Cluster being used for UAT testing More Information Needed One one application server exists More Information Needed Trying to do business on the website Desired Condition -AND- -AND- -AND- -AND- -AND- -AND- -AND-
  • Add Evidence Customers Complaining Web Server returning 500 errors The application server was timing out SQL Server was not processing queries Transaction log was unable to grow T: Drive at 0 Bytes free Logs were not truncated DBA on honeymoon vacation in Fiji Logs are truncated manually Company has only 1 DBA “Backup” DBA was not aware the logs require truncation Space allocations are fixed Lack of Control Only one database cluster in use DR SQL Cluster DR Cluster being used for UAT testing More Information Needed One one application server exists More Information Needed Trying to do business on the website Desired Condition -AND- -AND- -AND- -AND- -AND- -AND- -AND- Statistical Data Situational Observation
  • Examples of Evidence •  Personal experience or observation •  Statistical data (Monitoring Metrics) •  Examples, particular events, or situations that illustrate •  Analogies (comparisons with similar situations) •  Informed opinion (the opinions of experts and authorities) •  Historical documentation •  Experimental evidence
  • Ideas for Finding Causes Causes Management Organization Process Knowledge Technology People Information Applications Infrastructure Capital
  • Step 3: Find Solutions
  • Failure Modes Analysis SQL Server Not Available Transaction log is unable to grow T: Drive at 0 Bytes free Logs were not truncated DBA on honeymoon vacation in Fiji Logs are truncated manually Company has only 1 DBA “Backup” DBA was not aware the logs require truncation (Condition Cause) Space allocations are fixed (Condition Cause) Lack of Control SQL is unable to cache query results Available RAM at 0 Bytes Free C: Drive at 0 Bytes free Minidump is configured to write to C: Drive Server was ASRing frequently Software distributions were leaving files in the TEMP folder %TEMP% configured to C:Temp Kernel able to write to page file -AND- -AND- -AND- -AND- -OR- -AND- -OR-
  • Picking Monitors SQL Server Not Available Transaction log is unable to grow T: Drive at 0 Bytes free Logs were not truncated DBA on honeymoon vacation in Fiji Logs are truncated manually Company has only 1 DBA “Backup” DBA was not aware the logs require truncation (Condition Cause) Space allocations are fixed (Condition Cause) Lack of Control SQL is unable to cache query results Available RAM at 0 Bytes Free C: Drive at 0 Bytes free Minidump is configured to write to C: Drive Server was ASRing frequently Software distributions were leaving files in the TEMP folder %TEMP% configured to C:Temp Kernel able to write to page file -AND- -AND- -AND- -AND- -OR- -AND- -OR- Monitor the intersections at the “OR’s” At least one point along each branch after the “OR”
  • FMEA Matrix (Impact Calculation) Negligible (1-2): no loss in functionality, mostly cosmetic Marginal (3-4): temporary interruptions or the degradation lasts for a brief period of time Critical (5-6): the problem will not resolve itself but a work around exists allowing the problem to be bypassed Serious (7-8): the problem will not resolve itself and no work around is possible. Functionality is impaired or lost but the system is usable to some extent Catastrophic (9-10): the system is completely unusable Improbable (1-2): less than 1 time per year Remote (3-4): 1 time per year Occasional (5-6): 1 time per month Probable (7-8): 1 time per day Chronic (9-10): 1 or more times per day Very high (1-2): during the design phase High (3-4): during peer review or unit testing Moderate (5-6): during system testing or acceptance testing Remote (7-8): during or immediately after production deployment Very Remote (9-10): only after heavy usage by users
  • FMEA Matrix (Evidence) These are the events that help us to RULE IN a failure mode as a possible cause These are the events that help us RULE OUT the failure mode as not relevant
  • Application-Technology Matrix Maps services, applications and technologies enabling: • Monitoring investment prioritization • Monitoring maturity • Which templates need to be deployed when new hardware is acquired • Whether an service has sufficient monitoring coverage based on its application components • This approach allows for anticipating changes to a customer’s monitoring needs Scores indicate: 0 – No Strategy 1 – Limited Monitoring 2 – Fully Integrated Strategy
  • Step 4: Use this knowledge intelligently
  • During Service Support •  Command Centers and Support Teams –  Use the failure modes to rule out causes –  Each failure mode will have a documented process to follow to mitigate the impact once the likely failure mode is identified •  Incident Managers –  Start bridge calls and provide an accounting of all the potential failure modes, which have been successfully ruled out, and which need to be investigated –  Coordinate the investigation assignments and consolidate the investigation results
  • Facilitating Production Assurance •  CritSits –  Start the CritSit meeting and provide an accounting of all the potential failure modes, which have been successfully ruled out, and which need to be investigated –  Initiate investigations / experiments by assign potential failure modes to the incident response teams •  Problem Management –  Document the causal elements as new failure modes –  Disseminate new failure modes to Architecture, the Monitoring Team, and the Command Center/Service Desk •  Reporting –  Produce a monthly news letter to application owners with the list of failure modes they should discuss with their architects –  Incorporate failure modes into “Fault Line” analysis
  • During the Design Process •  Architects –  Certify that designs do not contain the known failure modes or document that the failure mode does not present an unacceptable risk –  Document the requirements for Solution Architects to follow to ensure the mitigation strategies are implemented •  Developers –  Certify that designs do not contain the known failure modes or document that the failure mode does not present an unacceptable risk –  Certify the designs implement the mitigation strategies
  • Improving Enterprise Processes and Tools •  Systems Management and Monitoring –  Develop new monitoring requirements using the documented indications and contraindications •  Event Management –  Develop new correlations tying indications and contraindications to failure modes to assist in ruling out or ruling in those “in play” more efficiently •  Configuration Management –  Develop new discovery patterns using the documented indications and contraindications –  Develop automations to detect the presence of failure mode conditions and generate an event to the Event Management System
  • A few final thoughts…
  • Running a Good Pre-Mortem Defer Judgment Encourage Wild Ideas Build on Ideas Stay Focused One Person at a Time Be Visual Go for Quantity SUCCESSFUL RCA
  • Here is Why It Works RCA Process Re- Establishes Personal Relationships Social Networks Cooling-Off Period De- Escalating Gestures Confidence- Building Measures Trust Building Respect
  • Don’t try to create everything at once. Knowledge is something that is created over time. Iterative Development
  • Let’s keep the conversation going… Andrew.P.White@Gmail.com! ReverendDrew! SystemsManagementZen.Wordpress.com! systemsmanagementzen.wordpress.com/feed/! @SystemsMgmtZen! ReverendDrew! APWhite@us.ibm.com! 614-306-3434!