Bringing Back the Love: How Situational Awareness Improves User Experience!http://www.ﬂickr.com/photos/64123293@N00/5985619750/sizes/l/in/photostream/!
Andrew White! Manager of Systems and ! Event Management At ! Nationwide Insurance! !Mr. White leads a team of software developers focused on creating tools that collect and analyze health information from Nationwides IT systems. These tools have a wide variety of applications, from fault detection and problem investigation to trend reporting and capacity planning.!!Andrew has over ten years of experience designing and managing the deployment of systems management software. Prior to joining Nationwide, Andrew developed solutions for a wide variety of organizations, including the Mexican Secretaría de Hacienda y Crédito Público, Telmex, Wal-Mart of Mexico, JP Morgan Chase, and the US Navy Facilities and Engineering Command.!
GROUND RULES FOR THIS SESSION…!1. If you can’t tell if I am trying to be funny…! !GO AHEAD AND LAUGH!!2. Feel free to text, tweet, yammer, or whatever. People gotta hear this!!3. If you have a question, no need to wait until the end. Just interrupt me. Seriously… I don’t mind.! Follow Us: #ITSMSummit!
My name is Andrew White! I lead a Systems and Event Management team !
I am here today to talk about! Situational Awareness!
SITUATION – [SI-CHƏ-WĀ-SHƏN]! -noun" 1. manner of being situated; location or position with reference to environment: The situation of the house allowed for a beautiful view. " 2. condition; case; plight: He is in a desperate situation. " 3. the state of affairs; combination of circumstances: The present international situation is dangerous. " 4. a state of affairs of special or critical signiﬁcance in the course of a play, novel, etc. "Follow Us: #ITSMSummit!
AWARENESS – [UH-WAIR-NIS]! -noun" 1. having knowledge; conscious; cognizant: aware of danger. " 2. informed; alert; knowledgeable; sophisticated: She is one of the most politically aware young women around. "Follow Us: #ITSMSummit!
When you put them together we get:! The perception of and reaction to a set of changing events in terms of what can be done instead of merely the recollection of a stimuli.1 ! Most outages are the result of the lack of situational awareness!1. Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32–64.!
SOMETIMES WE MISS WHAT IS GOING ON! Say… what’s a mountain goat doing all the way up here in a cloud bank?!Follow Us: #ITSMSummit!
WHICH DO YOU USE WHEN?! We don’t have a tooling problem…! Technology Areas!we have an understanding problem!! Tool! Tool! Tool! Follow Us: #ITSMSummit!
Our systems are capable of producing a huge amount of data, both on the status of their own components and on the status of the environment. The problem with today’s systems is not a lack of information, but ﬁnding what is needed when it is needed.!1. Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems.Human Factors 37(1), 32–64.!!
BOYD’S OODA “LOOP”! Observe! Orient! Decide! Act! Implicit Guidance & Control! UnfoldingCircumstances! Cultural! Norms! Feed Knowledge ! Cognitive! Feed Feed Forward! Life Cycle! Abilities! Forward! Decision! Forward! Action Observation! (Hypothesis)! (Test)! New ! Prior! Information! Wisdom! Outside Information! Feedback! Feedback! Unfolding Interaction With Environment! • Note how observation shapes orientation, shapes decision, shapes action, and in turn is shaped by the feedback and other phenomena coming into our sensing or observing window.! • Also note how the entire “loop” (not just orientation) is an ongoing many-sided implicit cross-referencing process of projection, empathy, correlation, and rejection.! ! Follow Us: #ITSMSummit! From “The Essence of Winning and Losing,” John R. Boyd, January 1996.!
WHERE THE BREAKDOWN OCCURS! • System Capability! • Interface Design! Systemic Inﬂuences! • Stress & Workload! • Complexity! • Automation! Feedback! Current State! Situational Awareness! Perception of Comprehension Projection of Performance Elements in of Current Future Status! Decision! Current Situation! Situation! ! of Actions! ! ! ! Level 1! Level 2! Level 3! Observe! Orient! Decide! Act! • Goals & Objectives! • Preconceptions! Cognitive Processes! • Expectations! Long Term Automaticity! Memory! Individual Inﬂuences! • Abilities!Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness • Experience! Follow Us: #ITSMSummit!in dynamic systems. Human Factors 37(1), 32–64.! • Training!
Maybe.!Let me show you why this is important…!
WE (IT) SELLS PROMISES…! The value of these promises depends on the customer’s perception that we are willing and capable of making good on the promise when the time comes. This perception is affected by the interactions they have with us. !Follow Us: #ITSMSummit!
Objective #1: Users Love Our IT Systems…http://www.ﬂickr.com/photos/anneacaso/3693155059/sizes/l/in/photostream/!
WHAT THIS MEANS TO US…!There are a few inescapable facts we face:!1. Weneeds reliable systems to store the promises it makes to its customers !2. Our systems mirror the complexity of the businesses they support!3. Our environments must be massive to scale to handle the workload!4. There is too much activity for a single person to be totally situationally aware!5. If the users can’t use it, it doesn’t work! Follow Us: #ITSMSummit!
EVENT MANAGEMENT FOCUS…!In addition to monitoring for performance, we are here tohelp manage availability.! Our Formula:! 1. Continually collect, categorize, and analyze all events from as many sources as possible! 2. Correlate events and analyze them using previous outages as patterns to identify situations worth investigating! 3. Notify a support team so the situation can be mitigated before becoming an outage!Follow Us: #ITSMSummit!
When all of these happen at the same time…! Ug…!
CLEANING UP THE LANDSCAPE! Launch Pad! Silo! Monolithic Niche! Framework! Information Bus! Adapted from: Akella, Janaki. “IT Architecture: Cutting costs and complexity.” McKinsey Quarterly 13 Nov 2009Follow Us: #ITSMSummit! https://www.mckinseyquarterly.com/IT_architecture_Cutting_costs_and_complexity_2391!
ONE INTEGRATED ENVIRONMENT! CMDB! Paging! Service Desk! Presentation Framework!3rd Party Providers! Knowledge! Asset Mgmt! Enrichment & Correlation! Event API! Event Pool! Event Catalog! Predictive!Business Telemetry! Mainframe! Distributed! Database! Network! Middleware! Storage! Operational! Data Warehouse! Follow Us: #ITSMSummit!
CONCEPTUALIZING SITUATIONAL AWARENESS! Real-Time Event Streams! Detected and Predicted Situations! Situational Awareness Engine! Causal Relationship Patterns from from Past RCAs! Historical Data! Follow Us: #ITSMSummit!Adapted from http://www.slideshare.net/TimBassCEP/getting-started-in-cep-how-to-build-an-event-processing-application-presentation-717795!
SITUATIONAL AWARENESS MODEL DESIGN! Data! Information! Knowledge! Intelligence! Runbook Automation! Level 5! Event Taxonomy Historical Event and Enrichment! Archive! Level 1!Event Sources! ! Solicitations for User Interaction Event Pipeline! via the Visualization Framework! Event Tracking! Situation Predictive ! Detection! Analysis! Level 2! Level 3! Level 4! Causal Relationship Patterns from from Past RCAs! Adapted from the JDL: Steinberg, A., & Bowman, C., Follow Us: #ITSMSummit! Historical Data! Handbook of Multisensor Data Fusion, CRC Press, 2001!
REQUIREMENTS FOR UNITY OF EFFORT! Symptoms of Missing Elements ! • Command and control (No Leadership)! 1. Command • The team lacks a clear direction! and Control! • Lots of activity, lack of progress! • Shared Experience (Poor Relationships)! • Us vs. Them mentality! • Unhealthy competition!3. Situational 2. Shared • Situational Awareness (Poor Communication)! Awareness! Experience! • Focused on cooperation, not collaboration! • Blame culture! • Infrequent or non-existent communication! Follow Us: #ITSMSummit!
Our success in any endeavor depends directly onour ability to solve problems! What do we need to do that?!
WHAT MATTERS MOST?!Cook County Hospital, Chicago, IL! Dr. Lee Goldman! The Goldman Algorithm!§ Is the patient feeling unstable angina?!§ Is there ﬂuid in the patient’s lungs?!§ Is the patient’s systolic blood pressure below 100? ! Prediction of Patients Who Will Have a Heart Attack Within 72 100 Hours! 90 80 70 60 50 By paying attention to what really matters, Dr. 40 30 20 Goldman improved the “false negatives” by 10 0 Traditional Techniques Goldman Algorithm 20 percentage points and eliminated the “false positives” altogether. ! Follow Us: #ITSMSummit!
THE GOLDMAN ALGORITHM! ECG Evidence of Acute Myocardial Infarction (MI)?! ST-Segment Elevation ≥ 1mm in ≥ 2 Contiguous Patient enters ED Perform Leads (New or Unknown Age)!with suspected Acute Electrocardiogram or! Cardiac Ischema! (EKG)! Pathologic Q Waves in ≥ 2 Contiguous Leads (New or Unknown Age)! Yes! No! ECG Evidence of Acute Ischemia?!Coronary ST-Segment Depression ≥ 1mm in ≥ 2 Contiguous LeadsCare Unit! (New or Unknown Age) or! T- Wave Inversion in ≥ 2 Contiguous Leads (New or Unknown Age) or! Left Bundle-Branch Block (New or Unknown Age)! Yes! No! Urgent Factors Present?! Urgent Factors Present?! Rates Above Both Lung Bases! Rates Above Both Lung Bases! Systolic Blood Pressure <100 mm Hg! Systolic Blood Pressure <100 mm Hg! Unstable Ischemic Heart Disease! Unstable Ischemic Heart Disease! 2 or 3 Factors! 0 or 1 Factors! 2 or 3 Factors! 1 Factors! 0 Factors! High Risk! Moderate Risk! Low Risk! Very Low Risk! Inpatient Observation Telemetry Unit! Follow Us: #ITSMSummit! Unit!
WHAT GOOD MONITORING LOOKS LIKE Elements of Good Monitoring!1. System Availability!2. Operating System Performance! !! 1! 2! 3! 4! 5! 6! !8!3. Hardware Monitoring! !! !4. Service/Daemon and Process Availability!5. Error Logs!6. Application Resource KPIs!7. End-to-End Transactions!8. Point of Failure Transactions!9. Fail-Over Success! Load Balancer!10. “Activity Monitors” and “Reverse Hockey Mainframe! Stick”! Data Power! Switch! Load Balancer! Firewall! ! ! Web Server Farm! 7! ! ! Corporate! LANs & VPNs! Load Balancer! Database! Middleware! ! ! ! ! ! ! ! ! ! ! ! ! ! 9! 10!! ! ! ! ! ! ! ! ! ! ! ! ! Follow Us: #ITSMSummit!
FINDING METRICS THAT MATTER! Evaluating the Effectiveness of a Metric !§ Will the metric be used in a report? If so, which one? How is it used in the report?!§ Will the metric be used in a dashboard? If so, which one? How will it be used?!§ What action(s) will be taken if an alert is generated? Who are the actors? Will a ticket be generated? If so, what severity?!§ How often is this event likely to occur? What is the impact if the event occurs? What is the likelihood it can be detected by monitoring?!§ Will the metric help identify the source of a problem? Is it a coincident / symptomatic indicator?!§ Is the metric always associated with a single problem? Could this metric become a false indicator?!§ What is the impact if this goes undetected?!§ What is the lifespan for this metric? What is the potential for changes that may reduce the efﬁcacy of the metric?!Follow Us: #ITSMSummit!
ANATOMY OF AN OUTAGE! IM01109089: P0 - Affecting Multiple apps & Internet Sales West! 6:00-ish pm: MQ ﬂows start 5:45-ish pm: CICS ABENDS are interrupted and are start ﬂooding MainView but not alerting in Flow Diagnostics! high enough to ticket! !2! !1! ! ! Database! WAS! Load Balancer! zOS! CICS! Firewall! DB2! Corporate! LANs & VPNs! Message! zOS! Web! WAS! Queue! MQ! Servers! Database! ! ! ! 3! ! ! ! 6:54pm: Support teams ! ! ! ! 10:29pm: Support teams investigate the interrupted investigate MQ and ultimately ﬂows and determine it is a 4! ! ! 5! ! ! and rule it out and ultimately6:04pm: Synthetic transactions fail at “back-end” problem! decide to reset CICS to resolveand 6:14 the Ops Center conﬁrms the the issue!issue Follow Us: #ITSMSummit! and creates a P0 Incident!
COMMON PROBLEM TYPES! § Design Problems! § Creative Problems! § Daily Problems! § People Problems! Rule-Based Event Based Approach! Approach!Follow Us: #ITSMSummit!
EVENT-BASED PROBLEM SOLVING! § Appreciative Understanding! § Know What We Are Solving! § Create A Common Reality! § Solutions Based on Causes !Follow Us: #ITSMSummit!
CAUSAL RELATIONSHIPS! ① Causes are effects, and effects are causes! Database Logs Not Down ! Drive Full Truncated (Cause/Effect)! (Effect)! (Cause)!Follow Us: #ITSMSummit!
CAUSAL RELATIONSHIPS! ② You can keep identifying causes – there is no limit! End of the Database Down ! Logs Not Universe Drive Full Truncated Beginning of (Primary Effect)! (Cause/Effect)! Time (Cause)! (Effect)! (Cause/Effect)!Follow Us: #ITSMSummit!
TWO IMPORTANT QUESTIONS! Ask “Why?”! End of the Database Down ! Logs Not Universe Drive Full Truncated Beginning of (Primary Effect)! (Cause/Effect)! Time (Cause)! (Effect)! (Cause/Effect)! Ask “What”!Follow Us: #ITSMSummit!
RULES FOR CAUSAL RELATIONSHIPS! ③ An Effect is often the result of multiple causes! DBA on honeymoon vacation in Fiji! Logs are truncated manually! Logs were not -AND-! Transaction log truncated! was unable to grow! Company has only SQL Server was 1 DBA! not processing -AND-! queries (Effect)! T: Drive at 0 Bytes -AND-! “Backup” DBA was free! not aware the logs require truncation! Space allocations are ﬁxed! Lack of Control!Follow Us: #ITSMSummit!
RULES FOR CAUSAL RELATIONSHIPS! ④ Causes need to be both necessary and sufﬁcient! DBA on honeymoon vacation in Fiji! (Transitory Cause)! Logs are truncated manually! Logs were not (Non-Transitory Cause)! truncated! -AND-! Transaction log was (Transitory Cause & unable to grow Effect)! Company has only 1 (Transitory Cause)! DBA! SQL Server was not processing queries -AND-! (Non-Transitory Cause)! (Effect)! T: Drive at 0 Bytes free! (Non-transitory Cause -AND-! “Backup” DBA was not & Effect)! aware the logs require truncation! (Non-Transitory Cause)! Space allocations are ﬁxed! Lack of Control! (Non-Transitory Cause)!Follow Us: #ITSMSummit!
HOW FIRE WORKS! Transitory! Non-Transitory! Oxygen! Match Strike! Heat! Fuel! Fire! Time! Oxygen! • Transitory Causes act as catalysts to bring Heat! about change (think Transition)! Fire! -AND-! Fuel! • Non-Transitory Causes are objects, properties/attributes, and status! Match Strike!Follow Us: #ITSMSummit!
TAKE AN SOLOGIC RCA DIAGRAM! DBA on honeymoon vacation in Fiji! Logs are truncated manually! Logs were not truncated! -AND-! Transaction log was unable to grow! Company has only 1 SQL Server was not DBA! processing queries! -AND-! T: Drive at 0 Bytes free! -AND-! “Backup” DBA was not aware the logs The application require truncation! server was timing -AND-! out! Space allocations DR SQL Cluster! are ﬁxed! Lack of Control! Web Server Only one database returning 500 errors! -AND-! -AND-! cluster in use! DR Cluster being More Information used for UAT testing! Needed! Customers One one application More Information -AND-!Complaining! server exists! Needed! Trying to do business on the Desired Condition! website! Follow Us: #ITSMSummit!
ADD THE EVIDENCE! Statistical Data! DBA on honeymoon vacation in Fiji! Observation! Logs are truncated manually! Logs were not truncated! -AND-! Transaction log was unable to grow! Company has only 1 SQL Server was not DBA! processing queries! -AND-! T: Drive at 0 Bytes free! -AND-! “Backup” DBA was not aware the logs The application require truncation! server was timing -AND-! out! Space allocations DR SQL Cluster! are ﬁxed! Lack of Control! Web Server Only one database returning 500 errors! -AND-! -AND-! cluster in use! DR Cluster being More Information used for UAT testing! Needed! Customers One one application More Information -AND-!Complaining! server exists! Needed! Trying to do business on the Desired Condition! website! Situational! Follow Us: #ITSMSummit!
FAILURE MODES AND EFFECT ANALYSI DBA on honeymoon vacation in Fiji! Logs are truncated manually! Transaction log is unable Logs were not truncated! -AND-! to grow! Company has only 1 DBA! -AND-! “Backup” DBA was not aware the logs require T: Drive at 0 Bytes free! -AND-! truncation! (Condition Cause)! Space allocations are ﬁxed! Lack of Control! (Condition Cause)! SQL Server Not Minidump is conﬁgured Available! -OR-! to write to C: Drive! Server was ASRing SQL is unable to cache frequently! query results ! -AND-! C: Drive at 0 Bytes free! -OR-! Available RAM at 0 Software distributions Bytes Free! -AND-! were leaving ﬁles in the TEMP folder! Kernel able to write to page ﬁle! %TEMP% conﬁgured to C:Temp!Follow Us: #ITSMSummit!
GETTING TO OUR REQUIREMENTS! At least one point DBA on honeymoon vacation in Fiji! along each branch after the “OR”! Logs are truncated manually! Logs were not Monitor the truncated! -AND-! Transaction log is intersections at unable to grow! the “OR’s”! Company has only 1 DBA! -AND-! “Backup” DBA was not T: Drive at 0 Bytes aware the logs require free! -AND-! truncation! (Condition Cause)! Space allocations are ﬁxed! Lack of Control! SQL Server Not (Condition Cause)! Minidump is Available! -OR-! conﬁgured to write to C: Drive! SQL is unable to Server was ASRing cache query results ! frequently! -AND-! C: Drive at 0 Bytes -OR-! free! Available RAM at 0 Software distributions Bytes Free! -AND-! were leaving ﬁles in the TEMP folder! Kernel able to write to page ﬁle! %TEMP% conﬁgured to C:Temp!Follow Us: #ITSMSummit!
FMEA MATRIX (IMPACT CALCULATION)! Very high (1-2): during the design phase! High (3-4): during peer review or unit testing! Moderate (5-6): during system testing or acceptance testing! Remote (7-8): during or immediately after production deployment!Negligible (1-2): no loss in functionality, Very Remote (9-10): only after heavymostly cosmetic! usage by users!Marginal (3-4): temporary interruptions orthe degradation lasts for a brief period oftime!Critical (5-6): the problem will not resolveitself but a work around exists allowing the Improbable (1-2): less than 1 time per year!problem to be bypassed! Remote (3-4): 1 time per year!Serious (7-8): the problem will not resolve Occasional (5-6): 1 time per month!itself and no work around is possible. Probable (7-8): 1 time per day!Functionality is impaired or lost but the Chronic (9-10): 1 or more times per day!system is usable to some extent!Catastrophic (9-10): the system iscompletely unusable! Follow Us: #ITSMSummit!
FMEA MATRIX (EVIDENCE)! These are the events that help us RULE OUT the failure mode as not relevant! These are the events that help us to RULE IN a failure mode as a possible cause!Follow Us: #ITSMSummit!
HOW TO DETERMINE EVENT SEVERITY! Six Levels of Severity !• The event severity is determined with Logical Server! respect to the component generating the event!• The event severity does not consider Physical Server! impact or urgency! Virtual• The incident priority is not determined by Machine 1! event severity! Server Logical Volumes!• The event severity helps drive an effective 1! triage when multiple events arrive at Volume approximately the same time! Group 1! Physical Volumes!• Only after the effected components and Virtual Server Hard Hard Hard their relationships to each other have been Machine 2! Volume Drive 1! Drive 2! Drive 3! 2! Group 2! determined can impact and urgency be determined! Severity! Description! Critical! The component has completely failed! Major! The component is operating but is in a degraded or crippled state! Minor! The component is functioning normally but is at risk of a more serious failure! Informational! The component is functioning normally but is reporting a change in state! Unknown! The component has changed its operating state but the effect is not known! Clear! The component is operating normally or a higher severity event has been resolved! Follow Us: #ITSMSummit!
MONITORING BASED ON PATTERNS! Layers of Pre-Deﬁned Monitoring Patterns !• The OS template is deployed when the server is provisioned!• As a server is customized to ﬁt its role, additional templates are deployed!• Templates are stacked on top of each other until no gaps remain!• This approach provides a high degree of standardization without sacriﬁcing the ability to develop a custom solution !Follow Us: #ITSMSummit!
APPLICATION-TECHNOLOGY MATRIX! Maps services, applications and technologies enabling:! • Monitoring investment prioritization! • Monitoring maturity! • Which templates need to be deployed when new hardware is acquired! • Whether an service has sufﬁcient monitoring coverage based on its application components! • This approach allows for anticipating changes to a customer’s monitoring needs!Scores indicate:!0 – No Strategy!1 – Limited Monitoring!2 – Fully Integrated Strategy! Follow Us: #ITSMSummit!
FACILITATING PRODUCTION ASSURANCE!§ CritSits! § Start the CritSit meeting and provide an accounting of all the potential failure modes, which have been successfully ruled out, and which need to be investigated! § Include other potential failure modes into the KT matrix!§ Problem Management! § Document the causal elements as new failure modes! § Disseminate new failure modes to Architecture, ESM and the Command Center!§ Reporting! § Produce a monthly news letter to application owners with the list of failure modes they should discuss with their architects! § Incorporate failure modes into “Fault Line” analysis! Follow Us: #ITSMSummit!
DURING THE DESIGN PROCESS!• Architects !! • Certify that designs do not contain the known failure modes or document that the failure mode does not present an unacceptable risk! • Document the requirements for Solution Architects to follow to ensure the mitigation strategies are implemented completely!• Developers! • Certify that designs do not contain the known failure modes or document that the failure mode does not present an unacceptable risk! • Certify the designs implement the mitigation strategies correctly! Follow Us: #ITSMSummit!
IMPROVING ENTERPRISE TOOLS!§ Systems Management! § Develop new monitoring requirements using the documented indications and contraindications!§ Event Management! § Develop new correlations tying indications and contraindications to failure modes to assist in ruling out or ruling in those “in play” more efﬁciently!§ Conﬁguration Management! § Develop new discovery patterns using the documented indications and contraindications! § Develop automations to detect the presence of failure mode conditions and generate an event to the Event Management System! Follow Us: #ITSMSummit!
DURING SERVICE SUPPORT!• Command Centers and Support Teams! – Use the failure modes to rule out potential failure modes! – Each failure mode will have a documented process to follow to mitigate the impact once a failure mode is identiﬁed!• Incident Managers! – Start bridge calls and provide an accounting of all the potential failure modes, which have been successfully ruled out, and which need to be investigated! – Coordinate the investigation assignments and consolidate the investigation results! Follow Us: #ITSMSummit!