SlideShare a Scribd company logo
1 of 3
Lost Cause Analysis; Approaches to trigger incident resolution.
Dan Young (dan@firstoccurrence.com)
January 2012
Alarm management systems are often accused of overwhelming their users that results in
missed outages or elevated Mean Time to Repair. A frequently cited resolution to this
problem is “Root cause analysis” (RCA). RCA in alarm management is the process of
automated detection of the most probable cause of an alarm. An outage or a alarm will be
caused by something and it will become an incident when it impacts a service or a
business function. Generally the larger the impact the greater number of events and the
impact and cause of an outage can easily be lost in the flood of events. Events are
isolated and don’t naturally relate to other events so the process of RCA will attempt to
pick all these unrelated pieces of data (plus data from other datasources) and pinpoint the
probable cause. The benefits of this are easy to see:
1. Operators focus on the problem rather than sifting through events
2. Ideally, Mean Time To Repair (MTTR) is reduced
2. Ultimately the outage could be further automated by raising a truck roll etc.
Unfortunately identification of the true root cause is nearly impossible with current
monitoring levels. Since an alarm is almost always a consequence or a symptom of
something else that is not monitored or measurable. This might include a power failure,
software bug or configuration et al. Even if a very advanced RCA system is able to
determine an offline segment of the network that can be isolated to a particular WAN
interface via network topology and event & polled investigations, that WAN interface being
offline is not the root cause it is simply the consequence. The outage is far more likely due
to an unauthorised change or a carrier issue than the interface being disabled.
Additionally, the likelihood that such a tidy scenario of network segments going offline is
increasingly unlikely. It is far more likely that a modern network will have many layers of
redundancy and duplication that make RCA unnecessary since its impact is isolated and
routed.
If a RCA can trigger an automated resolution action is it is it truly found the root cause.
This is difficult and rare since the root cause possibilities of any outage in any given IT 
Network environment are nearly limitless and ever expanding. The value of developing
RCA becomes further difficult to justify over time as technologies eventually mature the
more become more stable. Overtime, networks and IT systems are becoming more robust,
more redundant, better designed and have greater capacity and resiliency. These new
technologies often require analysis technologies themselves and the alarm management
RCA policies are generally far behind, if they ever arrive.
Despite these inherent limitations with RCA, it remains the pinnacle of alarm management.
We propose an approach for all but the most static network or system would be to
deprioritise automated RCA capabilities and leave that it to human operators and element
managers and extend alarm management with the introduction of an alarm framework that
supports operations to deliver lower MTTR and business focus by identifying actionable
abnormalities.
Alarm Configuration Database
We define an actionable alarm as an outage that impacts a service and/or requires
intervention to resolve. Not all alarms require operator intervention and our analysis's have
found a majority of alarms will resolve themselves in a short enough amount of time that
manual intervention wouldn’t be possible. To provide operations focus we suggest the
following criteria every alarm must pass before being presented or ticketed. The answer
must be yes to the following questions to warrant the raising of a ticket:
1.Does the alarm indicate an outage that could impact the business operation?
2.Can the alarmoutage resolve itself?
3.Is the alarm out of the ordinary (i.e has it never happened or rarely happens)
4.Has the alarm gone past the point of resolving itself?
These are simple rules that should be defined in an Alarm Configuration Database
(AMDB). An AMDB will provide visibility to what is being managed and enable users to
quickly change alarm behavior to changing network or customer needs.
Alarm History
Alarm history is a rich source of data that can be tapped for improved operations..
Standard Metrics like Mean Time to Repair (MTTR) or Mean Time Between Failure
(MTBF) can be applied to Incident Management so we can learn a great deal about alarms
and escalate alarms when something is out of the ordinary.
Based on our experience we have found the following rule matrix:
Metric Indicates
Low MTTR Likely to resolve itself.
High MTTR Unlikely to resolve itself.
Low MTBF Likely to resolve itself.
High MTBF Unlikely to resolve itself. Likely to have a
bigger business impact.
No MTBF (unknown) Unlikely to resolve itself. Likely to have a
business impact.
These metrics can be used to assign priorities and provide powerful escalation rules to
alarms.
Alarm Cluster Analysis
A generic RCA pattern we have seen success with is also the most simple. It has been
applied and proven in a number of network technologies that have a degree of scale. That
approach is “Alarm Clustering” or “Alarm Grouping”.
X events in Y time in Z something
X number of events
Y time period
Z is a logical group (i.e locationnetworkcustomerproduct)
e.g.10 events in 5 minutes at the Melbourne Branch indicates something is
happening there and it should be investigated.
We suggest that to really get the best value from this approach the AMDB should have the
ability to configure or view these X, Y and Z thresholds.
Conclusion
In general we believe RCA is powerful and remains the ultimate goal however with network
and IT technologies moving at a faster pace alarm management we suggest that an alarm
management system should still be smart but also be flexible. The first step to doing this is
understand the alarm behavior and providing simple tools to allow operations to define
what alarms are to actionable and what conditions they become so.
Lost cause analysis - Alarm Management

More Related Content

What's hot

Major Incident Management Trends: 2016 Survey Report
Major Incident Management Trends: 2016 Survey ReportMajor Incident Management Trends: 2016 Survey Report
Major Incident Management Trends: 2016 Survey ReportxMatters Inc
 
Patch and Vulnerability Management
Patch and Vulnerability ManagementPatch and Vulnerability Management
Patch and Vulnerability ManagementMarcelo Martins
 
Getting the Most Value from VM and Compliance Programs white paper
Getting the Most Value from VM and Compliance Programs white paperGetting the Most Value from VM and Compliance Programs white paper
Getting the Most Value from VM and Compliance Programs white paperTawnia Beckwith
 
Enterprise security management II
Enterprise security management   IIEnterprise security management   II
Enterprise security management IIzapp0
 
Kept up by Potential IT Disasters? Your Guide to Disaster Recovery as a Servi...
Kept up by Potential IT Disasters? Your Guide to Disaster Recovery as a Servi...Kept up by Potential IT Disasters? Your Guide to Disaster Recovery as a Servi...
Kept up by Potential IT Disasters? Your Guide to Disaster Recovery as a Servi...VAST
 
5 Traits of a Proactive Guard Tour System
5 Traits of a Proactive Guard Tour System5 Traits of a Proactive Guard Tour System
5 Traits of a Proactive Guard Tour System24/7 Software
 
18 Ways Incident Management Systems Create Order (And Why It Matters)
18 Ways Incident Management Systems Create Order (And Why It Matters)18 Ways Incident Management Systems Create Order (And Why It Matters)
18 Ways Incident Management Systems Create Order (And Why It Matters)24/7 Software
 
Enterprise incident response 2017
Enterprise incident response   2017Enterprise incident response   2017
Enterprise incident response 2017zapp0
 
Building a World-Class Proactive Integrated Security and Network Ops Center
Building a World-Class Proactive Integrated Security and Network Ops CenterBuilding a World-Class Proactive Integrated Security and Network Ops Center
Building a World-Class Proactive Integrated Security and Network Ops CenterPriyanka Aash
 
Enterprise Vulnerability Management: Back to Basics
Enterprise Vulnerability Management: Back to BasicsEnterprise Vulnerability Management: Back to Basics
Enterprise Vulnerability Management: Back to BasicsDamon Small
 
State of on call report 2014
State of on call report 2014State of on call report 2014
State of on call report 2014Todd Vernon
 
Ema report -_ibm_security_q_radar_incident_forensics_vs_other_industry_tools
Ema report -_ibm_security_q_radar_incident_forensics_vs_other_industry_toolsEma report -_ibm_security_q_radar_incident_forensics_vs_other_industry_tools
Ema report -_ibm_security_q_radar_incident_forensics_vs_other_industry_toolsAnjoum .
 

What's hot (15)

Major Incident Management Trends: 2016 Survey Report
Major Incident Management Trends: 2016 Survey ReportMajor Incident Management Trends: 2016 Survey Report
Major Incident Management Trends: 2016 Survey Report
 
Patch and Vulnerability Management
Patch and Vulnerability ManagementPatch and Vulnerability Management
Patch and Vulnerability Management
 
Getting the Most Value from VM and Compliance Programs white paper
Getting the Most Value from VM and Compliance Programs white paperGetting the Most Value from VM and Compliance Programs white paper
Getting the Most Value from VM and Compliance Programs white paper
 
Enterprise security management II
Enterprise security management   IIEnterprise security management   II
Enterprise security management II
 
Better fraud solution through fraud and IT synergy
Better fraud solution through fraud and IT synergyBetter fraud solution through fraud and IT synergy
Better fraud solution through fraud and IT synergy
 
Failure Reporting Process Map
Failure Reporting Process MapFailure Reporting Process Map
Failure Reporting Process Map
 
Kept up by Potential IT Disasters? Your Guide to Disaster Recovery as a Servi...
Kept up by Potential IT Disasters? Your Guide to Disaster Recovery as a Servi...Kept up by Potential IT Disasters? Your Guide to Disaster Recovery as a Servi...
Kept up by Potential IT Disasters? Your Guide to Disaster Recovery as a Servi...
 
5 Traits of a Proactive Guard Tour System
5 Traits of a Proactive Guard Tour System5 Traits of a Proactive Guard Tour System
5 Traits of a Proactive Guard Tour System
 
18 Ways Incident Management Systems Create Order (And Why It Matters)
18 Ways Incident Management Systems Create Order (And Why It Matters)18 Ways Incident Management Systems Create Order (And Why It Matters)
18 Ways Incident Management Systems Create Order (And Why It Matters)
 
14-15 Ask the Expert
14-15 Ask the Expert14-15 Ask the Expert
14-15 Ask the Expert
 
Enterprise incident response 2017
Enterprise incident response   2017Enterprise incident response   2017
Enterprise incident response 2017
 
Building a World-Class Proactive Integrated Security and Network Ops Center
Building a World-Class Proactive Integrated Security and Network Ops CenterBuilding a World-Class Proactive Integrated Security and Network Ops Center
Building a World-Class Proactive Integrated Security and Network Ops Center
 
Enterprise Vulnerability Management: Back to Basics
Enterprise Vulnerability Management: Back to BasicsEnterprise Vulnerability Management: Back to Basics
Enterprise Vulnerability Management: Back to Basics
 
State of on call report 2014
State of on call report 2014State of on call report 2014
State of on call report 2014
 
Ema report -_ibm_security_q_radar_incident_forensics_vs_other_industry_tools
Ema report -_ibm_security_q_radar_incident_forensics_vs_other_industry_toolsEma report -_ibm_security_q_radar_incident_forensics_vs_other_industry_tools
Ema report -_ibm_security_q_radar_incident_forensics_vs_other_industry_tools
 

Similar to Lost cause analysis - Alarm Management

Monitoring Clusters and Load Balancers
Monitoring Clusters and Load BalancersMonitoring Clusters and Load Balancers
Monitoring Clusters and Load BalancersPrince JabaKumar
 
Unraveling the mystery how to predict application performance problems
Unraveling the mystery how to predict application performance problems Unraveling the mystery how to predict application performance problems
Unraveling the mystery how to predict application performance problems jKool
 
Avoiding SAN Perfomance Problems
Avoiding SAN Perfomance ProblemsAvoiding SAN Perfomance Problems
Avoiding SAN Perfomance ProblemsTheFibreChannel
 
Destroying Perf Bottlenecks
Destroying Perf BottlenecksDestroying Perf Bottlenecks
Destroying Perf Bottlenecksbenscheerer
 
White paper - Actionable Alarming - Wonderware-Schneider Electric
White paper - Actionable Alarming - Wonderware-Schneider ElectricWhite paper - Actionable Alarming - Wonderware-Schneider Electric
White paper - Actionable Alarming - Wonderware-Schneider ElectricSuman Singh
 
The difference between in-depth analysis of virtual infrastructures & monitoring
The difference between in-depth analysis of virtual infrastructures & monitoringThe difference between in-depth analysis of virtual infrastructures & monitoring
The difference between in-depth analysis of virtual infrastructures & monitoringBettyRManning
 
White Paper Leveraging Automation for Advanced Network Troubleshooting
White Paper Leveraging Automation for Advanced Network TroubleshootingWhite Paper Leveraging Automation for Advanced Network Troubleshooting
White Paper Leveraging Automation for Advanced Network TroubleshootingE.S.G. JR. Consulting, Inc.
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Brian Brazil
 
Monitoring An Enterprise Uc Environment
Monitoring An Enterprise Uc EnvironmentMonitoring An Enterprise Uc Environment
Monitoring An Enterprise Uc EnvironmentLanair
 
Open service risk correlation
Open service risk correlationOpen service risk correlation
Open service risk correlationfrantzyv
 
[White paper] detecting problems in industrial networks though continuous mon...
[White paper] detecting problems in industrial networks though continuous mon...[White paper] detecting problems in industrial networks though continuous mon...
[White paper] detecting problems in industrial networks though continuous mon...TI Safe
 
Deploying Network Taps for Improved Security
Deploying Network Taps for Improved SecurityDeploying Network Taps for Improved Security
Deploying Network Taps for Improved SecurityDatacomsystemsinc
 
WEEK 6 RESPONSES.docx
WEEK 6 RESPONSES.docxWEEK 6 RESPONSES.docx
WEEK 6 RESPONSES.docxwrite5
 
Closed Loop Automation for NFV
Closed Loop Automation for NFVClosed Loop Automation for NFV
Closed Loop Automation for NFVJames Crawshaw
 
A guide to modern it disaster recovery
A guide to modern it disaster recoveryA guide to modern it disaster recovery
A guide to modern it disaster recoveryJohn Brouillard
 
Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016Dieter Plaetinck
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum JapanBrian Brazil
 
Saving One Network At a Time
Saving One Network At a TimeSaving One Network At a Time
Saving One Network At a TimeJeffrey Ong
 

Similar to Lost cause analysis - Alarm Management (20)

PacketsNeverLie
PacketsNeverLiePacketsNeverLie
PacketsNeverLie
 
Monitoring Clusters and Load Balancers
Monitoring Clusters and Load BalancersMonitoring Clusters and Load Balancers
Monitoring Clusters and Load Balancers
 
Enterprise network management
Enterprise network managementEnterprise network management
Enterprise network management
 
Unraveling the mystery how to predict application performance problems
Unraveling the mystery how to predict application performance problems Unraveling the mystery how to predict application performance problems
Unraveling the mystery how to predict application performance problems
 
Avoiding SAN Perfomance Problems
Avoiding SAN Perfomance ProblemsAvoiding SAN Perfomance Problems
Avoiding SAN Perfomance Problems
 
Destroying Perf Bottlenecks
Destroying Perf BottlenecksDestroying Perf Bottlenecks
Destroying Perf Bottlenecks
 
White paper - Actionable Alarming - Wonderware-Schneider Electric
White paper - Actionable Alarming - Wonderware-Schneider ElectricWhite paper - Actionable Alarming - Wonderware-Schneider Electric
White paper - Actionable Alarming - Wonderware-Schneider Electric
 
The difference between in-depth analysis of virtual infrastructures & monitoring
The difference between in-depth analysis of virtual infrastructures & monitoringThe difference between in-depth analysis of virtual infrastructures & monitoring
The difference between in-depth analysis of virtual infrastructures & monitoring
 
White Paper Leveraging Automation for Advanced Network Troubleshooting
White Paper Leveraging Automation for Advanced Network TroubleshootingWhite Paper Leveraging Automation for Advanced Network Troubleshooting
White Paper Leveraging Automation for Advanced Network Troubleshooting
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
Monitoring An Enterprise Uc Environment
Monitoring An Enterprise Uc EnvironmentMonitoring An Enterprise Uc Environment
Monitoring An Enterprise Uc Environment
 
Open service risk correlation
Open service risk correlationOpen service risk correlation
Open service risk correlation
 
[White paper] detecting problems in industrial networks though continuous mon...
[White paper] detecting problems in industrial networks though continuous mon...[White paper] detecting problems in industrial networks though continuous mon...
[White paper] detecting problems in industrial networks though continuous mon...
 
Deploying Network Taps for Improved Security
Deploying Network Taps for Improved SecurityDeploying Network Taps for Improved Security
Deploying Network Taps for Improved Security
 
WEEK 6 RESPONSES.docx
WEEK 6 RESPONSES.docxWEEK 6 RESPONSES.docx
WEEK 6 RESPONSES.docx
 
Closed Loop Automation for NFV
Closed Loop Automation for NFVClosed Loop Automation for NFV
Closed Loop Automation for NFV
 
A guide to modern it disaster recovery
A guide to modern it disaster recoveryA guide to modern it disaster recovery
A guide to modern it disaster recovery
 
Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Saving One Network At a Time
Saving One Network At a TimeSaving One Network At a Time
Saving One Network At a Time
 

Recently uploaded

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 

Lost cause analysis - Alarm Management

  • 1. Lost Cause Analysis; Approaches to trigger incident resolution. Dan Young (dan@firstoccurrence.com) January 2012 Alarm management systems are often accused of overwhelming their users that results in missed outages or elevated Mean Time to Repair. A frequently cited resolution to this problem is “Root cause analysis” (RCA). RCA in alarm management is the process of automated detection of the most probable cause of an alarm. An outage or a alarm will be caused by something and it will become an incident when it impacts a service or a business function. Generally the larger the impact the greater number of events and the impact and cause of an outage can easily be lost in the flood of events. Events are isolated and don’t naturally relate to other events so the process of RCA will attempt to pick all these unrelated pieces of data (plus data from other datasources) and pinpoint the probable cause. The benefits of this are easy to see: 1. Operators focus on the problem rather than sifting through events 2. Ideally, Mean Time To Repair (MTTR) is reduced 2. Ultimately the outage could be further automated by raising a truck roll etc. Unfortunately identification of the true root cause is nearly impossible with current monitoring levels. Since an alarm is almost always a consequence or a symptom of something else that is not monitored or measurable. This might include a power failure, software bug or configuration et al. Even if a very advanced RCA system is able to determine an offline segment of the network that can be isolated to a particular WAN interface via network topology and event & polled investigations, that WAN interface being offline is not the root cause it is simply the consequence. The outage is far more likely due to an unauthorised change or a carrier issue than the interface being disabled. Additionally, the likelihood that such a tidy scenario of network segments going offline is increasingly unlikely. It is far more likely that a modern network will have many layers of redundancy and duplication that make RCA unnecessary since its impact is isolated and routed. If a RCA can trigger an automated resolution action is it is it truly found the root cause. This is difficult and rare since the root cause possibilities of any outage in any given IT Network environment are nearly limitless and ever expanding. The value of developing RCA becomes further difficult to justify over time as technologies eventually mature the more become more stable. Overtime, networks and IT systems are becoming more robust, more redundant, better designed and have greater capacity and resiliency. These new technologies often require analysis technologies themselves and the alarm management RCA policies are generally far behind, if they ever arrive. Despite these inherent limitations with RCA, it remains the pinnacle of alarm management. We propose an approach for all but the most static network or system would be to deprioritise automated RCA capabilities and leave that it to human operators and element managers and extend alarm management with the introduction of an alarm framework that supports operations to deliver lower MTTR and business focus by identifying actionable abnormalities. Alarm Configuration Database We define an actionable alarm as an outage that impacts a service and/or requires intervention to resolve. Not all alarms require operator intervention and our analysis's have found a majority of alarms will resolve themselves in a short enough amount of time that manual intervention wouldn’t be possible. To provide operations focus we suggest the following criteria every alarm must pass before being presented or ticketed. The answer must be yes to the following questions to warrant the raising of a ticket: 1.Does the alarm indicate an outage that could impact the business operation? 2.Can the alarmoutage resolve itself?
  • 2. 3.Is the alarm out of the ordinary (i.e has it never happened or rarely happens) 4.Has the alarm gone past the point of resolving itself? These are simple rules that should be defined in an Alarm Configuration Database (AMDB). An AMDB will provide visibility to what is being managed and enable users to quickly change alarm behavior to changing network or customer needs. Alarm History Alarm history is a rich source of data that can be tapped for improved operations.. Standard Metrics like Mean Time to Repair (MTTR) or Mean Time Between Failure (MTBF) can be applied to Incident Management so we can learn a great deal about alarms and escalate alarms when something is out of the ordinary. Based on our experience we have found the following rule matrix: Metric Indicates Low MTTR Likely to resolve itself. High MTTR Unlikely to resolve itself. Low MTBF Likely to resolve itself. High MTBF Unlikely to resolve itself. Likely to have a bigger business impact. No MTBF (unknown) Unlikely to resolve itself. Likely to have a business impact. These metrics can be used to assign priorities and provide powerful escalation rules to alarms. Alarm Cluster Analysis A generic RCA pattern we have seen success with is also the most simple. It has been applied and proven in a number of network technologies that have a degree of scale. That approach is “Alarm Clustering” or “Alarm Grouping”. X events in Y time in Z something X number of events Y time period Z is a logical group (i.e locationnetworkcustomerproduct) e.g.10 events in 5 minutes at the Melbourne Branch indicates something is happening there and it should be investigated. We suggest that to really get the best value from this approach the AMDB should have the ability to configure or view these X, Y and Z thresholds. Conclusion In general we believe RCA is powerful and remains the ultimate goal however with network and IT technologies moving at a faster pace alarm management we suggest that an alarm management system should still be smart but also be flexible. The first step to doing this is understand the alarm behavior and providing simple tools to allow operations to define what alarms are to actionable and what conditions they become so.