1. Lost Cause Analysis; Approaches to trigger incident resolution.
Dan Young (dan@firstoccurrence.com)
January 2012
Alarm management systems are often accused of overwhelming their users that results in
missed outages or elevated Mean Time to Repair. A frequently cited resolution to this
problem is “Root cause analysis” (RCA). RCA in alarm management is the process of
automated detection of the most probable cause of an alarm. An outage or a alarm will be
caused by something and it will become an incident when it impacts a service or a
business function. Generally the larger the impact the greater number of events and the
impact and cause of an outage can easily be lost in the flood of events. Events are
isolated and don’t naturally relate to other events so the process of RCA will attempt to
pick all these unrelated pieces of data (plus data from other datasources) and pinpoint the
probable cause. The benefits of this are easy to see:
1. Operators focus on the problem rather than sifting through events
2. Ideally, Mean Time To Repair (MTTR) is reduced
2. Ultimately the outage could be further automated by raising a truck roll etc.
Unfortunately identification of the true root cause is nearly impossible with current
monitoring levels. Since an alarm is almost always a consequence or a symptom of
something else that is not monitored or measurable. This might include a power failure,
software bug or configuration et al. Even if a very advanced RCA system is able to
determine an offline segment of the network that can be isolated to a particular WAN
interface via network topology and event & polled investigations, that WAN interface being
offline is not the root cause it is simply the consequence. The outage is far more likely due
to an unauthorised change or a carrier issue than the interface being disabled.
Additionally, the likelihood that such a tidy scenario of network segments going offline is
increasingly unlikely. It is far more likely that a modern network will have many layers of
redundancy and duplication that make RCA unnecessary since its impact is isolated and
routed.
If a RCA can trigger an automated resolution action is it is it truly found the root cause.
This is difficult and rare since the root cause possibilities of any outage in any given IT
Network environment are nearly limitless and ever expanding. The value of developing
RCA becomes further difficult to justify over time as technologies eventually mature the
more become more stable. Overtime, networks and IT systems are becoming more robust,
more redundant, better designed and have greater capacity and resiliency. These new
technologies often require analysis technologies themselves and the alarm management
RCA policies are generally far behind, if they ever arrive.
Despite these inherent limitations with RCA, it remains the pinnacle of alarm management.
We propose an approach for all but the most static network or system would be to
deprioritise automated RCA capabilities and leave that it to human operators and element
managers and extend alarm management with the introduction of an alarm framework that
supports operations to deliver lower MTTR and business focus by identifying actionable
abnormalities.
Alarm Configuration Database
We define an actionable alarm as an outage that impacts a service and/or requires
intervention to resolve. Not all alarms require operator intervention and our analysis's have
found a majority of alarms will resolve themselves in a short enough amount of time that
manual intervention wouldn’t be possible. To provide operations focus we suggest the
following criteria every alarm must pass before being presented or ticketed. The answer
must be yes to the following questions to warrant the raising of a ticket:
1.Does the alarm indicate an outage that could impact the business operation?
2.Can the alarmoutage resolve itself?
2. 3.Is the alarm out of the ordinary (i.e has it never happened or rarely happens)
4.Has the alarm gone past the point of resolving itself?
These are simple rules that should be defined in an Alarm Configuration Database
(AMDB). An AMDB will provide visibility to what is being managed and enable users to
quickly change alarm behavior to changing network or customer needs.
Alarm History
Alarm history is a rich source of data that can be tapped for improved operations..
Standard Metrics like Mean Time to Repair (MTTR) or Mean Time Between Failure
(MTBF) can be applied to Incident Management so we can learn a great deal about alarms
and escalate alarms when something is out of the ordinary.
Based on our experience we have found the following rule matrix:
Metric Indicates
Low MTTR Likely to resolve itself.
High MTTR Unlikely to resolve itself.
Low MTBF Likely to resolve itself.
High MTBF Unlikely to resolve itself. Likely to have a
bigger business impact.
No MTBF (unknown) Unlikely to resolve itself. Likely to have a
business impact.
These metrics can be used to assign priorities and provide powerful escalation rules to
alarms.
Alarm Cluster Analysis
A generic RCA pattern we have seen success with is also the most simple. It has been
applied and proven in a number of network technologies that have a degree of scale. That
approach is “Alarm Clustering” or “Alarm Grouping”.
X events in Y time in Z something
X number of events
Y time period
Z is a logical group (i.e locationnetworkcustomerproduct)
e.g.10 events in 5 minutes at the Melbourne Branch indicates something is
happening there and it should be investigated.
We suggest that to really get the best value from this approach the AMDB should have the
ability to configure or view these X, Y and Z thresholds.
Conclusion
In general we believe RCA is powerful and remains the ultimate goal however with network
and IT technologies moving at a faster pace alarm management we suggest that an alarm
management system should still be smart but also be flexible. The first step to doing this is
understand the alarm behavior and providing simple tools to allow operations to define
what alarms are to actionable and what conditions they become so.