The elusive root cause

The Elusive Root Cause Of IT Problems
And How To Easily Identify It

Noam Biran
Director of Product Management

Introduction
Mr. Biran
• Director of Product Management at Neebula
• 20 years experience in systems management & BSM
• Innovation Product Management at BMC
• Co-founder of Appilog (now HP uCMDB & DDMA)

About Neebula
Neebula provides the first and only automatic service-centric IT management
solution allowing IT organizations to improve the service provided to the business
by shifting from managing disparate technology silos to managing the services
running in the data center. Leveraging unique technology that automatically maps
business services to the underlying infrastructure, Neebula enables the IT team to
increase availability of the main services they manage and reduce the time to
repair of problems.

Agenda
• Introduction
• Root cause analysis defined
• The problem resolution process
• Problem detection
• Root cause analysis methods
• Improving root cause analysis processes

Root Cause Analysis Definition
ITIL V3
An Activity that identifies the Root Cause of
an Incident or Problem.
Root Cause Analysis typically concentrates on
IT Infrastructure failures.

Wikipedia
Root Cause Analysis is any structured
approach to identify the factors that resulted
in the harmful consequences of one or more
past events

The importance of Root Cause Analysis
• Root Cause Analysis has a high impact on
– IT processes
• The efficiency of the overall incident/problem
management process
• Good RCA discipline requires well established
configuration management
– Organizational goals
• Meeting internal and external SLAs
• Financial (budget & revenue) implications
• Brand / customer loyalty

The Critical Role of Root Cause Analysis
• Improper (or lack of) identification of the real
root cause may yield:
– Repeating problems
– Increased downtime
– Waste of human
resources on
“fixing” the wrong
issues
– Risk to the business

The Life of The Operator
We expect the operator
– To handle 1000’s of cryptic events
– Understand impact on 100’s of services
– Understand the correlation to
customers service complaints
– Understand what changed
– Orchestrate the resolution
And make these decisions within minutes to
reduce MTTR

Are we giving our operators the tools to
succeed?

Problem Resolution Process
• Events coming in to the NOC
• NOC performs some investigation
• Root cause analysis is shared between NOC
& 2nd/3rd level support (admins)
• Low level diagnostics & problem resolution
is done by 2nd/3rd level support (admins)

Involved Parties & Tools

• Tools
– Monitoring tools
– Configuration management tools
• People
– Users
– NOC
– Admins – specialized teams focused on specific
area, e.g. system, database, network
– Application support / developers

The Common Process – Blame Game
• No structured process
• Lack of overall cross-domain view
• Each team has its own terminology and view
• Each team is working on its own

Potential Problem Symptoms
• Lack of certain functionality
– A certain transaction does not work
• Performance degradation
– Fund transfer response time is above 2 sec.
• Availability issue
– Application doesn’t work
• None
– Unnoticeable failure due to high availability
configuration

Problem Detection
• Good problem detection methods are key for a
structured root cause analysis process
• Problem detection tools should provide sufficient
data to the root cause analysis process
• There are various distinct methods each with its
pros and cons
• There is no single superior detection method

Detection – Users
• What it does
– Compensates for unknown / unreported
problems
• What it doesn’t
– Supposedly accurate – actually might point in
the wrong direction
– Usually takes place
too late for a quick fix
& impact to business

Detection – Infrastructure Monitoring
• What it does
– Monitor each technical element
comprising the service
– Great way to identify
specific availability failures
– Hard to correlate with real user experience
– Too many false positives
– Lots of events on symptoms rather on actual problem

Detection – End User Experience
• What it does
– Measure overall response time of user transactions
– Synthetic or real user transactions
– The ultimate problem detection method
– No real breakdown to assist
in pinpointing the problem
or even the domain

Detection – Transaction Breakdown
• What it does
– Discovery of each transaction’s path
within the data center
– Highlight potential performance
problems within the transaction
execution
– No correlation to infrastructure
monitoring
– Cannot cover the entire data center
– domain specific

Detection – Domain Specific Tools
• What it does
– Drill down in a specific application
– Great analysis & diagnostics within an application
– No data center wide view
– Lack of insight into the
connections between
applications

Potential Root Cause Types

• Configuration change
• Version upgrade
• Hardware fault
• Software bug
• Capacity problem
• Resource collision

Common Ways for Root Cause Analysis

• War room scenario
• The log file approach
• APM tools
• Transaction management
• Manual event correlation / analysis

War Room Scenario

• Getting everyone in the same room
• Each has its own data and terminology
• Blame game
• Takes a lot of time

The Log File Approach

• An admin sits and analyzes log files and
other historical data from various sources
• A domain specific approach
• Certain degree of structured process
• Might identify problems that
are not the root cause
(distractions)

APM Tools

• An admin sits and analyzes log files and
other historical data from various sources
• A domain specific approach
• Certain degree of structured process
• Might identify problems that
are not the root cause
(distractions)

Transaction Management

• A great tool to point to the probable area
where the root cause resides
• Limited to specific domains
• Inability to correlate with infrastructure
metrics / failures

Manual Event Correlation / Analysis

• Requires cross-domain expertise
• Requires understanding of dependencies
between components
• Time consuming
• Lack of insight into other
non-event data

Improving Root Cause Analysis
Processes

Making The Best From Existing Tools

• Choose problem detection methods that
assist in the root cause analysis process
• Turn the root cause analysis into a
structured process
– Internal team processes
– Inter-team processes
• Common language & visibility between
teams

New Methods: Mapping

• Mapping of Business service & applications
and the supporting infrastructure
• Ties symptoms (user) to problems
(technology)
• Introduces a common language between
teams
• Enables a high level cross-domain view

New Methods: Structured Process

• Define a structured process for problem
investigation and root cause analysis
• Define how collaboration should occur
during root cause analysis between teams

New Methods: Tools

• Use tools that provide a historical
dimension for problem investigation
• Use tools that enable the correlation of
problems to configuration changes
• Use topology based correlation instead of
rule based (or manual based) correlation

The elusive root cause

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to The elusive root cause

Similar to The elusive root cause (20)

Recently uploaded

Recently uploaded (20)

The elusive root cause

Editor's Notes