Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
The Elusive Root Cause Of IT ProblemsAnd How To Easily Identify ItNoam BiranDirector of Product Management
Introduction               Mr. Biran               •    Director of Product Management at Neebula               •    20 ye...
Agenda•   Introduction•   Root cause analysis defined•   The problem resolution process•   Problem detection•   Root cause...
Root Cause Analysis Definition   ITIL V3              An Activity that identifies the Root Cause of              an Incide...
The importance of Root Cause Analysis• Root Cause Analysis has a high impact on  – IT processes     • The efficiency of th...
Root Cause Analysis Nowadays
The Critical Role of Root Cause Analysis• Improper (or lack of) identification of the real  root cause may yield:   – Repe...
The Life of The OperatorWe expect the operator    – To handle 1000’s of cryptic events    – Understand impact on 100’s of ...
Problem Resolution Process
Problem Resolution Process• Events coming in to the NOC• NOC performs some investigation• Root cause analysis is shared be...
Involved Parties & Tools• Tools  – Monitoring tools  – Configuration management tools• People  – Users  – NOC  – Admins – ...
The Common Process – Blame Game•   No structured process•   Lack of overall cross-domain view•   Each team has its own ter...
Problem Detection
Potential Problem Symptoms• Lack of certain functionality  – A certain transaction does not work• Performance degradation ...
Problem Detection• Good problem detection methods are key for a  structured root cause analysis process• Problem detection...
Detection – Users• What it does  – Compensates for unknown / unreported    problems• What it doesn’t  – Supposedly accurat...
Detection – Infrastructure Monitoring• What it does  – Monitor each technical element    comprising the service  – Great w...
Detection – End User Experience• What it does  – Measure overall response time of user transactions  – Synthetic or real u...
Detection – Transaction Breakdown• What it does  – Discovery of each transaction’s path    within the data center  – Highl...
Detection – Domain Specific Tools• What it does  – Drill down in a specific application  – Great analysis & diagnostics wi...
Detection - Synergy
Root Cause Analysis Methods
Potential Root Cause Types•   Configuration change•   Version upgrade•   Hardware fault•   Software bug•   Capacity proble...
Common Ways for Root Cause Analysis•   War room scenario•   The log file approach•   APM tools•   Transaction management• ...
War Room Scenario•   Getting everyone in the same room•   Each has its own data and terminology•   Blame game•   Takes a l...
The Log File Approach• An admin sits and analyzes log files and  other historical data from various sources• A domain spec...
APM Tools• An admin sits and analyzes log files and  other historical data from various sources• A domain specific approac...
Transaction Management• A great tool to point to the probable area  where the root cause resides• Limited to specific doma...
Manual Event Correlation / Analysis• Requires cross-domain expertise• Requires understanding of dependencies  between comp...
Improving Root Cause Analysis          Processes
Making The Best From Existing Tools• Choose problem detection methods that  assist in the root cause analysis process• Tur...
New Methods: Mapping• Mapping of Business service & applications  and the supporting infrastructure• Ties symptoms (user) ...
New Methods: Structured Process• Define a structured process for problem  investigation and root cause analysis• Define ho...
New Methods: Tools• Use tools that provide a historical  dimension for problem investigation• Use tools that enable the co...
The elusive root cause
Upcoming SlideShare
Loading in …5
×

The elusive root cause

1,454 views

Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

The elusive root cause

  1. 1. The Elusive Root Cause Of IT ProblemsAnd How To Easily Identify ItNoam BiranDirector of Product Management
  2. 2. Introduction Mr. Biran • Director of Product Management at Neebula • 20 years experience in systems management & BSM • Innovation Product Management at BMC • Co-founder of Appilog (now HP uCMDB & DDMA) About Neebula Neebula provides the first and only automatic service-centric IT management solution allowing IT organizations to improve the service provided to the business by shifting from managing disparate technology silos to managing the services running in the data center. Leveraging unique technology that automatically maps business services to the underlying infrastructure, Neebula enables the IT team to increase availability of the main services they manage and reduce the time to repair of problems.
  3. 3. Agenda• Introduction• Root cause analysis defined• The problem resolution process• Problem detection• Root cause analysis methods• Improving root cause analysis processes
  4. 4. Root Cause Analysis Definition ITIL V3 An Activity that identifies the Root Cause of an Incident or Problem. Root Cause Analysis typically concentrates on IT Infrastructure failures. Wikipedia Root Cause Analysis is any structured approach to identify the factors that resulted in the harmful consequences of one or more past events
  5. 5. The importance of Root Cause Analysis• Root Cause Analysis has a high impact on – IT processes • The efficiency of the overall incident/problem management process • Good RCA discipline requires well established configuration management – Organizational goals • Meeting internal and external SLAs • Financial (budget & revenue) implications • Brand / customer loyalty
  6. 6. Root Cause Analysis Nowadays
  7. 7. The Critical Role of Root Cause Analysis• Improper (or lack of) identification of the real root cause may yield: – Repeating problems – Increased downtime – Waste of human resources on “fixing” the wrong issues – Risk to the business
  8. 8. The Life of The OperatorWe expect the operator – To handle 1000’s of cryptic events – Understand impact on 100’s of services – Understand the correlation to customers service complaints – Understand what changed – Orchestrate the resolutionAnd make these decisions within minutes toreduce MTTR Are we giving our operators the tools to succeed?
  9. 9. Problem Resolution Process
  10. 10. Problem Resolution Process• Events coming in to the NOC• NOC performs some investigation• Root cause analysis is shared between NOC & 2nd/3rd level support (admins)• Low level diagnostics & problem resolution is done by 2nd/3rd level support (admins)
  11. 11. Involved Parties & Tools• Tools – Monitoring tools – Configuration management tools• People – Users – NOC – Admins – specialized teams focused on specific area, e.g. system, database, network – Application support / developers
  12. 12. The Common Process – Blame Game• No structured process• Lack of overall cross-domain view• Each team has its own terminology and view• Each team is working on its own
  13. 13. Problem Detection
  14. 14. Potential Problem Symptoms• Lack of certain functionality – A certain transaction does not work• Performance degradation – Fund transfer response time is above 2 sec.• Availability issue – Application doesn’t work• None – Unnoticeable failure due to high availability configuration
  15. 15. Problem Detection• Good problem detection methods are key for a structured root cause analysis process• Problem detection tools should provide sufficient data to the root cause analysis process• There are various distinct methods each with its pros and cons• There is no single superior detection method
  16. 16. Detection – Users• What it does – Compensates for unknown / unreported problems• What it doesn’t – Supposedly accurate – actually might point in the wrong direction – Usually takes place too late for a quick fix & impact to business
  17. 17. Detection – Infrastructure Monitoring• What it does – Monitor each technical element comprising the service – Great way to identify specific availability failures• What it doesn’t – Hard to correlate with real user experience – Too many false positives – Lots of events on symptoms rather on actual problem
  18. 18. Detection – End User Experience• What it does – Measure overall response time of user transactions – Synthetic or real user transactions – The ultimate problem detection method• What it doesn’t – No real breakdown to assist in pinpointing the problem or even the domain
  19. 19. Detection – Transaction Breakdown• What it does – Discovery of each transaction’s path within the data center – Highlight potential performance problems within the transaction execution• What it doesn’t – No correlation to infrastructure monitoring – Cannot cover the entire data center – domain specific
  20. 20. Detection – Domain Specific Tools• What it does – Drill down in a specific application – Great analysis & diagnostics within an application• What it doesn’t – No data center wide view – Lack of insight into the connections between applications
  21. 21. Detection - Synergy
  22. 22. Root Cause Analysis Methods
  23. 23. Potential Root Cause Types• Configuration change• Version upgrade• Hardware fault• Software bug• Capacity problem• Resource collision
  24. 24. Common Ways for Root Cause Analysis• War room scenario• The log file approach• APM tools• Transaction management• Manual event correlation / analysis
  25. 25. War Room Scenario• Getting everyone in the same room• Each has its own data and terminology• Blame game• Takes a lot of time
  26. 26. The Log File Approach• An admin sits and analyzes log files and other historical data from various sources• A domain specific approach• Certain degree of structured process• Might identify problems that are not the root cause (distractions)
  27. 27. APM Tools• An admin sits and analyzes log files and other historical data from various sources• A domain specific approach• Certain degree of structured process• Might identify problems that are not the root cause (distractions)
  28. 28. Transaction Management• A great tool to point to the probable area where the root cause resides• Limited to specific domains• Inability to correlate with infrastructure metrics / failures
  29. 29. Manual Event Correlation / Analysis• Requires cross-domain expertise• Requires understanding of dependencies between components• Time consuming• Lack of insight into other non-event data
  30. 30. Improving Root Cause Analysis Processes
  31. 31. Making The Best From Existing Tools• Choose problem detection methods that assist in the root cause analysis process• Turn the root cause analysis into a structured process – Internal team processes – Inter-team processes• Common language & visibility between teams
  32. 32. New Methods: Mapping• Mapping of Business service & applications and the supporting infrastructure• Ties symptoms (user) to problems (technology)• Introduces a common language between teams• Enables a high level cross-domain view
  33. 33. New Methods: Structured Process• Define a structured process for problem investigation and root cause analysis• Define how collaboration should occur during root cause analysis between teams
  34. 34. New Methods: Tools• Use tools that provide a historical dimension for problem investigation• Use tools that enable the correlation of problems to configuration changes• Use topology based correlation instead of rule based (or manual based) correlation

×