Your SlideShare is downloading. ×
The elusive root cause
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

The elusive root cause


Published on

Published in: Technology, Education

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Introduction to the subjectWebinar logistics: presentation first, send questions during, answer questions at the end
  • RCA is problematic even to defineITIL definition -> useless. ITIL failedWikipedia:StructuredFactorsConsequencesPast events – I’ll call them symptoms
  • Talk about each bullet
  • Many data sources (event feeds)All are mixed and funneled into the NOCNOC needs to filter and make order in them based on:RelevanceSource / derivedBut the NOC doesn’t have the tools or processes to do thisNo structured way to do this filtering (though the NOC is used to structured processes like run book)
  • Taking care of the symptoms and not the problemsAssociating wrong events -> figuring out the incorrect root cause
  • NOC is used to structured processes (like run book)We don’t give them toolsWe don’t give them structured processes (or any processes)They don’t posses cross-domain knowledge usually
  • Isolation – diagnosticsNOC’s investigation may yield forwarding to the wrong team and therefore wrong analysis done in the wrong context
  • Explain eachHow do they all tie together? Usually they don’t
  • Problem detection begins with the symptomsSame symptoms may be caused by different problems
  • We need a combination of toolsChoose the right mix to assist in the RCA processNeed synergy between the methods
  • Cross domainCross disciplineRequire deep understanding
  • Not a structured approach
  • Transcript

    • 1. The Elusive Root Cause Of IT ProblemsAnd How To Easily Identify ItNoam BiranDirector of Product Management
    • 2. Introduction Mr. Biran • Director of Product Management at Neebula • 20 years experience in systems management & BSM • Innovation Product Management at BMC • Co-founder of Appilog (now HP uCMDB & DDMA) About Neebula Neebula provides the first and only automatic service-centric IT management solution allowing IT organizations to improve the service provided to the business by shifting from managing disparate technology silos to managing the services running in the data center. Leveraging unique technology that automatically maps business services to the underlying infrastructure, Neebula enables the IT team to increase availability of the main services they manage and reduce the time to repair of problems.
    • 3. Agenda• Introduction• Root cause analysis defined• The problem resolution process• Problem detection• Root cause analysis methods• Improving root cause analysis processes
    • 4. Root Cause Analysis Definition ITIL V3 An Activity that identifies the Root Cause of an Incident or Problem. Root Cause Analysis typically concentrates on IT Infrastructure failures. Wikipedia Root Cause Analysis is any structured approach to identify the factors that resulted in the harmful consequences of one or more past events
    • 5. The importance of Root Cause Analysis• Root Cause Analysis has a high impact on – IT processes • The efficiency of the overall incident/problem management process • Good RCA discipline requires well established configuration management – Organizational goals • Meeting internal and external SLAs • Financial (budget & revenue) implications • Brand / customer loyalty
    • 6. Root Cause Analysis Nowadays
    • 7. The Critical Role of Root Cause Analysis• Improper (or lack of) identification of the real root cause may yield: – Repeating problems – Increased downtime – Waste of human resources on “fixing” the wrong issues – Risk to the business
    • 8. The Life of The OperatorWe expect the operator – To handle 1000’s of cryptic events – Understand impact on 100’s of services – Understand the correlation to customers service complaints – Understand what changed – Orchestrate the resolutionAnd make these decisions within minutes toreduce MTTR Are we giving our operators the tools to succeed?
    • 9. Problem Resolution Process
    • 10. Problem Resolution Process• Events coming in to the NOC• NOC performs some investigation• Root cause analysis is shared between NOC & 2nd/3rd level support (admins)• Low level diagnostics & problem resolution is done by 2nd/3rd level support (admins)
    • 11. Involved Parties & Tools• Tools – Monitoring tools – Configuration management tools• People – Users – NOC – Admins – specialized teams focused on specific area, e.g. system, database, network – Application support / developers
    • 12. The Common Process – Blame Game• No structured process• Lack of overall cross-domain view• Each team has its own terminology and view• Each team is working on its own
    • 13. Problem Detection
    • 14. Potential Problem Symptoms• Lack of certain functionality – A certain transaction does not work• Performance degradation – Fund transfer response time is above 2 sec.• Availability issue – Application doesn’t work• None – Unnoticeable failure due to high availability configuration
    • 15. Problem Detection• Good problem detection methods are key for a structured root cause analysis process• Problem detection tools should provide sufficient data to the root cause analysis process• There are various distinct methods each with its pros and cons• There is no single superior detection method
    • 16. Detection – Users• What it does – Compensates for unknown / unreported problems• What it doesn’t – Supposedly accurate – actually might point in the wrong direction – Usually takes place too late for a quick fix & impact to business
    • 17. Detection – Infrastructure Monitoring• What it does – Monitor each technical element comprising the service – Great way to identify specific availability failures• What it doesn’t – Hard to correlate with real user experience – Too many false positives – Lots of events on symptoms rather on actual problem
    • 18. Detection – End User Experience• What it does – Measure overall response time of user transactions – Synthetic or real user transactions – The ultimate problem detection method• What it doesn’t – No real breakdown to assist in pinpointing the problem or even the domain
    • 19. Detection – Transaction Breakdown• What it does – Discovery of each transaction’s path within the data center – Highlight potential performance problems within the transaction execution• What it doesn’t – No correlation to infrastructure monitoring – Cannot cover the entire data center – domain specific
    • 20. Detection – Domain Specific Tools• What it does – Drill down in a specific application – Great analysis & diagnostics within an application• What it doesn’t – No data center wide view – Lack of insight into the connections between applications
    • 21. Detection - Synergy
    • 22. Root Cause Analysis Methods
    • 23. Potential Root Cause Types• Configuration change• Version upgrade• Hardware fault• Software bug• Capacity problem• Resource collision
    • 24. Common Ways for Root Cause Analysis• War room scenario• The log file approach• APM tools• Transaction management• Manual event correlation / analysis
    • 25. War Room Scenario• Getting everyone in the same room• Each has its own data and terminology• Blame game• Takes a lot of time
    • 26. The Log File Approach• An admin sits and analyzes log files and other historical data from various sources• A domain specific approach• Certain degree of structured process• Might identify problems that are not the root cause (distractions)
    • 27. APM Tools• An admin sits and analyzes log files and other historical data from various sources• A domain specific approach• Certain degree of structured process• Might identify problems that are not the root cause (distractions)
    • 28. Transaction Management• A great tool to point to the probable area where the root cause resides• Limited to specific domains• Inability to correlate with infrastructure metrics / failures
    • 29. Manual Event Correlation / Analysis• Requires cross-domain expertise• Requires understanding of dependencies between components• Time consuming• Lack of insight into other non-event data
    • 30. Improving Root Cause Analysis Processes
    • 31. Making The Best From Existing Tools• Choose problem detection methods that assist in the root cause analysis process• Turn the root cause analysis into a structured process – Internal team processes – Inter-team processes• Common language & visibility between teams
    • 32. New Methods: Mapping• Mapping of Business service & applications and the supporting infrastructure• Ties symptoms (user) to problems (technology)• Introduces a common language between teams• Enables a high level cross-domain view
    • 33. New Methods: Structured Process• Define a structured process for problem investigation and root cause analysis• Define how collaboration should occur during root cause analysis between teams
    • 34. New Methods: Tools• Use tools that provide a historical dimension for problem investigation• Use tools that enable the correlation of problems to configuration changes• Use topology based correlation instead of rule based (or manual based) correlation