SlideShare a Scribd company logo
The Elusive Root Cause Of IT Problems
And How To Easily Identify It


Noam Biran
Director of Product Management
Introduction
               Mr. Biran
               •    Director of Product Management at Neebula
               •    20 years experience in systems management & BSM
               •    Innovation Product Management at BMC
               •    Co-founder of Appilog (now HP uCMDB & DDMA)



 About Neebula
  Neebula provides the first and only automatic service-centric IT management
  solution allowing IT organizations to improve the service provided to the business
  by shifting from managing disparate technology silos to managing the services
  running in the data center. Leveraging unique technology that automatically maps
  business services to the underlying infrastructure, Neebula enables the IT team to
  increase availability of the main services they manage and reduce the time to
  repair of problems.
Agenda
•   Introduction
•   Root cause analysis defined
•   The problem resolution process
•   Problem detection
•   Root cause analysis methods
•   Improving root cause analysis processes
Root Cause Analysis Definition
   ITIL V3
              An Activity that identifies the Root Cause of
              an Incident or Problem.
              Root Cause Analysis typically concentrates on
              IT Infrastructure failures.



  Wikipedia
              Root Cause Analysis is any structured
              approach to identify the factors that resulted
              in the harmful consequences of one or more
              past events
The importance of Root Cause Analysis
• Root Cause Analysis has a high impact on
  – IT processes
     • The efficiency of the overall incident/problem
       management process
     • Good RCA discipline requires well established
       configuration management
  – Organizational goals
     • Meeting internal and external SLAs
     • Financial (budget & revenue) implications
     • Brand / customer loyalty
Root Cause Analysis Nowadays
The Critical Role of Root Cause Analysis
• Improper (or lack of) identification of the real
  root cause may yield:
   – Repeating problems
   – Increased downtime
   – Waste of human
     resources on
     “fixing” the wrong
     issues
   – Risk to the business
The Life of The Operator
We expect the operator
    – To handle 1000’s of cryptic events
    – Understand impact on 100’s of services
    – Understand the correlation to
       customers service complaints
    – Understand what changed
    – Orchestrate the resolution
And make these decisions within minutes to
reduce MTTR

   Are we giving our operators the tools to
   succeed?
Problem Resolution Process
Problem Resolution Process
• Events coming in to the NOC
• NOC performs some investigation
• Root cause analysis is shared between NOC
  & 2nd/3rd level support (admins)
• Low level diagnostics & problem resolution
  is done by 2nd/3rd level support (admins)
Involved Parties & Tools

• Tools
  – Monitoring tools
  – Configuration management tools
• People
  – Users
  – NOC
  – Admins – specialized teams focused on specific
    area, e.g. system, database, network
  – Application support / developers
The Common Process – Blame Game
•   No structured process
•   Lack of overall cross-domain view
•   Each team has its own terminology and view
•   Each team is working on its own
Problem Detection
Potential Problem Symptoms
• Lack of certain functionality
  – A certain transaction does not work
• Performance degradation
  – Fund transfer response time is above 2 sec.
• Availability issue
  – Application doesn’t work
• None
  – Unnoticeable failure due to high availability
    configuration
Problem Detection
• Good problem detection methods are key for a
  structured root cause analysis process
• Problem detection tools should provide sufficient
  data to the root cause analysis process
• There are various distinct methods each with its
  pros and cons
• There is no single superior detection method
Detection – Users
• What it does
  – Compensates for unknown / unreported
    problems
• What it doesn’t
  – Supposedly accurate – actually might point in
    the wrong direction
  – Usually takes place
    too late for a quick fix
    & impact to business
Detection – Infrastructure Monitoring
• What it does
  – Monitor each technical element
    comprising the service
  – Great way to identify
    specific availability failures
• What it doesn’t
  – Hard to correlate with real user experience
  – Too many false positives
  – Lots of events on symptoms rather on actual problem
Detection – End User Experience
• What it does
  – Measure overall response time of user transactions
  – Synthetic or real user transactions
  – The ultimate problem detection method
• What it doesn’t
  – No real breakdown to assist
    in pinpointing the problem
    or even the domain
Detection – Transaction Breakdown
• What it does
  – Discovery of each transaction’s path
    within the data center
  – Highlight potential performance
    problems within the transaction
    execution
• What it doesn’t
  – No correlation to infrastructure
    monitoring
  – Cannot cover the entire data center
    – domain specific
Detection – Domain Specific Tools
• What it does
  – Drill down in a specific application
  – Great analysis & diagnostics within an application
• What it doesn’t
  – No data center wide view
  – Lack of insight into the
    connections between
    applications
Detection - Synergy
Root Cause Analysis Methods
Potential Root Cause Types

•   Configuration change
•   Version upgrade
•   Hardware fault
•   Software bug
•   Capacity problem
•   Resource collision
Common Ways for Root Cause Analysis

•   War room scenario
•   The log file approach
•   APM tools
•   Transaction management
•   Manual event correlation / analysis
War Room Scenario

•   Getting everyone in the same room
•   Each has its own data and terminology
•   Blame game
•   Takes a lot of time
The Log File Approach

• An admin sits and analyzes log files and
  other historical data from various sources
• A domain specific approach
• Certain degree of structured process
• Might identify problems that
  are not the root cause
  (distractions)
APM Tools

• An admin sits and analyzes log files and
  other historical data from various sources
• A domain specific approach
• Certain degree of structured process
• Might identify problems that
  are not the root cause
  (distractions)
Transaction Management

• A great tool to point to the probable area
  where the root cause resides
• Limited to specific domains
• Inability to correlate with infrastructure
  metrics / failures
Manual Event Correlation / Analysis

• Requires cross-domain expertise
• Requires understanding of dependencies
  between components
• Time consuming
• Lack of insight into other
  non-event data
Improving Root Cause Analysis
          Processes
Making The Best From Existing Tools

• Choose problem detection methods that
  assist in the root cause analysis process
• Turn the root cause analysis into a
  structured process
  – Internal team processes
  – Inter-team processes
• Common language & visibility between
  teams
New Methods: Mapping

• Mapping of Business service & applications
  and the supporting infrastructure
• Ties symptoms (user) to problems
  (technology)
• Introduces a common language between
  teams
• Enables a high level cross-domain view
New Methods: Structured Process

• Define a structured process for problem
  investigation and root cause analysis
• Define how collaboration should occur
  during root cause analysis between teams
New Methods: Tools

• Use tools that provide a historical
  dimension for problem investigation
• Use tools that enable the correlation of
  problems to configuration changes
• Use topology based correlation instead of
  rule based (or manual based) correlation
The elusive root cause

More Related Content

What's hot

Alexander Rhea Resume
Alexander Rhea ResumeAlexander Rhea Resume
Alexander Rhea Resume
Alex Rhea
 
Sadchap04
Sadchap04Sadchap04
Requirements elicitation techniques
Requirements elicitation techniquesRequirements elicitation techniques
Requirements elicitation techniques
Teniola Alimi
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
MuhammadTalha436
 
Requirement Elicitation Techniques/Methods
Requirement Elicitation Techniques/MethodsRequirement Elicitation Techniques/Methods
Requirement Elicitation Techniques/Methods
SUFYAN SATTAR
 
Chapter 7 Development Strategies
Chapter 7 Development StrategiesChapter 7 Development Strategies
Chapter 7 Development Strategies
Meryl C
 
Financial Crime Projects
Financial Crime ProjectsFinancial Crime Projects
Financial Crime ProjectsDavid Allsop
 
Chapter 2 analyzing the business case
Chapter 2 analyzing the business caseChapter 2 analyzing the business case
Chapter 2 analyzing the business case
Raquel Miranda
 
Systems Analysis
Systems AnalysisSystems Analysis
Systems Analysis
Bli Wilson
 
Non functional requirements. do we really care…?
Non functional requirements. do we really care…?Non functional requirements. do we really care…?
Non functional requirements. do we really care…?
OSSCube
 
Design for non functional requirements
Design for non functional requirementsDesign for non functional requirements
Design for non functional requirements
Habeeb Mahaboob
 
Requirement analysis and UML modelling in Software engineering
Requirement analysis and UML modelling in Software engineeringRequirement analysis and UML modelling in Software engineering
Requirement analysis and UML modelling in Software engineering
snehalkulkarni74
 
Requirements Management Part 1 - Management and Elicitation
Requirements Management Part 1 - Management and ElicitationRequirements Management Part 1 - Management and Elicitation
Requirements Management Part 1 - Management and ElicitationMohamed Shaaban
 
Intoduction to software engineering part 1
Intoduction to software engineering part 1Intoduction to software engineering part 1
Intoduction to software engineering part 1
Rupesh Vaishnav
 
2 feasibility-study
2 feasibility-study2 feasibility-study
2 feasibility-study
Fajar Baskoro
 
Network Operations Center
Network Operations Center  Network Operations Center
Network Operations Center
Muhannad Kalbouneh
 

What's hot (17)

Alexander Rhea Resume
Alexander Rhea ResumeAlexander Rhea Resume
Alexander Rhea Resume
 
Sadchap04
Sadchap04Sadchap04
Sadchap04
 
Requirements elicitation techniques
Requirements elicitation techniquesRequirements elicitation techniques
Requirements elicitation techniques
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
 
Requirement Elicitation Techniques/Methods
Requirement Elicitation Techniques/MethodsRequirement Elicitation Techniques/Methods
Requirement Elicitation Techniques/Methods
 
Chapter 7 Development Strategies
Chapter 7 Development StrategiesChapter 7 Development Strategies
Chapter 7 Development Strategies
 
Financial Crime Projects
Financial Crime ProjectsFinancial Crime Projects
Financial Crime Projects
 
Chapter 2 analyzing the business case
Chapter 2 analyzing the business caseChapter 2 analyzing the business case
Chapter 2 analyzing the business case
 
Systems Analysis
Systems AnalysisSystems Analysis
Systems Analysis
 
Non functional requirements. do we really care…?
Non functional requirements. do we really care…?Non functional requirements. do we really care…?
Non functional requirements. do we really care…?
 
Design for non functional requirements
Design for non functional requirementsDesign for non functional requirements
Design for non functional requirements
 
Chapter 03
Chapter 03Chapter 03
Chapter 03
 
Requirement analysis and UML modelling in Software engineering
Requirement analysis and UML modelling in Software engineeringRequirement analysis and UML modelling in Software engineering
Requirement analysis and UML modelling in Software engineering
 
Requirements Management Part 1 - Management and Elicitation
Requirements Management Part 1 - Management and ElicitationRequirements Management Part 1 - Management and Elicitation
Requirements Management Part 1 - Management and Elicitation
 
Intoduction to software engineering part 1
Intoduction to software engineering part 1Intoduction to software engineering part 1
Intoduction to software engineering part 1
 
2 feasibility-study
2 feasibility-study2 feasibility-study
2 feasibility-study
 
Network Operations Center
Network Operations Center  Network Operations Center
Network Operations Center
 

Similar to The elusive root cause

requirements analysis and design
requirements analysis and designrequirements analysis and design
requirements analysis and design
Preeti Mishra
 
Requirement Analysis
Requirement AnalysisRequirement Analysis
Requirement Analysis
SADEED AMEEN
 
lecture_Analysis Phase.ppt
lecture_Analysis Phase.pptlecture_Analysis Phase.ppt
lecture_Analysis Phase.ppt
AteeqaKokab1
 
lecture_5 (2).ppt hjhrrgjbgrmgrhbgrgghjd
lecture_5 (2).ppt hjhrrgjbgrmgrhbgrgghjdlecture_5 (2).ppt hjhrrgjbgrmgrhbgrgghjd
lecture_5 (2).ppt hjhrrgjbgrmgrhbgrgghjd
AqeelAbbas94
 
Testing Throughout the Software Life Cycle (2013)
Testing Throughout the Software Life Cycle (2013)Testing Throughout the Software Life Cycle (2013)
Testing Throughout the Software Life Cycle (2013)
Jana Gierloff
 
software requirement
software requirement software requirement
software requirement
nimmik4u
 
Chapter 12 developiong business&it solutions
Chapter 12  developiong business&it solutionsChapter 12  developiong business&it solutions
Chapter 12 developiong business&it solutions
Advance Saraswati Prakashan Pvt Ltd
 
Requirements engineering process in software engineering
Requirements engineering process in software engineeringRequirements engineering process in software engineering
Requirements engineering process in software engineering
Preeti Mishra
 
Best practice for_agile_ds_projects
Best practice for_agile_ds_projectsBest practice for_agile_ds_projects
Best practice for_agile_ds_projects
Khalid Kahloot
 
req engg (1).ppt
req engg (1).pptreq engg (1).ppt
req engg (1).ppt
WaniHBisen
 
Top 5 Java Performance Metrics, Tips & Tricks
Top 5 Java Performance Metrics, Tips & TricksTop 5 Java Performance Metrics, Tips & Tricks
Top 5 Java Performance Metrics, Tips & Tricks
AppDynamics
 
Mistakes we make_and_howto_avoid_them_v0.12
Mistakes we make_and_howto_avoid_them_v0.12Mistakes we make_and_howto_avoid_them_v0.12
Mistakes we make_and_howto_avoid_them_v0.12
Trevor Warren
 
INTRODUCTION TO SOFTWARE ENGINEERING
INTRODUCTION TO SOFTWARE ENGINEERINGINTRODUCTION TO SOFTWARE ENGINEERING
INTRODUCTION TO SOFTWARE ENGINEERING
Preeti Mishra
 
What is onTune for management
What is onTune for managementWhat is onTune for management
What is onTune for management
TeemStone Pty Ltd
 
Proj Mgmt.ppt
Proj Mgmt.pptProj Mgmt.ppt
Proj Mgmt.ppt
NikhilDudka
 
Software quality assurance
Software quality assuranceSoftware quality assurance
Software quality assurance
Rizky Munggaran
 
1 Information Systems Analysis & Design,.pptx
1 Information Systems Analysis & Design,.pptx1 Information Systems Analysis & Design,.pptx
1 Information Systems Analysis & Design,.pptx
MadhusudhanaSubraman
 
Chapter01.ppt
Chapter01.pptChapter01.ppt
Chapter01.ppt
SangeethaVal
 
UNIT-III SYSTEM DEVELOPMENT LIFE CYCLE.pptx
UNIT-III SYSTEM DEVELOPMENT LIFE CYCLE.pptxUNIT-III SYSTEM DEVELOPMENT LIFE CYCLE.pptx
UNIT-III SYSTEM DEVELOPMENT LIFE CYCLE.pptx
abhiisharma0504
 

Similar to The elusive root cause (20)

requirements analysis and design
requirements analysis and designrequirements analysis and design
requirements analysis and design
 
Requirement Analysis
Requirement AnalysisRequirement Analysis
Requirement Analysis
 
lecture_Analysis Phase.ppt
lecture_Analysis Phase.pptlecture_Analysis Phase.ppt
lecture_Analysis Phase.ppt
 
lecture_5 (2).ppt hjhrrgjbgrmgrhbgrgghjd
lecture_5 (2).ppt hjhrrgjbgrmgrhbgrgghjdlecture_5 (2).ppt hjhrrgjbgrmgrhbgrgghjd
lecture_5 (2).ppt hjhrrgjbgrmgrhbgrgghjd
 
Testing Throughout the Software Life Cycle (2013)
Testing Throughout the Software Life Cycle (2013)Testing Throughout the Software Life Cycle (2013)
Testing Throughout the Software Life Cycle (2013)
 
software requirement
software requirement software requirement
software requirement
 
Chapter 12 developiong business&it solutions
Chapter 12  developiong business&it solutionsChapter 12  developiong business&it solutions
Chapter 12 developiong business&it solutions
 
Development Guideline
Development GuidelineDevelopment Guideline
Development Guideline
 
Requirements engineering process in software engineering
Requirements engineering process in software engineeringRequirements engineering process in software engineering
Requirements engineering process in software engineering
 
Best practice for_agile_ds_projects
Best practice for_agile_ds_projectsBest practice for_agile_ds_projects
Best practice for_agile_ds_projects
 
req engg (1).ppt
req engg (1).pptreq engg (1).ppt
req engg (1).ppt
 
Top 5 Java Performance Metrics, Tips & Tricks
Top 5 Java Performance Metrics, Tips & TricksTop 5 Java Performance Metrics, Tips & Tricks
Top 5 Java Performance Metrics, Tips & Tricks
 
Mistakes we make_and_howto_avoid_them_v0.12
Mistakes we make_and_howto_avoid_them_v0.12Mistakes we make_and_howto_avoid_them_v0.12
Mistakes we make_and_howto_avoid_them_v0.12
 
INTRODUCTION TO SOFTWARE ENGINEERING
INTRODUCTION TO SOFTWARE ENGINEERINGINTRODUCTION TO SOFTWARE ENGINEERING
INTRODUCTION TO SOFTWARE ENGINEERING
 
What is onTune for management
What is onTune for managementWhat is onTune for management
What is onTune for management
 
Proj Mgmt.ppt
Proj Mgmt.pptProj Mgmt.ppt
Proj Mgmt.ppt
 
Software quality assurance
Software quality assuranceSoftware quality assurance
Software quality assurance
 
1 Information Systems Analysis & Design,.pptx
1 Information Systems Analysis & Design,.pptx1 Information Systems Analysis & Design,.pptx
1 Information Systems Analysis & Design,.pptx
 
Chapter01.ppt
Chapter01.pptChapter01.ppt
Chapter01.ppt
 
UNIT-III SYSTEM DEVELOPMENT LIFE CYCLE.pptx
UNIT-III SYSTEM DEVELOPMENT LIFE CYCLE.pptxUNIT-III SYSTEM DEVELOPMENT LIFE CYCLE.pptx
UNIT-III SYSTEM DEVELOPMENT LIFE CYCLE.pptx
 

Recently uploaded

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 

Recently uploaded (20)

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 

The elusive root cause

  • 1. The Elusive Root Cause Of IT Problems And How To Easily Identify It Noam Biran Director of Product Management
  • 2. Introduction Mr. Biran • Director of Product Management at Neebula • 20 years experience in systems management & BSM • Innovation Product Management at BMC • Co-founder of Appilog (now HP uCMDB & DDMA) About Neebula Neebula provides the first and only automatic service-centric IT management solution allowing IT organizations to improve the service provided to the business by shifting from managing disparate technology silos to managing the services running in the data center. Leveraging unique technology that automatically maps business services to the underlying infrastructure, Neebula enables the IT team to increase availability of the main services they manage and reduce the time to repair of problems.
  • 3. Agenda • Introduction • Root cause analysis defined • The problem resolution process • Problem detection • Root cause analysis methods • Improving root cause analysis processes
  • 4. Root Cause Analysis Definition ITIL V3 An Activity that identifies the Root Cause of an Incident or Problem. Root Cause Analysis typically concentrates on IT Infrastructure failures. Wikipedia Root Cause Analysis is any structured approach to identify the factors that resulted in the harmful consequences of one or more past events
  • 5. The importance of Root Cause Analysis • Root Cause Analysis has a high impact on – IT processes • The efficiency of the overall incident/problem management process • Good RCA discipline requires well established configuration management – Organizational goals • Meeting internal and external SLAs • Financial (budget & revenue) implications • Brand / customer loyalty
  • 7. The Critical Role of Root Cause Analysis • Improper (or lack of) identification of the real root cause may yield: – Repeating problems – Increased downtime – Waste of human resources on “fixing” the wrong issues – Risk to the business
  • 8. The Life of The Operator We expect the operator – To handle 1000’s of cryptic events – Understand impact on 100’s of services – Understand the correlation to customers service complaints – Understand what changed – Orchestrate the resolution And make these decisions within minutes to reduce MTTR Are we giving our operators the tools to succeed?
  • 10. Problem Resolution Process • Events coming in to the NOC • NOC performs some investigation • Root cause analysis is shared between NOC & 2nd/3rd level support (admins) • Low level diagnostics & problem resolution is done by 2nd/3rd level support (admins)
  • 11. Involved Parties & Tools • Tools – Monitoring tools – Configuration management tools • People – Users – NOC – Admins – specialized teams focused on specific area, e.g. system, database, network – Application support / developers
  • 12. The Common Process – Blame Game • No structured process • Lack of overall cross-domain view • Each team has its own terminology and view • Each team is working on its own
  • 14. Potential Problem Symptoms • Lack of certain functionality – A certain transaction does not work • Performance degradation – Fund transfer response time is above 2 sec. • Availability issue – Application doesn’t work • None – Unnoticeable failure due to high availability configuration
  • 15. Problem Detection • Good problem detection methods are key for a structured root cause analysis process • Problem detection tools should provide sufficient data to the root cause analysis process • There are various distinct methods each with its pros and cons • There is no single superior detection method
  • 16. Detection – Users • What it does – Compensates for unknown / unreported problems • What it doesn’t – Supposedly accurate – actually might point in the wrong direction – Usually takes place too late for a quick fix & impact to business
  • 17. Detection – Infrastructure Monitoring • What it does – Monitor each technical element comprising the service – Great way to identify specific availability failures • What it doesn’t – Hard to correlate with real user experience – Too many false positives – Lots of events on symptoms rather on actual problem
  • 18. Detection – End User Experience • What it does – Measure overall response time of user transactions – Synthetic or real user transactions – The ultimate problem detection method • What it doesn’t – No real breakdown to assist in pinpointing the problem or even the domain
  • 19. Detection – Transaction Breakdown • What it does – Discovery of each transaction’s path within the data center – Highlight potential performance problems within the transaction execution • What it doesn’t – No correlation to infrastructure monitoring – Cannot cover the entire data center – domain specific
  • 20. Detection – Domain Specific Tools • What it does – Drill down in a specific application – Great analysis & diagnostics within an application • What it doesn’t – No data center wide view – Lack of insight into the connections between applications
  • 23. Potential Root Cause Types • Configuration change • Version upgrade • Hardware fault • Software bug • Capacity problem • Resource collision
  • 24. Common Ways for Root Cause Analysis • War room scenario • The log file approach • APM tools • Transaction management • Manual event correlation / analysis
  • 25. War Room Scenario • Getting everyone in the same room • Each has its own data and terminology • Blame game • Takes a lot of time
  • 26. The Log File Approach • An admin sits and analyzes log files and other historical data from various sources • A domain specific approach • Certain degree of structured process • Might identify problems that are not the root cause (distractions)
  • 27. APM Tools • An admin sits and analyzes log files and other historical data from various sources • A domain specific approach • Certain degree of structured process • Might identify problems that are not the root cause (distractions)
  • 28. Transaction Management • A great tool to point to the probable area where the root cause resides • Limited to specific domains • Inability to correlate with infrastructure metrics / failures
  • 29. Manual Event Correlation / Analysis • Requires cross-domain expertise • Requires understanding of dependencies between components • Time consuming • Lack of insight into other non-event data
  • 30. Improving Root Cause Analysis Processes
  • 31. Making The Best From Existing Tools • Choose problem detection methods that assist in the root cause analysis process • Turn the root cause analysis into a structured process – Internal team processes – Inter-team processes • Common language & visibility between teams
  • 32. New Methods: Mapping • Mapping of Business service & applications and the supporting infrastructure • Ties symptoms (user) to problems (technology) • Introduces a common language between teams • Enables a high level cross-domain view
  • 33. New Methods: Structured Process • Define a structured process for problem investigation and root cause analysis • Define how collaboration should occur during root cause analysis between teams
  • 34. New Methods: Tools • Use tools that provide a historical dimension for problem investigation • Use tools that enable the correlation of problems to configuration changes • Use topology based correlation instead of rule based (or manual based) correlation

Editor's Notes

  1. Introduction to the subjectWebinar logistics: presentation first, send questions during, answer questions at the end
  2. RCA is problematic even to defineITIL definition -> useless. ITIL failedWikipedia:StructuredFactorsConsequencesPast events – I’ll call them symptoms
  3. Talk about each bullet
  4. Many data sources (event feeds)All are mixed and funneled into the NOCNOC needs to filter and make order in them based on:RelevanceSource / derivedBut the NOC doesn’t have the tools or processes to do thisNo structured way to do this filtering (though the NOC is used to structured processes like run book)
  5. Taking care of the symptoms and not the problemsAssociating wrong events -> figuring out the incorrect root cause
  6. NOC is used to structured processes (like run book)We don’t give them toolsWe don’t give them structured processes (or any processes)They don’t posses cross-domain knowledge usually
  7. Isolation – diagnosticsNOC’s investigation may yield forwarding to the wrong team and therefore wrong analysis done in the wrong context
  8. Explain eachHow do they all tie together? Usually they don’t
  9. Problem detection begins with the symptomsSame symptoms may be caused by different problems
  10. We need a combination of toolsChoose the right mix to assist in the RCA processNeed synergy between the methods
  11. Cross domainCross disciplineRequire deep understanding
  12. Not a structured approach