SlideShare a Scribd company logo
Root Cause Analysis
Fact and Fiction
Dustin Collins @dustinmm80
CyberArk Conjur
Origins of Root Cause Analysis
- First implemented in 1958 in Toyota manufacturing plants
(5 Whys)
- Has since been adopted and tailored to many other
industries
Why do RCA?
To better understand the underlying causes of problems so
we can address them and prevent them happening again.
When to do RCA?
- When issues happen more than once
- When an outage affects many users
- When a system is not functioning as designed
5 Whys
- Start with a problem statement
- Ask why
- Repeat until root cause is found
Issues with 5 Whys
Incorrect or leading problem statement can point to the wrong
issue.
Not very useful in complex situations, where you can’t answer
why in the moment.
Issues with 5 Whys
It’s not repeatable
- Different people may get different results
- Same people at a different time may get different results
Issues with 5 Whys
Linear thinking leads teams to drive towards one root cause.
There is usually more than one root cause.
Human error is not a valid root cause.
Biases
Hindsight bias
- “Should have known better”.
Outcome bias
- “Since the outcome was bad, the plan was bad”.
What changed?
Cynefin
Conceptual framework created by Dave Snowden to organize
intellectual capital at IBM.
Uses quadrants to organize problems by complexity and
suggests a course of action.
Cynefin
quadrant
steps to take
causality
knowledge
Cynefin
EntropyProgress
Simple/Obvious
Sense Categorize Respond
Alert: Disk almost full Disk full Prune Docker images
Internal webservice is not responding
Complicated
Sense Analyze Respond
Failing pipelines ran at
same time as other builds
Concurrent pipeline
collisions cause node
resource contention
Update pipelines to splay
across nodes
Increasing CI pipeline failures, happens only during the day
Complex
Probe Sense Respond
- Investigate logs
- Inspect exponential
backoff code
- Run load tests
- View existing alerts
- Analyze performance from
load test
Add backpressure throttling
to discovery service
Discovery service became overloaded, triggering cascading failure in
other services
Chaotic
Act Sense Respond
Move instances to another
AZ
Still unable to connect Cannot connect in new AZ
either
Unable to connect to instances in AWS us-east-1a. No AWS service
warnings.
Disorder
Reduce Analyze Iterate
What do we know for sure? What do we agree on? Move to a quadrant,
continue
Stems from a lack of agreement on the problem
Takeaways
- Be aware of the limitations of the RCA techniques you use
- Emergent behavior arises from complexity and increased rate of change
- Consider trying Cynefin to help you approach complex problems
Thank you!
Dustin Collins @dustinmm80

More Related Content

What's hot

Continuous Deployment
Continuous DeploymentContinuous Deployment
Continuous Deployment
Brian Henerey
 
How effective feedback can improve your software
How effective feedback can improve your softwareHow effective feedback can improve your software
How effective feedback can improve your software
Sven Peters
 
Community building lessons from Ansible
Community building lessons from AnsibleCommunity building lessons from Ansible
Community building lessons from Ansible
Greg DeKoenigsberg
 
Reliable tests with selenium web driver
Reliable tests with selenium web driverReliable tests with selenium web driver
Reliable tests with selenium web driver
PawelPabich
 
DevOps vs The Enterprise
DevOps vs The EnterpriseDevOps vs The Enterprise
DevOps vs The Enterprise
CloudCheckr
 
Silver Lining for Miles: DevOps for Building Security Solutions
Silver Lining for Miles: DevOps for Building Security SolutionsSilver Lining for Miles: DevOps for Building Security Solutions
Silver Lining for Miles: DevOps for Building Security Solutions
SeniorStoryteller
 
Agile bodensee - Agile Testing: Bug prevention vs. bug detection
Agile bodensee - Agile Testing: Bug prevention vs. bug detectionAgile bodensee - Agile Testing: Bug prevention vs. bug detection
Agile bodensee - Agile Testing: Bug prevention vs. bug detection
Michael Palotas
 
Introduction to Chaos Engineering
Introduction to Chaos EngineeringIntroduction to Chaos Engineering
Introduction to Chaos Engineering
Raymond Adrian (Rad) Butalid
 
Shift left-testing
Shift left-testingShift left-testing
Shift left-testing
Alan Richardson
 
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Codemotion
 
OSGi Best and Worst Practices
OSGi Best and Worst PracticesOSGi Best and Worst Practices
OSGi Best and Worst Practices
Chris Aniszczyk
 
JavaLand 2022 - Debugging distributed systems
JavaLand 2022 - Debugging distributed systemsJavaLand 2022 - Debugging distributed systems
JavaLand 2022 - Debugging distributed systems
Bert Jan Schrijver
 
DevOps - Automating Legacy
DevOps - Automating LegacyDevOps - Automating Legacy
DevOps - Automating Legacy
David Tank
 
Breathing the breath of the monster combining agile and context-driven
Breathing the breath of the monster   combining agile and context-drivenBreathing the breath of the monster   combining agile and context-driven
Breathing the breath of the monster combining agile and context-driven
Ilari Henrik Aegerter
 
You build it, you run it
You build it, you run itYou build it, you run it
You build it, you run it
Skyscanner
 
A Happy Marriage between Context-Driven and Agile
A Happy Marriage between Context-Driven and AgileA Happy Marriage between Context-Driven and Agile
A Happy Marriage between Context-Driven and Agile
Ilari Henrik Aegerter
 
Agile Software Development for Non-Developers
Agile Software Development for Non-DevelopersAgile Software Development for Non-Developers
Agile Software Development for Non-Developers
hamvocke
 
Chaos Engineering
Chaos EngineeringChaos Engineering
Chaos Engineering
Yury Roa
 
How to Build a Healthy On-Call Culture
How to Build a Healthy On-Call CultureHow to Build a Healthy On-Call Culture
How to Build a Healthy On-Call Culture
Atlassian
 
Delhi first draft
Delhi first draftDelhi first draft
Delhi first draft
vaibhav lokhande
 

What's hot (20)

Continuous Deployment
Continuous DeploymentContinuous Deployment
Continuous Deployment
 
How effective feedback can improve your software
How effective feedback can improve your softwareHow effective feedback can improve your software
How effective feedback can improve your software
 
Community building lessons from Ansible
Community building lessons from AnsibleCommunity building lessons from Ansible
Community building lessons from Ansible
 
Reliable tests with selenium web driver
Reliable tests with selenium web driverReliable tests with selenium web driver
Reliable tests with selenium web driver
 
DevOps vs The Enterprise
DevOps vs The EnterpriseDevOps vs The Enterprise
DevOps vs The Enterprise
 
Silver Lining for Miles: DevOps for Building Security Solutions
Silver Lining for Miles: DevOps for Building Security SolutionsSilver Lining for Miles: DevOps for Building Security Solutions
Silver Lining for Miles: DevOps for Building Security Solutions
 
Agile bodensee - Agile Testing: Bug prevention vs. bug detection
Agile bodensee - Agile Testing: Bug prevention vs. bug detectionAgile bodensee - Agile Testing: Bug prevention vs. bug detection
Agile bodensee - Agile Testing: Bug prevention vs. bug detection
 
Introduction to Chaos Engineering
Introduction to Chaos EngineeringIntroduction to Chaos Engineering
Introduction to Chaos Engineering
 
Shift left-testing
Shift left-testingShift left-testing
Shift left-testing
 
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
 
OSGi Best and Worst Practices
OSGi Best and Worst PracticesOSGi Best and Worst Practices
OSGi Best and Worst Practices
 
JavaLand 2022 - Debugging distributed systems
JavaLand 2022 - Debugging distributed systemsJavaLand 2022 - Debugging distributed systems
JavaLand 2022 - Debugging distributed systems
 
DevOps - Automating Legacy
DevOps - Automating LegacyDevOps - Automating Legacy
DevOps - Automating Legacy
 
Breathing the breath of the monster combining agile and context-driven
Breathing the breath of the monster   combining agile and context-drivenBreathing the breath of the monster   combining agile and context-driven
Breathing the breath of the monster combining agile and context-driven
 
You build it, you run it
You build it, you run itYou build it, you run it
You build it, you run it
 
A Happy Marriage between Context-Driven and Agile
A Happy Marriage between Context-Driven and AgileA Happy Marriage between Context-Driven and Agile
A Happy Marriage between Context-Driven and Agile
 
Agile Software Development for Non-Developers
Agile Software Development for Non-DevelopersAgile Software Development for Non-Developers
Agile Software Development for Non-Developers
 
Chaos Engineering
Chaos EngineeringChaos Engineering
Chaos Engineering
 
How to Build a Healthy On-Call Culture
How to Build a Healthy On-Call CultureHow to Build a Healthy On-Call Culture
How to Build a Healthy On-Call Culture
 
Delhi first draft
Delhi first draftDelhi first draft
Delhi first draft
 

Similar to Root Cause Analysis: Fact and Fiction

Testing within an Agile Environment - Beyza Sakir and Chris Gollop
Testing within an Agile Environment - Beyza Sakir and Chris GollopTesting within an Agile Environment - Beyza Sakir and Chris Gollop
Testing within an Agile Environment - Beyza Sakir and Chris Gollop
JAXLondon2014
 
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.02014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
Joakim Lindbom
 
Kanban 101
Kanban 101Kanban 101
Kanban 101
Dennis Stevens
 
30 days or less: New Features to Production
30 days or less: New Features to Production30 days or less: New Features to Production
30 days or less: New Features to Production
Karthik Gaekwad
 
Selenium Users Anonymous
Selenium Users AnonymousSelenium Users Anonymous
Selenium Users Anonymous
Dave Haeffner
 
Scaling Your Tests: Continued Change Without Fear
Scaling Your Tests: Continued Change Without FearScaling Your Tests: Continued Change Without Fear
Scaling Your Tests: Continued Change Without Fear
TechWell
 
Building an automated database deployment pipeline
Building an automated database deployment pipelineBuilding an automated database deployment pipeline
Building an automated database deployment pipeline
Red Gate Software
 
Mere Paas Teensy Hai (Nikhil Mittal)
Mere Paas Teensy Hai (Nikhil Mittal)Mere Paas Teensy Hai (Nikhil Mittal)
Mere Paas Teensy Hai (Nikhil Mittal)
ClubHack
 
Synergy 2015 Session Slides: SYN408 XenDesktop 7.6 Architecture - Dealing Wit...
Synergy 2015 Session Slides: SYN408 XenDesktop 7.6 Architecture - Dealing Wit...Synergy 2015 Session Slides: SYN408 XenDesktop 7.6 Architecture - Dealing Wit...
Synergy 2015 Session Slides: SYN408 XenDesktop 7.6 Architecture - Dealing Wit...
Citrix
 
Chaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just ChaosChaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just Chaos
Charity Majors
 
Using XP practices on 1960s green screen technology
Using XP practices on 1960s green screen technologyUsing XP practices on 1960s green screen technology
Using XP practices on 1960s green screen technology
niksilver
 
Green Screen ci at Travis Perkins
Green Screen ci at Travis PerkinsGreen Screen ci at Travis Perkins
Green Screen ci at Travis Perkins
Brian Leach
 
DevOps - Boldly Go for Distro
DevOps - Boldly Go for DistroDevOps - Boldly Go for Distro
DevOps - Boldly Go for Distro
Paul Boos
 
Moving to Continuous Delivery Without Breaking Your Code
Moving to Continuous Delivery Without Breaking Your CodeMoving to Continuous Delivery Without Breaking Your Code
Moving to Continuous Delivery Without Breaking Your Code
XebiaLabs
 
2015 10 dev ops n-fi - why it's a good idea to deploy 10 times per day v1.0 -...
2015 10 dev ops n-fi - why it's a good idea to deploy 10 times per day v1.0 -...2015 10 dev ops n-fi - why it's a good idea to deploy 10 times per day v1.0 -...
2015 10 dev ops n-fi - why it's a good idea to deploy 10 times per day v1.0 -...
Joakim Lindbom
 
The Economics of Scale: Promises and Perils of Going Distributed
The Economics of Scale: Promises and Perils of Going DistributedThe Economics of Scale: Promises and Perils of Going Distributed
The Economics of Scale: Promises and Perils of Going Distributed
Tyler Treat
 
DevOps - Chaos Engineering on Kubernetes
DevOps - Chaos Engineering on KubernetesDevOps - Chaos Engineering on Kubernetes
DevOps - Chaos Engineering on Kubernetes
David Hsu
 
Ohio 2012-help-sysad-out
Ohio 2012-help-sysad-outOhio 2012-help-sysad-out
Ohio 2012-help-sysad-out
mralexjuarez
 
How to test a Mainframe Application
How to test a Mainframe ApplicationHow to test a Mainframe Application
How to test a Mainframe Application
Michael Erichsen
 
DevOps Transition Strategies
DevOps Transition StrategiesDevOps Transition Strategies
DevOps Transition Strategies
Alec Lazarescu
 

Similar to Root Cause Analysis: Fact and Fiction (20)

Testing within an Agile Environment - Beyza Sakir and Chris Gollop
Testing within an Agile Environment - Beyza Sakir and Chris GollopTesting within an Agile Environment - Beyza Sakir and Chris Gollop
Testing within an Agile Environment - Beyza Sakir and Chris Gollop
 
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.02014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
 
Kanban 101
Kanban 101Kanban 101
Kanban 101
 
30 days or less: New Features to Production
30 days or less: New Features to Production30 days or less: New Features to Production
30 days or less: New Features to Production
 
Selenium Users Anonymous
Selenium Users AnonymousSelenium Users Anonymous
Selenium Users Anonymous
 
Scaling Your Tests: Continued Change Without Fear
Scaling Your Tests: Continued Change Without FearScaling Your Tests: Continued Change Without Fear
Scaling Your Tests: Continued Change Without Fear
 
Building an automated database deployment pipeline
Building an automated database deployment pipelineBuilding an automated database deployment pipeline
Building an automated database deployment pipeline
 
Mere Paas Teensy Hai (Nikhil Mittal)
Mere Paas Teensy Hai (Nikhil Mittal)Mere Paas Teensy Hai (Nikhil Mittal)
Mere Paas Teensy Hai (Nikhil Mittal)
 
Synergy 2015 Session Slides: SYN408 XenDesktop 7.6 Architecture - Dealing Wit...
Synergy 2015 Session Slides: SYN408 XenDesktop 7.6 Architecture - Dealing Wit...Synergy 2015 Session Slides: SYN408 XenDesktop 7.6 Architecture - Dealing Wit...
Synergy 2015 Session Slides: SYN408 XenDesktop 7.6 Architecture - Dealing Wit...
 
Chaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just ChaosChaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just Chaos
 
Using XP practices on 1960s green screen technology
Using XP practices on 1960s green screen technologyUsing XP practices on 1960s green screen technology
Using XP practices on 1960s green screen technology
 
Green Screen ci at Travis Perkins
Green Screen ci at Travis PerkinsGreen Screen ci at Travis Perkins
Green Screen ci at Travis Perkins
 
DevOps - Boldly Go for Distro
DevOps - Boldly Go for DistroDevOps - Boldly Go for Distro
DevOps - Boldly Go for Distro
 
Moving to Continuous Delivery Without Breaking Your Code
Moving to Continuous Delivery Without Breaking Your CodeMoving to Continuous Delivery Without Breaking Your Code
Moving to Continuous Delivery Without Breaking Your Code
 
2015 10 dev ops n-fi - why it's a good idea to deploy 10 times per day v1.0 -...
2015 10 dev ops n-fi - why it's a good idea to deploy 10 times per day v1.0 -...2015 10 dev ops n-fi - why it's a good idea to deploy 10 times per day v1.0 -...
2015 10 dev ops n-fi - why it's a good idea to deploy 10 times per day v1.0 -...
 
The Economics of Scale: Promises and Perils of Going Distributed
The Economics of Scale: Promises and Perils of Going DistributedThe Economics of Scale: Promises and Perils of Going Distributed
The Economics of Scale: Promises and Perils of Going Distributed
 
DevOps - Chaos Engineering on Kubernetes
DevOps - Chaos Engineering on KubernetesDevOps - Chaos Engineering on Kubernetes
DevOps - Chaos Engineering on Kubernetes
 
Ohio 2012-help-sysad-out
Ohio 2012-help-sysad-outOhio 2012-help-sysad-out
Ohio 2012-help-sysad-out
 
How to test a Mainframe Application
How to test a Mainframe ApplicationHow to test a Mainframe Application
How to test a Mainframe Application
 
DevOps Transition Strategies
DevOps Transition StrategiesDevOps Transition Strategies
DevOps Transition Strategies
 

Recently uploaded

How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 

Recently uploaded (20)

How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 

Root Cause Analysis: Fact and Fiction

  • 1. Root Cause Analysis Fact and Fiction Dustin Collins @dustinmm80 CyberArk Conjur
  • 2. Origins of Root Cause Analysis - First implemented in 1958 in Toyota manufacturing plants (5 Whys) - Has since been adopted and tailored to many other industries
  • 3. Why do RCA? To better understand the underlying causes of problems so we can address them and prevent them happening again.
  • 4. When to do RCA? - When issues happen more than once - When an outage affects many users - When a system is not functioning as designed
  • 5. 5 Whys - Start with a problem statement - Ask why - Repeat until root cause is found
  • 6. Issues with 5 Whys Incorrect or leading problem statement can point to the wrong issue. Not very useful in complex situations, where you can’t answer why in the moment.
  • 7. Issues with 5 Whys It’s not repeatable - Different people may get different results - Same people at a different time may get different results
  • 8. Issues with 5 Whys Linear thinking leads teams to drive towards one root cause. There is usually more than one root cause. Human error is not a valid root cause.
  • 9. Biases Hindsight bias - “Should have known better”. Outcome bias - “Since the outcome was bad, the plan was bad”.
  • 11. Cynefin Conceptual framework created by Dave Snowden to organize intellectual capital at IBM. Uses quadrants to organize problems by complexity and suggests a course of action.
  • 14. Simple/Obvious Sense Categorize Respond Alert: Disk almost full Disk full Prune Docker images Internal webservice is not responding
  • 15. Complicated Sense Analyze Respond Failing pipelines ran at same time as other builds Concurrent pipeline collisions cause node resource contention Update pipelines to splay across nodes Increasing CI pipeline failures, happens only during the day
  • 16. Complex Probe Sense Respond - Investigate logs - Inspect exponential backoff code - Run load tests - View existing alerts - Analyze performance from load test Add backpressure throttling to discovery service Discovery service became overloaded, triggering cascading failure in other services
  • 17. Chaotic Act Sense Respond Move instances to another AZ Still unable to connect Cannot connect in new AZ either Unable to connect to instances in AWS us-east-1a. No AWS service warnings.
  • 18. Disorder Reduce Analyze Iterate What do we know for sure? What do we agree on? Move to a quadrant, continue Stems from a lack of agreement on the problem
  • 19. Takeaways - Be aware of the limitations of the RCA techniques you use - Emergent behavior arises from complexity and increased rate of change - Consider trying Cynefin to help you approach complex problems