SlideShare a Scribd company logo
1 of 56
Don’t Just Fix It,
Learn From it
The Importance of Incident Management When Something Breaks
Something’s broken, something’s broken
It’s your fault, It’s your fault
Are you gonna fix it? Are you gonna fix it?
Right away, Right away
― Ethan (5 years old)
Maybe 10 minutes
Maybe 1 day
Maybe a year
Let's make the process better
Is the smoke actually
a fire
Houston, we have a
problem
Document, document,
document
Keep those who need to
know informed
Stop trying to be the
hero
Let’s learn from this
I’m Rick Clymer
QR Lead @ RocketReach
Sev 1, High Priority, Drop Everything and Fix Survivor
Is the smoke
actually a fire?
Let’s figure out if it’s smoke or fire
Does an issue
actually exist?
Does An Issue Actually Exist
Are customers (internal or external) impacted?
Does the alert actually mean something?
If not resolved quickly, will revenue be lost due to the impact?
Is someone screaming fire?
Do you have a gut feel that something needs investigated?
Let’s figure out where to go next
Do the conditions of
the issue meet any
criteria used to
determine severity?
Does an issue
actually exist?
Example Severity Mapping
Customer facing issue
with no workaround.
Business is severely
impacted
Experiencing or
potential for data loss
Customer facing issue
with a workaround.
Business sees an
impact
Core functionality
broken
Customer facing issue
that is more of an
annoyance
Performance degradation
Secondary functionality
broken
1 2 3
Severity vs Priority
What is the impact of the issue on my
software?
Typically higher the severity, faster the
remediation time but can vary
Owner - Subject manager expert able to
determine technical impact, business
folks able to determine business impact
Severity
Severity vs Priority
How fast does the issue need resolved?
If we handle resolving this issue, what is
going to get pushed to the side?
A high priority fix can have a low severity
Owner - Typically leadership, product
management
What is the impact of the issue on my
software?
Typically higher the severity, faster the
remediation time but can vary
Owner - Subject manager expert able to
determine technical impact, business
folks able to determine business impact
Severity Priority
Let’s figure out where to go next
Do the conditions of
the issue meet any
criteria used to
determine severity?
Does an issue
actually exist?
Is the condition
expected and will
self resolve?
Houston,
we have a
problem
Initiate Fire Fighting mode
Have a communication process in place to inform others an incident exists
Use your alerting system to initiate a new incident
Setup integrations for things like your messaging tool (Slack, Teams, etc…), monitoring systems, virtual meetings, etc..
Setup a central meeting point that anyone can join
Initiate Fire Fighting mode
Have a communication process in place to inform others an incident exists
Did something get deployed to production and cause the issue?
Did someone or something start using the system in a way that hasn’t happened previously?
Do the monitors show any infrastructure related issues?
Is there anything in the error logs indicating a new system issue that hasn't been seen before?
What changed between 10 minutes ago and now?
Use your alerting system to initiate a new incident
Setup integrations for things like your messaging tool (Slack, Teams, etc…), monitoring systems, virtual meetings, etc..
Setup a central meeting point that anyone can join
Initiate Fire Fighting mode
Have a communication process in place to inform others an incident exists
Did something get deployed to production and cause the issue?
Did someone or something start using the system in a way that hasn’t happened previously?
Do the monitors show any infrastructure related issues?
Is there anything in the error logs indicating a new system issue that hasn't been seen before?
What changed between 10 minutes ago and now?
As you have ideas of what’s going on, make sure to get them out of your head and somewhere others can access
Make it easy for others to join in the fire fighting
Use your alerting system to initiate a new incident
Setup integrations for things like your messaging tool (Slack, Teams, etc…), monitoring systems, virtual meetings, etc..
Setup a central meeting point that anyone can join
Trust your hunches… Hunches are
usually based on facts filed away just
below the conscious level.
― Joyce Brothers, American psychologist
Don’t just go on
your intuition
Second set of eyes is the most valuable tool
in a moment of incident
Time of day, stress levels, prior activities, etc
all play a role in your decisions
Stop trying to
be the hero
No individual can win a
game by himself
― Pele, international soccer star
Stop the bleeding, keep your company’s reputation with your
customers, get things back to normal
Resolve your issue
Fool me once, shame on you. Fool me twice, shame on me.
Don’t let this bite you
again
Innovation comes only from readily and seamlessly sharing
information rather than hoarding it - Tom Peters
Stop creating silos
Incident Management Team
Commander Tech Lead Communicator
●Gets the appropriate
people involved
●Holds all positions until
others are delegated
●Keeps incident docs
up to date
●Handles the incident
resolution process
●Delegates to those
who are working on
resolving
●Controls what is
changed
●Face of the incident
response
●Issues periodic
updates to
stakeholders
●Keeps incident docs
up to date
Additional Roles You Might Consider
Customer Support Lead
Subject Matter Experts
External Communication Lead
Dedicated Scribe
Root Cause Analyst
Be prepared to
have backups
for any role
Document
Document
Document
Key pieces of data to keep track of
Any monitors, alerts,
logs that are related to
the incident
Who is assigned a role
at what point
What has been
attempted to remedy
the system
Support ticket counts
related to the specific
incident
Any ideas of what the
root cause could be
What the actual
problem that is trying to
be remedied
Key time related metrics
What time was the first alert, notification, smoking server, etc… signaled?
What time did someone acknowledge the issue and begin working?
What time were system’s back to a usable state?
Any time a potential fix was pushed to a production environment?
What time was the system back to a fully restored state?
How to keep documentation
Keep those
who need to
know informed
Teamwork begins by building trust.
And the only way to do that is to
overcome our need for invulnerability.
― Patrick Lencioni, author
Open internal line of communication
Know your audience
Over-updating is better than keeping everyone in the dark
Typically try to follow a 15-30 minute or sooner if new information becomes available timeline
Provide timely updates
Incident response started, complete or partial restoration, new communicator taking duties over
Inform on key milestones
Keep the message appropriate for who you are delivering it to
Avoid providing exact times for system to be restored
Try to keep all messaging from the same person
Keep your
customers in the
know
Provide your customers with the tools when
an incident arises
Internal and external communication are
two very different things
Lesson learned: Let someone in customer
support provide external messaging
Let’s learn
from this
Most important part of
incident response:
Most important part of
incident response:
Post Mortems
What a post mortem isn’t
Somewhere to cast blame on a person, team, or some other specific resource
Post mortems should be available for any internal stakeholder to review
This can be a trust rebuilding process with the rest of the business as issues arise to show how you’ve taken action
to ensure this doesn’t happen again
A journal entry intended to be tucked under your pillow at night
Post mortems should follow the same template every time an incident occurs
Ensure key information is captured in a repeatable fashion
Scattered thoughts in a document
Post mortems are intended to be learning experiences on issues in our systems and processes, not to point fingers
Blameless post mortems allow the team to work closer to making sure this incident won’t happen again
Goals for your post mortems
Ensure the incident
is documented
Goals for your post mortems
Root cause is
understood for how
the incident occured
Ensure the incident
is documented
Goals for your post mortems
Root cause is
understood for how
the incident occured
Ensure the incident
is documented
What steps have
been identified and
either put into place
or planned to be put
into place to reduce
the change of the
incident occurring
again
Post Mortem Template
Incident Title
Date (Note time zones and keep consistent)
Author(s)
Status
Summary (280 characters max)
Impact (Technical and business)
Root Cause(s)
Trigger
Resolution
How was incident detected
Action Items (Item, owner, link to ticket system)
Lessons Learned:
● What went well
● What went wrong
● Where we got lucky
Timeline (Note time zone, use military time)
Key Incident metrics:
● Start/Stop Time
● Time to detect, acknowledge, recovery
Supporting information
Post Mortem Example
Post mortem tips
Keep all post mortems in a repository
Our minds are mush, try to complete the document and distribute for a review to initial team as soon as possible
Distribute to the team for review within 24 hours of the initial incident
Not for everyone, but once the incident is over, your post mortem document is ready for review
Use the post mortem template as your incident note template
Use a markdown template and create an incident mgmt GitHub repo
Create a directory in Google Docs or Office 365 to store the documents
These are intended to be learning lessons to prevent as well as trust builders for key stakeholders
Make available to key stakeholders within 72 hours of an incident
No matter what happens, or how bad it
seems today, life does go on, and it
will be better tomorrow.
― Maya Angelou
Thank you!
clymerrm
https://bit.ly/IncidentMgmtQAOTHW2023
Additional Resources
Google SRE Books
Atlassian Incident Management
Something’s Broken PagerDuty Alert Sound
PagerDuty Incident Management
Presentation - bit.ly/IncidentMgmtQAOTHW2023

More Related Content

Similar to Rick Clymer - Incident Management.pdf

Retrospectives - Agile Ottawa Meetup
Retrospectives - Agile Ottawa MeetupRetrospectives - Agile Ottawa Meetup
Retrospectives - Agile Ottawa MeetupDag Rowe
 
Nerd herding ntc11nerd - Howe
Nerd herding ntc11nerd - HoweNerd herding ntc11nerd - Howe
Nerd herding ntc11nerd - HoweGrant M Howe
 
Rekard Edgren - Curing Our Binary Disease - EuroSTAR 2012
Rekard Edgren - Curing Our Binary Disease - EuroSTAR 2012Rekard Edgren - Curing Our Binary Disease - EuroSTAR 2012
Rekard Edgren - Curing Our Binary Disease - EuroSTAR 2012TEST Huddle
 
React Faster and Better: New Approaches for Advanced Incident Response
React Faster and Better: New Approaches for Advanced Incident ResponseReact Faster and Better: New Approaches for Advanced Incident Response
React Faster and Better: New Approaches for Advanced Incident ResponseSilvioPappalardo
 
Blameless postmoretem
Blameless postmoretemBlameless postmoretem
Blameless postmoretemCoTransition
 
ITIL Incident Management Workflow PowerPoint Presentation Slides
ITIL Incident Management Workflow PowerPoint Presentation SlidesITIL Incident Management Workflow PowerPoint Presentation Slides
ITIL Incident Management Workflow PowerPoint Presentation SlidesSlideTeam
 
Log Forensics from CEIC 2007
Log Forensics from CEIC 2007Log Forensics from CEIC 2007
Log Forensics from CEIC 2007Anton Chuvakin
 
11-Incident Response, Risk Management Sample Question and Answer-24-06-2023.ppt
11-Incident Response, Risk Management Sample Question and Answer-24-06-2023.ppt11-Incident Response, Risk Management Sample Question and Answer-24-06-2023.ppt
11-Incident Response, Risk Management Sample Question and Answer-24-06-2023.pptabhichowdary16
 
Using Problem Solving Skills To Get A Job
Using Problem Solving Skills To Get A JobUsing Problem Solving Skills To Get A Job
Using Problem Solving Skills To Get A JobGary Clement
 
Architecting a Post Mortem - Velocity 2018 San Jose Tutorial
Architecting a Post Mortem - Velocity 2018 San Jose TutorialArchitecting a Post Mortem - Velocity 2018 San Jose Tutorial
Architecting a Post Mortem - Velocity 2018 San Jose TutorialWill Gallego
 
Enterprise security management II
Enterprise security management   IIEnterprise security management   II
Enterprise security management IIzapp0
 
Leveraging Diversity to Find What Works and Amplify
Leveraging Diversity to Find What Works and Amplify Leveraging Diversity to Find What Works and Amplify
Leveraging Diversity to Find What Works and Amplify Mike Cardus
 
Brian HughesTheimportanceof evidencecollectionan.docx
Brian HughesTheimportanceof evidencecollectionan.docxBrian HughesTheimportanceof evidencecollectionan.docx
Brian HughesTheimportanceof evidencecollectionan.docxjasoninnes20
 
Physical Security Assessments
Physical Security AssessmentsPhysical Security Assessments
Physical Security AssessmentsTom Eston
 
The on-call survival guide - how to be confident on-call
The on-call survival guide - how to be confident on-call The on-call survival guide - how to be confident on-call
The on-call survival guide - how to be confident on-call Raygun
 
Incident response methodology
Incident response methodologyIncident response methodology
Incident response methodologyPiyush Jain
 
Get things done : pragmatic project management
Get things done : pragmatic project managementGet things done : pragmatic project management
Get things done : pragmatic project managementStan Carrico
 
Automatic Assessment of Failure Recovery in Erlang Applications
Automatic Assessment of Failure Recovery in Erlang ApplicationsAutomatic Assessment of Failure Recovery in Erlang Applications
Automatic Assessment of Failure Recovery in Erlang ApplicationsJan Henry Nystrom
 
Approaches to unraveling a complex test problem
Approaches to unraveling a complex test problemApproaches to unraveling a complex test problem
Approaches to unraveling a complex test problemJohan Hoberg
 

Similar to Rick Clymer - Incident Management.pdf (20)

Retrospectives - Agile Ottawa Meetup
Retrospectives - Agile Ottawa MeetupRetrospectives - Agile Ottawa Meetup
Retrospectives - Agile Ottawa Meetup
 
Nerd herding ntc11nerd - Howe
Nerd herding ntc11nerd - HoweNerd herding ntc11nerd - Howe
Nerd herding ntc11nerd - Howe
 
Rekard Edgren - Curing Our Binary Disease - EuroSTAR 2012
Rekard Edgren - Curing Our Binary Disease - EuroSTAR 2012Rekard Edgren - Curing Our Binary Disease - EuroSTAR 2012
Rekard Edgren - Curing Our Binary Disease - EuroSTAR 2012
 
React Faster and Better: New Approaches for Advanced Incident Response
React Faster and Better: New Approaches for Advanced Incident ResponseReact Faster and Better: New Approaches for Advanced Incident Response
React Faster and Better: New Approaches for Advanced Incident Response
 
Blameless postmoretem
Blameless postmoretemBlameless postmoretem
Blameless postmoretem
 
ITIL Incident Management Workflow PowerPoint Presentation Slides
ITIL Incident Management Workflow PowerPoint Presentation SlidesITIL Incident Management Workflow PowerPoint Presentation Slides
ITIL Incident Management Workflow PowerPoint Presentation Slides
 
Log Forensics from CEIC 2007
Log Forensics from CEIC 2007Log Forensics from CEIC 2007
Log Forensics from CEIC 2007
 
11-Incident Response, Risk Management Sample Question and Answer-24-06-2023.ppt
11-Incident Response, Risk Management Sample Question and Answer-24-06-2023.ppt11-Incident Response, Risk Management Sample Question and Answer-24-06-2023.ppt
11-Incident Response, Risk Management Sample Question and Answer-24-06-2023.ppt
 
Using Problem Solving Skills To Get A Job
Using Problem Solving Skills To Get A JobUsing Problem Solving Skills To Get A Job
Using Problem Solving Skills To Get A Job
 
Architecting a Post Mortem - Velocity 2018 San Jose Tutorial
Architecting a Post Mortem - Velocity 2018 San Jose TutorialArchitecting a Post Mortem - Velocity 2018 San Jose Tutorial
Architecting a Post Mortem - Velocity 2018 San Jose Tutorial
 
Enterprise security management II
Enterprise security management   IIEnterprise security management   II
Enterprise security management II
 
Leveraging Diversity to Find What Works and Amplify
Leveraging Diversity to Find What Works and Amplify Leveraging Diversity to Find What Works and Amplify
Leveraging Diversity to Find What Works and Amplify
 
Brian HughesTheimportanceof evidencecollectionan.docx
Brian HughesTheimportanceof evidencecollectionan.docxBrian HughesTheimportanceof evidencecollectionan.docx
Brian HughesTheimportanceof evidencecollectionan.docx
 
Physical Security Assessments
Physical Security AssessmentsPhysical Security Assessments
Physical Security Assessments
 
The on-call survival guide - how to be confident on-call
The on-call survival guide - how to be confident on-call The on-call survival guide - how to be confident on-call
The on-call survival guide - how to be confident on-call
 
Incident response methodology
Incident response methodologyIncident response methodology
Incident response methodology
 
Get things done : pragmatic project management
Get things done : pragmatic project managementGet things done : pragmatic project management
Get things done : pragmatic project management
 
Automatic Assessment of Failure Recovery in Erlang Applications
Automatic Assessment of Failure Recovery in Erlang ApplicationsAutomatic Assessment of Failure Recovery in Erlang Applications
Automatic Assessment of Failure Recovery in Erlang Applications
 
Root Cause Analysis Services | Securium Solutions
Root Cause Analysis Services | Securium SolutionsRoot Cause Analysis Services | Securium Solutions
Root Cause Analysis Services | Securium Solutions
 
Approaches to unraveling a complex test problem
Approaches to unraveling a complex test problemApproaches to unraveling a complex test problem
Approaches to unraveling a complex test problem
 

More from QA or the Highway

KrishnaToolComparisionPPT.pdf
KrishnaToolComparisionPPT.pdfKrishnaToolComparisionPPT.pdf
KrishnaToolComparisionPPT.pdfQA or the Highway
 
Ravi Lakkavalli - World Quality Report.pptx
Ravi Lakkavalli - World Quality Report.pptxRavi Lakkavalli - World Quality Report.pptx
Ravi Lakkavalli - World Quality Report.pptxQA or the Highway
 
Caleb Crandall - Testing Between the Buckets.pptx
Caleb Crandall - Testing Between the Buckets.pptxCaleb Crandall - Testing Between the Buckets.pptx
Caleb Crandall - Testing Between the Buckets.pptxQA or the Highway
 
Thomas Haver - Mobile Testing.pdf
Thomas Haver - Mobile Testing.pdfThomas Haver - Mobile Testing.pdf
Thomas Haver - Mobile Testing.pdfQA or the Highway
 
Thomas Haver - Example Mapping.pdf
Thomas Haver - Example Mapping.pdfThomas Haver - Example Mapping.pdf
Thomas Haver - Example Mapping.pdfQA or the Highway
 
Joe Colantonio - Actionable Automation Awesomeness in Testing Farm.pdf
Joe Colantonio - Actionable Automation Awesomeness in Testing Farm.pdfJoe Colantonio - Actionable Automation Awesomeness in Testing Farm.pdf
Joe Colantonio - Actionable Automation Awesomeness in Testing Farm.pdfQA or the Highway
 
Sarah Geisinger - Continious Testing Metrics That Matter.pdf
Sarah Geisinger - Continious Testing Metrics That Matter.pdfSarah Geisinger - Continious Testing Metrics That Matter.pdf
Sarah Geisinger - Continious Testing Metrics That Matter.pdfQA or the Highway
 
Jeff Sing - Quarterly Service Delivery Reviews.pdf
Jeff Sing - Quarterly Service Delivery Reviews.pdfJeff Sing - Quarterly Service Delivery Reviews.pdf
Jeff Sing - Quarterly Service Delivery Reviews.pdfQA or the Highway
 
Leandro Melendez - Chihuahua Load Tests.pdf
Leandro Melendez - Chihuahua Load Tests.pdfLeandro Melendez - Chihuahua Load Tests.pdf
Leandro Melendez - Chihuahua Load Tests.pdfQA or the Highway
 
Robert Fornal - ChatGPT as a Testing Tool.pptx
Robert Fornal - ChatGPT as a Testing Tool.pptxRobert Fornal - ChatGPT as a Testing Tool.pptx
Robert Fornal - ChatGPT as a Testing Tool.pptxQA or the Highway
 
Federico Toledo - Extra-functional testing.pdf
Federico Toledo - Extra-functional testing.pdfFederico Toledo - Extra-functional testing.pdf
Federico Toledo - Extra-functional testing.pdfQA or the Highway
 
Andrew Knight - Managing the Test Data Nightmare.pptx
Andrew Knight - Managing the Test Data Nightmare.pptxAndrew Knight - Managing the Test Data Nightmare.pptx
Andrew Knight - Managing the Test Data Nightmare.pptxQA or the Highway
 
Melissa Tondi - Automation We_re Doing it Wrong.pdf
Melissa Tondi - Automation We_re Doing it Wrong.pdfMelissa Tondi - Automation We_re Doing it Wrong.pdf
Melissa Tondi - Automation We_re Doing it Wrong.pdfQA or the Highway
 
Jeff Van Fleet and John Townsend - Transition from Testing to Leadership.pdf
Jeff Van Fleet and John Townsend - Transition from Testing to Leadership.pdfJeff Van Fleet and John Townsend - Transition from Testing to Leadership.pdf
Jeff Van Fleet and John Townsend - Transition from Testing to Leadership.pdfQA or the Highway
 
DesiradhaRam Gadde - Testers _ Testing in ChatGPT-AI world.pptx
DesiradhaRam Gadde - Testers _ Testing in ChatGPT-AI world.pptxDesiradhaRam Gadde - Testers _ Testing in ChatGPT-AI world.pptx
DesiradhaRam Gadde - Testers _ Testing in ChatGPT-AI world.pptxQA or the Highway
 
Damian Synadinos - Word Smatter.pdf
Damian Synadinos - Word Smatter.pdfDamian Synadinos - Word Smatter.pdf
Damian Synadinos - Word Smatter.pdfQA or the Highway
 
Lee Barnes - What Successful Test Automation is.pdf
Lee Barnes - What Successful Test Automation is.pdfLee Barnes - What Successful Test Automation is.pdf
Lee Barnes - What Successful Test Automation is.pdfQA or the Highway
 
Jordan Powell - API Testing with Cypress.pptx
Jordan Powell - API Testing with Cypress.pptxJordan Powell - API Testing with Cypress.pptx
Jordan Powell - API Testing with Cypress.pptxQA or the Highway
 
Carlos Kidman - Exploring AI Applications in Testing.pptx
Carlos Kidman - Exploring AI Applications in Testing.pptxCarlos Kidman - Exploring AI Applications in Testing.pptx
Carlos Kidman - Exploring AI Applications in Testing.pptxQA or the Highway
 
Ben Oconis - Breaking Down Silos.pdf
Ben Oconis - Breaking Down Silos.pdfBen Oconis - Breaking Down Silos.pdf
Ben Oconis - Breaking Down Silos.pdfQA or the Highway
 

More from QA or the Highway (20)

KrishnaToolComparisionPPT.pdf
KrishnaToolComparisionPPT.pdfKrishnaToolComparisionPPT.pdf
KrishnaToolComparisionPPT.pdf
 
Ravi Lakkavalli - World Quality Report.pptx
Ravi Lakkavalli - World Quality Report.pptxRavi Lakkavalli - World Quality Report.pptx
Ravi Lakkavalli - World Quality Report.pptx
 
Caleb Crandall - Testing Between the Buckets.pptx
Caleb Crandall - Testing Between the Buckets.pptxCaleb Crandall - Testing Between the Buckets.pptx
Caleb Crandall - Testing Between the Buckets.pptx
 
Thomas Haver - Mobile Testing.pdf
Thomas Haver - Mobile Testing.pdfThomas Haver - Mobile Testing.pdf
Thomas Haver - Mobile Testing.pdf
 
Thomas Haver - Example Mapping.pdf
Thomas Haver - Example Mapping.pdfThomas Haver - Example Mapping.pdf
Thomas Haver - Example Mapping.pdf
 
Joe Colantonio - Actionable Automation Awesomeness in Testing Farm.pdf
Joe Colantonio - Actionable Automation Awesomeness in Testing Farm.pdfJoe Colantonio - Actionable Automation Awesomeness in Testing Farm.pdf
Joe Colantonio - Actionable Automation Awesomeness in Testing Farm.pdf
 
Sarah Geisinger - Continious Testing Metrics That Matter.pdf
Sarah Geisinger - Continious Testing Metrics That Matter.pdfSarah Geisinger - Continious Testing Metrics That Matter.pdf
Sarah Geisinger - Continious Testing Metrics That Matter.pdf
 
Jeff Sing - Quarterly Service Delivery Reviews.pdf
Jeff Sing - Quarterly Service Delivery Reviews.pdfJeff Sing - Quarterly Service Delivery Reviews.pdf
Jeff Sing - Quarterly Service Delivery Reviews.pdf
 
Leandro Melendez - Chihuahua Load Tests.pdf
Leandro Melendez - Chihuahua Load Tests.pdfLeandro Melendez - Chihuahua Load Tests.pdf
Leandro Melendez - Chihuahua Load Tests.pdf
 
Robert Fornal - ChatGPT as a Testing Tool.pptx
Robert Fornal - ChatGPT as a Testing Tool.pptxRobert Fornal - ChatGPT as a Testing Tool.pptx
Robert Fornal - ChatGPT as a Testing Tool.pptx
 
Federico Toledo - Extra-functional testing.pdf
Federico Toledo - Extra-functional testing.pdfFederico Toledo - Extra-functional testing.pdf
Federico Toledo - Extra-functional testing.pdf
 
Andrew Knight - Managing the Test Data Nightmare.pptx
Andrew Knight - Managing the Test Data Nightmare.pptxAndrew Knight - Managing the Test Data Nightmare.pptx
Andrew Knight - Managing the Test Data Nightmare.pptx
 
Melissa Tondi - Automation We_re Doing it Wrong.pdf
Melissa Tondi - Automation We_re Doing it Wrong.pdfMelissa Tondi - Automation We_re Doing it Wrong.pdf
Melissa Tondi - Automation We_re Doing it Wrong.pdf
 
Jeff Van Fleet and John Townsend - Transition from Testing to Leadership.pdf
Jeff Van Fleet and John Townsend - Transition from Testing to Leadership.pdfJeff Van Fleet and John Townsend - Transition from Testing to Leadership.pdf
Jeff Van Fleet and John Townsend - Transition from Testing to Leadership.pdf
 
DesiradhaRam Gadde - Testers _ Testing in ChatGPT-AI world.pptx
DesiradhaRam Gadde - Testers _ Testing in ChatGPT-AI world.pptxDesiradhaRam Gadde - Testers _ Testing in ChatGPT-AI world.pptx
DesiradhaRam Gadde - Testers _ Testing in ChatGPT-AI world.pptx
 
Damian Synadinos - Word Smatter.pdf
Damian Synadinos - Word Smatter.pdfDamian Synadinos - Word Smatter.pdf
Damian Synadinos - Word Smatter.pdf
 
Lee Barnes - What Successful Test Automation is.pdf
Lee Barnes - What Successful Test Automation is.pdfLee Barnes - What Successful Test Automation is.pdf
Lee Barnes - What Successful Test Automation is.pdf
 
Jordan Powell - API Testing with Cypress.pptx
Jordan Powell - API Testing with Cypress.pptxJordan Powell - API Testing with Cypress.pptx
Jordan Powell - API Testing with Cypress.pptx
 
Carlos Kidman - Exploring AI Applications in Testing.pptx
Carlos Kidman - Exploring AI Applications in Testing.pptxCarlos Kidman - Exploring AI Applications in Testing.pptx
Carlos Kidman - Exploring AI Applications in Testing.pptx
 
Ben Oconis - Breaking Down Silos.pdf
Ben Oconis - Breaking Down Silos.pdfBen Oconis - Breaking Down Silos.pdf
Ben Oconis - Breaking Down Silos.pdf
 

Recently uploaded

Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 

Recently uploaded (20)

Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 

Rick Clymer - Incident Management.pdf

  • 1. Don’t Just Fix It, Learn From it The Importance of Incident Management When Something Breaks
  • 2. Something’s broken, something’s broken It’s your fault, It’s your fault Are you gonna fix it? Are you gonna fix it? Right away, Right away ― Ethan (5 years old)
  • 3.
  • 4.
  • 5.
  • 6. Maybe 10 minutes Maybe 1 day Maybe a year
  • 7. Let's make the process better Is the smoke actually a fire Houston, we have a problem Document, document, document Keep those who need to know informed Stop trying to be the hero Let’s learn from this
  • 8. I’m Rick Clymer QR Lead @ RocketReach Sev 1, High Priority, Drop Everything and Fix Survivor
  • 10. Let’s figure out if it’s smoke or fire Does an issue actually exist?
  • 11. Does An Issue Actually Exist Are customers (internal or external) impacted? Does the alert actually mean something? If not resolved quickly, will revenue be lost due to the impact? Is someone screaming fire? Do you have a gut feel that something needs investigated?
  • 12. Let’s figure out where to go next Do the conditions of the issue meet any criteria used to determine severity? Does an issue actually exist?
  • 13. Example Severity Mapping Customer facing issue with no workaround. Business is severely impacted Experiencing or potential for data loss Customer facing issue with a workaround. Business sees an impact Core functionality broken Customer facing issue that is more of an annoyance Performance degradation Secondary functionality broken 1 2 3
  • 14. Severity vs Priority What is the impact of the issue on my software? Typically higher the severity, faster the remediation time but can vary Owner - Subject manager expert able to determine technical impact, business folks able to determine business impact Severity
  • 15. Severity vs Priority How fast does the issue need resolved? If we handle resolving this issue, what is going to get pushed to the side? A high priority fix can have a low severity Owner - Typically leadership, product management What is the impact of the issue on my software? Typically higher the severity, faster the remediation time but can vary Owner - Subject manager expert able to determine technical impact, business folks able to determine business impact Severity Priority
  • 16. Let’s figure out where to go next Do the conditions of the issue meet any criteria used to determine severity? Does an issue actually exist? Is the condition expected and will self resolve?
  • 18. Initiate Fire Fighting mode Have a communication process in place to inform others an incident exists Use your alerting system to initiate a new incident Setup integrations for things like your messaging tool (Slack, Teams, etc…), monitoring systems, virtual meetings, etc.. Setup a central meeting point that anyone can join
  • 19. Initiate Fire Fighting mode Have a communication process in place to inform others an incident exists Did something get deployed to production and cause the issue? Did someone or something start using the system in a way that hasn’t happened previously? Do the monitors show any infrastructure related issues? Is there anything in the error logs indicating a new system issue that hasn't been seen before? What changed between 10 minutes ago and now? Use your alerting system to initiate a new incident Setup integrations for things like your messaging tool (Slack, Teams, etc…), monitoring systems, virtual meetings, etc.. Setup a central meeting point that anyone can join
  • 20. Initiate Fire Fighting mode Have a communication process in place to inform others an incident exists Did something get deployed to production and cause the issue? Did someone or something start using the system in a way that hasn’t happened previously? Do the monitors show any infrastructure related issues? Is there anything in the error logs indicating a new system issue that hasn't been seen before? What changed between 10 minutes ago and now? As you have ideas of what’s going on, make sure to get them out of your head and somewhere others can access Make it easy for others to join in the fire fighting Use your alerting system to initiate a new incident Setup integrations for things like your messaging tool (Slack, Teams, etc…), monitoring systems, virtual meetings, etc.. Setup a central meeting point that anyone can join
  • 21.
  • 22. Trust your hunches… Hunches are usually based on facts filed away just below the conscious level. ― Joyce Brothers, American psychologist
  • 23. Don’t just go on your intuition Second set of eyes is the most valuable tool in a moment of incident Time of day, stress levels, prior activities, etc all play a role in your decisions
  • 24. Stop trying to be the hero
  • 25. No individual can win a game by himself ― Pele, international soccer star
  • 26. Stop the bleeding, keep your company’s reputation with your customers, get things back to normal Resolve your issue
  • 27. Fool me once, shame on you. Fool me twice, shame on me. Don’t let this bite you again
  • 28. Innovation comes only from readily and seamlessly sharing information rather than hoarding it - Tom Peters Stop creating silos
  • 29. Incident Management Team Commander Tech Lead Communicator ●Gets the appropriate people involved ●Holds all positions until others are delegated ●Keeps incident docs up to date ●Handles the incident resolution process ●Delegates to those who are working on resolving ●Controls what is changed ●Face of the incident response ●Issues periodic updates to stakeholders ●Keeps incident docs up to date
  • 30. Additional Roles You Might Consider Customer Support Lead Subject Matter Experts External Communication Lead Dedicated Scribe Root Cause Analyst
  • 31. Be prepared to have backups for any role
  • 33. Key pieces of data to keep track of Any monitors, alerts, logs that are related to the incident Who is assigned a role at what point What has been attempted to remedy the system Support ticket counts related to the specific incident Any ideas of what the root cause could be What the actual problem that is trying to be remedied
  • 34. Key time related metrics What time was the first alert, notification, smoking server, etc… signaled? What time did someone acknowledge the issue and begin working? What time were system’s back to a usable state? Any time a potential fix was pushed to a production environment? What time was the system back to a fully restored state?
  • 35. How to keep documentation
  • 36. Keep those who need to know informed
  • 37. Teamwork begins by building trust. And the only way to do that is to overcome our need for invulnerability. ― Patrick Lencioni, author
  • 38. Open internal line of communication Know your audience Over-updating is better than keeping everyone in the dark Typically try to follow a 15-30 minute or sooner if new information becomes available timeline Provide timely updates Incident response started, complete or partial restoration, new communicator taking duties over Inform on key milestones Keep the message appropriate for who you are delivering it to Avoid providing exact times for system to be restored Try to keep all messaging from the same person
  • 39. Keep your customers in the know Provide your customers with the tools when an incident arises Internal and external communication are two very different things Lesson learned: Let someone in customer support provide external messaging
  • 41. Most important part of incident response:
  • 42. Most important part of incident response: Post Mortems
  • 43. What a post mortem isn’t Somewhere to cast blame on a person, team, or some other specific resource Post mortems should be available for any internal stakeholder to review This can be a trust rebuilding process with the rest of the business as issues arise to show how you’ve taken action to ensure this doesn’t happen again A journal entry intended to be tucked under your pillow at night Post mortems should follow the same template every time an incident occurs Ensure key information is captured in a repeatable fashion Scattered thoughts in a document Post mortems are intended to be learning experiences on issues in our systems and processes, not to point fingers Blameless post mortems allow the team to work closer to making sure this incident won’t happen again
  • 44. Goals for your post mortems Ensure the incident is documented
  • 45. Goals for your post mortems Root cause is understood for how the incident occured Ensure the incident is documented
  • 46. Goals for your post mortems Root cause is understood for how the incident occured Ensure the incident is documented What steps have been identified and either put into place or planned to be put into place to reduce the change of the incident occurring again
  • 47. Post Mortem Template Incident Title Date (Note time zones and keep consistent) Author(s) Status Summary (280 characters max) Impact (Technical and business) Root Cause(s) Trigger Resolution How was incident detected Action Items (Item, owner, link to ticket system) Lessons Learned: ● What went well ● What went wrong ● Where we got lucky Timeline (Note time zone, use military time) Key Incident metrics: ● Start/Stop Time ● Time to detect, acknowledge, recovery Supporting information
  • 49. Post mortem tips Keep all post mortems in a repository Our minds are mush, try to complete the document and distribute for a review to initial team as soon as possible Distribute to the team for review within 24 hours of the initial incident Not for everyone, but once the incident is over, your post mortem document is ready for review Use the post mortem template as your incident note template Use a markdown template and create an incident mgmt GitHub repo Create a directory in Google Docs or Office 365 to store the documents These are intended to be learning lessons to prevent as well as trust builders for key stakeholders Make available to key stakeholders within 72 hours of an incident
  • 50. No matter what happens, or how bad it seems today, life does go on, and it will be better tomorrow. ― Maya Angelou
  • 51.
  • 52.
  • 53.
  • 54.
  • 56. Additional Resources Google SRE Books Atlassian Incident Management Something’s Broken PagerDuty Alert Sound PagerDuty Incident Management Presentation - bit.ly/IncidentMgmtQAOTHW2023