SlideShare a Scribd company logo
1 of 44
World-Class Incident
Response Management
Keith Smith
Cloud Site Reliability Engineering
IncidentOps.com
Incident Ops
Introduction
As an Cloud Site Reliability and Distributed
Service Engineer at Microsoft, Keith Smith
has worked on highly-available distributed
cloud telemetry pipeline operations at
massive scale for Xbox and Windows.
Keith manages all AWS / Azure Cloud
Operations at Imagine Learning and has
helped the company to move to agile
Incident Response Management by
facilitating a culture of communication and
collaboration between support,
development, and operations.
He enjoys spending time with his family, rock
climbing, and biking.
He is the founder of Incident Ops, a
Microsoft Azure Partner specializing in Site
Reliability, Cloud Architecture, and Incident
Response.
Agenda
Incident Definition
Incident Response Management
On-Call Procedures
Keeping Services Healthy
Agenda
Incident Definition
 Introduction / Level Setting
 Incident Timeline
 Prioritization
Incident Response Management
On-Call Procedures
Keeping Services Healthy
“ An incident is defined as an event
that has a measurable impact on the
customer experience. ”
Keith Smith
Incident Introduction
There are two major measurements when it comes to service health:
 Mean Time Between Failures – MTBF
 Mean Time to Resolve Incidents – MTTR
MTTR can be further broken down:
Time to detect incident
Time to engage
(acknowledge) incident
Time to mitigate
incident impact
Time to resolve incident
Incident Timeline
08:00 18:00
09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00
09:31 11:11
10:00
Mean Time Between Failures - MTBF
09:31
Incident Start
11:11
Incident Resolved
09:31 - 11:11
Incident Window
11:11
Incident Resolved
09:31
Incident Start 09:46
Alert Acknowledged
09:31 - 09:46
MTTA – min
09:46 - 10:43
MTTM – min
10:43 - 11:11
MTTR – min
10:43
Impact Mitigated
How do you Prioritize an Incident?
A non-maskable interrupt (NMI) is a computer processor interrupt
that cannot be ignored by standard interrupt masking techniques in
the system. It is typically used to signal attention for non-recoverable
hardware errors.
Answers.com, emphasis added - http://www.answers.com/Q/What_is_non_maskable_interrupt_interrupt
An incident is a development and operations interrupt that cannot be
ignored by standard feature development. It is typically used to signal
attention for non-recoverable issues that cause customer impact.
Inspired by Bryan Sparks, CTO – Imagine Learning Inc.
August 2015
Agenda
Incident Definition
Incident Response Management
 Incident Lifecycle
 Incident Resolution
 Root Cause Analysis
 Post-Mortem Review
On-Call Procedures
Keeping Services Healthy
“ What gets measured,
gets managed. ”
Peter Drucker
Incident Lifecycle
Cloud Services Cloud Services are monitored using desired tools.
Incident Lifecycle
Incident Begins Customer impacting incident triggers.
Source catches incident
and alerts to on-call
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins Monitoring catches incident and routes alert to
incident management system and on-call individuals.
MTTA/MTTD
Severity Assessed
Investigation
Ongoing
Investigation Begins
Source catches incident
and alerts to on-call
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins On-call begins investigating. Impact is assessed and
updates to company status page are made as
needed.
Severity Assessed
Investigation
Ongoing
Investigation Begins
Source catches incident
and alerts to on-call
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins
Escalate to TDO
Incident Severe or Lengthy
(>30 minutes)
Escalate to Ops/Dev
SME
Additional help required to determine
cause and mitigate incident
Subject Matter Experts (Service
Owners) are escalated to as
needed.
For high-impact incidents, the
Technical Duty Officer (Dev
and/or Operations Manager) is
looped in to coordinate team
activities.
Severity Assessed
Investigation
Ongoing
Investigation Begins
Source catches incident
and alerts to on-call
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins
Escalate to TDO
Incident Severe or Lengthy
(>30 minutes)
Escalate to Ops/Dev
SME
Additional help required to determine
cause and mitigate incident
MTTM
Most Companies
stop here.
Incident Impact
Mitigated
(Temporary Fix)
Temporary workaround
implemented
“ For every effect there is a root cause. Find
and address the root cause rather than try to
fix the effect, as there is no end to the latter. ”
Celestine Chua
Writer and Founder of Personal Excellence, life coach
Incident Resolution
Two criteria are required for an incident to be resolved (closed):
 Impact has been mitigated.
 Root cause of the issue has been identified.
Work items to address root cause are completed and released to
production immediately when possible.
At times additional long-term work is required to address root cause.
In this case the work item is logged as a Bug and the Shield team
works on the fix (described in more detail later).
Creating a Root Cause Culture
Don’t stop until the incident is resolved.
 This is an expectation, and won’t always be popular.
Make root cause part of your Acceptance Criteria
 Record root cause of issues in work tracking software (JIRA, VSTS,
etc) for incident work items.
Post-Mortem discussion is mandatory for incident participants.
Severity Assessed
Investigation
Ongoing
Investigation Begins
Source catches incident
and alerts to on-call
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins
Escalate to TDO
Incident Severe or Lengthy
(>30 minutes)
Escalate to Ops/Dev
SME
Additional help required to determine
cause and mitigate incident
MTTM
Most Companies
stop here.
Don’t stop here!
Incident Impact
Mitigated
(Temporary Fix)
Temporary workaround
implemented
Incident Impact
Mitigated
(Temporary Fix)
Root Cause
Determined
Temporary workaround
implemented
Cause determined
but not mitigated
Severity Assessed
Investigation
Ongoing
Investigation Begins
Source catches incident
and alerts to on-call
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins
Escalate to TDO
Incident Severe or Lengthy
(>30 minutes)
Escalate to Ops/Dev
SME
Additional help required to determine
cause and mitigate incident
Finding Root
Cause is the single
most important
step in the
Incident Lifecycle.
MTTM
Incident Resolved
Permanent Fix
Implemented
Incident Impact
Mitigated
(Temporary Fix)
Root Cause
Determined
Temporary workaround
implemented
Cause determined
but not mitigated
Severity Assessed
Investigation
Ongoing
Investigation Begins
Source catches incident
and alerts to on-call
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins
Escalate to TDO
Incident Severe or Lengthy
(>30 minutes)
Escalate to Ops/Dev
SME
Additional help required to determine
cause and mitigate incident
Root Cause has
been addressed
and incident is
truly resolved at
this point.
Post-Mortem
Discussion
(Retrospective)
Repair Items
Identified
Incident Resolved
Permanent Fix
Implemented
Incident Impact
Mitigated
(Temporary Fix)
Root Cause
Determined
Temporary workaround
implemented
Cause determined
but not mitigated
Severity Assessed
Investigation
Ongoing
Investigation Begins
Source catches incident
and alerts to on-call
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins
Escalate to TDO
Incident Severe or Lengthy
(>30 minutes)
Escalate to Ops/Dev
SME
Additional help required to determine
cause and mitigate incident
A review of past
incidents is
performed at regular
intervals (weekly,
monthly, etc).
Post-Mortem Discussion
The Post-Mortem Retrospective is a no
blame tolerated team gathering.
It’s a great opportunity to learn and
grow from each other’s experiences
and to take time to reflect on the
current strengths and weaknesses in
company services.
Livesite
Review
1. Discuss actions
taken to address
incidents.
2. What we could
have done better
during the
incident.
3. Review work
items required to
ensure incidents
do not happen
again.
4. Suggest other
things we can do
to continually
improve our
services.
Agenda
Incident Definition
Incident Response Management
On-Call Procedures
 Dual Paging
 Procedures Step-by-Step
 Incident Fatigue
Keeping Services Healthy
“ Pain sure does bring out the best in
people, doesn’t it? ”
Bob Dylan
Singer, Songwriter, Painter, Writer, and Nobel Prize Laureate
Dual-Paging
Live site issues generally fall into two categories:
 Infrastructure issues.
 Code Issues.
The goal is the same for both: Reduce MTTR by resolving issues as
quickly as possible.
But we don’t know which category an issue falls into when an
incident starts.
On-Call Procedure
Engineering Team
Primary Rotation
Engineering Team
Secondary Rotation
Cloud Alert
(Dual Page to Operations and Engineering Teams)
Cloud Operations
Primary Rotation
Cloud Operations
Secondary Rotation
Operations Team
Engineering Team
Incident alert triggers a phone call / SMS message
to both Operations and Engineering team.
A secondary is always available should the
primary on-call is unavailable.
On-Call Procedure
Engineering Team
Primary Rotation
Engineering Team
Secondary Rotation
Cloud Alert
(Dual Page to Operations and Engineering Teams)
Cloud Operations
Primary Rotation
Cloud Operations
Secondary Rotation
Operations Team
Engineering Team
All active on-call personnel join a voice
conference bridge using Skype, Slack, or an
equivalent tool to coordinate the incident
investigation.
Initiate
Bridge
Initiate
Bridge
On-Call Procedure
Engineering Team
Primary Rotation
Engineering Team
Secondary Rotation
Cloud Alert
(Dual Page to Operations and Engineering Teams)
Cloud Operations
Primary Rotation
Cloud Operations
Secondary Rotation
Operations Team
Engineering Team
Sometimes a little extra help is needed. Service
Subject Matter Experts (Engineering and/or Ops.)
may be called to join the conference bridge.
Initiate
Bridge
Initiate
Bridge
Service Subject Matter
Expert (SME)
Join Conference
Bridge
Service Subject Matter
Expert (SME)
Operations Team Lead Engineering Team Lead
On-Call Procedure
Engineering Team
Primary Rotation
Engineering Team
Secondary Rotation
Cloud Alert
(Dual Page to Operations and Engineering Teams)
Cloud Operations
Primary Rotation
Cloud Operations
Secondary Rotation
Operations Team
Engineering Team
Lengthy /
Severe issues
are escalated
to team leads
to assist in
coordinating
the incident.
Initiate
Bridge
Initiate
Bridge
Service Subject Matter
Expert (SME)
Join Conference
Bridge
Service Subject Matter
Expert (SME)
Incident Fatigue
An important side note. Incidents are urgent and stressful. Don’t
create unnecessary incidents when possible.
Every alert should be actionable.
If it isn’t actionable 100% of the time, monitoring needs to be
adjusted as an incident action item or should only send notification
emails (not create incidents).
Agenda
Incident Definition
Incident Response Management
On-Call Procedures
Keeping Services Healthy
 Alert Management Systems
 Shield Teams
 Bug Cap
 Error Rate Zero
“ Early to bed and early to rise, makes a
man healthy, wealthy, and wise. ”
Benjamin Franklin
Founding Father of the United States, Inventor, Author, Scientist
Quick Recap: Incident Primary Goals
Mitigate impact as quickly
as possible (when able).
Determine root cause.
Identify action items to
address root cause
(permanently).
Alert Management System
At the core of a World-Class Incident Response Management pipeline
is an Alert Management System.
This system will aggregate monitoring alerts into a centralized system
and route these alerts to the correct teams / personnel.
Alerts are always routed via phone / SMS. Email is not real-time and
too much noise exists in email.
Integrations
The alert management system should integrate with the tools your
team is familiar with using, and engineers can work out their own
flow for addressing incidents.
Make it easy to accept and use, and people will adopt it.
Shield Teams
Engineering Shield Teams are an obvious extension to dual paging.
They help engineers focus and avoid interrupt-driven work.
Feature teams work on backlog of
new feature development.
Shield teams address bugs and
interruptions to feature team.
Shield Teams are a concept I learned from and experienced working at Microsoft.
They use them with many Engineering teams.
Shield Teams
Shield Teams rotate at each iteration (sprint). This spreads the load,
provides cross-training opportunities, and safeguards against incident
fatigue.
Feature teams work on backlog of
new feature development.
Shield teams address bugs and
interruptions to feature team.
Shield Teams are a concept I learned from and experienced working at Microsoft.
They use them with many Engineering teams.
Bug Cap
Bug Cap is a concept I learned from Microsoft, and it is an amazing
answer to addressing technical debt.
Team Size x 4 = Bug Cap
The rule is simple:
If bug count exceeds bug cap, stop working on new features until
bugs are resolved.
Bug Cap
Bug Cap violations should be tracked as a metric for each team and
reviewed in management discussions.
This metric is great for standup, retrospective, and planning
discussions.
Error Rate Zero
Which is easier to monitor?
What is the baseline for graph A? for B?
Low error rates create actionable monitoring and alerting.
Error Rate Zero
Don’t tolerate bugs … Ever.
The goal is to be able to treat
them as incidents, and
eliminate them with the
highest priority.
Questions?
Please connect with me on LinkedIn:
https://www.linkedin.com/in/keithbradsmith
Interested in a training or in partnering with Incident Ops?

More Related Content

What's hot

An Introduction to Disaster Recovery Planning
An Introduction to Disaster Recovery PlanningAn Introduction to Disaster Recovery Planning
An Introduction to Disaster Recovery Planning
NEBizRecovery
 
Introduction to Risk Management via the NIST Cyber Security Framework
Introduction to Risk Management via the NIST Cyber Security FrameworkIntroduction to Risk Management via the NIST Cyber Security Framework
Introduction to Risk Management via the NIST Cyber Security Framework
PECB
 
Security operations center-SOC Presentation-مرکز عملیات امنیت
Security operations center-SOC Presentation-مرکز عملیات امنیتSecurity operations center-SOC Presentation-مرکز عملیات امنیت
Security operations center-SOC Presentation-مرکز عملیات امنیت
ReZa AdineH
 

What's hot (20)

5 BEST PRACTICES FOR A SECURITY OPERATION CENTER (SOC)
5 BEST PRACTICES FOR A SECURITY OPERATION CENTER (SOC)5 BEST PRACTICES FOR A SECURITY OPERATION CENTER (SOC)
5 BEST PRACTICES FOR A SECURITY OPERATION CENTER (SOC)
 
Security Operation Center - Design & Build
Security Operation Center - Design & BuildSecurity Operation Center - Design & Build
Security Operation Center - Design & Build
 
BUSINESS-CONTINUITY-AND-DISASTER-RECOVERY.pptx
BUSINESS-CONTINUITY-AND-DISASTER-RECOVERY.pptxBUSINESS-CONTINUITY-AND-DISASTER-RECOVERY.pptx
BUSINESS-CONTINUITY-AND-DISASTER-RECOVERY.pptx
 
Disaster recovery
Disaster recoveryDisaster recovery
Disaster recovery
 
Chapter 11: Information Security Incident Management
Chapter 11: Information Security Incident ManagementChapter 11: Information Security Incident Management
Chapter 11: Information Security Incident Management
 
Comprehensive plans are in place to improve our institutional cyber security
Comprehensive plans are in place to improve our institutional cyber securityComprehensive plans are in place to improve our institutional cyber security
Comprehensive plans are in place to improve our institutional cyber security
 
The Six Stages of Incident Response - Auscert 2016
The Six Stages of Incident Response - Auscert 2016The Six Stages of Incident Response - Auscert 2016
The Six Stages of Incident Response - Auscert 2016
 
Security operation center (SOC)
Security operation center (SOC)Security operation center (SOC)
Security operation center (SOC)
 
Cybersecurity Frameworks | NIST Cybersecurity Framework | Cybersecurity Certi...
Cybersecurity Frameworks | NIST Cybersecurity Framework | Cybersecurity Certi...Cybersecurity Frameworks | NIST Cybersecurity Framework | Cybersecurity Certi...
Cybersecurity Frameworks | NIST Cybersecurity Framework | Cybersecurity Certi...
 
Cyber Crisis Management - Kloudlearn
Cyber Crisis Management - KloudlearnCyber Crisis Management - Kloudlearn
Cyber Crisis Management - Kloudlearn
 
ISACA Belgium CERT view 2011
ISACA Belgium CERT view 2011ISACA Belgium CERT view 2011
ISACA Belgium CERT view 2011
 
Security operations center 5 security controls
 Security operations center 5 security controls Security operations center 5 security controls
Security operations center 5 security controls
 
Cybersecurity crisis management a prep guide
Cybersecurity crisis management   a prep guideCybersecurity crisis management   a prep guide
Cybersecurity crisis management a prep guide
 
An Introduction to Disaster Recovery Planning
An Introduction to Disaster Recovery PlanningAn Introduction to Disaster Recovery Planning
An Introduction to Disaster Recovery Planning
 
Business Continuity Planning Presentation
Business Continuity Planning PresentationBusiness Continuity Planning Presentation
Business Continuity Planning Presentation
 
Introduction to Risk Management via the NIST Cyber Security Framework
Introduction to Risk Management via the NIST Cyber Security FrameworkIntroduction to Risk Management via the NIST Cyber Security Framework
Introduction to Risk Management via the NIST Cyber Security Framework
 
Security operations center-SOC Presentation-مرکز عملیات امنیت
Security operations center-SOC Presentation-مرکز عملیات امنیتSecurity operations center-SOC Presentation-مرکز عملیات امنیت
Security operations center-SOC Presentation-مرکز عملیات امنیت
 
Disaster Recovery Planning
Disaster Recovery PlanningDisaster Recovery Planning
Disaster Recovery Planning
 
A Case Study of the Capital One Data Breach
A Case Study of the Capital One Data BreachA Case Study of the Capital One Data Breach
A Case Study of the Capital One Data Breach
 
Cybersecurity: Mock Cyberwar Game
Cybersecurity: Mock Cyberwar Game   Cybersecurity: Mock Cyberwar Game
Cybersecurity: Mock Cyberwar Game
 

Similar to World-Class Incident Response Management

1. After a cyber attack, the organizational decision making and re.docx
1. After a cyber attack, the organizational decision making and re.docx1. After a cyber attack, the organizational decision making and re.docx
1. After a cyber attack, the organizational decision making and re.docx
jackiewalcutt
 
Contingency Plan WAK BANKS ATM
Contingency Plan WAK BANKS ATMContingency Plan WAK BANKS ATM
Contingency Plan WAK BANKS ATM
Wajahat Ali Khan
 
(White House IT Security Staff BCP Policy) ([CSIA 4.docx
 (White House IT Security Staff BCP Policy) ([CSIA 4.docx (White House IT Security Staff BCP Policy) ([CSIA 4.docx
(White House IT Security Staff BCP Policy) ([CSIA 4.docx
joyjonna282
 
LogRhythm_-_Modern_Cyber_Threat_Pandemic.pptx
LogRhythm_-_Modern_Cyber_Threat_Pandemic.pptxLogRhythm_-_Modern_Cyber_Threat_Pandemic.pptx
LogRhythm_-_Modern_Cyber_Threat_Pandemic.pptx
CNSHacking
 

Similar to World-Class Incident Response Management (20)

Cyber Security and Business Continuity an Integrated Discipline
Cyber Security and Business Continuity an Integrated DisciplineCyber Security and Business Continuity an Integrated Discipline
Cyber Security and Business Continuity an Integrated Discipline
 
1. After a cyber attack, the organizational decision making and re.docx
1. After a cyber attack, the organizational decision making and re.docx1. After a cyber attack, the organizational decision making and re.docx
1. After a cyber attack, the organizational decision making and re.docx
 
End-to-End OT SecOps Transforming from Good to Great
End-to-End OT SecOps Transforming from Good to GreatEnd-to-End OT SecOps Transforming from Good to Great
End-to-End OT SecOps Transforming from Good to Great
 
Business Continuation The Basics
Business Continuation   The BasicsBusiness Continuation   The Basics
Business Continuation The Basics
 
Cyber+Incident+Response+-+Generic+Denial+of+Service+Playbook+v2.3.docx
Cyber+Incident+Response+-+Generic+Denial+of+Service+Playbook+v2.3.docxCyber+Incident+Response+-+Generic+Denial+of+Service+Playbook+v2.3.docx
Cyber+Incident+Response+-+Generic+Denial+of+Service+Playbook+v2.3.docx
 
Executive Perspective Building an OT Security Program from the Top Down
Executive Perspective Building an OT Security Program from the Top DownExecutive Perspective Building an OT Security Program from the Top Down
Executive Perspective Building an OT Security Program from the Top Down
 
ITIL Incident Management Workflow PowerPoint Presentation Slides
ITIL Incident Management Workflow PowerPoint Presentation SlidesITIL Incident Management Workflow PowerPoint Presentation Slides
ITIL Incident Management Workflow PowerPoint Presentation Slides
 
Operation: Next Summit Takeaways
Operation: Next Summit TakeawaysOperation: Next Summit Takeaways
Operation: Next Summit Takeaways
 
Cyber+incident+response+ +generic+ransomware+playbook+v2.3
Cyber+incident+response+ +generic+ransomware+playbook+v2.3Cyber+incident+response+ +generic+ransomware+playbook+v2.3
Cyber+incident+response+ +generic+ransomware+playbook+v2.3
 
Enterprise incident response 2017
Enterprise incident response   2017Enterprise incident response   2017
Enterprise incident response 2017
 
Importance Of Structured Incident Response Process
Importance Of Structured Incident Response ProcessImportance Of Structured Incident Response Process
Importance Of Structured Incident Response Process
 
Contingency Plan WAK BANKS ATM
Contingency Plan WAK BANKS ATMContingency Plan WAK BANKS ATM
Contingency Plan WAK BANKS ATM
 
The Perfect Storm - How We Talk About Disasters
The Perfect Storm - How We Talk About DisastersThe Perfect Storm - How We Talk About Disasters
The Perfect Storm - How We Talk About Disasters
 
Getting Started with Business Continuity
Getting Started with Business ContinuityGetting Started with Business Continuity
Getting Started with Business Continuity
 
DS Crisis Management Foundation - Lifecycle
DS Crisis Management Foundation - LifecycleDS Crisis Management Foundation - Lifecycle
DS Crisis Management Foundation - Lifecycle
 
(White House IT Security Staff BCP Policy) ([CSIA 4.docx
 (White House IT Security Staff BCP Policy) ([CSIA 4.docx (White House IT Security Staff BCP Policy) ([CSIA 4.docx
(White House IT Security Staff BCP Policy) ([CSIA 4.docx
 
LogRhythm_-_Modern_Cyber_Threat_Pandemic.pptx
LogRhythm_-_Modern_Cyber_Threat_Pandemic.pptxLogRhythm_-_Modern_Cyber_Threat_Pandemic.pptx
LogRhythm_-_Modern_Cyber_Threat_Pandemic.pptx
 
Problem management foundation - Introduction
Problem management foundation - IntroductionProblem management foundation - Introduction
Problem management foundation - Introduction
 
OT Security Architecture & Resilience: Designing for Security Success
OT Security Architecture & Resilience:  Designing for Security SuccessOT Security Architecture & Resilience:  Designing for Security Success
OT Security Architecture & Resilience: Designing for Security Success
 
Building a Business Continuity Capability
Building a Business Continuity CapabilityBuilding a Business Continuity Capability
Building a Business Continuity Capability
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

World-Class Incident Response Management

  • 1. World-Class Incident Response Management Keith Smith Cloud Site Reliability Engineering IncidentOps.com Incident Ops
  • 2. Introduction As an Cloud Site Reliability and Distributed Service Engineer at Microsoft, Keith Smith has worked on highly-available distributed cloud telemetry pipeline operations at massive scale for Xbox and Windows. Keith manages all AWS / Azure Cloud Operations at Imagine Learning and has helped the company to move to agile Incident Response Management by facilitating a culture of communication and collaboration between support, development, and operations. He enjoys spending time with his family, rock climbing, and biking. He is the founder of Incident Ops, a Microsoft Azure Partner specializing in Site Reliability, Cloud Architecture, and Incident Response.
  • 3. Agenda Incident Definition Incident Response Management On-Call Procedures Keeping Services Healthy
  • 4. Agenda Incident Definition  Introduction / Level Setting  Incident Timeline  Prioritization Incident Response Management On-Call Procedures Keeping Services Healthy
  • 5. “ An incident is defined as an event that has a measurable impact on the customer experience. ” Keith Smith
  • 6. Incident Introduction There are two major measurements when it comes to service health:  Mean Time Between Failures – MTBF  Mean Time to Resolve Incidents – MTTR MTTR can be further broken down: Time to detect incident Time to engage (acknowledge) incident Time to mitigate incident impact Time to resolve incident
  • 7. Incident Timeline 08:00 18:00 09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 09:31 11:11 10:00 Mean Time Between Failures - MTBF 09:31 Incident Start 11:11 Incident Resolved 09:31 - 11:11 Incident Window 11:11 Incident Resolved 09:31 Incident Start 09:46 Alert Acknowledged 09:31 - 09:46 MTTA – min 09:46 - 10:43 MTTM – min 10:43 - 11:11 MTTR – min 10:43 Impact Mitigated
  • 8. How do you Prioritize an Incident? A non-maskable interrupt (NMI) is a computer processor interrupt that cannot be ignored by standard interrupt masking techniques in the system. It is typically used to signal attention for non-recoverable hardware errors. Answers.com, emphasis added - http://www.answers.com/Q/What_is_non_maskable_interrupt_interrupt An incident is a development and operations interrupt that cannot be ignored by standard feature development. It is typically used to signal attention for non-recoverable issues that cause customer impact. Inspired by Bryan Sparks, CTO – Imagine Learning Inc. August 2015
  • 9. Agenda Incident Definition Incident Response Management  Incident Lifecycle  Incident Resolution  Root Cause Analysis  Post-Mortem Review On-Call Procedures Keeping Services Healthy
  • 10. “ What gets measured, gets managed. ” Peter Drucker
  • 11. Incident Lifecycle Cloud Services Cloud Services are monitored using desired tools.
  • 12. Incident Lifecycle Incident Begins Customer impacting incident triggers.
  • 13. Source catches incident and alerts to on-call Incident Acknowledged by On-Call Incident Lifecycle Incident Begins Monitoring catches incident and routes alert to incident management system and on-call individuals. MTTA/MTTD
  • 14. Severity Assessed Investigation Ongoing Investigation Begins Source catches incident and alerts to on-call Incident Acknowledged by On-Call Incident Lifecycle Incident Begins On-call begins investigating. Impact is assessed and updates to company status page are made as needed.
  • 15. Severity Assessed Investigation Ongoing Investigation Begins Source catches incident and alerts to on-call Incident Acknowledged by On-Call Incident Lifecycle Incident Begins Escalate to TDO Incident Severe or Lengthy (>30 minutes) Escalate to Ops/Dev SME Additional help required to determine cause and mitigate incident Subject Matter Experts (Service Owners) are escalated to as needed. For high-impact incidents, the Technical Duty Officer (Dev and/or Operations Manager) is looped in to coordinate team activities.
  • 16. Severity Assessed Investigation Ongoing Investigation Begins Source catches incident and alerts to on-call Incident Acknowledged by On-Call Incident Lifecycle Incident Begins Escalate to TDO Incident Severe or Lengthy (>30 minutes) Escalate to Ops/Dev SME Additional help required to determine cause and mitigate incident MTTM Most Companies stop here. Incident Impact Mitigated (Temporary Fix) Temporary workaround implemented
  • 17. “ For every effect there is a root cause. Find and address the root cause rather than try to fix the effect, as there is no end to the latter. ” Celestine Chua Writer and Founder of Personal Excellence, life coach
  • 18. Incident Resolution Two criteria are required for an incident to be resolved (closed):  Impact has been mitigated.  Root cause of the issue has been identified. Work items to address root cause are completed and released to production immediately when possible. At times additional long-term work is required to address root cause. In this case the work item is logged as a Bug and the Shield team works on the fix (described in more detail later).
  • 19. Creating a Root Cause Culture Don’t stop until the incident is resolved.  This is an expectation, and won’t always be popular. Make root cause part of your Acceptance Criteria  Record root cause of issues in work tracking software (JIRA, VSTS, etc) for incident work items. Post-Mortem discussion is mandatory for incident participants.
  • 20. Severity Assessed Investigation Ongoing Investigation Begins Source catches incident and alerts to on-call Incident Acknowledged by On-Call Incident Lifecycle Incident Begins Escalate to TDO Incident Severe or Lengthy (>30 minutes) Escalate to Ops/Dev SME Additional help required to determine cause and mitigate incident MTTM Most Companies stop here. Don’t stop here! Incident Impact Mitigated (Temporary Fix) Temporary workaround implemented
  • 21. Incident Impact Mitigated (Temporary Fix) Root Cause Determined Temporary workaround implemented Cause determined but not mitigated Severity Assessed Investigation Ongoing Investigation Begins Source catches incident and alerts to on-call Incident Acknowledged by On-Call Incident Lifecycle Incident Begins Escalate to TDO Incident Severe or Lengthy (>30 minutes) Escalate to Ops/Dev SME Additional help required to determine cause and mitigate incident Finding Root Cause is the single most important step in the Incident Lifecycle. MTTM
  • 22. Incident Resolved Permanent Fix Implemented Incident Impact Mitigated (Temporary Fix) Root Cause Determined Temporary workaround implemented Cause determined but not mitigated Severity Assessed Investigation Ongoing Investigation Begins Source catches incident and alerts to on-call Incident Acknowledged by On-Call Incident Lifecycle Incident Begins Escalate to TDO Incident Severe or Lengthy (>30 minutes) Escalate to Ops/Dev SME Additional help required to determine cause and mitigate incident Root Cause has been addressed and incident is truly resolved at this point.
  • 23. Post-Mortem Discussion (Retrospective) Repair Items Identified Incident Resolved Permanent Fix Implemented Incident Impact Mitigated (Temporary Fix) Root Cause Determined Temporary workaround implemented Cause determined but not mitigated Severity Assessed Investigation Ongoing Investigation Begins Source catches incident and alerts to on-call Incident Acknowledged by On-Call Incident Lifecycle Incident Begins Escalate to TDO Incident Severe or Lengthy (>30 minutes) Escalate to Ops/Dev SME Additional help required to determine cause and mitigate incident A review of past incidents is performed at regular intervals (weekly, monthly, etc).
  • 24. Post-Mortem Discussion The Post-Mortem Retrospective is a no blame tolerated team gathering. It’s a great opportunity to learn and grow from each other’s experiences and to take time to reflect on the current strengths and weaknesses in company services. Livesite Review 1. Discuss actions taken to address incidents. 2. What we could have done better during the incident. 3. Review work items required to ensure incidents do not happen again. 4. Suggest other things we can do to continually improve our services.
  • 25. Agenda Incident Definition Incident Response Management On-Call Procedures  Dual Paging  Procedures Step-by-Step  Incident Fatigue Keeping Services Healthy
  • 26. “ Pain sure does bring out the best in people, doesn’t it? ” Bob Dylan Singer, Songwriter, Painter, Writer, and Nobel Prize Laureate
  • 27. Dual-Paging Live site issues generally fall into two categories:  Infrastructure issues.  Code Issues. The goal is the same for both: Reduce MTTR by resolving issues as quickly as possible. But we don’t know which category an issue falls into when an incident starts.
  • 28. On-Call Procedure Engineering Team Primary Rotation Engineering Team Secondary Rotation Cloud Alert (Dual Page to Operations and Engineering Teams) Cloud Operations Primary Rotation Cloud Operations Secondary Rotation Operations Team Engineering Team Incident alert triggers a phone call / SMS message to both Operations and Engineering team. A secondary is always available should the primary on-call is unavailable.
  • 29. On-Call Procedure Engineering Team Primary Rotation Engineering Team Secondary Rotation Cloud Alert (Dual Page to Operations and Engineering Teams) Cloud Operations Primary Rotation Cloud Operations Secondary Rotation Operations Team Engineering Team All active on-call personnel join a voice conference bridge using Skype, Slack, or an equivalent tool to coordinate the incident investigation. Initiate Bridge Initiate Bridge
  • 30. On-Call Procedure Engineering Team Primary Rotation Engineering Team Secondary Rotation Cloud Alert (Dual Page to Operations and Engineering Teams) Cloud Operations Primary Rotation Cloud Operations Secondary Rotation Operations Team Engineering Team Sometimes a little extra help is needed. Service Subject Matter Experts (Engineering and/or Ops.) may be called to join the conference bridge. Initiate Bridge Initiate Bridge Service Subject Matter Expert (SME) Join Conference Bridge Service Subject Matter Expert (SME)
  • 31. Operations Team Lead Engineering Team Lead On-Call Procedure Engineering Team Primary Rotation Engineering Team Secondary Rotation Cloud Alert (Dual Page to Operations and Engineering Teams) Cloud Operations Primary Rotation Cloud Operations Secondary Rotation Operations Team Engineering Team Lengthy / Severe issues are escalated to team leads to assist in coordinating the incident. Initiate Bridge Initiate Bridge Service Subject Matter Expert (SME) Join Conference Bridge Service Subject Matter Expert (SME)
  • 32. Incident Fatigue An important side note. Incidents are urgent and stressful. Don’t create unnecessary incidents when possible. Every alert should be actionable. If it isn’t actionable 100% of the time, monitoring needs to be adjusted as an incident action item or should only send notification emails (not create incidents).
  • 33. Agenda Incident Definition Incident Response Management On-Call Procedures Keeping Services Healthy  Alert Management Systems  Shield Teams  Bug Cap  Error Rate Zero
  • 34. “ Early to bed and early to rise, makes a man healthy, wealthy, and wise. ” Benjamin Franklin Founding Father of the United States, Inventor, Author, Scientist
  • 35. Quick Recap: Incident Primary Goals Mitigate impact as quickly as possible (when able). Determine root cause. Identify action items to address root cause (permanently).
  • 36. Alert Management System At the core of a World-Class Incident Response Management pipeline is an Alert Management System. This system will aggregate monitoring alerts into a centralized system and route these alerts to the correct teams / personnel. Alerts are always routed via phone / SMS. Email is not real-time and too much noise exists in email.
  • 37. Integrations The alert management system should integrate with the tools your team is familiar with using, and engineers can work out their own flow for addressing incidents. Make it easy to accept and use, and people will adopt it.
  • 38. Shield Teams Engineering Shield Teams are an obvious extension to dual paging. They help engineers focus and avoid interrupt-driven work. Feature teams work on backlog of new feature development. Shield teams address bugs and interruptions to feature team. Shield Teams are a concept I learned from and experienced working at Microsoft. They use them with many Engineering teams.
  • 39. Shield Teams Shield Teams rotate at each iteration (sprint). This spreads the load, provides cross-training opportunities, and safeguards against incident fatigue. Feature teams work on backlog of new feature development. Shield teams address bugs and interruptions to feature team. Shield Teams are a concept I learned from and experienced working at Microsoft. They use them with many Engineering teams.
  • 40. Bug Cap Bug Cap is a concept I learned from Microsoft, and it is an amazing answer to addressing technical debt. Team Size x 4 = Bug Cap The rule is simple: If bug count exceeds bug cap, stop working on new features until bugs are resolved.
  • 41. Bug Cap Bug Cap violations should be tracked as a metric for each team and reviewed in management discussions. This metric is great for standup, retrospective, and planning discussions.
  • 42. Error Rate Zero Which is easier to monitor? What is the baseline for graph A? for B? Low error rates create actionable monitoring and alerting.
  • 43. Error Rate Zero Don’t tolerate bugs … Ever. The goal is to be able to treat them as incidents, and eliminate them with the highest priority.
  • 44. Questions? Please connect with me on LinkedIn: https://www.linkedin.com/in/keithbradsmith Interested in a training or in partnering with Incident Ops?

Editor's Notes

  1. Operations in Highly-Scalable Distributed Cloud Services in an Agile or DevOps culture / organization
  2. Incident Definition -Defining each of the characteristics of an incident -Explain the differences between a bug and an incident Incident Response Management -A strongly defined and repeatable process for managing and responding to incidents Root Cause Culture -Discuss the importance of Root Cause analysis and where most companies fall short Keeping Services Healthy -This is me sharing a few secrets of success in managing technical debt and iteratively improving services all the time
  3. Incident Definition -Defining each of the characteristics of an incident -Explain the differences between a bug and an incident Incident Response Management -A strongly defined and repeatable process for managing and responding to incidents Root Cause Culture -Discuss the importance of Root Cause analysis and where most companies fall short Keeping Services Healthy -This is me sharing a few secrets of success in managing technical debt and iteratively improving services all the time
  4. 3 important takeaways here: An incident is an event. It has a clear start and end. Impact is measurable. Customers (both internal or external) are the true measure of impact.
  5. Bryan Sparks, CTO at Imagine Learning, described incidents as NMIs. He gave me 100% discretion over EVERY Support, Operations, Dev, and PM in the company. At any given moment, I can tap any resource to help with an incident if I feel that person can help an incident to be resolved more quickly.
  6. Incident Definition -Defining each of the characteristics of an incident -Explain the differences between a bug and an incident Incident Response Management -A strongly defined and repeatable process for managing and responding to incidents Root Cause Culture -Discuss the importance of Root Cause analysis and where most companies fall short Keeping Services Healthy -This is me sharing a few secrets of success in managing technical debt and iteratively improving services all the time
  7. I like this quote, often attributed to Peter Drucker… but I found no actual proof online that he said it. The simple act of paying attention to something will cause you to make connections you never did before, and you'll improve in those areas - almost without any extra effort. This process takes preparation and discipline, but once it is set up and generally accepted… it’s a breeze to use and extend.
  8. Email is NOT a reliable tool for incident management. Email disrupts us all regularly throughout the day/night, and incidents need to break out as something more than just another email. An incident MUST only notify via phone / SMS. Emails are ok for auditing, but are not a primary tool for on-call.
  9. Tell the story of my sisters Back Pain – Treating the symptom and not the cause. Example: Recycling the app pool daily instead of figuring out why the service crashes every once in a while. One is a mitigation, the other is root cause.
  10. That’s right, I put all 3 names for this meeting in a single slide!
  11. Incident Definition -Defining each of the characteristics of an incident -Explain the differences between a bug and an incident Incident Response Management -A strongly defined and repeatable process for managing and responding to incidents Root Cause Culture -Discuss the importance of Root Cause analysis and where most companies fall short Keeping Services Healthy -This is me sharing a few secrets of success in managing technical debt and iteratively improving services all the time
  12. Bring the pain forward. Dev teams write better code when they are on the hook for fixing it in production. We’ve NEVER had anyone cite on-call as a reason for leaving the company (yet…)
  13. [DEMO] Show on-call rotations for Cloud Operations and Cloud Infrastructure (Have them pre-loaded) Central to EVERY person in this chain is communication / collaboration. The first thing done in every incident is to combine efforts and start a VOICE discussion. Chat is used for tracking and updates, but is too slow for incident collaboration.
  14. [DEMO] Show on-call rotations for Cloud Operations and Cloud Infrastructure (Have them pre-loaded) Central to EVERY person in this chain is communication / collaboration. The first thing done in every incident is to combine efforts and start a VOICE discussion. Chat is used for tracking and updates, but is too slow for incident collaboration.
  15. [DEMO] Show on-call rotations for Cloud Operations and Cloud Infrastructure (Have them pre-loaded) Central to EVERY person in this chain is communication / collaboration. The first thing done in every incident is to combine efforts and start a VOICE discussion. Chat is used for tracking and updates, but is too slow for incident collaboration.
  16. It’s easy to sit and do nothing 30 minutes into an incident. The team lead can drive individual accountability during incidents.
  17. Incident Definition -Defining each of the characteristics of an incident -Explain the differences between a bug and an incident Incident Response Management -A strongly defined and repeatable process for managing and responding to incidents Root Cause Culture -Discuss the importance of Root Cause analysis and where most companies fall short Keeping Services Healthy -This is me sharing a few secrets of success in managing technical debt and iteratively improving services all the time
  18. Root Cause Methodology: Note that these are not numbered, and will regularly be addressed in different order. These three items are the goals of every incident, and the driving force behind all activities within the incident lifecycle. Every action should strive to reach one of these goals FASTER, with more precision, and be more COMPLETE. These go in JIRA with special tags and are discussed in Post-Mortem.
  19. Show Slack integration with OpsGenie [DEMO] Statuspage integration?
  20. Interruptions are expensive. Feature teams do not work on bugs and address incidents unless needed. Shield teams do this work during their assigned iteration (like being assigned to active duty) and only do feature work as able, but never is anything assigned to them for that iteration.
  21. Shield teams rotate.
  22. Describe the Zen of Inbox Zero. -Inbox items are a todo list. Anything in the inbox is a task requiring follow up in a set period, such as a day.