World-Class Incident Response Management

World-Class Incident
Response Management
Keith Smith
Cloud Site Reliability Engineering
IncidentOps.com
Incident Ops

Introduction
As an Cloud Site Reliability and Distributed
Service Engineer at Microsoft, Keith Smith
has worked on highly-available distributed
cloud telemetry pipeline operations at
massive scale for Xbox and Windows.
Keith manages all AWS / Azure Cloud
Operations at Imagine Learning and has
helped the company to move to agile
Incident Response Management by
facilitating a culture of communication and
collaboration between support,
development, and operations.
He enjoys spending time with his family, rock
climbing, and biking.
He is the founder of Incident Ops, a
Microsoft Azure Partner specializing in Site
Reliability, Cloud Architecture, and Incident
Response.

Agenda
Incident Definition
Incident Response Management
On-Call Procedures
Keeping Services Healthy

Agenda
Incident Definition
 Introduction / Level Setting
 Incident Timeline
 Prioritization
On-Call Procedures

“ An incident is defined as an event
that has a measurable impact on the
customer experience. ”
Keith Smith

Incident Introduction
There are two major measurements when it comes to service health:
 Mean Time Between Failures – MTBF
 Mean Time to Resolve Incidents – MTTR
MTTR can be further broken down:
Time to detect incident
Time to engage
(acknowledge) incident
Time to mitigate
incident impact
Time to resolve incident

Incident Timeline
08:00 18:00
09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00
09:31 11:11
10:00
Mean Time Between Failures - MTBF
09:31
Incident Start
11:11
Incident Resolved
09:31 - 11:11
Incident Window
11:11
Incident Resolved
09:31
Incident Start 09:46
Alert Acknowledged
09:31 - 09:46
MTTA – min
09:46 - 10:43
MTTM – min
10:43 - 11:11
MTTR – min
10:43
Impact Mitigated

How do you Prioritize an Incident?
A non-maskable interrupt (NMI) is a computer processor interrupt
that cannot be ignored by standard interrupt masking techniques in
the system. It is typically used to signal attention for non-recoverable
hardware errors.
Answers.com, emphasis added - http://www.answers.com/Q/What_is_non_maskable_interrupt_interrupt
An incident is a development and operations interrupt that cannot be
ignored by standard feature development. It is typically used to signal
attention for non-recoverable issues that cause customer impact.
Inspired by Bryan Sparks, CTO – Imagine Learning Inc.
August 2015

Agenda
Incident Definition
 Incident Lifecycle
 Incident Resolution
 Root Cause Analysis
 Post-Mortem Review
On-Call Procedures

“ What gets measured,
gets managed. ”
Peter Drucker

Incident Lifecycle
Cloud Services Cloud Services are monitored using desired tools.

Incident Lifecycle
Incident Begins Customer impacting incident triggers.

Source catches incident
and alerts to on-call
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins Monitoring catches incident and routes alert to
incident management system and on-call individuals.
MTTA/MTTD

Severity Assessed
Investigation
Ongoing
Investigation Begins
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins On-call begins investigating. Impact is assessed and
updates to company status page are made as
needed.

Severity Assessed
Investigation
Ongoing
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins
Escalate to TDO
Incident Severe or Lengthy
(>30 minutes)
Escalate to Ops/Dev
SME
Additional help required to determine
cause and mitigate incident
Subject Matter Experts (Service
Owners) are escalated to as
needed.
For high-impact incidents, the
Technical Duty Officer (Dev
and/or Operations Manager) is
looped in to coordinate team
activities.

Severity Assessed
Investigation
Ongoing
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins
Escalate to TDO
(>30 minutes)
Escalate to Ops/Dev
SME
MTTM
Most Companies
stop here.
Incident Impact
Mitigated
(Temporary Fix)
Temporary workaround
implemented

“ For every effect there is a root cause. Find
and address the root cause rather than try to
fix the effect, as there is no end to the latter. ”
Celestine Chua
Writer and Founder of Personal Excellence, life coach

Incident Resolution
Two criteria are required for an incident to be resolved (closed):
 Impact has been mitigated.
 Root cause of the issue has been identified.
Work items to address root cause are completed and released to
production immediately when possible.
At times additional long-term work is required to address root cause.
In this case the work item is logged as a Bug and the Shield team
works on the fix (described in more detail later).

Creating a Root Cause Culture
Don’t stop until the incident is resolved.
 This is an expectation, and won’t always be popular.
Make root cause part of your Acceptance Criteria
 Record root cause of issues in work tracking software (JIRA, VSTS,
etc) for incident work items.
Post-Mortem discussion is mandatory for incident participants.

Severity Assessed
Investigation
Ongoing
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins
Escalate to TDO
(>30 minutes)
Escalate to Ops/Dev
SME
MTTM
Most Companies
stop here.
Don’t stop here!
Incident Impact
Mitigated
(Temporary Fix)
implemented

Incident Impact
Mitigated
(Temporary Fix)
Root Cause
Determined
implemented
Cause determined
but not mitigated
Severity Assessed
Investigation
Ongoing
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins
Escalate to TDO
(>30 minutes)
Escalate to Ops/Dev
SME
Finding Root
Cause is the single
most important
step in the
Incident Lifecycle.
MTTM

Incident Resolved
Permanent Fix
Implemented
Incident Impact
Mitigated
(Temporary Fix)
Root Cause
Determined
implemented
Cause determined
but not mitigated
Severity Assessed
Investigation
Ongoing
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins
Escalate to TDO
(>30 minutes)
Escalate to Ops/Dev
SME
Root Cause has
been addressed
and incident is
truly resolved at
this point.

Post-Mortem
Discussion
(Retrospective)
Repair Items
Identified
Incident Resolved
Permanent Fix
Implemented
Incident Impact
Mitigated
(Temporary Fix)
Root Cause
Determined
implemented
Cause determined
but not mitigated
Severity Assessed
Investigation
Ongoing
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins
Escalate to TDO
(>30 minutes)
Escalate to Ops/Dev
SME
A review of past
incidents is
performed at regular
intervals (weekly,
monthly, etc).

Post-Mortem Discussion
The Post-Mortem Retrospective is a no
blame tolerated team gathering.
It’s a great opportunity to learn and
grow from each other’s experiences
and to take time to reflect on the
current strengths and weaknesses in
company services.
Livesite
Review
1. Discuss actions
taken to address
incidents.
2. What we could
have done better
during the
incident.
3. Review work
items required to
ensure incidents
do not happen
again.
4. Suggest other
things we can do
to continually
improve our
services.

Agenda
Incident Definition
On-Call Procedures
 Dual Paging
 Procedures Step-by-Step
 Incident Fatigue

“ Pain sure does bring out the best in
people, doesn’t it? ”
Bob Dylan
Singer, Songwriter, Painter, Writer, and Nobel Prize Laureate

Dual-Paging
Live site issues generally fall into two categories:
 Infrastructure issues.
 Code Issues.
The goal is the same for both: Reduce MTTR by resolving issues as
quickly as possible.
But we don’t know which category an issue falls into when an
incident starts.

On-Call Procedure
Engineering Team
Primary Rotation
Engineering Team
Secondary Rotation
Cloud Alert
(Dual Page to Operations and Engineering Teams)
Cloud Operations
Primary Rotation
Cloud Operations
Secondary Rotation
Operations Team
Engineering Team
Incident alert triggers a phone call / SMS message
to both Operations and Engineering team.
A secondary is always available should the
primary on-call is unavailable.

On-Call Procedure
Engineering Team
Primary Rotation
Engineering Team
Secondary Rotation
Cloud Alert
Cloud Operations
Primary Rotation
Cloud Operations
Secondary Rotation
Operations Team
Engineering Team
All active on-call personnel join a voice
conference bridge using Skype, Slack, or an
equivalent tool to coordinate the incident
investigation.
Initiate
Bridge
Initiate
Bridge

On-Call Procedure
Engineering Team
Primary Rotation
Engineering Team
Secondary Rotation
Cloud Alert
Cloud Operations
Primary Rotation
Cloud Operations
Secondary Rotation
Operations Team
Engineering Team
Sometimes a little extra help is needed. Service
Subject Matter Experts (Engineering and/or Ops.)
may be called to join the conference bridge.
Initiate
Bridge
Initiate
Bridge
Service Subject Matter
Expert (SME)
Join Conference
Bridge
Expert (SME)

Operations Team Lead Engineering Team Lead
On-Call Procedure
Engineering Team
Primary Rotation
Engineering Team
Secondary Rotation
Cloud Alert
Cloud Operations
Primary Rotation
Cloud Operations
Secondary Rotation
Operations Team
Engineering Team
Lengthy /
Severe issues
are escalated
to team leads
to assist in
coordinating
the incident.
Initiate
Bridge
Initiate
Bridge
Expert (SME)
Join Conference
Bridge
Expert (SME)

Incident Fatigue
An important side note. Incidents are urgent and stressful. Don’t
create unnecessary incidents when possible.
Every alert should be actionable.
If it isn’t actionable 100% of the time, monitoring needs to be
adjusted as an incident action item or should only send notification
emails (not create incidents).

Agenda
Incident Definition
On-Call Procedures
 Alert Management Systems
 Shield Teams
 Bug Cap
 Error Rate Zero

“ Early to bed and early to rise, makes a
man healthy, wealthy, and wise. ”
Benjamin Franklin
Founding Father of the United States, Inventor, Author, Scientist

Quick Recap: Incident Primary Goals
Mitigate impact as quickly
as possible (when able).
Determine root cause.
Identify action items to
address root cause
(permanently).

Alert Management System
At the core of a World-Class Incident Response Management pipeline
is an Alert Management System.
This system will aggregate monitoring alerts into a centralized system
and route these alerts to the correct teams / personnel.
Alerts are always routed via phone / SMS. Email is not real-time and
too much noise exists in email.

Integrations
The alert management system should integrate with the tools your
team is familiar with using, and engineers can work out their own
flow for addressing incidents.
Make it easy to accept and use, and people will adopt it.

Shield Teams
Engineering Shield Teams are an obvious extension to dual paging.
They help engineers focus and avoid interrupt-driven work.
Feature teams work on backlog of
new feature development.
Shield teams address bugs and
interruptions to feature team.
Shield Teams are a concept I learned from and experienced working at Microsoft.
They use them with many Engineering teams.

Shield Teams
Shield Teams rotate at each iteration (sprint). This spreads the load,
provides cross-training opportunities, and safeguards against incident
fatigue.
Feature teams work on backlog of
new feature development.
Shield teams address bugs and
interruptions to feature team.
Shield Teams are a concept I learned from and experienced working at Microsoft.
They use them with many Engineering teams.

Bug Cap
Bug Cap is a concept I learned from Microsoft, and it is an amazing
answer to addressing technical debt.
Team Size x 4 = Bug Cap
The rule is simple:
If bug count exceeds bug cap, stop working on new features until
bugs are resolved.

Bug Cap
Bug Cap violations should be tracked as a metric for each team and
reviewed in management discussions.
This metric is great for standup, retrospective, and planning
discussions.

Error Rate Zero
Which is easier to monitor?
What is the baseline for graph A? for B?
Low error rates create actionable monitoring and alerting.

Error Rate Zero
Don’t tolerate bugs … Ever.
The goal is to be able to treat
them as incidents, and
eliminate them with the
highest priority.

Questions?
Please connect with me on LinkedIn:
https://www.linkedin.com/in/keithbradsmith
Interested in a training or in partnering with Incident Ops?

World-Class Incident Response Management

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to World-Class Incident Response Management

Similar to World-Class Incident Response Management (20)

Recently uploaded

Recently uploaded (20)

World-Class Incident Response Management

Editor's Notes