Taken from principles learned over many years at several companies including Microsoft, this presentation describes the process of creating a strongly defined and repeatable Incident Response Management pipeline. The goal of this presentation is to increase companies ability to maintain healthy cloud services throughout the entire application lifecycle. It describes how companies should identify, respond to, and manage incidents, on-call procedures, and organizational implementations that reduce incident fatigue and keep services consistently reliable and available.
2. Introduction
As an Cloud Site Reliability and Distributed
Service Engineer at Microsoft, Keith Smith
has worked on highly-available distributed
cloud telemetry pipeline operations at
massive scale for Xbox and Windows.
Keith manages all AWS / Azure Cloud
Operations at Imagine Learning and has
helped the company to move to agile
Incident Response Management by
facilitating a culture of communication and
collaboration between support,
development, and operations.
He enjoys spending time with his family, rock
climbing, and biking.
He is the founder of Incident Ops, a
Microsoft Azure Partner specializing in Site
Reliability, Cloud Architecture, and Incident
Response.
5. “ An incident is defined as an event
that has a measurable impact on the
customer experience. ”
Keith Smith
6. Incident Introduction
There are two major measurements when it comes to service health:
Mean Time Between Failures – MTBF
Mean Time to Resolve Incidents – MTTR
MTTR can be further broken down:
Time to detect incident
Time to engage
(acknowledge) incident
Time to mitigate
incident impact
Time to resolve incident
8. How do you Prioritize an Incident?
A non-maskable interrupt (NMI) is a computer processor interrupt
that cannot be ignored by standard interrupt masking techniques in
the system. It is typically used to signal attention for non-recoverable
hardware errors.
Answers.com, emphasis added - http://www.answers.com/Q/What_is_non_maskable_interrupt_interrupt
An incident is a development and operations interrupt that cannot be
ignored by standard feature development. It is typically used to signal
attention for non-recoverable issues that cause customer impact.
Inspired by Bryan Sparks, CTO – Imagine Learning Inc.
August 2015
13. Source catches incident
and alerts to on-call
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins Monitoring catches incident and routes alert to
incident management system and on-call individuals.
MTTA/MTTD
14. Severity Assessed
Investigation
Ongoing
Investigation Begins
Source catches incident
and alerts to on-call
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins On-call begins investigating. Impact is assessed and
updates to company status page are made as
needed.
15. Severity Assessed
Investigation
Ongoing
Investigation Begins
Source catches incident
and alerts to on-call
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins
Escalate to TDO
Incident Severe or Lengthy
(>30 minutes)
Escalate to Ops/Dev
SME
Additional help required to determine
cause and mitigate incident
Subject Matter Experts (Service
Owners) are escalated to as
needed.
For high-impact incidents, the
Technical Duty Officer (Dev
and/or Operations Manager) is
looped in to coordinate team
activities.
16. Severity Assessed
Investigation
Ongoing
Investigation Begins
Source catches incident
and alerts to on-call
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins
Escalate to TDO
Incident Severe or Lengthy
(>30 minutes)
Escalate to Ops/Dev
SME
Additional help required to determine
cause and mitigate incident
MTTM
Most Companies
stop here.
Incident Impact
Mitigated
(Temporary Fix)
Temporary workaround
implemented
17. “ For every effect there is a root cause. Find
and address the root cause rather than try to
fix the effect, as there is no end to the latter. ”
Celestine Chua
Writer and Founder of Personal Excellence, life coach
18. Incident Resolution
Two criteria are required for an incident to be resolved (closed):
Impact has been mitigated.
Root cause of the issue has been identified.
Work items to address root cause are completed and released to
production immediately when possible.
At times additional long-term work is required to address root cause.
In this case the work item is logged as a Bug and the Shield team
works on the fix (described in more detail later).
19. Creating a Root Cause Culture
Don’t stop until the incident is resolved.
This is an expectation, and won’t always be popular.
Make root cause part of your Acceptance Criteria
Record root cause of issues in work tracking software (JIRA, VSTS,
etc) for incident work items.
Post-Mortem discussion is mandatory for incident participants.
20. Severity Assessed
Investigation
Ongoing
Investigation Begins
Source catches incident
and alerts to on-call
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins
Escalate to TDO
Incident Severe or Lengthy
(>30 minutes)
Escalate to Ops/Dev
SME
Additional help required to determine
cause and mitigate incident
MTTM
Most Companies
stop here.
Don’t stop here!
Incident Impact
Mitigated
(Temporary Fix)
Temporary workaround
implemented
21. Incident Impact
Mitigated
(Temporary Fix)
Root Cause
Determined
Temporary workaround
implemented
Cause determined
but not mitigated
Severity Assessed
Investigation
Ongoing
Investigation Begins
Source catches incident
and alerts to on-call
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins
Escalate to TDO
Incident Severe or Lengthy
(>30 minutes)
Escalate to Ops/Dev
SME
Additional help required to determine
cause and mitigate incident
Finding Root
Cause is the single
most important
step in the
Incident Lifecycle.
MTTM
22. Incident Resolved
Permanent Fix
Implemented
Incident Impact
Mitigated
(Temporary Fix)
Root Cause
Determined
Temporary workaround
implemented
Cause determined
but not mitigated
Severity Assessed
Investigation
Ongoing
Investigation Begins
Source catches incident
and alerts to on-call
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins
Escalate to TDO
Incident Severe or Lengthy
(>30 minutes)
Escalate to Ops/Dev
SME
Additional help required to determine
cause and mitigate incident
Root Cause has
been addressed
and incident is
truly resolved at
this point.
23. Post-Mortem
Discussion
(Retrospective)
Repair Items
Identified
Incident Resolved
Permanent Fix
Implemented
Incident Impact
Mitigated
(Temporary Fix)
Root Cause
Determined
Temporary workaround
implemented
Cause determined
but not mitigated
Severity Assessed
Investigation
Ongoing
Investigation Begins
Source catches incident
and alerts to on-call
Incident
Acknowledged by
On-Call
Incident Lifecycle
Incident Begins
Escalate to TDO
Incident Severe or Lengthy
(>30 minutes)
Escalate to Ops/Dev
SME
Additional help required to determine
cause and mitigate incident
A review of past
incidents is
performed at regular
intervals (weekly,
monthly, etc).
24. Post-Mortem Discussion
The Post-Mortem Retrospective is a no
blame tolerated team gathering.
It’s a great opportunity to learn and
grow from each other’s experiences
and to take time to reflect on the
current strengths and weaknesses in
company services.
Livesite
Review
1. Discuss actions
taken to address
incidents.
2. What we could
have done better
during the
incident.
3. Review work
items required to
ensure incidents
do not happen
again.
4. Suggest other
things we can do
to continually
improve our
services.
26. “ Pain sure does bring out the best in
people, doesn’t it? ”
Bob Dylan
Singer, Songwriter, Painter, Writer, and Nobel Prize Laureate
27. Dual-Paging
Live site issues generally fall into two categories:
Infrastructure issues.
Code Issues.
The goal is the same for both: Reduce MTTR by resolving issues as
quickly as possible.
But we don’t know which category an issue falls into when an
incident starts.
28. On-Call Procedure
Engineering Team
Primary Rotation
Engineering Team
Secondary Rotation
Cloud Alert
(Dual Page to Operations and Engineering Teams)
Cloud Operations
Primary Rotation
Cloud Operations
Secondary Rotation
Operations Team
Engineering Team
Incident alert triggers a phone call / SMS message
to both Operations and Engineering team.
A secondary is always available should the
primary on-call is unavailable.
29. On-Call Procedure
Engineering Team
Primary Rotation
Engineering Team
Secondary Rotation
Cloud Alert
(Dual Page to Operations and Engineering Teams)
Cloud Operations
Primary Rotation
Cloud Operations
Secondary Rotation
Operations Team
Engineering Team
All active on-call personnel join a voice
conference bridge using Skype, Slack, or an
equivalent tool to coordinate the incident
investigation.
Initiate
Bridge
Initiate
Bridge
30. On-Call Procedure
Engineering Team
Primary Rotation
Engineering Team
Secondary Rotation
Cloud Alert
(Dual Page to Operations and Engineering Teams)
Cloud Operations
Primary Rotation
Cloud Operations
Secondary Rotation
Operations Team
Engineering Team
Sometimes a little extra help is needed. Service
Subject Matter Experts (Engineering and/or Ops.)
may be called to join the conference bridge.
Initiate
Bridge
Initiate
Bridge
Service Subject Matter
Expert (SME)
Join Conference
Bridge
Service Subject Matter
Expert (SME)
31. Operations Team Lead Engineering Team Lead
On-Call Procedure
Engineering Team
Primary Rotation
Engineering Team
Secondary Rotation
Cloud Alert
(Dual Page to Operations and Engineering Teams)
Cloud Operations
Primary Rotation
Cloud Operations
Secondary Rotation
Operations Team
Engineering Team
Lengthy /
Severe issues
are escalated
to team leads
to assist in
coordinating
the incident.
Initiate
Bridge
Initiate
Bridge
Service Subject Matter
Expert (SME)
Join Conference
Bridge
Service Subject Matter
Expert (SME)
32. Incident Fatigue
An important side note. Incidents are urgent and stressful. Don’t
create unnecessary incidents when possible.
Every alert should be actionable.
If it isn’t actionable 100% of the time, monitoring needs to be
adjusted as an incident action item or should only send notification
emails (not create incidents).
34. “ Early to bed and early to rise, makes a
man healthy, wealthy, and wise. ”
Benjamin Franklin
Founding Father of the United States, Inventor, Author, Scientist
35. Quick Recap: Incident Primary Goals
Mitigate impact as quickly
as possible (when able).
Determine root cause.
Identify action items to
address root cause
(permanently).
36. Alert Management System
At the core of a World-Class Incident Response Management pipeline
is an Alert Management System.
This system will aggregate monitoring alerts into a centralized system
and route these alerts to the correct teams / personnel.
Alerts are always routed via phone / SMS. Email is not real-time and
too much noise exists in email.
37. Integrations
The alert management system should integrate with the tools your
team is familiar with using, and engineers can work out their own
flow for addressing incidents.
Make it easy to accept and use, and people will adopt it.
38. Shield Teams
Engineering Shield Teams are an obvious extension to dual paging.
They help engineers focus and avoid interrupt-driven work.
Feature teams work on backlog of
new feature development.
Shield teams address bugs and
interruptions to feature team.
Shield Teams are a concept I learned from and experienced working at Microsoft.
They use them with many Engineering teams.
39. Shield Teams
Shield Teams rotate at each iteration (sprint). This spreads the load,
provides cross-training opportunities, and safeguards against incident
fatigue.
Feature teams work on backlog of
new feature development.
Shield teams address bugs and
interruptions to feature team.
Shield Teams are a concept I learned from and experienced working at Microsoft.
They use them with many Engineering teams.
40. Bug Cap
Bug Cap is a concept I learned from Microsoft, and it is an amazing
answer to addressing technical debt.
Team Size x 4 = Bug Cap
The rule is simple:
If bug count exceeds bug cap, stop working on new features until
bugs are resolved.
41. Bug Cap
Bug Cap violations should be tracked as a metric for each team and
reviewed in management discussions.
This metric is great for standup, retrospective, and planning
discussions.
42. Error Rate Zero
Which is easier to monitor?
What is the baseline for graph A? for B?
Low error rates create actionable monitoring and alerting.
43. Error Rate Zero
Don’t tolerate bugs … Ever.
The goal is to be able to treat
them as incidents, and
eliminate them with the
highest priority.
44. Questions?
Please connect with me on LinkedIn:
https://www.linkedin.com/in/keithbradsmith
Interested in a training or in partnering with Incident Ops?
Editor's Notes
Operations in Highly-Scalable Distributed Cloud Services in an Agile or DevOps culture / organization
Incident Definition
-Defining each of the characteristics of an incident
-Explain the differences between a bug and an incident
Incident Response Management
-A strongly defined and repeatable process for managing and responding to incidents
Root Cause Culture
-Discuss the importance of Root Cause analysis and where most companies fall short
Keeping Services Healthy
-This is me sharing a few secrets of success in managing technical debt and iteratively improving services all the time
Incident Definition
-Defining each of the characteristics of an incident
-Explain the differences between a bug and an incident
Incident Response Management
-A strongly defined and repeatable process for managing and responding to incidents
Root Cause Culture
-Discuss the importance of Root Cause analysis and where most companies fall short
Keeping Services Healthy
-This is me sharing a few secrets of success in managing technical debt and iteratively improving services all the time
3 important takeaways here:
An incident is an event. It has a clear start and end.
Impact is measurable.
Customers (both internal or external) are the true measure of impact.
Bryan Sparks, CTO at Imagine Learning, described incidents as NMIs.
He gave me 100% discretion over EVERY Support, Operations, Dev, and PM in the company. At any given moment, I can tap any resource to help with an incident if I feel that person can help an incident to be resolved more quickly.
Incident Definition
-Defining each of the characteristics of an incident
-Explain the differences between a bug and an incident
Incident Response Management
-A strongly defined and repeatable process for managing and responding to incidents
Root Cause Culture
-Discuss the importance of Root Cause analysis and where most companies fall short
Keeping Services Healthy
-This is me sharing a few secrets of success in managing technical debt and iteratively improving services all the time
I like this quote, often attributed to Peter Drucker… but I found no actual proof online that he said it.
The simple act of paying attention to something will cause you to make connections you never did before, and you'll improve in those areas - almost without any extra effort.
This process takes preparation and discipline, but once it is set up and generally accepted… it’s a breeze to use and extend.
Email is NOT a reliable tool for incident management. Email disrupts us all regularly throughout the day/night, and incidents need to break out as something more than just another email.
An incident MUST only notify via phone / SMS. Emails are ok for auditing, but are not a primary tool for on-call.
Tell the story of my sisters Back Pain – Treating the symptom and not the cause.
Example: Recycling the app pool daily instead of figuring out why the service crashes every once in a while. One is a mitigation, the other is root cause.
That’s right, I put all 3 names for this meeting in a single slide!
Incident Definition
-Defining each of the characteristics of an incident
-Explain the differences between a bug and an incident
Incident Response Management
-A strongly defined and repeatable process for managing and responding to incidents
Root Cause Culture
-Discuss the importance of Root Cause analysis and where most companies fall short
Keeping Services Healthy
-This is me sharing a few secrets of success in managing technical debt and iteratively improving services all the time
Bring the pain forward. Dev teams write better code when they are on the hook for fixing it in production.
We’ve NEVER had anyone cite on-call as a reason for leaving the company (yet…)
[DEMO] Show on-call rotations for Cloud Operations and Cloud Infrastructure (Have them pre-loaded)
Central to EVERY person in this chain is communication / collaboration. The first thing done in every incident is to combine efforts and start a VOICE discussion. Chat is used for tracking and updates, but is too slow for incident collaboration.
[DEMO] Show on-call rotations for Cloud Operations and Cloud Infrastructure (Have them pre-loaded)
Central to EVERY person in this chain is communication / collaboration. The first thing done in every incident is to combine efforts and start a VOICE discussion. Chat is used for tracking and updates, but is too slow for incident collaboration.
[DEMO] Show on-call rotations for Cloud Operations and Cloud Infrastructure (Have them pre-loaded)
Central to EVERY person in this chain is communication / collaboration. The first thing done in every incident is to combine efforts and start a VOICE discussion. Chat is used for tracking and updates, but is too slow for incident collaboration.
It’s easy to sit and do nothing 30 minutes into an incident. The team lead can drive individual accountability during incidents.
Incident Definition
-Defining each of the characteristics of an incident
-Explain the differences between a bug and an incident
Incident Response Management
-A strongly defined and repeatable process for managing and responding to incidents
Root Cause Culture
-Discuss the importance of Root Cause analysis and where most companies fall short
Keeping Services Healthy
-This is me sharing a few secrets of success in managing technical debt and iteratively improving services all the time
Root Cause Methodology:
Note that these are not numbered, and will regularly be addressed in different order.
These three items are the goals of every incident, and the driving force behind all activities within the incident lifecycle.
Every action should strive to reach one of these goals FASTER, with more precision, and be more COMPLETE.
These go in JIRA with special tags and are discussed in Post-Mortem.
Show Slack integration with OpsGenie
[DEMO]
Statuspage integration?
Interruptions are expensive.
Feature teams do not work on bugs and address incidents unless needed. Shield teams do this work during their assigned iteration (like being assigned to active duty) and only do feature work as able, but never is anything assigned to them for that iteration.
Shield teams rotate.
Describe the Zen of Inbox Zero.
-Inbox items are a todo list. Anything in the inbox is a task requiring follow up in a set period, such as a day.