I gave this talk at PagerDuty's 2016 Summit. It's all about how to leverage Incident data to improve your incident response culture and how to drive improvement into your service delivery.
17. #PDSummit16
3 Qualities of Highly Effective
Incident Management
• Measurement
• If it Moves….Graph It!
• Credit - Ian Malpass – Etsy
• https://codeascraft.com/2011/02/15/measure-anything-measure-everything/
• Transparency
• System Health & Availability
• State of the Incident
• Collaboration
• Effective Cross Team Troubleshooting
• Effective Prevention Efforts
Collaboration
TransparencyMeasurement
23. #PDSummit16
Alert or Incident
• Service Impacting
• Important to Detail & Understand
Example:
“The Site is Down”
• Tactical & Explicit
• Important to Trend & Remediate
Examples:
“CPU > 99%”
“Disk Space @ 95%”
Incidents
VS
Alerts
24. #PDSummit16
Measurement: Alert Analysis
• Trend Alert Totals Overtime
• Try to remove incident related alerts
• Group by:
• Alert Types – CPU, Memory, Etc..
• Source Host – Common Themes
• Host Types – Database, App, Network, etc..
• Prioritize Time to Remediate
• Short Term & Long Term
• Manage Alert Fatigue
Collaboration
TransparencyMeasurement
26. #PDSummit16
Measurement: Incident Rates & Cost
• Trend Incident Rates by Service
• Establishes Frequency & MTTR Trends
• Enables benchmarking (& Comparison)
• Enables forecasting to effectively plan time
• Establish Cost Metrics
• Recovery Efforts
• Capture the # of engineers involved in recovery efforts
• Capture the hours of engineering effort involved in recovery
• Customer Impact
• Correlate customer contacts to specific incidents
• Establish business metrics that can reflect customer impact
27. #PDSummit16
Measurement: Incident Cause & Recovery
• Analyze Cause with Organization
• Potential Causes:
• Change Released – Reference Change Ticket
• Establish Objective Confidence Levels for Change (by Service)
• Code/Infra/Bug Issues – Reference Bug Ticket
• Creates a Tangible Cost to Priority Discussion
• 3rd Party Service Dependency (Cloud, Monitoring, ISP, Etc…)
• Tangible Business Impact
• Recovery
• Corrective Action
• Monitoring Effectiveness
29. #PDSummit16
Transparency: Current State & Historical
• Current State of Services & Incidents:
• Maintain a Service Status Page (Internal & External)
• Service Status – Outage, Degraded, etc….
• Incident Dashboard
• Severity
• Establishes Urgency Expectations
• Referenceable History
• Simplify Searching History
• Link Recovery Documentation to past Incidents
Collaboration
TransparencyMeasurement
30. #PDSummit16
Collaboration:
• Transparency to data has a cultural influence
• Fix it Together
• Inquisitive Troubleshooting
• Fix it Long Term
• Team recognizes impact and create empathy
• Product Team Engagement
• Objective data on product performance
Collaboration
TransparencyMeasurement
31. #PDSummit16
Measurement + Transparency + Collaboration
• Incident Response & Recovery Times Decrease
• Incident Frequency Decreases
• Incident Recovery Cost Decreases
• Increasing Engineering Output
• Decision Making Abilities Improve
• Team Morale Improves
• And Most Importantly….
• Happy Confident Customers
32. #PDSummit16
Tips for Getting There
• Measure stuff
• Be transparent with your metrics
• Don’t try to do it all at once
• Don’t make your Incident process bulky
• Consistent Ceremonies
Collaboration
TransparencyMeasurement
We’ll go through a generic example of what commonly happens in an organization that doesn’t do Incident Management well.
Once upon a time…. The site was down
Good thing executives noticed before anyone else. Pretty sure she’s just sitting in there with her door closed clicking refresh all day…..
Site outages are always the database! (please note in the middle that the Network person left the room immediately)
That ended up being timely as the DB team made assumptions it’s the Network’s issue
Now we start to see customer contacts coming in as a result of this issue
We have not entered the phase of the incident I refer to as the “Buckshot” phase. Everyone communicating poorly to everyone else with none of it contributing to resolving the issue
Turn it off and turn it on again, brilliant!
Poke that database performance as a root cause one more time….
If you’re looking for me you can find me with my face in my palm at my desk….
Congratulations, you just contributed to Global Warming in your own way… Someone get me some marshmellows!
You just made your customer do this, or worse yet, they spent the time they were waiting for your site to come online looking for new service providers.
They are spending their wasted time by your site being down looking for a new provider to replace you!
But, It’s not just about the outage response here. There is is so much more to Incident Management than response and resolution.
OK, In just having fun with the story here, but this is a terrible experience for customers and employees. Nobody talented wants to work in an environment like this and nobody wants to do business with a company that operates like this.
.
At SPS Commerce we’ve fostered a great culture around incident Management
Micro services Architecture & Hybrid cloud leveraging PagerDuty as a notification hub as well as an event emitter to help us start to trigger automated responses.
There is so much data out there, where do we start?
If you’re not already looking at some of this data I definitely recommend starting with the basics. And the PD UI does a great job of getting you there quickly.
When you analyze your data what you will typically find is “spikes” in alert counts & MTTR are typically indicative of a much larger issue that likely had more significant impact on your service performance than just one Disk space issue for example.
PagerDuty recently released to Beta some work they’re doing on event processing and analysis. I think this is the beginning of something very interesting. This beta is a great way to visualize the concept of an alert versus an incident.
You can see here (first animation) that we’re looking at a large group and volume of alerts compared to normal. I would likely assess this an incident that is probably the result of a shared dependency having performance issues. In this 2nd example you can see just a few small alert groupings is probably more indicative of isolated issues like CPU or Disk Space.
This is something PD’s new OCC is going to be awesome for!
What we have done
Alert rates
Got us thinking more about “Incidents” v. “Alerts”
Alerts are by nature more isolated small scale issues that are important to respond to as well as remediate the go forward risks at a larger scale
Incidents typically impact your customers or have risk to and will absolutely cost you valuable engineering time, it’s absolutely critical you leverage that to motivate a healthy short and long term response process.
There are so many dimensions to an Incident, It’s important to start small and iterate.
Recovery Efforts – Hours and #’s should not be perfect, give it a good swag.
Establishing business metrics – Canary Testing
Change helps you shape your confidence level with true numbers
Code/Bug – Tangible priority discussion around cost of Time, Customer UX, and frequency of recurrence.
Recovery – If you have a service that is regularly corrected by restarting it, maybe it’s time to automate that?
Visible Status Page – current state of service health should always be visible in a shared location. (not in an Email string, that’s exclusive we need inclusive during an incident)
Referenceable – props to PD road map on Event efforts.
Decision Making Improves – you make better decisions as a business