PDSummit16 - Using Incident Data to Improve your Business

#PDSummit16
Using Incident Data to Build Better
Internal Processes

#PDSummit16#PDSummit16
Andy Domeier
Director – System Operations
SPS Commerce
Twitter: @ajdomie

• Incident Story Time
• SPS Commerce
• 3 Qualities of Effective Incident Management
• Tips for Getting There
• Agenda

Incident Story Time
An tale of an Unhealthy Incident Culture

#PDSummit16
It’s not just about the
outage response….
Has it happened before?
Will it happen again?
Why did this happen?

• Supply Chain Communications
Network
• Connecting over 60,000 Trading
Partners Globally
• Services:
• Fulfillment
• Integration
• Item Management
• Analytics

#PDSummit16
Core
UX
Logging
APM
SysOps • Level 1 India
Engineer • On-Call
MGMT • On-Call
ChatOpsChatOps
Automation

#PDSummit16
3 Qualities of Highly Effective
Incident Management
• Measurement
• If it Moves….Graph It!
• Credit - Ian Malpass – Etsy
• https://codeascraft.com/2011/02/15/measure-anything-measure-everything/
• Transparency
• System Health & Availability
• State of the Incident
• Collaboration
• Effective Cross Team Troubleshooting
• Effective Prevention Efforts
Collaboration
TransparencyMeasurement

#PDSummit16
Measurement: Where to start? Collaboration

#PDSummit16
Measurement: Start with the Basics
• Basics:
• Total Counts
• MTTR
• Escalations
• Group by:
• Service
• Team
• Severity

#PDSummit16
Measurement: Make Sense of the Spikes
• Typically spikes indicate a larger issue in scope

#PDSummit16
Trivia:
• A Group of Geese is a
• A Group of Cows is a
• A Group of Tigers is a
• A Group of Alerts is
Flock
Herd
Streak
?????

#PDSummit16
PagerDuty:
Infrastructure Health Module (Preview)

#PDSummit16
Alert or Incident
• Service Impacting
• Important to Detail & Understand
Example:
“The Site is Down”
• Tactical & Explicit
• Important to Trend & Remediate
Examples:
“CPU > 99%”
“Disk Space @ 95%”
Incidents
VS
Alerts

#PDSummit16
Measurement: Alert Analysis
• Trend Alert Totals Overtime
• Try to remove incident related alerts
• Group by:
• Alert Types – CPU, Memory, Etc..
• Source Host – Common Themes
• Host Types – Database, App, Network, etc..
• Prioritize Time to Remediate
• Short Term & Long Term
• Manage Alert Fatigue
Collaboration

“Alert trends ignored
today are tomorrow’s
incidents…”

#PDSummit16
Measurement: Incident Rates & Cost
• Trend Incident Rates by Service
• Establishes Frequency & MTTR Trends
• Enables benchmarking (& Comparison)
• Enables forecasting to effectively plan time
• Establish Cost Metrics
• Recovery Efforts
• Capture the # of engineers involved in recovery efforts
• Capture the hours of engineering effort involved in recovery
• Customer Impact
• Correlate customer contacts to specific incidents
• Establish business metrics that can reflect customer impact

#PDSummit16
Measurement: Incident Cause & Recovery
• Analyze Cause with Organization
• Potential Causes:
• Change Released – Reference Change Ticket
• Establish Objective Confidence Levels for Change (by Service)
• Code/Infra/Bug Issues – Reference Bug Ticket
• Creates a Tangible Cost to Priority Discussion
• 3rd Party Service Dependency (Cloud, Monitoring, ISP, Etc…)
• Tangible Business Impact
• Recovery
• Corrective Action
• Monitoring Effectiveness

#PDSummit16
You can’t measure Incidents you
avoided, be sure to also measure
success.

#PDSummit16
Transparency: Current State & Historical
• Current State of Services & Incidents:
• Maintain a Service Status Page (Internal & External)
• Service Status – Outage, Degraded, etc….
• Incident Dashboard
• Severity
• Establishes Urgency Expectations
• Referenceable History
• Simplify Searching History
• Link Recovery Documentation to past Incidents
Collaboration

#PDSummit16
Collaboration:
• Transparency to data has a cultural influence
• Fix it Together
• Inquisitive Troubleshooting
• Fix it Long Term
• Team recognizes impact and create empathy
• Product Team Engagement
• Objective data on product performance
Collaboration

#PDSummit16
Measurement + Transparency + Collaboration
• Incident Response & Recovery Times Decrease
• Incident Frequency Decreases
• Incident Recovery Cost Decreases
• Increasing Engineering Output
• Decision Making Abilities Improve
• Team Morale Improves
• And Most Importantly….
• Happy Confident Customers

#PDSummit16
Tips for Getting There
• Measure stuff
• Be transparent with your metrics
• Don’t try to do it all at once
• Don’t make your Incident process bulky
• Consistent Ceremonies
Collaboration

“Fosture a Culture
that Challenges &
Learns from Failure..”

Thanks for listening!
Twitter: @ajdomie

Please provide
feedback for this
session by filling out
the feedback survey

PDSummit16 - Using Incident Data to Improve your Business

Recommended

Recommended

More Related Content

Similar to PDSummit16 - Using Incident Data to Improve your Business

Similar to PDSummit16 - Using Incident Data to Improve your Business (20)

Recently uploaded

Recently uploaded (20)

PDSummit16 - Using Incident Data to Improve your Business

Editor's Notes