SlideShare a Scribd company logo
1 of 35
#PDSummit16
Using Incident Data to Build Better
Internal Processes
#PDSummit16#PDSummit16
Andy Domeier
Director – System Operations
SPS Commerce
Twitter: @ajdomie
#PDSummit16#PDSummit16
• Incident Story Time
• SPS Commerce
• 3 Qualities of Effective Incident Management
• Tips for Getting There
• Agenda
#PDSummit16#PDSummit16
Incident Story Time
An tale of an Unhealthy Incident Culture
#PDSummit16#PDSummit16
#PDSummit16#PDSummit16
#PDSummit16#PDSummit16
#PDSummit16#PDSummit16
#PDSummit16#PDSummit16
#PDSummit16#PDSummit16
#PDSummit16
Congratulations
#PDSummit16
Most Importantly
#PDSummit16
Or Worse…
#PDSummit16
It’s not just about the
outage response….
Has it happened before?
Will it happen again?
Why did this happen?
#PDSummit16#PDSummit16
• Supply Chain Communications
Network
• Connecting over 60,000 Trading
Partners Globally
• Services:
• Fulfillment
• Integration
• Item Management
• Analytics
#PDSummit16
Core
UX
Logging
APM
SysOps • Level 1 India
Engineer • On-Call
MGMT • On-Call
ChatOpsChatOps
Automation
#PDSummit16
3 Qualities of Highly Effective
Incident Management
• Measurement
• If it Moves….Graph It!
• Credit - Ian Malpass – Etsy
• https://codeascraft.com/2011/02/15/measure-anything-measure-everything/
• Transparency
• System Health & Availability
• State of the Incident
• Collaboration
• Effective Cross Team Troubleshooting
• Effective Prevention Efforts
Collaboration
TransparencyMeasurement
#PDSummit16
Measurement: Where to start? Collaboration
TransparencyMeasurement
#PDSummit16
Measurement: Start with the Basics
• Basics:
• Total Counts
• MTTR
• Escalations
• Group by:
• Service
• Team
• Severity
#PDSummit16
Measurement: Make Sense of the Spikes
• Typically spikes indicate a larger issue in scope
#PDSummit16
Trivia:
• A Group of Geese is a
• A Group of Cows is a
• A Group of Tigers is a
• A Group of Alerts is
Flock
Herd
Streak
?????
#PDSummit16
PagerDuty:
Infrastructure Health Module (Preview)
#PDSummit16
Alert or Incident
• Service Impacting
• Important to Detail & Understand
Example:
“The Site is Down”
• Tactical & Explicit
• Important to Trend & Remediate
Examples:
“CPU > 99%”
“Disk Space @ 95%”
Incidents
VS
Alerts
#PDSummit16
Measurement: Alert Analysis
• Trend Alert Totals Overtime
• Try to remove incident related alerts
• Group by:
• Alert Types – CPU, Memory, Etc..
• Source Host – Common Themes
• Host Types – Database, App, Network, etc..
• Prioritize Time to Remediate
• Short Term & Long Term
• Manage Alert Fatigue
Collaboration
TransparencyMeasurement
#PDSummit16#PDSummit16
“Alert trends ignored
today are tomorrow’s
incidents…”
#PDSummit16
Measurement: Incident Rates & Cost
• Trend Incident Rates by Service
• Establishes Frequency & MTTR Trends
• Enables benchmarking (& Comparison)
• Enables forecasting to effectively plan time
• Establish Cost Metrics
• Recovery Efforts
• Capture the # of engineers involved in recovery efforts
• Capture the hours of engineering effort involved in recovery
• Customer Impact
• Correlate customer contacts to specific incidents
• Establish business metrics that can reflect customer impact
#PDSummit16
Measurement: Incident Cause & Recovery
• Analyze Cause with Organization
• Potential Causes:
• Change Released – Reference Change Ticket
• Establish Objective Confidence Levels for Change (by Service)
• Code/Infra/Bug Issues – Reference Bug Ticket
• Creates a Tangible Cost to Priority Discussion
• 3rd Party Service Dependency (Cloud, Monitoring, ISP, Etc…)
• Tangible Business Impact
• Recovery
• Corrective Action
• Monitoring Effectiveness
#PDSummit16
You can’t measure Incidents you
avoided, be sure to also measure
success.
#PDSummit16
Transparency: Current State & Historical
• Current State of Services & Incidents:
• Maintain a Service Status Page (Internal & External)
• Service Status – Outage, Degraded, etc….
• Incident Dashboard
• Severity
• Establishes Urgency Expectations
• Referenceable History
• Simplify Searching History
• Link Recovery Documentation to past Incidents
Collaboration
TransparencyMeasurement
#PDSummit16
Collaboration:
• Transparency to data has a cultural influence
• Fix it Together
• Inquisitive Troubleshooting
• Fix it Long Term
• Team recognizes impact and create empathy
• Product Team Engagement
• Objective data on product performance
Collaboration
TransparencyMeasurement
#PDSummit16
Measurement + Transparency + Collaboration
• Incident Response & Recovery Times Decrease
• Incident Frequency Decreases
• Incident Recovery Cost Decreases
• Increasing Engineering Output
• Decision Making Abilities Improve
• Team Morale Improves
• And Most Importantly….
• Happy Confident Customers
#PDSummit16
Tips for Getting There
• Measure stuff
• Be transparent with your metrics
• Don’t try to do it all at once
• Don’t make your Incident process bulky
• Consistent Ceremonies
Collaboration
TransparencyMeasurement
#PDSummit16#PDSummit16
“Fosture a Culture
that Challenges &
Learns from Failure..”
#PDSummit16#PDSummit16
Thanks for listening!
Twitter: @ajdomie
#PDSummit16#PDSummit16
Please provide
feedback for this
session by filling out
the feedback survey

More Related Content

Similar to PDSummit16 - Using Incident Data to Improve your Business

Alignment between CEM and Agile - Building the Right product - BASSA2015
Alignment between CEM and Agile - Building the Right product - BASSA2015Alignment between CEM and Agile - Building the Right product - BASSA2015
Alignment between CEM and Agile - Building the Right product - BASSA2015IQ Business - agility@IQ
 
Managed IT Services: What It Is and Why It Matters
Managed IT Services: What It Is and Why It MattersManaged IT Services: What It Is and Why It Matters
Managed IT Services: What It Is and Why It MattersNet at Work
 
'Metrics That Matter': Gabrielle Benefield @ Colombo Agile Con 2014
'Metrics That Matter': Gabrielle Benefield @ Colombo Agile Con 2014'Metrics That Matter': Gabrielle Benefield @ Colombo Agile Con 2014
'Metrics That Matter': Gabrielle Benefield @ Colombo Agile Con 2014ColomboCampsCommunity
 
Using Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce CostsUsing Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce CostsConnotate
 
Using Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce CostsUsing Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce CostsConnotate
 
An Ounce of Validation = a Pound of Pivot by LinkedIn PM
An Ounce of Validation = a Pound of Pivot by LinkedIn PMAn Ounce of Validation = a Pound of Pivot by LinkedIn PM
An Ounce of Validation = a Pound of Pivot by LinkedIn PMProduct School
 
Ideal Customer Profile Guide
Ideal Customer Profile GuideIdeal Customer Profile Guide
Ideal Customer Profile GuideJoseph Barbato
 
Fundamentals of Designing, Building, & Implementing a Service Delivery Center
Fundamentals of Designing, Building, & Implementing a Service Delivery CenterFundamentals of Designing, Building, & Implementing a Service Delivery Center
Fundamentals of Designing, Building, & Implementing a Service Delivery CenterScottMadden, Inc.
 
GEOALBERTA 2015 - StAlbert - Do You Measure Up
GEOALBERTA 2015 - StAlbert - Do You Measure UpGEOALBERTA 2015 - StAlbert - Do You Measure Up
GEOALBERTA 2015 - StAlbert - Do You Measure UpTammy Kobliuk
 
SIAM Skills Workshop, BCS, ITSM Review 17th Nov 2015
SIAM Skills Workshop, BCS, ITSM Review 17th Nov 2015SIAM Skills Workshop, BCS, ITSM Review 17th Nov 2015
SIAM Skills Workshop, BCS, ITSM Review 17th Nov 2015Martin Thompson
 
FIRST-WA-Project-Management-May-2020.pdf
FIRST-WA-Project-Management-May-2020.pdfFIRST-WA-Project-Management-May-2020.pdf
FIRST-WA-Project-Management-May-2020.pdfXolaniRadebeRadebe
 
NPS is Dead, Long Live NPS!
NPS is Dead, Long Live NPS! NPS is Dead, Long Live NPS!
NPS is Dead, Long Live NPS! Elizabeth Magill
 
Success, Failure, Disaster: Cisco Measures Top Tasks - CapCHI 21 Oct 2015
Success, Failure, Disaster: Cisco Measures Top Tasks - CapCHI 21 Oct 2015Success, Failure, Disaster: Cisco Measures Top Tasks - CapCHI 21 Oct 2015
Success, Failure, Disaster: Cisco Measures Top Tasks - CapCHI 21 Oct 2015Neo Insight
 
How to seize B2B market opportunities thanks to Big Data
How to seize B2B market opportunities thanks to Big DataHow to seize B2B market opportunities thanks to Big Data
How to seize B2B market opportunities thanks to Big DataMark Beekman
 
Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013
Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013
Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013IBM Switzerland
 
Maintenance Metrics that Matter
Maintenance Metrics that MatterMaintenance Metrics that Matter
Maintenance Metrics that MattereMaint Enterprises
 
Targets That Work (for the Service Desk), Susan Storey
Targets That Work (for the Service Desk), Susan StoreyTargets That Work (for the Service Desk), Susan Storey
Targets That Work (for the Service Desk), Susan StoreyService Desk Institute
 

Similar to PDSummit16 - Using Incident Data to Improve your Business (20)

Alignment between CEM and Agile - Building the Right product - BASSA2015
Alignment between CEM and Agile - Building the Right product - BASSA2015Alignment between CEM and Agile - Building the Right product - BASSA2015
Alignment between CEM and Agile - Building the Right product - BASSA2015
 
ICG 6 sigma transformation
ICG 6 sigma transformationICG 6 sigma transformation
ICG 6 sigma transformation
 
Managed IT Services: What It Is and Why It Matters
Managed IT Services: What It Is and Why It MattersManaged IT Services: What It Is and Why It Matters
Managed IT Services: What It Is and Why It Matters
 
'Metrics That Matter': Gabrielle Benefield @ Colombo Agile Con 2014
'Metrics That Matter': Gabrielle Benefield @ Colombo Agile Con 2014'Metrics That Matter': Gabrielle Benefield @ Colombo Agile Con 2014
'Metrics That Matter': Gabrielle Benefield @ Colombo Agile Con 2014
 
Using Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce CostsUsing Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce Costs
 
Using Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce CostsUsing Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce Costs
 
Paradigm 2020
Paradigm 2020Paradigm 2020
Paradigm 2020
 
An Ounce of Validation = a Pound of Pivot by LinkedIn PM
An Ounce of Validation = a Pound of Pivot by LinkedIn PMAn Ounce of Validation = a Pound of Pivot by LinkedIn PM
An Ounce of Validation = a Pound of Pivot by LinkedIn PM
 
Ideal Customer Profile Guide
Ideal Customer Profile GuideIdeal Customer Profile Guide
Ideal Customer Profile Guide
 
Fundamentals of Designing, Building, & Implementing a Service Delivery Center
Fundamentals of Designing, Building, & Implementing a Service Delivery CenterFundamentals of Designing, Building, & Implementing a Service Delivery Center
Fundamentals of Designing, Building, & Implementing a Service Delivery Center
 
GEOALBERTA 2015 - StAlbert - Do You Measure Up
GEOALBERTA 2015 - StAlbert - Do You Measure UpGEOALBERTA 2015 - StAlbert - Do You Measure Up
GEOALBERTA 2015 - StAlbert - Do You Measure Up
 
SIAM Skills Workshop, BCS, ITSM Review 17th Nov 2015
SIAM Skills Workshop, BCS, ITSM Review 17th Nov 2015SIAM Skills Workshop, BCS, ITSM Review 17th Nov 2015
SIAM Skills Workshop, BCS, ITSM Review 17th Nov 2015
 
FIRST-WA-Project-Management-May-2020.pdf
FIRST-WA-Project-Management-May-2020.pdfFIRST-WA-Project-Management-May-2020.pdf
FIRST-WA-Project-Management-May-2020.pdf
 
NPS is Dead, Long Live NPS!
NPS is Dead, Long Live NPS! NPS is Dead, Long Live NPS!
NPS is Dead, Long Live NPS!
 
Success, Failure, Disaster: Cisco Measures Top Tasks - CapCHI 21 Oct 2015
Success, Failure, Disaster: Cisco Measures Top Tasks - CapCHI 21 Oct 2015Success, Failure, Disaster: Cisco Measures Top Tasks - CapCHI 21 Oct 2015
Success, Failure, Disaster: Cisco Measures Top Tasks - CapCHI 21 Oct 2015
 
3 types of monitoring for 2020
3 types of monitoring for 20203 types of monitoring for 2020
3 types of monitoring for 2020
 
How to seize B2B market opportunities thanks to Big Data
How to seize B2B market opportunities thanks to Big DataHow to seize B2B market opportunities thanks to Big Data
How to seize B2B market opportunities thanks to Big Data
 
Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013
Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013
Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013
 
Maintenance Metrics that Matter
Maintenance Metrics that MatterMaintenance Metrics that Matter
Maintenance Metrics that Matter
 
Targets That Work (for the Service Desk), Susan Storey
Targets That Work (for the Service Desk), Susan StoreyTargets That Work (for the Service Desk), Susan Storey
Targets That Work (for the Service Desk), Susan Storey
 

Recently uploaded

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Recently uploaded (20)

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

PDSummit16 - Using Incident Data to Improve your Business

Editor's Notes

  1. We’ll go through a generic example of what commonly happens in an organization that doesn’t do Incident Management well.
  2. Once upon a time…. The site was down
  3. Good thing executives noticed before anyone else. Pretty sure she’s just sitting in there with her door closed clicking refresh all day….. Site outages are always the database! (please note in the middle that the Network person left the room immediately) That ended up being timely as the DB team made assumptions it’s the Network’s issue
  4. Now we start to see customer contacts coming in as a result of this issue
  5. We have not entered the phase of the incident I refer to as the “Buckshot” phase. Everyone communicating poorly to everyone else with none of it contributing to resolving the issue
  6. Turn it off and turn it on again, brilliant! Poke that database performance as a root cause one more time….
  7. If you’re looking for me you can find me with my face in my palm at my desk….
  8. Congratulations, you just contributed to Global Warming in your own way… Someone get me some marshmellows!
  9. You just made your customer do this, or worse yet, they spent the time they were waiting for your site to come online looking for new service providers.
  10. They are spending their wasted time by your site being down looking for a new provider to replace you!
  11. But, It’s not just about the outage response here. There is is so much more to Incident Management than response and resolution. OK, In just having fun with the story here, but this is a terrible experience for customers and employees. Nobody talented wants to work in an environment like this and nobody wants to do business with a company that operates like this. .
  12. At SPS Commerce we’ve fostered a great culture around incident Management
  13. Micro services Architecture & Hybrid cloud leveraging PagerDuty as a notification hub as well as an event emitter to help us start to trigger automated responses.
  14. There is so much data out there, where do we start?
  15. If you’re not already looking at some of this data I definitely recommend starting with the basics. And the PD UI does a great job of getting you there quickly.
  16. When you analyze your data what you will typically find is “spikes” in alert counts & MTTR are typically indicative of a much larger issue that likely had more significant impact on your service performance than just one Disk space issue for example.
  17. PagerDuty recently released to Beta some work they’re doing on event processing and analysis. I think this is the beginning of something very interesting. This beta is a great way to visualize the concept of an alert versus an incident. You can see here (first animation) that we’re looking at a large group and volume of alerts compared to normal. I would likely assess this an incident that is probably the result of a shared dependency having performance issues. In this 2nd example you can see just a few small alert groupings is probably more indicative of isolated issues like CPU or Disk Space.
  18. This is something PD’s new OCC is going to be awesome for! What we have done Alert rates Got us thinking more about “Incidents” v. “Alerts” Alerts are by nature more isolated small scale issues that are important to respond to as well as remediate the go forward risks at a larger scale Incidents typically impact your customers or have risk to and will absolutely cost you valuable engineering time, it’s absolutely critical you leverage that to motivate a healthy short and long term response process.
  19. There are so many dimensions to an Incident, It’s important to start small and iterate. Recovery Efforts – Hours and #’s should not be perfect, give it a good swag. Establishing business metrics – Canary Testing
  20. Change helps you shape your confidence level with true numbers Code/Bug – Tangible priority discussion around cost of Time, Customer UX, and frequency of recurrence. Recovery – If you have a service that is regularly corrected by restarting it, maybe it’s time to automate that?
  21. Visible Status Page – current state of service health should always be visible in a shared location. (not in an Email string, that’s exclusive we need inclusive during an incident) Referenceable – props to PD road map on Event efforts.
  22. Decision Making Improves – you make better decisions as a business