This document discusses introducing incident management and maximizing monitoring value through a DevOps transformation. It covers what DevOps and incident management are, the phases of incident response, how to set up alerting and runbooks, create postmortems, and prepare for on-call response. Additional steps are identified to strengthen incident management through training, developing collaboration and continuous improvement. Benefits of adopting DevOps and incident management include self-organizing teams, more experimentation, better service quality, faster recovery times, increased collaboration, and continuous improvement.
2. 2
2 Bridgestone Mobility Solutions
Lena Standke
(former Olena Kharchenko)
• Project Lead for IcM
• DevOps Change Agent
• Agile Coach
3. 3 Bridgestone Mobility Solutions
Agenda
DevOps and Incident Management
Incident Management in Practice
Additional Steps to Strengthen IcM
Benefits of Introducing DevOps and IcM
Q&A
5. 5 Bridgestone Mobility Solutions
What is DevOps
Three ways of DevOps:
1. Principles of Flow
2. Principles of Feedback
3. Principles of Continuous Learning
6. 6 Bridgestone Mobility Solutions
What is DevOps
Three ways of DevOps:
1. Principles of Flow
• Making work visible
• Limiting WIP
• Reducing batch sizes and handoffs
2. Principles of Feedback
3. Principles of Continuous Learning
7. 7 Bridgestone Mobility Solutions
What is DevOps
Three ways of DevOps:
1. Principles of Flow
• Making work visible
• Limiting WIP
• Reducing batch sizes and handoffs
2. Principles of Feedback
• Swarm and solve problems to build knowledge
• Push quality closer to the source
• Optimize for downstream work centers
3. Principles of Continuous Learning
8. 8 Bridgestone Mobility Solutions
What is DevOps
Three ways of DevOps:
1. Principles of Flow
• Making work visible
• Limiting WIP
• Reducing batch sizes and handoffs
2. Principles of Feedback
• Swarm and solve problems to build knowledge
• Push quality closer to the source
• Optimize for downstream work centers
3. Principles of Continuous Learning
• Enable organizational learning and a safety culture
• Institutionalize the improvement of daily work
10. 10 Bridgestone Mobility Solutions
What is Incident Management
• Incident is an unplanned interruption to a service or reduction in
the quality of a service.
• Incident management is the practice of minimizing the negative
impact of incidents by restoring normal service operation as
quickly as possible.
11. 11 Bridgestone Mobility Solutions
How does it work in practice?
How do I
reach out to
the team?
13. 13 Bridgestone Mobility Solutions
Phases of Incident Response
3
2 4
1 5 6
Preparation
Detection
Communication
Incident analysis Postmortem
Recovery
Monitoring and Alerting Observability
Runbooks
Guidelines
Root Cause Analysis
Status Update
14. 14 Bridgestone Mobility Solutions
The Most Frequent Questions
1. How to set up alerting?
2. How to write a runbook?
3. How to make a postmortem?
4. How to prepare for on-call?
15. 15 Bridgestone Mobility Solutions
The Most Frequent Questions
1. How to set up alerting?
2. How to write a runbook?
3. How to make a postmortem?
4. How to prepare for on-call?
16. 16 Bridgestone Mobility Solutions
• Introduce an alert management tool
• Automate alert routing
• Create escalation rules
• Use an effective contact method
• Improve alerts:
o Use meaningful alert names (service name and problem description)
o Prioritize alerts (24/7 only for high and critical)
o Constantly check for false positives/negatives
How to Improve Alerting
Team A
Team B
Team C
17. 17 Bridgestone Mobility Solutions
The Most Frequent Questions
1. How to set up alerting?
2. How to write a runbook?
3. How to make a postmortem?
4. How to prepare for on-call?
18. 18 Bridgestone Mobility Solutions
• Prepare a runbook template to be used across the company
• Create runbooks in your knowledge base where everyone can find them
• Label or mark runbooks to find them quickly
• Keep it short and easy to follow
Service Runbooks
19. 19 Bridgestone Mobility Solutions
1. Provide a service overview
2. Define SLA, SLO, SLI
3. Include dependencies
4. Create rescue cards (table structure)
• Alert/incident name
• Impact description and severity
• Step-by-step instructions with links and screenshots
Runbook Template
21. 21 Bridgestone Mobility Solutions
• Proofread the runbook
• Test the runbook
• Improve the runbook after incidents
Finalizing Runbook
22. 22 Bridgestone Mobility Solutions
The Most Frequent Questions
1. How to set up alerting?
2. How to write a runbook?
3. How to make a postmortem?
4. How to prepare for on-call?
24. 24 Bridgestone Mobility Solutions
1. Go deep into the reason to find the root cause
• Ask why, not who; use 5 Whys technique
2. How did you solve it? What could have gone better?
• Describe your solution and mitigation and analyze them
3. Is your monitoring/alerting set up good enough?
• How was the incident discovered? By a customer or monitoring?
• How were you notified?
4. Is the runbook good enough?
• Could you easily follow the runbook? Was it helpful?
• Is there a better/faster way to solve the incident?
5. What are your lessons learnt?
Creating Postmortem
25. 25 Bridgestone Mobility Solutions
The Most Frequent Questions
1. How to set up alerting?
2. How to write a runbook?
3. How to make a postmortem?
4. How to prepare for on-call?
26. 26 Bridgestone Mobility Solutions
• Check legal aspects of on-call and define guidelines
• Make on-call conditions attractive
• Identify people willing to take over on-call
• Cluster teams if they have only few on-call participants
• Provide hands-on onboarding and training
On-call preparation
27. 27 Bridgestone Mobility Solutions
• Test on-call starting only with working hours
• Start test on-call with a daily rotation
• Test your runbooks
• Improve monitoring and alerting during the test phase
• Try out different contact methods and rotation setups
Starting On-call
29. 29 Bridgestone Mobility Solutions
• Organize hands-on trainings
• Create guidelines, best practices and how-tos in your knowledge base
• Find experts to be a go-to person for questions
• Set up knowledge sharing platform
Training
30. 30 Bridgestone Mobility Solutions
• Create atmosphere of trust to involve people
• Take questions, feedback and worries seriously
• Foster collaborations and knowledge sharing among teams
• Continuously improve processes and way of working
• Concentrate on lessons learned and not on failures
Developing continuous improvement and collaboration mindset
32. 32 Bridgestone Mobility Solutions
• Self-organizing teams
• More experimentation and new ideas
• Better service quality and stability
• Faster mean time to recover
• More collaboration between different teams
• Continuous improvement
Benefits of Introducing DevOps and IcM into your Company