Another frustrating day full of incidents. Everything that was planned had to be shifted to handle various incidents, even having to ignore some.
Despite having incident management in place, the volume of anomalies leaves us with challenges to solve. With this abnormal volume of anomalies, which incidents to handle, which to ignore? Who is in the best position to handle them? How do we sort out the false positives? How do we organize to turn things around quickly?
The Accelerate report identifies that organizations that are able to master a continuous and stable software iteration flow are the ones that make the difference in the market. For the others, incidents will remain their daily routine until their available resources are exhausted.
This presentation shares a methodology to bring your incident flow under control, at the cost of some sacrifice and courage.
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Quality Engineering Incident Discipline
1. Incidents
The Shorter, the Better
qeunit.com
Antoine CRASKE
#digital #architecture
#transformation
#qualityengineering #qe
#testautomation #opensource
@acraske_
linkedin/acraske
qeunit.com
2. La Redoute
Director of Technology Transformation
Director of Architecture & Technology
Senior Director of Engineering
Senior Engineering Manager
Previous positions of Project Director, IT Manager. Project Manager, Software Engineer
Entrepreneurship
Co-founder, atale.io
Co-founder, Cerberus Testing
Co-founder, Test Automation Camp
Communities
Speaker at Software, DevOps, Testing, Quality, Open source conferences
QE Unit, founder & organizer of the Quality Engineering community
TICE.Leiria, Meetup founder & organizer
Ministry of Testing Leiria, Meetup founder & organizer
Apache Kafka User Group Portugal, Meetup founder & organizer
Archilocus, Architecture community co-founder & co-organizer
Publications
On Defining Quality Engineering, QE Unit - with Rémi Dewitte (on Leanpub, Amazon)
Improving La Redoute's CI/CD Pipeline and DevOps Processes by Applying Machine
Learning Techniques, ResearchGate.
Collecting Data from Continuous Practices: an Infrastructure to Support Team Development,
ResearchGate.
Who am I
Antoine CRASKE
#digital #architecture
#transformation
#qualityengineering #qe
#testautomation #opensource
@acraske_
linkedin/acraske
qeunit.com
4. We are full of failures, and far from success
Source: The 2022 Accelerate State of DevOps Report
qeunit.com
5. Incidents - if it was so easy
1200 incidents/months with 5 majors resolved in 5,81 hours
30k€ of direct costs with indirect of 100k and brand impacts
96% raise inability to learn from previous incidents
Source : Quocirca (2017), Damage Control – The impact of critical IT incidents
qeunit.com
6. What we all have done
Incident management
methods, organization, tooling
Prioritization matrix
Survive to last(s) “P1”
qeunit.com
Source : Tech Target
Source : Blameless.io
Source : istockphoto
7. Our questions
Which incidents to address or ignore?
Who are the minimal persons to include?
How to reverse the incidents trends?
qeunit.com
10. A sum of probabilities
Incident
Risk A
Risk B
Risk C
Risk D
Problem 1
Problem 2
Order application does not handle retries
Financial application have downtime
Entire order processing flow is impacted
qeunit.com
11. With contributing factors influencing the system
Incident
Risk
Contributing
factor
Problem 1
Problem 2
Contributing
factor
Contributing
factor
Contributing
factors
Risk
Risk
Risk
Risk
Contributing
factor
Contributing
factor
Contributing
factor
Contributing
factors
Risk
Risk
Risk
● Internal/external
● Process/tools
● Human/skills
● Organization
● …
● Internal/external
● Process/tools
● Human/skills
● Organization
● …
Source: Divya Vohra Behla*, Susan Ferreira, Systems Thinking: An Analysis of Key Factors and Relationships,Complex Adaptive Systems.
Source: Ryan Kitchens said at SRECon in 2019 “the focus should be on remediating the system, not the individual.”
qeunit.com
12. “Success is nothing more than a few
simple disciplines, practiced every day.”
qeunit.com
14. #1 - Anti-fragility¹
Failure is inevitable
● We cannot stop the business
● More speed, more risks
● It’s about building an adaptive capacity
“the ability to continue to adapt to changing environments, stakeholders, demands, contexts”
Invest for guided continuous improvements
● Identify safety boundaries
● Reduce impacts at boundaries
● Inputs for upstream remediation
Source: Riccardo Patriarca, Dynamic Models To Enhance Space Safety. Space Safety Magazine.
¹Nassim Nicholas Taleb, Antifragile: Things That Gain From Disorder.
qeunit.com
15. #2 - Raise incidents
MTTA/D/R are not sufficient alone
● Mean is an average
● But… incidents are not average
If you have to pick three indicators
● TTD (Time To Detect) in absolute value
● SLI then SLO
● Volume of people and teams involved
Source : 2021 VOID Report - the Verica Open Incident Database
Source : La Redoute internal, not authorized for disclosure.
Source: Alex Ewerlöf, How to Best Use MTT* Metrics to
Optimize Your Incident Response. InfoQ article.
qeunit.com
16. #3 - Post-mortem, no excuses
All incidents are opportunities to learn
● Increase knowledge of the system
● Incidents have risk and luck factors
● Near-misses are equally important
Develop an organizational discipline
● 0 excuses
● 100% follow-up with executive support
● Build up operational excellence
Source : 2021 VOID Report - the Verica Open Incident Database
Source : La Redoute internal, not authorized for disclosure.
qeunit.com
17. Software is a complex socio-technological system
● “Complex systems fail in complex ways”
● Contributing factors at the source of root causes
● Systemic approach instead of problem resolution
#4 - Root cause(s)
Source : Systems Thinking: Managing Chaos and Complexity. Jamshid Gharajedaghi.
Source : What is the Difference Between Root Cause
and Contributing Factor, Peedia (2022)
qeunit.com
18. #5 - Blameless transparency
Leverage the “Speed of Trust”¹
● Transparency builds relationships
● Transparency gives space to fix what’s broken
● The more you understand, the more you can trust
Tackling the hard parts
● “When things go wrong, we all experience fear”
● There’s no “blameless retrospective”
● Make it progressive
Source: Uber concealed huge data breach, BBC news
¹Covey, S. M. R. (2008). The speed of trust: the one thing that changes everything. Simon & Schuster.
Source : La Redoute internal, not authorized for disclosure.
Engineering transparency
Organizational transparency
Stakeholders transparency
Public transparency
Source: Transparency in incident response, Squadcast
qeunit.com
19. #6 - Learn
Solving an incident is not fixing an incident
● Siloed investigations by software engineer
● Investigators are not forensic medicine
● Identify themes and narratives leading to root causes
Dedicated “Incident Analysis” organization
● Staff strong Incident Analyst
● Block continuous time for Problem Management
● Ensure ongoing executive support
“Incident analysis is not actually about the incident, it’s an
opportunity we have to see the delta between how we
think our organization works and how it actually works”
—Nora Jones, CEO Jeli.io & Founder, LFI
Support
Delivery
Incident
Analyst
qeunit.com
20. #7 - Step-by-step
Act on the existing system first
● Already multiple contributing factors
● Don’t change too many system factors
● Build up the adaptive capacity
Iterate on realistic targets with maturity
● Evolving a system takes time
● Ensure continuity in specific periods
● Industrialize SLI, SLO, and then starts SRE
qeunit.com
22. For more Quality Engineering
#peer-review #support #content-sharing
#mentoring #content
And also
Tech.rocks
moderntesting.org & AB Testing Podcast, Slack
platformengineering.org
qeunit.com
qeunit.com
23. Incidents
The Shorter, the Better
qeunit.com
Antoine CRASKE
#digital #architecture
#transformation
#qualityengineering #qe
#testautomation #opensource
@acraske_
linkedin/acraske
qeunit.com