Quality Engineering Incident Discipline

Incidents
The Shorter, the Better
qeunit.com
Antoine CRASKE
#digital #architecture
#transformation
#qualityengineering #qe
#testautomation #opensource
@acraske_
linkedin/acraske
qeunit.com

La Redoute
Director of Technology Transformation
Director of Architecture & Technology
Senior Director of Engineering
Senior Engineering Manager
Previous positions of Project Director, IT Manager. Project Manager, Software Engineer
Entrepreneurship
Co-founder, atale.io
Co-founder, Cerberus Testing
Co-founder, Test Automation Camp
Communities
Speaker at Software, DevOps, Testing, Quality, Open source conferences
QE Unit, founder & organizer of the Quality Engineering community
TICE.Leiria, Meetup founder & organizer
Ministry of Testing Leiria, Meetup founder & organizer
Apache Kafka User Group Portugal, Meetup founder & organizer
Archilocus, Architecture community co-founder & co-organizer
Publications
On Defining Quality Engineering, QE Unit - with Rémi Dewitte (on Leanpub, Amazon)
Improving La Redoute's CI/CD Pipeline and DevOps Processes by Applying Machine
Learning Techniques, ResearchGate.
Collecting Data from Continuous Practices: an Infrastructure to Support Team Development,
ResearchGate.
Who am I
Antoine CRASKE
#digital #architecture
#transformation
#qualityengineering #qe
#testautomation #opensource
@acraske_
linkedin/acraske
qeunit.com

"You have failures
because you are successful"
—Dr. Richard Cook, How Complex Systems Fail
qeunit.com

We are full of failures, and far from success
Source: The 2022 Accelerate State of DevOps Report
qeunit.com

Incidents - if it was so easy
1200 incidents/months with 5 majors resolved in 5,81 hours
30k€ of direct costs with indirect of 100k and brand impacts
96% raise inability to learn from previous incidents
Source : Quocirca (2017), Damage Control – The impact of critical IT incidents
qeunit.com

What we all have done
Incident management
methods, organization, tooling
Prioritization matrix
Survive to last(s) “P1”
qeunit.com
Source : Tech Target
Source : Blameless.io
Source : istockphoto

Our questions
Which incidents to address or ignore?
Who are the minimal persons to include?
How to reverse the incidents trends?
qeunit.com

“Complex systems fail in complex ways”
qeunit.com

Complexity is not only in software
qeunit.com

A sum of probabilities
Incident
Risk A
Risk B
Risk C
Risk D
Problem 1
Problem 2
Order application does not handle retries
Financial application have downtime
Entire order processing flow is impacted
qeunit.com

With contributing factors influencing the system
Incident
Risk
Contributing
factor
Problem 1
Problem 2
Contributing
factor
Contributing
factor
Contributing
factors
Risk
Risk
Risk
Risk
Contributing
factor
Contributing
factor
Contributing
factor
Contributing
factors
Risk
Risk
Risk
● Internal/external
● Process/tools
● Human/skills
● Organization
● …
● Internal/external
● Process/tools
● Human/skills
● Organization
● …
Source: Divya Vohra Behla*, Susan Ferreira, Systems Thinking: An Analysis of Key Factors and Relationships,Complex Adaptive Systems.
Source: Ryan Kitchens said at SRECon in 2019 “the focus should be on remediating the system, not the individual.”
qeunit.com

“Success is nothing more than a few
simple disciplines, practiced every day.”
qeunit.com

Quality Engineering Incident Discipline
1. Anti-fragility
2. Raise incidents
3. Post-mortem, no excuses
4. Root-cause(s)
5. Blameless transparency
6. Learn
7. Step-by-step
qeunit.com

#1 - Anti-fragility¹
Failure is inevitable
● We cannot stop the business
● More speed, more risks
● It’s about building an adaptive capacity
“the ability to continue to adapt to changing environments, stakeholders, demands, contexts”
Invest for guided continuous improvements
● Identify safety boundaries
● Reduce impacts at boundaries
● Inputs for upstream remediation
Source: Riccardo Patriarca, Dynamic Models To Enhance Space Safety. Space Safety Magazine.
¹Nassim Nicholas Taleb, Antifragile: Things That Gain From Disorder.
qeunit.com

#2 - Raise incidents
MTTA/D/R are not sufficient alone
● Mean is an average
● But… incidents are not average
If you have to pick three indicators
● TTD (Time To Detect) in absolute value
● SLI then SLO
● Volume of people and teams involved
Source : 2021 VOID Report - the Verica Open Incident Database
Source : La Redoute internal, not authorized for disclosure.
Source: Alex Ewerlöf, How to Best Use MTT* Metrics to
Optimize Your Incident Response. InfoQ article.
qeunit.com

#3 - Post-mortem, no excuses
All incidents are opportunities to learn
● Increase knowledge of the system
● Incidents have risk and luck factors
● Near-misses are equally important
Develop an organizational discipline
● 0 excuses
● 100% follow-up with executive support
● Build up operational excellence
Source : 2021 VOID Report - the Verica Open Incident Database
qeunit.com

Software is a complex socio-technological system
● “Complex systems fail in complex ways”
● Contributing factors at the source of root causes
● Systemic approach instead of problem resolution
#4 - Root cause(s)
Source : Systems Thinking: Managing Chaos and Complexity. Jamshid Gharajedaghi.
Source : What is the Difference Between Root Cause
and Contributing Factor, Peedia (2022)
qeunit.com

#5 - Blameless transparency
Leverage the “Speed of Trust”¹
● Transparency builds relationships
● Transparency gives space to fix what’s broken
● The more you understand, the more you can trust
Tackling the hard parts
● “When things go wrong, we all experience fear”
● There’s no “blameless retrospective”
● Make it progressive
Source: Uber concealed huge data breach, BBC news
¹Covey, S. M. R. (2008). The speed of trust: the one thing that changes everything. Simon & Schuster.
Engineering transparency
Organizational transparency
Stakeholders transparency
Public transparency
Source: Transparency in incident response, Squadcast
qeunit.com

#6 - Learn
Solving an incident is not fixing an incident
● Siloed investigations by software engineer
● Investigators are not forensic medicine
● Identify themes and narratives leading to root causes
Dedicated “Incident Analysis” organization
● Staff strong Incident Analyst
● Block continuous time for Problem Management
● Ensure ongoing executive support
“Incident analysis is not actually about the incident, it’s an
opportunity we have to see the delta between how we
think our organization works and how it actually works”
—Nora Jones, CEO Jeli.io & Founder, LFI
Support
Delivery
Incident
Analyst
qeunit.com

#7 - Step-by-step
Act on the existing system first
● Already multiple contributing factors
● Don’t change too many system factors
● Build up the adaptive capacity
Iterate on realistic targets with maturity
● Evolving a system takes time
● Ensure continuity in specific periods
● Industrialize SLI, SLO, and then starts SRE
qeunit.com

For more Quality Engineering
#peer-review #support #content-sharing
#mentoring #content
And also
Tech.rocks
moderntesting.org & AB Testing Podcast, Slack
platformengineering.org
qeunit.com
qeunit.com

Quality Engineering Incident Discipline

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Quality Engineering Incident Discipline

Similar to Quality Engineering Incident Discipline (20)

More from Antoine Craske

More from Antoine Craske (11)

Recently uploaded

Recently uploaded (20)

Quality Engineering Incident Discipline