Characterizing the spectrum of organizational practice for SRE using the Shuhari and Dreyfus frameworks.
Presented at USENIX SREcon17EMEA, 31 August 2017 in Dublin, Ireland.
3. Is Change Going to Stop?
All things are compounded objects
in a continuous change of condition
Article by Mathias LaFeldt Impermanance: The Single Root Cause
Dave Zwieback Beyond Blame:
The root cause for both the functioning and malfunctions in all complex systems is impermanence (i.e., the fact that all
systems are changeable by nature). Knowing the root cause, we no longer seek it, and instead look for the many
conditions that allowed a particular situation to manifest. We accept that not all conditions are knowable or fixable.
https://medium.com/production-
ready/impermanence-the-single-root-cause-bd9ebadf1e8e
4. Stages of Practice
Shu (obey)
Ha (detach)
Ri (separate)
There are many conceptual frameworks for skill acquisition. In three stages,
1 - One needs to learn the mechanical basics, the forms of the art, 2 - Then learn to innovate on the basic forms, and
finally 3 - Transcend the forms to flow intuitively with the elements
5. Stages of Practice
Innocent
Shu (obey) Novice
Beginner
Competent
Ha (detach) Pro cient
Ri (separate) Master
Expert/Researcher
Novice - follows rules as given, without context
Beginner - limited “situational perception”
Competent - active decision making in choosing a course of action
Proficient - prioritizes importance of aspects, perceives deviations from the normal pattern
Master - (I’ve adjusted the top term a bit) intuitive grasp of situations based on deep, tacit understanding
Around 1980, Stuart & Hubert Dreyfus wrote a paper proposing a 5-stage model for skill acquisition. While the Dreyfus
brothers had some particular components in mind with their model which have been debated by others, the general
concepts are:
And other writers have added “boundary” condition states as well:
Innocent - Have “heard about XXX”, No acquaintance with a concept or process
Expert/Researcher - Write books, An advanced state characterized by teaching others and pushing the definitions
forward
“Five” is a nice, in-between count; so let’s look at SRE practices using the Dreyfus model
6. Signposts of SRE Practice
Incident Response
Incident Prevention
Post Mortems
SL[AOI]s
Monitoring
8. Shu Signposts:
Incident Response
Novice “Alarmed” by incidents
Primarily external sourced with inconsistent
response
Beginner “Fears” incidents
E ective response requires speci c people
Competent “Aware” that incidents are normal
Well de ned handling process
Novice - alarmed by incidents - which come mainly from external notification
Beginner - fears incidents and responding well requires particular people
Competent - “aware” that incidents are normal, processes are more established
9. Ha-Ri Signposts:
Incident Response
Pro cient “Accept” incidents as a normal
Some inter-team coordination planning
Master “Embrace” incidents as learning experiences
Well documented processes and procedures with
learning inputs to the process
Proficient - accept that incidents are normal
Master - embraces incidents as a learning experience and has a strong feedback framework
11. Shu Signposts:
Incident Prevention
Novice Focus on remediation (docs & metrics) for
manually-identi ed, static, contributory causes
Beginner Documentation done to an “acceptable” level
Static & action-based causes recognized
Competent Focus on team response to incidents, maintaining
docs
Novice - Manually identified, static causes
Beginner - Better documentation, recognizes both static and action-based causes
Competent - Looks to improve how the team responds to incidents
12. Ha-Ri Signposts:
Incident Prevention
Pro cient Early phases of chaos engineering - scheduled
Master Randomized chaos engineering
Focus on general hygiene of operational
environment
Proficient - early, scheduled chaos engineering; a little please, but not too much
Master - “bring it on”, randomized chaos engineering
14. Shu Signposts:
Post Mortems
Novice “Blameful”, only for crisis incidents
Looking for a scapegoat
Beginner Only performed for major incidents
Looking for a cause with a focus on mistakes
Competent More common, starting to look past blaming
Focus on improving local processes
Novice - blameful, looking for a scapegoat
Beginner - looking for a cause, mainly around “mistakes”
Competent - starting to look past blaming
15. Ha-Ri Signposts:
Post Mortems
Pro cient “Blameless”, used consistently
Action items feed back to improve systems &
processes
Master Used to derive “meta”-learnings
Applying learnings across the system
Proficient - blameless, consistent processes feeding back into the organization
Master - a step above, looking for larger themes and applying across the entire system
17. Shu Signposts:
SL[AOI]s
Novice Externally imposed (SLA), if any
On paper, not necessarily measured
May be manually calculated for contractual needs
Beginner Recognizes the di erence in these terms
Measures “easy” things
Competent De ned and measured primary characteristics
Measures internal SLOs, not just contractual
performance
Novice - externally imposed if any
Beginner - understands the differences, measures what is easy
Competent - primary measures in place
18. Ha-Ri Signposts:
SL[AOI]s
Pro cient Well developed cascade of measures
Historical record and correlation to events
Master Meaningful measures throughout the system
Proficient - well developed sets of measures with historical records/baselines
Master - meaningful measures throughout the system
20. Shu Signposts:
Monitoring
Novice No baseline metrics established
Beginner “OS level” or “out of the box”, inconsistent
monitoring
Partial baselines being developed
Competent Consistent baseline monitoring across entire
system
Able to determine statistical anomalies
Novice - no baseline
Beginner - “out of the box”, spotty coverage
Competent - consistent monitoring
21. Ha-Ri Signposts:
Monitoring
Pro cient Thorough instrumentation of all service components
Able to correlate internal and external measures
Master Data observable upon demand
Automated correlation and anomaly detection
Proficient - thorough measures
Master - observable upon demand with automated anomaly detection
22. Other Potential Areas to Evaluate
Error Budget De nition and Usage
Change Management Practices
Demand Forecasting / Cost to Serve
23. More Potential Areas to Evaluate
Provisioning
E ciency
Do Your Services “Plan for Retirement”?
24. Even More Potential Areas to
Evaluate
New Services: Intro to Stability
MTTS (hat tip to Etsy) or INsomnia
Toil Fraction
25. Assessing Your Organization’s
Level of Practice
Mock Assessment Search
Monitoring
Novice
Competent
Pro..
BeginnerBeg..
IResponse
Beginner
SLx
Novice
Beginner
IPrevent PM
Com..
NoviceNovice
Flamegraph showing degrees of execution
30. And the Beat Goes On
Each ‘9’ will cost you more that the one before it
Org-wide Practice Adoption ?
Everything as a Service
Customer Reliability Engineering
31. . . . it was only the beginning of the real story . . . which goes on forever:
in which every chapter is better than the one before.
Continuing the conversation. . .
Twitter: @DrKurtA
LinkedIn: https://linkedin.com/in/kurta1