SlideShare a Scribd company logo
1 of 31
Download to read offline
THE NEVER-ENDING
STORY:
SITE RELIABILITY
Automating Ourselves
Out of a Job ?!?
Is Change Going to Stop?
All things are compounded objects
in a continuous change of condition
Article by Mathias LaFeldt Impermanance: The Single Root Cause
Dave Zwieback Beyond Blame:
The root cause for both the functioning and malfunctions in all complex systems is impermanence (i.e., the fact that all
systems are changeable by nature). Knowing the root cause, we no longer seek it, and instead look for the many
conditions that allowed a particular situation to manifest. We accept that not all conditions are knowable or fixable.
https://medium.com/production-
ready/impermanence-the-single-root-cause-bd9ebadf1e8e
Stages of Practice
Shu (obey)
Ha (detach)
Ri (separate)
There are many conceptual frameworks for skill acquisition. In three stages,
1 - One needs to learn the mechanical basics, the forms of the art, 2 - Then learn to innovate on the basic forms, and
finally 3 - Transcend the forms to flow intuitively with the elements
Stages of Practice
Innocent
Shu (obey) Novice
Beginner
Competent
Ha (detach) Pro cient
Ri (separate) Master  
Expert/Researcher
Novice - follows rules as given, without context
Beginner - limited “situational perception”
Competent - active decision making in choosing a course of action
Proficient - prioritizes importance of aspects, perceives deviations from the normal pattern
Master - (I’ve adjusted the top term a bit) intuitive grasp of situations based on deep, tacit understanding
Around 1980, Stuart & Hubert Dreyfus wrote a paper proposing a 5-stage model for skill acquisition. While the Dreyfus
brothers had some particular components in mind with their model which have been debated by others, the general
concepts are:
And other writers have added “boundary” condition states as well:
Innocent - Have “heard about XXX”, No acquaintance with a concept or process
Expert/Researcher - Write books, An advanced state characterized by teaching others and pushing the definitions
forward
“Five” is a nice, in-between count; so let’s look at SRE practices using the Dreyfus model
Signposts of SRE Practice
Incident Response
Incident Prevention
Post Mortems
SL[AOI]s
Monitoring
Signposts:
Incident Response
(hat tip to J. Paul Reed)
Shu Signposts:
Incident Response
Novice “Alarmed” by incidents
Primarily external sourced with inconsistent
response
Beginner “Fears” incidents
E ective response requires speci c people
Competent “Aware” that incidents are normal
Well de ned handling process
Novice - alarmed by incidents - which come mainly from external notification
Beginner - fears incidents and responding well requires particular people
Competent - “aware” that incidents are normal, processes are more established
Ha-Ri Signposts:
Incident Response
Pro cient “Accept” incidents as a normal
Some inter-team coordination planning
Master “Embrace” incidents as learning experiences
Well documented processes and procedures with
learning inputs to the process
Proficient - accept that incidents are normal
Master - embraces incidents as a learning experience and has a strong feedback framework
Signposts:
Incident Prevention
(hat tip to J. Paul Reed)
Shu Signposts:
Incident Prevention
Novice Focus on remediation (docs & metrics) for
manually-identi ed, static, contributory causes
Beginner Documentation done to an “acceptable” level
Static & action-based causes recognized
Competent Focus on team response to incidents, maintaining
docs
Novice - Manually identified, static causes
Beginner - Better documentation, recognizes both static and action-based causes
Competent - Looks to improve how the team responds to incidents
Ha-Ri Signposts:
Incident Prevention
Pro cient Early phases of chaos engineering - scheduled
Master Randomized chaos engineering
Focus on general hygiene of operational
environment
Proficient - early, scheduled chaos engineering; a little please, but not too much
Master - “bring it on”, randomized chaos engineering
Signposts:
Post Mortems
Shu Signposts:
Post Mortems
Novice “Blameful”, only for crisis incidents
Looking for a scapegoat
Beginner Only performed for major incidents
Looking for a cause with a focus on mistakes
Competent More common, starting to look past blaming
Focus on improving local processes
Novice - blameful, looking for a scapegoat
Beginner - looking for a cause, mainly around “mistakes”
Competent - starting to look past blaming
Ha-Ri Signposts:
Post Mortems
Pro cient “Blameless”, used consistently
Action items feed back to improve systems &
processes
Master Used to derive “meta”-learnings
Applying learnings across the system
Proficient - blameless, consistent processes feeding back into the organization
Master - a step above, looking for larger themes and applying across the entire system
Signposts:
SLAs / SLOs / SLIs
Shu Signposts:
SL[AOI]s
Novice Externally imposed (SLA), if any
On paper, not necessarily measured
May be manually calculated for contractual needs
Beginner Recognizes the di erence in these terms
Measures “easy” things
Competent De ned and measured primary characteristics
Measures internal SLOs, not just contractual
performance
Novice - externally imposed if any
Beginner - understands the differences, measures what is easy
Competent - primary measures in place
Ha-Ri Signposts:
SL[AOI]s
Pro cient Well developed cascade of measures
Historical record and correlation to events
Master Meaningful measures throughout the system
Proficient - well developed sets of measures with historical records/baselines
Master - meaningful measures throughout the system
Signposts:
Monitoring
Shu Signposts:
Monitoring
Novice No baseline metrics established
Beginner “OS level” or “out of the box”, inconsistent
monitoring
Partial baselines being developed
Competent Consistent baseline monitoring across entire
system
Able to determine statistical anomalies
Novice - no baseline
Beginner - “out of the box”, spotty coverage
Competent - consistent monitoring
Ha-Ri Signposts:
Monitoring
Pro cient Thorough instrumentation of all service components
Able to correlate internal and external measures
Master Data observable upon demand
Automated correlation and anomaly detection
Proficient - thorough measures
Master - observable upon demand with automated anomaly detection
Other Potential Areas to Evaluate
Error Budget De nition and Usage
Change Management Practices
Demand Forecasting / Cost to Serve
More Potential Areas to Evaluate
Provisioning
E ciency
Do Your Services “Plan for Retirement”?
Even More Potential Areas to
Evaluate
New Services: Intro to Stability
MTTS (hat tip to Etsy) or INsomnia
Toil Fraction
Assessing Your Organization’s
Level of Practice
Mock Assessment Search
Monitoring
Novice
Competent
Pro..
BeginnerBeg..
I­Response
Beginner
SLx
Novice
Beginner
I­Prevent PM
Com..
NoviceNovice
Flamegraph showing degrees of execution
ARE WE
AUTOMATING
OURSELVES OUT OF A
JOB !?!
And the Beat Goes On
Each ‘9’ will cost you more that the one before it
 
Org-wide Practice Adoption ?
Everything as a Service
Customer Reliability Engineering
. . . it was only the beginning of the real story . . . which goes on forever:
in which every chapter is better than the one before.
Continuing the conversation. . .
Twitter: @DrKurtA
LinkedIn: https://linkedin.com/in/kurta1

More Related Content

What's hot

Root Cause Analysis and Accident Investigation
Root Cause Analysis and Accident InvestigationRoot Cause Analysis and Accident Investigation
Root Cause Analysis and Accident InvestigationKPADealerWebinars
 
Root Cause Analysis
Root Cause AnalysisRoot Cause Analysis
Root Cause Analysisgatelyw396
 
Root Cause Analysis (RCA) Tools
Root Cause Analysis (RCA) ToolsRoot Cause Analysis (RCA) Tools
Root Cause Analysis (RCA) ToolsJeremy Jay Lim
 
Safety webinar with mark friend
Safety webinar with mark friendSafety webinar with mark friend
Safety webinar with mark friendERAUWebinars
 
Intro to Root Cause Analysis
Intro to Root Cause AnalysisIntro to Root Cause Analysis
Intro to Root Cause AnalysisCarmel Khan
 
Root Cause Analysis
Root Cause AnalysisRoot Cause Analysis
Root Cause Analysismtalhausmani
 
Root Cause Analysis | QualiTest Group
Root Cause Analysis | QualiTest GroupRoot Cause Analysis | QualiTest Group
Root Cause Analysis | QualiTest GroupQualitest
 
Devops - Accelerating the Pace and Securing Along the Way - Thaddeus Walsh
Devops - Accelerating the Pace and Securing Along the Way - Thaddeus WalshDevops - Accelerating the Pace and Securing Along the Way - Thaddeus Walsh
Devops - Accelerating the Pace and Securing Along the Way - Thaddeus WalshDrew Malone
 
Root Cause Analysis
Root Cause AnalysisRoot Cause Analysis
Root Cause Analysistqmdoctor
 

What's hot (13)

Root Cause Analysis and Accident Investigation
Root Cause Analysis and Accident InvestigationRoot Cause Analysis and Accident Investigation
Root Cause Analysis and Accident Investigation
 
Root Cause Analysis
Root Cause AnalysisRoot Cause Analysis
Root Cause Analysis
 
Root Cause Analysis (RCA) Tools
Root Cause Analysis (RCA) ToolsRoot Cause Analysis (RCA) Tools
Root Cause Analysis (RCA) Tools
 
Root cause analysis
Root cause analysis Root cause analysis
Root cause analysis
 
Safety webinar with mark friend
Safety webinar with mark friendSafety webinar with mark friend
Safety webinar with mark friend
 
Intro to Root Cause Analysis
Intro to Root Cause AnalysisIntro to Root Cause Analysis
Intro to Root Cause Analysis
 
Root Cause Analysis
Root Cause AnalysisRoot Cause Analysis
Root Cause Analysis
 
Root Cause Analysis | QualiTest Group
Root Cause Analysis | QualiTest GroupRoot Cause Analysis | QualiTest Group
Root Cause Analysis | QualiTest Group
 
Human errors
Human errorsHuman errors
Human errors
 
Devops - Accelerating the Pace and Securing Along the Way - Thaddeus Walsh
Devops - Accelerating the Pace and Securing Along the Way - Thaddeus WalshDevops - Accelerating the Pace and Securing Along the Way - Thaddeus Walsh
Devops - Accelerating the Pace and Securing Along the Way - Thaddeus Walsh
 
Root cause analysis
Root cause analysisRoot cause analysis
Root cause analysis
 
Root cause analysis
Root cause analysisRoot cause analysis
Root cause analysis
 
Root Cause Analysis
Root Cause AnalysisRoot Cause Analysis
Root Cause Analysis
 

Similar to The NeverEnding Story: Site Reliability

Assessing stages of practice
Assessing stages of practiceAssessing stages of practice
Assessing stages of practiceKurt Andersen
 
The Anatomy of Problem Solving
The Anatomy of Problem SolvingThe Anatomy of Problem Solving
The Anatomy of Problem SolvingDamian T. Gordon
 
From Reactive to Predictive Process Management
From Reactive to Predictive Process ManagementFrom Reactive to Predictive Process Management
From Reactive to Predictive Process ManagementMichael zur Muehlen
 
Leveraging Diversity to Find What Works and Amplify
Leveraging Diversity to Find What Works and Amplify Leveraging Diversity to Find What Works and Amplify
Leveraging Diversity to Find What Works and Amplify Mike Cardus
 
2012 IEHF - Task risk management
2012 IEHF - Task risk management2012 IEHF - Task risk management
2012 IEHF - Task risk managementAndy Brazier
 
Cause and effect diagrams
Cause and effect diagramsCause and effect diagrams
Cause and effect diagramsRonald Bartels
 
Cedp 402 assessment lectures
Cedp 402 assessment lecturesCedp 402 assessment lectures
Cedp 402 assessment lecturesRyan Sain
 
Problem Management - Systematic Approach
Problem Management - Systematic ApproachProblem Management - Systematic Approach
Problem Management - Systematic ApproachYugi Achipireddygari
 
Creating a compliance assessment program on a tight budget
Creating a compliance assessment program on a tight budgetCreating a compliance assessment program on a tight budget
Creating a compliance assessment program on a tight budgetAshley Deuble
 
Operating Excellence is built on Corrective & Preventive Actions
Operating Excellence is built on Corrective & Preventive ActionsOperating Excellence is built on Corrective & Preventive Actions
Operating Excellence is built on Corrective & Preventive ActionsAtanu Dhar
 
Cliffnotes on Blue Teaming
Cliffnotes on Blue TeamingCliffnotes on Blue Teaming
Cliffnotes on Blue TeamingRishabh Dangwal
 
Rekard Edgren - Curing Our Binary Disease - EuroSTAR 2012
Rekard Edgren - Curing Our Binary Disease - EuroSTAR 2012Rekard Edgren - Curing Our Binary Disease - EuroSTAR 2012
Rekard Edgren - Curing Our Binary Disease - EuroSTAR 2012TEST Huddle
 
Defect MgmtBugDay Bangkok 2009: Defect Management
Defect MgmtBugDay Bangkok 2009: Defect ManagementDefect MgmtBugDay Bangkok 2009: Defect Management
Defect MgmtBugDay Bangkok 2009: Defect Managementguest476528
 
Chaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field GuideChaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field Guidematthewbrahms
 
Learning From Failure & How You Can Too
Learning From Failure & How You Can TooLearning From Failure & How You Can Too
Learning From Failure & How You Can TooChad Todd, MBA
 
Through the new lens: Quality & Complexity bruce waltuck
Through the new lens: Quality & Complexity bruce waltuckThrough the new lens: Quality & Complexity bruce waltuck
Through the new lens: Quality & Complexity bruce waltuckBruce Waltuck
 

Similar to The NeverEnding Story: Site Reliability (20)

Assessing stages of practice
Assessing stages of practiceAssessing stages of practice
Assessing stages of practice
 
Root cause analysis
Root cause analysisRoot cause analysis
Root cause analysis
 
The Anatomy of Problem Solving
The Anatomy of Problem SolvingThe Anatomy of Problem Solving
The Anatomy of Problem Solving
 
Root causeanalysis
Root causeanalysisRoot causeanalysis
Root causeanalysis
 
From Reactive to Predictive Process Management
From Reactive to Predictive Process ManagementFrom Reactive to Predictive Process Management
From Reactive to Predictive Process Management
 
Leveraging Diversity to Find What Works and Amplify
Leveraging Diversity to Find What Works and Amplify Leveraging Diversity to Find What Works and Amplify
Leveraging Diversity to Find What Works and Amplify
 
2012 IEHF - Task risk management
2012 IEHF - Task risk management2012 IEHF - Task risk management
2012 IEHF - Task risk management
 
Cause and effect diagrams
Cause and effect diagramsCause and effect diagrams
Cause and effect diagrams
 
Cedp 402 assessment lectures
Cedp 402 assessment lecturesCedp 402 assessment lectures
Cedp 402 assessment lectures
 
Problem Management - Systematic Approach
Problem Management - Systematic ApproachProblem Management - Systematic Approach
Problem Management - Systematic Approach
 
Creating a compliance assessment program on a tight budget
Creating a compliance assessment program on a tight budgetCreating a compliance assessment program on a tight budget
Creating a compliance assessment program on a tight budget
 
Operating Excellence is built on Corrective & Preventive Actions
Operating Excellence is built on Corrective & Preventive ActionsOperating Excellence is built on Corrective & Preventive Actions
Operating Excellence is built on Corrective & Preventive Actions
 
Root Cause Analysis
Root Cause AnalysisRoot Cause Analysis
Root Cause Analysis
 
Cliffnotes on Blue Teaming
Cliffnotes on Blue TeamingCliffnotes on Blue Teaming
Cliffnotes on Blue Teaming
 
Rekard Edgren - Curing Our Binary Disease - EuroSTAR 2012
Rekard Edgren - Curing Our Binary Disease - EuroSTAR 2012Rekard Edgren - Curing Our Binary Disease - EuroSTAR 2012
Rekard Edgren - Curing Our Binary Disease - EuroSTAR 2012
 
Defect MgmtBugDay Bangkok 2009: Defect Management
Defect MgmtBugDay Bangkok 2009: Defect ManagementDefect MgmtBugDay Bangkok 2009: Defect Management
Defect MgmtBugDay Bangkok 2009: Defect Management
 
Chaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field GuideChaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field Guide
 
Root cause analysis
Root cause analysisRoot cause analysis
Root cause analysis
 
Learning From Failure & How You Can Too
Learning From Failure & How You Can TooLearning From Failure & How You Can Too
Learning From Failure & How You Can Too
 
Through the new lens: Quality & Complexity bruce waltuck
Through the new lens: Quality & Complexity bruce waltuckThrough the new lens: Quality & Complexity bruce waltuck
Through the new lens: Quality & Complexity bruce waltuck
 

More from Kurt Andersen

Collective Mindfulness for Better Decision Making
Collective Mindfulness for Better Decision MakingCollective Mindfulness for Better Decision Making
Collective Mindfulness for Better Decision MakingKurt Andersen
 
How bad is your toil? Measuring the Human Impact of Process
How bad is your toil? Measuring the Human Impact of ProcessHow bad is your toil? Measuring the Human Impact of Process
How bad is your toil? Measuring the Human Impact of ProcessKurt Andersen
 
Facilitating DevOps Execution in an All Digital Environment
Facilitating DevOps Execution in an All Digital EnvironmentFacilitating DevOps Execution in an All Digital Environment
Facilitating DevOps Execution in an All Digital EnvironmentKurt Andersen
 
Lessons from Iraq - Building & Running SRE Teams
Lessons from Iraq - Building & Running SRE TeamsLessons from Iraq - Building & Running SRE Teams
Lessons from Iraq - Building & Running SRE TeamsKurt Andersen
 
What You Need to Know About Email Authentication
What You Need to Know About Email AuthenticationWhat You Need to Know About Email Authentication
What You Need to Know About Email AuthenticationKurt Andersen
 
Weeping Angels of Site Reliability
Weeping Angels of Site ReliabilityWeeping Angels of Site Reliability
Weeping Angels of Site ReliabilityKurt Andersen
 
Join us at #SREcon15
Join us at #SREcon15Join us at #SREcon15
Join us at #SREcon15Kurt Andersen
 
Fighting Email Abuse with DMARC
Fighting Email Abuse with DMARCFighting Email Abuse with DMARC
Fighting Email Abuse with DMARCKurt Andersen
 
Operational Costs of Technical Debt
Operational Costs of Technical DebtOperational Costs of Technical Debt
Operational Costs of Technical DebtKurt Andersen
 

More from Kurt Andersen (9)

Collective Mindfulness for Better Decision Making
Collective Mindfulness for Better Decision MakingCollective Mindfulness for Better Decision Making
Collective Mindfulness for Better Decision Making
 
How bad is your toil? Measuring the Human Impact of Process
How bad is your toil? Measuring the Human Impact of ProcessHow bad is your toil? Measuring the Human Impact of Process
How bad is your toil? Measuring the Human Impact of Process
 
Facilitating DevOps Execution in an All Digital Environment
Facilitating DevOps Execution in an All Digital EnvironmentFacilitating DevOps Execution in an All Digital Environment
Facilitating DevOps Execution in an All Digital Environment
 
Lessons from Iraq - Building & Running SRE Teams
Lessons from Iraq - Building & Running SRE TeamsLessons from Iraq - Building & Running SRE Teams
Lessons from Iraq - Building & Running SRE Teams
 
What You Need to Know About Email Authentication
What You Need to Know About Email AuthenticationWhat You Need to Know About Email Authentication
What You Need to Know About Email Authentication
 
Weeping Angels of Site Reliability
Weeping Angels of Site ReliabilityWeeping Angels of Site Reliability
Weeping Angels of Site Reliability
 
Join us at #SREcon15
Join us at #SREcon15Join us at #SREcon15
Join us at #SREcon15
 
Fighting Email Abuse with DMARC
Fighting Email Abuse with DMARCFighting Email Abuse with DMARC
Fighting Email Abuse with DMARC
 
Operational Costs of Technical Debt
Operational Costs of Technical DebtOperational Costs of Technical Debt
Operational Costs of Technical Debt
 

Recently uploaded

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Recently uploaded (20)

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

The NeverEnding Story: Site Reliability

  • 3. Is Change Going to Stop? All things are compounded objects in a continuous change of condition Article by Mathias LaFeldt Impermanance: The Single Root Cause Dave Zwieback Beyond Blame: The root cause for both the functioning and malfunctions in all complex systems is impermanence (i.e., the fact that all systems are changeable by nature). Knowing the root cause, we no longer seek it, and instead look for the many conditions that allowed a particular situation to manifest. We accept that not all conditions are knowable or fixable. https://medium.com/production- ready/impermanence-the-single-root-cause-bd9ebadf1e8e
  • 4. Stages of Practice Shu (obey) Ha (detach) Ri (separate) There are many conceptual frameworks for skill acquisition. In three stages, 1 - One needs to learn the mechanical basics, the forms of the art, 2 - Then learn to innovate on the basic forms, and finally 3 - Transcend the forms to flow intuitively with the elements
  • 5. Stages of Practice Innocent Shu (obey) Novice Beginner Competent Ha (detach) Pro cient Ri (separate) Master   Expert/Researcher Novice - follows rules as given, without context Beginner - limited “situational perception” Competent - active decision making in choosing a course of action Proficient - prioritizes importance of aspects, perceives deviations from the normal pattern Master - (I’ve adjusted the top term a bit) intuitive grasp of situations based on deep, tacit understanding Around 1980, Stuart & Hubert Dreyfus wrote a paper proposing a 5-stage model for skill acquisition. While the Dreyfus brothers had some particular components in mind with their model which have been debated by others, the general concepts are: And other writers have added “boundary” condition states as well: Innocent - Have “heard about XXX”, No acquaintance with a concept or process Expert/Researcher - Write books, An advanced state characterized by teaching others and pushing the definitions forward “Five” is a nice, in-between count; so let’s look at SRE practices using the Dreyfus model
  • 6. Signposts of SRE Practice Incident Response Incident Prevention Post Mortems SL[AOI]s Monitoring
  • 8. Shu Signposts: Incident Response Novice “Alarmed” by incidents Primarily external sourced with inconsistent response Beginner “Fears” incidents E ective response requires speci c people Competent “Aware” that incidents are normal Well de ned handling process Novice - alarmed by incidents - which come mainly from external notification Beginner - fears incidents and responding well requires particular people Competent - “aware” that incidents are normal, processes are more established
  • 9. Ha-Ri Signposts: Incident Response Pro cient “Accept” incidents as a normal Some inter-team coordination planning Master “Embrace” incidents as learning experiences Well documented processes and procedures with learning inputs to the process Proficient - accept that incidents are normal Master - embraces incidents as a learning experience and has a strong feedback framework
  • 11. Shu Signposts: Incident Prevention Novice Focus on remediation (docs & metrics) for manually-identi ed, static, contributory causes Beginner Documentation done to an “acceptable” level Static & action-based causes recognized Competent Focus on team response to incidents, maintaining docs Novice - Manually identified, static causes Beginner - Better documentation, recognizes both static and action-based causes Competent - Looks to improve how the team responds to incidents
  • 12. Ha-Ri Signposts: Incident Prevention Pro cient Early phases of chaos engineering - scheduled Master Randomized chaos engineering Focus on general hygiene of operational environment Proficient - early, scheduled chaos engineering; a little please, but not too much Master - “bring it on”, randomized chaos engineering
  • 14. Shu Signposts: Post Mortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents Looking for a cause with a focus on mistakes Competent More common, starting to look past blaming Focus on improving local processes Novice - blameful, looking for a scapegoat Beginner - looking for a cause, mainly around “mistakes” Competent - starting to look past blaming
  • 15. Ha-Ri Signposts: Post Mortems Pro cient “Blameless”, used consistently Action items feed back to improve systems & processes Master Used to derive “meta”-learnings Applying learnings across the system Proficient - blameless, consistent processes feeding back into the organization Master - a step above, looking for larger themes and applying across the entire system
  • 17. Shu Signposts: SL[AOI]s Novice Externally imposed (SLA), if any On paper, not necessarily measured May be manually calculated for contractual needs Beginner Recognizes the di erence in these terms Measures “easy” things Competent De ned and measured primary characteristics Measures internal SLOs, not just contractual performance Novice - externally imposed if any Beginner - understands the differences, measures what is easy Competent - primary measures in place
  • 18. Ha-Ri Signposts: SL[AOI]s Pro cient Well developed cascade of measures Historical record and correlation to events Master Meaningful measures throughout the system Proficient - well developed sets of measures with historical records/baselines Master - meaningful measures throughout the system
  • 20. Shu Signposts: Monitoring Novice No baseline metrics established Beginner “OS level” or “out of the box”, inconsistent monitoring Partial baselines being developed Competent Consistent baseline monitoring across entire system Able to determine statistical anomalies Novice - no baseline Beginner - “out of the box”, spotty coverage Competent - consistent monitoring
  • 21. Ha-Ri Signposts: Monitoring Pro cient Thorough instrumentation of all service components Able to correlate internal and external measures Master Data observable upon demand Automated correlation and anomaly detection Proficient - thorough measures Master - observable upon demand with automated anomaly detection
  • 22. Other Potential Areas to Evaluate Error Budget De nition and Usage Change Management Practices Demand Forecasting / Cost to Serve
  • 23. More Potential Areas to Evaluate Provisioning E ciency Do Your Services “Plan for Retirement”?
  • 24. Even More Potential Areas to Evaluate New Services: Intro to Stability MTTS (hat tip to Etsy) or INsomnia Toil Fraction
  • 25. Assessing Your Organization’s Level of Practice Mock Assessment Search Monitoring Novice Competent Pro.. BeginnerBeg.. I­Response Beginner SLx Novice Beginner I­Prevent PM Com.. NoviceNovice Flamegraph showing degrees of execution
  • 27.
  • 28.
  • 29.
  • 30. And the Beat Goes On Each ‘9’ will cost you more that the one before it   Org-wide Practice Adoption ? Everything as a Service Customer Reliability Engineering
  • 31. . . . it was only the beginning of the real story . . . which goes on forever: in which every chapter is better than the one before. Continuing the conversation. . . Twitter: @DrKurtA LinkedIn: https://linkedin.com/in/kurta1