SlideShare a Scribd company logo
Rethinking Site Reliability
Engineering for ITSM
Principal Product Manager (BMC Software) and ITIL® 4 Contributing Author
SDI “New Ways of Working” virtual event: 7th May 2020
Jon Stevens-Hall
A brief introduction to SRE
• First published 2016
• Free online: https://landing.google.com/sre/books/
“SRE is what happens when you ask a software engineer
to design an operations team”
• Some SREs at Google are software engineers hired in
the standard way
• Others were close to being recruited as a software
engineer, and brought other infra skills
• SREs help software developers run their services but
also expect accountability from them.
@jonstevenshall
Me
• ITSM veteran – 23 years
• Principal Product Manager with BMC software, for
our enterprise ITSM toolset, BMC Helix ITSM
• ITIL® 4 contributing author
• Strong focus on DevOps and new ways of working
• Asked by SDI to talk about SRE today
Twitter: @JonStevensHall
• https://www.bmc.com/blogs/author/jonhall/
• https://medium.com/@jonhall_
@jonstevenshall
Eliminating Toil
Subjects covered in the SRE handbook
Risk
Service Level
Objectives
Monitoring
Distributed Systems
Automation
Release Engineering
Alerting
On-Call
Handling Incidents
Postmortem SchedulingHuman Welfare
Communication
Risk
Service Level
Objectives
Eliminating Toil
@jonstevenshall
Defining Toil
AutomatableRepetitiveManual Tactical
Devoid of
enduring
value
Linearly
scaling
• SRE teams can only spend a maximum of 50% on work
considered to be “Toil”.
• Google’s SRE handbook describes “toil” as work which is:
@jonstevenshall
Service Level Objective
Availability =
Successful requests
Total requests
Availability =
Uptime
Uptime + Downtime
SLOs are not SLAs. Not all Google services have SLAs, but some do.
SLOs define appropriate technical bounds for service performance.
“99.9% of RPC calls will
complete in less than 100 ms"
“The service will be available
99.99% of the time”
@jonstevenshall
Error Budget
Defines how unreliable the service is allowed to be,
within a given time period.
100%
Required performance (SLO target)
Remaining error budget
Outages to-date
“An outage is no longer a bad thing.
It is an expected part of the process of innovation,
and an occurrence that development and SRE teams
manage rather than fear”
Outages are okay, within error budget tolerances
@jonstevenshall
Error Budget
99.9% SLO 0.05% outages already
Error budget remaining is 0.05%
Hence an additional outage which consumes 0.02% of the quarter’s
availability would cost 40% of the remaining error budget.
Balancing innovation and reliability
• Large budget: developers can take
more risks
• Budget nearly drained: developers tend
to slow down
• Budget used up: releases temporarily
halted
Error Budget shapes ways of working
Martin Vorel via Libreshot
@jonstevenshall
How does SRE align to ITSM as we know it?
“Site Reliability Engineering (SRE) is an
emerging IT service management
(ITSM) framework”
(First line of the report, published 1st April 2020)
@jonstevenshall
Will SRE replace ITSM as we know it?
“As a service management alternative, SRE
updates traditional ITSM activities with
innovative and self-organizing concepts such as…
• Management to service level objectives
• Error budgets
• Toil reduction
• Release engineering
• Monitoring/observability
• Embracing risk”
Jayne Groll, DevOps.com, 2020
@jonstevenshall
A counterpoint: Most of us are not Google.
“If you think spinning up an ops team and calling it
SRE is ‘an implementation of DevOps’ you’ve
swallowed the worst poison pill the DevOps talk
circuit can deal to you.”
• Unless you are Google scale, a separate
production operations team is inefficient
• DevOps showed that “throw over the wall
from Dev to Ops” was wrong
• Throwing over the wall from Dev to SRE
doesn’t improve that
The Agile Admin – Blog (2018).
@jonstevenshall
One size doesn’t fit all
“Toyota opens its factory doors to us again and
again, but I imagine Toyota’s leaders may also be
shaking their heads and thinking, ‘Sure, come
have a look. But why are you so interested in the
solutions we develop for our specific problems?
Why do you never study how we go about
developing those solutions?’”
Mike Rother – Toyota Kata (2009)
@jonstevenshall
• “Site Reliability Engineering”…
• In many (most?) cases, IT services aren’t “sites”
• A lot of ITSM work isn’t software engineering
• Reliability is a broad term
© Copyright 2020 BMC Software, Inc.
What’s in a name? Rethinking SRE for ITSM
SRE is already an established profession
Job advert for ”disruptive Fintech”, retrieved May 2020
@jonstevenshall
ITIL® 4 maps SRE principles to 14 practices
• Availability management
• Capacity and performance management
• Change enablement
• Incident management
• Infrastructure and platform management
• Monitoring and event management
• Problem management
• Service design
• Software development and management
• Deployment management
• Organizational change management
• Release management
• Service configuration management
• Service validation and testing
(but at a very high level)
@jonstevenshall
• Current service desk performance: 98% call pickup, on a target of 90%
• Agents fully utilised on calls
Example of an “SRE-like” change for IT support
Before:
• Schedule a proportion of each agent’s time to be away from calls
• Use that time for proactive reduction of toil (e.g. building self service
offerings, developing knowledge content)
• The previous 8% over-performance in call pickup is spent as “error budget”
Applying SRE thinking:
@jonstevenshall
• Level of change approval determined by perceived risk of change type
• Pressure from DevOps to remove approval barriers (particularly CAB)
SRE applied to Change Enablement
Before:
• Dynamically assess approval level based on confidence level in team
• Implementation success rate of previous changes
• Issues caused by previous changes
• Adherence to guardrails in the DevOps toolchain
• Minimal intervention, applied more smartly
Applying SRE thinking:
@jonstevenshall
• Site Reliability Engineering is a bad name for many aspects of ITSM.
• SRE is already defined (and recruited) as a specialist operations
developer role.
• Some key, novel themes have great potential but need redefinition
for much of ITSM.
Summary of issues
@jonstevenshall
• “Rebrand” the SRE concept for ITSM: New name, new thinking.
• Identify new ways to apply key novel principles such as reliability and toil
reduction.
• Invest in ITSM initiatives such as KCS which share good principles with SRE.
Proposal
@jonstevenshall
© Copyright 2020 BMC Software, Inc.
Referenced materials and further reading
https://theagileadmin.com/2018/10/02/sre-
the-biggest-lie-since-kanban/
https://devops.com/sre-is-the-most-
innovative-approach-to-itsm-since-itil/
@ernestmueller @JayneGroll @RealMikeRother
http://www-personal.umich.edu/
~mrother/Homepage.html
https://www.bmc.com/blogs
/sre-vs-itops/
@DevDroid9
https://medium.com/@JonHall_/what-can-site-reliability-
engineers-teach-service-desks-about-toil-e1df38ca284e
@jonstevenshall
https://landing.google.com/sre/books/
@srebook
Rethinking Site Reliability Engineering for ITSM - SDI virtual event "New Ways of Working", May 2020

More Related Content

What's hot

Sre summary
Sre summarySre summary
Sre summary
Yogesh Shah
 
Site (Service) Reliability Engineering
Site (Service) Reliability EngineeringSite (Service) Reliability Engineering
Site (Service) Reliability Engineering
Mark Underwood
 
DevOps Approach (Point of View by Ravi Tadwalkar)
DevOps Approach (Point of View by Ravi Tadwalkar)DevOps Approach (Point of View by Ravi Tadwalkar)
DevOps Approach (Point of View by Ravi Tadwalkar)
Ravi Tadwalkar
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
Rauno De Pasquale
 
SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!
New Relic
 
Site reliability engineering
Site reliability engineeringSite reliability engineering
Site reliability engineering
Jason Loeffler
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)
Setyo Legowo
 
Managing software projects & teams effectively
Managing software projects & teams effectivelyManaging software projects & teams effectively
Managing software projects & teams effectively
Ashutosh Agarwal
 
Cloud Native Engineering with SRE and GitOps
Cloud Native Engineering with SRE and GitOpsCloud Native Engineering with SRE and GitOps
Cloud Native Engineering with SRE and GitOps
Weaveworks
 
SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)
Hussain Mansoor
 
Capgemini Cloud Assessment - A Pathway to Enterprise Cloud Migration
Capgemini Cloud Assessment - A Pathway to Enterprise Cloud MigrationCapgemini Cloud Assessment - A Pathway to Enterprise Cloud Migration
Capgemini Cloud Assessment - A Pathway to Enterprise Cloud Migration
Floyd DCosta
 
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
Tori Wieldt
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
Ashutosh Agarwal
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
ITSM Academy, Inc.
 
Cloud Strategy First
 Cloud Strategy First Cloud Strategy First
Cloud Strategy First
Amazon Web Services
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
DevOpsDays Tel Aviv
 
AIOps - The next 5 years
AIOps - The next 5 yearsAIOps - The next 5 years
AIOps - The next 5 years
Moogsoft
 
Rapid Strategic SRE Assessments
Rapid Strategic SRE AssessmentsRapid Strategic SRE Assessments
Rapid Strategic SRE Assessments
Marc Hornbeek
 
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
Splunk
 
SRE in Startup
SRE in StartupSRE in Startup
SRE in Startup
Ladislav Prskavec
 

What's hot (20)

Sre summary
Sre summarySre summary
Sre summary
 
Site (Service) Reliability Engineering
Site (Service) Reliability EngineeringSite (Service) Reliability Engineering
Site (Service) Reliability Engineering
 
DevOps Approach (Point of View by Ravi Tadwalkar)
DevOps Approach (Point of View by Ravi Tadwalkar)DevOps Approach (Point of View by Ravi Tadwalkar)
DevOps Approach (Point of View by Ravi Tadwalkar)
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
 
SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!
 
Site reliability engineering
Site reliability engineeringSite reliability engineering
Site reliability engineering
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)
 
Managing software projects & teams effectively
Managing software projects & teams effectivelyManaging software projects & teams effectively
Managing software projects & teams effectively
 
Cloud Native Engineering with SRE and GitOps
Cloud Native Engineering with SRE and GitOpsCloud Native Engineering with SRE and GitOps
Cloud Native Engineering with SRE and GitOps
 
SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)
 
Capgemini Cloud Assessment - A Pathway to Enterprise Cloud Migration
Capgemini Cloud Assessment - A Pathway to Enterprise Cloud MigrationCapgemini Cloud Assessment - A Pathway to Enterprise Cloud Migration
Capgemini Cloud Assessment - A Pathway to Enterprise Cloud Migration
 
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
 
Cloud Strategy First
 Cloud Strategy First Cloud Strategy First
Cloud Strategy First
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
 
AIOps - The next 5 years
AIOps - The next 5 yearsAIOps - The next 5 years
AIOps - The next 5 years
 
Rapid Strategic SRE Assessments
Rapid Strategic SRE AssessmentsRapid Strategic SRE Assessments
Rapid Strategic SRE Assessments
 
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
 
SRE in Startup
SRE in StartupSRE in Startup
SRE in Startup
 

Similar to Rethinking Site Reliability Engineering for ITSM - SDI virtual event "New Ways of Working", May 2020

Site Reliability Engineering: Harnessing (and redefining) it for ITSM
Site Reliability Engineering: Harnessing (and redefining) it for ITSMSite Reliability Engineering: Harnessing (and redefining) it for ITSM
Site Reliability Engineering: Harnessing (and redefining) it for ITSM
Jon Stevens-Hall
 
Galorath - IT Data Collection, Analysis and Benchmarking: From Processes and...
Galorath -  IT Data Collection, Analysis and Benchmarking: From Processes and...Galorath -  IT Data Collection, Analysis and Benchmarking: From Processes and...
Galorath - IT Data Collection, Analysis and Benchmarking: From Processes and...
International Software Benchmarking Standards Group (ISBSG)
 
Devops intro
Devops introDevops intro
Devops intro
Pallavi Mudaliar
 
Is DevOps Really Changing IT Support?
Is DevOps Really Changing IT Support?Is DevOps Really Changing IT Support?
Is DevOps Really Changing IT Support?
Jon Stevens-Hall
 
Microservices
MicroservicesMicroservices
Microservices
PT.JUG
 
Un Architecture
Un ArchitectureUn Architecture
Un Architecture
chrisonea
 
Leveraging Failure to Succeed in DevOps
Leveraging Failure to Succeed in DevOpsLeveraging Failure to Succeed in DevOps
Leveraging Failure to Succeed in DevOps
Steve Brown
 
Avoiding Cloud Computing Planning & Implementation Failure
Avoiding Cloud Computing Planning & Implementation FailureAvoiding Cloud Computing Planning & Implementation Failure
Avoiding Cloud Computing Planning & Implementation Failure
Nathaniel Payne
 
NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...
NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...
NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...
NUS-ISS
 
Making sense of microservices, service mesh, and serverless
Making sense of microservices, service mesh, and serverlessMaking sense of microservices, service mesh, and serverless
Making sense of microservices, service mesh, and serverless
Christian Posta
 
DBA Role Shift in a DevOps World
DBA Role Shift in a DevOps WorldDBA Role Shift in a DevOps World
DBA Role Shift in a DevOps World
Datavail
 
How to Drive Maximum Business Value from IT Investments with the Flow Framework
How to Drive Maximum Business Value from IT Investments with the Flow FrameworkHow to Drive Maximum Business Value from IT Investments with the Flow Framework
How to Drive Maximum Business Value from IT Investments with the Flow Framework
Tasktop
 
How AI is transforming DevOps | Calidad Infotech
How AI is transforming DevOps | Calidad InfotechHow AI is transforming DevOps | Calidad Infotech
How AI is transforming DevOps | Calidad Infotech
Calidad Infotech
 
How Salesforce built a Scalable, World-Class, Performance Engineering Team
How Salesforce built a Scalable, World-Class, Performance Engineering TeamHow Salesforce built a Scalable, World-Class, Performance Engineering Team
How Salesforce built a Scalable, World-Class, Performance Engineering TeamSalesforce Developers
 
DevOps: What, who, why and how?
DevOps: What, who, why and how?DevOps: What, who, why and how?
DevOps: What, who, why and how?
Red Gate Software
 
The Future of Change Management and DevOps for Dummies
The Future of Change Management and DevOps for DummiesThe Future of Change Management and DevOps for Dummies
The Future of Change Management and DevOps for Dummies
DBmaestro - Database DevOps
 
Owasp summit debrief v1.0 (jun 2017)
Owasp summit debrief v1.0 (jun 2017)Owasp summit debrief v1.0 (jun 2017)
Owasp summit debrief v1.0 (jun 2017)
owaspsummit
 
DevOps Beyond the Buzzwords: What it Means to Embrace the DevOps Lifestyle
DevOps Beyond the Buzzwords: What it Means to Embrace the DevOps LifestyleDevOps Beyond the Buzzwords: What it Means to Embrace the DevOps Lifestyle
DevOps Beyond the Buzzwords: What it Means to Embrace the DevOps Lifestyle
Mark Heckler
 
IoT to Cloud the DevOps Way
IoT to Cloud the DevOps WayIoT to Cloud the DevOps Way
IoT to Cloud the DevOps Way
Mark Heckler
 
AMIS 25: DevOps Best Practice for Oracle SOA and BPM
AMIS 25: DevOps Best Practice for Oracle SOA and BPMAMIS 25: DevOps Best Practice for Oracle SOA and BPM
AMIS 25: DevOps Best Practice for Oracle SOA and BPM
Matt Wright
 

Similar to Rethinking Site Reliability Engineering for ITSM - SDI virtual event "New Ways of Working", May 2020 (20)

Site Reliability Engineering: Harnessing (and redefining) it for ITSM
Site Reliability Engineering: Harnessing (and redefining) it for ITSMSite Reliability Engineering: Harnessing (and redefining) it for ITSM
Site Reliability Engineering: Harnessing (and redefining) it for ITSM
 
Galorath - IT Data Collection, Analysis and Benchmarking: From Processes and...
Galorath -  IT Data Collection, Analysis and Benchmarking: From Processes and...Galorath -  IT Data Collection, Analysis and Benchmarking: From Processes and...
Galorath - IT Data Collection, Analysis and Benchmarking: From Processes and...
 
Devops intro
Devops introDevops intro
Devops intro
 
Is DevOps Really Changing IT Support?
Is DevOps Really Changing IT Support?Is DevOps Really Changing IT Support?
Is DevOps Really Changing IT Support?
 
Microservices
MicroservicesMicroservices
Microservices
 
Un Architecture
Un ArchitectureUn Architecture
Un Architecture
 
Leveraging Failure to Succeed in DevOps
Leveraging Failure to Succeed in DevOpsLeveraging Failure to Succeed in DevOps
Leveraging Failure to Succeed in DevOps
 
Avoiding Cloud Computing Planning & Implementation Failure
Avoiding Cloud Computing Planning & Implementation FailureAvoiding Cloud Computing Planning & Implementation Failure
Avoiding Cloud Computing Planning & Implementation Failure
 
NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...
NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...
NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...
 
Making sense of microservices, service mesh, and serverless
Making sense of microservices, service mesh, and serverlessMaking sense of microservices, service mesh, and serverless
Making sense of microservices, service mesh, and serverless
 
DBA Role Shift in a DevOps World
DBA Role Shift in a DevOps WorldDBA Role Shift in a DevOps World
DBA Role Shift in a DevOps World
 
How to Drive Maximum Business Value from IT Investments with the Flow Framework
How to Drive Maximum Business Value from IT Investments with the Flow FrameworkHow to Drive Maximum Business Value from IT Investments with the Flow Framework
How to Drive Maximum Business Value from IT Investments with the Flow Framework
 
How AI is transforming DevOps | Calidad Infotech
How AI is transforming DevOps | Calidad InfotechHow AI is transforming DevOps | Calidad Infotech
How AI is transforming DevOps | Calidad Infotech
 
How Salesforce built a Scalable, World-Class, Performance Engineering Team
How Salesforce built a Scalable, World-Class, Performance Engineering TeamHow Salesforce built a Scalable, World-Class, Performance Engineering Team
How Salesforce built a Scalable, World-Class, Performance Engineering Team
 
DevOps: What, who, why and how?
DevOps: What, who, why and how?DevOps: What, who, why and how?
DevOps: What, who, why and how?
 
The Future of Change Management and DevOps for Dummies
The Future of Change Management and DevOps for DummiesThe Future of Change Management and DevOps for Dummies
The Future of Change Management and DevOps for Dummies
 
Owasp summit debrief v1.0 (jun 2017)
Owasp summit debrief v1.0 (jun 2017)Owasp summit debrief v1.0 (jun 2017)
Owasp summit debrief v1.0 (jun 2017)
 
DevOps Beyond the Buzzwords: What it Means to Embrace the DevOps Lifestyle
DevOps Beyond the Buzzwords: What it Means to Embrace the DevOps LifestyleDevOps Beyond the Buzzwords: What it Means to Embrace the DevOps Lifestyle
DevOps Beyond the Buzzwords: What it Means to Embrace the DevOps Lifestyle
 
IoT to Cloud the DevOps Way
IoT to Cloud the DevOps WayIoT to Cloud the DevOps Way
IoT to Cloud the DevOps Way
 
AMIS 25: DevOps Best Practice for Oracle SOA and BPM
AMIS 25: DevOps Best Practice for Oracle SOA and BPMAMIS 25: DevOps Best Practice for Oracle SOA and BPM
AMIS 25: DevOps Best Practice for Oracle SOA and BPM
 

More from Jon Stevens-Hall

Expanding our Understanding: Complex Adaptive Systems
Expanding our Understanding: Complex Adaptive SystemsExpanding our Understanding: Complex Adaptive Systems
Expanding our Understanding: Complex Adaptive Systems
Jon Stevens-Hall
 
Velocity19 Berlin: Swarming, Cynefin… and avoiding the problems of becoming a...
Velocity19 Berlin: Swarming, Cynefin…and avoiding the problems of becoming a...Velocity19 Berlin: Swarming, Cynefin…and avoiding the problems of becoming a...
Velocity19 Berlin: Swarming, Cynefin… and avoiding the problems of becoming a...
Jon Stevens-Hall
 
DevOps Enterprise Summit 2019 - How Swarming Enables Enterprise Support to wo...
DevOps Enterprise Summit 2019 - How Swarming Enables EnterpriseSupport to wo...DevOps Enterprise Summit 2019 - How Swarming Enables EnterpriseSupport to wo...
DevOps Enterprise Summit 2019 - How Swarming Enables Enterprise Support to wo...
Jon Stevens-Hall
 
SRVision 2019, Utrecht: Swarming and Cynefin
SRVision 2019, Utrecht: Swarming and CynefinSRVision 2019, Utrecht: Swarming and Cynefin
SRVision 2019, Utrecht: Swarming and Cynefin
Jon Stevens-Hall
 
SDI19: Swarming and Devops for ITSM
SDI19: Swarming and Devops for ITSMSDI19: Swarming and Devops for ITSM
SDI19: Swarming and Devops for ITSM
Jon Stevens-Hall
 
Support at scale in a DevOps world How Swarming and Cynefin can save you from...
Support at scale in a DevOps world How Swarming and Cynefin can save you from...Support at scale in a DevOps world How Swarming and Cynefin can save you from...
Support at scale in a DevOps world How Swarming and Cynefin can save you from...
Jon Stevens-Hall
 
Swarming: How a new approach to support can save DevOps teams from 3rd-line t...
Swarming: How a new approach to support can save DevOps teams from 3rd-line t...Swarming: How a new approach to support can save DevOps teams from 3rd-line t...
Swarming: How a new approach to support can save DevOps teams from 3rd-line t...
Jon Stevens-Hall
 
DevOps Enterprise Summit Las Vegas 2018: The Problem of Becoming a 3rd-Line S...
DevOps Enterprise Summit Las Vegas 2018: The Problem of Becoming a 3rd-Line S...DevOps Enterprise Summit Las Vegas 2018: The Problem of Becoming a 3rd-Line S...
DevOps Enterprise Summit Las Vegas 2018: The Problem of Becoming a 3rd-Line S...
Jon Stevens-Hall
 
DevOpsDays Riga - Swarming Presentation
DevOpsDays Riga - Swarming PresentationDevOpsDays Riga - Swarming Presentation
DevOpsDays Riga - Swarming Presentation
Jon Stevens-Hall
 
ITSM, Swarming and Devops
ITSM, Swarming and DevopsITSM, Swarming and Devops
ITSM, Swarming and Devops
Jon Stevens-Hall
 
Devops In The Enterprise: How Swarming Can Fix The Problem Of Becoming A 3rd-...
Devops In The Enterprise:How Swarming Can Fix The Problem Of Becoming A 3rd-...Devops In The Enterprise:How Swarming Can Fix The Problem Of Becoming A 3rd-...
Devops In The Enterprise: How Swarming Can Fix The Problem Of Becoming A 3rd-...
Jon Stevens-Hall
 
Service Manager Dag, Netherlands 2018: Why we should ditch the 3-tier support...
Service Manager Dag, Netherlands 2018: Why we should ditch the 3-tier support...Service Manager Dag, Netherlands 2018: Why we should ditch the 3-tier support...
Service Manager Dag, Netherlands 2018: Why we should ditch the 3-tier support...
Jon Stevens-Hall
 
Configuration Management Camp 2018: The problem of becoming "3rd line support...
Configuration Management Camp 2018: The problem of becoming "3rd line support...Configuration Management Camp 2018: The problem of becoming "3rd line support...
Configuration Management Camp 2018: The problem of becoming "3rd line support...
Jon Stevens-Hall
 
BMC Engage 2015: Optimizing Service Desk Interactions with Knowledge Management
BMC Engage 2015: Optimizing Service Desk Interactions with Knowledge ManagementBMC Engage 2015: Optimizing Service Desk Interactions with Knowledge Management
BMC Engage 2015: Optimizing Service Desk Interactions with Knowledge Management
Jon Stevens-Hall
 
Devopsdays Edinburgh 2017 - Ignite talk - Swarming
Devopsdays Edinburgh 2017 - Ignite talk - SwarmingDevopsdays Edinburgh 2017 - Ignite talk - Swarming
Devopsdays Edinburgh 2017 - Ignite talk - Swarming
Jon Stevens-Hall
 
devopsdays Stockholm Ignite talk: Aligning DevOps with Enterprise-scale custo...
devopsdays Stockholm Ignite talk: Aligning DevOps with Enterprise-scale custo...devopsdays Stockholm Ignite talk: Aligning DevOps with Enterprise-scale custo...
devopsdays Stockholm Ignite talk: Aligning DevOps with Enterprise-scale custo...
Jon Stevens-Hall
 
Knowledge Management in BMC Remedy 9.1
Knowledge Management in BMC Remedy 9.1Knowledge Management in BMC Remedy 9.1
Knowledge Management in BMC Remedy 9.1
Jon Stevens-Hall
 
How the Internet of Things and 20 billion devices will change your job
How the Internet of Things and 20 billion devices will change your jobHow the Internet of Things and 20 billion devices will change your job
How the Internet of Things and 20 billion devices will change your job
Jon Stevens-Hall
 
IAITAM ACE 2016, New Orleans - Presentation
IAITAM ACE 2016, New Orleans - PresentationIAITAM ACE 2016, New Orleans - Presentation
IAITAM ACE 2016, New Orleans - Presentation
Jon Stevens-Hall
 
Evolving Service for the Digital Workplace
Evolving Service for the Digital WorkplaceEvolving Service for the Digital Workplace
Evolving Service for the Digital Workplace
Jon Stevens-Hall
 

More from Jon Stevens-Hall (20)

Expanding our Understanding: Complex Adaptive Systems
Expanding our Understanding: Complex Adaptive SystemsExpanding our Understanding: Complex Adaptive Systems
Expanding our Understanding: Complex Adaptive Systems
 
Velocity19 Berlin: Swarming, Cynefin… and avoiding the problems of becoming a...
Velocity19 Berlin: Swarming, Cynefin…and avoiding the problems of becoming a...Velocity19 Berlin: Swarming, Cynefin…and avoiding the problems of becoming a...
Velocity19 Berlin: Swarming, Cynefin… and avoiding the problems of becoming a...
 
DevOps Enterprise Summit 2019 - How Swarming Enables Enterprise Support to wo...
DevOps Enterprise Summit 2019 - How Swarming Enables EnterpriseSupport to wo...DevOps Enterprise Summit 2019 - How Swarming Enables EnterpriseSupport to wo...
DevOps Enterprise Summit 2019 - How Swarming Enables Enterprise Support to wo...
 
SRVision 2019, Utrecht: Swarming and Cynefin
SRVision 2019, Utrecht: Swarming and CynefinSRVision 2019, Utrecht: Swarming and Cynefin
SRVision 2019, Utrecht: Swarming and Cynefin
 
SDI19: Swarming and Devops for ITSM
SDI19: Swarming and Devops for ITSMSDI19: Swarming and Devops for ITSM
SDI19: Swarming and Devops for ITSM
 
Support at scale in a DevOps world How Swarming and Cynefin can save you from...
Support at scale in a DevOps world How Swarming and Cynefin can save you from...Support at scale in a DevOps world How Swarming and Cynefin can save you from...
Support at scale in a DevOps world How Swarming and Cynefin can save you from...
 
Swarming: How a new approach to support can save DevOps teams from 3rd-line t...
Swarming: How a new approach to support can save DevOps teams from 3rd-line t...Swarming: How a new approach to support can save DevOps teams from 3rd-line t...
Swarming: How a new approach to support can save DevOps teams from 3rd-line t...
 
DevOps Enterprise Summit Las Vegas 2018: The Problem of Becoming a 3rd-Line S...
DevOps Enterprise Summit Las Vegas 2018: The Problem of Becoming a 3rd-Line S...DevOps Enterprise Summit Las Vegas 2018: The Problem of Becoming a 3rd-Line S...
DevOps Enterprise Summit Las Vegas 2018: The Problem of Becoming a 3rd-Line S...
 
DevOpsDays Riga - Swarming Presentation
DevOpsDays Riga - Swarming PresentationDevOpsDays Riga - Swarming Presentation
DevOpsDays Riga - Swarming Presentation
 
ITSM, Swarming and Devops
ITSM, Swarming and DevopsITSM, Swarming and Devops
ITSM, Swarming and Devops
 
Devops In The Enterprise: How Swarming Can Fix The Problem Of Becoming A 3rd-...
Devops In The Enterprise:How Swarming Can Fix The Problem Of Becoming A 3rd-...Devops In The Enterprise:How Swarming Can Fix The Problem Of Becoming A 3rd-...
Devops In The Enterprise: How Swarming Can Fix The Problem Of Becoming A 3rd-...
 
Service Manager Dag, Netherlands 2018: Why we should ditch the 3-tier support...
Service Manager Dag, Netherlands 2018: Why we should ditch the 3-tier support...Service Manager Dag, Netherlands 2018: Why we should ditch the 3-tier support...
Service Manager Dag, Netherlands 2018: Why we should ditch the 3-tier support...
 
Configuration Management Camp 2018: The problem of becoming "3rd line support...
Configuration Management Camp 2018: The problem of becoming "3rd line support...Configuration Management Camp 2018: The problem of becoming "3rd line support...
Configuration Management Camp 2018: The problem of becoming "3rd line support...
 
BMC Engage 2015: Optimizing Service Desk Interactions with Knowledge Management
BMC Engage 2015: Optimizing Service Desk Interactions with Knowledge ManagementBMC Engage 2015: Optimizing Service Desk Interactions with Knowledge Management
BMC Engage 2015: Optimizing Service Desk Interactions with Knowledge Management
 
Devopsdays Edinburgh 2017 - Ignite talk - Swarming
Devopsdays Edinburgh 2017 - Ignite talk - SwarmingDevopsdays Edinburgh 2017 - Ignite talk - Swarming
Devopsdays Edinburgh 2017 - Ignite talk - Swarming
 
devopsdays Stockholm Ignite talk: Aligning DevOps with Enterprise-scale custo...
devopsdays Stockholm Ignite talk: Aligning DevOps with Enterprise-scale custo...devopsdays Stockholm Ignite talk: Aligning DevOps with Enterprise-scale custo...
devopsdays Stockholm Ignite talk: Aligning DevOps with Enterprise-scale custo...
 
Knowledge Management in BMC Remedy 9.1
Knowledge Management in BMC Remedy 9.1Knowledge Management in BMC Remedy 9.1
Knowledge Management in BMC Remedy 9.1
 
How the Internet of Things and 20 billion devices will change your job
How the Internet of Things and 20 billion devices will change your jobHow the Internet of Things and 20 billion devices will change your job
How the Internet of Things and 20 billion devices will change your job
 
IAITAM ACE 2016, New Orleans - Presentation
IAITAM ACE 2016, New Orleans - PresentationIAITAM ACE 2016, New Orleans - Presentation
IAITAM ACE 2016, New Orleans - Presentation
 
Evolving Service for the Digital Workplace
Evolving Service for the Digital WorkplaceEvolving Service for the Digital Workplace
Evolving Service for the Digital Workplace
 

Recently uploaded

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 

Recently uploaded (20)

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 

Rethinking Site Reliability Engineering for ITSM - SDI virtual event "New Ways of Working", May 2020

  • 1. Rethinking Site Reliability Engineering for ITSM Principal Product Manager (BMC Software) and ITIL® 4 Contributing Author SDI “New Ways of Working” virtual event: 7th May 2020 Jon Stevens-Hall
  • 2. A brief introduction to SRE • First published 2016 • Free online: https://landing.google.com/sre/books/ “SRE is what happens when you ask a software engineer to design an operations team” • Some SREs at Google are software engineers hired in the standard way • Others were close to being recruited as a software engineer, and brought other infra skills • SREs help software developers run their services but also expect accountability from them. @jonstevenshall
  • 3. Me • ITSM veteran – 23 years • Principal Product Manager with BMC software, for our enterprise ITSM toolset, BMC Helix ITSM • ITIL® 4 contributing author • Strong focus on DevOps and new ways of working • Asked by SDI to talk about SRE today Twitter: @JonStevensHall • https://www.bmc.com/blogs/author/jonhall/ • https://medium.com/@jonhall_ @jonstevenshall
  • 4. Eliminating Toil Subjects covered in the SRE handbook Risk Service Level Objectives Monitoring Distributed Systems Automation Release Engineering Alerting On-Call Handling Incidents Postmortem SchedulingHuman Welfare Communication Risk Service Level Objectives Eliminating Toil @jonstevenshall
  • 5. Defining Toil AutomatableRepetitiveManual Tactical Devoid of enduring value Linearly scaling • SRE teams can only spend a maximum of 50% on work considered to be “Toil”. • Google’s SRE handbook describes “toil” as work which is: @jonstevenshall
  • 6. Service Level Objective Availability = Successful requests Total requests Availability = Uptime Uptime + Downtime SLOs are not SLAs. Not all Google services have SLAs, but some do. SLOs define appropriate technical bounds for service performance. “99.9% of RPC calls will complete in less than 100 ms" “The service will be available 99.99% of the time” @jonstevenshall
  • 7. Error Budget Defines how unreliable the service is allowed to be, within a given time period. 100% Required performance (SLO target) Remaining error budget Outages to-date
  • 8. “An outage is no longer a bad thing. It is an expected part of the process of innovation, and an occurrence that development and SRE teams manage rather than fear” Outages are okay, within error budget tolerances @jonstevenshall
  • 9. Error Budget 99.9% SLO 0.05% outages already Error budget remaining is 0.05% Hence an additional outage which consumes 0.02% of the quarter’s availability would cost 40% of the remaining error budget.
  • 10. Balancing innovation and reliability • Large budget: developers can take more risks • Budget nearly drained: developers tend to slow down • Budget used up: releases temporarily halted Error Budget shapes ways of working Martin Vorel via Libreshot @jonstevenshall
  • 11. How does SRE align to ITSM as we know it? “Site Reliability Engineering (SRE) is an emerging IT service management (ITSM) framework” (First line of the report, published 1st April 2020) @jonstevenshall
  • 12. Will SRE replace ITSM as we know it? “As a service management alternative, SRE updates traditional ITSM activities with innovative and self-organizing concepts such as… • Management to service level objectives • Error budgets • Toil reduction • Release engineering • Monitoring/observability • Embracing risk” Jayne Groll, DevOps.com, 2020 @jonstevenshall
  • 13. A counterpoint: Most of us are not Google. “If you think spinning up an ops team and calling it SRE is ‘an implementation of DevOps’ you’ve swallowed the worst poison pill the DevOps talk circuit can deal to you.” • Unless you are Google scale, a separate production operations team is inefficient • DevOps showed that “throw over the wall from Dev to Ops” was wrong • Throwing over the wall from Dev to SRE doesn’t improve that The Agile Admin – Blog (2018). @jonstevenshall
  • 14. One size doesn’t fit all “Toyota opens its factory doors to us again and again, but I imagine Toyota’s leaders may also be shaking their heads and thinking, ‘Sure, come have a look. But why are you so interested in the solutions we develop for our specific problems? Why do you never study how we go about developing those solutions?’” Mike Rother – Toyota Kata (2009) @jonstevenshall
  • 15. • “Site Reliability Engineering”… • In many (most?) cases, IT services aren’t “sites” • A lot of ITSM work isn’t software engineering • Reliability is a broad term © Copyright 2020 BMC Software, Inc. What’s in a name? Rethinking SRE for ITSM
  • 16. SRE is already an established profession Job advert for ”disruptive Fintech”, retrieved May 2020 @jonstevenshall
  • 17. ITIL® 4 maps SRE principles to 14 practices • Availability management • Capacity and performance management • Change enablement • Incident management • Infrastructure and platform management • Monitoring and event management • Problem management • Service design • Software development and management • Deployment management • Organizational change management • Release management • Service configuration management • Service validation and testing (but at a very high level) @jonstevenshall
  • 18. • Current service desk performance: 98% call pickup, on a target of 90% • Agents fully utilised on calls Example of an “SRE-like” change for IT support Before: • Schedule a proportion of each agent’s time to be away from calls • Use that time for proactive reduction of toil (e.g. building self service offerings, developing knowledge content) • The previous 8% over-performance in call pickup is spent as “error budget” Applying SRE thinking: @jonstevenshall
  • 19. • Level of change approval determined by perceived risk of change type • Pressure from DevOps to remove approval barriers (particularly CAB) SRE applied to Change Enablement Before: • Dynamically assess approval level based on confidence level in team • Implementation success rate of previous changes • Issues caused by previous changes • Adherence to guardrails in the DevOps toolchain • Minimal intervention, applied more smartly Applying SRE thinking: @jonstevenshall
  • 20. • Site Reliability Engineering is a bad name for many aspects of ITSM. • SRE is already defined (and recruited) as a specialist operations developer role. • Some key, novel themes have great potential but need redefinition for much of ITSM. Summary of issues @jonstevenshall
  • 21. • “Rebrand” the SRE concept for ITSM: New name, new thinking. • Identify new ways to apply key novel principles such as reliability and toil reduction. • Invest in ITSM initiatives such as KCS which share good principles with SRE. Proposal @jonstevenshall
  • 22. © Copyright 2020 BMC Software, Inc. Referenced materials and further reading https://theagileadmin.com/2018/10/02/sre- the-biggest-lie-since-kanban/ https://devops.com/sre-is-the-most- innovative-approach-to-itsm-since-itil/ @ernestmueller @JayneGroll @RealMikeRother http://www-personal.umich.edu/ ~mrother/Homepage.html https://www.bmc.com/blogs /sre-vs-itops/ @DevDroid9 https://medium.com/@JonHall_/what-can-site-reliability- engineers-teach-service-desks-about-toil-e1df38ca284e @jonstevenshall https://landing.google.com/sre/books/ @srebook