Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Clearing the Way
For SRE in the Enterprise
Damon Edwards
@damonedwards
Community
Ops Improvement
DevOps
Ops Tools
Damon Edwards
Digital
Agile
DevOps
CI/CD
Cloud
Docker
Kubernetes
Microservices
CHANGE
Wow
That is cool
I wish I could
work there
OpsBusiness
Idea
Shorter Time-to-Market
Fast Feedback
from Users
Dev Ops
Running
Services
Improved Quality
Digital and Dev...
Our transformation has largely
ignored Ops. Any ideas?
Have you heard of SRE?
Google does it.
Jane Doe
Systems Administrator
Jane Doe
Systems Administrator
We have
SysAdmins
Jane Doe
Systems Administrator
They should be
SREs!
Jane Doe
SRE
They should be
SREs!
ITIL Book 1
ITIL Book 2
ITIL Book 3
ITIL Book 4
ITIL Book 5
Quality!
is job
#1
Sys
Admin
CAB CALENDAR
Your new title is SR...
SysAdmins
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break
again. And again...
SysAdmins
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break
again. And again...
Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
Principles of SRE are what set SRE apart
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator...
SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator...
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to mak...
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to mak...
Principles of SRE are what set SRE apart
Stephen Thorne
At DevOps Enterprise Summit
London 2018
“Principles of SRE”
https:...
Forces That Undermine SRE Principles
Silos Queues
Excessive Toil Low Trust
Forces That Undermine SRE Principles
Silos Queues
Excessive Toil Low Trust
Silos
Backlog Information
PrioritiesTools
Backlog Information
I need X
PrioritiesTools
Silos
Backlog Information
I need X
PrioritiesTools
Silos
Backlog
I do X
Requests
for X
Silo A
Information
Priorities
Silo B
Tools
Silos cause disconnects and mismatches
Backlog Information
I need X
PrioritiesTools
Backlog
I do X
Requests
for X
Silo A
I...
1
2
3
Silos Interfere with feedback loops
1
2
3
Silos Interfere with feedback loops
Producer Consumer
Ops
Ops
Ops
Function A
Function B
Function C
Silos create labor pools of functional specialists
Requests fulfilled by semi-
manual or ...
Silos Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow ...
Silos Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow ...
Silos Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow ...
Silos Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow ...
Forces That Undermine SRE Principles
Silos Queues
Toil Low Trust
How do we cover for our cross-silo disconnects and mismatches?
Silo A Silo B
How do we cover for our cross-silo disconnects and mismatches?
Silo A Silo B
Ticket
Queue
??
Silo A Silo B
We all know how well that works
Ticket
Queue
Request queues are an expensive way to manage work
Ticket
Queue
Queues Create…
Longer Cycle Time
Increased Risk
More Varia...
What do queues do to value streams?
What do queues do to value streams?
Queue
A
Queue
B
What do queues do to value streams?
Queue
A
Queue
B
Queues disintegrate and
obfuscate value streams
Tickets queues become “snowflake makers”
??
Silo A Silo B
Ticket
Queue
Tickets queues become “snowflake makers”
??
Silo A Silo B
Ticket
Queue
Snowflakes
(each unique, technically acceptable but...
Ticket Queues Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make t...
Ticket Queues Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make t...
Ticket Queues Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make t...
Ticket Queues Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make t...
Forces That Undermine Operations
Silos Queues
Toil Low Trust
Toil is the enemy of SRE
Toil is the enemy of SRE
“Toil is the kind of work tied to running a production
service that tends to be manual, repetitiv...
Toil vs. Engineering Work
Toil Engineering Work
Lacks Enduring Value Builds Enduring Value
Rote, Repetitive Creative, Iter...
Excessive toil prevents fixing the system
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to...
Excessive toil prevents fixing the system
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to...
Excessive Toil Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make...
Excessive Toil Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make...
Excessive Toil Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make...
Excessive Toil Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make...
Forces That Undermine Operations
Silos Queues
Toil Low Trust
Where are decisions made? Who can take action?
escalate
1° 2° 3° 4°
escalate escalateor
Decisions made here
All work is contextual
John
Allspaw
All work is contextual
rm -rf $PATHNAME
John
Allspaw
All work is contextual
rm -rf $PATHNAME Is this dangerous?
John
Allspaw
All work is contextual
rm -rf $PATHNAME
John
Allspaw
All work is contextual
rm -rf $PATHNAME
John
Allspaw
All work is contextual
rm -rf $PATHNAME
Is this dangerous?
John
Allspaw
All work is contextual
rm -rf $PATHNAME
John
Allspaw
All work is contextual
rm -rf $PATHNAME
Answer is always
“it depends”
John
Allspaw
escalate
1° 2° 3° 4°
escalate escalateor
Context
Where are decisions made? Who can take action?
Low trust + approvals = illusion of control
Ticket
System
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the i...
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the i...
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the i...
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the i...
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the i...
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the i...
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the i...
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the i...
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the i...
Low Trust Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomo...
Low Trust Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomo...
Low Trust Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomo...
Low Trust Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomo...
Forces That Undermine Operations
Silos Queues
Toil Low Trust
So what can we do differently?
Lean on Lean to find what to fix
PD
TS
W
EP M
M
M
TS
?
PD
TS
W
EP M
M
M
TS
?
Countermeasure
Countermeasure
Countermeasure
...
Lean on Lean to find what to fix
PD
TS
W
EP M
M
M
TS
?
PD
TS
W
EP M
M
M
TS
?
Countermeasure
Countermeasure
Countermeasure
...
Lean on Lean to find what to fix
PD
TS
W
EP M
M
M
TS
?
PD
TS
W
EP M
M
M
TS
?
Countermeasure
Countermeasure
Countermeasure
...
Lean on Lean to find what to fix
PD
TS
W
EP M
M
M
TS
?
PD
TS
W
EP M
M
M
TS
?
Countermeasure
Countermeasure
Countermeasure
...
Lean on Lean to find what to fix
PD
TS
W
EP M
M
M
TS
?
PD
TS
W
EP M
M
M
TS
?
Countermeasure
Countermeasure
Countermeasure
...
Get rid of as many silos as possible
Old Silo A Old Silo B Old Silo C Old Silo D
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Get ri...
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Get ri...
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Get ri...
Shared responsibility matters more than org model
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
...
Shared responsibility matters more than org model
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
...
Shared responsibility matters more than org model
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
...
Why focus on getting rid of handoffs?
Why focus on getting rid of handoffs?
1. Your people are your most valuable assets
Why focus on getting rid of handoffs?
1. Your people are your most valuable assets
2. The SRE skillset is expensive
Why focus on getting rid of handoffs?
1. Your people are your most valuable assets
2. The SRE skillset is expensive
3. Sta...
SREs are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticke...
SREs are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticke...
SREs are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticke...
SREs are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticke...
SREs are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticke...
SREs are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticke...
What about the handoffs you can’t get rid of?
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Fu...
What about the handoffs you can’t get rid of?
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Fu...
What about the handoffs you can’t get rid of?
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Fu...
Operations as a Service: Turn handoffs into self-service
Operations as a Service
On
Demand
On
Demand
On
Demand
On
Demand
O...
Development Team 1
Development Team 2
Development Team n
Ops/SRE
Team
Operations as a Service
On
Demand
On
Demand
On
Deman...
Operations as a Service: Popular Uses for SRE
Environment
"I could fix it, if I could get to it”
Operations as a Service: Popular Uses for SRE
Environment
"I could fix it, if I could get to it”
Environment
O
a
a
S
Operations as a Service: Popular Uses for SRE
“Avoiding the dogpile”
I think its a problem with
dbcluster07-store2.uswest....
Operations as a Service: Popular Uses for SRE
“Avoiding the dogpile”
I think its a problem with
dbcluster07-store2.uswest....
“I don’t read wikis. I’m an expert.”
docs
Service has changed. This flag is now
required or bad things will happen!
Pause m...
“I don’t read wikis. I’m an expert.”
docs
Service has changed. This flag is now
required or bad things will happen!
Pause m...
Operations as a Service: Popular Uses for SRE
“Uneven and hidden skills”
I don’t know
how to do X.
I know how
to do X.
I k...
Operations as a Service: Popular Uses for SRE
“Uneven and hidden skills”
I don’t know
how to do X.
I know how
to do X.
I k...
“Let me do that for you again… and again”
Done.
I need you to
do X
Later…
Ticket
Other
work
Done.
I need you to
do X
Later...
“Let me do that for you again… and again”
Done.
I need you to
do X
Later…
Ticket
Other
work
Done.
I need you to
do X
Later...
Use tickets only for what they are good for
Ticket
System
Use tickets only for what they are good for
1.Documenting true problems/issues/exceptions
Ticket
System
Use tickets only for what they are good for
1.Documenting true problems/issues/exceptions
2.Routing for necessary approval...
Use tickets only for what they are good for
1.Documenting true problems/issues/exceptions
2.Routing for necessary approval...
But won’t Security or Compliance stop you?
Operations as a Service
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(embedded)C...
But what about ITIL®
?
But what about ITIL®
?
• Ask ITIL people and they say SRE is ITIL compatible
But what about ITIL®
?
• Ask ITIL people and they say SRE is ITIL compatible
• Ask people who have seen ITIL implemented a...
But what about ITIL®
?
• Ask ITIL people and they say SRE is ITIL compatible
• Ask people who have seen ITIL implemented a...
But what about ITIL®
?
• Ask ITIL people and they say SRE is ITIL compatible
• Ask people who have seen ITIL implemented a...
But what about ITIL®
?
• Ask ITIL people and they say SRE is ITIL compatible
• Ask people who have seen ITIL implemented a...
“Shift Left” the ability to take action
escalate
1° 2° 3° 4°
escalate escalateor
“Shift Left” the ability to take action
Push the ability to take action this direction
escalate
1° 2° 3° 4°
escalate escal...
“Shift Left” the ability to take action
Push the ability to take action this direction
escalate
1° 2° 3° 4°
escalate escal...
Reduce Toil
Reduce Toil
1. Track toil levels for each team
Reduce Toil
1. Track toil levels for each team
2. Set toil limits for each team
Reduce Toil
1. Track toil levels for each team
2. Set toil limits for each team
3. Fund efforts to reduce toil (with empha...
Start a book club
Recap
SRE is more than a title
Leverage the Operations as a
Service design pattern
“Shift-Left” control and decision
makin...
Let’s talk…
@damonedwards
damon@rundeck.com
https://www.rundeck.com/oaas
Dive Deeper Into Operations as a Service:
Upcoming SlideShare
Loading in …5
×

Clearing the Way For SRE In the Enterprise - Damon Edwards at SREcon EU 2018

3,582 views

Published on

Slides from Damon Edwards presentation at SREcon in Dusseldorf, Germany on 30 Aug 2018.

Video will be available here:
https://www.usenix.org/conference/srecon18europe/presentation/edwards

Published in: Technology

Clearing the Way For SRE In the Enterprise - Damon Edwards at SREcon EU 2018

  1. 1. Clearing the Way For SRE in the Enterprise Damon Edwards @damonedwards
  2. 2. Community Ops Improvement DevOps Ops Tools Damon Edwards
  3. 3. Digital Agile DevOps CI/CD Cloud Docker Kubernetes Microservices CHANGE Wow That is cool I wish I could work there
  4. 4. OpsBusiness Idea Shorter Time-to-Market Fast Feedback from Users Dev Ops Running Services Improved Quality Digital and DevOps Availability Auditing Security Compliance "Go faster!" “Open up!” “Lock it down!” “Great for Dev, but what about Ops?”
  5. 5. Our transformation has largely ignored Ops. Any ideas? Have you heard of SRE? Google does it.
  6. 6. Jane Doe Systems Administrator
  7. 7. Jane Doe Systems Administrator We have SysAdmins
  8. 8. Jane Doe Systems Administrator They should be SREs!
  9. 9. Jane Doe SRE They should be SREs!
  10. 10. ITIL Book 1 ITIL Book 2 ITIL Book 3 ITIL Book 4 ITIL Book 5 Quality! is job #1 Sys Admin CAB CALENDAR Your new title is SRE. Now write code and be better at ops. PROVISIONING PROCESS Dilbert characters © Scott Adams www.dilbert.com
  11. 11. SysAdmins Overloaded. Constant firefighting. Waiting in ticket queues for everything. Things break. Break again. And again. Everyone is busy, but it doesn’t get any better. ansformation has largely nored Ops. Any ideas? Have you heard of SRE? Google does it. Everything takes too long, cost too much, and break too often! Executive View
  12. 12. SysAdmins Overloaded. Constant firefighting. Waiting in ticket queues for everything. Things break. Break again. And again. Everyone is busy, but it doesn’t get any better. ansformation has largely nored Ops. Any ideas? Have you heard of SRE? Google does it. Everything takes too long, cost too much, and break too often! Executive View (False) SRE Overloaded. Constant firefighting. Waiting in ticket queues for everything. Things break. Break again. And again. Everyone is busy, but it doesn’t get any better. Our transformation has largely ignored Ops. Any ideas? Have you h Google Everything takes too long, cost too much, and break too often! Executive View
  13. 13. Changing job titles or adding individual skills doesn’t make systems administrators SREs.
  14. 14. Principles of SRE are what set SRE apart
  15. 15. Principles of SRE are what set SRE apart 1. SRE needs Service Level Objectives, with consequences
  16. 16. Principles of SRE are what set SRE apart 1. SRE needs Service Level Objectives, with consequences
  17. 17. SLO and Error Budgets: Tools for Shared Responsibility 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service)
  18. 18. SLO and Error Budgets: Tools for Shared Responsibility 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service) DEV BIZ Ops
  19. 19. Principles of SRE are what set SRE apart 1. SRE needs Service Level Objectives, with consequences
  20. 20. Principles of SRE are what set SRE apart 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today
  21. 21. Principles of SRE are what set SRE apart 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload
  22. 22. Principles of SRE are what set SRE apart Stephen Thorne At DevOps Enterprise Summit London 2018 “Principles of SRE” https://youtu.be/c-w_GYvi0eA 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload
  23. 23. Forces That Undermine SRE Principles Silos Queues Excessive Toil Low Trust
  24. 24. Forces That Undermine SRE Principles Silos Queues Excessive Toil Low Trust
  25. 25. Silos Backlog Information PrioritiesTools
  26. 26. Backlog Information I need X PrioritiesTools Silos
  27. 27. Backlog Information I need X PrioritiesTools Silos Backlog I do X Requests for X Silo A Information Priorities Silo B Tools
  28. 28. Silos cause disconnects and mismatches Backlog Information I need X PrioritiesTools Backlog I do X Requests for X Silo A Information Priorities Silo B Tools Context Context Process Process Tooling Tooling Capacity Capacity
  29. 29. 1 2 3 Silos Interfere with feedback loops
  30. 30. 1 2 3 Silos Interfere with feedback loops Producer Consumer Ops Ops Ops
  31. 31. Function A Function B Function C Silos create labor pools of functional specialists Requests fulfilled by semi- manual or manual effort Primary management focus is on protecting team capacity
  32. 32. Silos Undermine SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload?
  33. 33. Silos Undermine SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Disjointed silos make meaningful SLOs and shared responsibility almost impossible X
  34. 34. Silos Undermine SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Disjointed silos make meaningful SLOs and shared responsibility almost impossible X Siloed labor pools, disconnected processes and tools, and slow feedback loops tend to consume all available capacity X
  35. 35. Silos Undermine SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Disjointed silos make meaningful SLOs and shared responsibility almost impossible X Siloed labor pools, disconnected processes and tools, and slow feedback loops tend to consume all available capacity X Struggling to keep up with demand and unable to protect capacityX
  36. 36. Forces That Undermine SRE Principles Silos Queues Toil Low Trust
  37. 37. How do we cover for our cross-silo disconnects and mismatches? Silo A Silo B
  38. 38. How do we cover for our cross-silo disconnects and mismatches? Silo A Silo B Ticket Queue
  39. 39. ?? Silo A Silo B We all know how well that works Ticket Queue
  40. 40. Request queues are an expensive way to manage work Ticket Queue Queues Create… Longer Cycle Time Increased Risk More Variability More Overhead Lower Quality Less Motivation Adapted from Donald G. Reinertsen, The Principles of Product Development Flow: Second Generation Lean Product Development
  41. 41. What do queues do to value streams?
  42. 42. What do queues do to value streams? Queue A Queue B
  43. 43. What do queues do to value streams? Queue A Queue B Queues disintegrate and obfuscate value streams
  44. 44. Tickets queues become “snowflake makers” ?? Silo A Silo B Ticket Queue
  45. 45. Tickets queues become “snowflake makers” ?? Silo A Silo B Ticket Queue Snowflakes (each unique, technically acceptable but unreproducible and brittle)
  46. 46. Ticket Queues Undermine SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload?
  47. 47. Ticket Queues Undermine SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Tickets reinforce siloed behaviors and obfuscate the value stream X
  48. 48. Ticket Queues Undermine SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Tickets reinforce siloed behaviors and obfuscate the value stream X Longer cycle time, more variability, more overhead, lower quality, and more snowflakes consume available capacity X
  49. 49. Ticket Queues Undermine SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Tickets reinforce siloed behaviors and obfuscate the value stream X Longer cycle time, more variability, more overhead, lower quality, and more snowflakes consume available capacity X Queues obfuscate the pressure being put on request fulfillersX
  50. 50. Forces That Undermine Operations Silos Queues Toil Low Trust
  51. 51. Toil is the enemy of SRE
  52. 52. Toil is the enemy of SRE “Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” -Vivek Rau Google
  53. 53. Toil vs. Engineering Work Toil Engineering Work Lacks Enduring Value Builds Enduring Value Rote, Repetitive Creative, Iterative Tactical Strategic Increases With Scale Enables Scaling Can Be Automated Requires Human Creativity
  54. 54. Excessive toil prevents fixing the system Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
  55. 55. Excessive toil prevents fixing the system Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
  56. 56. Excessive Toil Undermines SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload?
  57. 57. Excessive Toil Undermines SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Buried in toil keeps team from contributing engineering work to uphold their end of the shared responsibility deal X
  58. 58. Excessive Toil Undermines SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Buried in toil keeps team from contributing engineering work to uphold their end of the shared responsibility deal X Buried in toil… no capacity for engineering work to reduce toil.X
  59. 59. Excessive Toil Undermines SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Buried in toil keeps team from contributing engineering work to uphold their end of the shared responsibility deal X Buried in toil… no capacity for engineering work to reduce toil.X Buried in toil… no capacity for engineering work to reduce toil.X
  60. 60. Forces That Undermine Operations Silos Queues Toil Low Trust
  61. 61. Where are decisions made? Who can take action? escalate 1° 2° 3° 4° escalate escalateor Decisions made here
  62. 62. All work is contextual John Allspaw
  63. 63. All work is contextual rm -rf $PATHNAME John Allspaw
  64. 64. All work is contextual rm -rf $PATHNAME Is this dangerous? John Allspaw
  65. 65. All work is contextual rm -rf $PATHNAME John Allspaw
  66. 66. All work is contextual rm -rf $PATHNAME John Allspaw
  67. 67. All work is contextual rm -rf $PATHNAME Is this dangerous? John Allspaw
  68. 68. All work is contextual rm -rf $PATHNAME John Allspaw
  69. 69. All work is contextual rm -rf $PATHNAME Answer is always “it depends” John Allspaw
  70. 70. escalate 1° 2° 3° 4° escalate escalateor Context Where are decisions made? Who can take action?
  71. 71. Low trust + approvals = illusion of control Ticket System
  72. 72. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and
  73. 73. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”)
  74. 74. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”)
  75. 75. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”)
  76. 76. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”)
  77. 77. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”)
  78. 78. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”) How many are you left with?
  79. 79. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”) How many are you left with? How many were the right call?
  80. 80. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”) How many are you left with? How many were the right call? How many got rejected?
  81. 81. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”) How many are you left with? How many were the right call? How many got rejected?
  82. 82. Low Trust Undermines SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload?
  83. 83. Low Trust Undermines SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Cultures of low trust have a really difficult time with shared responsibility X
  84. 84. Low Trust Undermines SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Cultures of low trust have a really difficult time with shared responsibility X People closest to problems know what to fix but tasking, priorities, and decisions are largely out of their control X
  85. 85. Low Trust Undermines SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Cultures of low trust have a really difficult time with shared responsibility X People closest to problems know what to fix but tasking, priorities, and decisions are largely out of their control X People aren’t trusted to plan or design their own workX
  86. 86. Forces That Undermine Operations Silos Queues Toil Low Trust
  87. 87. So what can we do differently?
  88. 88. Lean on Lean to find what to fix PD TS W EP M M M TS ? PD TS W EP M M M TS ? Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Map the end-to-end flow of information and artifacts (using a recent delivery or event) Identify what slows lead times, undermines quality, and impacts flow 1 2 3 Identify countermeasures and create improvement storyboards (justification/plan)
  89. 89. Lean on Lean to find what to fix PD TS W EP M M M TS ? PD TS W EP M M M TS ? Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Map the end-to-end flow of information and artifacts (using a recent delivery or event) Identify what slows lead times, undermines quality, and impacts flow 1 2 3 Identify countermeasures and create improvement storyboards (justification/plan) All processes should be studied with an improvement disciple
  90. 90. Lean on Lean to find what to fix PD TS W EP M M M TS ? PD TS W EP M M M TS ? Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Map the end-to-end flow of information and artifacts (using a recent delivery or event) Identify what slows lead times, undermines quality, and impacts flow 1 2 3 Identify countermeasures and create improvement storyboards (justification/plan) All processes should be studied with an improvement disciple Incidents are just as much a “process” as delivery
  91. 91. Lean on Lean to find what to fix PD TS W EP M M M TS ? PD TS W EP M M M TS ? Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Map the end-to-end flow of information and artifacts (using a recent delivery or event) Identify what slows lead times, undermines quality, and impacts flow 1 2 3 Identify countermeasures and create improvement storyboards (justification/plan) All processes should be studied with an improvement disciple Incidents are just as much a “process” as delivery Look to Lean for proven improvement techniques (value stream mapping, waste analysis, improvement kata)
  92. 92. Lean on Lean to find what to fix PD TS W EP M M M TS ? PD TS W EP M M M TS ? Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Map the end-to-end flow of information and artifacts (using a recent delivery or event) Identify what slows lead times, undermines quality, and impacts flow 1 2 3 Identify countermeasures and create improvement storyboards (justification/plan) All processes should be studied with an improvement disciple Incidents are just as much a “process” as delivery Look to Lean for proven improvement techniques (value stream mapping, waste analysis, improvement kata) Make it a part of your organization’s discipline
  93. 93. Get rid of as many silos as possible Old Silo A Old Silo B Old Silo C Old Silo D
  94. 94. Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Get rid of as many silos as possible
  95. 95. Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Get rid of as many silos as possible Key 1: get rid of as many handoffs as possible
  96. 96. Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Get rid of as many silos as possible Key 2: “Horizontal” shared responsibility, not everyone do everything! Key 1: get rid of as many handoffs as possible
  97. 97. Shared responsibility matters more than org model Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Development Team 1 Development Team 2 Development Team n SRE Team Clear handoff requirements Error budget consequences “Netflix" Model “Google” Model
  98. 98. Shared responsibility matters more than org model Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Development Team 1 Development Team 2 Development Team n SRE Team Clear handoff requirements Error budget consequences “Netflix" Model “Google” Model
  99. 99. Shared responsibility matters more than org model Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Development Team 1 Development Team 2 Development Team n SRE Team Clear handoff requirements Error budget consequences “Netflix" Model “Google” Model Same high-quality, high-velocity results!
  100. 100. Why focus on getting rid of handoffs?
  101. 101. Why focus on getting rid of handoffs? 1. Your people are your most valuable assets
  102. 102. Why focus on getting rid of handoffs? 1. Your people are your most valuable assets 2. The SRE skillset is expensive
  103. 103. Why focus on getting rid of handoffs? 1. Your people are your most valuable assets 2. The SRE skillset is expensive 3. Stay out of their way!
  104. 104. SREs are expensive, stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This:
  105. 105. SREs are expensive, stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This: Observe Orient Decide Action SRE OODA Loop Reduce friction:
  106. 106. SREs are expensive, stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This: Invest in the right instrumentation Observe Orient Decide Action SRE OODA Loop Reduce friction:
  107. 107. SREs are expensive, stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This: Invest in the right instrumentation Invest in collaboration, checklists, investigatory tools Observe Orient Decide Action SRE OODA Loop Reduce friction:
  108. 108. SREs are expensive, stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This: Invest in the right instrumentation Invest in collaboration, checklists, investigatory tools Empower them to make decisions! Observe Orient Decide Action SRE OODA Loop Reduce friction:
  109. 109. SREs are expensive, stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This: Invest in the right instrumentation Invest in collaboration, checklists, investigatory tools Empower them to make decisions! Empower them to take action! Observe Orient Decide Action SRE OODA Loop Reduce friction:
  110. 110. What about the handoffs you can’t get rid of? Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Specialist Capabilities Specialist Capabilities Specialist Capabilities
  111. 111. What about the handoffs you can’t get rid of? Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Specialist Capabilities Specialist Capabilities Specialist Capabilities Ticket Queue Ticket Queue Ticket Queue
  112. 112. What about the handoffs you can’t get rid of? Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Specialist Capabilities Specialist Capabilities Specialist Capabilities Ticket Queue Ticket Queue Ticket Queue Ticket Queue Ticket Queue Ticket Queue
  113. 113. Operations as a Service: Turn handoffs into self-service Operations as a Service On Demand On Demand On Demand On Demand Ops (embedded)Cross-Functional Product Team 1 Cross-Functional Product Team n Ops (embedded) Ops (builds & operates) Cross-Functional Product Team 2 Ops (embedded) Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist
  114. 114. Development Team 1 Development Team 2 Development Team n Ops/SRE Team Operations as a Service On Demand On Demand On Demand On Demand Ops (builds & operates) Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Operations as a Service: Works with any org model
  115. 115. Operations as a Service: Popular Uses for SRE Environment "I could fix it, if I could get to it”
  116. 116. Operations as a Service: Popular Uses for SRE Environment "I could fix it, if I could get to it” Environment O a a S
  117. 117. Operations as a Service: Popular Uses for SRE “Avoiding the dogpile” I think its a problem with dbcluster07-store2.uswest.acme dbcluster07- store2.uswest. acme “$ top” “$ top” “$ top” “$ top” “$ top” “$ top”“$ top”
  118. 118. Operations as a Service: Popular Uses for SRE “Avoiding the dogpile” I think its a problem with dbcluster07-store2.uswest.acme dbcluster07- store2.uswest. acme “$ top” “$ top” “$ top” “$ top” “$ top” “$ top”“$ top” I think its a problem with dbcluster07-store2.uswest.acme dbcluster07- store2.uswest. acme “$ top” “Healthcheck store2 - all” OaaS
  119. 119. “I don’t read wikis. I’m an expert.” docs Service has changed. This flag is now required or bad things will happen! Pause monitoring first or we all get woken up! “restart -doit -now” I’ve done this before. I’ve got this. Environment docs Later… Operations as a Service: Popular Uses for SRE
  120. 120. “I don’t read wikis. I’m an expert.” docs Service has changed. This flag is now required or bad things will happen! Pause monitoring first or we all get woken up! “restart -doit -now” I’ve done this before. I’ve got this. Environment docs Later… OaaS Service has changed. This flag is now required or bad things will happen! Pause monitoring first or we all get woken up! “restart” I’ve done this before. I’ve got this. Environment Later… Update Restart Job ✅ OaaS Operations as a Service: Popular Uses for SRE
  121. 121. Operations as a Service: Popular Uses for SRE “Uneven and hidden skills” I don’t know how to do X. I know how to do X. I know how to do Y. I don’t know how to do Y.
  122. 122. Operations as a Service: Popular Uses for SRE “Uneven and hidden skills” I don’t know how to do X. I know how to do X. I know how to do Y. I don’t know how to do Y. OaaS “Do X” “Define Y Procedure” “Define X Procedure” “Do Y” “Do X+Y”
  123. 123. “Let me do that for you again… and again” Done. I need you to do X Later… Ticket Other work Done. I need you to do X Later… Ticket Other work Sigh..Done. I need you to do X Ticket Other work Operations as a Service: Popular Uses for SRE
  124. 124. “Let me do that for you again… and again” Done. I need you to do X Later… Ticket Other work Done. I need you to do X Later… Ticket Other work Sigh..Done. I need you to do X Ticket Other work OaaS Do X Later… Other work 1 Later… Other work 2 Other work 3 Do X Do X OaaS OaaS Operations as a Service: Popular Uses for SRE
  125. 125. Use tickets only for what they are good for Ticket System
  126. 126. Use tickets only for what they are good for 1.Documenting true problems/issues/exceptions Ticket System
  127. 127. Use tickets only for what they are good for 1.Documenting true problems/issues/exceptions 2.Routing for necessary approvals Ticket System
  128. 128. Use tickets only for what they are good for 1.Documenting true problems/issues/exceptions 2.Routing for necessary approvals Not as a general purpose work management system! Ticket System
  129. 129. But won’t Security or Compliance stop you? Operations as a Service On Demand On Demand On Demand On Demand Ops (embedded)Cross-Functional Product Team 1 Cross-Functional Product Team n Ops (embedded) Ops (builds & operates) Cross-Functional Product Team 2 Ops (embedded) Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Build-in Security Here Build-in Compliance Here
  130. 130. But what about ITIL® ?
  131. 131. But what about ITIL® ? • Ask ITIL people and they say SRE is ITIL compatible
  132. 132. But what about ITIL® ? • Ask ITIL people and they say SRE is ITIL compatible • Ask people who have seen ITIL implemented and they say “how?”
  133. 133. But what about ITIL® ? • Ask ITIL people and they say SRE is ITIL compatible • Ask people who have seen ITIL implemented and they say “how?” • Agile+DevOps+SRE have self-regulation and shared responsibility features that seem to undermine ITIL command and control nature
  134. 134. But what about ITIL® ? • Ask ITIL people and they say SRE is ITIL compatible • Ask people who have seen ITIL implemented and they say “how?” • Agile+DevOps+SRE have self-regulation and shared responsibility features that seem to undermine ITIL command and control nature • ITIL “Standard Change” is often focus of discussion, but it still implies approval model
  135. 135. But what about ITIL® ? • Ask ITIL people and they say SRE is ITIL compatible • Ask people who have seen ITIL implemented and they say “how?” • Agile+DevOps+SRE have self-regulation and shared responsibility features that seem to undermine ITIL command and control nature • ITIL “Standard Change” is often focus of discussion, but it still implies approval model • Straight talk: are we doing contortions to defend a sunk cost?
  136. 136. “Shift Left” the ability to take action escalate 1° 2° 3° 4° escalate escalateor
  137. 137. “Shift Left” the ability to take action Push the ability to take action this direction escalate 1° 2° 3° 4° escalate escalateor
  138. 138. “Shift Left” the ability to take action Push the ability to take action this direction escalate 1° 2° 3° 4° escalate escalateor OaaS Enablement and tooling
  139. 139. Reduce Toil
  140. 140. Reduce Toil 1. Track toil levels for each team
  141. 141. Reduce Toil 1. Track toil levels for each team 2. Set toil limits for each team
  142. 142. Reduce Toil 1. Track toil levels for each team 2. Set toil limits for each team 3. Fund efforts to reduce toil (with emphasis on teams over toil limits)
  143. 143. Start a book club
  144. 144. Recap SRE is more than a title Leverage the Operations as a Service design pattern “Shift-Left” control and decision making. Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Focus on removing silos and queues Operations as a Service On Demand On Demand On Demand On Demand Ops (embedded)Cross-Functional Product Team 1 Cross-Functional Product Team n Ops (embedded) Ops (builds & operates) Cross-Functional Product Team 2 Ops (embedded) Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Reduce toil to create capacity to change Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”) Understand the forces undermining SRE ITIL Book 1 ITIL Book 2 ITIL Book 3 ITIL Book 4 ITIL Book 5 Quality! is job #1 Sys Admin CAB CALENDAR Your new title is SRE. Now write code and be better at ops. PROVISIONING PROCESS Dilbert characters © Scott Adams www.dilbert.com
  145. 145. Let’s talk… @damonedwards damon@rundeck.com https://www.rundeck.com/oaas Dive Deeper Into Operations as a Service:

×