Clearing the Way
For SRE in the Enterprise
Damon Edwards
@damonedwards
Community
Ops Improvement
DevOps
Ops Tools
Damon Edwards
Digital
Agile
DevOps
CI/CD
Cloud
Docker
Kubernetes
Microservices
CHANGE
Wow
That is cool
I wish I could
work there
OpsBusiness
Idea
Shorter Time-to-Market
Fast Feedback
from Users
Dev Ops
Running
Services
Improved Quality
Digital and DevOps
Availability Auditing
Security Compliance
"Go faster!"
“Open up!”
“Lock it down!”
“Great for Dev, but what about Ops?”
Our transformation has largely
ignored Ops. Any ideas?
Have you heard of SRE?
Google does it.
Jane Doe
Systems Administrator
Jane Doe
Systems Administrator
We have
SysAdmins
Jane Doe
Systems Administrator
They should be
SREs!
Jane Doe
SRE
They should be
SREs!
ITIL Book 1
ITIL Book 2
ITIL Book 3
ITIL Book 4
ITIL Book 5
Quality!
is job
#1
Sys
Admin
CAB CALENDAR
Your new title is SRE.
Now write code and be better at ops.
PROVISIONING PROCESS
Dilbert characters © Scott Adams www.dilbert.com
SysAdmins
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break
again. And again.
Everyone is busy, but it
doesn’t get any better.
ansformation has largely
nored Ops. Any ideas?
Have you heard of SRE?
Google does it.
Everything takes too
long, cost too much,
and break too often!
Executive
View
SysAdmins
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break
again. And again.
Everyone is busy, but it
doesn’t get any better.
ansformation has largely
nored Ops. Any ideas?
Have you heard of SRE?
Google does it.
Everything takes too
long, cost too much,
and break too often!
Executive
View
(False) SRE
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break
again. And again.
Everyone is busy, but it
doesn’t get any better.
Our transformation has largely
ignored Ops. Any ideas?
Have you h
Google
Everything takes too
long, cost too much,
and break too often!
Executive
View
Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
Principles of SRE are what set SRE apart
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
DEV
BIZ
Ops
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Principles of SRE are what set SRE apart
Stephen Thorne
At DevOps Enterprise Summit
London 2018
“Principles of SRE”
https://youtu.be/c-w_GYvi0eA
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Forces That Undermine SRE Principles
Silos Queues
Excessive Toil Low Trust
Forces That Undermine SRE Principles
Silos Queues
Excessive Toil Low Trust
Silos
Backlog Information
PrioritiesTools
Backlog Information
I need X
PrioritiesTools
Silos
Backlog Information
I need X
PrioritiesTools
Silos
Backlog
I do X
Requests
for X
Silo A
Information
Priorities
Silo B
Tools
Silos cause disconnects and mismatches
Backlog Information
I need X
PrioritiesTools
Backlog
I do X
Requests
for X
Silo A
Information
Priorities
Silo B
Tools
Context
Context
Process
Process
Tooling
Tooling
Capacity
Capacity
1
2
3
Silos Interfere with feedback loops
1
2
3
Silos Interfere with feedback loops
Producer Consumer
Ops
Ops
Ops
Function A
Function B
Function C
Silos create labor pools of functional specialists
Requests fulfilled by semi-
manual or manual effort

Primary management focus is
on protecting team capacity
Silos Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Silos Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Disjointed silos make meaningful SLOs and shared
responsibility almost impossible
X
Silos Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Disjointed silos make meaningful SLOs and shared
responsibility almost impossible
X
Siloed labor pools, disconnected processes and tools, and slow
feedback loops tend to consume all available capacity
X
Silos Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Disjointed silos make meaningful SLOs and shared
responsibility almost impossible
X
Siloed labor pools, disconnected processes and tools, and slow
feedback loops tend to consume all available capacity
X
Struggling to keep up with demand and unable to protect capacityX
Forces That Undermine SRE Principles
Silos Queues
Toil Low Trust
How do we cover for our cross-silo disconnects and mismatches?
Silo A Silo B
How do we cover for our cross-silo disconnects and mismatches?
Silo A Silo B
Ticket
Queue
??
Silo A Silo B
We all know how well that works
Ticket
Queue
Request queues are an expensive way to manage work
Ticket
Queue
Queues Create…
Longer Cycle Time
Increased Risk
More Variability
More Overhead
Lower Quality
Less Motivation
Adapted from Donald G. Reinertsen, The Principles of Product Development Flow: Second Generation Lean Product Development
What do queues do to value streams?
What do queues do to value streams?
Queue
A
Queue
B
What do queues do to value streams?
Queue
A
Queue
B
Queues disintegrate and
obfuscate value streams
Tickets queues become “snowflake makers”
??
Silo A Silo B
Ticket
Queue
Tickets queues become “snowflake makers”
??
Silo A Silo B
Ticket
Queue
Snowflakes
(each unique, technically acceptable but unreproducible and brittle)
Ticket Queues Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Ticket Queues Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Tickets reinforce siloed behaviors and obfuscate the value
stream
X
Ticket Queues Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Tickets reinforce siloed behaviors and obfuscate the value
stream
X
Longer cycle time, more variability, more overhead, lower quality, and
more snowflakes consume available capacity
X
Ticket Queues Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Tickets reinforce siloed behaviors and obfuscate the value
stream
X
Longer cycle time, more variability, more overhead, lower quality, and
more snowflakes consume available capacity
X
Queues obfuscate the pressure being put on request fulfillersX
Forces That Undermine Operations
Silos Queues
Toil Low Trust
Toil is the enemy of SRE
Toil is the enemy of SRE
“Toil is the kind of work tied to running a production
service that tends to be manual, repetitive,
automatable, tactical, devoid of enduring value, and
that scales linearly as a service grows.”
-Vivek Rau

Google
Toil vs. Engineering Work
Toil Engineering Work
Lacks Enduring Value Builds Enduring Value
Rote, Repetitive Creative, Iterative
Tactical Strategic
Increases With Scale Enables Scaling
Can Be Automated Requires Human Creativity
Excessive toil prevents fixing the system
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Excessive toil prevents fixing the system
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Excessive Toil Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Excessive Toil Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Buried in toil keeps team from contributing engineering work
to uphold their end of the shared responsibility deal
X
Excessive Toil Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Buried in toil keeps team from contributing engineering work
to uphold their end of the shared responsibility deal
X
Buried in toil… no capacity for engineering work to reduce toil.X
Excessive Toil Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Buried in toil keeps team from contributing engineering work
to uphold their end of the shared responsibility deal
X
Buried in toil… no capacity for engineering work to reduce toil.X
Buried in toil… no capacity for engineering work to reduce toil.X
Forces That Undermine Operations
Silos Queues
Toil Low Trust
Where are decisions made? Who can take action?
escalate
1° 2° 3° 4°
escalate escalateor
Decisions made here
All work is contextual
John
Allspaw
All work is contextual
rm -rf $PATHNAME
John
Allspaw
All work is contextual
rm -rf $PATHNAME Is this dangerous?
John
Allspaw
All work is contextual
rm -rf $PATHNAME
John
Allspaw
All work is contextual
rm -rf $PATHNAME
John
Allspaw
All work is contextual
rm -rf $PATHNAME
Is this dangerous?
John
Allspaw
All work is contextual
rm -rf $PATHNAME
John
Allspaw
All work is contextual
rm -rf $PATHNAME
Answer is always
“it depends”
John
Allspaw
escalate
1° 2° 3° 4°
escalate escalateor
Context
Where are decisions made? Who can take action?
Low trust + approvals = illusion of control
Ticket
System
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
…subtract the too removed to judge (“mostly guessing”)
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
…subtract the too removed to judge (“mostly guessing”)
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
…subtract the too removed to judge (“mostly guessing”)
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
…subtract the too removed to judge (“mostly guessing”)
How many are you left with?
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
…subtract the too removed to judge (“mostly guessing”)
How many are you left with?
How many were the right call?
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
…subtract the too removed to judge (“mostly guessing”)
How many are you left with?
How many were the right call?
How many got rejected?
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
…subtract the too removed to judge (“mostly guessing”)
How many are you left with?
How many were the right call?
How many got rejected?
Low Trust Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Low Trust Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Cultures of low trust have a really difficult time with shared
responsibility
X
Low Trust Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Cultures of low trust have a really difficult time with shared
responsibility
X
People closest to problems know what to fix but tasking, priorities,
and decisions are largely out of their control
X
Low Trust Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Cultures of low trust have a really difficult time with shared
responsibility
X
People closest to problems know what to fix but tasking, priorities,
and decisions are largely out of their control
X
People aren’t trusted to plan or design their own workX
Forces That Undermine Operations
Silos Queues
Toil Low Trust
So what can we do differently?
Lean on Lean to find what to fix
PD
TS
W
EP M
M
M
TS
?
PD
TS
W
EP M
M
M
TS
?
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Map the end-to-end flow of information and artifacts (using a recent delivery or event)
Identify what slows lead times, undermines quality, and impacts flow
1
2
3 Identify countermeasures and create improvement storyboards (justification/plan)
Lean on Lean to find what to fix
PD
TS
W
EP M
M
M
TS
?
PD
TS
W
EP M
M
M
TS
?
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Map the end-to-end flow of information and artifacts (using a recent delivery or event)
Identify what slows lead times, undermines quality, and impacts flow
1
2
3 Identify countermeasures and create improvement storyboards (justification/plan)
All processes should be studied with
an improvement disciple
Lean on Lean to find what to fix
PD
TS
W
EP M
M
M
TS
?
PD
TS
W
EP M
M
M
TS
?
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Map the end-to-end flow of information and artifacts (using a recent delivery or event)
Identify what slows lead times, undermines quality, and impacts flow
1
2
3 Identify countermeasures and create improvement storyboards (justification/plan)
All processes should be studied with
an improvement disciple
Incidents are just as much a
“process” as delivery
Lean on Lean to find what to fix
PD
TS
W
EP M
M
M
TS
?
PD
TS
W
EP M
M
M
TS
?
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Map the end-to-end flow of information and artifacts (using a recent delivery or event)
Identify what slows lead times, undermines quality, and impacts flow
1
2
3 Identify countermeasures and create improvement storyboards (justification/plan)
All processes should be studied with
an improvement disciple
Incidents are just as much a
“process” as delivery
Look to Lean for proven
improvement techniques (value
stream mapping, waste analysis,
improvement kata)
Lean on Lean to find what to fix
PD
TS
W
EP M
M
M
TS
?
PD
TS
W
EP M
M
M
TS
?
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Map the end-to-end flow of information and artifacts (using a recent delivery or event)
Identify what slows lead times, undermines quality, and impacts flow
1
2
3 Identify countermeasures and create improvement storyboards (justification/plan)
All processes should be studied with
an improvement disciple
Incidents are just as much a
“process” as delivery
Look to Lean for proven
improvement techniques (value
stream mapping, waste analysis,
improvement kata)
Make it a part of your organization’s
discipline
Get rid of as many silos as possible
Old Silo A Old Silo B Old Silo C Old Silo D
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Get rid of as many silos as possible
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Get rid of as many silos as possible
Key 1: get rid of as many
handoffs as possible
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Get rid of as many silos as possible
Key 2: “Horizontal”
shared responsibility, not
everyone do everything!
Key 1: get rid of as many
handoffs as possible
Shared responsibility matters more than org model
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Development Team 1
Development Team 2
Development Team n
SRE
Team
Clear handoff requirements
Error budget consequences
“Netflix"
Model
“Google”
Model
Shared responsibility matters more than org model
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Development Team 1
Development Team 2
Development Team n
SRE
Team
Clear handoff requirements
Error budget consequences
“Netflix"
Model
“Google”
Model
Shared responsibility matters more than org model
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Development Team 1
Development Team 2
Development Team n
SRE
Team
Clear handoff requirements
Error budget consequences
“Netflix"
Model
“Google”
Model
Same
high-quality,
high-velocity
results!
Why focus on getting rid of handoffs?
Why focus on getting rid of handoffs?
1. Your people are your most valuable assets
Why focus on getting rid of handoffs?
1. Your people are your most valuable assets
2. The SRE skillset is expensive
Why focus on getting rid of handoffs?
1. Your people are your most valuable assets
2. The SRE skillset is expensive
3. Stay out of their way!
SREs are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
SREs are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Observe
Orient
Decide
Action
SRE
OODA
Loop
Reduce friction:
SREs are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Invest in the right
instrumentation
Observe
Orient
Decide
Action
SRE
OODA
Loop
Reduce friction:
SREs are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Invest in the right
instrumentation
Invest in
collaboration,
checklists,
investigatory tools
Observe
Orient
Decide
Action
SRE
OODA
Loop
Reduce friction:
SREs are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Invest in the right
instrumentation
Invest in
collaboration,
checklists,
investigatory tools
Empower them to
make decisions!
Observe
Orient
Decide
Action
SRE
OODA
Loop
Reduce friction:
SREs are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Invest in the right
instrumentation
Invest in
collaboration,
checklists,
investigatory tools
Empower them to
make decisions!
Empower them to
take action!
Observe
Orient
Decide
Action
SRE
OODA
Loop
Reduce friction:
What about the handoffs you can’t get rid of?
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Specialist
Capabilities
Specialist
Capabilities
Specialist
Capabilities
What about the handoffs you can’t get rid of?
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Specialist
Capabilities
Specialist
Capabilities
Specialist
Capabilities
Ticket
Queue
Ticket
Queue
Ticket
Queue
What about the handoffs you can’t get rid of?
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Specialist
Capabilities
Specialist
Capabilities
Specialist
Capabilities
Ticket
Queue
Ticket
Queue
Ticket
Queue
Ticket
Queue
Ticket
Queue Ticket
Queue
Operations as a Service: Turn handoffs into self-service
Operations as a Service
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(embedded)Cross-Functional Product Team 1
Cross-Functional Product Team n Ops
(embedded)
Ops
(builds & operates)
Cross-Functional Product Team 2 Ops
(embedded)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Development Team 1
Development Team 2
Development Team n
Ops/SRE
Team
Operations as a Service
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(builds & operates)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Operations as a Service: Works with any org model
Operations as a Service: Popular Uses for SRE
Environment
"I could fix it, if I could get to it”
Operations as a Service: Popular Uses for SRE
Environment
"I could fix it, if I could get to it”
Environment
O
a
a
S
Operations as a Service: Popular Uses for SRE
“Avoiding the dogpile”
I think its a problem with
dbcluster07-store2.uswest.acme
dbcluster07-
store2.uswest.
acme
“$ top”
“$ top”
“$ top”
“$ top”
“$ top”
“$ top”“$ top”
Operations as a Service: Popular Uses for SRE
“Avoiding the dogpile”
I think its a problem with
dbcluster07-store2.uswest.acme
dbcluster07-
store2.uswest.
acme
“$ top”
“$ top”
“$ top”
“$ top”
“$ top”
“$ top”“$ top”
I think its a problem with
dbcluster07-store2.uswest.acme
dbcluster07-
store2.uswest.
acme
“$ top”
“Healthcheck
store2 - all”
OaaS
“I don’t read wikis. I’m an expert.”
docs
Service has changed. This flag is now
required or bad things will happen!
Pause monitoring first or we
all get woken up!
“restart -doit -now”
I’ve done this before. I’ve got this.
Environment
docs
Later…
Operations as a Service: Popular Uses for SRE
“I don’t read wikis. I’m an expert.”
docs
Service has changed. This flag is now
required or bad things will happen!
Pause monitoring first or we
all get woken up!
“restart -doit -now”
I’ve done this before. I’ve got this.
Environment
docs
Later…
OaaS
Service has changed. This flag is now
required or bad things will happen!
Pause monitoring first or we
all get woken up!
“restart”
I’ve done this before. I’ve got this.
Environment
Later…
Update
Restart Job
✅
OaaS
Operations as a Service: Popular Uses for SRE
Operations as a Service: Popular Uses for SRE
“Uneven and hidden skills”
I don’t know
how to do X.
I know how
to do X.
I know how
to do Y.
I don’t know
how to do Y.
Operations as a Service: Popular Uses for SRE
“Uneven and hidden skills”
I don’t know
how to do X.
I know how
to do X.
I know how
to do Y.
I don’t know
how to do Y.
OaaS
“Do X”
“Define Y
Procedure”
“Define X
Procedure”
“Do Y”
“Do X+Y”
“Let me do that for you again… and again”
Done.
I need you to
do X
Later…
Ticket
Other
work
Done.
I need you to
do X
Later…
Ticket
Other
work
Sigh..Done.
I need you to
do X
Ticket
Other
work
Operations as a Service: Popular Uses for SRE
“Let me do that for you again… and again”
Done.
I need you to
do X
Later…
Ticket
Other
work
Done.
I need you to
do X
Later…
Ticket
Other
work
Sigh..Done.
I need you to
do X
Ticket
Other
work
OaaS
Do X
Later…
Other
work 1
Later…
Other
work 2
Other
work 3
Do X
Do X
OaaS
OaaS
Operations as a Service: Popular Uses for SRE
Use tickets only for what they are good for
Ticket
System
Use tickets only for what they are good for
1.Documenting true problems/issues/exceptions
Ticket
System
Use tickets only for what they are good for
1.Documenting true problems/issues/exceptions
2.Routing for necessary approvals
Ticket
System
Use tickets only for what they are good for
1.Documenting true problems/issues/exceptions
2.Routing for necessary approvals
Not as a general purpose work management system!
Ticket
System
But won’t Security or Compliance stop you?
Operations as a Service
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(embedded)Cross-Functional Product Team 1
Cross-Functional Product Team n Ops
(embedded)
Ops
(builds & operates)
Cross-Functional Product Team 2 Ops
(embedded)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Build-in
Security
Here
Build-in
Compliance
Here
But what about ITIL®
?
But what about ITIL®
?
• Ask ITIL people and they say SRE is ITIL compatible
But what about ITIL®
?
• Ask ITIL people and they say SRE is ITIL compatible
• Ask people who have seen ITIL implemented and they say “how?”
But what about ITIL®
?
• Ask ITIL people and they say SRE is ITIL compatible
• Ask people who have seen ITIL implemented and they say “how?”
• Agile+DevOps+SRE have self-regulation and shared responsibility
features that seem to undermine ITIL command and control nature
But what about ITIL®
?
• Ask ITIL people and they say SRE is ITIL compatible
• Ask people who have seen ITIL implemented and they say “how?”
• Agile+DevOps+SRE have self-regulation and shared responsibility
features that seem to undermine ITIL command and control nature
• ITIL “Standard Change” is often focus of discussion, but it still
implies approval model
But what about ITIL®
?
• Ask ITIL people and they say SRE is ITIL compatible
• Ask people who have seen ITIL implemented and they say “how?”
• Agile+DevOps+SRE have self-regulation and shared responsibility
features that seem to undermine ITIL command and control nature
• ITIL “Standard Change” is often focus of discussion, but it still
implies approval model
• Straight talk: are we doing contortions to defend a sunk cost?
“Shift Left” the ability to take action
escalate
1° 2° 3° 4°
escalate escalateor
“Shift Left” the ability to take action
Push the ability to take action this direction
escalate
1° 2° 3° 4°
escalate escalateor
“Shift Left” the ability to take action
Push the ability to take action this direction
escalate
1° 2° 3° 4°
escalate escalateor
OaaS Enablement and tooling
Reduce Toil
Reduce Toil
1. Track toil levels for each team
Reduce Toil
1. Track toil levels for each team
2. Set toil limits for each team
Reduce Toil
1. Track toil levels for each team
2. Set toil limits for each team
3. Fund efforts to reduce toil (with emphasis on teams over toil limits)
Start a book club
Recap
SRE is more than a title
Leverage the Operations as a
Service design pattern
“Shift-Left” control and decision
making.
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Focus on removing silos and
queues
Operations as a Service
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(embedded)Cross-Functional Product Team 1
Cross-Functional Product Team n Ops
(embedded)
Ops
(builds & operates)
Cross-Functional Product Team 2 Ops
(embedded)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Reduce toil to create capacity
to change
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Understand the forces
undermining SRE
ITIL Book 1
ITIL Book 2
ITIL Book 3
ITIL Book 4
ITIL Book 5
Quality!
is job
#1
Sys
Admin
CAB CALENDAR
Your new title is SRE.
Now write code and be better at ops.
PROVISIONING PROCESS
Dilbert characters © Scott Adams www.dilbert.com
Let’s talk…
@damonedwards
damon@rundeck.com
https://www.rundeck.com/oaas
Dive Deeper Into Operations as a Service:

Clearing the Way For SRE In the Enterprise

  • 1.
    Clearing the Way ForSRE in the Enterprise Damon Edwards @damonedwards
  • 2.
  • 3.
  • 4.
    OpsBusiness Idea Shorter Time-to-Market Fast Feedback fromUsers Dev Ops Running Services Improved Quality Digital and DevOps Availability Auditing Security Compliance "Go faster!" “Open up!” “Lock it down!” “Great for Dev, but what about Ops?”
  • 5.
    Our transformation haslargely ignored Ops. Any ideas? Have you heard of SRE? Google does it.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
    ITIL Book 1 ITILBook 2 ITIL Book 3 ITIL Book 4 ITIL Book 5 Quality! is job #1 Sys Admin CAB CALENDAR Your new title is SRE. Now write code and be better at ops. PROVISIONING PROCESS Dilbert characters © Scott Adams www.dilbert.com
  • 11.
    SysAdmins Overloaded. Constant firefighting. Waiting inticket queues for everything. Things break. Break again. And again. Everyone is busy, but it doesn’t get any better. ansformation has largely nored Ops. Any ideas? Have you heard of SRE? Google does it. Everything takes too long, cost too much, and break too often! Executive View
  • 12.
    SysAdmins Overloaded. Constant firefighting. Waiting inticket queues for everything. Things break. Break again. And again. Everyone is busy, but it doesn’t get any better. ansformation has largely nored Ops. Any ideas? Have you heard of SRE? Google does it. Everything takes too long, cost too much, and break too often! Executive View (False) SRE Overloaded. Constant firefighting. Waiting in ticket queues for everything. Things break. Break again. And again. Everyone is busy, but it doesn’t get any better. Our transformation has largely ignored Ops. Any ideas? Have you h Google Everything takes too long, cost too much, and break too often! Executive View
  • 13.
    Changing job titlesor adding individual skills doesn’t make systems administrators SREs.
  • 14.
    Principles of SREare what set SRE apart
  • 15.
    Principles of SREare what set SRE apart 1. SRE needs Service Level Objectives, with consequences
  • 16.
    Principles of SREare what set SRE apart 1. SRE needs Service Level Objectives, with consequences
  • 17.
    SLO and ErrorBudgets: Tools for Shared Responsibility 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service)
  • 18.
    SLO and ErrorBudgets: Tools for Shared Responsibility 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service) DEV BIZ Ops
  • 19.
    Principles of SREare what set SRE apart 1. SRE needs Service Level Objectives, with consequences
  • 20.
    Principles of SREare what set SRE apart 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today
  • 21.
    Principles of SREare what set SRE apart 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload
  • 22.
    Principles of SREare what set SRE apart Stephen Thorne At DevOps Enterprise Summit London 2018 “Principles of SRE” https://youtu.be/c-w_GYvi0eA 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload
  • 23.
    Forces That UndermineSRE Principles Silos Queues Excessive Toil Low Trust
  • 24.
    Forces That UndermineSRE Principles Silos Queues Excessive Toil Low Trust
  • 25.
  • 26.
    Backlog Information I needX PrioritiesTools Silos
  • 27.
    Backlog Information I needX PrioritiesTools Silos Backlog I do X Requests for X Silo A Information Priorities Silo B Tools
  • 28.
    Silos cause disconnectsand mismatches Backlog Information I need X PrioritiesTools Backlog I do X Requests for X Silo A Information Priorities Silo B Tools Context Context Process Process Tooling Tooling Capacity Capacity
  • 29.
  • 30.
    1 2 3 Silos Interfere withfeedback loops Producer Consumer Ops Ops Ops
  • 31.
    Function A Function B FunctionC Silos create labor pools of functional specialists Requests fulfilled by semi- manual or manual effort Primary management focus is on protecting team capacity
  • 32.
    Silos Undermine SREPrinciples 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload?
  • 33.
    Silos Undermine SREPrinciples 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Disjointed silos make meaningful SLOs and shared responsibility almost impossible X
  • 34.
    Silos Undermine SREPrinciples 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Disjointed silos make meaningful SLOs and shared responsibility almost impossible X Siloed labor pools, disconnected processes and tools, and slow feedback loops tend to consume all available capacity X
  • 35.
    Silos Undermine SREPrinciples 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Disjointed silos make meaningful SLOs and shared responsibility almost impossible X Siloed labor pools, disconnected processes and tools, and slow feedback loops tend to consume all available capacity X Struggling to keep up with demand and unable to protect capacityX
  • 36.
    Forces That UndermineSRE Principles Silos Queues Toil Low Trust
  • 37.
    How do wecover for our cross-silo disconnects and mismatches? Silo A Silo B
  • 38.
    How do wecover for our cross-silo disconnects and mismatches? Silo A Silo B Ticket Queue
  • 39.
    ?? Silo A SiloB We all know how well that works Ticket Queue
  • 40.
    Request queues arean expensive way to manage work Ticket Queue Queues Create… Longer Cycle Time Increased Risk More Variability More Overhead Lower Quality Less Motivation Adapted from Donald G. Reinertsen, The Principles of Product Development Flow: Second Generation Lean Product Development
  • 41.
    What do queuesdo to value streams?
  • 42.
    What do queuesdo to value streams? Queue A Queue B
  • 43.
    What do queuesdo to value streams? Queue A Queue B Queues disintegrate and obfuscate value streams
  • 44.
    Tickets queues become“snowflake makers” ?? Silo A Silo B Ticket Queue
  • 45.
    Tickets queues become“snowflake makers” ?? Silo A Silo B Ticket Queue Snowflakes (each unique, technically acceptable but unreproducible and brittle)
  • 46.
    Ticket Queues UndermineSRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload?
  • 47.
    Ticket Queues UndermineSRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Tickets reinforce siloed behaviors and obfuscate the value stream X
  • 48.
    Ticket Queues UndermineSRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Tickets reinforce siloed behaviors and obfuscate the value stream X Longer cycle time, more variability, more overhead, lower quality, and more snowflakes consume available capacity X
  • 49.
    Ticket Queues UndermineSRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Tickets reinforce siloed behaviors and obfuscate the value stream X Longer cycle time, more variability, more overhead, lower quality, and more snowflakes consume available capacity X Queues obfuscate the pressure being put on request fulfillersX
  • 50.
    Forces That UndermineOperations Silos Queues Toil Low Trust
  • 51.
    Toil is theenemy of SRE
  • 52.
    Toil is theenemy of SRE “Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” -Vivek Rau Google
  • 53.
    Toil vs. EngineeringWork Toil Engineering Work Lacks Enduring Value Builds Enduring Value Rote, Repetitive Creative, Iterative Tactical Strategic Increases With Scale Enables Scaling Can Be Automated Requires Human Creativity
  • 54.
    Excessive toil preventsfixing the system Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
  • 55.
    Excessive toil preventsfixing the system Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
  • 56.
    Excessive Toil UnderminesSRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload?
  • 57.
    Excessive Toil UnderminesSRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Buried in toil keeps team from contributing engineering work to uphold their end of the shared responsibility deal X
  • 58.
    Excessive Toil UnderminesSRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Buried in toil keeps team from contributing engineering work to uphold their end of the shared responsibility deal X Buried in toil… no capacity for engineering work to reduce toil.X
  • 59.
    Excessive Toil UnderminesSRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Buried in toil keeps team from contributing engineering work to uphold their end of the shared responsibility deal X Buried in toil… no capacity for engineering work to reduce toil.X Buried in toil… no capacity for engineering work to reduce toil.X
  • 60.
    Forces That UndermineOperations Silos Queues Toil Low Trust
  • 61.
    Where are decisionsmade? Who can take action? escalate 1° 2° 3° 4° escalate escalateor Decisions made here
  • 62.
    All work iscontextual John Allspaw
  • 63.
    All work iscontextual rm -rf $PATHNAME John Allspaw
  • 64.
    All work iscontextual rm -rf $PATHNAME Is this dangerous? John Allspaw
  • 65.
    All work iscontextual rm -rf $PATHNAME John Allspaw
  • 66.
    All work iscontextual rm -rf $PATHNAME John Allspaw
  • 67.
    All work iscontextual rm -rf $PATHNAME Is this dangerous? John Allspaw
  • 68.
    All work iscontextual rm -rf $PATHNAME John Allspaw
  • 69.
    All work iscontextual rm -rf $PATHNAME Answer is always “it depends” John Allspaw
  • 70.
    escalate 1° 2° 3°4° escalate escalateor Context Where are decisions made? Who can take action?
  • 71.
    Low trust +approvals = illusion of control Ticket System
  • 72.
    Low trust +approvals = illusion of control Ticket System Add up the total number of approval requests and
  • 73.
    Low trust +approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”)
  • 74.
    Low trust +approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”)
  • 75.
    Low trust +approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”)
  • 76.
    Low trust +approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”)
  • 77.
    Low trust +approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”)
  • 78.
    Low trust +approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”) How many are you left with?
  • 79.
    Low trust +approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”) How many are you left with? How many were the right call?
  • 80.
    Low trust +approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”) How many are you left with? How many were the right call? How many got rejected?
  • 81.
    Low trust +approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”) How many are you left with? How many were the right call? How many got rejected?
  • 82.
    Low Trust UnderminesSRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload?
  • 83.
    Low Trust UnderminesSRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Cultures of low trust have a really difficult time with shared responsibility X
  • 84.
    Low Trust UnderminesSRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Cultures of low trust have a really difficult time with shared responsibility X People closest to problems know what to fix but tasking, priorities, and decisions are largely out of their control X
  • 85.
    Low Trust UnderminesSRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Cultures of low trust have a really difficult time with shared responsibility X People closest to problems know what to fix but tasking, priorities, and decisions are largely out of their control X People aren’t trusted to plan or design their own workX
  • 86.
    Forces That UndermineOperations Silos Queues Toil Low Trust
  • 87.
    So what canwe do differently?
  • 88.
    Lean on Leanto find what to fix PD TS W EP M M M TS ? PD TS W EP M M M TS ? Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Map the end-to-end flow of information and artifacts (using a recent delivery or event) Identify what slows lead times, undermines quality, and impacts flow 1 2 3 Identify countermeasures and create improvement storyboards (justification/plan)
  • 89.
    Lean on Leanto find what to fix PD TS W EP M M M TS ? PD TS W EP M M M TS ? Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Map the end-to-end flow of information and artifacts (using a recent delivery or event) Identify what slows lead times, undermines quality, and impacts flow 1 2 3 Identify countermeasures and create improvement storyboards (justification/plan) All processes should be studied with an improvement disciple
  • 90.
    Lean on Leanto find what to fix PD TS W EP M M M TS ? PD TS W EP M M M TS ? Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Map the end-to-end flow of information and artifacts (using a recent delivery or event) Identify what slows lead times, undermines quality, and impacts flow 1 2 3 Identify countermeasures and create improvement storyboards (justification/plan) All processes should be studied with an improvement disciple Incidents are just as much a “process” as delivery
  • 91.
    Lean on Leanto find what to fix PD TS W EP M M M TS ? PD TS W EP M M M TS ? Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Map the end-to-end flow of information and artifacts (using a recent delivery or event) Identify what slows lead times, undermines quality, and impacts flow 1 2 3 Identify countermeasures and create improvement storyboards (justification/plan) All processes should be studied with an improvement disciple Incidents are just as much a “process” as delivery Look to Lean for proven improvement techniques (value stream mapping, waste analysis, improvement kata)
  • 92.
    Lean on Leanto find what to fix PD TS W EP M M M TS ? PD TS W EP M M M TS ? Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Map the end-to-end flow of information and artifacts (using a recent delivery or event) Identify what slows lead times, undermines quality, and impacts flow 1 2 3 Identify countermeasures and create improvement storyboards (justification/plan) All processes should be studied with an improvement disciple Incidents are just as much a “process” as delivery Look to Lean for proven improvement techniques (value stream mapping, waste analysis, improvement kata) Make it a part of your organization’s discipline
  • 93.
    Get rid ofas many silos as possible Old Silo A Old Silo B Old Silo C Old Silo D
  • 94.
    Old Silo AOld Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Get rid of as many silos as possible
  • 95.
    Old Silo AOld Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Get rid of as many silos as possible Key 1: get rid of as many handoffs as possible
  • 96.
    Old Silo AOld Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Get rid of as many silos as possible Key 2: “Horizontal” shared responsibility, not everyone do everything! Key 1: get rid of as many handoffs as possible
  • 97.
    Shared responsibility mattersmore than org model Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Development Team 1 Development Team 2 Development Team n SRE Team Clear handoff requirements Error budget consequences “Netflix" Model “Google” Model
  • 98.
    Shared responsibility mattersmore than org model Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Development Team 1 Development Team 2 Development Team n SRE Team Clear handoff requirements Error budget consequences “Netflix" Model “Google” Model
  • 99.
    Shared responsibility mattersmore than org model Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Development Team 1 Development Team 2 Development Team n SRE Team Clear handoff requirements Error budget consequences “Netflix" Model “Google” Model Same high-quality, high-velocity results!
  • 100.
    Why focus ongetting rid of handoffs?
  • 101.
    Why focus ongetting rid of handoffs? 1. Your people are your most valuable assets
  • 102.
    Why focus ongetting rid of handoffs? 1. Your people are your most valuable assets 2. The SRE skillset is expensive
  • 103.
    Why focus ongetting rid of handoffs? 1. Your people are your most valuable assets 2. The SRE skillset is expensive 3. Stay out of their way!
  • 104.
    SREs are expensive,stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This:
  • 105.
    SREs are expensive,stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This: Observe Orient Decide Action SRE OODA Loop Reduce friction:
  • 106.
    SREs are expensive,stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This: Invest in the right instrumentation Observe Orient Decide Action SRE OODA Loop Reduce friction:
  • 107.
    SREs are expensive,stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This: Invest in the right instrumentation Invest in collaboration, checklists, investigatory tools Observe Orient Decide Action SRE OODA Loop Reduce friction:
  • 108.
    SREs are expensive,stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This: Invest in the right instrumentation Invest in collaboration, checklists, investigatory tools Empower them to make decisions! Observe Orient Decide Action SRE OODA Loop Reduce friction:
  • 109.
    SREs are expensive,stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This: Invest in the right instrumentation Invest in collaboration, checklists, investigatory tools Empower them to make decisions! Empower them to take action! Observe Orient Decide Action SRE OODA Loop Reduce friction:
  • 110.
    What about thehandoffs you can’t get rid of? Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Specialist Capabilities Specialist Capabilities Specialist Capabilities
  • 111.
    What about thehandoffs you can’t get rid of? Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Specialist Capabilities Specialist Capabilities Specialist Capabilities Ticket Queue Ticket Queue Ticket Queue
  • 112.
    What about thehandoffs you can’t get rid of? Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Specialist Capabilities Specialist Capabilities Specialist Capabilities Ticket Queue Ticket Queue Ticket Queue Ticket Queue Ticket Queue Ticket Queue
  • 113.
    Operations as aService: Turn handoffs into self-service Operations as a Service On Demand On Demand On Demand On Demand Ops (embedded)Cross-Functional Product Team 1 Cross-Functional Product Team n Ops (embedded) Ops (builds & operates) Cross-Functional Product Team 2 Ops (embedded) Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist
  • 114.
    Development Team 1 DevelopmentTeam 2 Development Team n Ops/SRE Team Operations as a Service On Demand On Demand On Demand On Demand Ops (builds & operates) Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Operations as a Service: Works with any org model
  • 115.
    Operations as aService: Popular Uses for SRE Environment "I could fix it, if I could get to it”
  • 116.
    Operations as aService: Popular Uses for SRE Environment "I could fix it, if I could get to it” Environment O a a S
  • 117.
    Operations as aService: Popular Uses for SRE “Avoiding the dogpile” I think its a problem with dbcluster07-store2.uswest.acme dbcluster07- store2.uswest. acme “$ top” “$ top” “$ top” “$ top” “$ top” “$ top”“$ top”
  • 118.
    Operations as aService: Popular Uses for SRE “Avoiding the dogpile” I think its a problem with dbcluster07-store2.uswest.acme dbcluster07- store2.uswest. acme “$ top” “$ top” “$ top” “$ top” “$ top” “$ top”“$ top” I think its a problem with dbcluster07-store2.uswest.acme dbcluster07- store2.uswest. acme “$ top” “Healthcheck store2 - all” OaaS
  • 119.
    “I don’t readwikis. I’m an expert.” docs Service has changed. This flag is now required or bad things will happen! Pause monitoring first or we all get woken up! “restart -doit -now” I’ve done this before. I’ve got this. Environment docs Later… Operations as a Service: Popular Uses for SRE
  • 120.
    “I don’t readwikis. I’m an expert.” docs Service has changed. This flag is now required or bad things will happen! Pause monitoring first or we all get woken up! “restart -doit -now” I’ve done this before. I’ve got this. Environment docs Later… OaaS Service has changed. This flag is now required or bad things will happen! Pause monitoring first or we all get woken up! “restart” I’ve done this before. I’ve got this. Environment Later… Update Restart Job ✅ OaaS Operations as a Service: Popular Uses for SRE
  • 121.
    Operations as aService: Popular Uses for SRE “Uneven and hidden skills” I don’t know how to do X. I know how to do X. I know how to do Y. I don’t know how to do Y.
  • 122.
    Operations as aService: Popular Uses for SRE “Uneven and hidden skills” I don’t know how to do X. I know how to do X. I know how to do Y. I don’t know how to do Y. OaaS “Do X” “Define Y Procedure” “Define X Procedure” “Do Y” “Do X+Y”
  • 123.
    “Let me dothat for you again… and again” Done. I need you to do X Later… Ticket Other work Done. I need you to do X Later… Ticket Other work Sigh..Done. I need you to do X Ticket Other work Operations as a Service: Popular Uses for SRE
  • 124.
    “Let me dothat for you again… and again” Done. I need you to do X Later… Ticket Other work Done. I need you to do X Later… Ticket Other work Sigh..Done. I need you to do X Ticket Other work OaaS Do X Later… Other work 1 Later… Other work 2 Other work 3 Do X Do X OaaS OaaS Operations as a Service: Popular Uses for SRE
  • 125.
    Use tickets onlyfor what they are good for Ticket System
  • 126.
    Use tickets onlyfor what they are good for 1.Documenting true problems/issues/exceptions Ticket System
  • 127.
    Use tickets onlyfor what they are good for 1.Documenting true problems/issues/exceptions 2.Routing for necessary approvals Ticket System
  • 128.
    Use tickets onlyfor what they are good for 1.Documenting true problems/issues/exceptions 2.Routing for necessary approvals Not as a general purpose work management system! Ticket System
  • 129.
    But won’t Securityor Compliance stop you? Operations as a Service On Demand On Demand On Demand On Demand Ops (embedded)Cross-Functional Product Team 1 Cross-Functional Product Team n Ops (embedded) Ops (builds & operates) Cross-Functional Product Team 2 Ops (embedded) Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Build-in Security Here Build-in Compliance Here
  • 130.
  • 131.
    But what aboutITIL® ? • Ask ITIL people and they say SRE is ITIL compatible
  • 132.
    But what aboutITIL® ? • Ask ITIL people and they say SRE is ITIL compatible • Ask people who have seen ITIL implemented and they say “how?”
  • 133.
    But what aboutITIL® ? • Ask ITIL people and they say SRE is ITIL compatible • Ask people who have seen ITIL implemented and they say “how?” • Agile+DevOps+SRE have self-regulation and shared responsibility features that seem to undermine ITIL command and control nature
  • 134.
    But what aboutITIL® ? • Ask ITIL people and they say SRE is ITIL compatible • Ask people who have seen ITIL implemented and they say “how?” • Agile+DevOps+SRE have self-regulation and shared responsibility features that seem to undermine ITIL command and control nature • ITIL “Standard Change” is often focus of discussion, but it still implies approval model
  • 135.
    But what aboutITIL® ? • Ask ITIL people and they say SRE is ITIL compatible • Ask people who have seen ITIL implemented and they say “how?” • Agile+DevOps+SRE have self-regulation and shared responsibility features that seem to undermine ITIL command and control nature • ITIL “Standard Change” is often focus of discussion, but it still implies approval model • Straight talk: are we doing contortions to defend a sunk cost?
  • 136.
    “Shift Left” theability to take action escalate 1° 2° 3° 4° escalate escalateor
  • 137.
    “Shift Left” theability to take action Push the ability to take action this direction escalate 1° 2° 3° 4° escalate escalateor
  • 138.
    “Shift Left” theability to take action Push the ability to take action this direction escalate 1° 2° 3° 4° escalate escalateor OaaS Enablement and tooling
  • 139.
  • 140.
    Reduce Toil 1. Tracktoil levels for each team
  • 141.
    Reduce Toil 1. Tracktoil levels for each team 2. Set toil limits for each team
  • 142.
    Reduce Toil 1. Tracktoil levels for each team 2. Set toil limits for each team 3. Fund efforts to reduce toil (with emphasis on teams over toil limits)
  • 143.
  • 144.
    Recap SRE is morethan a title Leverage the Operations as a Service design pattern “Shift-Left” control and decision making. Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Focus on removing silos and queues Operations as a Service On Demand On Demand On Demand On Demand Ops (embedded)Cross-Functional Product Team 1 Cross-Functional Product Team n Ops (embedded) Ops (builds & operates) Cross-Functional Product Team 2 Ops (embedded) Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Reduce toil to create capacity to change Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”) Understand the forces undermining SRE ITIL Book 1 ITIL Book 2 ITIL Book 3 ITIL Book 4 ITIL Book 5 Quality! is job #1 Sys Admin CAB CALENDAR Your new title is SRE. Now write code and be better at ops. PROVISIONING PROCESS Dilbert characters © Scott Adams www.dilbert.com
  • 145.