SRE for Everyone: Making Tomorrow Better Than Today

SRE for Everyone:
Making Tomorrow Better Than Today

Damon Edwards

@damonedwards
2019

Not that far away, maybe in a company just like yours…

Overloaded. Constant firefighting.
Ticket
Ticket
Project A
···
Project B
···
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
DUE: Yesterday! DUE: Tomorrow!
Ticket
Ticket
Ticket

Waiting in ticket queues for everything.

Ticket

Ticket
Ticket
Ticket
Ticket
Ticket
Ticket

Things break. Break again. And again.
Later…
Later…
same
same
Help!
Ticket
Wait Interrupt
Help!
Ticket
Wait Interrupt
Help!
Ticket
Wait Interrupt

Everyone is busy, but it doesn’t get any better.
Improvement
Project
Business
Delivery
Incidents
Business
Delivery
Business
Delivery

Everything takes too long, costs
too much, and breaks too often!
Executives

Have you heard of SRE?
Google does it.

“SRE…
When you ask
software engineers
to do operations”
“SRE…
Next-generation,
cloud-native
Operations”
Class SRE implements DevOps
“SRE…
When Ops does
more engineering
than Ops”

“SRE…
When you ask
software engineers
to do operations”
“SRE…
Next-generation,
cloud-native
Operations”
Class SRE implements DevOps
“SRE…
When Ops does
more engineering
than Ops”
SRE

Google does it.

Jane Doe
Systems Administrator

Jane Doe
We have
SysAdmins

Jane Doe
They should be
SREs!

Jane Doe
SRE
They should be
SREs!

ITIL Book 1
ITIL Book 2
ITIL Book 3
ITIL Book 4
ITIL Book 5
Quality!
is job
#1
Sys
Admin
CAB CALENDAR
Your new title is SRE.
Now write code and be better at ops.
PROVISIONING PROCESS
Dilbert characters © Scott Adams www.dilbert.com
Sys
Admin
CAB CALENDAR
our new title is SRE.
w write code and be better at ops.

SysAdmins
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break
again. And again.
Everyone is busy, but it
doesn’t get any better.
ansformation has largely
nored Ops. Any ideas?
Google does it.
Everything takes too
long, cost too much, and
break too often!
Executive

View

SysAdmins
firefighting.
for everything.
Things break. Break
again. And again.
ansformation has largely
nored Ops. Any ideas?
Google does it.
break too often!
Executive

View
SRE (new name)
firefighting.
for everything.
Things break. Break
again. And again.
Our transformation has largely
ignored Ops. Any ideas?
Have you h
Google
break too often!
Executive

View

Changing job titles or adding individual skills
doesn’t make systems administrators SREs.

Observability
Programming
Skills
Distributed
Systems Arch.
Blameless
Post-Mortems

Observability
Programming
Skills
Distributed
Systems Arch.
Blameless
Post-Mortems
000000000000000

Not SRE
Observability
Programming
Skills
Distributed
Systems Arch.
Blameless
Post-Mortems
000000000000000

SRE is a rethinking of how Operations work gets
done.

Principles are what makes SRE different

Stephen Thorne, Google

At DevOps Enterprise Summit

London 2018
“Principles of SRE”
https://youtu.be/c-w_GYvi0eA

1. SRE needs Service Level Objectives, with consequences
Stephen Thorne, Google

At DevOps Enterprise Summit

London 2018
“Principles of SRE”
https://youtu.be/c-w_GYvi0eA

SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)

0
100
Error Budget*
DEV
BIZ
Ops

0
100
Error Budget*
DEV
BIZ
Ops
SLO takes priority!!

Principles of SRE are what set SRE apart


2. SREs have time to make tomorrow better than today

Toil: Name For a Problem We’ve All Felt

Toil: Name For a Problem We’ve All Felt
“Toil is the kind of work tied to running a production
service that tends to be manual, repetitive,
automatable, tactical, devoid of enduring value, and
that scales linearly as a service grows.”
-Vivek Rau

Google

Toil vs. Engineering Work
Toil Engineering Work
Lacks Enduring Value Builds Enduring Value
Rote, Repetitive Creative, Iterative
Tactical Strategic
Increases With Scale Enables Scaling
Can Be Automated Requires Human Creativity

Excessive Toil Prevents Fixing the System
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)

Excessive Toil Prevents Fixing the System
E.W.Toil
Reduce toil
Downward spiral is inevitable!


3. SRE teams have the ability to regulate their workload

SRE teams have the ability to regulate their workload

Example:

Example:
What if handing-off responsibility to SRE/Ops wasn’t a right?

Example:
(separate the “running in production” from “run by SRE/Ops”)

Example:
(separate the “running in production” from “run by SRE/Ops”)
“?!?”

Where to start (the practical approach)



Company-wide culture change (hard!)



Reduce toil. 
Everybody wins!

Why focus on reducing toil?
1. Lots of value independent of “SRE”

2. Your people are you most expensive assets 
… stay out of their way!
Why focus on reducing toil?
1. Lots of value independent of “SRE”

Start reducing toil today
Toil

1. Track toil levels for each team
Toil

Track toil levels for each team

• Standardize (e.g. meetings and email are “overhead" not “toil”)

• Track

• Self-reporting

• Periodic surveys

• SM or PM interview/sampling

• Track

• Self-reporting

• Periodic surveys

• SM or PM interview/sampling
• Don’t get lost in time tracking weeds!

Toil
2. Set toil limit for each team (50% is conventional wisdom)



3. Fund efforts to reduce toil (with emphasis on teams already over limit)
Toil



3. Fund efforts to reduce toil (with emphasis on teams already over limit)
Toil
Michael Kehoe

Todd Palino

(LinkedIn)

At SREcon Americas 2019

Example
Process
“Code Yellow”

Where to focus?
Toil
Reduce
Technical Debt

Where to focus?
Toil
Reduce
Technical Debt
Re-Engineer

Processes

Where to focus?
Toil
Reduce
Technical Debt
Re-Engineer

Processes
Enable
Self-Service

Eliminate Interruptions
Eliminate Waiting

Eliminate Waiting
Self-Service
Do X.

Eliminate Waiting
Self-Service
Do X.
… and a lot less toil

How to enable self-service?
Empower teams to spot and fix the anti-patterns.

“Do this for me, do it again, then do it again.”
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt

“I could fix it, but I can’t get to it.”
Environment
I could fix it if I
could get to it
Before
Wait
Interrupt

“I could fix it, but I can’t get to it.”
Environment
I could fix it if I
could get to it
Before
Wait
Interrupt
After
I’ve got this!
Environment
Self-
Service

“The dog-pile.”
!!
I think its a problem with
db07-store2.uswest.acme
“$ top”
“$ top”
db07store2.
uswest.acme
“$ top”
“$ top”
“$ top”
!!
“$ top”
!!
!!
!!
healthcheck
store2 -all
db07store2.
uswest.acme
Self-Service
1.
2.
3.
I think its a problem with
db07-store2.uswest.acme

“I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this ﬂag or
bad things will happen!
Pause monitoring ﬁrst or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before

“I’m an expert, I don’t read the wiki.”
docs
“restart -doit -now”
I’ve got this…
Environment
docs
Later…
Before
“restart”
Environment
Later…
Update
Restart Job
✅
I’ve got this.
Self-Service
Self-Service
After

“Known issue… doesn’t get permanent fix”

Self-Service Operations Design Pattern (in a nutshell)
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge

Pull-Based
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge

Pull-Based
Accept tools/languages
that teams want to use
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge

Pull-Based
Define “guardrails” to
provide work safety
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge

Pull-Based
Let people who
“push buttons”
define the buttons
provide work safety
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge

Pull-Based
Let people who
“push buttons”
define the buttons
Build in security
and compliance
provide work safety
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge

Self-Service is ultimately about user experience
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge

Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
1.Work how they want to work (GUI, API, CLI)

Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
2. “Guardrails” (Smart options that helpfully constrain)

Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
2. “Guardrails” (Smart options that helpfully constrain)
3.Dynamic resource model 
(Up-to-date details of your environment)

Self-Service can also be a foundation
for strategic initiatives

Strategic: Improve incident response times
https://youtu.be/USYrDaPEFtM
Jody Mulkey at DOES ‘15 SF

Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools
DEV STAGE PROD
Dev & QA NOC/Ops Dev
Promote
approved
jobs
Self-Service Self-Service
Empower

Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools
DEV STAGE PROD
Dev & QA NOC/Ops Dev
Promote
approved
jobs
Self-Service Self-Service
Empower
• Reduced MTTR by 92%

• Reduced escalations by 50%

• Reduced overall support costs by 55%

Strategic: Reduce compliance burden & improve
Shaun Norris at DOES ‘18 Las Vegas
https://youtu.be/d5IMvK0YHTg

Optimized for compliance
• 86,000+ employees

• 60+ countries

• Highly regulated


• 60+ countries

LOB #1
LOB #2 LOB #3
LOB …n
Services Scripts/Tools
Data Center
Data Center
Data Center Services Scripts/Tools
Cloud
Cloud
Cloud
Cloud
Self-Service
ComplianceConsistency


• 60+ countries

LOB #1
LOB #2 LOB #3
LOB …n
Data Center
Data Center
Data Center Services Scripts/Tools
Cloud
Cloud
Cloud
Cloud
Self-Service
ComplianceConsistency
12 months:

• Saved 28 person years of time

• 13,000+ ops tasks in privileged environments that
didn’t require a review

• ~200 less customer impacting events

Recap: Make Tomorrow Better Than Today
SRE is more than a title
Be practical and start focusing
on toil
Find and fix toil anti-patterns
Error Budgets and Toil Limits
Apply Self-Service Operations
design pattern
E.W.Toil
Reduce toil
SRE is a new way to think
about Ops work
ITIL Book 1
ITIL Book 2
ITIL Book 3
ITIL Book 4
ITIL Book 5
Quality!
is job
#1
Sys
Admin
CAB CALENDAR
Your new title is SRE.
Now write code and be better at ops.
1. SRE needs Service Level
Objectives, with consequences

2. SREs have time to make
tomorrow better than today

3. SRE teams have the ability to
regulate their workload
0
100
Error Budget*
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt
Consumer of
Ops Capabilities
Self-Service Operation
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
Toil

Let’s talk…
@damonedwards
damon@rundeck.com

SRE for Everyone: Making Tomorrow Better Than Today

More Related Content

What's hot

Similar to SRE for Everyone: Making Tomorrow Better Than Today

More from Rundeck

Recently uploaded

In this document

SRE for Everyone: Making Tomorrow Better Than Today