SRE for Everyone:
Making Tomorrow Better Than Today

Damon Edwards

@damonedwards
2019
Not that far away, maybe in a company just like yours…
Not that far away, maybe in a company just like yours…
Overloaded. Constant firefighting.
Ticket
Ticket
Project A
···
Project B
···
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
DUE: Yesterday! DUE: Tomorrow!
Ticket
Ticket
Ticket
Waiting in ticket queues for everything.
Not that far away, maybe in a company just like yours…
Waiting in ticket queues for everything.
Ticket
Not that far away, maybe in a company just like yours…
Waiting in ticket queues for everything.
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Not that far away, maybe in a company just like yours…
Things break. Break again. And again.
Later…
Later…
same
same
Help!
Ticket
Wait Interrupt
Help!
Ticket
Wait Interrupt
Help!
Ticket
Wait Interrupt
Not that far away, maybe in a company just like yours…
Everyone is busy, but it doesn’t get any better.
Improvement
Project
Business
Delivery
Incidents
Business
Delivery
Business
Delivery
Not that far away, maybe in a company just like yours…
Overloaded. Constant firefighting.
Waiting in ticket queues for everything.
Things break. Break again. And again.
Everyone is busy, but it doesn’t get any better.
Not that far away, maybe in a company just like yours…
Overloaded. Constant firefighting.
Waiting in ticket queues for everything.
Things break. Break again. And again.
Everyone is busy, but it doesn’t get any better.
Not that far away, maybe in a company just like yours…
Everything takes too long, costs
too much, and breaks too often!
Executives

Have you heard of SRE?
Google does it.
“SRE…
When you ask
software engineers
to do operations”
“SRE…
Next-generation,
cloud-native
Operations”
Class SRE implements DevOps
“SRE…
When Ops does
more engineering
than Ops”
“SRE…
When you ask
software engineers
to do operations”
“SRE…
Next-generation,
cloud-native
Operations”
Class SRE implements DevOps
“SRE…
When Ops does
more engineering
than Ops”
SRE
Have you heard of SRE?
Google does it.
Jane Doe
Systems Administrator
Jane Doe
Systems Administrator
We have
SysAdmins
Jane Doe
Systems Administrator
They should be
SREs!
Jane Doe
SRE
They should be
SREs!
ITIL Book 1
ITIL Book 2
ITIL Book 3
ITIL Book 4
ITIL Book 5
Quality!
is job
#1
Sys
Admin
CAB CALENDAR
Your new title is SRE.
Now write code and be better at ops.
PROVISIONING PROCESS
Dilbert characters © Scott Adams www.dilbert.com
Sys
Admin
CAB CALENDAR
our new title is SRE.
w write code and be better at ops.
PROVISIONING PROCESS
Dilbert characters © Scott Adams www.dilbert.com
SysAdmins
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break
again. And again.
Everyone is busy, but it
doesn’t get any better.
ansformation has largely
nored Ops. Any ideas?
Have you heard of SRE?
Google does it.
Everything takes too
long, cost too much, and
break too often!
Executive

View
SysAdmins
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break
again. And again.
Everyone is busy, but it
doesn’t get any better.
ansformation has largely
nored Ops. Any ideas?
Have you heard of SRE?
Google does it.
Everything takes too
long, cost too much, and
break too often!
Executive

View
SRE (new name)
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break
again. And again.
Everyone is busy, but it
doesn’t get any better.
Our transformation has largely
ignored Ops. Any ideas?
Have you h
Google
Everything takes too
long, cost too much, and
break too often!
Executive

View
Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
Observability
Programming
Skills
Distributed
Systems Arch.
Blameless
Post-Mortems
Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
Observability
Programming
Skills
Distributed
Systems Arch.
Blameless
Post-Mortems
000000000000000
Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
Not SRE
Observability
Programming
Skills
Distributed
Systems Arch.
Blameless
Post-Mortems
000000000000000
Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
SRE is a rethinking of how Operations work gets
done.
Principles are what makes SRE different
Principles are what makes SRE different
Stephen Thorne, Google

At DevOps Enterprise Summit

London 2018
“Principles of SRE”
https://youtu.be/c-w_GYvi0eA
Principles are what makes SRE different
1. SRE needs Service Level Objectives, with consequences
Stephen Thorne, Google

At DevOps Enterprise Summit

London 2018
“Principles of SRE”
https://youtu.be/c-w_GYvi0eA
SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
DEV
BIZ
Ops
SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
DEV
BIZ
Ops
SLO takes priority!!
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today
Toil: Name For a Problem We’ve All Felt
Toil: Name For a Problem We’ve All Felt
“Toil is the kind of work tied to running a production
service that tends to be manual, repetitive,
automatable, tactical, devoid of enduring value, and
that scales linearly as a service grows.”
-Vivek Rau

Google
Toil vs. Engineering Work
Toil Engineering Work
Lacks Enduring Value Builds Enduring Value
Rote, Repetitive Creative, Iterative
Tactical Strategic
Increases With Scale Enables Scaling
Can Be Automated Requires Human Creativity
Excessive Toil Prevents Fixing the System
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Excessive Toil Prevents Fixing the System
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Excessive Toil Prevents Fixing the System
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Downward spiral is inevitable!
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
SRE teams have the ability to regulate their workload
SRE teams have the ability to regulate their workload
Example:
SRE teams have the ability to regulate their workload
Example:
What if handing-off responsibility to SRE/Ops wasn’t a right?
SRE teams have the ability to regulate their workload
Example:
What if handing-off responsibility to SRE/Ops wasn’t a right?
(separate the “running in production” from “run by SRE/Ops”)
SRE teams have the ability to regulate their workload
Example:
What if handing-off responsibility to SRE/Ops wasn’t a right?
(separate the “running in production” from “run by SRE/Ops”)
“?!?”
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Where to start (the practical approach)
Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today

3. SRE teams have the ability to regulate their workload
Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today

3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today

3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
Company-wide culture change (hard!)
Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today

3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
Company-wide culture change (hard!)
Reduce toil.

Everybody wins!
Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today

3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
Company-wide culture change (hard!)
Reduce toil.

Everybody wins!
Why focus on reducing toil?
Why focus on reducing toil?
1. Lots of value independent of “SRE”
2. Your people are you most expensive assets

… stay out of their way!
Why focus on reducing toil?
1. Lots of value independent of “SRE”
Start reducing toil today
Toil
Start reducing toil today
1. Track toil levels for each team
Toil
Start reducing toil today
1. Track toil levels for each team
Toil
Track toil levels for each team
Track toil levels for each team
• Standardize (e.g. meetings and email are “overhead" not “toil”)
Track toil levels for each team
• Standardize (e.g. meetings and email are “overhead" not “toil”)
• Track

• Self-reporting

• Periodic surveys

• SM or PM interview/sampling
Track toil levels for each team
• Standardize (e.g. meetings and email are “overhead" not “toil”)
• Track

• Self-reporting

• Periodic surveys

• SM or PM interview/sampling
• Don’t get lost in time tracking weeds!
Start reducing toil today
1. Track toil levels for each team
Toil
Start reducing toil today
1. Track toil levels for each team
Toil
2. Set toil limit for each team (50% is conventional wisdom)
Start reducing toil today
1. Track toil levels for each team

2. Set toil limit for each team (50% is conventional wisdom)

3. Fund efforts to reduce toil (with emphasis on teams already over limit)
Toil
Start reducing toil today
1. Track toil levels for each team

2. Set toil limit for each team (50% is conventional wisdom)

3. Fund efforts to reduce toil (with emphasis on teams already over limit)
Toil
Michael Kehoe

Todd Palino 

(LinkedIn)

At SREcon Americas 2019

Example
Process
“Code Yellow”
Where to focus?
Toil
Where to focus?
Toil
Reduce
Technical Debt
Where to focus?
Toil
Reduce
Technical Debt
Re-Engineer

Processes
Where to focus?
Toil
Reduce
Technical Debt
Re-Engineer

Processes
Enable
Self-Service
Where to focus?
Toil
Reduce
Technical Debt
Re-Engineer

Processes
Enable
Self-Service
Eliminate Interruptions
Eliminate Waiting
Eliminate Interruptions
Eliminate Waiting
Self-Service
Do X.
Eliminate Interruptions
Eliminate Waiting
Self-Service
Do X.
… and a lot less toil
How to enable self-service?
Empower teams to spot and fix the anti-patterns.
“Do this for me, do it again, then do it again.”
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt
“Do this for me, do it again, then do it again.”
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt
“I could fix it, but I can’t get to it.”
Environment
I could fix it if I
could get to it
Before
Wait
Interrupt
“I could fix it, but I can’t get to it.”
Environment
I could fix it if I
could get to it
Before
Wait
Interrupt
After
I’ve got this!
Environment
Self-
Service
“The dog-pile.”
!!
I think its a problem with
db07-store2.uswest.acme
“$ top”
“$ top”
db07store2.
uswest.acme
“$ top”
“$ top”
“$ top”
!!
“$ top”
!!
!!
!!
healthcheck
store2 -all
db07store2.
uswest.acme
Self-Service
1.
2.
3.
I think its a problem with
db07-store2.uswest.acme
“I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before
“I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before
“I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart”
Environment
Later…
Update
Restart Job
✅
I’ve done this before.
I’ve got this.
Self-Service
Self-Service
After
“Known issue… doesn’t get permanent fix”
“Known issue… doesn’t get permanent fix”
Self-Service Operations Design Pattern (in a nutshell)
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
Self-Service Operations Design Pattern (in a nutshell)
Pull-Based
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
Self-Service Operations Design Pattern (in a nutshell)
Pull-Based
Accept tools/languages
that teams want to use
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
Self-Service Operations Design Pattern (in a nutshell)
Pull-Based
Accept tools/languages
that teams want to use
Define “guardrails” to
provide work safety
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
Self-Service Operations Design Pattern (in a nutshell)
Pull-Based
Accept tools/languages
that teams want to use
Let people who
“push buttons”
define the buttons
Define “guardrails” to
provide work safety
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
Self-Service Operations Design Pattern (in a nutshell)
Pull-Based
Accept tools/languages
that teams want to use
Let people who
“push buttons”
define the buttons
Build in security
and compliance
Define “guardrails” to
provide work safety
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
Self-Service is ultimately about user experience
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
Self-Service is ultimately about user experience
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
1.Work how they want to work (GUI, API, CLI)
Self-Service is ultimately about user experience
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
1.Work how they want to work (GUI, API, CLI)
2. “Guardrails” (Smart options that helpfully constrain)
Self-Service is ultimately about user experience
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
1.Work how they want to work (GUI, API, CLI)
2. “Guardrails” (Smart options that helpfully constrain)
3.Dynamic resource model

(Up-to-date details of your environment)
Self-Service can also be a foundation
for strategic initiatives
Strategic: Improve incident response times
https://youtu.be/USYrDaPEFtM
Jody Mulkey at DOES ‘15 SF
Strategic: Improve incident response times
https://youtu.be/USYrDaPEFtM
Jody Mulkey at DOES ‘15 SF
Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools
DEV STAGE PROD
Dev & QA NOC/Ops Dev
Promote
approved
jobs
Self-Service Self-Service
Empower
Strategic: Improve incident response times
https://youtu.be/USYrDaPEFtM
Jody Mulkey at DOES ‘15 SF
Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools
DEV STAGE PROD
Dev & QA NOC/Ops Dev
Promote
approved
jobs
Self-Service Self-Service
Empower
Strategic: Improve incident response times
https://youtu.be/USYrDaPEFtM
Jody Mulkey at DOES ‘15 SF
Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools
DEV STAGE PROD
Dev & QA NOC/Ops Dev
Promote
approved
jobs
Self-Service Self-Service
Empower
• Reduced MTTR by 92%

• Reduced escalations by 50%

• Reduced overall support costs by 55%
Strategic: Reduce compliance burden & improve
Shaun Norris at DOES ‘18 Las Vegas
https://youtu.be/d5IMvK0YHTg
Strategic: Reduce compliance burden & improve
Shaun Norris at DOES ‘18 Las Vegas
https://youtu.be/d5IMvK0YHTg
Optimized for compliance
• 86,000+ employees

• 60+ countries

• Highly regulated
Strategic: Reduce compliance burden & improve
Shaun Norris at DOES ‘18 Las Vegas
https://youtu.be/d5IMvK0YHTg
Optimized for compliance
• 86,000+ employees

• 60+ countries

• Highly regulated
LOB #1
LOB #2 LOB #3
LOB …n
Services Scripts/Tools
Data Center
Services Scripts/Tools
Data Center
Services Scripts/Tools
Data Center Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Self-Service
ComplianceConsistency
Strategic: Reduce compliance burden & improve
Shaun Norris at DOES ‘18 Las Vegas
https://youtu.be/d5IMvK0YHTg
Optimized for compliance
• 86,000+ employees

• 60+ countries

• Highly regulated
LOB #1
LOB #2 LOB #3
LOB …n
Services Scripts/Tools
Data Center
Services Scripts/Tools
Data Center
Services Scripts/Tools
Data Center Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Self-Service
ComplianceConsistency
12 months: 

• Saved 28 person years of time

• 13,000+ ops tasks in privileged environments that
didn’t require a review

• ~200 less customer impacting events
Recap: Make Tomorrow Better Than Today
SRE is more than a title
Be practical and start focusing
on toil
Find and fix toil anti-patterns
Error Budgets and Toil Limits
Apply Self-Service Operations
design pattern
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
SRE is a new way to think
about Ops work
ITIL Book 1
ITIL Book 2
ITIL Book 3
ITIL Book 4
ITIL Book 5
Quality!
is job
#1
Sys
Admin
CAB CALENDAR
Your new title is SRE.
Now write code and be better at ops.
PROVISIONING PROCESS
Dilbert characters © Scott Adams www.dilbert.com
1. SRE needs Service Level
Objectives, with consequences

2. SREs have time to make
tomorrow better than today

3. SRE teams have the ability to
regulate their workload
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt
Consumer of
Ops Capabilities
Self-Service Operation
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
Toil
Let’s talk…
@damonedwards
damon@rundeck.com

SRE for Everyone: Making Tomorrow Better Than Today

  • 1.
    SRE for Everyone: MakingTomorrow Better Than Today Damon Edwards @damonedwards 2019
  • 3.
    Not that faraway, maybe in a company just like yours…
  • 4.
    Not that faraway, maybe in a company just like yours… Overloaded. Constant firefighting. Ticket Ticket Project A ··· Project B ··· Ticket Ticket Ticket Ticket Ticket Ticket Ticket Ticket Ticket Ticket Ticket DUE: Yesterday! DUE: Tomorrow! Ticket Ticket Ticket
  • 5.
    Waiting in ticketqueues for everything. Not that far away, maybe in a company just like yours…
  • 6.
    Waiting in ticketqueues for everything. Ticket Not that far away, maybe in a company just like yours…
  • 7.
    Waiting in ticketqueues for everything. Ticket Ticket Ticket Ticket Ticket Ticket Not that far away, maybe in a company just like yours…
  • 8.
    Things break. Breakagain. And again. Later… Later… same same Help! Ticket Wait Interrupt Help! Ticket Wait Interrupt Help! Ticket Wait Interrupt Not that far away, maybe in a company just like yours…
  • 9.
    Everyone is busy,but it doesn’t get any better. Improvement Project Business Delivery Incidents Business Delivery Business Delivery Not that far away, maybe in a company just like yours…
  • 10.
    Overloaded. Constant firefighting. Waitingin ticket queues for everything. Things break. Break again. And again. Everyone is busy, but it doesn’t get any better. Not that far away, maybe in a company just like yours…
  • 11.
    Overloaded. Constant firefighting. Waitingin ticket queues for everything. Things break. Break again. And again. Everyone is busy, but it doesn’t get any better. Not that far away, maybe in a company just like yours… Everything takes too long, costs too much, and breaks too often! Executives Have you heard of SRE? Google does it.
  • 13.
    “SRE… When you ask softwareengineers to do operations” “SRE… Next-generation, cloud-native Operations” Class SRE implements DevOps “SRE… When Ops does more engineering than Ops”
  • 14.
    “SRE… When you ask softwareengineers to do operations” “SRE… Next-generation, cloud-native Operations” Class SRE implements DevOps “SRE… When Ops does more engineering than Ops” SRE
  • 15.
    Have you heardof SRE? Google does it.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
    ITIL Book 1 ITILBook 2 ITIL Book 3 ITIL Book 4 ITIL Book 5 Quality! is job #1 Sys Admin CAB CALENDAR Your new title is SRE. Now write code and be better at ops. PROVISIONING PROCESS Dilbert characters © Scott Adams www.dilbert.com Sys Admin CAB CALENDAR our new title is SRE. w write code and be better at ops. PROVISIONING PROCESS Dilbert characters © Scott Adams www.dilbert.com
  • 21.
    SysAdmins Overloaded. Constant firefighting. Waiting inticket queues for everything. Things break. Break again. And again. Everyone is busy, but it doesn’t get any better. ansformation has largely nored Ops. Any ideas? Have you heard of SRE? Google does it. Everything takes too long, cost too much, and break too often! Executive View
  • 22.
    SysAdmins Overloaded. Constant firefighting. Waiting inticket queues for everything. Things break. Break again. And again. Everyone is busy, but it doesn’t get any better. ansformation has largely nored Ops. Any ideas? Have you heard of SRE? Google does it. Everything takes too long, cost too much, and break too often! Executive View SRE (new name) Overloaded. Constant firefighting. Waiting in ticket queues for everything. Things break. Break again. And again. Everyone is busy, but it doesn’t get any better. Our transformation has largely ignored Ops. Any ideas? Have you h Google Everything takes too long, cost too much, and break too often! Executive View
  • 23.
    Changing job titlesor adding individual skills doesn’t make systems administrators SREs.
  • 24.
    Changing job titlesor adding individual skills doesn’t make systems administrators SREs.
  • 25.
    Changing job titlesor adding individual skills doesn’t make systems administrators SREs. Observability Programming Skills Distributed Systems Arch. Blameless Post-Mortems
  • 26.
    Changing job titlesor adding individual skills doesn’t make systems administrators SREs. Observability Programming Skills Distributed Systems Arch. Blameless Post-Mortems 000000000000000
  • 27.
    Changing job titlesor adding individual skills doesn’t make systems administrators SREs. Not SRE Observability Programming Skills Distributed Systems Arch. Blameless Post-Mortems 000000000000000
  • 28.
    Changing job titlesor adding individual skills doesn’t make systems administrators SREs.
  • 29.
    Changing job titlesor adding individual skills doesn’t make systems administrators SREs. SRE is a rethinking of how Operations work gets done.
  • 30.
    Principles are whatmakes SRE different
  • 31.
    Principles are whatmakes SRE different Stephen Thorne, Google At DevOps Enterprise Summit London 2018 “Principles of SRE” https://youtu.be/c-w_GYvi0eA
  • 32.
    Principles are whatmakes SRE different 1. SRE needs Service Level Objectives, with consequences Stephen Thorne, Google At DevOps Enterprise Summit London 2018 “Principles of SRE” https://youtu.be/c-w_GYvi0eA
  • 33.
    SLO and ErrorBudgets: Tools for Shared Responsibility 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service)
  • 34.
    SLO and ErrorBudgets: Tools for Shared Responsibility 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service)
  • 35.
    SLO and ErrorBudgets: Tools for Shared Responsibility 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service) DEV BIZ Ops
  • 36.
    SLO and ErrorBudgets: Tools for Shared Responsibility 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service) DEV BIZ Ops SLO takes priority!!
  • 37.
    Principles of SREare what set SRE apart 1. SRE needs Service Level Objectives, with consequences
  • 38.
    Principles of SREare what set SRE apart 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today
  • 39.
    Toil: Name Fora Problem We’ve All Felt
  • 40.
    Toil: Name Fora Problem We’ve All Felt “Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” -Vivek Rau Google
  • 41.
    Toil vs. EngineeringWork Toil Engineering Work Lacks Enduring Value Builds Enduring Value Rote, Repetitive Creative, Iterative Tactical Strategic Increases With Scale Enables Scaling Can Be Automated Requires Human Creativity
  • 42.
    Excessive Toil PreventsFixing the System Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
  • 43.
    Excessive Toil PreventsFixing the System Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
  • 44.
    Excessive Toil PreventsFixing the System Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”) Downward spiral is inevitable!
  • 45.
    Principles of SREare what set SRE apart 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today
  • 46.
    Principles of SREare what set SRE apart 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload
  • 47.
    SRE teams havethe ability to regulate their workload
  • 48.
    SRE teams havethe ability to regulate their workload Example:
  • 49.
    SRE teams havethe ability to regulate their workload Example: What if handing-off responsibility to SRE/Ops wasn’t a right?
  • 50.
    SRE teams havethe ability to regulate their workload Example: What if handing-off responsibility to SRE/Ops wasn’t a right? (separate the “running in production” from “run by SRE/Ops”)
  • 51.
    SRE teams havethe ability to regulate their workload Example: What if handing-off responsibility to SRE/Ops wasn’t a right? (separate the “running in production” from “run by SRE/Ops”) “?!?”
  • 52.
    Principles of SREare what set SRE apart 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload
  • 53.
    Where to start(the practical approach)
  • 54.
    Where to start(the practical approach) 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload
  • 55.
    Where to start(the practical approach) 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Company-wide culture change (hard!)
  • 56.
    Where to start(the practical approach) 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Company-wide culture change (hard!) Company-wide culture change (hard!)
  • 57.
    Where to start(the practical approach) 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Company-wide culture change (hard!) Company-wide culture change (hard!) Reduce toil.
 Everybody wins!
  • 58.
    Where to start(the practical approach) 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Company-wide culture change (hard!) Company-wide culture change (hard!) Reduce toil.
 Everybody wins!
  • 59.
    Why focus onreducing toil?
  • 60.
    Why focus onreducing toil? 1. Lots of value independent of “SRE”
  • 61.
    2. Your peopleare you most expensive assets
 … stay out of their way! Why focus on reducing toil? 1. Lots of value independent of “SRE”
  • 62.
  • 63.
    Start reducing toiltoday 1. Track toil levels for each team Toil
  • 64.
    Start reducing toiltoday 1. Track toil levels for each team Toil
  • 65.
    Track toil levelsfor each team
  • 66.
    Track toil levelsfor each team • Standardize (e.g. meetings and email are “overhead" not “toil”)
  • 67.
    Track toil levelsfor each team • Standardize (e.g. meetings and email are “overhead" not “toil”) • Track • Self-reporting • Periodic surveys • SM or PM interview/sampling
  • 68.
    Track toil levelsfor each team • Standardize (e.g. meetings and email are “overhead" not “toil”) • Track • Self-reporting • Periodic surveys • SM or PM interview/sampling • Don’t get lost in time tracking weeds!
  • 69.
    Start reducing toiltoday 1. Track toil levels for each team Toil
  • 70.
    Start reducing toiltoday 1. Track toil levels for each team Toil 2. Set toil limit for each team (50% is conventional wisdom)
  • 71.
    Start reducing toiltoday 1. Track toil levels for each team 2. Set toil limit for each team (50% is conventional wisdom) 3. Fund efforts to reduce toil (with emphasis on teams already over limit) Toil
  • 72.
    Start reducing toiltoday 1. Track toil levels for each team 2. Set toil limit for each team (50% is conventional wisdom) 3. Fund efforts to reduce toil (with emphasis on teams already over limit) Toil Michael Kehoe Todd Palino (LinkedIn) At SREcon Americas 2019 Example Process “Code Yellow”
  • 73.
  • 74.
  • 75.
    Where to focus? Toil Reduce TechnicalDebt Re-Engineer Processes
  • 76.
    Where to focus? Toil Reduce TechnicalDebt Re-Engineer Processes Enable Self-Service
  • 77.
    Where to focus? Toil Reduce TechnicalDebt Re-Engineer Processes Enable Self-Service
  • 79.
  • 80.
  • 81.
  • 82.
    How to enableself-service? Empower teams to spot and fix the anti-patterns.
  • 83.
    “Do this forme, do it again, then do it again.” Done.I need you to do X Your other work I need you to do X I need you to do X Ticket Do X Later… Do X Do X Done. Done. Your other work Self-Service Self-Service Self-Service Your other work x2 Your other work x3 Later…Later… Later… Your other work Your other work After Before Wait Interrupt Ticket Wait Interrupt Ticket Wait Interrupt
  • 84.
    “Do this forme, do it again, then do it again.” Done.I need you to do X Your other work I need you to do X I need you to do X Ticket Do X Later… Do X Do X Done. Done. Your other work Self-Service Self-Service Self-Service Your other work x2 Your other work x3 Later…Later… Later… Your other work Your other work After Before Wait Interrupt Ticket Wait Interrupt Ticket Wait Interrupt
  • 85.
    “I could fixit, but I can’t get to it.” Environment I could fix it if I could get to it Before Wait Interrupt
  • 86.
    “I could fixit, but I can’t get to it.” Environment I could fix it if I could get to it Before Wait Interrupt After I’ve got this! Environment Self- Service
  • 87.
    “The dog-pile.” !! I thinkits a problem with db07-store2.uswest.acme “$ top” “$ top” db07store2. uswest.acme “$ top” “$ top” “$ top” !! “$ top” !! !! !! healthcheck store2 -all db07store2. uswest.acme Self-Service 1. 2. 3. I think its a problem with db07-store2.uswest.acme
  • 88.
    “I’m an expert,I don’t read the wiki.” docs Service has changed. Use this flag or bad things will happen! Pause monitoring first or we all get woken up! “restart -doit -now” I’ve done this before. I’ve got this… Environment docs Later… Before
  • 89.
    “I’m an expert,I don’t read the wiki.” docs Service has changed. Use this flag or bad things will happen! Pause monitoring first or we all get woken up! “restart -doit -now” I’ve done this before. I’ve got this… Environment docs Later… Before
  • 90.
    “I’m an expert,I don’t read the wiki.” docs Service has changed. Use this flag or bad things will happen! Pause monitoring first or we all get woken up! “restart -doit -now” I’ve done this before. I’ve got this… Environment docs Later… Before Service has changed. Use this flag or bad things will happen! Pause monitoring first or we all get woken up! “restart” Environment Later… Update Restart Job ✅ I’ve done this before. I’ve got this. Self-Service Self-Service After
  • 91.
    “Known issue… doesn’tget permanent fix”
  • 92.
    “Known issue… doesn’tget permanent fix”
  • 93.
    Self-Service Operations DesignPattern (in a nutshell) Consumer of Ops Capabilities Self-Service On Demand Ops Capability Specialist Knowledge Ops Capability Specialist Knowledge
  • 94.
    Self-Service Operations DesignPattern (in a nutshell) Pull-Based Consumer of Ops Capabilities Self-Service On Demand Ops Capability Specialist Knowledge Ops Capability Specialist Knowledge
  • 95.
    Self-Service Operations DesignPattern (in a nutshell) Pull-Based Accept tools/languages that teams want to use Consumer of Ops Capabilities Self-Service On Demand Ops Capability Specialist Knowledge Ops Capability Specialist Knowledge
  • 96.
    Self-Service Operations DesignPattern (in a nutshell) Pull-Based Accept tools/languages that teams want to use Define “guardrails” to provide work safety Consumer of Ops Capabilities Self-Service On Demand Ops Capability Specialist Knowledge Ops Capability Specialist Knowledge
  • 97.
    Self-Service Operations DesignPattern (in a nutshell) Pull-Based Accept tools/languages that teams want to use Let people who “push buttons” define the buttons Define “guardrails” to provide work safety Consumer of Ops Capabilities Self-Service On Demand Ops Capability Specialist Knowledge Ops Capability Specialist Knowledge
  • 98.
    Self-Service Operations DesignPattern (in a nutshell) Pull-Based Accept tools/languages that teams want to use Let people who “push buttons” define the buttons Build in security and compliance Define “guardrails” to provide work safety Consumer of Ops Capabilities Self-Service On Demand Ops Capability Specialist Knowledge Ops Capability Specialist Knowledge
  • 99.
    Self-Service is ultimatelyabout user experience Consumer of Ops Capabilities Self-Service On Demand Ops Capability Specialist Knowledge Ops Capability Specialist Knowledge
  • 100.
    Self-Service is ultimatelyabout user experience Consumer of Ops Capabilities Self-Service On Demand Ops Capability Specialist Knowledge Ops Capability Specialist Knowledge 1.Work how they want to work (GUI, API, CLI)
  • 101.
    Self-Service is ultimatelyabout user experience Consumer of Ops Capabilities Self-Service On Demand Ops Capability Specialist Knowledge Ops Capability Specialist Knowledge 1.Work how they want to work (GUI, API, CLI) 2. “Guardrails” (Smart options that helpfully constrain)
  • 102.
    Self-Service is ultimatelyabout user experience Consumer of Ops Capabilities Self-Service On Demand Ops Capability Specialist Knowledge Ops Capability Specialist Knowledge 1.Work how they want to work (GUI, API, CLI) 2. “Guardrails” (Smart options that helpfully constrain) 3.Dynamic resource model
 (Up-to-date details of your environment)
  • 103.
    Self-Service can alsobe a foundation for strategic initiatives
  • 104.
    Strategic: Improve incidentresponse times https://youtu.be/USYrDaPEFtM Jody Mulkey at DOES ‘15 SF
  • 105.
    Strategic: Improve incidentresponse times https://youtu.be/USYrDaPEFtM Jody Mulkey at DOES ‘15 SF Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools DEV STAGE PROD Dev & QA NOC/Ops Dev Promote approved jobs Self-Service Self-Service Empower
  • 106.
    Strategic: Improve incidentresponse times https://youtu.be/USYrDaPEFtM Jody Mulkey at DOES ‘15 SF Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools DEV STAGE PROD Dev & QA NOC/Ops Dev Promote approved jobs Self-Service Self-Service Empower
  • 107.
    Strategic: Improve incidentresponse times https://youtu.be/USYrDaPEFtM Jody Mulkey at DOES ‘15 SF Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools DEV STAGE PROD Dev & QA NOC/Ops Dev Promote approved jobs Self-Service Self-Service Empower • Reduced MTTR by 92% • Reduced escalations by 50% • Reduced overall support costs by 55%
  • 108.
    Strategic: Reduce complianceburden & improve Shaun Norris at DOES ‘18 Las Vegas https://youtu.be/d5IMvK0YHTg
  • 109.
    Strategic: Reduce complianceburden & improve Shaun Norris at DOES ‘18 Las Vegas https://youtu.be/d5IMvK0YHTg Optimized for compliance • 86,000+ employees • 60+ countries • Highly regulated
  • 110.
    Strategic: Reduce complianceburden & improve Shaun Norris at DOES ‘18 Las Vegas https://youtu.be/d5IMvK0YHTg Optimized for compliance • 86,000+ employees • 60+ countries • Highly regulated LOB #1 LOB #2 LOB #3 LOB …n Services Scripts/Tools Data Center Services Scripts/Tools Data Center Services Scripts/Tools Data Center Services Scripts/Tools Cloud Services Scripts/Tools Cloud Services Scripts/Tools Cloud Services Scripts/Tools Cloud Self-Service ComplianceConsistency
  • 111.
    Strategic: Reduce complianceburden & improve Shaun Norris at DOES ‘18 Las Vegas https://youtu.be/d5IMvK0YHTg Optimized for compliance • 86,000+ employees • 60+ countries • Highly regulated LOB #1 LOB #2 LOB #3 LOB …n Services Scripts/Tools Data Center Services Scripts/Tools Data Center Services Scripts/Tools Data Center Services Scripts/Tools Cloud Services Scripts/Tools Cloud Services Scripts/Tools Cloud Services Scripts/Tools Cloud Self-Service ComplianceConsistency 12 months: • Saved 28 person years of time • 13,000+ ops tasks in privileged environments that didn’t require a review • ~200 less customer impacting events
  • 112.
    Recap: Make TomorrowBetter Than Today SRE is more than a title Be practical and start focusing on toil Find and fix toil anti-patterns Error Budgets and Toil Limits Apply Self-Service Operations design pattern Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”) SRE is a new way to think about Ops work ITIL Book 1 ITIL Book 2 ITIL Book 3 ITIL Book 4 ITIL Book 5 Quality! is job #1 Sys Admin CAB CALENDAR Your new title is SRE. Now write code and be better at ops. PROVISIONING PROCESS Dilbert characters © Scott Adams www.dilbert.com 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service) Done.I need you to do X Your other work I need you to do X I need you to do X Ticket Do X Later… Do X Do X Done. Done. Your other work Self-Service Self-Service Self-Service Your other work x2 Your other work x3 Later…Later… Later… Your other work Your other work After Before Wait Interrupt Ticket Wait Interrupt Ticket Wait Interrupt Consumer of Ops Capabilities Self-Service Operation On Demand Ops Capability Specialist Knowledge Ops Capability Specialist Knowledge Toil
  • 113.