1. SysAdmin to SRE:
Creating Capacity to Make Tomorrow Better Than Today
How Runbook Automation for Incident Management, and Other Self-Service Operations Practices
Can Ignite the way to True SRE Outcomes
jorn knuttila
@jorn_knuttila
2. Not that far away, maybe in a company just like yours…
🔥
Overloaded. Constant firefighting.
Ticket
Ticket
Project A
···
Project B
···
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
DUE: Yesterday! DUE: Tomorrow!
Ticket
Ticket
Ticket
3. 🔥
Waiting in ticket queues for everything.
Not that far away, maybe in a company just like yours…
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
4. 🧨
Things break. Break again. And again.
Not that far away, maybe in a company just like yours…
Later…
Later…
same
same
Help!
Ticket
Wait Interrupt
Help!
Ticket
Wait Interrupt
Help!
Ticket
Wait Interrupt
5. ⁉
Everyone is busy, but it doesn’t get any better.
Not that far away, maybe in a company just like yours…
Improvement
Project
Business
Delivery
Incidents
Business
Delivery
Business
Delivery
6. 🔥
Overloaded. Constant firefighting.
🔥
Waiting in ticket queues for everything.
🔥
Things break. Break again. And again.
🔥
Everyone is busy, but it doesn’t get any better.
Not that far away, maybe in a company just like yours…
Everything takes too long, costs
too much, and breaks too often!
Executives
11. SysAdmins
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break again.
And again.
Everyone is busy, but it
doesn’t get any better.
Everything takes too
long, cost too much, and
break too often!
Executive
View
SRE (new name)
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break again.
And again.
Everyone is busy, but it
doesn’t get any better.
Everything takes too
long, cost too much, and
break too often!
Executive
View
12. Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
Observability
Programming
Skills
Distributed
Systems Arch.
Blameless
Post-Mortems
000000000000000
14. Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
SRE is a rethinking of how Operations work gets
done.
15. Principles are what makes SRE different
1. SRE needs Service Level Objectives, with consequences
Stephen Thorne, Google
At DevOps Enterprise Summit
London 2018
“Principles of
SRE”
https://youtu.be/c-w_GYvi0eA
16. SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
17. SLO and Error Budgets: Tools for Shared Responsibility
DEV
BIZ
Ops
SLO takes priority!!
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
18. Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
19. Toil: Name For a Problem We’ve All Felt
“Toil is the kind of work tied to running a production
service that tends to be manual, repetitive,
automatable, tactical, devoid of enduring value, and
that scales linearly as a service grows.” -Vivek Rau (Google)
20. Toil vs. Engineering Work
Toil Engineering Work
Lacks Enduring Value Builds Enduring Value
Rote, Repetitive Creative, Iterative
Tactical Strategic
Increases With Scale Enables Scaling
Can Be Automated Requires Human Creativity
21. Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Excessive Toil Prevents Fixing the System
Downward spiral is inevitable!
22. Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
23. SRE teams have the ability to regulate their workload
What if handing-off responsibility to SRE/Ops wasn’t a right?
(separate the “running in production” from “run by SRE/Ops”)
“?!?”
24. Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
Company-wide culture change (hard!)
Reduce toil.
Everybody wins!
25. 2. Your people are you most expensive assets
… stay out of their way!
Why focus on reducing toil?
1. Lots of value independent of “SRE”
26. Super easy to get started reducing toil
1. Track toil levels for each team
2. Set toil limit for each team (50% is conventional wisdom)
3. Fund efforts to reduce toil (with emphasis on teams already over limit)
Toil
↳ Refactor apps, tools, and processes
↳ Apply self-service design pattern
27. How to enable self-service?
Empower teams to spot and fix the anti-patterns.
28. “Do this for me, do it again, then do it again.”
Toil Toil
29. “I could fix it, but I can’t get to it.”
Toil Toil
35. Self-Service Operations Design Pattern (in a nutshell)
Pull-Based
Accept tools/languages
that teams want to use
Let people who
“push buttons”
define the buttons
Build in security
and compliance
Define “guardrails” to
provide work safety
36. Recap: Creating Capacity to Make Tomorrow Better Than Today
SRE is more than a title
Be practical and start focusing
on toil
Find and fix toil anti-patterns
Error Budgets and Toil Limits
Apply Self-Service Operations
design pattern
SRE is a new way to think
about Ops work
1. SRE needs Service Level
Objectives, with consequences
2. SREs have time to make
tomorrow better than today
3. SRE teams have the ability to
regulate their workload
Toil