Tickets Make Operations Work Unnecessarily Miserable

Tickets Make Operations
Unnecessarily Miserable
Damon Edwards
@damonedwards

Community
Ops Improvement
DevOps in Enterprise
Ops Tools
Damon Edwards

Please note:
No, I’m not against writing things down  
(nor do I advocate for anarchy)

Please note:
No, I’m not against writing things down  
(nor do I advocate for anarchy)
Tickets Queues aren’t the only villains  
(there are many unindicted co-conspirators)

Developers have had an unfair advantage.

Ops
Ah-ha!
Dev
Ka-ching!
Agile
2001

Ops
Ah-ha!
Dev
Ka-ching!
Agile
2001
ITIL
1989

OpsBusiness
Idea
Shorter Time-to-Market
Fast Feedback
from Users
Dev Ops
Running
Services
Improved Quality
Digital and DevOps
Availability Auditing
Security Compliance
"Go faster!"
“Open up!”
“Lock it down!”
2019

Digital
Agile
DevOps
SRE
Cloud
Docker
Kubernetes
Microservices
CHANGE
Wow
That is cool
I wish I could
work there

But nobody was talking about what
happened after deployment…

It was just another Tuesday…

NOC
NOC
Biz
Manager
Escalate!
NOC NOC
NOC
(Bob)
Open
Incident
Ticket
9:30am 10:00am
NOC (Bob)
Biz Manager
Ticket
Context Wagon
Yes, but this
looks different
Hasn’t there been
some intermittent
errors this week?
v3
?!

NOC
(Bob)
Open
Incident
Ticket
Ticket
Biz
Manager
App-specific
SREs
“Try this.”
“Try that.”
SRE
SysAdmin
with Prod Access
(Steve)
SRE
SRE
SRE
SRE
SRE
SRE
Bridge
Call
Biz
Manager
fixed?
fixed?
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x SRE
Ticket
Context Wagon
Ticket
Context Wagon

SRE
“It’s a problem
with the Foo
service”
SRE
SRE
Foo
SRE
SRE
SRE
SRE
Bridge
Call
Biz
Manager
Foo
Service
No.
NOC
(Bob)
Update
Ticket
Ticket
Foo
Lead Dev
+ add
12:00pm
NOC (Bob)
Biz Manager
Foo SRE
Ticket
Context Wagon
Can you
ﬁx it?

o
Dev
Foo
Lead Dev
(Karen)
ding!
Ignore.
App
Manager
Hey did you see
that ticket?
Foo
Lead Dev
(Karen)
sigh.
I’ll take a look
I’m go
mor
pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
Scrum
Ticket
Context Wagon

k
Foo
Lead Dev
(Karen)
I’m going to need
more log ﬁles
Ticket
SysAdmin Team
+ add
Update
Ticket
Chat
“Can someone with
access to Foo Service
in Prod01 help me with
ticket #42516?”
SysAdmin
(Lee) Ticket
“logs
attached”
Foo
Lead Dev
(Karen)
Ticket
“no the
other ones”
Le
(K
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Ticket
Context Wagon

Foo
Lead Dev
(Karen)
Logs
-Who restarted these services? (and why?)
-They didn’t use the correct environment
variables!
-This entire service pool needs to be restarted!
Ticket
Update
Ticket
NOC
(Bob)
Update
Ticket
Ticket
Middleware Team
+ add
“Middleware, please
urgent restart this entire
app pool with the correct
environment variable”
2:00pm
Ticket
Context W

ase
s entire
e correct
able”
NOC
(Bob)
Middleware
Manager
(Melissa)
No way. It’s the middle
of the day! You need
business approval.
NOC
(Bob)
Update
Ticket
Ticket
SVP for Line of
Business
+ add
(S
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
NOC (B
Biz Ma
App Ma
Lead D
Foo SR
Ticket
Context Wagon
Ticket
Context Wagon
2:30pm

Update
Ticket
Ticket
SVP for Line of
Business
+ add
SVP
(Susan)
Chief of
Staff
Tech VP
Tech VP
Update
Ticket
Ticket
“Restart approved”
Customer
impact?
Ticket
Middlewa
Manage
(Melissa
Wh
prod
5:00pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Ticket
Context Wagon

Share
point
proved”
Ticket
Middleware
Manager
(Melissa)
Who knows these
production services
the best?
Ellen!
Middleware Middleware
(Scott)
Ellen
to
Europe
ofﬁce
Middleware
(Scott)
Trial and error
.doc
5:00pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Ticket
Context Wagon

Share
point
Middleware
(Scott)
Trial and error
.doc
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
ket
Context Wagon
Middleware
(Scott)
Bar
Service
10 min Middleware
(Scott)
Waiting for
Acme Service
Acme startup
failed
Bar
Service
6:00pm

Come on.. no.no.no.
What? Why?
Middleware
(Scott)

8888888
Come on.. no.no.no.
What? Why?
Middleware
(Scott)

-Bar app startup timed out. Error says can’t
connect to Acme service.
- I looked at Acme but it seems to be running
-Is this error message correct? Why can’t Bar
connect?
Ticket
Update
Ticket
Middleware
(Scott)
Bar SRE
+ add
Bar SRE
(Linda)
Middleware
(Scott)
-URGENT: Network
connection issue
between Bar and
Acme
Ticket
Update
Ticket
Network
SRE Team
+ add
6:45
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)Ticket
Context Wagon
The new environment pre-ﬂight
check is preventing startup.
Looks like Bar’s connection to
Acme is being blocked.

Bar SRE
(Linda)
Middleware
(Scott)
-URGENT: Network
connection issue
between Bar and
Acme
Ticket
Update
Ticket
Network
SRE Team
+ add
Bar
Lead Dev
6:45pm
ob)
ager
nager
ev (Karen)
E
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Customers are
calling. What
is going on?The new environment pre-ﬂight
check is preventing startup.
Looks like Bar’s connection to
Acme is being blocked.
Bar
Lead Dev
(Liu)
Business
Managers
I can comment out
the test… But the
CD pipeline only
goes to QA ENV!

Network Dir
(Carlos)
Middleware
(Scott)
Carlos, I need a favor.
Can you escalate?Middleware
Manager
(Melissa)
Customers are
calling. What
is going on?
Last week..
Net SRE
VP
VP
Priority!
Different
Incident!
Net SRE Net SRE
Net SRE
Its the network!
Business
Managers
Your
network is
broken!
Business
Managers
We are already
working on it!
Network VPs
out
he
ly
V!

Network
SRE
(Hari)
The ﬁrewall is
blocking the trafﬁc
You’ll have to take
it up with the
Firewall Team
-URGENT: Firewall is
blocking connection
between Bar and Acme
Ticket
Open
Firewall
Ticket
Firewall
Team
+ add
Firewall Engineer
(Freddie)
Middleware
(Scott)
Paging on-call…
Open bridge…
Can’t be the firewall, it hasn’t
changed since last Thursday.
No its the firewall.
8:00p
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Network PM (Carlos)
Network SRE (Bob)
Ticket
Context Wagon

Firewall Engineer
(Freddie)
Middleware
(Scott)
Firewall Engineer
(Freddie)
Middleware
(Scott)
Can’t be the firewall, it hasn’t
changed since last Thursday.
No its the firewall.
There was a rule change last
Thursday that would stop Bar
from talking to Acme.
Can you change it back?
Sure we make changes on
Thursday…
Chief of
Staff
SVP and VPs are livid… this was
supposed to be a safe change!!
Freddie, we’ve got customers calling.
ES
Em
pro
rul
Update
Firewall
Ticket
Firewall Engineer
(Freddie)
8:00pm

d VPs are livid… this was
sed to be a safe change!!
we’ve got customers calling.
ESCALATE:
Emergency
production ﬁrewall
rule change review
Ticket
Update
Firewall
Ticket
NetSec
+ add
Firewall Engineer
(Freddie)
Paging on-call…
NetSec
(Nicole)
This is production so I’ll have
to get others on the Network
CAB…
Chief of
Staff
Firewall
(Freddie)
Middleware
(Scott)
Customer outage!
… I’ll call SVP Susan
Middleware
Manager
VP
VP
Bar
Lead Dev
9:00pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAd
Middle
SVP
Chief o
2 x Tec
Ticket
Context Wagon

I’ll have
Network
Chief of
Staff
Firewall
(Freddie)
Middleware
(Scott)
Customer outage!
APPROVE: Emergency
ﬁrewall rule change
Ticket
Update
Firewall
Ticket
NetSec
(Nicole)
… I’ll call SVP Susan
Middleware
Manager
VP
VP
Bar
Lead Dev
Firewall
(Freddie)
Net L2
(Bob)
Middl
(Sc
Firewall
change
Restart Bar
9:30pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Network PM (Carlos)
Network SRE (Bob)
Firewall (Freddie)
Ticket
Context Wagon
NetSec (Nicole)

Middleware
(Scott)
Update
Ticket
Ticket
Customer Engagement
Manager
+ add
Policy
!!
“Ready for
API tests”
9:45pm
Firewall
(Freddie)
Net L2
(Bob)
Middleware
(Scott)
Firewall
change
Restart Bar
I think we
are good!
Middleware
Manager
VP
VP
Bar
Lead Dev
You
“think?”
pm

et
gement
“Ready for
API tests”
Customer
Engagement
Manager
(Varsha)
NOC
(Bob)
Customer Engagement
Manager
(Varsha)
Update
Ticket
Ticket
“APIs OK”
Middleware
(Scott)
Upda
Tick
11:00pm
Ticket
Co

e
Ticket
“APIs OK”
Middleware
(Scott)
Update
Ticket
Ticket
“Services
restarted OK”
NOC
NOC
Lights are green…
I guess it is ﬁxed.
Close
Ticket
NOC
(Bob)
Zzz
11:30pm
N
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Network PM (Carlos)
Network SRE (Bob)
Firewall (Freddie)
Ticket
Context Wagon
NetSec (Nicole)
Cust. Engmt. (Varsha)

e
Ticket
“APIs OK”
Middleware
(Scott)
Update
Ticket
Ticket
“Services
restarted OK”
NOC
NOC
Lights are green…
Close
Ticket
NOC
(Bob)
Zzz
11:30pm
N
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Network PM (Carlos)
Network SRE (Bob)
Firewall (Freddie)
Ticket
Context Wagon
NetSec (Nicole)
.

NOC
Lights are green…
Close
Ticket
NOC
(Bob)
Zzz
Next Day
SVP
(Susan)
Whose fault is this?!
Why are we so bad at change?
What additional processes
and approvals are you
adding to never let this
happen again?!
VP
VP
Dir
Dir
VP
Dir
VP
Scott)
da)
Carlos)
(Bob)
die)
NetSec (Nicole)

We’ve invested in Cloud, Agile,
DevOps, Containers…
Why does everything still take too
long and cost too much?
Executive Team
Our transformation has
largely ignored Ops

Most companies chase the symptoms…

…by following the conventional wisdom:

“We need better tools”

“We need more people”

“We need more discipline and attention to detail”

“We need more discipline and attention to detail”
“We need more change reviews/approvals”

Challenge the conventional
wisdom about operations work

Forces That Undermine Operations
Silos Ticket Queues
Excessive ToilLow Trust

Where are decisions made? Who can take action?
escalate
1° 2° 3° 4°
escalate escalateor

escalate
1° 2° 3° 4°
escalate escalateor
Decisions made here

All work is contextual
John
Allspaw

rm -rf $PATHNAME
John
Allspaw

rm -rf $PATHNAME Is this dangerous?
John
Allspaw

rm -rf $PATHNAME
Is this dangerous?
John
Allspaw

rm -rf $PATHNAME
Answer is always
“it depends”
John
Allspaw

escalate
1° 2° 3° 4°
escalate escalateor
Context

Psychological Safety
Psychological safety is a shared belief that the team is safe for
interpersonal risk taking. It can be defined as "being able to show
and employ one's self without fear of negative consequences of
self-image, status or career.
- William Kahn

Boston University

1990

- William Kahn

Boston University

1990
Google: most important characteristic
to predict team effectiveness?
2016

- William Kahn

Boston University

1990
Google: most important characteristic
to predict team effectiveness?
2016
Psychological safety!

Toil: Name For a Problem We’ve All Felt

Toil: Name For a Problem We’ve All Felt
“Toil is the kind of work tied to running a production
service that tends to be manual, repetitive,
automatable, tactical, devoid of enduring value, and
that scales linearly as a service grows.”
-Vivek Rau

Google

Toil vs. Engineering Work
Toil Engineering Work
Lacks Enduring Value Builds Enduring Value
Rote, Repetitive Creative, Iterative
Tactical Strategic
Increases With Scale Enables Scaling
Can Be Automated Requires Human Creativity

Excessive Toil Prevents Fixing the System
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)

Excessive Toil Prevents Fixing the System
E.W.Toil
Reduce toil
No capacity to improve business
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Downward spiral is inevitable!

Forces That Undermine Operations
Silos Queues
Excessive ToilLow Trust

Silos
Backlog Information
PrioritiesTools

Backlog Information
I need X
PrioritiesTools
Silos

Backlog Information
I need X
PrioritiesTools
Silos
Backlog
I do X
Requests
for X
Silo A
Information
Priorities
Silo B
Tools

Silos cause disconnects and mismatches
Backlog Information
I need X
PrioritiesTools
Backlog
I do X
Requests
for X
Silo A
Information
Priorities
Silo B
Tools
Context
Context
Process
Process
Tooling
Tooling
Capacity
Capacity

1
2
3
Silos Interfere with feedback loops

1
2
3
Silos Interfere with feedback loops
Producer Consumer
Ops
Ops
Ops

Function A
Function B
Function C
Silos create labor pools of functional specialists
Requests fulfilled by semi-
manual or manual effort

Primary management focus is
on protecting team capacity

How do we cover for our silos’ disconnects and mismatches?
Silo A Silo B

How do we cover for our silos’ disconnects and mismatches?
Silo A Silo B
Ticket
Queue

??
Silo A Silo B
We all know how well that works
Ticket
Queue

Ticket queues are an expensive way to manage work
Ticket
Queue
Queues Create…
Longer Cycle Time
Increased Risk
More Variability
More Overhead
Lower Quality
Less Motivation
Adapted from Donald G. Reinertsen, The Principles of Product Development Flow: Second Generation Lean Product Development

What do queues do to value streams?

Queue
A
Queue
B

Queue
A
Queue
B
Queues disintegrate and
obfuscate value streams

Ticket queues are “snowflake makers”
??
Silo A Silo B
Ticket
Queue

Ticket queues are “snowflake makers”
??
Silo A Silo B
Ticket
Queue
Snowflakes
Technically acceptable, but brittle and unreproducible

"Queues don’t learn”
??
Silo A Silo B
Ticket
Queue
Scott Prugh

CSGi

“Shift Left” the ability to take action
escalate
1° 2° 3° 4°
escalate escalateor

Push the ability to take action this direction
escalate
1° 2° 3° 4°
escalate escalateor

Push the ability to take action this direction
escalate
1° 2° 3° 4°
escalate escalateor
Tools
Enablement and tooling

Start reducing toil today
Toil

1. Track toil levels for each team
Toil

Track toil levels for each team

• Standardize (e.g. meetings and email are “overhead" not “toil”)

• Track

• Self-reporting

• Periodic surveys

• SM or PM interview/sampling

• Track

• Self-reporting

• Periodic surveys

• SM or PM interview/sampling
• Don’t get lost in time tracking weeds!

Toil
2. Set toil limit for each team (50% is conventional wisdom)



3. Fund efforts to reduce toil (with emphasis on teams already over limit)
Toil



3. Fund efforts to reduce toil (with emphasis on teams already over limit)
Toil
Michael Kehoe

Todd Palino

(LinkedIn)

At SREcon Americas 2019

Example
Process
“Code Yellow”

Where to focus?
Toil
Reduce
Technical Debt

Where to focus?
Toil
Reduce
Technical Debt
Re-Engineer

Processes

Where to focus?
Toil
Reduce
Technical Debt
Re-Engineer

Processes
Enable
Self-Service

Eliminate Interruptions
Eliminate Waiting

Eliminate Waiting
Self-Service
Do X.

Eliminate Waiting
Self-Service
Do X.
… and a lot less toil

How to enable self-service?
Empower teams to spot and fix the anti-patterns.

“Do this for me, do it again, then do it again.”
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt

“I could fix it, but I can’t get to it.”
Environment
I could fix it if I
could get to it
Before
Wait
Interrupt

“I could fix it, but I can’t get to it.”
Environment
I could fix it if I
could get to it
Before
Wait
Interrupt
After
I’ve got this!
Environment
Self-
Service

“What’s the BEST way to do this?!”
Environment
3:00 AM
sar -n Dev,EDEV 1
sar -n TCP,ETCP 1
cat etc/resolv.conf
mpstat -P ALL 1
tcpretrans
tcpconnect
tcpaccept
netstat -rnv
check ﬁrewall conﬁg
netstat -s
Check Network
Earlier…
After
Self-Service

“Oh no… I’m the Brent.”
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket Ticket
Ticket Ticket
Before

“The dog-pile.”
!!
I think its a problem with
db07-store2.uswest.acme
“$ top”
“$ top”
db07store2.
uswest.acme
“$ top”
“$ top”
“$ top”
!!
“$ top”
!!
!!
!!
healthcheck
store2 -all
db07store2.
uswest.acme
Self-Service
1.
2.
3.
I think its a problem with
db07-store2.uswest.acme

“I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this ﬂag or
bad things will happen!
Pause monitoring ﬁrst or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before

“I’m an expert, I don’t read the wiki.”
docs
“restart -doit -now”
I’ve got this…
Environment
docs
Later…
Before
“restart”
Environment
Later…
Update
Restart Job
✅
I’ve got this.
Self-Service
Self-Service
After

“You have a workaround, deal with it.”

Self-Service Operations Design Pattern (in a nutshell)
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge

Pull-Based
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge

Pull-Based
Accept tools/languages
that teams want to use
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge

Pull-Based
Define “guardrails” to
provide work safety
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge

Pull-Based
Let people who
“push buttons”
define the buttons
provide work safety
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge

Pull-Based
Let people who
“push buttons”
define the buttons
Build in security
and compliance
provide work safety
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge

Self-Service is ultimately about user experience
Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge

Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
1.Work how they want to work (GUI, API, CLI)

Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
2. “Guardrails” (Smart options that helpfully constrain)

Consumer of
Ops Capabilities
Self-Service
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
2. “Guardrails” (Smart options that helpfully constrain)
3.Dynamic resource model 
(Up-to-date details of your environment)

Obvious: Get rid of as many silos as possible
Old Silo A Old Silo B Old Silo C Old Silo D

Cross-Functional Team 1
Cross-Functional Team n

“Horizontal” shared
responsibility, not
everyone do everything!

Shared and dedicated responsibility is key
Development Team 1
Development Team 2
Development Team n
SRE
Team
Clear handoff requirements
Error budget with consequences
“Netflix"
Model
“Google”
Model

Shared and dedicated responsibility is key
Development Team 1
Development Team 2
Development Team n
SRE
Team
Clear handoff requirements
Error budget with consequences
“Netflix"
Model
“Google”
Model
Same
high-quality,
high-velocity
results!

But what about the cross-cutting concerns?
Specialist
Capabilities
Specialist
Capabilities
Specialist
Capabilities

Specialist
Capabilities
Specialist
Capabilities
Specialist
Capabilities
Ticket
Queue
Ticket
Queue
Ticket
Queue

Specialist
Capabilities
Specialist
Capabilities
Specialist
Capabilities
Ticket
Queue
Ticket
Queue
Ticket
Queue
Ticket
Queue
Ticket
Queue Ticket
Queue

Self-Service Operations: Turn handoffs into self-service
Self-Service Operations
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(operates platform)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops
(embedded)Cross-Functional Product Team 1
Cross-Functional Product Team n Ops
(embedded)
Cross-Functional Product Team 2 Ops
(embedded)

Self-Service Operations: Works with any org model
Development Team 1
Development Team 2
Development Team n
Ops/SRE
Team
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(operates platform)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist

Development Team 1
Development Team 2
Ops/SRE
Team
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(operates platform)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
(embedded)
But, what about security and compliance?
Build-in
Security
Here
Build-in
Compliance
Here

Are all tickets bad?
Ticket
System
No. Just use tickets for what they are good for

1.Documenting true problems/issues/exceptionsTicket
System

1.Documenting true problems/issues/exceptions
2.Routing for necessary approvals
Ticket
System

1.Documenting true problems/issues/exceptions
2.Routing for necessary approvals
Not as a general purpose work management system!
Ticket
System

Self-Service can also be a foundation
for strategic initiatives

Strategic: Improve incident response times
https://youtu.be/USYrDaPEFtM
Jody Mulkey at DOES ‘15 SF

Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools
DEV STAGE PROD
Dev & QA NOC/Ops Dev
Promote
approved
jobs
Self-Service Self-Service
Empower

Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools
DEV STAGE PROD
Dev & QA NOC/Ops Dev
Promote
approved
jobs
Self-Service Self-Service
Empower
• Reduced MTTR by 92%

• Reduced escalations by 50%

• Reduced overall support costs by 55%

Strategic: Reduce compliance burden & improve consistency
Shaun Norris at DOES ‘18 Las Vegas
https://youtu.be/d5IMvK0YHTg

Optimized for compliance
• 86,000+ employees

• 60+ countries

• Highly regulated


• 60+ countries

LOB #1
LOB #2 LOB #3
LOB …n
Services Scripts/Tools
Data Center
Data Center
Data Center Services Scripts/Tools
Cloud
Cloud
Cloud
Cloud
Self-Service
ComplianceConsistency


• 60+ countries

LOB #1
LOB #2 LOB #3
LOB …n
Data Center
Data Center
Data Center Services Scripts/Tools
Cloud
Cloud
Cloud
Cloud
Self-Service
ComplianceConsistency
12 months:

• Saved 28 person years of time

• 13,000+ ops tasks in privileged environments that
didn’t require a review

• ~200 less customer impacting events

rundeck.com/self-service
Read for free online:
Working on documenting the Self-
Service Operations design pattern.
Where I need your help…

Recap
Don’t forget about Ops.
Challenge conventional wisdom.
Use self-service to reduce
remaining ticket queues
“Shift-Left” control and decision
making.
Focus on removing silosLearn from SRE: Focus on
understanding & reducing toil
E.W.Toil
Reduce toil
oil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Understand the forces
undermining operations work
Development Team 1
Development Team 2
Ops/SRE
Team
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(operates platform)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
(embedded)

Let’s talk…
@damonedwards
damon@rundeck.com
rundeck.com/self-service

Tickets Make Operations Work Unnecessarily Miserable

More Related Content

What's hot

Similar to Tickets Make Operations Work Unnecessarily Miserable

More from Rundeck

Recently uploaded

Tickets Make Operations Work Unnecessarily Miserable