Damon Edwards
October 13-15, 2020
Runbook Automation:
Old News or a Key to Unlock Performance?
Damon Edwards
Damon Edwards
Damon Edwards
Lean/Flow
Fast Feedback Loops
CI/CD
Small Batch Sizes
Shift Left
Cloud Platforms
Microservices
Continuous Testing
Infrastructure as Code
3 Ways / 5 Ideals
Automated Governance
SLOs
Kanban
You build it, you run it
Lean/Flow
Fast Feedback Loops
CI/CD
Small Batch Sizes
Shift Left
Cloud Platforms
Microservices
Continuous Testing
Infrastructure as Code
3 Ways / 5 Ideals
Automated Governance
SLOs
Kanban
You build it, you run it
What’s Next?
Thesis:
The Next Great Unlocks Will
Come from Ops…
Incident Management
Thesis:
The Next Great Unlocks Will
Come from Ops…
Incident Management
Service Requests
Thesis:
The Next Great Unlocks Will
Come from Ops…
Incident Management
Incident Management
What do we all want?
Incident Management
What do we all want?
Shorter Incidents.
Fewer Escalations.
Incident Management
What do we all want?
Shorter Incidents.
Fewer Escalations.
What always gets in the way?
Incident Management
What do we all want?
Shorter Incidents.
Fewer Escalations.
What always gets in the way?
Complexity.
Our world is complex,
and not deterministic.
J. Paul Reed
Our world is complex,
and not deterministic!
John Allspaw
Development vs Operations
(deterministic POV)
Development vs Operations
(deterministic POV)
Code → Build → Run? ( 👍 / 👎 )
Development vs Operations
(deterministic POV)
Code → Build → Run? ( 👍 / 👎 )
Development vs Operations
(deterministic POV)
Code → Build → Run? ( 👍 / 👎 )
Development vs Operations
SAIL/cornell.edu
(stochastic POV)
(deterministic POV)
Code → Build → Run? ( 👍 / 👎 )
Development vs Operations
SAIL/cornell.edu
(stochastic POV)
Network traffic
Usage patterns
Config updates
Platform changes
API performance
Library updates
Hardware variation
Tuning
Data changes
(deterministic POV)
Code → Build → Run? ( 👍 / 👎 )
Development vs Operations
SAIL/cornell.edu
(stochastic POV)
Network traffic
Usage patterns
Config updates
Platform changes
API performance
Library updates
Hardware variation
Tuning
Tailoring
Data changes
(deterministic POV)
Code → Build → Run? ( 👍 / 👎 )
Development vs Operations
SAIL/cornell.edu
(stochastic POV)
Network traffic
Usage patterns
Config updates
Platform changes
API performance
Library updates
Hardware variation
Tuning
Tailoring
Richard Cook, M.D.
Data changes
(deterministic POV)
Code → Build → Run? ( 👍 / 👎 )
Development vs Operations
SAIL/cornell.edu
(stochastic POV)
Network traffic
Usage patterns
Config updates
Platform changes
API performance
Library updates
Hardware variation
Tuning
“System as imagined”
Tailoring
Richard Cook, M.D.
Data changes
(deterministic POV)
Code → Build → Run? ( 👍 / 👎 )
Development vs Operations
SAIL/cornell.edu
(stochastic POV)
Network traffic
Usage patterns
Config updates
Platform changes
API performance
Library updates
Hardware variation
Tuning
“System as imagined” “System as found”
Tailoring
Richard Cook, M.D.
Data changes
What are people doing in “systems as found”:
Richard Cook, M.D.
1. Monitoring
What are people doing in “systems as found”:
Richard Cook, M.D.
1. Monitoring
2. Responding (making sense of what they are seeing)
What are people doing in “systems as found”:
Richard Cook, M.D.
1. Monitoring
2. Responding (making sense of what they are seeing)
3. Adapting (more tailoring of the complex system)
What are people doing in “systems as found”:
Richard Cook, M.D.
1. Monitoring
2. Responding (making sense of what they are seeing)
3. Adapting (more tailoring of the complex system)
4. Learning (feedback loops and understanding)
What are people doing in “systems as found”:
Richard Cook, M.D.
1. Monitoring
2. Responding (making sense of what they are seeing)
3. Adapting (more tailoring of the complex system)
4. Learning (feedback loops and understanding)
What are people doing in “systems as found”:
Automation alone
can’t do this
Most important lesson from other high-consequence fields:
Most important lesson from other high-consequence fields:
The role of automation is to support the human operator,
not to replace the human operator.
“We must develop trust in our operators.” 
Richard Cook, M.D.
“We must develop trust in our operators.” 
“Too much design goes into preventing
people from doing things”
Richard Cook, M.D.
“We must develop trust in our operators.” 
“Too much design goes into preventing
people from doing things”
“We need to reveal the actual controls
that are available”
Richard Cook, M.D.
Abstraction
Abstraction
Too high: Black Box. (bad)
Abstraction
Too high: Black Box. (bad)
Dr. Lisanne
Bainbridgee
“The Ironies of
Automation”
1982
Abstraction
Too high: Black Box. (bad)
Too low: ssh, sudo, and 🙏 (bad)
Dr. Lisanne
Bainbridgee
“The Ironies of
Automation”
1982
Instead your
experts become
the “abstraction”
Instead your
experts become
the “abstraction”
Tickets
Tickets
Tickets
Ticket
Ticket
Ticket
Self-Service
Self-Service
TEAM A
TEAM B
TEAM C
TEAM D
An Incident Management Example…
3 Options:
1. Decipher the wiki3 Options:
1. Decipher the wiki 2. Ad-hoc tool/script usage3 Options:
1. Decipher the wiki 2. Ad-hoc tool/script usage 3. ESCALATE!!3 Options:
Service Requests Too…
Enable New Org Models…
I build it.
I run it.
Platform Engineering
What’s the magnitude of impact?…
(#YMMV)
What’s possible?
CFO
?
(#YMMV)
What’s possible?
CFO
?
60%
Shorter
Incidents
(#YMMV)
What’s possible?
CFO
?
60%
Shorter
Incidents
50%
Fewer
Escalations
(#YMMV)
What’s possible?
CFO
?
60%
Shorter
Incidents
50%
Fewer
Escalations
99%Faster
Turnaround
Time
(#YMMV)
What’s possible?
60%
Shorter
Incidents
50%
Fewer
Escalations
99%Faster
Turnaround
Time
CFO
(#YMMV)
What’s possible?
60%
Shorter
Incidents
50%
Fewer
Escalations
99%Faster
Turnaround
Time
CFO
We’ve got
Budget!
(#YMMV)
What’s possible?
TEAM A
TEAM B
TEAM C
TEAM D
Runbook
Automation
Creating Self-Service
TEAM A
TEAM B
TEAM C
TEAM D
Runbook
Automation
Creating Self-Service
TEAM A
TEAM B
TEAM C
TEAM D
Runbook
Automation
Creating Self-Service
TEAM A
TEAM B
TEAM C
TEAM D
Runbook
Automation
Do
Creating Self-Service
TEAM A
TEAM B
TEAM C
TEAM D
Runbook
Automation
View Do
Creating Self-Service
TEAM A
TEAM B
TEAM C
TEAM D
Runbook
Automation
AIOps
Observability
View Do
Creating Self-Service
Self-Service Operations
The next great unlocks?
Runbook Automation
AIOps / Observability
Let’s Talk!
@damonedwards
damon@pagerduty.com
Damon Edwards
"How Complex Systems Fail” - 1998
Dr Richard Cook, MD
https://how.complexsystems.fail/
“Ironies of Automation” - 1982
Dr. Lisanne Bainbridge
https://www.ise.ncsu.edu/wp-content/uploads/
2017/02/Bainbridge_1983_Automatica.pdf
https://youtu.be/URDBE4q_IgM
The Troubles of Automating “All the Things” - 2019
J. Paul Reed
How Your Systems Keep Running - 2017
John Allspaw
https://youtu.be/xA5U85LSk0M
Slides: rundeck.com/does20
https://youtu.be/1zUtBLZ4Lus
Operations: The Last Mile
Damon Edwards
Slides: rundeck.com/does20
Operations:
The Last Mile
Incident Management Meets DevOps
Bhavik Gudka & Surya Avirneni
https://youtu.be/6W-HuG5a8L0

Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]