Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]

Damon Edwards
October 13-15, 2020
Runbook Automation:
Old News or a Key to Unlock Performance?

Lean/Flow
Fast Feedback Loops
CI/CD
Small Batch Sizes
Shift Left
Cloud Platforms
Microservices
Continuous Testing
Infrastructure as Code
3 Ways / 5 Ideals
Automated Governance
SLOs
Kanban
You build it, you run it

Lean/Flow
Fast Feedback Loops
CI/CD
Small Batch Sizes
Shift Left
Cloud Platforms
Microservices
Continuous Testing
Infrastructure as Code
3 Ways / 5 Ideals
Automated Governance
SLOs
Kanban
You build it, you run it
What’s Next?

Thesis:
The Next Great Unlocks Will
Come from Ops…

Incident Management
Thesis:
Come from Ops…

Incident Management
Service Requests
Thesis:
Come from Ops…

Incident Management
What do we all want?

Incident Management
Shorter Incidents.
Fewer Escalations.

Incident Management
Shorter Incidents.
Fewer Escalations.
What always gets in the way?

Incident Management
Shorter Incidents.
Fewer Escalations.
What always gets in the way?
Complexity.

Our world is complex,
and not deterministic.
J. Paul Reed
Our world is complex,
and not deterministic!
John Allspaw

(deterministic POV)
Development vs Operations

(deterministic POV)
Code → Build → Run? ( 👍 / 👎 )

(deterministic POV)
SAIL/cornell.edu
(stochastic POV)

(deterministic POV)
SAIL/cornell.edu
(stochastic POV)
Network traffic
Usage patterns
Config updates
Platform changes
API performance
Library updates
Hardware variation
Tuning
Data changes

(deterministic POV)
SAIL/cornell.edu
(stochastic POV)
Network traffic
Usage patterns
Config updates
Platform changes
API performance
Library updates
Hardware variation
Tuning
Tailoring
Data changes

(deterministic POV)
SAIL/cornell.edu
(stochastic POV)
Network traffic
Usage patterns
Config updates
Platform changes
API performance
Library updates
Hardware variation
Tuning
Tailoring
Richard Cook, M.D.
Data changes

(deterministic POV)
SAIL/cornell.edu
(stochastic POV)
Network traffic
Usage patterns
Config updates
Platform changes
API performance
Library updates
Hardware variation
Tuning
“System as imagined”
Tailoring
Richard Cook, M.D.
Data changes

(deterministic POV)
SAIL/cornell.edu
(stochastic POV)
Network traffic
Usage patterns
Config updates
Platform changes
API performance
Library updates
Hardware variation
Tuning
“System as imagined” “System as found”
Tailoring
Richard Cook, M.D.
Data changes

What are people doing in “systems as found”:
Richard Cook, M.D.

1. Monitoring
Richard Cook, M.D.

1. Monitoring
2. Responding (making sense of what they are seeing)
Richard Cook, M.D.

1. Monitoring
3. Adapting (more tailoring of the complex system)
Richard Cook, M.D.

1. Monitoring
4. Learning (feedback loops and understanding)
Richard Cook, M.D.

1. Monitoring
4. Learning (feedback loops and understanding)
Automation alone
can’t do this

Most important lesson from other high-consequence fields:

Most important lesson from other high-consequence fields:
The role of automation is to support the human operator,
not to replace the human operator.

“We must develop trust in our operators.”
Richard Cook, M.D.

“Too much design goes into preventing
people from doing things”
Richard Cook, M.D.

“Too much design goes into preventing
people from doing things”
“We need to reveal the actual controls
that are available”
Richard Cook, M.D.

Abstraction
Too high: Black Box. (bad)

Abstraction
Dr. Lisanne
Bainbridgee
“The Ironies of
Automation”
1982

Abstraction
Too low: ssh, sudo, and 🙏 (bad)
Dr. Lisanne
Bainbridgee
“The Ironies of
Automation”
1982

Instead your
experts become
the “abstraction”

Self-Service
TEAM A
TEAM B
TEAM C
TEAM D

An Incident Management Example…

1. Decipher the wiki3 Options:

1. Decipher the wiki 2. Ad-hoc tool/script usage3 Options:

1. Decipher the wiki 2. Ad-hoc tool/script usage 3. ESCALATE!!3 Options:

I build it.
I run it.
Platform Engineering

What’s the magnitude of impact?…

CFO
?
(#YMMV)
What’s possible?

CFO
?
60%
Shorter
Incidents
(#YMMV)
What’s possible?

CFO
?
60%
Shorter
Incidents
50%
Fewer
Escalations
(#YMMV)
What’s possible?

CFO
?
60%
Shorter
Incidents
50%
Fewer
Escalations
99%Faster
Turnaround
Time
(#YMMV)
What’s possible?

60%
Shorter
Incidents
50%
Fewer
Escalations
99%Faster
Turnaround
Time
CFO
(#YMMV)
What’s possible?

60%
Shorter
Incidents
50%
Fewer
Escalations
99%Faster
Turnaround
Time
CFO
We’ve got
Budget!
(#YMMV)
What’s possible?

TEAM A
TEAM B
TEAM C
TEAM D
Runbook
Automation
Creating Self-Service

TEAM A
TEAM B
TEAM C
TEAM D
Runbook
Automation
Do

TEAM A
TEAM B
TEAM C
TEAM D
Runbook
Automation
View Do

TEAM A
TEAM B
TEAM C
TEAM D
Runbook
Automation
AIOps
Observability
View Do

Self-Service Operations
The next great unlocks?
Runbook Automation
AIOps / Observability

Let’s Talk!
@damonedwards
damon@pagerduty.com
Damon Edwards

"How Complex Systems Fail” - 1998
Dr Richard Cook, MD
https://how.complexsystems.fail/
“Ironies of Automation” - 1982
Dr. Lisanne Bainbridge
https://www.ise.ncsu.edu/wp-content/uploads/
2017/02/Bainbridge_1983_Automatica.pdf
https://youtu.be/URDBE4q_IgM
The Troubles of Automating “All the Things” - 2019
J. Paul Reed
How Your Systems Keep Running - 2017
John Allspaw
https://youtu.be/xA5U85LSk0M
Slides: rundeck.com/does20

https://youtu.be/1zUtBLZ4Lus
Operations: The Last Mile
Damon Edwards
Slides: rundeck.com/does20
Operations:
The Last Mile
Incident Management Meets DevOps
Bhavik Gudka & Surya Avirneni
https://youtu.be/6W-HuG5a8L0

Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]

More Related Content

What's hot

Similar to Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]

More from Rundeck

Recently uploaded

Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]