SlideShare a Scribd company logo
Failure Happens:
Improving Incident Response In Enterprises
Damon Edwards
@damonedwards
Damon
Edwards
Ops Improvement
DevOps Consulting
Ops Tools
Community
1. This is not a talk about monitoring
2. This talk is about the enterprise
Please note….
Let’s look at an (unfortunately) typical incident…
[Get PDF here: https://rundeck.co/incident_lisa2017]
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Let’s condense that…
Take a DevOps approach to improvement…
1. Use Lean thinking and analyze like any process
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
What got in the way? (Wastes)
PD - Partially Done
TS - Task Switching
W - Waiting
M - Motion / Manual
D - Defects
EP - Extra Process
EF - Extra Features
H - Heroics
Structured analysis building on DevOps
adaptation of “7 deadly wastes” from Lean / Agile:
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
What got in the way? (Wastes)
PD - Partially Done
TS - Task Switching
W - Waiting
M - Motion / Manual
D - Defects
EP - Extra Process
EF - Extra Features
H - Heroics
Structured analysis building on DevOps
adaptation of “7 deadly wastes” from Lean / Agile:
M
M
M
M
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
What got in the way? (Wastes)
PD - Partially Done
TS - Task Switching
W - Waiting
M - Motion / Manual
D - Defects
EP - Extra Process
EF - Extra Features
H - Heroics
Structured analysis building on DevOps
adaptation of “7 deadly wastes” from Lean / Agile:
TS
TS
TS
TS
TS
TS TSM
M
M
M
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
What got in the way? (Wastes)
PD - Partially Done
TS - Task Switching
W - Waiting
M - Motion / Manual
D - Defects
EP - Extra Process
EF - Extra Features
H - Heroics
Structured analysis building on DevOps
adaptation of “7 deadly wastes” from Lean / Agile:
TS
TS
TS
TS
TS
TS TS
W
W
W W
W
M
M
M
M
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
What got in the way? (Wastes)
PD - Partially Done
TS - Task Switching
W - Waiting
M - Motion / Manual
D - Defects
EP - Extra Process
EF - Extra Features
H - Heroics
Structured analysis building on DevOps
adaptation of “7 deadly wastes” from Lean / Agile:
TS
TS
TS
TS
TS
TS TS
W
W
W W
W
EP
EP
M
M
M
M
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
What got in the way? (Wastes)
PD - Partially Done
TS - Task Switching
W - Waiting
M - Motion / Manual
D - Defects
EP - Extra Process
EF - Extra Features
H - Heroics
Structured analysis building on DevOps
adaptation of “7 deadly wastes” from Lean / Agile:
TS
TS
TS
TS
TS
TS TS
W
W
W W
W
EP
EP
H HH
M
M
M
M
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
What got in the way? (Wastes)
PD - Partially Done
TS - Task Switching
W - Waiting
M - Motion / Manual
D - Defects
EP - Extra Process
EF - Extra Features
H - Heroics
Structured analysis building on DevOps
adaptation of “7 deadly wastes” from Lean / Agile:
TS
TS
TS
TS
TS
TS TS
PD
PD
PD
W
W
W W
W
EP
EP
H HH
M
M
M
M
Tip: Try techniques like Value Stream Mapping
1. Use Lean thinking and analyze like any process
2. Make sure you are measuring the right things
MTTD?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
MTTD?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Detect?
MTTD?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Detect? Diagnose!
MTTD?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Detect? Diagnose!
Diagnose > Detect
MTTR?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
MTTR?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Restore?
MTTR?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Restore?
Repair!
DEV
MTTR?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Restore?
Restore is good. Repair (fix) is ultimate measure.
Repair!
DEV
Cost?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
HC: 2 HC: 9 HC: 2 HC: 3 HC: 4 HC: 1.5 HC: 1.5 HC: 1.5
Cost?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
HC: 2 HC: 9 HC: 2 HC: 3 HC: 4 HC: 1.5 HC: 1.5 HC: 1.5
Context:
2
Context:
10
Context:
6
Context:
7
Context:
10
Context:
12
Context:
13
Context:
13
Cost?
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Don’t underestimate the expense of context switching!
HC: 2 HC: 9 HC: 2 HC: 3 HC: 4 HC: 1.5 HC: 1.5 HC: 1.5
Context:
2
Context:
10
Context:
6
Context:
7
Context:
10
Context:
12
Context:
13
Context:
13
1. Use Lean thinking and analyze like any process
2. Make sure you are measuring the right things
3. “Shift Left” control and decision making
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Shift-Left
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Solve here or here
Shift-Left
1
-NOC alerts
-Biz Manager
phone call
-Escalate to
L2
-L2 bridge call
-SysAdmin try
this & that
-Consensus:
Foo service
-Interrupt Foo
app dev
-Find sysadmin
with access
-Multiple
roundtrips to
get logs
2 3
-Identify
incorrect
restarts
-Attempt to get
middleware to
restart
-Need approval
4
-Escalate to
SVP
-VPs try to
assess impact
-Approve
5 6
-Find
middleware
engineer
-Review docs
-Trial & error
-Need API
tests per policy
-Find tester
-Pass tests
7
-Middleware
calls restart
done
-NOC sees
green
-Ticket closed
8
10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm
+00:30
after alert
+02:30
after alert
+04:30
after alert
+05:00
after alert
+07:30
after alert
+08:30
after alert
+12:00
after alert
+13:00
after alert
Ticket
Context Wagon
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x L2
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo L2
SysAdmin (Lee)
Middleware Mngr.
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Cust. Engage.
(Varsha)
Solve here or here NOT HERE
Shift-Left
Empower those closest to the issue
escalate escalate
1° 2° 3°
escalate
4°
Empower those closest to the issue
escalate escalate
1° 2° 3°
escalate
4°
Push the ability to take action this direction
Empower those closest to the issue
escalate escalate
1° 2° 3°
escalate
4°
Push the ability to take action this direction
But what gets
in the way?
Empower those closest to the issue
escalate escalate
1° 2° 3°
escalate
4°
Push the ability to take action this direction
SilosBut what gets
in the way?
1. Use Lean thinking and analyze like any process
2. Make sure you are measuring the right things
3. “Shift Left” control and decision making
4. Get rid of silos
Silos ruin everything
Backlog Context
I need X
Backlog
I do X
Requests
for X
Silo A
Priorities
Context
Priorities
Silo B
Tools Tools
Silos ruin everything
Backlog Context
I need X
Backlog
I do X
Requests
for X
Silo A
Priorities
Context
Priorities
Silo B
Tools Tools
Ticket-Driven Request Queues Are Best Indicator of Silos
Team A
(Dev)
Team B
(Ops)
Ticket
System
??
Team A
(Dev)
Team B
(Ops)
Ticket
System
??
Silo Builder
Ticket-Driven Request Queues Are Best Indicator of Silos
Team A
(Dev)
Team B
(Ops)
Ticket
System
??
Silo Builder Snowflake Maker
Ticket-Driven Request Queues Are Best Indicator of Silos
Popular: Replace Silos with Cross Functional Team
Dev/Test Release OperatePlanning
Popular: Replace Silos with Cross Functional Team
Dev/Test Release OperatePlanning
Cross-Functional Teams
Cross-Functional Teams
Cross-Functional Teams
Dev/Test Release OperatePlanning
Popular: Replace Silos with Cross Functional Team
Cross-Functional Teams
Cross-Functional Teams
Cross-Functional Teams
EnvironmentsDBAs Network Security NOC
Dev/Test Release OperatePlanning
Popular: Replace Silos with Cross Functional Team
Cross-Functional Teams
Cross-Functional Teams
Cross-Functional Teams
EnvironmentsDBAs Network Security NOC
Dev/Test Release OperatePlanning
Team A
(Dev)
Team B
(Ops)
Ticket
System
??
Popular: Replace Silos with Cross Functional Team
Cross-Functional Teams
Cross-Functional Teams
Cross-Functional Teams
1. Use Lean thinking and analyze like any process
2. Make sure you are measuring the right things
3. “Shift Left” control and decision making
4. Get rid of silos
5. Establish Operations as a Service
Team A
(Dev)
Team B
(Ops)
Ticket
System
??
Silo Builder
Get rid of as many ticket-driven request queues as possible
Team A
(Dev)
Team B
(Ops)
Ticket
System
??
Silo Builder Snowflake Maker
Get rid of as many ticket-driven request queues as possible
… replace with Operations as a Service design pattern
Team A
(Dev)
Team B
(Ops)Ticket
System
Operations
as a
Service
Execute
On Demand
Define
Procedures
Vet
Procedures
Define
Policies
Actual Exceptions
Execute
On Demand
Changes how your organization thinks
about automated procedures…
Automated procedures are comprised of three parts
Definition of the automated procedure
Execution of the automated procedure
Governance of the automated procedure
Define
Execute
Govern
Automated procedures are comprised of three parts
Definition of the automated procedure
Execution of the automated procedure
Governance of the automated procedure
Define
Execute
Govern
(security, oversight, compliance, etc.)
Traditional Ops Silo
Define
Execute
Govern
“Consumers of Ops”
(Dev, QA, Release, NOC, Security, etc.)
Ops
Rigid Self-Service
Define
Execute
Govern
“Consumers of Ops”
(Dev, QA, Release, NOC, Security, etc.)
Ops
Define
Execute
Govern
Execute
“Consumers of Ops”
(Dev, QA, Release, NOC, Security, etc.)
Ops
Rigid Self-Service (limited)
High-Velocity Handoffs
Define
Govern
Execute
“Consumers of Ops”
(Dev, QA, Release, NOC, Security, etc.)
Ops
Self-Service Operations
Define
Govern
Execute
“Consumers of Ops”
(Dev, QA, Release, NOC, Security, etc.)
Ops
Self-Service Operations
Define
Govern
Execute
Govern
“Consumers of Ops”
(Dev, QA, Release, NOC, Security, etc.)
Ops
fdfd
Operations as a Service
Operations
as a
Service
ED G
Team B
(Ops)
Vet
Procedures
Define
Policies
Execute
On Demand
Team A
(Dev)
Define
Procedures
Execute
On Demand
fdfd
Operations as a Service
Move definition, execution, and governance to where you
get the most effective use of labor and best flow of work
Operations
as a
Service
ED G
Team B
(Ops)
Vet
Procedures
Define
Policies
Execute
On Demand
Team A
(Dev)
Define
Procedures
Execute
On Demand
Rundeck: Open Source Platform For Operations as a Service
#! ! "# $
Scripts APIs Tools Cloud VMs Containers
Orchestration &
Scheduling of Workflows
Collect and
Process Output
Infrastructure
details and state
from multiple
sources
Config.
Man.
CMDB
Monitor.
Metrics
Cloud
Corp
Directory
Authentication
and roles
ITSM Tickets, work
status, approvals
>_
Create workflows ● Define ACL policies ● Execute workflows
Web GUI API CLI
Common implementation pattern
for Operations as a Service…
Step 1: Establish a Secure Ops Hub
Operations as a Service
Engineers get visibility
and controlled self-service
Secrets
Ops Procedures
“Status”
“Firewall Change”
"Restart"
deny
allow
Identity Audit Logs
Infrastructure view
Service health
System metrics
Ops Support use for
remediation procedures
Inventory and Health
Execute
+ Monitoring Tools
Security and Ops manages
access, configuration, and compliance
Step 2: Establish a SDLC for Ops Procedures
Operations as a Service
Engineers get visibility
and controlled self-service
Secrets
Ops Procedures
“Status”
“Firewall Change”
"Restart"
deny
allow
Identity Audit Logs
Infrastructure view
Service health
System metrics
Ops Support use for
remediation procedures
Inventory and Health
Execute
Source Code
Repo
if (($state==wait))
then
kill -9 $PID
fi
Change
Product Engineers
produce automated
procedures and health
checks.
RISKY
Automated Procedures
and Health Checks
FIX
Code review
+ Monitoring Tools
Security and Ops manages
access, configuration, and compliance
Step 3: Connect with Enterprise Management Systems
Service Desk
CustomersOps Support get
visibility and audit trail
updated by support tools
Service Ticket
Execute
Software
Supply Chain
Ops integrate
with artifact
flow
Operations as a Service
Engineers get visibility
and controlled self-service
Secrets
Ops Procedures
“Status”
“Firewall Change”
"Restart"
deny
allow
Identity Audit Logs
Infrastructure view
Service health
System metrics
Ops Support use for
remediation procedures
Inventory and Health
Source Code
Repo
if (($state==wait))
then
kill -9 $PID
fi
Change
Product Engineers
produce automated
procedures and health
checks.
RISKY
Automated Procedures
and Health Checks
FIX
Code review
+ Monitoring Tools
Security and Ops manages
access, configuration, and compliance
Step 4: Make Compliance Really Happy
Service Desk
CustomersOps Support get
visibility and audit trail
updated by support tools
Service Ticket
Execute
Software
Supply Chain
Ops integrate
with artifact
flow
Who reviewed it? Who ran it? When? Where? Approval trail?
Who created the procedure?
Who created the policy?
Operations as a Service
Engineers get visibility
and controlled self-service
Secrets
Ops Procedures
“Status”
“Firewall Change”
"Restart"
deny
allow
Identity Audit Logs
Infrastructure view
Service health
System metrics
Ops Support use for
remediation procedures
Inventory and Health
Source Code
Repo
if (($state==wait))
then
kill -9 $PID
fi
Change
Product Engineers
produce automated
procedures and health
checks.
RISKY
Automated Procedures
and Health Checks
FIX
Code review
+ Monitoring Tools
Security and Ops manages
access, configuration, and compliance
1. Use Lean thinking and analyze like any process
2. Make sure you are measuring the right things
3. “Shift Left” control and decision making
4. Get rid of silos
5. Establish Operations as a Service
6. Encourage Developers to think “Operable First”
Encourage Developers to think “Operable First”
Operations is the business (unless you literally sell packaged software)
Encourage Developers to think “Operable First”
Operations is the business (unless you literally sell packaged software)
Deployability, configurability, monitoring are service features
Encourage Developers to think “Operable First”
Operations is the business (unless you literally sell packaged software)
Deployability, configurability, monitoring are service features
Build configurability into the service, don’t externalize it
Encourage Developers to think “Operable First”
Operations is the business (unless you literally sell packaged software)
Deployability, configurability, monitoring are service features
Build configurability into the service, don’t externalize it
Demand “prod-like” environments everywhere
Encourage Developers to think “Operable First”
Operations is the business (unless you literally sell packaged software)
Deployability, configurability, monitoring are service features
Build configurability into the service, don’t externalize it
Demand “prod-like” environments everywhere
Make any handoff between teams “verification-driven”
Encourage Developers to think “Operable First”
Operations is the business (unless you literally sell packaged software)
Deployability, configurability, monitoring are service features
Build configurability into the service, don’t externalize it
Demand “prod-like” environments everywhere
Make any handoff between teams “verification-driven”
Create immutable versioned artifacts and use standard packaging
Encourage Developers to think “Operable First”
Operations is the business (unless you literally sell packaged software)
Deployability, configurability, monitoring are service features
Build configurability into the service, don’t externalize it
Demand “prod-like” environments everywhere
Make any handoff between teams “verification-driven”
Create immutable versioned artifacts and use standard packaging
Integration tests over unit tests
Encourage Developers to think “Operable First”
Let’s look at a company improving it’s
incident response capability…
Mark
Maun
Jody
Mulkey
Justin
Dean
90% Reduction in MTTR
50% Reduction in escalations
55% Reduction of overall support costs
1. New org, support, and escalation model
escalate
1° 2° 3° 4°
escalate escalate
1. New org, support, and escalation model
escalate
1° 2° 3° 4°
escalate escalate
1. New org, support, and escalation model
EMT ER Trauma
Surgeon
Specialist
Surgeon
TOC (NOC) SRE
Production Eng.
Scrum Teams
Data Services
Platform Eng.
Global Network
> 15 min > 30 min > 60 min
2. Key: Push the ability to take action closest to the problem
escalate
1° 2° 3° 4°
escalate escalate
1. New org, support, and escalation model
EMT ER Trauma
Surgeon
Specialist
Surgeon
TOC (NOC) SRE
Production Eng.
Scrum Teams
Data Services
Platform Eng.
Global Network
> 15 min > 30 min > 60 min
2. Key: Push the ability to take action closest to the problem
escalate
1° 2° 3° 4°
escalate escalate
1. New org, support, and escalation model
EMT ER Trauma
Surgeon
Specialist
Surgeon
TOC (NOC) SRE
Production Eng.
Scrum Teams
Data Services
Platform Eng.
Global Network
> 15 min > 30 min > 60 min
3. Longterm investment in operability
(deployment, configuration, monitoring, automated runbooks)
“Support at the Edge”
Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ
http://rundeck.org/stories/mark_maun.html
“Support at the Edge”
Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ
http://rundeck.org/stories/mark_maun.html
• Automated Ops procedures written/
vetted by the delivery teams
“Support at the Edge”
Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ
http://rundeck.org/stories/mark_maun.html
• Automated Ops procedures written/
vetted by the delivery teams
• Ops remained in full control of what
can run and security policy
“Support at the Edge”
Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ
http://rundeck.org/stories/mark_maun.html
• Automated Ops procedures written/
vetted by the delivery teams
• Ops remained in full control of what
can run and security policy
• Empowered NOC and other support
teams with self-service ops tasks
“Support at the Edge”
Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ
http://rundeck.org/stories/mark_maun.html
• Automated Ops procedures written/
vetted by the delivery teams
• Ops remained in full control of what
can run and security policy
• Empowered NOC and other support
teams with self-service ops tasks
• Empowered developers with limited
self-service operations
“Support at the Edge”
Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ
http://rundeck.org/stories/mark_maun.html
• Automated Ops procedures written/
vetted by the delivery teams
• Ops remained in full control of what
can run and security policy
• Empowered NOC and other support
teams with self-service ops tasks
• Empowered developers with limited
self-service operations
• Combined with new incident response
and escalation model
1. Use Lean thinking and analyze like any process
2. Make sure you are measuring the right things
3. “Shift Left” control and decision making
4. Get rid of silos
5. Establish Operations as a Service
6. Encourage Developers to think “Operable First”
Recap!
Let’s talk…
@damonedwards
damon@rundeck.com

More Related Content

What's hot

SysAdmin to SRE: Solving the Last Mile Problem
SysAdmin to SRE: Solving the Last Mile ProblemSysAdmin to SRE: Solving the Last Mile Problem
SysAdmin to SRE: Solving the Last Mile Problem
Rundeck
 
Self-Service Operations: Because Ops Still Happens
Self-Service Operations: Because Ops Still HappensSelf-Service Operations: Because Ops Still Happens
Self-Service Operations: Because Ops Still Happens
Rundeck
 
Operations: The Last Mile
Operations: The Last Mile Operations: The Last Mile
Operations: The Last Mile
Rundeck
 
The Last Mile Continued: Incident Management
The Last Mile Continued: Incident Management The Last Mile Continued: Incident Management
The Last Mile Continued: Incident Management
Rundeck
 
Helping Ops Help You: Development’s Role in Enabling Self-Service Operations
Helping Ops Help You:  Development’s Role in Enabling Self-Service OperationsHelping Ops Help You:  Development’s Role in Enabling Self-Service Operations
Helping Ops Help You: Development’s Role in Enabling Self-Service Operations
Rundeck
 
Tickets Make Operations Work Unnecessarily Miserable
Tickets Make Operations Work Unnecessarily MiserableTickets Make Operations Work Unnecessarily Miserable
Tickets Make Operations Work Unnecessarily Miserable
Rundeck
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE
Rundeck
 
Clearing the Way For SRE In the Enterprise
Clearing the Way For SRE In the Enterprise Clearing the Way For SRE In the Enterprise
Clearing the Way For SRE In the Enterprise
Rundeck
 
Making Tomorrow Better than Today - Unlocking the Full Potential of Operations
Making Tomorrow Better than Today - Unlocking the Full Potential of OperationsMaking Tomorrow Better than Today - Unlocking the Full Potential of Operations
Making Tomorrow Better than Today - Unlocking the Full Potential of Operations
Rundeck
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE
Rundeck
 
Operations: The Last Mile
Operations: The Last Mile Operations: The Last Mile
Operations: The Last Mile
Rundeck
 
SRE Lessons for the Enterprise
SRE Lessons for the Enterprise SRE Lessons for the Enterprise
SRE Lessons for the Enterprise
Rundeck
 
SRE From Scratch
SRE From ScratchSRE From Scratch
SRE From Scratch
Grier Johnson
 
SysAdmin to SRE: Creating Capacity to Make Tomorrow Better Than Today
SysAdmin to SRE: Creating Capacity to Make Tomorrow Better Than Today  SysAdmin to SRE: Creating Capacity to Make Tomorrow Better Than Today
SysAdmin to SRE: Creating Capacity to Make Tomorrow Better Than Today
Rundeck
 
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationEmpower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
Rundeck
 
Operations: The Last Mile Problem For DevOps
Operations: The Last Mile Problem For DevOpsOperations: The Last Mile Problem For DevOps
Operations: The Last Mile Problem For DevOps
Rundeck
 
Continuously Deploying Culture: Scaling Culture at Etsy - Velocity Europe 2012
Continuously Deploying Culture: Scaling Culture at Etsy - Velocity Europe 2012Continuously Deploying Culture: Scaling Culture at Etsy - Velocity Europe 2012
Continuously Deploying Culture: Scaling Culture at Etsy - Velocity Europe 2012
Patrick McDonnell
 
Agile Infrastructure Velocity 09
Agile Infrastructure Velocity 09Agile Infrastructure Velocity 09
Agile Infrastructure Velocity 09
Andrew Shafer
 
All daydevops 2016 - Turning Human Capital into High Performance Organizati...
All daydevops   2016 - Turning Human Capital into High Performance Organizati...All daydevops   2016 - Turning Human Capital into High Performance Organizati...
All daydevops 2016 - Turning Human Capital into High Performance Organizati...
John Willis
 
Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient Systems
C4Media
 

What's hot (20)

SysAdmin to SRE: Solving the Last Mile Problem
SysAdmin to SRE: Solving the Last Mile ProblemSysAdmin to SRE: Solving the Last Mile Problem
SysAdmin to SRE: Solving the Last Mile Problem
 
Self-Service Operations: Because Ops Still Happens
Self-Service Operations: Because Ops Still HappensSelf-Service Operations: Because Ops Still Happens
Self-Service Operations: Because Ops Still Happens
 
Operations: The Last Mile
Operations: The Last Mile Operations: The Last Mile
Operations: The Last Mile
 
The Last Mile Continued: Incident Management
The Last Mile Continued: Incident Management The Last Mile Continued: Incident Management
The Last Mile Continued: Incident Management
 
Helping Ops Help You: Development’s Role in Enabling Self-Service Operations
Helping Ops Help You:  Development’s Role in Enabling Self-Service OperationsHelping Ops Help You:  Development’s Role in Enabling Self-Service Operations
Helping Ops Help You: Development’s Role in Enabling Self-Service Operations
 
Tickets Make Operations Work Unnecessarily Miserable
Tickets Make Operations Work Unnecessarily MiserableTickets Make Operations Work Unnecessarily Miserable
Tickets Make Operations Work Unnecessarily Miserable
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE
 
Clearing the Way For SRE In the Enterprise
Clearing the Way For SRE In the Enterprise Clearing the Way For SRE In the Enterprise
Clearing the Way For SRE In the Enterprise
 
Making Tomorrow Better than Today - Unlocking the Full Potential of Operations
Making Tomorrow Better than Today - Unlocking the Full Potential of OperationsMaking Tomorrow Better than Today - Unlocking the Full Potential of Operations
Making Tomorrow Better than Today - Unlocking the Full Potential of Operations
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE
 
Operations: The Last Mile
Operations: The Last Mile Operations: The Last Mile
Operations: The Last Mile
 
SRE Lessons for the Enterprise
SRE Lessons for the Enterprise SRE Lessons for the Enterprise
SRE Lessons for the Enterprise
 
SRE From Scratch
SRE From ScratchSRE From Scratch
SRE From Scratch
 
SysAdmin to SRE: Creating Capacity to Make Tomorrow Better Than Today
SysAdmin to SRE: Creating Capacity to Make Tomorrow Better Than Today  SysAdmin to SRE: Creating Capacity to Make Tomorrow Better Than Today
SysAdmin to SRE: Creating Capacity to Make Tomorrow Better Than Today
 
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationEmpower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
 
Operations: The Last Mile Problem For DevOps
Operations: The Last Mile Problem For DevOpsOperations: The Last Mile Problem For DevOps
Operations: The Last Mile Problem For DevOps
 
Continuously Deploying Culture: Scaling Culture at Etsy - Velocity Europe 2012
Continuously Deploying Culture: Scaling Culture at Etsy - Velocity Europe 2012Continuously Deploying Culture: Scaling Culture at Etsy - Velocity Europe 2012
Continuously Deploying Culture: Scaling Culture at Etsy - Velocity Europe 2012
 
Agile Infrastructure Velocity 09
Agile Infrastructure Velocity 09Agile Infrastructure Velocity 09
Agile Infrastructure Velocity 09
 
All daydevops 2016 - Turning Human Capital into High Performance Organizati...
All daydevops   2016 - Turning Human Capital into High Performance Organizati...All daydevops   2016 - Turning Human Capital into High Performance Organizati...
All daydevops 2016 - Turning Human Capital into High Performance Organizati...
 
Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient Systems
 

Similar to Failure Happens: Improving Incident Response In Enterprises

DevOps Training - Ho Chi Minh City
DevOps Training - Ho Chi Minh CityDevOps Training - Ho Chi Minh City
DevOps Training - Ho Chi Minh City
Christian Trabold
 
Devops is (not ) a buzzword
Devops is (not ) a buzzwordDevops is (not ) a buzzword
Devops is (not ) a buzzword
Miguel Fonseca
 
Lead Developer: A Day in the Life
Lead Developer: A Day in the LifeLead Developer: A Day in the Life
Lead Developer: A Day in the Life
John Valentino
 
BizOps Done Right: Breaking DevOps Silos to Deliver Great User Experiences
BizOps Done Right: Breaking DevOps Silos to Deliver Great User ExperiencesBizOps Done Right: Breaking DevOps Silos to Deliver Great User Experiences
BizOps Done Right: Breaking DevOps Silos to Deliver Great User Experiences
Klaus Enzenhofer
 
Chicago Docker Meetup Presentation - Mediafly
Chicago Docker Meetup Presentation - MediaflyChicago Docker Meetup Presentation - Mediafly
Chicago Docker Meetup Presentation - Mediafly
Mediafly
 
SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
SaltConf 2015: Salt stack at web scale: Better, Stronger, FasterSaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
Thomas Jackson
 
The DevOps Pay Raise: Quantifying Your Value to Move Up the Ladder
The DevOps Pay Raise: Quantifying Your Value to Move Up the LadderThe DevOps Pay Raise: Quantifying Your Value to Move Up the Ladder
The DevOps Pay Raise: Quantifying Your Value to Move Up the Ladder
tlevey
 
Continuous Delivery for Python Developers – PyCon Otto
Continuous Delivery for Python Developers – PyCon OttoContinuous Delivery for Python Developers – PyCon Otto
Continuous Delivery for Python Developers – PyCon Otto
Peter Bittner
 
Build It and They Will Come-Pliant
Build It and They Will Come-PliantBuild It and They Will Come-Pliant
Build It and They Will Come-Pliant
Julie Tsai
 
Atlassian - Software For Every Team
Atlassian - Software For Every TeamAtlassian - Software For Every Team
Atlassian - Software For Every Team
Sven Peters
 
⛳️ Votre API passe-t-elle le contrôle technique ?
⛳️ Votre API passe-t-elle le contrôle technique ?⛳️ Votre API passe-t-elle le contrôle technique ?
⛳️ Votre API passe-t-elle le contrôle technique ?
François-Guillaume Ribreau
 
Monitoring 改造計畫:流程觀點
Monitoring 改造計畫:流程觀點Monitoring 改造計畫:流程觀點
Monitoring 改造計畫:流程觀點
William Yeh
 
Continuous Delivery and Automated Operations on k8s with keptn
Continuous Delivery and Automated Operations on k8s with keptnContinuous Delivery and Automated Operations on k8s with keptn
Continuous Delivery and Automated Operations on k8s with keptn
Andreas Grabner
 
Realising the true value of DevOps
Realising the true value of DevOpsRealising the true value of DevOps
Realising the true value of DevOps
tlevey
 
DevOps with Serverless
DevOps with ServerlessDevOps with Serverless
DevOps with Serverless
Yan Cui
 
Node fips
Node fipsNode fips
Node fips
Michael Dawson
 
Challenges and best practices of database continuous delivery
Challenges and best practices of database continuous deliveryChallenges and best practices of database continuous delivery
Challenges and best practices of database continuous delivery
DBmaestro - Database DevOps
 
Serverless 101 in Montreal
Serverless 101 in MontrealServerless 101 in Montreal
Serverless 101 in Montreal
Aaron Williams
 
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.02014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
Joakim Lindbom
 
Dash presentation
Dash presentationDash presentation
Dash presentation
Calvin French-Owen
 

Similar to Failure Happens: Improving Incident Response In Enterprises (20)

DevOps Training - Ho Chi Minh City
DevOps Training - Ho Chi Minh CityDevOps Training - Ho Chi Minh City
DevOps Training - Ho Chi Minh City
 
Devops is (not ) a buzzword
Devops is (not ) a buzzwordDevops is (not ) a buzzword
Devops is (not ) a buzzword
 
Lead Developer: A Day in the Life
Lead Developer: A Day in the LifeLead Developer: A Day in the Life
Lead Developer: A Day in the Life
 
BizOps Done Right: Breaking DevOps Silos to Deliver Great User Experiences
BizOps Done Right: Breaking DevOps Silos to Deliver Great User ExperiencesBizOps Done Right: Breaking DevOps Silos to Deliver Great User Experiences
BizOps Done Right: Breaking DevOps Silos to Deliver Great User Experiences
 
Chicago Docker Meetup Presentation - Mediafly
Chicago Docker Meetup Presentation - MediaflyChicago Docker Meetup Presentation - Mediafly
Chicago Docker Meetup Presentation - Mediafly
 
SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
SaltConf 2015: Salt stack at web scale: Better, Stronger, FasterSaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
 
The DevOps Pay Raise: Quantifying Your Value to Move Up the Ladder
The DevOps Pay Raise: Quantifying Your Value to Move Up the LadderThe DevOps Pay Raise: Quantifying Your Value to Move Up the Ladder
The DevOps Pay Raise: Quantifying Your Value to Move Up the Ladder
 
Continuous Delivery for Python Developers – PyCon Otto
Continuous Delivery for Python Developers – PyCon OttoContinuous Delivery for Python Developers – PyCon Otto
Continuous Delivery for Python Developers – PyCon Otto
 
Build It and They Will Come-Pliant
Build It and They Will Come-PliantBuild It and They Will Come-Pliant
Build It and They Will Come-Pliant
 
Atlassian - Software For Every Team
Atlassian - Software For Every TeamAtlassian - Software For Every Team
Atlassian - Software For Every Team
 
⛳️ Votre API passe-t-elle le contrôle technique ?
⛳️ Votre API passe-t-elle le contrôle technique ?⛳️ Votre API passe-t-elle le contrôle technique ?
⛳️ Votre API passe-t-elle le contrôle technique ?
 
Monitoring 改造計畫:流程觀點
Monitoring 改造計畫:流程觀點Monitoring 改造計畫:流程觀點
Monitoring 改造計畫:流程觀點
 
Continuous Delivery and Automated Operations on k8s with keptn
Continuous Delivery and Automated Operations on k8s with keptnContinuous Delivery and Automated Operations on k8s with keptn
Continuous Delivery and Automated Operations on k8s with keptn
 
Realising the true value of DevOps
Realising the true value of DevOpsRealising the true value of DevOps
Realising the true value of DevOps
 
DevOps with Serverless
DevOps with ServerlessDevOps with Serverless
DevOps with Serverless
 
Node fips
Node fipsNode fips
Node fips
 
Challenges and best practices of database continuous delivery
Challenges and best practices of database continuous deliveryChallenges and best practices of database continuous delivery
Challenges and best practices of database continuous delivery
 
Serverless 101 in Montreal
Serverless 101 in MontrealServerless 101 in Montreal
Serverless 101 in Montreal
 
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.02014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
 
Dash presentation
Dash presentationDash presentation
Dash presentation
 

More from Rundeck

Rundeck Community Office Hours: Using Variables with Job Steps
Rundeck Community Office Hours:  Using Variables with Job Steps Rundeck Community Office Hours:  Using Variables with Job Steps
Rundeck Community Office Hours: Using Variables with Job Steps
Rundeck
 
Introducing PagerDuty Process Automation
Introducing PagerDuty Process AutomationIntroducing PagerDuty Process Automation
Introducing PagerDuty Process Automation
Rundeck
 
How to Build a Custom Plugin in Rundeck
How to Build a Custom Plugin in RundeckHow to Build a Custom Plugin in Rundeck
How to Build a Custom Plugin in Rundeck
Rundeck
 
Lunch and learn: Getting started with Rundeck & Ansible
Lunch and learn:  Getting started with Rundeck & AnsibleLunch and learn:  Getting started with Rundeck & Ansible
Lunch and learn: Getting started with Rundeck & Ansible
Rundeck
 
Self Service Cloud Operations: Safely Delegate the Management of your Cloud ...
Self Service Cloud Operations:  Safely Delegate the Management of your Cloud ...Self Service Cloud Operations:  Safely Delegate the Management of your Cloud ...
Self Service Cloud Operations: Safely Delegate the Management of your Cloud ...
Rundeck
 
Rundeck Office Hours: Best Practices Access Control Policies
Rundeck Office Hours:  Best Practices Access Control PoliciesRundeck Office Hours:  Best Practices Access Control Policies
Rundeck Office Hours: Best Practices Access Control Policies
Rundeck
 
Mastering Secrets Management in Rundeck
Mastering Secrets Management in RundeckMastering Secrets Management in Rundeck
Mastering Secrets Management in Rundeck
Rundeck
 
What's New in Rundeck 3.4
What's New in Rundeck 3.4   What's New in Rundeck 3.4
What's New in Rundeck 3.4
Rundeck
 
Automate Yourself Out of a Job: Safely Delegate the Management of your Azure...
Automate Yourself Out of a Job:  Safely Delegate the Management of your Azure...Automate Yourself Out of a Job:  Safely Delegate the Management of your Azure...
Automate Yourself Out of a Job: Safely Delegate the Management of your Azure...
Rundeck
 
Super-Charge Your Site Reliability Practices with Runbook Automation
Super-Charge Your Site Reliability Practices with Runbook Automation Super-Charge Your Site Reliability Practices with Runbook Automation
Super-Charge Your Site Reliability Practices with Runbook Automation
Rundeck
 
Introduction to Rundeck
Introduction to Rundeck Introduction to Rundeck
Introduction to Rundeck
Rundeck
 
Automated Remediation with Rundeck + Sensu
Automated Remediation with Rundeck + SensuAutomated Remediation with Rundeck + Sensu
Automated Remediation with Rundeck + Sensu
Rundeck
 
Modernizing Incident Response
Modernizing Incident Response Modernizing Incident Response
Modernizing Incident Response
Rundeck
 
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Rundeck
 
Datadog + Rundeck at DASH 2020
Datadog + Rundeck at DASH 2020Datadog + Rundeck at DASH 2020
Datadog + Rundeck at DASH 2020
Rundeck
 
Rundeck Overview
Rundeck OverviewRundeck Overview
Rundeck Overview
Rundeck
 
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationEmpower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
Rundeck
 
Advanced Cluster Settings
Advanced Cluster Settings Advanced Cluster Settings
Advanced Cluster Settings
Rundeck
 
Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration
Rundeck
 
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
Rundeck
 

More from Rundeck (20)

Rundeck Community Office Hours: Using Variables with Job Steps
Rundeck Community Office Hours:  Using Variables with Job Steps Rundeck Community Office Hours:  Using Variables with Job Steps
Rundeck Community Office Hours: Using Variables with Job Steps
 
Introducing PagerDuty Process Automation
Introducing PagerDuty Process AutomationIntroducing PagerDuty Process Automation
Introducing PagerDuty Process Automation
 
How to Build a Custom Plugin in Rundeck
How to Build a Custom Plugin in RundeckHow to Build a Custom Plugin in Rundeck
How to Build a Custom Plugin in Rundeck
 
Lunch and learn: Getting started with Rundeck & Ansible
Lunch and learn:  Getting started with Rundeck & AnsibleLunch and learn:  Getting started with Rundeck & Ansible
Lunch and learn: Getting started with Rundeck & Ansible
 
Self Service Cloud Operations: Safely Delegate the Management of your Cloud ...
Self Service Cloud Operations:  Safely Delegate the Management of your Cloud ...Self Service Cloud Operations:  Safely Delegate the Management of your Cloud ...
Self Service Cloud Operations: Safely Delegate the Management of your Cloud ...
 
Rundeck Office Hours: Best Practices Access Control Policies
Rundeck Office Hours:  Best Practices Access Control PoliciesRundeck Office Hours:  Best Practices Access Control Policies
Rundeck Office Hours: Best Practices Access Control Policies
 
Mastering Secrets Management in Rundeck
Mastering Secrets Management in RundeckMastering Secrets Management in Rundeck
Mastering Secrets Management in Rundeck
 
What's New in Rundeck 3.4
What's New in Rundeck 3.4   What's New in Rundeck 3.4
What's New in Rundeck 3.4
 
Automate Yourself Out of a Job: Safely Delegate the Management of your Azure...
Automate Yourself Out of a Job:  Safely Delegate the Management of your Azure...Automate Yourself Out of a Job:  Safely Delegate the Management of your Azure...
Automate Yourself Out of a Job: Safely Delegate the Management of your Azure...
 
Super-Charge Your Site Reliability Practices with Runbook Automation
Super-Charge Your Site Reliability Practices with Runbook Automation Super-Charge Your Site Reliability Practices with Runbook Automation
Super-Charge Your Site Reliability Practices with Runbook Automation
 
Introduction to Rundeck
Introduction to Rundeck Introduction to Rundeck
Introduction to Rundeck
 
Automated Remediation with Rundeck + Sensu
Automated Remediation with Rundeck + SensuAutomated Remediation with Rundeck + Sensu
Automated Remediation with Rundeck + Sensu
 
Modernizing Incident Response
Modernizing Incident Response Modernizing Incident Response
Modernizing Incident Response
 
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
 
Datadog + Rundeck at DASH 2020
Datadog + Rundeck at DASH 2020Datadog + Rundeck at DASH 2020
Datadog + Rundeck at DASH 2020
 
Rundeck Overview
Rundeck OverviewRundeck Overview
Rundeck Overview
 
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationEmpower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
 
Advanced Cluster Settings
Advanced Cluster Settings Advanced Cluster Settings
Advanced Cluster Settings
 
Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration
 
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 

Failure Happens: Improving Incident Response In Enterprises

  • 1. Failure Happens: Improving Incident Response In Enterprises Damon Edwards @damonedwards
  • 3. 1. This is not a talk about monitoring 2. This talk is about the enterprise Please note….
  • 4. Let’s look at an (unfortunately) typical incident… [Get PDF here: https://rundeck.co/incident_lisa2017]
  • 5. 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) Let’s condense that…
  • 6. Take a DevOps approach to improvement…
  • 7. 1. Use Lean thinking and analyze like any process
  • 8. 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) What got in the way? (Wastes) PD - Partially Done TS - Task Switching W - Waiting M - Motion / Manual D - Defects EP - Extra Process EF - Extra Features H - Heroics Structured analysis building on DevOps adaptation of “7 deadly wastes” from Lean / Agile:
  • 9. 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) What got in the way? (Wastes) PD - Partially Done TS - Task Switching W - Waiting M - Motion / Manual D - Defects EP - Extra Process EF - Extra Features H - Heroics Structured analysis building on DevOps adaptation of “7 deadly wastes” from Lean / Agile: M M M M
  • 10. 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) What got in the way? (Wastes) PD - Partially Done TS - Task Switching W - Waiting M - Motion / Manual D - Defects EP - Extra Process EF - Extra Features H - Heroics Structured analysis building on DevOps adaptation of “7 deadly wastes” from Lean / Agile: TS TS TS TS TS TS TSM M M M
  • 11. 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) What got in the way? (Wastes) PD - Partially Done TS - Task Switching W - Waiting M - Motion / Manual D - Defects EP - Extra Process EF - Extra Features H - Heroics Structured analysis building on DevOps adaptation of “7 deadly wastes” from Lean / Agile: TS TS TS TS TS TS TS W W W W W M M M M
  • 12. 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) What got in the way? (Wastes) PD - Partially Done TS - Task Switching W - Waiting M - Motion / Manual D - Defects EP - Extra Process EF - Extra Features H - Heroics Structured analysis building on DevOps adaptation of “7 deadly wastes” from Lean / Agile: TS TS TS TS TS TS TS W W W W W EP EP M M M M
  • 13. 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) What got in the way? (Wastes) PD - Partially Done TS - Task Switching W - Waiting M - Motion / Manual D - Defects EP - Extra Process EF - Extra Features H - Heroics Structured analysis building on DevOps adaptation of “7 deadly wastes” from Lean / Agile: TS TS TS TS TS TS TS W W W W W EP EP H HH M M M M
  • 14. 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) What got in the way? (Wastes) PD - Partially Done TS - Task Switching W - Waiting M - Motion / Manual D - Defects EP - Extra Process EF - Extra Features H - Heroics Structured analysis building on DevOps adaptation of “7 deadly wastes” from Lean / Agile: TS TS TS TS TS TS TS PD PD PD W W W W W EP EP H HH M M M M
  • 15. Tip: Try techniques like Value Stream Mapping
  • 16. 1. Use Lean thinking and analyze like any process 2. Make sure you are measuring the right things
  • 17. MTTD? 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha)
  • 18. MTTD? 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) Detect?
  • 19. MTTD? 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) Detect? Diagnose!
  • 20. MTTD? 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) Detect? Diagnose! Diagnose > Detect
  • 21. MTTR? 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha)
  • 22. MTTR? 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) Restore?
  • 23. MTTR? 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) Restore? Repair! DEV
  • 24. MTTR? 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) Restore? Restore is good. Repair (fix) is ultimate measure. Repair! DEV
  • 25. Cost? 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) HC: 2 HC: 9 HC: 2 HC: 3 HC: 4 HC: 1.5 HC: 1.5 HC: 1.5
  • 26. Cost? 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) HC: 2 HC: 9 HC: 2 HC: 3 HC: 4 HC: 1.5 HC: 1.5 HC: 1.5 Context: 2 Context: 10 Context: 6 Context: 7 Context: 10 Context: 12 Context: 13 Context: 13
  • 27. Cost? 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) Don’t underestimate the expense of context switching! HC: 2 HC: 9 HC: 2 HC: 3 HC: 4 HC: 1.5 HC: 1.5 HC: 1.5 Context: 2 Context: 10 Context: 6 Context: 7 Context: 10 Context: 12 Context: 13 Context: 13
  • 28. 1. Use Lean thinking and analyze like any process 2. Make sure you are measuring the right things 3. “Shift Left” control and decision making
  • 29. 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) Shift-Left
  • 30. 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) Solve here or here Shift-Left
  • 31. 1 -NOC alerts -Biz Manager phone call -Escalate to L2 -L2 bridge call -SysAdmin try this & that -Consensus: Foo service -Interrupt Foo app dev -Find sysadmin with access -Multiple roundtrips to get logs 2 3 -Identify incorrect restarts -Attempt to get middleware to restart -Need approval 4 -Escalate to SVP -VPs try to assess impact -Approve 5 6 -Find middleware engineer -Review docs -Trial & error -Need API tests per policy -Find tester -Pass tests 7 -Middleware calls restart done -NOC sees green -Ticket closed 8 10:00am 12:00pm 2:00pm 2:30pm 5:00pm 6:00pm 9:30pm 10:30pm +00:30 after alert +02:30 after alert +04:30 after alert +05:00 after alert +07:30 after alert +08:30 after alert +12:00 after alert +13:00 after alert Ticket Context Wagon NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x L2 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo L2 SysAdmin (Lee) Middleware Mngr. SVP Chief of Staff 2 x Tech VP Middleware (Scott) Cust. Engage. (Varsha) Solve here or here NOT HERE Shift-Left
  • 32. Empower those closest to the issue escalate escalate 1° 2° 3° escalate 4°
  • 33. Empower those closest to the issue escalate escalate 1° 2° 3° escalate 4° Push the ability to take action this direction
  • 34. Empower those closest to the issue escalate escalate 1° 2° 3° escalate 4° Push the ability to take action this direction But what gets in the way?
  • 35. Empower those closest to the issue escalate escalate 1° 2° 3° escalate 4° Push the ability to take action this direction SilosBut what gets in the way?
  • 36. 1. Use Lean thinking and analyze like any process 2. Make sure you are measuring the right things 3. “Shift Left” control and decision making 4. Get rid of silos
  • 37. Silos ruin everything Backlog Context I need X Backlog I do X Requests for X Silo A Priorities Context Priorities Silo B Tools Tools
  • 38. Silos ruin everything Backlog Context I need X Backlog I do X Requests for X Silo A Priorities Context Priorities Silo B Tools Tools
  • 39. Ticket-Driven Request Queues Are Best Indicator of Silos Team A (Dev) Team B (Ops) Ticket System ??
  • 40. Team A (Dev) Team B (Ops) Ticket System ?? Silo Builder Ticket-Driven Request Queues Are Best Indicator of Silos
  • 41. Team A (Dev) Team B (Ops) Ticket System ?? Silo Builder Snowflake Maker Ticket-Driven Request Queues Are Best Indicator of Silos
  • 42. Popular: Replace Silos with Cross Functional Team Dev/Test Release OperatePlanning
  • 43. Popular: Replace Silos with Cross Functional Team Dev/Test Release OperatePlanning Cross-Functional Teams Cross-Functional Teams Cross-Functional Teams
  • 44. Dev/Test Release OperatePlanning Popular: Replace Silos with Cross Functional Team Cross-Functional Teams Cross-Functional Teams Cross-Functional Teams
  • 45. EnvironmentsDBAs Network Security NOC Dev/Test Release OperatePlanning Popular: Replace Silos with Cross Functional Team Cross-Functional Teams Cross-Functional Teams Cross-Functional Teams
  • 46. EnvironmentsDBAs Network Security NOC Dev/Test Release OperatePlanning Team A (Dev) Team B (Ops) Ticket System ?? Popular: Replace Silos with Cross Functional Team Cross-Functional Teams Cross-Functional Teams Cross-Functional Teams
  • 47. 1. Use Lean thinking and analyze like any process 2. Make sure you are measuring the right things 3. “Shift Left” control and decision making 4. Get rid of silos 5. Establish Operations as a Service
  • 48. Team A (Dev) Team B (Ops) Ticket System ?? Silo Builder Get rid of as many ticket-driven request queues as possible
  • 49. Team A (Dev) Team B (Ops) Ticket System ?? Silo Builder Snowflake Maker Get rid of as many ticket-driven request queues as possible
  • 50. … replace with Operations as a Service design pattern Team A (Dev) Team B (Ops)Ticket System Operations as a Service Execute On Demand Define Procedures Vet Procedures Define Policies Actual Exceptions Execute On Demand
  • 51. Changes how your organization thinks about automated procedures…
  • 52. Automated procedures are comprised of three parts Definition of the automated procedure Execution of the automated procedure Governance of the automated procedure Define Execute Govern
  • 53. Automated procedures are comprised of three parts Definition of the automated procedure Execution of the automated procedure Governance of the automated procedure Define Execute Govern (security, oversight, compliance, etc.)
  • 54. Traditional Ops Silo Define Execute Govern “Consumers of Ops” (Dev, QA, Release, NOC, Security, etc.) Ops
  • 55. Rigid Self-Service Define Execute Govern “Consumers of Ops” (Dev, QA, Release, NOC, Security, etc.) Ops
  • 56. Define Execute Govern Execute “Consumers of Ops” (Dev, QA, Release, NOC, Security, etc.) Ops Rigid Self-Service (limited)
  • 57. High-Velocity Handoffs Define Govern Execute “Consumers of Ops” (Dev, QA, Release, NOC, Security, etc.) Ops
  • 58. Self-Service Operations Define Govern Execute “Consumers of Ops” (Dev, QA, Release, NOC, Security, etc.) Ops
  • 59. Self-Service Operations Define Govern Execute Govern “Consumers of Ops” (Dev, QA, Release, NOC, Security, etc.) Ops
  • 60. fdfd Operations as a Service Operations as a Service ED G Team B (Ops) Vet Procedures Define Policies Execute On Demand Team A (Dev) Define Procedures Execute On Demand
  • 61. fdfd Operations as a Service Move definition, execution, and governance to where you get the most effective use of labor and best flow of work Operations as a Service ED G Team B (Ops) Vet Procedures Define Policies Execute On Demand Team A (Dev) Define Procedures Execute On Demand
  • 62. Rundeck: Open Source Platform For Operations as a Service #! ! "# $ Scripts APIs Tools Cloud VMs Containers Orchestration & Scheduling of Workflows Collect and Process Output Infrastructure details and state from multiple sources Config. Man. CMDB Monitor. Metrics Cloud Corp Directory Authentication and roles ITSM Tickets, work status, approvals >_ Create workflows ● Define ACL policies ● Execute workflows Web GUI API CLI
  • 63. Common implementation pattern for Operations as a Service…
  • 64. Step 1: Establish a Secure Ops Hub Operations as a Service Engineers get visibility and controlled self-service Secrets Ops Procedures “Status” “Firewall Change” "Restart" deny allow Identity Audit Logs Infrastructure view Service health System metrics Ops Support use for remediation procedures Inventory and Health Execute + Monitoring Tools Security and Ops manages access, configuration, and compliance
  • 65. Step 2: Establish a SDLC for Ops Procedures Operations as a Service Engineers get visibility and controlled self-service Secrets Ops Procedures “Status” “Firewall Change” "Restart" deny allow Identity Audit Logs Infrastructure view Service health System metrics Ops Support use for remediation procedures Inventory and Health Execute Source Code Repo if (($state==wait)) then kill -9 $PID fi Change Product Engineers produce automated procedures and health checks. RISKY Automated Procedures and Health Checks FIX Code review + Monitoring Tools Security and Ops manages access, configuration, and compliance
  • 66. Step 3: Connect with Enterprise Management Systems Service Desk CustomersOps Support get visibility and audit trail updated by support tools Service Ticket Execute Software Supply Chain Ops integrate with artifact flow Operations as a Service Engineers get visibility and controlled self-service Secrets Ops Procedures “Status” “Firewall Change” "Restart" deny allow Identity Audit Logs Infrastructure view Service health System metrics Ops Support use for remediation procedures Inventory and Health Source Code Repo if (($state==wait)) then kill -9 $PID fi Change Product Engineers produce automated procedures and health checks. RISKY Automated Procedures and Health Checks FIX Code review + Monitoring Tools Security and Ops manages access, configuration, and compliance
  • 67. Step 4: Make Compliance Really Happy Service Desk CustomersOps Support get visibility and audit trail updated by support tools Service Ticket Execute Software Supply Chain Ops integrate with artifact flow Who reviewed it? Who ran it? When? Where? Approval trail? Who created the procedure? Who created the policy? Operations as a Service Engineers get visibility and controlled self-service Secrets Ops Procedures “Status” “Firewall Change” "Restart" deny allow Identity Audit Logs Infrastructure view Service health System metrics Ops Support use for remediation procedures Inventory and Health Source Code Repo if (($state==wait)) then kill -9 $PID fi Change Product Engineers produce automated procedures and health checks. RISKY Automated Procedures and Health Checks FIX Code review + Monitoring Tools Security and Ops manages access, configuration, and compliance
  • 68. 1. Use Lean thinking and analyze like any process 2. Make sure you are measuring the right things 3. “Shift Left” control and decision making 4. Get rid of silos 5. Establish Operations as a Service 6. Encourage Developers to think “Operable First”
  • 69. Encourage Developers to think “Operable First”
  • 70. Operations is the business (unless you literally sell packaged software) Encourage Developers to think “Operable First”
  • 71. Operations is the business (unless you literally sell packaged software) Deployability, configurability, monitoring are service features Encourage Developers to think “Operable First”
  • 72. Operations is the business (unless you literally sell packaged software) Deployability, configurability, monitoring are service features Build configurability into the service, don’t externalize it Encourage Developers to think “Operable First”
  • 73. Operations is the business (unless you literally sell packaged software) Deployability, configurability, monitoring are service features Build configurability into the service, don’t externalize it Demand “prod-like” environments everywhere Encourage Developers to think “Operable First”
  • 74. Operations is the business (unless you literally sell packaged software) Deployability, configurability, monitoring are service features Build configurability into the service, don’t externalize it Demand “prod-like” environments everywhere Make any handoff between teams “verification-driven” Encourage Developers to think “Operable First”
  • 75. Operations is the business (unless you literally sell packaged software) Deployability, configurability, monitoring are service features Build configurability into the service, don’t externalize it Demand “prod-like” environments everywhere Make any handoff between teams “verification-driven” Create immutable versioned artifacts and use standard packaging Encourage Developers to think “Operable First”
  • 76. Operations is the business (unless you literally sell packaged software) Deployability, configurability, monitoring are service features Build configurability into the service, don’t externalize it Demand “prod-like” environments everywhere Make any handoff between teams “verification-driven” Create immutable versioned artifacts and use standard packaging Integration tests over unit tests Encourage Developers to think “Operable First”
  • 77. Let’s look at a company improving it’s incident response capability…
  • 78.
  • 79.
  • 81. 90% Reduction in MTTR 50% Reduction in escalations 55% Reduction of overall support costs
  • 82.
  • 83. 1. New org, support, and escalation model
  • 84. escalate 1° 2° 3° 4° escalate escalate 1. New org, support, and escalation model
  • 85. escalate 1° 2° 3° 4° escalate escalate 1. New org, support, and escalation model EMT ER Trauma Surgeon Specialist Surgeon TOC (NOC) SRE Production Eng. Scrum Teams Data Services Platform Eng. Global Network > 15 min > 30 min > 60 min
  • 86. 2. Key: Push the ability to take action closest to the problem escalate 1° 2° 3° 4° escalate escalate 1. New org, support, and escalation model EMT ER Trauma Surgeon Specialist Surgeon TOC (NOC) SRE Production Eng. Scrum Teams Data Services Platform Eng. Global Network > 15 min > 30 min > 60 min
  • 87. 2. Key: Push the ability to take action closest to the problem escalate 1° 2° 3° 4° escalate escalate 1. New org, support, and escalation model EMT ER Trauma Surgeon Specialist Surgeon TOC (NOC) SRE Production Eng. Scrum Teams Data Services Platform Eng. Global Network > 15 min > 30 min > 60 min 3. Longterm investment in operability (deployment, configuration, monitoring, automated runbooks)
  • 88. “Support at the Edge” Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ http://rundeck.org/stories/mark_maun.html
  • 89. “Support at the Edge” Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ http://rundeck.org/stories/mark_maun.html • Automated Ops procedures written/ vetted by the delivery teams
  • 90. “Support at the Edge” Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ http://rundeck.org/stories/mark_maun.html • Automated Ops procedures written/ vetted by the delivery teams • Ops remained in full control of what can run and security policy
  • 91. “Support at the Edge” Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ http://rundeck.org/stories/mark_maun.html • Automated Ops procedures written/ vetted by the delivery teams • Ops remained in full control of what can run and security policy • Empowered NOC and other support teams with self-service ops tasks
  • 92. “Support at the Edge” Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ http://rundeck.org/stories/mark_maun.html • Automated Ops procedures written/ vetted by the delivery teams • Ops remained in full control of what can run and security policy • Empowered NOC and other support teams with self-service ops tasks • Empowered developers with limited self-service operations
  • 93. “Support at the Edge” Sources: https://www.youtube.com/watch?v=_hr4KiB19bQ http://rundeck.org/stories/mark_maun.html • Automated Ops procedures written/ vetted by the delivery teams • Ops remained in full control of what can run and security policy • Empowered NOC and other support teams with self-service ops tasks • Empowered developers with limited self-service operations • Combined with new incident response and escalation model
  • 94. 1. Use Lean thinking and analyze like any process 2. Make sure you are measuring the right things 3. “Shift Left” control and decision making 4. Get rid of silos 5. Establish Operations as a Service 6. Encourage Developers to think “Operable First” Recap!