SysAdmin to SRE:
Solving the Last Mile Problem
Damon Edwards
@damonedwards
Operations:
The Last Mile
Operations:
The Last Mile
Silos Queues
Excessive ToilLow Trust
Operations:
The Last Mile
https://www.youtube.com/watch?v=1zUtBLZ4Lus
Silos Queues
Excessive ToilLow Trust
SRE
(Site Reliability Engineering)
“SRE…
When you ask
software engineers
to do operations”
“SRE…
Next-generation,
cloud-native
Operations”
Class SRE implements DevOps
“SRE…
When Ops does
more engineering
than Ops”
“SRE…
When you ask
software engineers
to do operations”
“SRE…
Next-generation,
cloud-native
Operations”
Class SRE implements DevOps
“SRE…
When Ops does
more engineering
than Ops”
SRE
Why SRE?
Simon Sinek
Start with
“why?”
Story time….
Its was just another Thursday…
Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
What a c#@p
service!
I can’t login Barely works
It’s broken
Customers
Thursday 10:00am PDT
(1200 Agents)
t a c#@p
ervice!
rks Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
“Stuff
isn’t
working”
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Ops Ops
…but monitoring
is all green”
OK
OK
OK
OK
OK
Call Center
Agent
Customer
Now it works Now it works
Service
Desk
?
Ops Ops
3:30pm
The next day…
Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Custo
VIP Cu
Friday 9:00am PDT
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Service
Desk
Escalate!
Incident
Commander
Ticket
Launch the
incident bridge
Ops
Incident
Commander
Ops
Dev
Sec
Ops
Bridge
Call
Ops
Not me…
Not me…
Not me…
Not me…
No code
updates
Probably not the new server
hardening process or the network
changes…
Headcount: 40
ev
No code
updates
Probably not the new server
dening process or the network
changes…
Ops
Ops
Ops
Uhh.. WHAT new
server hardening
process and network
changes?
Sec
We were going to fail
audit… you didn’t get
the email?
Dev
Bridge
Call
No code
updates
War
Room
SysAdmin
“Try
this”
Test
Platform
“Try
this”
Test
Network
“Try
this”
Test
Security
“Try
this”
Test
Storage
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander
“Theory: new
security updates”
Call Center
Agent
Customer
Now it works Now it works
Call Center
Manager
What is going
on?
3:30pm
Headcount: 30
orks
Ops
Ops
Sec
Ops
OpsOps
Rollback:
-OS changes
-Network changes
Over the weekend
QA
Headcount: 10
Monday morning…
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Custo
VIP Cus
Monday 10:00am PDT
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Customer Systems
Lead Dev
ding!
Ignore.
Incident
Commander
Hey did you s
that ticket?
Scrum
ustomer Systems
Lead Dev
Ignore.
Incident
Commander
Hey did you see
that ticket?
sigh.
I’ll take a look
Scrum
Customer Systems
Lead Dev
Customer S
Lead D
Somet
the data
.
I’ll take a look
r Systems
d Dev
Customer Systems
Lead Dev
Something is wrong with
the database connection…
… But our code didn’t
change.
DBA
No recent database
updates.
Dev
Bridge
Call
No code
updates
War
Room
DBA
“Try
this”
Test
SysAdmin
“Try
this”
Test
Network
“Try
this”
Test
Security
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander
“New Theory: Its
the database
connection”
Call Center
Manager
What is going
on?
idn’t
DBA
No recent database
updates.
Headcount: 20
Dev
Bridge
Call
No code
updates
War
Room
DBA
“Try
this”
Test
SysAdmin
“Try
this”
Test
Network
“Try
this”
Test
Security
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander
“New Theory: Its
the database
connection”
Call Center
Agent
Customer
Now it works Now it works
Call Center
Manager
What is going
on?
4:00pm
Headcount: 20
The next day…
Dev
Bridge
Call
No code
updates
War
Room
DBA
“Try
this”
Test
DBA
“Try
this”
Test
SysAdmin
“Try
this”
Test
SysEng
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander“New Theory: “problem with
stored procedures… but
not sure what”
Incident
Commander
DB Vendor phone
support isn’t
cutting it.
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
Tuesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Dev
No code
updates
War
Room
Test
Test
Test
Test
Test
Incident
Commander
Incident
Commander
Vendor
Management
DB Vendor phone
support isn’t
cutting it.
We only paid for
bronze support
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
Approval
Request
“Need to upgrade
support” Finance
??
The next day…
Dev
Bridge
Call
No code
updates
War
Room
Vendor
Consultant
“Let’s see with the vendor
consultant says”
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
OK, let me take a
look.
Ven
Cons
So
per
Wednesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Headcount: 15
Dev
e
No code
updates
War
Room
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
Vendor
Consultant
So?
Someone toggled on the new
performance analysis feature
DBA
3:00pm
dcount: 15
So?
Vendor
Consultant
Its been choking on a particular stored
procedure you use everywhere…
This stored procedure has
almost 400 parameters.
It’s 1 million lines
of code
but… its been
working for years!
?
?
?DBA
Dev
m
but… its been
working for years!
?
?
?
Ops
SysEng
QA
Ops
QA
DBA
change
config
load
test
Dev
1:00am
Headcount: 10
but… its been
working for years!
?
?
?
Ops
SysEng
QA
Ops
QA
DBA
change
config
load
test
Dev
1:00am
Headcount: 10
.
Post mortem…
Vendor
Consultant
Dir
Finance
No budget
GM, Line of
Business
Stay on
schedule
You should really
fix that…
Ops
It’s not fixed.
It’s just turned off.
VP Ops
I’m told bug
#8543 is P1, but
was rejected?
Ops
Refactor it before
it bites us again.
VP Dev
It’s not a bug.
You already have
a fix.
Dev
wins
Dev
wins
Dev
No time.
Dev
Their change
broke it.Dev vs Ops
Vendor
Consultant
Dir
Finance
No budget
GM, Line of
Business
Stay on
schedule
You should really
fix that…
Ops
It’s not fixed.
It’s just turned off.
VP Ops
I’m told bug
#8543 is P1, but
was rejected?
Ops
Refactor it before
it bites us again.
VP Dev
It’s not a bug.
You already have
a fix.
Dev
wins
Dev
wins
Dev
No time.
Dev
Their change
broke it.Dev vs Ops
Vendor
Consultant
Dir
Finance
No budget
GM, Line of
Business
Stay on
schedule
You should really
fix that…
Ops
It’s not fixed.
It’s just turned off.
VP Ops
I’m told bug
#8543 is P1, but
was rejected?
Ops
Refactor it before
it bites us again.
VP Dev
It’s not a bug.
You already have
a fix.
Dev
wins
Dev
wins
Dev
No time.
Dev
Their change
broke it.Dev vs Ops
Vendor
Consultant
Dir
Finance
No budget
GM, Line of
Business
Stay on
schedule
You should really
fix that…
Ops
It’s not fixed.
It’s just turned off.
VP Ops
I’m told bug
#8543 is P1, but
was rejected?
Ops
Refactor it before
it bites us again.
VP Dev
It’s not a bug.
You already have
a fix.
Dev
wins
Dev
wins
Dev
No time.
Dev
Their change
broke it.Dev vs Ops
Vendor
Consultant
Dir
Finance
No budget
GM, Line of
Business
Stay on
schedule
You should really
fix that…
Ops
It’s not fixed.
It’s just turned off.
VP Ops
I’m told bug
#8543 is P1, but
was rejected?
Ops
Refactor it before
it bites us again.
VP Dev
It’s not a bug.
You already have
a fix.
Dev
wins
Dev
wins
Dev
No time.
Dev
Their change
broke it.Dev vs Ops
Vendor
Consultant
Dir
Finance
No budget
GM, Line of
Business
Stay on
schedule
You should really
fix that…
Ops
It’s not fixed.
It’s just turned off.
VP Ops
I’m told bug
#8543 is P1, but
was rejected?
Ops
Refactor it before
it bites us again.
VP Dev
It’s not a bug.
You already have
a fix.
Dev
wins
Dev
wins
Dev
No time.
Dev
Their change
broke it.Dev vs Ops
Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
What a c#@p
service!
I can’t login Barely works
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Call Center
Agent
Customer
Now it works Now it works
Service
Desk
?
Ops Ops
Thursday 10:00am PDT 3:30pm
(1200 Agents)
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Service
Desk
Escalate!
Incident
Commander
Ticket
Launch the
incident bridge
Ops
Incident
Commander
Ops
Dev
Sec
Ops
Bridge
Call
Ops
Not me…
Not me…
Not me…
Not me…
No code
updates
Probably not the new server
hardening process or the network
changes…
Ops
Ops
Ops
Uhh.. WHAT new
server hardening
process and network
changes?
Sec
We were going to fail
audit… you didn’t get
the email?
Dev
Bridge
Call
No code
updates
War
Room
SysAdmin
“Try
this”
Test
Platform
“Try
this”
Test
Network
“Try
this”
Test
Security
“Try
this”
Test
Storage
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander
“Theory: new
security updates”
Call Center
Agent
Customer
Now it works Now it works
Ops
Ops
Sec
Ops
Ops
Call Center
Manager
What is going
on?
Ops
Rollback:
-OS changes
-Network changes
3:30pm Over the weekend
QA
Headcount: 40
Headcount: 30
Headcount: 10
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Bridge
Call
DBA
“Try
this”
SysAdmin
“Try
this”
Network
“Try
this”
Security
“Try
this”
SysEng
“Try
this”
“New Theory: Its
the database
connection”
Customer Systems
Lead Dev
ding!
Ignore.
Incident
Commander
Hey did you see
that ticket?
sigh.
I’ll take a look
Scrum
Customer Systems
Lead Dev
Customer Systems
Lead Dev
Something is wrong with
the database connection…
… But our code didn’t
change.
DBA
No recent database
updates.
Monday 10:00am PDT
Headco
Dev
Bridge
Call
No code
updates
War
Room
DBA
“Try
this”
Test
DBA
“Try
this”
Test
SysAdmin
“Try
this”
Test
SysEng
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander“New Theory: “problem with
stored procedures… but
not sure what”
Incident
Commander
Vendor
Management
DB Vendor phone
support isn’t
cutting it.
We only paid for
bronze support
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
Approval
Request
“Need to upgrade
support” Finance
??
Tuesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Dev
Bridge
Call
No code
updates
War
Room
Vendor
Consultant
“Let’s see with the vendor
consultant says”
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
OK, let me take a
look.
Vendor
Consultant
So?
Vendor
Consultant
Its been choking on a particular stored
procedure you use everywhere…Someone toggled on the new
performance analysis feature
This stored procedure has
almost 400 parameters.
It’s 1 million lines
of code
but… its been
working for years!
?
?
?
Ops
Sys
Ops
QA
change
config
load
test
Wednesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
DBA
Dev
3:00pm
Headcount: 15
Headcount: 10
Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
Friday 9:00am PDT
Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
What a c#@p
service!
I can’t login Barely works
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Call Center
Agent
Customer
Now it works Now it works
Service
Desk
?
Ops Ops
Thursday 10:00am PDT 3:30pm
(1200 Agents)
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Service
Desk
Escalate!
Incident
Commander
Ticket
Launch the
incident bridge
Ops
Incident
Commander
Ops
Dev
Sec
Ops
Bridge
Call
Ops
Not me…
Not me…
Not me…
Not me…
No code
updates
Probably not the new server
hardening process or the network
changes…
Ops
Ops
Ops
Uhh.. WHAT new
server hardening
process and network
changes?
Sec
We were going to fail
audit… you didn’t get
the email?
Dev
Bridge
Call
No code
updates
War
Room
SysAdmin
“Try
this”
Test
Platform
“Try
this”
Test
Network
“Try
this”
Test
Security
“Try
this”
Test
Storage
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander
“Theory: new
security updates”
Call Center
Agent
Customer
Now it works Now it works
Ops
Ops
Sec
Ops
Ops
Call Center
Manager
What is going
on?
Ops
Rollback:
-OS changes
-Network changes
3:30pm Over the weekend
QA
Headcount: 40
Headcount: 30
Headcount: 10
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Bridge
Call
DBA
“Try
this”
SysAdmin
“Try
this”
Network
“Try
this”
Security
“Try
this”
SysEng
“Try
this”
“New Theory: Its
the database
connection”
Customer Systems
Lead Dev
ding!
Ignore.
Incident
Commander
Hey did you see
that ticket?
sigh.
I’ll take a look
Scrum
Customer Systems
Lead Dev
Customer Systems
Lead Dev
Something is wrong with
the database connection…
… But our code didn’t
change.
DBA
No recent database
updates.
Monday 10:00am PDT
Headco
Dev
Bridge
Call
No code
updates
War
Room
DBA
“Try
this”
Test
DBA
“Try
this”
Test
SysAdmin
“Try
this”
Test
SysEng
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander“New Theory: “problem with
stored procedures… but
not sure what”
Incident
Commander
Vendor
Management
DB Vendor phone
support isn’t
cutting it.
We only paid for
bronze support
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
Approval
Request
“Need to upgrade
support” Finance
??
Tuesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Dev
Bridge
Call
No code
updates
War
Room
Vendor
Consultant
“Let’s see with the vendor
consultant says”
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
OK, let me take a
look.
Vendor
Consultant
So?
Vendor
Consultant
Its been choking on a particular stored
procedure you use everywhere…Someone toggled on the new
performance analysis feature
This stored procedure has
almost 400 parameters.
It’s 1 million lines
of code
but… its been
working for years!
?
?
?
Ops
Sys
Ops
QA
change
config
load
test
Wednesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
DBA
Dev
3:00pm
Headcount: 15
Headcount: 10
Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
Friday 9:00am PDT
Response labor: $270,000
Lost call center productivity: $620,000
$890,000
Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
What a c#@p
service!
I can’t login Barely works
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Call Center
Agent
Customer
Now it works Now it works
Service
Desk
?
Ops Ops
Thursday 10:00am PDT 3:30pm
(1200 Agents)
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Service
Desk
Escalate!
Incident
Commander
Ticket
Launch the
incident bridge
Ops
Incident
Commander
Ops
Dev
Sec
Ops
Bridge
Call
Ops
Not me…
Not me…
Not me…
Not me…
No code
updates
Probably not the new server
hardening process or the network
changes…
Ops
Ops
Ops
Uhh.. WHAT new
server hardening
process and network
changes?
Sec
We were going to fail
audit… you didn’t get
the email?
Dev
Bridge
Call
No code
updates
War
Room
SysAdmin
“Try
this”
Test
Platform
“Try
this”
Test
Network
“Try
this”
Test
Security
“Try
this”
Test
Storage
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander
“Theory: new
security updates”
Call Center
Agent
Customer
Now it works Now it works
Ops
Ops
Sec
Ops
Ops
Call Center
Manager
What is going
on?
Ops
Rollback:
-OS changes
-Network changes
3:30pm Over the weekend
QA
Headcount: 40
Headcount: 30
Headcount: 10
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Bridge
Call
DBA
“Try
this”
SysAdmin
“Try
this”
Network
“Try
this”
Security
“Try
this”
SysEng
“Try
this”
“New Theory: Its
the database
connection”
Customer Systems
Lead Dev
ding!
Ignore.
Incident
Commander
Hey did you see
that ticket?
sigh.
I’ll take a look
Scrum
Customer Systems
Lead Dev
Customer Systems
Lead Dev
Something is wrong with
the database connection…
… But our code didn’t
change.
DBA
No recent database
updates.
Monday 10:00am PDT
Headco
Dev
Bridge
Call
No code
updates
War
Room
DBA
“Try
this”
Test
DBA
“Try
this”
Test
SysAdmin
“Try
this”
Test
SysEng
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander“New Theory: “problem with
stored procedures… but
not sure what”
Incident
Commander
Vendor
Management
DB Vendor phone
support isn’t
cutting it.
We only paid for
bronze support
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
Approval
Request
“Need to upgrade
support” Finance
??
Tuesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Dev
Bridge
Call
No code
updates
War
Room
Vendor
Consultant
“Let’s see with the vendor
consultant says”
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
OK, let me take a
look.
Vendor
Consultant
So?
Vendor
Consultant
Its been choking on a particular stored
procedure you use everywhere…Someone toggled on the new
performance analysis feature
This stored procedure has
almost 400 parameters.
It’s 1 million lines
of code
but… its been
working for years!
?
?
?
Ops
Sys
Ops
QA
change
config
load
test
Wednesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
DBA
Dev
3:00pm
Headcount: 15
Headcount: 10
Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
Friday 9:00am PDT
Response labor: $270,000
Lost call center productivity: $620,000
$890,000
(+ project delays)
Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
What a c#@p
service!
I can’t login Barely works
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Call Center
Agent
Customer
Now it works Now it works
Service
Desk
?
Ops Ops
Thursday 10:00am PDT 3:30pm
(1200 Agents)
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Service
Desk
Escalate!
Incident
Commander
Ticket
Launch the
incident bridge
Ops
Incident
Commander
Ops
Dev
Sec
Ops
Bridge
Call
Ops
Not me…
Not me…
Not me…
Not me…
No code
updates
Probably not the new server
hardening process or the network
changes…
Ops
Ops
Ops
Uhh.. WHAT new
server hardening
process and network
changes?
Sec
We were going to fail
audit… you didn’t get
the email?
Dev
Bridge
Call
No code
updates
War
Room
SysAdmin
“Try
this”
Test
Platform
“Try
this”
Test
Network
“Try
this”
Test
Security
“Try
this”
Test
Storage
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander
“Theory: new
security updates”
Call Center
Agent
Customer
Now it works Now it works
Ops
Ops
Sec
Ops
Ops
Call Center
Manager
What is going
on?
Ops
Rollback:
-OS changes
-Network changes
3:30pm Over the weekend
QA
Headcount: 40
Headcount: 30
Headcount: 10
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Bridge
Call
DBA
“Try
this”
SysAdmin
“Try
this”
Network
“Try
this”
Security
“Try
this”
SysEng
“Try
this”
“New Theory: Its
the database
connection”
Customer Systems
Lead Dev
ding!
Ignore.
Incident
Commander
Hey did you see
that ticket?
sigh.
I’ll take a look
Scrum
Customer Systems
Lead Dev
Customer Systems
Lead Dev
Something is wrong with
the database connection…
… But our code didn’t
change.
DBA
No recent database
updates.
Monday 10:00am PDT
Headco
Dev
Bridge
Call
No code
updates
War
Room
DBA
“Try
this”
Test
DBA
“Try
this”
Test
SysAdmin
“Try
this”
Test
SysEng
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander“New Theory: “problem with
stored procedures… but
not sure what”
Incident
Commander
Vendor
Management
DB Vendor phone
support isn’t
cutting it.
We only paid for
bronze support
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
Approval
Request
“Need to upgrade
support” Finance
??
Tuesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Dev
Bridge
Call
No code
updates
War
Room
Vendor
Consultant
“Let’s see with the vendor
consultant says”
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
OK, let me take a
look.
Vendor
Consultant
So?
Vendor
Consultant
Its been choking on a particular stored
procedure you use everywhere…Someone toggled on the new
performance analysis feature
This stored procedure has
almost 400 parameters.
It’s 1 million lines
of code
but… its been
working for years!
?
?
?
Ops
Sys
Ops
QA
change
config
load
test
Wednesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
DBA
Dev
3:00pm
Headcount: 15
Headcount: 10
Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
Friday 9:00am PDT
Response labor: $270,000
Lost call center productivity: $620,000
$890,000
(+ project delays)
(+ brand damage)
Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
What a c#@p
service!
I can’t login Barely works
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Call Center
Agent
Customer
Now it works Now it works
Service
Desk
?
Ops Ops
Thursday 10:00am PDT 3:30pm
(1200 Agents)
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Service
Desk
Escalate!
Incident
Commander
Ticket
Launch the
incident bridge
Ops
Incident
Commander
Ops
Dev
Sec
Ops
Bridge
Call
Ops
Not me…
Not me…
Not me…
Not me…
No code
updates
Probably not the new server
hardening process or the network
changes…
Ops
Ops
Ops
Uhh.. WHAT new
server hardening
process and network
changes?
Sec
We were going to fail
audit… you didn’t get
the email?
Dev
Bridge
Call
No code
updates
War
Room
SysAdmin
“Try
this”
Test
Platform
“Try
this”
Test
Network
“Try
this”
Test
Security
“Try
this”
Test
Storage
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander
“Theory: new
security updates”
Call Center
Agent
Customer
Now it works Now it works
Ops
Ops
Sec
Ops
Ops
Call Center
Manager
What is going
on?
Ops
Rollback:
-OS changes
-Network changes
3:30pm Over the weekend
QA
Headcount: 40
Headcount: 30
Headcount: 10
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Bridge
Call
DBA
“Try
this”
SysAdmin
“Try
this”
Network
“Try
this”
Security
“Try
this”
SysEng
“Try
this”
“New Theory: Its
the database
connection”
Customer Systems
Lead Dev
ding!
Ignore.
Incident
Commander
Hey did you see
that ticket?
sigh.
I’ll take a look
Scrum
Customer Systems
Lead Dev
Customer Systems
Lead Dev
Something is wrong with
the database connection…
… But our code didn’t
change.
DBA
No recent database
updates.
Monday 10:00am PDT
Headco
Dev
Bridge
Call
No code
updates
War
Room
DBA
“Try
this”
Test
DBA
“Try
this”
Test
SysAdmin
“Try
this”
Test
SysEng
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander“New Theory: “problem with
stored procedures… but
not sure what”
Incident
Commander
Vendor
Management
DB Vendor phone
support isn’t
cutting it.
We only paid for
bronze support
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
Approval
Request
“Need to upgrade
support” Finance
??
Tuesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Dev
Bridge
Call
No code
updates
War
Room
Vendor
Consultant
“Let’s see with the vendor
consultant says”
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
OK, let me take a
look.
Vendor
Consultant
So?
Vendor
Consultant
Its been choking on a particular stored
procedure you use everywhere…Someone toggled on the new
performance analysis feature
This stored procedure has
almost 400 parameters.
It’s 1 million lines
of code
but… its been
working for years!
?
?
?
Ops
Sys
Ops
QA
change
config
load
test
Wednesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
DBA
Dev
3:00pm
Headcount: 15
Headcount: 10
Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
Friday 9:00am PDT
Response labor: $270,000
Lost call center productivity: $620,000
$890,000
(+ project delays)
(+ brand damage)
> $1,000,000
How did they end up here?
Corporate Plan
Annual Budget
Project Plan
Requirements
Corporate Plan
Annual Budget
Project Plan
Requirements
Corporate Plan
Annual Budget
Project Plan
Requirements
Corporate Plan
Annual Budget
Project Plan
Requirements
Context
Context
Process
Process
Tooling
Tooling
Capacity
Capacity
What were they thinking?
26 ITIL Processes
Service Validation & Testing
Strategy Management for IT Services
Supplier Management
The 7 Step Improvement
Transition Planning & Support
Access Management
Availability Management
Business Relationship Management
Capacity Management
Change Management
Change Evaluation
Demand Management
Design Coordination
Event Management
Financial Management for IT Services
Incident Management
Information Security Management
IT Service Continuity Management
Knowledge Management Process
Problem Management Process
Release & Deployment Management
Request Fulfillment Process
Service Asset & Configuration Management
Service Catalog Management
Service Level Management
Service Portfolio Management
ITIL Processes
The same as everyone else.
26 ITIL Processes
Service Validation & Testing
Strategy Management for IT Services
Supplier Management
The 7 Step Improvement
Transition Planning & Support
Access Management
Availability Management
Business Relationship Management
Capacity Management
Change Management
Change Evaluation
Demand Management
Design Coordination
Event Management
Financial Management for IT Services
Incident Management
Information Security Management
IT Service Continuity Management
Knowledge Management Process
Problem Management Process
Release & Deployment Management
Request Fulfillment Process
Service Asset & Configuration Management
Service Catalog Management
Service Level Management
Service Portfolio Management
26 ITIL Processes
Service Validation & Testing
Strategy Management for IT Services
Supplier Management
The 7 Step Improvement
Transition Planning & Support
Access Management
Availability Management
Business Relationship Management
Capacity Management
Change Management
Change Evaluation
Demand Management
Design Coordination
Event Management
Financial Management for IT Services
Incident Management
Information Security Management
IT Service Continuity Management
Knowledge Management Process
Problem Management Process
Release & Deployment Management
Request Fulfillment Process
Service Asset & Configuration Management
Service Catalog Management
Service Level Management
Service Portfolio Management
26 ITIL Processes
Service Validation & Testing
Strategy Management for IT Services
Supplier Management
The 7 Step Improvement
Transition Planning & Support
Access Management
Availability Management
Business Relationship Management
Capacity Management
Change Management
Change Evaluation
Demand Management
Design Coordination
Event Management
Financial Management for IT Services
Incident Management
Information Security Management
IT Service Continuity Management
Knowledge Management Process
Problem Management Process
Release & Deployment Management
Request Fulfillment Process
Service Asset & Configuration Management
Service Catalog Management
Service Level Management
Service Portfolio Management
26 ITIL Processes
Service Validation & Testing
Strategy Management for IT Services
Supplier Management
The 7 Step Improvement
Transition Planning & Support
Access Management
Availability Management
Business Relationship Management
Capacity Management
Change Management
Change Evaluation
Demand Management
Design Coordination
Event Management
Financial Management for IT Services
Incident Management
Information Security Management
IT Service Continuity Management
Knowledge Management Process
Problem Management Process
Release & Deployment Management
Request Fulfillment Process
Service Asset & Configuration Management
Service Catalog Management
Service Level Management
Service Portfolio Management
26 ITIL Processes
Service Validation & Testing
Strategy Management for IT Services
Supplier Management
The 7 Step Improvement
Transition Planning & Support
Access Management
Availability Management
Business Relationship Management
Capacity Management
Change Management
Change Evaluation
Demand Management
Design Coordination
Event Management
Financial Management for IT Services
Incident Management
Information Security Management
IT Service Continuity Management
Knowledge Management Process
Problem Management Process
Release & Deployment Management
Request Fulfillment Process
Service Asset & Configuration Management
Service Catalog Management
Service Level Management
Service Portfolio Management
Encourages
Silos
Context
Context
Process
Process
Tooling
Tooling
Capacity
Capacity
26 ITIL Processes
Service Validation & Testing
Strategy Management for IT Services
Supplier Management
The 7 Step Improvement
Transition Planning & Support
Access Management
Availability Management
Business Relationship Management
Capacity Management
Change Management
Change Evaluation
Demand Management
Design Coordination
Event Management
Financial Management for IT Services
Incident Management
Information Security Management
IT Service Continuity Management
Knowledge Management Process
Problem Management Process
Release & Deployment Management
Request Fulfillment Process
Service Asset & Configuration Management
Service Catalog Management
Service Level Management
Service Portfolio Management
Encourages
Silos
Context
Context
Process
Process
Tooling
Tooling
Capacity
Capacity
Command and Control Management
26 ITIL Processes
Service Validation & Testing
Strategy Management for IT Services
Supplier Management
The 7 Step Improvement
Transition Planning & Support
Access Management
Availability Management
Business Relationship Management
Capacity Management
Change Management
Change Evaluation
Demand Management
Design Coordination
Event Management
Financial Management for IT Services
Incident Management
Information Security Management
IT Service Continuity Management
Knowledge Management Process
Problem Management Process
Release & Deployment Management
Request Fulfillment Process
Service Asset & Configuration Management
Service Catalog Management
Service Level Management
Service Portfolio Management
Encourages
Silos
Context
Context
Process
Process
Tooling
Tooling
Capacity
Capacity
Command and Control Management
Deming
“3. Cease dependence on
inspection to achieve
quality.”
26 ITIL Processes
Service Validation & Testing
Strategy Management for IT Services
Supplier Management
The 7 Step Improvement
Transition Planning & Support
Access Management
Availability Management
Business Relationship Management
Capacity Management
Change Management
Change Evaluation
Demand Management
Design Coordination
Event Management
Financial Management for IT Services
Incident Management
Information Security Management
IT Service Continuity Management
Knowledge Management Process
Problem Management Process
Release & Deployment Management
Request Fulfillment Process
Service Asset & Configuration Management
Service Catalog Management
Service Level Management
Service Portfolio Management
Encourages
Silos
Context
Context
Process
Process
Tooling
Tooling
Capacity
Capacity
Command and Control Management
Deming
“3. Cease dependence on
inspection to achieve
quality.”
Charity Majors
“Distributed systems have an
infinite list of almost impossible
failure scenarios”
26 ITIL Processes
Service Validation & Testing
Strategy Management for IT Services
Supplier Management
The 7 Step Improvement
Transition Planning & Support
Access Management
Availability Management
Business Relationship Management
Capacity Management
Change Management
Change Evaluation
Demand Management
Design Coordination
Event Management
Financial Management for IT Services
Incident Management
Information Security Management
IT Service Continuity Management
Knowledge Management Process
Problem Management Process
Release & Deployment Management
Request Fulfillment Process
Service Asset & Configuration Management
Service Catalog Management
Service Level Management
Service Portfolio Management
Encourages
Silos
Context
Context
Process
Process
Tooling
Tooling
Capacity
Capacity
Command and Control Management
Deming
“3. Cease dependence on
inspection to achieve
quality.”
X X X X X X
Charity Majors
“Distributed systems have an
infinite list of almost impossible
failure scenarios”
Is there a different way?
The Rise of a New IT Operations
Support Model
By 2015, DevOps will evolve from a niche strategy employed
by large cloud providers into a mainstream strategy employed
by 20% of Global 2000 organizations
Why DevOps will emerge:
!DevOps is not usually driven from
Why DevOps will not emerge:
!Cultural changes are the hardest to
by 20% of Global 2000 organizations.
!DevOps is not usually driven from
the top down and, thus, may be
more easily accepted by IT
operations teams.
!Cultural changes are the hardest to
implement, and DevOps requires a
significant rethinking of IT
operations conventional wisdom.
!ITIL and other best practices
frameworks are acknowledged to
have not delivered on their goals,
enabling IT organizations to look for
!There is a large body of work with
respect to ITIL and other best
practices frameworks that is already
accepted within the industry enabling IT organizations to look for
new models.
!The growing interest in tools such
as Chef, Puppet, etc., will help
accepted within the industry.
!Open source (OSS) management
tools, which are more aligned with
this approach, have not seen pp p
stimulate demand for OSS-based
management
pp
significant enterprise market share
traction.
March 18, 2011
Cameron Haight
DevOps vs
ITIL?
The Rise of a New IT Operations
Support Model
By 2015, DevOps will evolve from a niche strategy employed
by large cloud providers into a mainstream strategy employed
by 20% of Global 2000 organizations
Why DevOps will emerge:
!DevOps is not usually driven from
Why DevOps will not emerge:
!Cultural changes are the hardest to
by 20% of Global 2000 organizations.
!DevOps is not usually driven from
the top down and, thus, may be
more easily accepted by IT
operations teams.
!Cultural changes are the hardest to
implement, and DevOps requires a
significant rethinking of IT
operations conventional wisdom.
!ITIL and other best practices
frameworks are acknowledged to
have not delivered on their goals,
enabling IT organizations to look for
!There is a large body of work with
respect to ITIL and other best
practices frameworks that is already
accepted within the industry enabling IT organizations to look for
new models.
!The growing interest in tools such
as Chef, Puppet, etc., will help
accepted within the industry.
!Open source (OSS) management
tools, which are more aligned with
this approach, have not seen pp p
stimulate demand for OSS-based
management
pp
significant enterprise market share
traction.
March 18, 2011
Cameron Haight
DevOps vs
ITIL?
The Rise of a New IT Operations
Support Model
By 2015, DevOps will evolve from a niche strategy employed
by large cloud providers into a mainstream strategy employed
by 20% of Global 2000 organizations
Why DevOps will emerge:
!DevOps is not usually driven from
Why DevOps will not emerge:
!Cultural changes are the hardest to
by 20% of Global 2000 organizations.
!DevOps is not usually driven from
the top down and, thus, may be
more easily accepted by IT
operations teams.
!Cultural changes are the hardest to
implement, and DevOps requires a
significant rethinking of IT
operations conventional wisdom.
!ITIL and other best practices
frameworks are acknowledged to
have not delivered on their goals,
enabling IT organizations to look for
!There is a large body of work with
respect to ITIL and other best
practices frameworks that is already
accepted within the industry enabling IT organizations to look for
new models.
!The growing interest in tools such
as Chef, Puppet, etc., will help
accepted within the industry.
!Open source (OSS) management
tools, which are more aligned with
this approach, have not seen pp p
stimulate demand for OSS-based
management
pp
significant enterprise market share
traction.
March 18, 2011
Cameron Haight
DevOps vs
ITIL?
Product,
Not Project
Continuous
Delivery
Shift
Left
(and more!)
DevOps…
Error
Budgets
0
100
!!
Toil
Limits
Cloud
Native
(and more!)
…then comes SRE
Product,
Not Project
Continuous
Delivery
Shift
Left
(and more!)
DevOps…
Error
Budgets
0
100
!!
Toil
Limits
Cloud
Native
(and more!)
…then comes SRE
Product,
Not Project
Continuous
Delivery
Shift
Left
Error
Budgets
0
100
!!
Toil
Limits
Cloud
Native+ + + + +
Product,
Not Project
Continuous
Delivery
Shift
Left
Error
Budgets
0
100
!!
Toil
Limits
Cloud
Native+ + + + +
“Value-Aligned” and Self-Regulating
Product,
Not Project
Continuous
Delivery
Shift
Left
Error
Budgets
0
100
!!
Toil
Limits
Cloud
Native+ + + + +
“Value-Aligned” and Self-Regulating
Dev Ops
Cross-Functional Team
Cross-Functional Team
Product,
Not Project
Continuous
Delivery
Shift
Left
Error
Budgets
0
100
!!
Toil
Limits
Cloud
Native+ + + + +
“Value-Aligned” and Self-Regulating
Dev Ops
Cross-Functional Team
Cross-Functional Team
Shared
Responsibility
Model
Product,
Not Project
Continuous
Delivery
Shift
Left
Error
Budgets
0
100
!!
Toil
Limits
Cloud
Native+ + + + +
“Value-Aligned” and Self-Regulating
Dev Ops
Cross-Functional Team
Cross-Functional Team
Shared
Responsibility
Model
“DevOps is a
deconstructive
movement”
Jon Hall
Developer
Developer
Developer
Developer
Developer
Old Release Still
Running
Release Plan
Release Plan
Release Plan
Release Plan
Deploy
Feature to
Production
Deploy
Feature to
Production
Deploy
Feature to
Production
Deploy
Feature to
Production
Bugs
Deploy
Feature to
Production
Immutable microservice deployment
scales, is faster with large teams and
diverse platform components
Adrian Cockcroft
https://www.youtube.com/watch?v=nMTaS07i3jk
DockerCon EU 2014
Architecture enables
speed.
Speed is the advantage.
Developer
Developer
Developer
Developer
Developer
Old Release Still
Running
Release Plan
Release Plan
Release Plan
Release Plan
Deploy
Feature to
Production
Deploy
Feature to
Production
Deploy
Feature to
Production
Deploy
Feature to
Production
Bugs
Deploy
Feature to
Production
Immutable microservice deployment
scales, is faster with large teams and
diverse platform components
Adrian Cockcroft
https://www.youtube.com/watch?v=nMTaS07i3jk
DockerCon EU 2014
Architecture enables
speed.
Speed is the advantage.
Developer
Developer
Developer
Developer
Developer
Old Release Still
Running
Release Plan
Release Plan
Release Plan
Release Plan
Deploy
Feature to
Production
Deploy
Feature to
Production
Deploy
Feature to
Production
Deploy
Feature to
Production
Bugs
Deploy
Feature to
Production
Immutable microservice deployment
scales, is faster with large teams and
diverse platform components
Adrian Cockcroft
https://www.youtube.com/watch?v=nMTaS07i3jk
DockerCon EU 2014
Architecture enables
speed.
Speed is the advantage.
Keeps the people out of
their own way!
What is the innovation of SRE?
Principles are what makes SRE different
Principles are what makes SRE different
Stephen Thorne, Google

At DevOps Enterprise Summit

London 2018
“Principles of SRE”
https://youtu.be/c-w_GYvi0eA
Principles are what makes SRE different
1. SRE needs Service Level Objectives, with consequences
Stephen Thorne, Google

At DevOps Enterprise Summit

London 2018
“Principles of SRE”
https://youtu.be/c-w_GYvi0eA
SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
DEV
BIZ
Ops
SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
DEV
BIZ
Ops
SLO takes priority!!
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences

Stephen Thorne, Google

At DevOps Enterprise Summit

London 2018
“Principles of SRE”
https://youtu.be/c-w_GYvi0eA
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today
Stephen Thorne, Google

At DevOps Enterprise Summit

London 2018
“Principles of SRE”
https://youtu.be/c-w_GYvi0eA
Toil: Name For a Problem We’ve All Felt
Toil: Name For a Problem We’ve All Felt
“Toil is the kind of work tied to running a production
service that tends to be manual, repetitive,
automatable, tactical, devoid of enduring value, and
that scales linearly as a service grows.”
-Vivek Rau

Google
Toil vs. Engineering Work
Toil Engineering Work
Lacks Enduring Value Builds Enduring Value
Rote, Repetitive Creative, Iterative
Tactical Strategic
Increases With Scale Enables Scaling
Can Be Automated Requires Human Creativity
Excessive Toil Prevents Fixing the System
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Excessive Toil Prevents Fixing the System
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Excessive Toil Prevents Fixing the System
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Downward spiral is inevitable!
Toil is a Naturally Occurring Force
General Evolution of Automation
1. No automation
2. Externally maintained system-specific automation
3. Externally maintained generic automation
4. Internally maintained system-specific automation
5. Systems that don’t need any automation
Niall Murphy
Microsoft Azure
Toil is a Naturally Occurring Force
General Evolution of Automation
1. No automation
2. Externally maintained system-specific automation
3. Externally maintained generic automation
4. Internally maintained system-specific automation
5. Systems that don’t need any automation
Niall Murphy
Microsoft Azure
Launch
(ToDos & Unknowns)
Mature
Toil is a Naturally Occurring Force
General Evolution of Automation
1. No automation
2. Externally maintained system-specific automation
3. Externally maintained generic automation
4. Internally maintained system-specific automation
5. Systems that don’t need any automation
Niall Murphy
Microsoft Azure
Toil
Toil
Toil
Toil
Launch
(ToDos & Unknowns)
Mature
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today
Stephen Thorne, Google

At DevOps Enterprise Summit

London 2018
“Principles of SRE”
https://youtu.be/c-w_GYvi0eA
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Stephen Thorne, Google

At DevOps Enterprise Summit

London 2018
“Principles of SRE”
https://youtu.be/c-w_GYvi0eA
SRE teams have the ability to regulate their workload
SRE teams have the ability to regulate their workload
SRE can say no.
SRE teams have the ability to regulate their workload
Example:
SRE can say no.
SRE teams have the ability to regulate their workload
Example:
What if handing-off responsibility to SRE/Ops wasn’t a right?
SRE can say no.
SRE teams have the ability to regulate their workload
Example:
What if handing-off responsibility to SRE/Ops wasn’t a right?
(separate the “running in production” from “run by SRE/Ops”)
SRE can say no.
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
What's the Difference Between DevOps and SRE? 

(class SRE implements DevOps)
@sethvargo@lizthegrey
Where to start (the practical approach)
Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today

3. SRE teams have the ability to regulate their workload
Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today

3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today

3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
Company-wide culture change (hard!)
Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today

3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
Company-wide culture change (hard!)
Reduce toil.

Everybody wins!
Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today

3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
Company-wide culture change (hard!)
Reduce toil.

Everybody wins!
Why focus on reducing toil?
Why focus on reducing toil?
1. Lots of value independent of “SRE”
2. Your people are you most expensive assets

… stay out of their way!
Why focus on reducing toil?
1. Lots of value independent of “SRE”
Start reducing toil today
Toil
Start reducing toil today
1. Track toil levels for each team
Toil
Start reducing toil today
1. Track toil levels for each team
Toil
Track toil levels for each team
Track toil levels for each team
• Standardize (e.g. meetings and email are “overhead" not “toil”)
Track toil levels for each team
• Standardize (e.g. meetings and email are “overhead" not “toil”)
• Track

• Self-reporting

• Periodic surveys

• SM or PM interview/sampling
Track toil levels for each team
• Standardize (e.g. meetings and email are “overhead" not “toil”)
• Track

• Self-reporting

• Periodic surveys

• SM or PM interview/sampling
• Don’t get lost in time tracking weeds!
Start reducing toil today
1. Track toil levels for each team
Toil
Start reducing toil today
1. Track toil levels for each team
Toil
2. Set toil limit for each team (50% is conventional wisdom)
Start reducing toil today
1. Track toil levels for each team

2. Set toil limit for each team (50% is conventional wisdom)

3. Fund efforts to reduce toil (with emphasis on teams already over limit)
Toil
Start reducing toil today
1. Track toil levels for each team

2. Set toil limit for each team (50% is conventional wisdom)

3. Fund efforts to reduce toil (with emphasis on teams already over limit)
Toil
Michael Kehoe

Todd Palino 

(LinkedIn)

At SREcon Americas 2019

Example
Process
“Code Yellow”
Where to focus?
Toil
Where to focus?
Toil
Reduce
Technical Debt
Where to focus?
Toil
Reduce
Technical Debt
Re-Engineer

Processes
Where to focus?
Toil
Reduce
Technical Debt
Re-Engineer

Processes
Enable
Self-Service
Where to focus?
Toil
Reduce
Technical Debt
Re-Engineer

Processes
Enable
Self-Service
Eliminate Interruptions
Eliminate Waiting
Eliminate Interruptions
Eliminate Waiting
Self-Service
(runbooks)
Do X.
Eliminate Interruptions
Eliminate Waiting
Self-Service
(runbooks)
Do X.
… and a lot less toil
Empower teams to spot and fix the anti-patterns.
“Fix this for me, fix it again, then fix it again.”
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt
“Fix this for me, fix it again, then fix it again.”
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt
“I could fix it, but I can’t get to it.”
Environment
I could fix it if I
could get to it
Before
Wait
Interrupt
“I could fix it, but I can’t get to it.”
Environment
I could fix it if I
could get to it
Before
Wait
Interrupt
After
I’ve got this!
Environment
Self-
Service
“The dog-pile.”
!!
I think its a problem with
db07-store2.uswest.acme
“$ top”
“$ top”
db07store2.
uswest.acme
“$ top”
“$ top”
“$ top”
!!
“$ top”
!!
!!
!!
healthcheck
store2 -all
db07store2.
uswest.acme
Self-Service
1.
2.
3.
I think its a problem with
db07-store2.uswest.acme
“I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before
“I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before
“I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart”
Environment
Later…
Update
Restart Job
✅
I’ve done this before.
I’ve got this.
Self-Service
Self-Service
After
“Known issue… doesn’t get permanent fix”
“Known issue… doesn’t get permanent fix”
Recap: Make Tomorrow Better Than Today
Beware: impact of traditional
management structures
Be practical and start focusing
on toil
Find and fix toil anti-patterns Empower with Self-Service
Runbooks
SRE is a new way to think
about Ops work
1. SRE needs Service Level
Objectives, with consequences

2. SREs have time to make
tomorrow better than today

3. SRE teams have the ability to
regulate their workload
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt
Toil
Use DevOps and SRE to improve
speed and quality
After
I’ve got this!
Environment
Self-
Service
Let’s talk…
@damonedwards
damon@rundeck.com

SysAdmin to SRE: Solving the Last Mile Problem

  • 1.
    SysAdmin to SRE: Solvingthe Last Mile Problem Damon Edwards @damonedwards
  • 3.
  • 4.
    Operations: The Last Mile SilosQueues Excessive ToilLow Trust
  • 5.
  • 6.
  • 8.
    “SRE… When you ask softwareengineers to do operations” “SRE… Next-generation, cloud-native Operations” Class SRE implements DevOps “SRE… When Ops does more engineering than Ops”
  • 9.
    “SRE… When you ask softwareengineers to do operations” “SRE… Next-generation, cloud-native Operations” Class SRE implements DevOps “SRE… When Ops does more engineering than Ops” SRE
  • 10.
  • 11.
  • 12.
    Its was justanother Thursday…
  • 13.
    Call Center Agent Call Center Agent Mybrowser times out!Wow, this is so slow! I can’t login What a c#@p service! I can’t login Barely works It’s broken Customers Thursday 10:00am PDT (1200 Agents)
  • 14.
    t a c#@p ervice! rksCall Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers
  • 15.
    Call Center Agent Technical Support Service Desk Many tickets Manycalls “Stuff isn’t working” “…but monitoring is all green” Service Desk OK OK OK OK OK Ops Ops
  • 16.
    …but monitoring is allgreen” OK OK OK OK OK Call Center Agent Customer Now it works Now it works Service Desk ? Ops Ops 3:30pm
  • 17.
  • 18.
    Call Center Agent Call Center Agent Mybrowser times out!Wow, this is so slow! I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Custo VIP Cu Friday 9:00am PDT
  • 19.
    Call Center Agent Technical Support Service Desk Many tickets Manycalls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK
  • 20.
    Service Desk Escalate! Incident Commander Ticket Launch the incident bridge Ops Incident Commander Ops Dev Sec Ops Bridge Call Ops Notme… Not me… Not me… Not me… No code updates Probably not the new server hardening process or the network changes… Headcount: 40
  • 21.
    ev No code updates Probably notthe new server dening process or the network changes… Ops Ops Ops Uhh.. WHAT new server hardening process and network changes? Sec We were going to fail audit… you didn’t get the email?
  • 22.
  • 23.
  • 24.
  • 25.
    Call Center Agent Call Center Agent …so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Custo VIP Cus Monday 10:00am PDT
  • 26.
    Call Center Agent Technical Support Service Desk Many tickets Manycalls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK
  • 27.
    “…but monitoring is allgreen” Service Desk OK OK OK OK OK Customer Systems Lead Dev ding! Ignore. Incident Commander Hey did you s that ticket? Scrum
  • 28.
    ustomer Systems Lead Dev Ignore. Incident Commander Heydid you see that ticket? sigh. I’ll take a look Scrum Customer Systems Lead Dev Customer S Lead D Somet the data
  • 29.
    . I’ll take alook r Systems d Dev Customer Systems Lead Dev Something is wrong with the database connection… … But our code didn’t change. DBA No recent database updates.
  • 30.
  • 31.
    Dev Bridge Call No code updates War Room DBA “Try this” Test SysAdmin “Try this” Test Network “Try this” Test Security “Try this” Test SysEng “Try this” Test Incident Commander “New Theory:Its the database connection” Call Center Agent Customer Now it works Now it works Call Center Manager What is going on? 4:00pm Headcount: 20
  • 32.
  • 33.
    Dev Bridge Call No code updates War Room DBA “Try this” Test DBA “Try this” Test SysAdmin “Try this” Test SysEng “Try this” Test SysEng “Try this” Test Incident Commander“New Theory:“problem with stored procedures… but not sure what” Incident Commander DB Vendor phone support isn’t cutting it. Call Center Manager What is going on? Call Center Director What is being done? Tuesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers
  • 34.
    Dev No code updates War Room Test Test Test Test Test Incident Commander Incident Commander Vendor Management DB Vendorphone support isn’t cutting it. We only paid for bronze support Call Center Manager What is going on? Call Center Director What is being done? Approval Request “Need to upgrade support” Finance ??
  • 35.
  • 36.
    Dev Bridge Call No code updates War Room Vendor Consultant “Let’s seewith the vendor consultant says” Call Center Manager What is going on? Call Center Director What is being done? OK, let me take a look. Ven Cons So per Wednesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Headcount: 15
  • 37.
    Dev e No code updates War Room Call Center Manager Whatis going on? Call Center Director What is being done? Vendor Consultant So? Someone toggled on the new performance analysis feature DBA 3:00pm dcount: 15
  • 38.
    So? Vendor Consultant Its been chokingon a particular stored procedure you use everywhere… This stored procedure has almost 400 parameters. It’s 1 million lines of code but… its been working for years! ? ? ?DBA Dev m
  • 39.
    but… its been workingfor years! ? ? ? Ops SysEng QA Ops QA DBA change config load test Dev 1:00am Headcount: 10
  • 40.
    but… its been workingfor years! ? ? ? Ops SysEng QA Ops QA DBA change config load test Dev 1:00am Headcount: 10 .
  • 41.
  • 42.
    Vendor Consultant Dir Finance No budget GM, Lineof Business Stay on schedule You should really fix that… Ops It’s not fixed. It’s just turned off. VP Ops I’m told bug #8543 is P1, but was rejected? Ops Refactor it before it bites us again. VP Dev It’s not a bug. You already have a fix. Dev wins Dev wins Dev No time. Dev Their change broke it.Dev vs Ops
  • 43.
    Vendor Consultant Dir Finance No budget GM, Lineof Business Stay on schedule You should really fix that… Ops It’s not fixed. It’s just turned off. VP Ops I’m told bug #8543 is P1, but was rejected? Ops Refactor it before it bites us again. VP Dev It’s not a bug. You already have a fix. Dev wins Dev wins Dev No time. Dev Their change broke it.Dev vs Ops
  • 44.
    Vendor Consultant Dir Finance No budget GM, Lineof Business Stay on schedule You should really fix that… Ops It’s not fixed. It’s just turned off. VP Ops I’m told bug #8543 is P1, but was rejected? Ops Refactor it before it bites us again. VP Dev It’s not a bug. You already have a fix. Dev wins Dev wins Dev No time. Dev Their change broke it.Dev vs Ops
  • 45.
    Vendor Consultant Dir Finance No budget GM, Lineof Business Stay on schedule You should really fix that… Ops It’s not fixed. It’s just turned off. VP Ops I’m told bug #8543 is P1, but was rejected? Ops Refactor it before it bites us again. VP Dev It’s not a bug. You already have a fix. Dev wins Dev wins Dev No time. Dev Their change broke it.Dev vs Ops
  • 46.
    Vendor Consultant Dir Finance No budget GM, Lineof Business Stay on schedule You should really fix that… Ops It’s not fixed. It’s just turned off. VP Ops I’m told bug #8543 is P1, but was rejected? Ops Refactor it before it bites us again. VP Dev It’s not a bug. You already have a fix. Dev wins Dev wins Dev No time. Dev Their change broke it.Dev vs Ops
  • 47.
    Vendor Consultant Dir Finance No budget GM, Lineof Business Stay on schedule You should really fix that… Ops It’s not fixed. It’s just turned off. VP Ops I’m told bug #8543 is P1, but was rejected? Ops Refactor it before it bites us again. VP Dev It’s not a bug. You already have a fix. Dev wins Dev wins Dev No time. Dev Their change broke it.Dev vs Ops
  • 48.
    Call Center Agent Call Center Agent Mybrowser times out!Wow, this is so slow! I can’t login What a c#@p service! I can’t login Barely works It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Call Center Agent Customer Now it works Now it works Service Desk ? Ops Ops Thursday 10:00am PDT 3:30pm (1200 Agents) Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Service Desk Escalate! Incident Commander Ticket Launch the incident bridge Ops Incident Commander Ops Dev Sec Ops Bridge Call Ops Not me… Not me… Not me… Not me… No code updates Probably not the new server hardening process or the network changes… Ops Ops Ops Uhh.. WHAT new server hardening process and network changes? Sec We were going to fail audit… you didn’t get the email? Dev Bridge Call No code updates War Room SysAdmin “Try this” Test Platform “Try this” Test Network “Try this” Test Security “Try this” Test Storage “Try this” Test SysEng “Try this” Test Incident Commander “Theory: new security updates” Call Center Agent Customer Now it works Now it works Ops Ops Sec Ops Ops Call Center Manager What is going on? Ops Rollback: -OS changes -Network changes 3:30pm Over the weekend QA Headcount: 40 Headcount: 30 Headcount: 10 Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Bridge Call DBA “Try this” SysAdmin “Try this” Network “Try this” Security “Try this” SysEng “Try this” “New Theory: Its the database connection” Customer Systems Lead Dev ding! Ignore. Incident Commander Hey did you see that ticket? sigh. I’ll take a look Scrum Customer Systems Lead Dev Customer Systems Lead Dev Something is wrong with the database connection… … But our code didn’t change. DBA No recent database updates. Monday 10:00am PDT Headco Dev Bridge Call No code updates War Room DBA “Try this” Test DBA “Try this” Test SysAdmin “Try this” Test SysEng “Try this” Test SysEng “Try this” Test Incident Commander“New Theory: “problem with stored procedures… but not sure what” Incident Commander Vendor Management DB Vendor phone support isn’t cutting it. We only paid for bronze support Call Center Manager What is going on? Call Center Director What is being done? Approval Request “Need to upgrade support” Finance ?? Tuesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Dev Bridge Call No code updates War Room Vendor Consultant “Let’s see with the vendor consultant says” Call Center Manager What is going on? Call Center Director What is being done? OK, let me take a look. Vendor Consultant So? Vendor Consultant Its been choking on a particular stored procedure you use everywhere…Someone toggled on the new performance analysis feature This stored procedure has almost 400 parameters. It’s 1 million lines of code but… its been working for years! ? ? ? Ops Sys Ops QA change config load test Wednesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers DBA Dev 3:00pm Headcount: 15 Headcount: 10 Call Center Agent Call Center Agent My browser times out!Wow, this is so slow! I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers Friday 9:00am PDT
  • 49.
    Call Center Agent Call Center Agent Mybrowser times out!Wow, this is so slow! I can’t login What a c#@p service! I can’t login Barely works It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Call Center Agent Customer Now it works Now it works Service Desk ? Ops Ops Thursday 10:00am PDT 3:30pm (1200 Agents) Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Service Desk Escalate! Incident Commander Ticket Launch the incident bridge Ops Incident Commander Ops Dev Sec Ops Bridge Call Ops Not me… Not me… Not me… Not me… No code updates Probably not the new server hardening process or the network changes… Ops Ops Ops Uhh.. WHAT new server hardening process and network changes? Sec We were going to fail audit… you didn’t get the email? Dev Bridge Call No code updates War Room SysAdmin “Try this” Test Platform “Try this” Test Network “Try this” Test Security “Try this” Test Storage “Try this” Test SysEng “Try this” Test Incident Commander “Theory: new security updates” Call Center Agent Customer Now it works Now it works Ops Ops Sec Ops Ops Call Center Manager What is going on? Ops Rollback: -OS changes -Network changes 3:30pm Over the weekend QA Headcount: 40 Headcount: 30 Headcount: 10 Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Bridge Call DBA “Try this” SysAdmin “Try this” Network “Try this” Security “Try this” SysEng “Try this” “New Theory: Its the database connection” Customer Systems Lead Dev ding! Ignore. Incident Commander Hey did you see that ticket? sigh. I’ll take a look Scrum Customer Systems Lead Dev Customer Systems Lead Dev Something is wrong with the database connection… … But our code didn’t change. DBA No recent database updates. Monday 10:00am PDT Headco Dev Bridge Call No code updates War Room DBA “Try this” Test DBA “Try this” Test SysAdmin “Try this” Test SysEng “Try this” Test SysEng “Try this” Test Incident Commander“New Theory: “problem with stored procedures… but not sure what” Incident Commander Vendor Management DB Vendor phone support isn’t cutting it. We only paid for bronze support Call Center Manager What is going on? Call Center Director What is being done? Approval Request “Need to upgrade support” Finance ?? Tuesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Dev Bridge Call No code updates War Room Vendor Consultant “Let’s see with the vendor consultant says” Call Center Manager What is going on? Call Center Director What is being done? OK, let me take a look. Vendor Consultant So? Vendor Consultant Its been choking on a particular stored procedure you use everywhere…Someone toggled on the new performance analysis feature This stored procedure has almost 400 parameters. It’s 1 million lines of code but… its been working for years! ? ? ? Ops Sys Ops QA change config load test Wednesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers DBA Dev 3:00pm Headcount: 15 Headcount: 10 Call Center Agent Call Center Agent My browser times out!Wow, this is so slow! I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers Friday 9:00am PDT Response labor: $270,000 Lost call center productivity: $620,000 $890,000
  • 50.
    Call Center Agent Call Center Agent Mybrowser times out!Wow, this is so slow! I can’t login What a c#@p service! I can’t login Barely works It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Call Center Agent Customer Now it works Now it works Service Desk ? Ops Ops Thursday 10:00am PDT 3:30pm (1200 Agents) Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Service Desk Escalate! Incident Commander Ticket Launch the incident bridge Ops Incident Commander Ops Dev Sec Ops Bridge Call Ops Not me… Not me… Not me… Not me… No code updates Probably not the new server hardening process or the network changes… Ops Ops Ops Uhh.. WHAT new server hardening process and network changes? Sec We were going to fail audit… you didn’t get the email? Dev Bridge Call No code updates War Room SysAdmin “Try this” Test Platform “Try this” Test Network “Try this” Test Security “Try this” Test Storage “Try this” Test SysEng “Try this” Test Incident Commander “Theory: new security updates” Call Center Agent Customer Now it works Now it works Ops Ops Sec Ops Ops Call Center Manager What is going on? Ops Rollback: -OS changes -Network changes 3:30pm Over the weekend QA Headcount: 40 Headcount: 30 Headcount: 10 Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Bridge Call DBA “Try this” SysAdmin “Try this” Network “Try this” Security “Try this” SysEng “Try this” “New Theory: Its the database connection” Customer Systems Lead Dev ding! Ignore. Incident Commander Hey did you see that ticket? sigh. I’ll take a look Scrum Customer Systems Lead Dev Customer Systems Lead Dev Something is wrong with the database connection… … But our code didn’t change. DBA No recent database updates. Monday 10:00am PDT Headco Dev Bridge Call No code updates War Room DBA “Try this” Test DBA “Try this” Test SysAdmin “Try this” Test SysEng “Try this” Test SysEng “Try this” Test Incident Commander“New Theory: “problem with stored procedures… but not sure what” Incident Commander Vendor Management DB Vendor phone support isn’t cutting it. We only paid for bronze support Call Center Manager What is going on? Call Center Director What is being done? Approval Request “Need to upgrade support” Finance ?? Tuesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Dev Bridge Call No code updates War Room Vendor Consultant “Let’s see with the vendor consultant says” Call Center Manager What is going on? Call Center Director What is being done? OK, let me take a look. Vendor Consultant So? Vendor Consultant Its been choking on a particular stored procedure you use everywhere…Someone toggled on the new performance analysis feature This stored procedure has almost 400 parameters. It’s 1 million lines of code but… its been working for years! ? ? ? Ops Sys Ops QA change config load test Wednesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers DBA Dev 3:00pm Headcount: 15 Headcount: 10 Call Center Agent Call Center Agent My browser times out!Wow, this is so slow! I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers Friday 9:00am PDT Response labor: $270,000 Lost call center productivity: $620,000 $890,000 (+ project delays)
  • 51.
    Call Center Agent Call Center Agent Mybrowser times out!Wow, this is so slow! I can’t login What a c#@p service! I can’t login Barely works It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Call Center Agent Customer Now it works Now it works Service Desk ? Ops Ops Thursday 10:00am PDT 3:30pm (1200 Agents) Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Service Desk Escalate! Incident Commander Ticket Launch the incident bridge Ops Incident Commander Ops Dev Sec Ops Bridge Call Ops Not me… Not me… Not me… Not me… No code updates Probably not the new server hardening process or the network changes… Ops Ops Ops Uhh.. WHAT new server hardening process and network changes? Sec We were going to fail audit… you didn’t get the email? Dev Bridge Call No code updates War Room SysAdmin “Try this” Test Platform “Try this” Test Network “Try this” Test Security “Try this” Test Storage “Try this” Test SysEng “Try this” Test Incident Commander “Theory: new security updates” Call Center Agent Customer Now it works Now it works Ops Ops Sec Ops Ops Call Center Manager What is going on? Ops Rollback: -OS changes -Network changes 3:30pm Over the weekend QA Headcount: 40 Headcount: 30 Headcount: 10 Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Bridge Call DBA “Try this” SysAdmin “Try this” Network “Try this” Security “Try this” SysEng “Try this” “New Theory: Its the database connection” Customer Systems Lead Dev ding! Ignore. Incident Commander Hey did you see that ticket? sigh. I’ll take a look Scrum Customer Systems Lead Dev Customer Systems Lead Dev Something is wrong with the database connection… … But our code didn’t change. DBA No recent database updates. Monday 10:00am PDT Headco Dev Bridge Call No code updates War Room DBA “Try this” Test DBA “Try this” Test SysAdmin “Try this” Test SysEng “Try this” Test SysEng “Try this” Test Incident Commander“New Theory: “problem with stored procedures… but not sure what” Incident Commander Vendor Management DB Vendor phone support isn’t cutting it. We only paid for bronze support Call Center Manager What is going on? Call Center Director What is being done? Approval Request “Need to upgrade support” Finance ?? Tuesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Dev Bridge Call No code updates War Room Vendor Consultant “Let’s see with the vendor consultant says” Call Center Manager What is going on? Call Center Director What is being done? OK, let me take a look. Vendor Consultant So? Vendor Consultant Its been choking on a particular stored procedure you use everywhere…Someone toggled on the new performance analysis feature This stored procedure has almost 400 parameters. It’s 1 million lines of code but… its been working for years! ? ? ? Ops Sys Ops QA change config load test Wednesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers DBA Dev 3:00pm Headcount: 15 Headcount: 10 Call Center Agent Call Center Agent My browser times out!Wow, this is so slow! I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers Friday 9:00am PDT Response labor: $270,000 Lost call center productivity: $620,000 $890,000 (+ project delays) (+ brand damage)
  • 52.
    Call Center Agent Call Center Agent Mybrowser times out!Wow, this is so slow! I can’t login What a c#@p service! I can’t login Barely works It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Call Center Agent Customer Now it works Now it works Service Desk ? Ops Ops Thursday 10:00am PDT 3:30pm (1200 Agents) Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Service Desk Escalate! Incident Commander Ticket Launch the incident bridge Ops Incident Commander Ops Dev Sec Ops Bridge Call Ops Not me… Not me… Not me… Not me… No code updates Probably not the new server hardening process or the network changes… Ops Ops Ops Uhh.. WHAT new server hardening process and network changes? Sec We were going to fail audit… you didn’t get the email? Dev Bridge Call No code updates War Room SysAdmin “Try this” Test Platform “Try this” Test Network “Try this” Test Security “Try this” Test Storage “Try this” Test SysEng “Try this” Test Incident Commander “Theory: new security updates” Call Center Agent Customer Now it works Now it works Ops Ops Sec Ops Ops Call Center Manager What is going on? Ops Rollback: -OS changes -Network changes 3:30pm Over the weekend QA Headcount: 40 Headcount: 30 Headcount: 10 Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers “…but monitoring is all green” Service Desk OK OK OK OK OK Bridge Call DBA “Try this” SysAdmin “Try this” Network “Try this” Security “Try this” SysEng “Try this” “New Theory: Its the database connection” Customer Systems Lead Dev ding! Ignore. Incident Commander Hey did you see that ticket? sigh. I’ll take a look Scrum Customer Systems Lead Dev Customer Systems Lead Dev Something is wrong with the database connection… … But our code didn’t change. DBA No recent database updates. Monday 10:00am PDT Headco Dev Bridge Call No code updates War Room DBA “Try this” Test DBA “Try this” Test SysAdmin “Try this” Test SysEng “Try this” Test SysEng “Try this” Test Incident Commander“New Theory: “problem with stored procedures… but not sure what” Incident Commander Vendor Management DB Vendor phone support isn’t cutting it. We only paid for bronze support Call Center Manager What is going on? Call Center Director What is being done? Approval Request “Need to upgrade support” Finance ?? Tuesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Dev Bridge Call No code updates War Room Vendor Consultant “Let’s see with the vendor consultant says” Call Center Manager What is going on? Call Center Director What is being done? OK, let me take a look. Vendor Consultant So? Vendor Consultant Its been choking on a particular stored procedure you use everywhere…Someone toggled on the new performance analysis feature This stored procedure has almost 400 parameters. It’s 1 million lines of code but… its been working for years! ? ? ? Ops Sys Ops QA change config load test Wednesday 10:00am PDT Call Center Agent Call Center Agent … so frustrating Not again… I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers DBA Dev 3:00pm Headcount: 15 Headcount: 10 Call Center Agent Call Center Agent My browser times out!Wow, this is so slow! I can’t login Are you kidding me? How hard is it to run a website? Soo Sloooow It’s broken Customers Call Center Agent Technical Support Service Desk Many tickets Many calls Customers “Stuff isn’t working” VIP Customers Friday 9:00am PDT Response labor: $270,000 Lost call center productivity: $620,000 $890,000 (+ project delays) (+ brand damage) > $1,000,000
  • 53.
    How did theyend up here?
  • 55.
  • 56.
  • 57.
  • 58.
    Corporate Plan Annual Budget ProjectPlan Requirements Context Context Process Process Tooling Tooling Capacity Capacity
  • 59.
    What were theythinking?
  • 60.
    26 ITIL Processes ServiceValidation & Testing Strategy Management for IT Services Supplier Management The 7 Step Improvement Transition Planning & Support Access Management Availability Management Business Relationship Management Capacity Management Change Management Change Evaluation Demand Management Design Coordination Event Management Financial Management for IT Services Incident Management Information Security Management IT Service Continuity Management Knowledge Management Process Problem Management Process Release & Deployment Management Request Fulfillment Process Service Asset & Configuration Management Service Catalog Management Service Level Management Service Portfolio Management ITIL Processes The same as everyone else.
  • 61.
    26 ITIL Processes ServiceValidation & Testing Strategy Management for IT Services Supplier Management The 7 Step Improvement Transition Planning & Support Access Management Availability Management Business Relationship Management Capacity Management Change Management Change Evaluation Demand Management Design Coordination Event Management Financial Management for IT Services Incident Management Information Security Management IT Service Continuity Management Knowledge Management Process Problem Management Process Release & Deployment Management Request Fulfillment Process Service Asset & Configuration Management Service Catalog Management Service Level Management Service Portfolio Management
  • 62.
    26 ITIL Processes ServiceValidation & Testing Strategy Management for IT Services Supplier Management The 7 Step Improvement Transition Planning & Support Access Management Availability Management Business Relationship Management Capacity Management Change Management Change Evaluation Demand Management Design Coordination Event Management Financial Management for IT Services Incident Management Information Security Management IT Service Continuity Management Knowledge Management Process Problem Management Process Release & Deployment Management Request Fulfillment Process Service Asset & Configuration Management Service Catalog Management Service Level Management Service Portfolio Management
  • 63.
    26 ITIL Processes ServiceValidation & Testing Strategy Management for IT Services Supplier Management The 7 Step Improvement Transition Planning & Support Access Management Availability Management Business Relationship Management Capacity Management Change Management Change Evaluation Demand Management Design Coordination Event Management Financial Management for IT Services Incident Management Information Security Management IT Service Continuity Management Knowledge Management Process Problem Management Process Release & Deployment Management Request Fulfillment Process Service Asset & Configuration Management Service Catalog Management Service Level Management Service Portfolio Management
  • 64.
    26 ITIL Processes ServiceValidation & Testing Strategy Management for IT Services Supplier Management The 7 Step Improvement Transition Planning & Support Access Management Availability Management Business Relationship Management Capacity Management Change Management Change Evaluation Demand Management Design Coordination Event Management Financial Management for IT Services Incident Management Information Security Management IT Service Continuity Management Knowledge Management Process Problem Management Process Release & Deployment Management Request Fulfillment Process Service Asset & Configuration Management Service Catalog Management Service Level Management Service Portfolio Management
  • 65.
    26 ITIL Processes ServiceValidation & Testing Strategy Management for IT Services Supplier Management The 7 Step Improvement Transition Planning & Support Access Management Availability Management Business Relationship Management Capacity Management Change Management Change Evaluation Demand Management Design Coordination Event Management Financial Management for IT Services Incident Management Information Security Management IT Service Continuity Management Knowledge Management Process Problem Management Process Release & Deployment Management Request Fulfillment Process Service Asset & Configuration Management Service Catalog Management Service Level Management Service Portfolio Management Encourages Silos Context Context Process Process Tooling Tooling Capacity Capacity
  • 66.
    26 ITIL Processes ServiceValidation & Testing Strategy Management for IT Services Supplier Management The 7 Step Improvement Transition Planning & Support Access Management Availability Management Business Relationship Management Capacity Management Change Management Change Evaluation Demand Management Design Coordination Event Management Financial Management for IT Services Incident Management Information Security Management IT Service Continuity Management Knowledge Management Process Problem Management Process Release & Deployment Management Request Fulfillment Process Service Asset & Configuration Management Service Catalog Management Service Level Management Service Portfolio Management Encourages Silos Context Context Process Process Tooling Tooling Capacity Capacity Command and Control Management
  • 67.
    26 ITIL Processes ServiceValidation & Testing Strategy Management for IT Services Supplier Management The 7 Step Improvement Transition Planning & Support Access Management Availability Management Business Relationship Management Capacity Management Change Management Change Evaluation Demand Management Design Coordination Event Management Financial Management for IT Services Incident Management Information Security Management IT Service Continuity Management Knowledge Management Process Problem Management Process Release & Deployment Management Request Fulfillment Process Service Asset & Configuration Management Service Catalog Management Service Level Management Service Portfolio Management Encourages Silos Context Context Process Process Tooling Tooling Capacity Capacity Command and Control Management Deming “3. Cease dependence on inspection to achieve quality.”
  • 68.
    26 ITIL Processes ServiceValidation & Testing Strategy Management for IT Services Supplier Management The 7 Step Improvement Transition Planning & Support Access Management Availability Management Business Relationship Management Capacity Management Change Management Change Evaluation Demand Management Design Coordination Event Management Financial Management for IT Services Incident Management Information Security Management IT Service Continuity Management Knowledge Management Process Problem Management Process Release & Deployment Management Request Fulfillment Process Service Asset & Configuration Management Service Catalog Management Service Level Management Service Portfolio Management Encourages Silos Context Context Process Process Tooling Tooling Capacity Capacity Command and Control Management Deming “3. Cease dependence on inspection to achieve quality.” Charity Majors “Distributed systems have an infinite list of almost impossible failure scenarios”
  • 69.
    26 ITIL Processes ServiceValidation & Testing Strategy Management for IT Services Supplier Management The 7 Step Improvement Transition Planning & Support Access Management Availability Management Business Relationship Management Capacity Management Change Management Change Evaluation Demand Management Design Coordination Event Management Financial Management for IT Services Incident Management Information Security Management IT Service Continuity Management Knowledge Management Process Problem Management Process Release & Deployment Management Request Fulfillment Process Service Asset & Configuration Management Service Catalog Management Service Level Management Service Portfolio Management Encourages Silos Context Context Process Process Tooling Tooling Capacity Capacity Command and Control Management Deming “3. Cease dependence on inspection to achieve quality.” X X X X X X Charity Majors “Distributed systems have an infinite list of almost impossible failure scenarios”
  • 70.
    Is there adifferent way?
  • 71.
    The Rise ofa New IT Operations Support Model By 2015, DevOps will evolve from a niche strategy employed by large cloud providers into a mainstream strategy employed by 20% of Global 2000 organizations Why DevOps will emerge: !DevOps is not usually driven from Why DevOps will not emerge: !Cultural changes are the hardest to by 20% of Global 2000 organizations. !DevOps is not usually driven from the top down and, thus, may be more easily accepted by IT operations teams. !Cultural changes are the hardest to implement, and DevOps requires a significant rethinking of IT operations conventional wisdom. !ITIL and other best practices frameworks are acknowledged to have not delivered on their goals, enabling IT organizations to look for !There is a large body of work with respect to ITIL and other best practices frameworks that is already accepted within the industry enabling IT organizations to look for new models. !The growing interest in tools such as Chef, Puppet, etc., will help accepted within the industry. !Open source (OSS) management tools, which are more aligned with this approach, have not seen pp p stimulate demand for OSS-based management pp significant enterprise market share traction. March 18, 2011 Cameron Haight DevOps vs ITIL?
  • 72.
    The Rise ofa New IT Operations Support Model By 2015, DevOps will evolve from a niche strategy employed by large cloud providers into a mainstream strategy employed by 20% of Global 2000 organizations Why DevOps will emerge: !DevOps is not usually driven from Why DevOps will not emerge: !Cultural changes are the hardest to by 20% of Global 2000 organizations. !DevOps is not usually driven from the top down and, thus, may be more easily accepted by IT operations teams. !Cultural changes are the hardest to implement, and DevOps requires a significant rethinking of IT operations conventional wisdom. !ITIL and other best practices frameworks are acknowledged to have not delivered on their goals, enabling IT organizations to look for !There is a large body of work with respect to ITIL and other best practices frameworks that is already accepted within the industry enabling IT organizations to look for new models. !The growing interest in tools such as Chef, Puppet, etc., will help accepted within the industry. !Open source (OSS) management tools, which are more aligned with this approach, have not seen pp p stimulate demand for OSS-based management pp significant enterprise market share traction. March 18, 2011 Cameron Haight DevOps vs ITIL?
  • 73.
    The Rise ofa New IT Operations Support Model By 2015, DevOps will evolve from a niche strategy employed by large cloud providers into a mainstream strategy employed by 20% of Global 2000 organizations Why DevOps will emerge: !DevOps is not usually driven from Why DevOps will not emerge: !Cultural changes are the hardest to by 20% of Global 2000 organizations. !DevOps is not usually driven from the top down and, thus, may be more easily accepted by IT operations teams. !Cultural changes are the hardest to implement, and DevOps requires a significant rethinking of IT operations conventional wisdom. !ITIL and other best practices frameworks are acknowledged to have not delivered on their goals, enabling IT organizations to look for !There is a large body of work with respect to ITIL and other best practices frameworks that is already accepted within the industry enabling IT organizations to look for new models. !The growing interest in tools such as Chef, Puppet, etc., will help accepted within the industry. !Open source (OSS) management tools, which are more aligned with this approach, have not seen pp p stimulate demand for OSS-based management pp significant enterprise market share traction. March 18, 2011 Cameron Haight DevOps vs ITIL?
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
    Product, Not Project Continuous Delivery Shift Left Error Budgets 0 100 !! Toil Limits Cloud Native+ ++ + + “Value-Aligned” and Self-Regulating Dev Ops Cross-Functional Team Cross-Functional Team
  • 79.
    Product, Not Project Continuous Delivery Shift Left Error Budgets 0 100 !! Toil Limits Cloud Native+ ++ + + “Value-Aligned” and Self-Regulating Dev Ops Cross-Functional Team Cross-Functional Team Shared Responsibility Model
  • 80.
    Product, Not Project Continuous Delivery Shift Left Error Budgets 0 100 !! Toil Limits Cloud Native+ ++ + + “Value-Aligned” and Self-Regulating Dev Ops Cross-Functional Team Cross-Functional Team Shared Responsibility Model “DevOps is a deconstructive movement” Jon Hall
  • 81.
    Developer Developer Developer Developer Developer Old Release Still Running ReleasePlan Release Plan Release Plan Release Plan Deploy Feature to Production Deploy Feature to Production Deploy Feature to Production Deploy Feature to Production Bugs Deploy Feature to Production Immutable microservice deployment scales, is faster with large teams and diverse platform components Adrian Cockcroft https://www.youtube.com/watch?v=nMTaS07i3jk DockerCon EU 2014 Architecture enables speed. Speed is the advantage.
  • 82.
    Developer Developer Developer Developer Developer Old Release Still Running ReleasePlan Release Plan Release Plan Release Plan Deploy Feature to Production Deploy Feature to Production Deploy Feature to Production Deploy Feature to Production Bugs Deploy Feature to Production Immutable microservice deployment scales, is faster with large teams and diverse platform components Adrian Cockcroft https://www.youtube.com/watch?v=nMTaS07i3jk DockerCon EU 2014 Architecture enables speed. Speed is the advantage.
  • 83.
    Developer Developer Developer Developer Developer Old Release Still Running ReleasePlan Release Plan Release Plan Release Plan Deploy Feature to Production Deploy Feature to Production Deploy Feature to Production Deploy Feature to Production Bugs Deploy Feature to Production Immutable microservice deployment scales, is faster with large teams and diverse platform components Adrian Cockcroft https://www.youtube.com/watch?v=nMTaS07i3jk DockerCon EU 2014 Architecture enables speed. Speed is the advantage. Keeps the people out of their own way!
  • 84.
    What is theinnovation of SRE?
  • 85.
    Principles are whatmakes SRE different
  • 86.
    Principles are whatmakes SRE different Stephen Thorne, Google At DevOps Enterprise Summit London 2018 “Principles of SRE” https://youtu.be/c-w_GYvi0eA
  • 87.
    Principles are whatmakes SRE different 1. SRE needs Service Level Objectives, with consequences Stephen Thorne, Google At DevOps Enterprise Summit London 2018 “Principles of SRE” https://youtu.be/c-w_GYvi0eA
  • 88.
    SLO and ErrorBudgets: Tools for Shared Responsibility 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service)
  • 89.
    SLO and ErrorBudgets: Tools for Shared Responsibility 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service)
  • 90.
    SLO and ErrorBudgets: Tools for Shared Responsibility 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service) DEV BIZ Ops
  • 91.
    SLO and ErrorBudgets: Tools for Shared Responsibility 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service) DEV BIZ Ops SLO takes priority!!
  • 92.
    Principles of SREare what set SRE apart 1. SRE needs Service Level Objectives, with consequences Stephen Thorne, Google At DevOps Enterprise Summit London 2018 “Principles of SRE” https://youtu.be/c-w_GYvi0eA
  • 93.
    Principles of SREare what set SRE apart 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today Stephen Thorne, Google At DevOps Enterprise Summit London 2018 “Principles of SRE” https://youtu.be/c-w_GYvi0eA
  • 94.
    Toil: Name Fora Problem We’ve All Felt
  • 95.
    Toil: Name Fora Problem We’ve All Felt “Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” -Vivek Rau Google
  • 96.
    Toil vs. EngineeringWork Toil Engineering Work Lacks Enduring Value Builds Enduring Value Rote, Repetitive Creative, Iterative Tactical Strategic Increases With Scale Enables Scaling Can Be Automated Requires Human Creativity
  • 97.
    Excessive Toil PreventsFixing the System Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
  • 98.
    Excessive Toil PreventsFixing the System Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
  • 99.
    Excessive Toil PreventsFixing the System Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”) Downward spiral is inevitable!
  • 100.
    Toil is aNaturally Occurring Force General Evolution of Automation 1. No automation 2. Externally maintained system-specific automation 3. Externally maintained generic automation 4. Internally maintained system-specific automation 5. Systems that don’t need any automation Niall Murphy Microsoft Azure
  • 101.
    Toil is aNaturally Occurring Force General Evolution of Automation 1. No automation 2. Externally maintained system-specific automation 3. Externally maintained generic automation 4. Internally maintained system-specific automation 5. Systems that don’t need any automation Niall Murphy Microsoft Azure Launch (ToDos & Unknowns) Mature
  • 102.
    Toil is aNaturally Occurring Force General Evolution of Automation 1. No automation 2. Externally maintained system-specific automation 3. Externally maintained generic automation 4. Internally maintained system-specific automation 5. Systems that don’t need any automation Niall Murphy Microsoft Azure Toil Toil Toil Toil Launch (ToDos & Unknowns) Mature
  • 103.
    Principles of SREare what set SRE apart 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today Stephen Thorne, Google At DevOps Enterprise Summit London 2018 “Principles of SRE” https://youtu.be/c-w_GYvi0eA
  • 104.
    Principles of SREare what set SRE apart 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Stephen Thorne, Google At DevOps Enterprise Summit London 2018 “Principles of SRE” https://youtu.be/c-w_GYvi0eA
  • 105.
    SRE teams havethe ability to regulate their workload
  • 106.
    SRE teams havethe ability to regulate their workload SRE can say no.
  • 107.
    SRE teams havethe ability to regulate their workload Example: SRE can say no.
  • 108.
    SRE teams havethe ability to regulate their workload Example: What if handing-off responsibility to SRE/Ops wasn’t a right? SRE can say no.
  • 109.
    SRE teams havethe ability to regulate their workload Example: What if handing-off responsibility to SRE/Ops wasn’t a right? (separate the “running in production” from “run by SRE/Ops”) SRE can say no.
  • 110.
    Principles of SREare what set SRE apart 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload
  • 111.
    What's the DifferenceBetween DevOps and SRE? 
 (class SRE implements DevOps) @sethvargo@lizthegrey
  • 112.
    Where to start(the practical approach)
  • 113.
    Where to start(the practical approach) 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload
  • 114.
    Where to start(the practical approach) 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Company-wide culture change (hard!)
  • 115.
    Where to start(the practical approach) 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Company-wide culture change (hard!) Company-wide culture change (hard!)
  • 116.
    Where to start(the practical approach) 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Company-wide culture change (hard!) Company-wide culture change (hard!) Reduce toil.
 Everybody wins!
  • 117.
    Where to start(the practical approach) 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Company-wide culture change (hard!) Company-wide culture change (hard!) Reduce toil.
 Everybody wins!
  • 118.
    Why focus onreducing toil?
  • 119.
    Why focus onreducing toil? 1. Lots of value independent of “SRE”
  • 120.
    2. Your peopleare you most expensive assets
 … stay out of their way! Why focus on reducing toil? 1. Lots of value independent of “SRE”
  • 121.
  • 122.
    Start reducing toiltoday 1. Track toil levels for each team Toil
  • 123.
    Start reducing toiltoday 1. Track toil levels for each team Toil
  • 124.
    Track toil levelsfor each team
  • 125.
    Track toil levelsfor each team • Standardize (e.g. meetings and email are “overhead" not “toil”)
  • 126.
    Track toil levelsfor each team • Standardize (e.g. meetings and email are “overhead" not “toil”) • Track • Self-reporting • Periodic surveys • SM or PM interview/sampling
  • 127.
    Track toil levelsfor each team • Standardize (e.g. meetings and email are “overhead" not “toil”) • Track • Self-reporting • Periodic surveys • SM or PM interview/sampling • Don’t get lost in time tracking weeds!
  • 128.
    Start reducing toiltoday 1. Track toil levels for each team Toil
  • 129.
    Start reducing toiltoday 1. Track toil levels for each team Toil 2. Set toil limit for each team (50% is conventional wisdom)
  • 130.
    Start reducing toiltoday 1. Track toil levels for each team 2. Set toil limit for each team (50% is conventional wisdom) 3. Fund efforts to reduce toil (with emphasis on teams already over limit) Toil
  • 131.
    Start reducing toiltoday 1. Track toil levels for each team 2. Set toil limit for each team (50% is conventional wisdom) 3. Fund efforts to reduce toil (with emphasis on teams already over limit) Toil Michael Kehoe Todd Palino (LinkedIn) At SREcon Americas 2019 Example Process “Code Yellow”
  • 132.
  • 133.
  • 134.
    Where to focus? Toil Reduce TechnicalDebt Re-Engineer Processes
  • 135.
    Where to focus? Toil Reduce TechnicalDebt Re-Engineer Processes Enable Self-Service
  • 136.
    Where to focus? Toil Reduce TechnicalDebt Re-Engineer Processes Enable Self-Service
  • 138.
  • 139.
  • 140.
  • 141.
    Empower teams tospot and fix the anti-patterns.
  • 142.
    “Fix this forme, fix it again, then fix it again.” Done.I need you to do X Your other work I need you to do X I need you to do X Ticket Do X Later… Do X Do X Done. Done. Your other work Self-Service Self-Service Self-Service Your other work x2 Your other work x3 Later…Later… Later… Your other work Your other work After Before Wait Interrupt Ticket Wait Interrupt Ticket Wait Interrupt
  • 143.
    “Fix this forme, fix it again, then fix it again.” Done.I need you to do X Your other work I need you to do X I need you to do X Ticket Do X Later… Do X Do X Done. Done. Your other work Self-Service Self-Service Self-Service Your other work x2 Your other work x3 Later…Later… Later… Your other work Your other work After Before Wait Interrupt Ticket Wait Interrupt Ticket Wait Interrupt
  • 144.
    “I could fixit, but I can’t get to it.” Environment I could fix it if I could get to it Before Wait Interrupt
  • 145.
    “I could fixit, but I can’t get to it.” Environment I could fix it if I could get to it Before Wait Interrupt After I’ve got this! Environment Self- Service
  • 146.
    “The dog-pile.” !! I thinkits a problem with db07-store2.uswest.acme “$ top” “$ top” db07store2. uswest.acme “$ top” “$ top” “$ top” !! “$ top” !! !! !! healthcheck store2 -all db07store2. uswest.acme Self-Service 1. 2. 3. I think its a problem with db07-store2.uswest.acme
  • 147.
    “I’m an expert,I don’t read the wiki.” docs Service has changed. Use this flag or bad things will happen! Pause monitoring first or we all get woken up! “restart -doit -now” I’ve done this before. I’ve got this… Environment docs Later… Before
  • 148.
    “I’m an expert,I don’t read the wiki.” docs Service has changed. Use this flag or bad things will happen! Pause monitoring first or we all get woken up! “restart -doit -now” I’ve done this before. I’ve got this… Environment docs Later… Before
  • 149.
    “I’m an expert,I don’t read the wiki.” docs Service has changed. Use this flag or bad things will happen! Pause monitoring first or we all get woken up! “restart -doit -now” I’ve done this before. I’ve got this… Environment docs Later… Before Service has changed. Use this flag or bad things will happen! Pause monitoring first or we all get woken up! “restart” Environment Later… Update Restart Job ✅ I’ve done this before. I’ve got this. Self-Service Self-Service After
  • 150.
    “Known issue… doesn’tget permanent fix”
  • 151.
    “Known issue… doesn’tget permanent fix”
  • 152.
    Recap: Make TomorrowBetter Than Today Beware: impact of traditional management structures Be practical and start focusing on toil Find and fix toil anti-patterns Empower with Self-Service Runbooks SRE is a new way to think about Ops work 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Done.I need you to do X Your other work I need you to do X I need you to do X Ticket Do X Later… Do X Do X Done. Done. Your other work Self-Service Self-Service Self-Service Your other work x2 Your other work x3 Later…Later… Later… Your other work Your other work After Before Wait Interrupt Ticket Wait Interrupt Ticket Wait Interrupt Toil Use DevOps and SRE to improve speed and quality After I’ve got this! Environment Self- Service
  • 153.