Ops Meta-Metrics: The Currency You Pay For Change

Ops Meta-Metrics
The Currency You Use to Pay For Change

John Allspaw
VP Operations
Etsy.com
http://www.ﬂickr.com/photos/wwarby/3296379139

Warning

Graphs and numbers in this
presentation
are sort of made up

/usr/nagios/libexec/check_ops.pl

How R U Doing?
http://www.ﬂickr.com/photos/a4gpa/190120662/

We track bugs already...

Example: https://issues.apache.org/jira/browse/TS

We should track
these, too...

We should track
these, too...

Changes (Who/What/When/Type)

We should track
these, too...

Incidents (Type/Severity)

We should track
these, too...

Incidents (Type/Severity)
Response to Incidents (TTR/TTD)

trepidation
noun
1 a feeling of fear or agitation about something that may
happen : the men set off in fear and trepidation.
2 archaic trembling motion.
DERIVATIVES
trepidatious adjective
ORIGIN late 15th cent.: from Latin trepidatio(n-), from
trepidare ‘be agitated, tremble,’ from trepidus ‘alarme

Change

Required.
Often feared.
Why?

http://www.ﬂickr.com/photos/20408885@N03/3570184759/

This is why
OMGWTF OUTAGES!!!1!!

la de da,
everything’s ﬁne

change
happens

Change
PTSD?

http://www.ﬂickr.com/photos/tzoﬁa/270800047/

But wait....
(OMGWTF)
la de da,

change
happens

But wait....
(OMGWTF)
la de da,

}
How much change is this?

change
happens

But wait....
(OMGWTF)
la de da,

}
What kind of change?

change
happens

But wait....
(OMGWTF)
la de da,

}
What kind of change?
How often does this happen?

change
happens

Need to raise conﬁdence that

change != outage

...incidents can be
handled well

http://www.ﬂickr.com/photos/axiepics/3181170364/

...root causes can be ﬁxed
quick enough

http://www.ﬂickr.com/photos/ljv/213624799/

...change can be
safe enough

http://www.ﬂickr.com/photos/marksetchell/43252686/

But how?
How do we have conﬁdence in anything
in our infrastructure?

We measure it.
And graph it.
And alert on it.

Tracking Change
1. Type
2. Frequency/Size
3. Results of those changes

Types of Change

Layers Examples

App code PHP/Rails/etc or ‘front-end’ code

Apache, MySQL, DB schema,
Services code
PHP/Ruby versions, etc.

OS/Servers, Switches, Routers,
Infrastructure
Datacenters, etc.

(you decide what these are for your architecture)

Code Deploys:
Who/What/When
WHEN WHO WHAT
(guy who pushed the button) (link to diff)

(http://codeascraft.etsy.com/2010/05/20/quantum-of-deployment/)

Code Deploys:
Who/What/When

Last 2 prod deploys
Last 2 Chef changes

other changes

(insert whatever ticketing/tracking you have)

Tracking Incidents

Incident Size

Big Outage
TTR still going

Tracking Incidents

1. Frequency
2. Severity
3. Root Cause
4. Time-To-Detect (TTD)
5. Time-To-Resolve (TTR)

The How
Doesn’t
Matter

http://www.ﬂickr.com/photos/matsuyuki/2328829160/

Incident/Degradation
Tracking
Start Detect Resolve Root PostMortem
Date Severity Done?
Time Time Time Cause

1/2/08 12:30 ET 12:32 ET 12:45 ET Sev1 DB Change Yes

3/7/08 18:32 ET 18:40 ET 18:47 ET Sev2 Capacity Yes

5/3/08 17:55 ET 17:55 ET 18:14 ET Sev3 Hardware Yes

Incident/Degradation
Tracking
Start Detect Resolve Root PostMortem
Date Severity
Time
These Time give you
will Time context Cause Done?

for your rates of change.

(You’ll need them for postmortems, anyway.)

Change:Incident Ratio

Important.


Important.
Not because all changes are equal.


Important.
Not because all changes are equal.
Not because all incidents are equal, or
change-related.

But because
humans will
irrationally
make a
permanent
connection
between the
two.
http://www.ﬂickr.com/photos/michelepedrolli/449572596/

Severity
Not all incidents are created equal.

Severity
Something like:

Severity
Something like:

SEV1 Full outage, or effectively unusable.

Severity
Something like:

SEV2 Signiﬁcant degradation for subset of users.

Severity
Something like:

SEV3 Minor impact on user experience.

Severity
Something like:

SEV3 Minor impact on user experience.
SEV4 No impact, but time-sensitive failure.

Root Cause?
(Not all incidents are change related)

Something like:

Note: this can be difﬁcult to categorize.
http://en.wikipedia.org/wiki/Root_cause_analysis

Root Cause?
(Not all incidents are change related)

Something like:

1. Hardware Failure
2. Datacenter Issue
3. Change: Code Issue
4. Change: Config Issue
5. Capacity/Traffic Issue
6. Other
Note: this can be difficult to categorize.
http://en.wikipedia.org/wiki/Root_cause_analysis

Recording Your Response

(worth the hassle)

http://www.ﬂickr.com/photos/mattblaze/2695044170/

la de da,

Time

la de da,

Time
change
happens

Noticed there
was a problem

la de da,

Time
change
happens

Noticed there
was a problem

Figured out
la de da, what the cause is

Time
change
happens

Fixed the problem

Noticed there •rolled back
was a problem •rolled forward
•temporary solution
•etc

Figured out

Time
change
happens

• Coordinate troubleshooting/diagnosis
Fixed the problem

•etc

Figured out

Time
change
happens

• Coordinate troubleshooting/diagnosis
• Communicate to support/community/execs
Fixed the problem

•etc

Figured out

Time
change
happens

Fixed the problem

•etc

Figured out

Time Time

change
happens

• Coordinate responses*
Fixed the problem

•etc

Figured out

Time Time

change
happens
* usually, “One Thing At A Time” responses

• Coordinate responses*
• Communicate to support/community/execs problem
Fixed the

•etc

Figured out

Time Time

change
happens
* usually, “One Thing At A Time” responses

Fixed the problem
Figured out
what the cause is
•etc

la de da,

Time Time

change
happens

• Conﬁrm stability, resolving steps

Fixed the problem
Figured out
what the cause is
•etc

la de da,

Time Time

change
happens

• Conﬁrm stability, resolving steps
• Communicate to support/community/execs
Fixed the problem
Figured out
what the cause is
•etc

la de da,

Time Time

change
happens

Communications
http://etsystatus.com

twitter.com/etsystatus

Fixed the problem
Figured out
what the cause is
•etc

la de da,

Time Time

change
happens PostMortem

Time To Detect

(TTD)

Time To Resolve
la de da,

(TTR)
la de da,

Time
change
happens

Hypothetical Example:
“We’re So Nimble!”

Nimble, But Stumbling?

+

Maybe this is too
Maybe you’re much suck?

}
changing too much at once?

}
Happening too often?

What percentage of incidents are related to
change?


What percentage of change-
related incidents are “off-hours”?

http://www.ﬂickr.com/photos/jeffreyanthonyrafolpiano/3266123838

What percentage of change-
related incidents are “off-hours”?

Do they have higher or
lower TTR?

http://www.ﬂickr.com/photos/jeffreyanthonyrafolpiano/3266123838

What types of change have the worst success
rates?

http://www.ﬂickr.com/photos/lwr/2257949828/

What types of change have the worst success
rates?

Which ones have the best
success rates?
http://www.ﬂickr.com/photos/lwr/2257949828/

Does your TTD/TTR increase
depending on the:

- SIZE?
- FREQUENCY?


Side effect is
that you’re
also tracking
successful
changes to
production
as well

http://www.ﬂickr.com/photos/wwworks/2313927146

Q2 2010
Incident
Success
Type Successes Failures Minutes(Sev1
Rate /2)

App code 420 5 98.81 8

Conﬁg 404 3 99.26 5

DB Schema 15 1 93.33 10

DNS 45 0 100 0

Network (misc) 5 0 100 0

Network (core) 1 0 100 0

Q2 2010
Incident
Success
Type Successes Failures Minutes(Se

!
Rate
v1/2)
App code 420 5 98.81 8

Conﬁg 404 3 99.26 5

DB Schema 15 1 93.33 10

DNS 45 0 100 0

Network (misc) 5 0 100 0

Network (core) 1 0 100 0

Incident Observations

Morale

Length of Incident/Outage

Incident Observations

Mistakes

Length of Incident/Outage

Change Observations

Change
Size

Change Frequency

Change Observations
Huge changesets
deployed rarely

Change
Size

Change Frequency

Change Observations
Huge changesets (high TTR)
deployed rarely

Change
Size

Change Frequency

Change Observations
deployed rarely

Change
Size Tiny changesets
deployed often

Change Frequency

Change Observations
deployed rarely

Change
Size Tiny changesets
deployed often
(low TTR)

Change Frequency

Speciﬁcally....

la de da,
What if this was only 5

}
everything’s ﬁne lines of code that were
changed?

Does that feel safer?
change
happens (it should)

Pay attention to this stuff
http://www.ﬂickr.com/photos/plasticbag/2461247090/

We’re Hiring Ops!
SF & NYC
In May:

- $22.9M of goods were sold by the community
- 1,895,943 new items listed
- 239,340 members joined

Continuous
Deployment

Described in 6 graphs
(Originally Cal Henderson’s idea)

Ops Meta-Metrics: The Currency You Pay For Change

Ops Meta-Metrics: The Currency You Pay For Change

More Related Content

What's hot

Viewers also liked

Similar to Ops Meta-Metrics: The Currency You Pay For Change

More from John Allspaw

Recently uploaded

Ops Meta-Metrics: The Currency You Pay For Change

Editor's Notes