-
1.
Ops Meta-Metrics
The Currency You Use to Pay For Change
John Allspaw
VP Operations
Etsy.com
http://www.flickr.com/photos/wwarby/3296379139
-
2.
Warning
Graphs and numbers in this
presentation
are sort of made up
-
3.
/usr/nagios/libexec/check_ops.pl
-
4.
How R U Doing?
http://www.flickr.com/photos/a4gpa/190120662/
-
5.
We track bugs already...
Example: https://issues.apache.org/jira/browse/TS
-
6.
We should track
these, too...
-
7.
We should track
these, too...
Changes (Who/What/When/Type)
-
8.
We should track
these, too...
Changes (Who/What/When/Type)
Incidents (Type/Severity)
-
9.
We should track
these, too...
Changes (Who/What/When/Type)
Incidents (Type/Severity)
Response to Incidents (TTR/TTD)
-
10.
trepidation
noun
1 a feeling of fear or agitation about something that may
happen : the men set off in fear and trepidation.
2 archaic trembling motion.
DERIVATIVES
trepidatious adjective
ORIGIN late 15th cent.: from Latin trepidatio(n-), from
trepidare ‘be agitated, tremble,’ from trepidus ‘alarme
-
11.
Change
Required.
Often feared.
Why?
http://www.flickr.com/photos/20408885@N03/3570184759/
-
12.
This is why
OMGWTF OUTAGES!!!1!!
la de da,
everything’s fine
change
happens
-
13.
Change
PTSD?
http://www.flickr.com/photos/tzofia/270800047/
-
14.
Brace For Impact?
-
15.
Brace For Impact?
-
16.
But wait....
(OMGWTF)
la de da,
everything’s fine
change
happens
-
17.
But wait....
(OMGWTF)
la de da,
}
everything’s fine
How much change is this?
change
happens
-
18.
But wait....
(OMGWTF)
la de da,
}
everything’s fine
How much change is this?
What kind of change?
change
happens
-
19.
But wait....
(OMGWTF)
la de da,
}
everything’s fine
How much change is this?
What kind of change?
How often does this happen?
change
happens
-
20.
Need to raise confidence that
change != outage
-
21.
...incidents can be
handled well
http://www.flickr.com/photos/axiepics/3181170364/
-
22.
...root causes can be fixed
quick enough
http://www.flickr.com/photos/ljv/213624799/
-
23.
...change can be
safe enough
http://www.flickr.com/photos/marksetchell/43252686/
-
24.
But how?
How do we have confidence in anything
in our infrastructure?
We measure it.
And graph it.
And alert on it.
-
25.
Tracking Change
1. Type
2. Frequency/Size
3. Results of those changes
-
26.
Types of Change
Layers Examples
App code PHP/Rails/etc or ‘front-end’ code
Apache, MySQL, DB schema,
Services code
PHP/Ruby versions, etc.
OS/Servers, Switches, Routers,
Infrastructure
Datacenters, etc.
(you decide what these are for your architecture)
-
27.
Code Deploys:
Who/What/When
WHEN WHO WHAT
(guy who pushed the button) (link to diff)
(http://codeascraft.etsy.com/2010/05/20/quantum-of-deployment/)
-
28.
Code Deploys:
Who/What/When
Last 2 prod deploys
Last 2 Chef changes
-
29.
other changes
(insert whatever ticketing/tracking you have)
-
30.
Frequency
-
31.
Frequency
-
32.
Frequency
-
33.
Size
-
34.
Tracking Incidents
http://www.flickr.com/photos/47684393@N00/4543311558/
-
35.
Incident Frequency
-
36.
Incident Size
Big Outage
TTR still going
-
37.
Tracking Incidents
1. Frequency
2. Severity
3. Root Cause
4. Time-To-Detect (TTD)
5. Time-To-Resolve (TTR)
-
38.
The How
Doesn’t
Matter
http://www.flickr.com/photos/matsuyuki/2328829160/
-
39.
Incident/Degradation
Tracking
Start Detect Resolve Root PostMortem
Date Severity Done?
Time Time Time Cause
1/2/08 12:30 ET 12:32 ET 12:45 ET Sev1 DB Change Yes
3/7/08 18:32 ET 18:40 ET 18:47 ET Sev2 Capacity Yes
5/3/08 17:55 ET 17:55 ET 18:14 ET Sev3 Hardware Yes
-
40.
Incident/Degradation
Tracking
Start Detect Resolve Root PostMortem
Date Severity
Time
These Time give you
will Time context Cause Done?
for your rates of change.
(You’ll need them for postmortems, anyway.)
-
41.
Change:Incident Ratio
-
42.
Change:Incident Ratio
Important.
-
43.
Change:Incident Ratio
Important.
Not because all changes are equal.
-
44.
Change:Incident Ratio
Important.
Not because all changes are equal.
Not because all incidents are equal, or
change-related.
-
45.
Change:Incident Ratio
But because
humans will
irrationally
make a
permanent
connection
between the
two.
http://www.flickr.com/photos/michelepedrolli/449572596/
-
46.
Severity
-
47.
Severity
Not all incidents are created equal.
-
48.
Severity
Not all incidents are created equal.
Something like:
-
49.
Severity
Not all incidents are created equal.
Something like:
-
50.
Severity
Not all incidents are created equal.
Something like:
SEV1 Full outage, or effectively unusable.
-
51.
Severity
Not all incidents are created equal.
Something like:
SEV1 Full outage, or effectively unusable.
SEV2 Significant degradation for subset of users.
-
52.
Severity
Not all incidents are created equal.
Something like:
SEV1 Full outage, or effectively unusable.
SEV2 Significant degradation for subset of users.
SEV3 Minor impact on user experience.
-
53.
Severity
Not all incidents are created equal.
Something like:
SEV1 Full outage, or effectively unusable.
SEV2 Significant degradation for subset of users.
SEV3 Minor impact on user experience.
SEV4 No impact, but time-sensitive failure.
-
54.
Root Cause?
(Not all incidents are change related)
Something like:
Note: this can be difficult to categorize.
http://en.wikipedia.org/wiki/Root_cause_analysis
-
55.
Root Cause?
(Not all incidents are change related)
Something like:
1. Hardware Failure
2. Datacenter Issue
3. Change: Code Issue
4. Change: Config Issue
5. Capacity/Traffic Issue
6. Other
Note: this can be difficult to categorize.
http://en.wikipedia.org/wiki/Root_cause_analysis
-
56.
Recording Your Response
(worth the hassle)
http://www.flickr.com/photos/mattblaze/2695044170/
-
57.
Time
-
58.
la de da,
everything’s fine
Time
-
59.
la de da,
everything’s fine
Time
change
happens
-
60.
Noticed there
was a problem
la de da,
everything’s fine
Time
change
happens
-
61.
Noticed there
was a problem
Figured out
la de da, what the cause is
everything’s fine
Time
change
happens
-
62.
Fixed the problem
Noticed there •rolled back
was a problem •rolled forward
•temporary solution
•etc
Figured out
la de da, what the cause is
everything’s fine
Time
change
happens
-
63.
Fixed the problem
Noticed there •rolled back
was a problem •rolled forward
•temporary solution
•etc
Figured out
la de da, what the cause is
everything’s fine
Time
change
happens
-
64.
• Coordinate troubleshooting/diagnosis
Fixed the problem
Noticed there •rolled back
was a problem •rolled forward
•temporary solution
•etc
Figured out
la de da, what the cause is
everything’s fine
Time
change
happens
-
65.
• Coordinate troubleshooting/diagnosis
• Communicate to support/community/execs
Fixed the problem
Noticed there •rolled back
was a problem •rolled forward
•temporary solution
•etc
Figured out
la de da, what the cause is
everything’s fine
Time
change
happens
-
66.
Fixed the problem
Noticed there •rolled back
was a problem •rolled forward
•temporary solution
•etc
Figured out
la de da, what the cause is
everything’s fine
Time Time
change
happens
-
67.
• Coordinate responses*
Fixed the problem
Noticed there •rolled back
was a problem •rolled forward
•temporary solution
•etc
Figured out
la de da, what the cause is
everything’s fine
Time Time
change
happens
* usually, “One Thing At A Time” responses
-
68.
• Coordinate responses*
• Communicate to support/community/execs problem
Fixed the
Noticed there •rolled back
was a problem •rolled forward
•temporary solution
•etc
Figured out
la de da, what the cause is
everything’s fine
Time Time
change
happens
* usually, “One Thing At A Time” responses
-
69.
Fixed the problem
Figured out
what the cause is
Noticed there •rolled back
was a problem •rolled forward
•temporary solution
•etc
la de da,
everything’s fine
Time Time
change
happens
-
70.
• Confirm stability, resolving steps
Fixed the problem
Figured out
what the cause is
Noticed there •rolled back
was a problem •rolled forward
•temporary solution
•etc
la de da,
everything’s fine
Time Time
change
happens
-
71.
• Confirm stability, resolving steps
• Communicate to support/community/execs
Fixed the problem
Figured out
what the cause is
Noticed there •rolled back
was a problem •rolled forward
•temporary solution
•etc
la de da,
everything’s fine
Time Time
change
happens
-
72.
Communications
http://etsystatus.com
twitter.com/etsystatus
-
73.
Fixed the problem
Figured out
what the cause is
Noticed there •rolled back
was a problem •rolled forward
•temporary solution
•etc
la de da,
everything’s fine
Time Time
change
happens PostMortem
-
74.
Time To Detect
(TTD)
Time To Resolve
la de da,
(TTR)
la de da,
everything’s fine
everything’s fine
Time
change
happens
-
75.
Hypothetical Example:
“We’re So Nimble!”
-
76.
Nimble, But Stumbling?
-
77.
Is There Any Pattern?
-
78.
Nimble, But Stumbling?
+
-
79.
Nimble, But Stumbling?
+
-
80.
Maybe this is too
Maybe you’re much suck?
}
changing too much at once?
}
Happening too often?
-
81.
What percentage of incidents are related to
change?
http://www.flickr.com/photos/78364563@N00/2467989781/
-
82.
What percentage of change-
related incidents are “off-hours”?
http://www.flickr.com/photos/jeffreyanthonyrafolpiano/3266123838
-
83.
What percentage of change-
related incidents are “off-hours”?
Do they have higher or
lower TTR?
http://www.flickr.com/photos/jeffreyanthonyrafolpiano/3266123838
-
84.
What types of change have the worst success
rates?
http://www.flickr.com/photos/lwr/2257949828/
-
85.
What types of change have the worst success
rates?
Which ones have the best
success rates?
http://www.flickr.com/photos/lwr/2257949828/
-
86.
Does your TTD/TTR increase
depending on the:
- SIZE?
- FREQUENCY?
http://www.flickr.com/photos/45409431@N00/2521827947/
-
87.
Side effect is
that you’re
also tracking
successful
changes to
production
as well
http://www.flickr.com/photos/wwworks/2313927146
-
88.
Q2 2010
Incident
Success
Type Successes Failures Minutes(Sev1
Rate /2)
App code 420 5 98.81 8
Config 404 3 99.26 5
DB Schema 15 1 93.33 10
DNS 45 0 100 0
Network (misc) 5 0 100 0
Network (core) 1 0 100 0
-
89.
Q2 2010
Incident
Success
Type Successes Failures Minutes(Se
!
Rate
v1/2)
App code 420 5 98.81 8
Config 404 3 99.26 5
DB Schema 15 1 93.33 10
DNS 45 0 100 0
Network (misc) 5 0 100 0
Network (core) 1 0 100 0
-
90.
Some Observations
-
91.
Incident Observations
Morale
Length of Incident/Outage
-
92.
Incident Observations
Mistakes
Length of Incident/Outage
-
93.
Change Observations
Change
Size
Change Frequency
-
94.
Change Observations
Huge changesets
deployed rarely
Change
Size
Change Frequency
-
95.
Change Observations
Huge changesets (high TTR)
deployed rarely
Change
Size
Change Frequency
-
96.
Change Observations
Huge changesets (high TTR)
deployed rarely
Change
Size Tiny changesets
deployed often
Change Frequency
-
97.
Change Observations
Huge changesets (high TTR)
deployed rarely
Change
Size Tiny changesets
deployed often
(low TTR)
Change Frequency
-
98.
Specifically....
la de da,
What if this was only 5
}
everything’s fine lines of code that were
changed?
Does that feel safer?
change
happens (it should)
-
99.
Pay attention to this stuff
http://www.flickr.com/photos/plasticbag/2461247090/
-
100.
We’re Hiring Ops!
SF & NYC
In May:
- $22.9M of goods were sold by the community
- 1,895,943 new items listed
- 239,340 members joined
-
101.
The End
-
102.
Bonus Time!!1!
-
103.
Continuous
Deployment
Described in 6 graphs
(Originally Cal Henderson’s idea)
This is about metrics about YOU! Metrics *about* the metrics-makers!
They are basically taken from both Flickr and Etsy.
HOW MANY: write public-facing app code? maintain the release tools? release process? respond to incidents? have had an outage or notable degradation this month? that was change-related?
Too fast? Too loose? Too many issues? Too many upset and stressed out humans?
Everyone is used to bug tracking, it’s something worthwhile....
If this is a feeling you have often, please read on.
All you need is to see this happen once, and it’s hard to get out of your memory.
No wonder why some people can start to think “code deploy = outage”.
Mild version of “Critical Incident Stress Management”?
Change = risk, and sometimes risk = outage. And outages are stressful.
Not supposed to feel like this.
Details about the change play a huge role in your ability to respond to change-related incidents.
Details about the change play a huge role in your ability to respond to change-related incidents.
Details about the change play a huge role in your ability to respond to change-related incidents.
We do this by tracking our responses to outages and incidents.
We can do this by tracking our change, and learning from the results.
We need to raise confidence that we’re moving as fast as we can while still being safe enough to do so. And we can adjust the change to meet our requirements...
Why should change and results of changes be any different?
Type = code, schema, infrastructure, etc.
Frequency/Size = how often each type is changed, implies risk
Results = how often each change results in an incident/degradation
Lots of different types here. Might be different for everyone.
Not all types of change bring the same amount of risk.
This info should be considered mandatory. This should also be done for db schema changes, network changes, changes in any part of the stack, really.
The header of our metrics tools has these statistics, too.
The tricky part: getting all prod changes written down without too much hassle.
Here’s one type of change....
Here’s another type of change....
Here’s yet another type of change...
Size does turn out to be important. Size = lines of code, level of SPOF risk, etc.
This seems like something you should do. Also: “incidents” = outages or degradations.
Just an example. This looks like it’s going well! Getting better!
Maybe I can’t say that it’s getting better, actually....
Some folks have Techcrunch as their incident log keeper. You could just use a spreadsheet.
An example!
You *are* doing postmortems on incidents that happen, right? Doing them comes at a certain point in your evolution.
Without the statistics, even a rare but severe outage can make the impression that change == outage.
Just examples. It’s important to categorize these things so you can count the ones that matter for the user’s experience. #4 Loss of redundancy
Just examples. It’s important to categorize these things so you can count the ones that matter for the user’s experience. #4 Loss of redundancy
Just examples. It’s important to categorize these things so you can count the ones that matter for the user’s experience. #4 Loss of redundancy
Just examples. It’s important to categorize these things so you can count the ones that matter for the user’s experience. #4 Loss of redundancy
Just examples. It’s important to categorize these things so you can count the ones that matter for the user’s experience. #4 Loss of redundancy
Just examples. It’s important to categorize these things so you can count the ones that matter for the user’s experience. #4 Loss of redundancy
Just examples. It’s important to categorize these things so you can count the ones that matter for the user’s experience. #4 Loss of redundancy
Just examples. This normally comes from a postmortem meeting. A good pointer on Root Cause Analysis is Eric Ries’ material on Five Whys, and the wikipedia page for RCA.
http://www.flickr.com/photos/mattblaze/2695044170/
What happens in our response to a change-related incident is just as important as the occurrence of the incident.
What happens in our response to a change-related incident is just as important as the occurrence of the incident.
What happens in our response to a change-related incident is just as important as the occurrence of the incident.
What happens in our response to a change-related incident is just as important as the occurrence of the incident.
What happens in our response to a change-related incident is just as important as the occurrence of the incident.
Th
Th
This might also be known as a ‘diagnose’ point.
This might also be known as a ‘diagnose’ point.
These events usually spawn other events.
These events usually spawn other events.
This should be standard operating procedure at this point,
These events usually spawn other events.
Some folks might notice a “Time To Diagnose” missing here.
ALSO: it’s usually more complex than this, but this is the gist of it.
Do incidents increase with size of change? With frequency? With frequency/size of different types?
If you don’t track: Change, Incidents, and Responses, you’ll never have answers for these questions.
Reasonable questions.
*YOU* get to decide what is “small” and “frequent”.
THIS is what can help give you confidence. Or not.
The longer an outage lasts, the bigger of a bummer it is for all those who are working on fixing it.
The longer an outage lasts, the more mistakes people make. (and, as the night gets longer)
Red herrings...
put two points on this graph
put two points on this graph
put two points on this graph
put two points on this graph
It should, because it is.
How we feel about change and how it can (or not) cause outages is important.
Some of the nastiest relationships emerge between dev and ops because of these things.
“Normal” = lots of change done at regular intervals, change = big, time = long.
2 weeks? 5000 lines?
Scary Monster of Change! Each incident-causing deploy has only one recourse: roll it all back. Even code that was ok and unrelated to the incident. Boo!
Silly Monster of Nothing to Be Afraid Of Because His Teeth Are Small.
Problem? Roll that little piece back. Or better yet, roll it forward!
This looks like an adorable monster. Like a Maurice Sendak monster.