Your SlideShare is downloading. ×
Ops Meta-Metrics: The Currency You Pay For Change
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Ops Meta-Metrics: The Currency You Pay For Change

22,131

Published on

Published in: Business, Technology
0 Comments
49 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
22,131
On Slideshare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
Downloads
536
Comments
0
Likes
49
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • This is about metrics about YOU! Metrics *about* the metrics-makers!
  • They are basically taken from both Flickr and Etsy.
  • HOW MANY: write public-facing app code? maintain the release tools? release process? respond to incidents? have had an outage or notable degradation this month? that was change-related?
  • Too fast? Too loose? Too many issues? Too many upset and stressed out humans?
  • Everyone is used to bug tracking, it’s something worthwhile....



  • If this is a feeling you have often, please read on.

  • All you need is to see this happen once, and it’s hard to get out of your memory.
    No wonder why some people can start to think “code deploy = outage”.
  • Mild version of “Critical Incident Stress Management”?
    Change = risk, and sometimes risk = outage. And outages are stressful.

  • Not supposed to feel like this.

  • Details about the change play a huge role in your ability to respond to change-related incidents.
  • Details about the change play a huge role in your ability to respond to change-related incidents.
  • Details about the change play a huge role in your ability to respond to change-related incidents.

  • We do this by tracking our responses to outages and incidents.
  • We can do this by tracking our change, and learning from the results.

  • We need to raise confidence that we’re moving as fast as we can while still being safe enough to do so. And we can adjust the change to meet our requirements...
  • Why should change and results of changes be any different?
  • Type = code, schema, infrastructure, etc.
    Frequency/Size = how often each type is changed, implies risk
    Results = how often each change results in an incident/degradation
  • Lots of different types here. Might be different for everyone.
    Not all types of change bring the same amount of risk.
  • This info should be considered mandatory. This should also be done for db schema changes, network changes, changes in any part of the stack, really.
  • The header of our metrics tools has these statistics, too.
  • The tricky part: getting all prod changes written down without too much hassle.
  • Here’s one type of change....
  • Here’s another type of change....
  • Here’s yet another type of change...
  • Size does turn out to be important. Size = lines of code, level of SPOF risk, etc.
  • This seems like something you should do. Also: “incidents” = outages or degradations.
  • Just an example. This looks like it’s going well! Getting better!
  • Maybe I can’t say that it’s getting better, actually....

  • Some folks have Techcrunch as their incident log keeper. You could just use a spreadsheet.
  • An example!
  • You *are* doing postmortems on incidents that happen, right? Doing them comes at a certain point in your evolution.



  • Without the statistics, even a rare but severe outage can make the impression that change == outage.
  • Just examples. It’s important to categorize these things so you can count the ones that matter for the user’s experience. #4 Loss of redundancy
  • Just examples. It’s important to categorize these things so you can count the ones that matter for the user’s experience. #4 Loss of redundancy
  • Just examples. It’s important to categorize these things so you can count the ones that matter for the user’s experience. #4 Loss of redundancy
  • Just examples. It’s important to categorize these things so you can count the ones that matter for the user’s experience. #4 Loss of redundancy
  • Just examples. It’s important to categorize these things so you can count the ones that matter for the user’s experience. #4 Loss of redundancy
  • Just examples. It’s important to categorize these things so you can count the ones that matter for the user’s experience. #4 Loss of redundancy
  • Just examples. It’s important to categorize these things so you can count the ones that matter for the user’s experience. #4 Loss of redundancy
  • Just examples. This normally comes from a postmortem meeting. A good pointer on Root Cause Analysis is Eric Ries’ material on Five Whys, and the wikipedia page for RCA.
  • http://www.flickr.com/photos/mattblaze/2695044170/
  • What happens in our response to a change-related incident is just as important as the occurrence of the incident.
  • What happens in our response to a change-related incident is just as important as the occurrence of the incident.
  • What happens in our response to a change-related incident is just as important as the occurrence of the incident.
  • What happens in our response to a change-related incident is just as important as the occurrence of the incident.
  • What happens in our response to a change-related incident is just as important as the occurrence of the incident.
  • Th
  • Th
  • This might also be known as a ‘diagnose’ point.
  • This might also be known as a ‘diagnose’ point.
  • These events usually spawn other events.
  • These events usually spawn other events.
  • This should be standard operating procedure at this point,
  • These events usually spawn other events.
  • Some folks might notice a “Time To Diagnose” missing here.
    ALSO: it’s usually more complex than this, but this is the gist of it.


  • Do incidents increase with size of change? With frequency? With frequency/size of different types?


  • If you don’t track: Change, Incidents, and Responses, you’ll never have answers for these questions.


  • Reasonable questions.

  • *YOU* get to decide what is “small” and “frequent”.


  • THIS is what can help give you confidence. Or not.

  • The longer an outage lasts, the bigger of a bummer it is for all those who are working on fixing it.
  • The longer an outage lasts, the more mistakes people make. (and, as the night gets longer)
    Red herrings...
  • put two points on this graph
  • put two points on this graph
  • put two points on this graph
  • put two points on this graph
  • It should, because it is.
  • How we feel about change and how it can (or not) cause outages is important.
    Some of the nastiest relationships emerge between dev and ops because of these things.





  • “Normal” = lots of change done at regular intervals, change = big, time = long.
  • 2 weeks? 5000 lines?
  • Scary Monster of Change! Each incident-causing deploy has only one recourse: roll it all back. Even code that was ok and unrelated to the incident. Boo!
  • Silly Monster of Nothing to Be Afraid Of Because His Teeth Are Small.
  • Problem? Roll that little piece back. Or better yet, roll it forward!
  • This looks like an adorable monster. Like a Maurice Sendak monster.

  • Transcript

    • 1. Ops Meta-Metrics The Currency You Use to Pay For Change John Allspaw VP Operations Etsy.com http://www.flickr.com/photos/wwarby/3296379139
    • 2. Warning Graphs and numbers in this presentation are sort of made up
    • 3. /usr/nagios/libexec/check_ops.pl
    • 4. How R U Doing? http://www.flickr.com/photos/a4gpa/190120662/
    • 5. We track bugs already... Example: https://issues.apache.org/jira/browse/TS
    • 6. We should track these, too...
    • 7. We should track these, too... Changes (Who/What/When/Type)
    • 8. We should track these, too... Changes (Who/What/When/Type) Incidents (Type/Severity)
    • 9. We should track these, too... Changes (Who/What/When/Type) Incidents (Type/Severity) Response to Incidents (TTR/TTD)
    • 10. trepidation noun 1 a feeling of fear or agitation about something that may happen : the men set off in fear and trepidation. 2 archaic trembling motion. DERIVATIVES trepidatious adjective ORIGIN late 15th cent.: from Latin trepidatio(n-), from trepidare ‘be agitated, tremble,’ from trepidus ‘alarme
    • 11. Change Required. Often feared. Why? http://www.flickr.com/photos/20408885@N03/3570184759/
    • 12. This is why OMGWTF OUTAGES!!!1!! la de da, everything’s fine change happens
    • 13. Change PTSD? http://www.flickr.com/photos/tzofia/270800047/
    • 14. Brace For Impact?
    • 15. Brace For Impact?
    • 16. But wait.... (OMGWTF) la de da, everything’s fine change happens
    • 17. But wait.... (OMGWTF) la de da, } everything’s fine How much change is this? change happens
    • 18. But wait.... (OMGWTF) la de da, } everything’s fine How much change is this? What kind of change? change happens
    • 19. But wait.... (OMGWTF) la de da, } everything’s fine How much change is this? What kind of change? How often does this happen? change happens
    • 20. Need to raise confidence that change != outage
    • 21. ...incidents can be handled well http://www.flickr.com/photos/axiepics/3181170364/
    • 22. ...root causes can be fixed quick enough http://www.flickr.com/photos/ljv/213624799/
    • 23. ...change can be safe enough http://www.flickr.com/photos/marksetchell/43252686/
    • 24. But how? How do we have confidence in anything in our infrastructure? We measure it. And graph it. And alert on it.
    • 25. Tracking Change 1. Type 2. Frequency/Size 3. Results of those changes
    • 26. Types of Change Layers Examples App code PHP/Rails/etc or ‘front-end’ code Apache, MySQL, DB schema, Services code PHP/Ruby versions, etc. OS/Servers, Switches, Routers, Infrastructure Datacenters, etc. (you decide what these are for your architecture)
    • 27. Code Deploys: Who/What/When WHEN WHO WHAT (guy who pushed the button) (link to diff) (http://codeascraft.etsy.com/2010/05/20/quantum-of-deployment/)
    • 28. Code Deploys: Who/What/When Last 2 prod deploys Last 2 Chef changes
    • 29. other changes (insert whatever ticketing/tracking you have)
    • 30. Frequency
    • 31. Frequency
    • 32. Frequency
    • 33. Size
    • 34. Tracking Incidents http://www.flickr.com/photos/47684393@N00/4543311558/
    • 35. Incident Frequency
    • 36. Incident Size Big Outage TTR still going
    • 37. Tracking Incidents 1. Frequency 2. Severity 3. Root Cause 4. Time-To-Detect (TTD) 5. Time-To-Resolve (TTR)
    • 38. The How Doesn’t Matter http://www.flickr.com/photos/matsuyuki/2328829160/
    • 39. Incident/Degradation Tracking Start Detect Resolve Root PostMortem Date Severity Done? Time Time Time Cause 1/2/08 12:30 ET 12:32 ET 12:45 ET Sev1 DB Change Yes 3/7/08 18:32 ET 18:40 ET 18:47 ET Sev2 Capacity Yes 5/3/08 17:55 ET 17:55 ET 18:14 ET Sev3 Hardware Yes
    • 40. Incident/Degradation Tracking Start Detect Resolve Root PostMortem Date Severity Time These Time give you will Time context Cause Done? for your rates of change. (You’ll need them for postmortems, anyway.)
    • 41. Change:Incident Ratio
    • 42. Change:Incident Ratio Important.
    • 43. Change:Incident Ratio Important. Not because all changes are equal.
    • 44. Change:Incident Ratio Important. Not because all changes are equal. Not because all incidents are equal, or change-related.
    • 45. Change:Incident Ratio But because humans will irrationally make a permanent connection between the two. http://www.flickr.com/photos/michelepedrolli/449572596/
    • 46. Severity
    • 47. Severity Not all incidents are created equal.
    • 48. Severity Not all incidents are created equal. Something like:
    • 49. Severity Not all incidents are created equal. Something like:
    • 50. Severity Not all incidents are created equal. Something like: SEV1 Full outage, or effectively unusable.
    • 51. Severity Not all incidents are created equal. Something like: SEV1 Full outage, or effectively unusable. SEV2 Significant degradation for subset of users.
    • 52. Severity Not all incidents are created equal. Something like: SEV1 Full outage, or effectively unusable. SEV2 Significant degradation for subset of users. SEV3 Minor impact on user experience.
    • 53. Severity Not all incidents are created equal. Something like: SEV1 Full outage, or effectively unusable. SEV2 Significant degradation for subset of users. SEV3 Minor impact on user experience. SEV4 No impact, but time-sensitive failure.
    • 54. Root Cause? (Not all incidents are change related) Something like: Note: this can be difficult to categorize. http://en.wikipedia.org/wiki/Root_cause_analysis
    • 55. Root Cause? (Not all incidents are change related) Something like: 1. Hardware Failure 2. Datacenter Issue 3. Change: Code Issue 4. Change: Config Issue 5. Capacity/Traffic Issue 6. Other Note: this can be difficult to categorize. http://en.wikipedia.org/wiki/Root_cause_analysis
    • 56. Recording Your Response (worth the hassle) http://www.flickr.com/photos/mattblaze/2695044170/
    • 57. Time
    • 58. la de da, everything’s fine Time
    • 59. la de da, everything’s fine Time change happens
    • 60. Noticed there was a problem la de da, everything’s fine Time change happens
    • 61. Noticed there was a problem Figured out la de da, what the cause is everything’s fine Time change happens
    • 62. Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time change happens
    • 63. Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time change happens
    • 64. • Coordinate troubleshooting/diagnosis Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time change happens
    • 65. • Coordinate troubleshooting/diagnosis • Communicate to support/community/execs Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time change happens
    • 66. Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time Time change happens
    • 67. • Coordinate responses* Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time Time change happens * usually, “One Thing At A Time” responses
    • 68. • Coordinate responses* • Communicate to support/community/execs problem Fixed the Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time Time change happens * usually, “One Thing At A Time” responses
    • 69. Fixed the problem Figured out what the cause is Noticed there •rolled back was a problem •rolled forward •temporary solution •etc la de da, everything’s fine Time Time change happens
    • 70. • Confirm stability, resolving steps Fixed the problem Figured out what the cause is Noticed there •rolled back was a problem •rolled forward •temporary solution •etc la de da, everything’s fine Time Time change happens
    • 71. • Confirm stability, resolving steps • Communicate to support/community/execs Fixed the problem Figured out what the cause is Noticed there •rolled back was a problem •rolled forward •temporary solution •etc la de da, everything’s fine Time Time change happens
    • 72. Communications http://etsystatus.com twitter.com/etsystatus
    • 73. Fixed the problem Figured out what the cause is Noticed there •rolled back was a problem •rolled forward •temporary solution •etc la de da, everything’s fine Time Time change happens PostMortem
    • 74. Time To Detect (TTD) Time To Resolve la de da, (TTR) la de da, everything’s fine everything’s fine Time change happens
    • 75. Hypothetical Example: “We’re So Nimble!”
    • 76. Nimble, But Stumbling?
    • 77. Is There Any Pattern?
    • 78. Nimble, But Stumbling? +
    • 79. Nimble, But Stumbling? +
    • 80. Maybe this is too Maybe you’re much suck? } changing too much at once? } Happening too often?
    • 81. What percentage of incidents are related to change? http://www.flickr.com/photos/78364563@N00/2467989781/
    • 82. What percentage of change- related incidents are “off-hours”? http://www.flickr.com/photos/jeffreyanthonyrafolpiano/3266123838
    • 83. What percentage of change- related incidents are “off-hours”? Do they have higher or lower TTR? http://www.flickr.com/photos/jeffreyanthonyrafolpiano/3266123838
    • 84. What types of change have the worst success rates? http://www.flickr.com/photos/lwr/2257949828/
    • 85. What types of change have the worst success rates? Which ones have the best success rates? http://www.flickr.com/photos/lwr/2257949828/
    • 86. Does your TTD/TTR increase depending on the: - SIZE? - FREQUENCY? http://www.flickr.com/photos/45409431@N00/2521827947/
    • 87. Side effect is that you’re also tracking successful changes to production as well http://www.flickr.com/photos/wwworks/2313927146
    • 88. Q2 2010 Incident Success Type Successes Failures Minutes(Sev1 Rate /2) App code 420 5 98.81 8 Config 404 3 99.26 5 DB Schema 15 1 93.33 10 DNS 45 0 100 0 Network (misc) 5 0 100 0 Network (core) 1 0 100 0
    • 89. Q2 2010 Incident Success Type Successes Failures Minutes(Se ! Rate v1/2) App code 420 5 98.81 8 Config 404 3 99.26 5 DB Schema 15 1 93.33 10 DNS 45 0 100 0 Network (misc) 5 0 100 0 Network (core) 1 0 100 0
    • 90. Some Observations
    • 91. Incident Observations Morale Length of Incident/Outage
    • 92. Incident Observations Mistakes Length of Incident/Outage
    • 93. Change Observations Change Size Change Frequency
    • 94. Change Observations Huge changesets deployed rarely Change Size Change Frequency
    • 95. Change Observations Huge changesets (high TTR) deployed rarely Change Size Change Frequency
    • 96. Change Observations Huge changesets (high TTR) deployed rarely Change Size Tiny changesets deployed often Change Frequency
    • 97. Change Observations Huge changesets (high TTR) deployed rarely Change Size Tiny changesets deployed often (low TTR) Change Frequency
    • 98. Specifically.... la de da, What if this was only 5 } everything’s fine lines of code that were changed? Does that feel safer? change happens (it should)
    • 99. Pay attention to this stuff http://www.flickr.com/photos/plasticbag/2461247090/
    • 100. We’re Hiring Ops! SF & NYC In May: - $22.9M of goods were sold by the community - 1,895,943 new items listed - 239,340 members joined
    • 101. The End
    • 102. Bonus Time!!1!
    • 103. Continuous Deployment Described in 6 graphs (Originally Cal Henderson’s idea)

    ×