Ops Meta-Metrics: The Currency You Pay For Change
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Ops Meta-Metrics: The Currency You Pay For Change

on

  • 23,463 views

 

Statistics

Views

Total Views
23,463
Views on SlideShare
19,434
Embed Views
4,029

Actions

Likes
46
Downloads
516
Comments
0

31 Embeds 4,029

http://www.kitchensoap.com 3150
http://esouillat.wordpress.com 169
http://srdjira.websense.com 166
http://mindreframer.github.io 150
http://www.planetdevops.net 140
http://feeds2.feedburner.com 73
http://localhost 35
https://twitter.com 28
https://nexus.connectsolutions.com 24
http://www.newsblur.com 14
http://feeds.feedburner.com 13
http://www.techgig.com 12
http://devopsbox.es 9
http://translate.googleusercontent.com 6
http://a0.twimg.com 5
http://www.linkedin.com 5
http://www.mefeedia.com 4
http://newsblur.com 4
http://blog.les.hhoffice.de.etracker.com 4
http://www.slideshare.net 3
http://srdjira 2
http://webcache.googleusercontent.com 2
http://cloud.feedly.com 2
https://twimg0-a.akamaihd.net 2
http://a.localtunnel.net 1
http://wfl01w.srv.mailcontrol.com 1
https://si0.twimg.com 1
http://dev.newsblur.com 1
http://www.yatedo.com 1
http://us-w1.rockmelt.com 1
https://www.linkedin.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This is about metrics about YOU! Metrics *about* the metrics-makers! <br />
  • They are basically taken from both Flickr and Etsy. <br />
  • HOW MANY: write public-facing app code? maintain the release tools? release process? respond to incidents? have had an outage or notable degradation this month? that was change-related? <br />
  • Too fast? Too loose? Too many issues? Too many upset and stressed out humans? <br />
  • Everyone is used to bug tracking, it&#x2019;s something worthwhile.... <br />
  • <br />
  • <br />
  • <br />
  • If this is a feeling you have often, please read on. <br />
  • <br />
  • All you need is to see this happen once, and it&#x2019;s hard to get out of your memory. <br /> No wonder why some people can start to think &#x201C;code deploy = outage&#x201D;. <br />
  • Mild version of &#x201C;Critical Incident Stress Management&#x201D;? <br /> Change = risk, and sometimes risk = outage. And outages are stressful. <br /> <br />
  • Not supposed to feel like this. <br />
  • <br />
  • Details about the change play a huge role in your ability to respond to change-related incidents. <br />
  • Details about the change play a huge role in your ability to respond to change-related incidents. <br />
  • Details about the change play a huge role in your ability to respond to change-related incidents. <br />
  • <br />
  • We do this by tracking our responses to outages and incidents. <br />
  • We can do this by tracking our change, and learning from the results. <br /> <br />
  • We need to raise confidence that we&#x2019;re moving as fast as we can while still being safe enough to do so. And we can adjust the change to meet our requirements... <br />
  • Why should change and results of changes be any different? <br />
  • Type = code, schema, infrastructure, etc. <br /> Frequency/Size = how often each type is changed, implies risk <br /> Results = how often each change results in an incident/degradation <br />
  • Lots of different types here. Might be different for everyone. <br /> Not all types of change bring the same amount of risk. <br />
  • This info should be considered mandatory. This should also be done for db schema changes, network changes, changes in any part of the stack, really. <br />
  • The header of our metrics tools has these statistics, too. <br />
  • The tricky part: getting all prod changes written down without too much hassle. <br />
  • Here&#x2019;s one type of change.... <br />
  • Here&#x2019;s another type of change.... <br />
  • Here&#x2019;s yet another type of change... <br />
  • Size does turn out to be important. Size = lines of code, level of SPOF risk, etc. <br />
  • This seems like something you should do. Also: &#x201C;incidents&#x201D; = outages or degradations. <br />
  • Just an example. This looks like it&#x2019;s going well! Getting better! <br />
  • Maybe I can&#x2019;t say that it&#x2019;s getting better, actually.... <br />
  • <br />
  • Some folks have Techcrunch as their incident log keeper. You could just use a spreadsheet. <br />
  • An example! <br />
  • You *are* doing postmortems on incidents that happen, right? Doing them comes at a certain point in your evolution. <br />
  • <br />
  • <br />
  • <br />
  • Without the statistics, even a rare but severe outage can make the impression that change == outage. <br />
  • Just examples. It&#x2019;s important to categorize these things so you can count the ones that matter for the user&#x2019;s experience. #4 Loss of redundancy <br />
  • Just examples. It&#x2019;s important to categorize these things so you can count the ones that matter for the user&#x2019;s experience. #4 Loss of redundancy <br />
  • Just examples. It&#x2019;s important to categorize these things so you can count the ones that matter for the user&#x2019;s experience. #4 Loss of redundancy <br />
  • Just examples. It&#x2019;s important to categorize these things so you can count the ones that matter for the user&#x2019;s experience. #4 Loss of redundancy <br />
  • Just examples. It&#x2019;s important to categorize these things so you can count the ones that matter for the user&#x2019;s experience. #4 Loss of redundancy <br />
  • Just examples. It&#x2019;s important to categorize these things so you can count the ones that matter for the user&#x2019;s experience. #4 Loss of redundancy <br />
  • Just examples. It&#x2019;s important to categorize these things so you can count the ones that matter for the user&#x2019;s experience. #4 Loss of redundancy <br />
  • Just examples. This normally comes from a postmortem meeting. A good pointer on Root Cause Analysis is Eric Ries&#x2019; material on Five Whys, and the wikipedia page for RCA. <br />
  • http://www.flickr.com/photos/mattblaze/2695044170/ <br />
  • What happens in our response to a change-related incident is just as important as the occurrence of the incident. <br />
  • What happens in our response to a change-related incident is just as important as the occurrence of the incident. <br />
  • What happens in our response to a change-related incident is just as important as the occurrence of the incident. <br />
  • What happens in our response to a change-related incident is just as important as the occurrence of the incident. <br />
  • What happens in our response to a change-related incident is just as important as the occurrence of the incident. <br />
  • Th <br />
  • Th <br />
  • This might also be known as a &#x2018;diagnose&#x2019; point. <br />
  • This might also be known as a &#x2018;diagnose&#x2019; point. <br />
  • These events usually spawn other events. <br />
  • These events usually spawn other events. <br />
  • This should be standard operating procedure at this point, <br />
  • These events usually spawn other events. <br />
  • Some folks might notice a &#x201C;Time To Diagnose&#x201D; missing here. <br /> ALSO: it&#x2019;s usually more complex than this, but this is the gist of it. <br />
  • <br />
  • <br />
  • Do incidents increase with size of change? With frequency? With frequency/size of different types? <br />
  • <br />
  • <br />
  • If you don&#x2019;t track: Change, Incidents, and Responses, you&#x2019;ll never have answers for these questions. <br />
  • <br />
  • <br />
  • Reasonable questions. <br /> <br />
  • *YOU* get to decide what is &#x201C;small&#x201D; and &#x201C;frequent&#x201D;. <br />
  • <br />
  • <br />
  • THIS is what can help give you confidence. Or not. <br />
  • <br />
  • The longer an outage lasts, the bigger of a bummer it is for all those who are working on fixing it. <br />
  • The longer an outage lasts, the more mistakes people make. (and, as the night gets longer) <br /> Red herrings... <br />
  • put two points on this graph <br />
  • put two points on this graph <br />
  • put two points on this graph <br />
  • put two points on this graph <br />
  • It should, because it is. <br />
  • How we feel about change and how it can (or not) cause outages is important. <br /> Some of the nastiest relationships emerge between dev and ops because of these things. <br /> <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • &#x201C;Normal&#x201D; = lots of change done at regular intervals, change = big, time = long. <br />
  • 2 weeks? 5000 lines? <br />
  • Scary Monster of Change! Each incident-causing deploy has only one recourse: roll it all back. Even code that was ok and unrelated to the incident. Boo! <br />
  • Silly Monster of Nothing to Be Afraid Of Because His Teeth Are Small. <br />
  • Problem? Roll that little piece back. Or better yet, roll it forward! <br />
  • This looks like an adorable monster. Like a Maurice Sendak monster. <br /> <br />

Ops Meta-Metrics: The Currency You Pay For Change Presentation Transcript

  • 1. Ops Meta-Metrics The Currency You Use to Pay For Change John Allspaw VP Operations Etsy.com http://www.flickr.com/photos/wwarby/3296379139
  • 2. Warning Graphs and numbers in this presentation are sort of made up
  • 3. /usr/nagios/libexec/check_ops.pl
  • 4. How R U Doing? http://www.flickr.com/photos/a4gpa/190120662/
  • 5. We track bugs already... Example: https://issues.apache.org/jira/browse/TS
  • 6. We should track these, too...
  • 7. We should track these, too... Changes (Who/What/When/Type)
  • 8. We should track these, too... Changes (Who/What/When/Type) Incidents (Type/Severity)
  • 9. We should track these, too... Changes (Who/What/When/Type) Incidents (Type/Severity) Response to Incidents (TTR/TTD)
  • 10. trepidation noun 1 a feeling of fear or agitation about something that may happen : the men set off in fear and trepidation. 2 archaic trembling motion. DERIVATIVES trepidatious adjective ORIGIN late 15th cent.: from Latin trepidatio(n-), from trepidare ‘be agitated, tremble,’ from trepidus ‘alarme
  • 11. Change Required. Often feared. Why? http://www.flickr.com/photos/20408885@N03/3570184759/
  • 12. This is why OMGWTF OUTAGES!!!1!! la de da, everything’s fine change happens
  • 13. Change PTSD? http://www.flickr.com/photos/tzofia/270800047/
  • 14. Brace For Impact?
  • 15. Brace For Impact?
  • 16. But wait.... (OMGWTF) la de da, everything’s fine change happens
  • 17. But wait.... (OMGWTF) la de da, } everything’s fine How much change is this? change happens
  • 18. But wait.... (OMGWTF) la de da, } everything’s fine How much change is this? What kind of change? change happens
  • 19. But wait.... (OMGWTF) la de da, } everything’s fine How much change is this? What kind of change? How often does this happen? change happens
  • 20. Need to raise confidence that change != outage
  • 21. ...incidents can be handled well http://www.flickr.com/photos/axiepics/3181170364/
  • 22. ...root causes can be fixed quick enough http://www.flickr.com/photos/ljv/213624799/
  • 23. ...change can be safe enough http://www.flickr.com/photos/marksetchell/43252686/
  • 24. But how? How do we have confidence in anything in our infrastructure? We measure it. And graph it. And alert on it.
  • 25. Tracking Change 1. Type 2. Frequency/Size 3. Results of those changes
  • 26. Types of Change Layers Examples App code PHP/Rails/etc or ‘front-end’ code Apache, MySQL, DB schema, Services code PHP/Ruby versions, etc. OS/Servers, Switches, Routers, Infrastructure Datacenters, etc. (you decide what these are for your architecture)
  • 27. Code Deploys: Who/What/When WHEN WHO WHAT (guy who pushed the button) (link to diff) (http://codeascraft.etsy.com/2010/05/20/quantum-of-deployment/)
  • 28. Code Deploys: Who/What/When Last 2 prod deploys Last 2 Chef changes
  • 29. other changes (insert whatever ticketing/tracking you have)
  • 30. Frequency
  • 31. Frequency
  • 32. Frequency
  • 33. Size
  • 34. Tracking Incidents http://www.flickr.com/photos/47684393@N00/4543311558/
  • 35. Incident Frequency
  • 36. Incident Size Big Outage TTR still going
  • 37. Tracking Incidents 1. Frequency 2. Severity 3. Root Cause 4. Time-To-Detect (TTD) 5. Time-To-Resolve (TTR)
  • 38. The How Doesn’t Matter http://www.flickr.com/photos/matsuyuki/2328829160/
  • 39. Incident/Degradation Tracking Start Detect Resolve Root PostMortem Date Severity Done? Time Time Time Cause 1/2/08 12:30 ET 12:32 ET 12:45 ET Sev1 DB Change Yes 3/7/08 18:32 ET 18:40 ET 18:47 ET Sev2 Capacity Yes 5/3/08 17:55 ET 17:55 ET 18:14 ET Sev3 Hardware Yes
  • 40. Incident/Degradation Tracking Start Detect Resolve Root PostMortem Date Severity Time These Time give you will Time context Cause Done? for your rates of change. (You’ll need them for postmortems, anyway.)
  • 41. Change:Incident Ratio
  • 42. Change:Incident Ratio Important.
  • 43. Change:Incident Ratio Important. Not because all changes are equal.
  • 44. Change:Incident Ratio Important. Not because all changes are equal. Not because all incidents are equal, or change-related.
  • 45. Change:Incident Ratio But because humans will irrationally make a permanent connection between the two. http://www.flickr.com/photos/michelepedrolli/449572596/
  • 46. Severity
  • 47. Severity Not all incidents are created equal.
  • 48. Severity Not all incidents are created equal. Something like:
  • 49. Severity Not all incidents are created equal. Something like:
  • 50. Severity Not all incidents are created equal. Something like: SEV1 Full outage, or effectively unusable.
  • 51. Severity Not all incidents are created equal. Something like: SEV1 Full outage, or effectively unusable. SEV2 Significant degradation for subset of users.
  • 52. Severity Not all incidents are created equal. Something like: SEV1 Full outage, or effectively unusable. SEV2 Significant degradation for subset of users. SEV3 Minor impact on user experience.
  • 53. Severity Not all incidents are created equal. Something like: SEV1 Full outage, or effectively unusable. SEV2 Significant degradation for subset of users. SEV3 Minor impact on user experience. SEV4 No impact, but time-sensitive failure.
  • 54. Root Cause? (Not all incidents are change related) Something like: Note: this can be difficult to categorize. http://en.wikipedia.org/wiki/Root_cause_analysis
  • 55. Root Cause? (Not all incidents are change related) Something like: 1. Hardware Failure 2. Datacenter Issue 3. Change: Code Issue 4. Change: Config Issue 5. Capacity/Traffic Issue 6. Other Note: this can be difficult to categorize. http://en.wikipedia.org/wiki/Root_cause_analysis
  • 56. Recording Your Response (worth the hassle) http://www.flickr.com/photos/mattblaze/2695044170/
  • 57. Time
  • 58. la de da, everything’s fine Time
  • 59. la de da, everything’s fine Time change happens
  • 60. Noticed there was a problem la de da, everything’s fine Time change happens
  • 61. Noticed there was a problem Figured out la de da, what the cause is everything’s fine Time change happens
  • 62. Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time change happens
  • 63. Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time change happens
  • 64. • Coordinate troubleshooting/diagnosis Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time change happens
  • 65. • Coordinate troubleshooting/diagnosis • Communicate to support/community/execs Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time change happens
  • 66. Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time Time change happens
  • 67. • Coordinate responses* Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time Time change happens * usually, “One Thing At A Time” responses
  • 68. • Coordinate responses* • Communicate to support/community/execs problem Fixed the Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time Time change happens * usually, “One Thing At A Time” responses
  • 69. Fixed the problem Figured out what the cause is Noticed there •rolled back was a problem •rolled forward •temporary solution •etc la de da, everything’s fine Time Time change happens
  • 70. • Confirm stability, resolving steps Fixed the problem Figured out what the cause is Noticed there •rolled back was a problem •rolled forward •temporary solution •etc la de da, everything’s fine Time Time change happens
  • 71. • Confirm stability, resolving steps • Communicate to support/community/execs Fixed the problem Figured out what the cause is Noticed there •rolled back was a problem •rolled forward •temporary solution •etc la de da, everything’s fine Time Time change happens
  • 72. Communications http://etsystatus.com twitter.com/etsystatus
  • 73. Fixed the problem Figured out what the cause is Noticed there •rolled back was a problem •rolled forward •temporary solution •etc la de da, everything’s fine Time Time change happens PostMortem
  • 74. Time To Detect (TTD) Time To Resolve la de da, (TTR) la de da, everything’s fine everything’s fine Time change happens
  • 75. Hypothetical Example: “We’re So Nimble!”
  • 76. Nimble, But Stumbling?
  • 77. Is There Any Pattern?
  • 78. Nimble, But Stumbling? +
  • 79. Nimble, But Stumbling? +
  • 80. Maybe this is too Maybe you’re much suck? } changing too much at once? } Happening too often?
  • 81. What percentage of incidents are related to change? http://www.flickr.com/photos/78364563@N00/2467989781/
  • 82. What percentage of change- related incidents are “off-hours”? http://www.flickr.com/photos/jeffreyanthonyrafolpiano/3266123838
  • 83. What percentage of change- related incidents are “off-hours”? Do they have higher or lower TTR? http://www.flickr.com/photos/jeffreyanthonyrafolpiano/3266123838
  • 84. What types of change have the worst success rates? http://www.flickr.com/photos/lwr/2257949828/
  • 85. What types of change have the worst success rates? Which ones have the best success rates? http://www.flickr.com/photos/lwr/2257949828/
  • 86. Does your TTD/TTR increase depending on the: - SIZE? - FREQUENCY? http://www.flickr.com/photos/45409431@N00/2521827947/
  • 87. Side effect is that you’re also tracking successful changes to production as well http://www.flickr.com/photos/wwworks/2313927146
  • 88. Q2 2010 Incident Success Type Successes Failures Minutes(Sev1 Rate /2) App code 420 5 98.81 8 Config 404 3 99.26 5 DB Schema 15 1 93.33 10 DNS 45 0 100 0 Network (misc) 5 0 100 0 Network (core) 1 0 100 0
  • 89. Q2 2010 Incident Success Type Successes Failures Minutes(Se ! Rate v1/2) App code 420 5 98.81 8 Config 404 3 99.26 5 DB Schema 15 1 93.33 10 DNS 45 0 100 0 Network (misc) 5 0 100 0 Network (core) 1 0 100 0
  • 90. Some Observations
  • 91. Incident Observations Morale Length of Incident/Outage
  • 92. Incident Observations Mistakes Length of Incident/Outage
  • 93. Change Observations Change Size Change Frequency
  • 94. Change Observations Huge changesets deployed rarely Change Size Change Frequency
  • 95. Change Observations Huge changesets (high TTR) deployed rarely Change Size Change Frequency
  • 96. Change Observations Huge changesets (high TTR) deployed rarely Change Size Tiny changesets deployed often Change Frequency
  • 97. Change Observations Huge changesets (high TTR) deployed rarely Change Size Tiny changesets deployed often (low TTR) Change Frequency
  • 98. Specifically.... la de da, What if this was only 5 } everything’s fine lines of code that were changed? Does that feel safer? change happens (it should)
  • 99. Pay attention to this stuff http://www.flickr.com/photos/plasticbag/2461247090/
  • 100. We’re Hiring Ops! SF & NYC In May: - $22.9M of goods were sold by the community - 1,895,943 new items listed - 239,340 members joined
  • 101. The End
  • 102. Bonus Time!!1!
  • 103. Continuous Deployment Described in 6 graphs (Originally Cal Henderson’s idea)