Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ops Meta-Metrics: The Currency You Pay For Change

27,846 views

Published on

Published in: Business, Technology
  • Be the first to comment

Ops Meta-Metrics: The Currency You Pay For Change

  1. 1. Ops Meta-Metrics The Currency You Use to Pay For Change John Allspaw VP Operations Etsy.com http://www.flickr.com/photos/wwarby/3296379139
  2. 2. Warning Graphs and numbers in this presentation are sort of made up
  3. 3. /usr/nagios/libexec/check_ops.pl
  4. 4. How R U Doing? http://www.flickr.com/photos/a4gpa/190120662/
  5. 5. We track bugs already... Example: https://issues.apache.org/jira/browse/TS
  6. 6. We should track these, too...
  7. 7. We should track these, too... Changes (Who/What/When/Type)
  8. 8. We should track these, too... Changes (Who/What/When/Type) Incidents (Type/Severity)
  9. 9. We should track these, too... Changes (Who/What/When/Type) Incidents (Type/Severity) Response to Incidents (TTR/TTD)
  10. 10. trepidation noun 1 a feeling of fear or agitation about something that may happen : the men set off in fear and trepidation. 2 archaic trembling motion. DERIVATIVES trepidatious adjective ORIGIN late 15th cent.: from Latin trepidatio(n-), from trepidare ‘be agitated, tremble,’ from trepidus ‘alarme
  11. 11. Change Required. Often feared. Why? http://www.flickr.com/photos/20408885@N03/3570184759/
  12. 12. This is why OMGWTF OUTAGES!!!1!! la de da, everything’s fine change happens
  13. 13. Change PTSD? http://www.flickr.com/photos/tzofia/270800047/
  14. 14. Brace For Impact?
  15. 15. Brace For Impact?
  16. 16. But wait.... (OMGWTF) la de da, everything’s fine change happens
  17. 17. But wait.... (OMGWTF) la de da, } everything’s fine How much change is this? change happens
  18. 18. But wait.... (OMGWTF) la de da, } everything’s fine How much change is this? What kind of change? change happens
  19. 19. But wait.... (OMGWTF) la de da, } everything’s fine How much change is this? What kind of change? How often does this happen? change happens
  20. 20. Need to raise confidence that change != outage
  21. 21. ...incidents can be handled well http://www.flickr.com/photos/axiepics/3181170364/
  22. 22. ...root causes can be fixed quick enough http://www.flickr.com/photos/ljv/213624799/
  23. 23. ...change can be safe enough http://www.flickr.com/photos/marksetchell/43252686/
  24. 24. But how? How do we have confidence in anything in our infrastructure? We measure it. And graph it. And alert on it.
  25. 25. Tracking Change 1. Type 2. Frequency/Size 3. Results of those changes
  26. 26. Types of Change Layers Examples App code PHP/Rails/etc or ‘front-end’ code Apache, MySQL, DB schema, Services code PHP/Ruby versions, etc. OS/Servers, Switches, Routers, Infrastructure Datacenters, etc. (you decide what these are for your architecture)
  27. 27. Code Deploys: Who/What/When WHEN WHO WHAT (guy who pushed the button) (link to diff) (http://codeascraft.etsy.com/2010/05/20/quantum-of-deployment/)
  28. 28. Code Deploys: Who/What/When Last 2 prod deploys Last 2 Chef changes
  29. 29. other changes (insert whatever ticketing/tracking you have)
  30. 30. Frequency
  31. 31. Frequency
  32. 32. Frequency
  33. 33. Size
  34. 34. Tracking Incidents http://www.flickr.com/photos/47684393@N00/4543311558/
  35. 35. Incident Frequency
  36. 36. Incident Size Big Outage TTR still going
  37. 37. Tracking Incidents 1. Frequency 2. Severity 3. Root Cause 4. Time-To-Detect (TTD) 5. Time-To-Resolve (TTR)
  38. 38. The How Doesn’t Matter http://www.flickr.com/photos/matsuyuki/2328829160/
  39. 39. Incident/Degradation Tracking Start Detect Resolve Root PostMortem Date Severity Done? Time Time Time Cause 1/2/08 12:30 ET 12:32 ET 12:45 ET Sev1 DB Change Yes 3/7/08 18:32 ET 18:40 ET 18:47 ET Sev2 Capacity Yes 5/3/08 17:55 ET 17:55 ET 18:14 ET Sev3 Hardware Yes
  40. 40. Incident/Degradation Tracking Start Detect Resolve Root PostMortem Date Severity Time These Time give you will Time context Cause Done? for your rates of change. (You’ll need them for postmortems, anyway.)
  41. 41. Change:Incident Ratio
  42. 42. Change:Incident Ratio Important.
  43. 43. Change:Incident Ratio Important. Not because all changes are equal.
  44. 44. Change:Incident Ratio Important. Not because all changes are equal. Not because all incidents are equal, or change-related.
  45. 45. Change:Incident Ratio But because humans will irrationally make a permanent connection between the two. http://www.flickr.com/photos/michelepedrolli/449572596/
  46. 46. Severity
  47. 47. Severity Not all incidents are created equal.
  48. 48. Severity Not all incidents are created equal. Something like:
  49. 49. Severity Not all incidents are created equal. Something like:
  50. 50. Severity Not all incidents are created equal. Something like: SEV1 Full outage, or effectively unusable.
  51. 51. Severity Not all incidents are created equal. Something like: SEV1 Full outage, or effectively unusable. SEV2 Significant degradation for subset of users.
  52. 52. Severity Not all incidents are created equal. Something like: SEV1 Full outage, or effectively unusable. SEV2 Significant degradation for subset of users. SEV3 Minor impact on user experience.
  53. 53. Severity Not all incidents are created equal. Something like: SEV1 Full outage, or effectively unusable. SEV2 Significant degradation for subset of users. SEV3 Minor impact on user experience. SEV4 No impact, but time-sensitive failure.
  54. 54. Root Cause? (Not all incidents are change related) Something like: Note: this can be difficult to categorize. http://en.wikipedia.org/wiki/Root_cause_analysis
  55. 55. Root Cause? (Not all incidents are change related) Something like: 1. Hardware Failure 2. Datacenter Issue 3. Change: Code Issue 4. Change: Config Issue 5. Capacity/Traffic Issue 6. Other Note: this can be difficult to categorize. http://en.wikipedia.org/wiki/Root_cause_analysis
  56. 56. Recording Your Response (worth the hassle) http://www.flickr.com/photos/mattblaze/2695044170/
  57. 57. Time
  58. 58. la de da, everything’s fine Time
  59. 59. la de da, everything’s fine Time change happens
  60. 60. Noticed there was a problem la de da, everything’s fine Time change happens
  61. 61. Noticed there was a problem Figured out la de da, what the cause is everything’s fine Time change happens
  62. 62. Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time change happens
  63. 63. Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time change happens
  64. 64. • Coordinate troubleshooting/diagnosis Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time change happens
  65. 65. • Coordinate troubleshooting/diagnosis • Communicate to support/community/execs Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time change happens
  66. 66. Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time Time change happens
  67. 67. • Coordinate responses* Fixed the problem Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time Time change happens * usually, “One Thing At A Time” responses
  68. 68. • Coordinate responses* • Communicate to support/community/execs problem Fixed the Noticed there •rolled back was a problem •rolled forward •temporary solution •etc Figured out la de da, what the cause is everything’s fine Time Time change happens * usually, “One Thing At A Time” responses
  69. 69. Fixed the problem Figured out what the cause is Noticed there •rolled back was a problem •rolled forward •temporary solution •etc la de da, everything’s fine Time Time change happens
  70. 70. • Confirm stability, resolving steps Fixed the problem Figured out what the cause is Noticed there •rolled back was a problem •rolled forward •temporary solution •etc la de da, everything’s fine Time Time change happens
  71. 71. • Confirm stability, resolving steps • Communicate to support/community/execs Fixed the problem Figured out what the cause is Noticed there •rolled back was a problem •rolled forward •temporary solution •etc la de da, everything’s fine Time Time change happens
  72. 72. Communications http://etsystatus.com twitter.com/etsystatus
  73. 73. Fixed the problem Figured out what the cause is Noticed there •rolled back was a problem •rolled forward •temporary solution •etc la de da, everything’s fine Time Time change happens PostMortem
  74. 74. Time To Detect (TTD) Time To Resolve la de da, (TTR) la de da, everything’s fine everything’s fine Time change happens
  75. 75. Hypothetical Example: “We’re So Nimble!”
  76. 76. Nimble, But Stumbling?
  77. 77. Is There Any Pattern?
  78. 78. Nimble, But Stumbling? +
  79. 79. Nimble, But Stumbling? +
  80. 80. Maybe this is too Maybe you’re much suck? } changing too much at once? } Happening too often?
  81. 81. What percentage of incidents are related to change? http://www.flickr.com/photos/78364563@N00/2467989781/
  82. 82. What percentage of change- related incidents are “off-hours”? http://www.flickr.com/photos/jeffreyanthonyrafolpiano/3266123838
  83. 83. What percentage of change- related incidents are “off-hours”? Do they have higher or lower TTR? http://www.flickr.com/photos/jeffreyanthonyrafolpiano/3266123838
  84. 84. What types of change have the worst success rates? http://www.flickr.com/photos/lwr/2257949828/
  85. 85. What types of change have the worst success rates? Which ones have the best success rates? http://www.flickr.com/photos/lwr/2257949828/
  86. 86. Does your TTD/TTR increase depending on the: - SIZE? - FREQUENCY? http://www.flickr.com/photos/45409431@N00/2521827947/
  87. 87. Side effect is that you’re also tracking successful changes to production as well http://www.flickr.com/photos/wwworks/2313927146
  88. 88. Q2 2010 Incident Success Type Successes Failures Minutes(Sev1 Rate /2) App code 420 5 98.81 8 Config 404 3 99.26 5 DB Schema 15 1 93.33 10 DNS 45 0 100 0 Network (misc) 5 0 100 0 Network (core) 1 0 100 0
  89. 89. Q2 2010 Incident Success Type Successes Failures Minutes(Se ! Rate v1/2) App code 420 5 98.81 8 Config 404 3 99.26 5 DB Schema 15 1 93.33 10 DNS 45 0 100 0 Network (misc) 5 0 100 0 Network (core) 1 0 100 0
  90. 90. Some Observations
  91. 91. Incident Observations Morale Length of Incident/Outage
  92. 92. Incident Observations Mistakes Length of Incident/Outage
  93. 93. Change Observations Change Size Change Frequency
  94. 94. Change Observations Huge changesets deployed rarely Change Size Change Frequency
  95. 95. Change Observations Huge changesets (high TTR) deployed rarely Change Size Change Frequency
  96. 96. Change Observations Huge changesets (high TTR) deployed rarely Change Size Tiny changesets deployed often Change Frequency
  97. 97. Change Observations Huge changesets (high TTR) deployed rarely Change Size Tiny changesets deployed often (low TTR) Change Frequency
  98. 98. Specifically.... la de da, What if this was only 5 } everything’s fine lines of code that were changed? Does that feel safer? change happens (it should)
  99. 99. Pay attention to this stuff http://www.flickr.com/photos/plasticbag/2461247090/
  100. 100. We’re Hiring Ops! SF & NYC In May: - $22.9M of goods were sold by the community - 1,895,943 new items listed - 239,340 members joined
  101. 101. The End
  102. 102. Bonus Time!!1!
  103. 103. Continuous Deployment Described in 6 graphs (Originally Cal Henderson’s idea)

×