Cloud Tech III: Actionable Metrics

950 views

Published on

Presentation for Cloud Tech III on How Netflix Thinks of Metrics

Published in: Technology
2 Comments
4 Likes
Statistics
Notes
No Downloads
Views
Total views
950
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
16
Comments
2
Likes
4
Embeds 0
No embeds

No notes for slide

Cloud Tech III: Actionable Metrics

  1. 1. Actionable Metrics Enabling Decision-Making in Netflix’s Decentralized Environment Cloud Tech III October 6, 2012 Roy Rapoport @royrapoport, rsr@netflix.comThursday, October 18, 12
  2. 2. Me • Been in tech for about 20 years • Systems engineering, networking, software development, QA, release management • Time at Netflix: 1195 days (3y:3m:1w) • (Current) job at Netflix: Make things better (Security Monkey, Python Platform, Central Alert Gateway, Breaking Stuff.. )Thursday, October 18, 12
  3. 3. Metrics HumorThursday, October 18, 12
  4. 4. Metrics HumorThursday, October 18, 12
  5. 5. Metrics HumorThursday, October 18, 12
  6. 6. Metrics HumorThursday, October 18, 12
  7. 7. Metrics Humor % of instances with even public IP addressesThursday, October 18, 12
  8. 8. Technology OverviewThursday, October 18, 12
  9. 9. Technology Overview • SoA, REST, Mostly JavaThursday, October 18, 12
  10. 10. Technology Overview • SoA, REST, Mostly Java • Simple overall architecture:Thursday, October 18, 12
  11. 11. Technology Overview • SoA, REST, Mostly Java • Simple overall architecture:Thursday, October 18, 12
  12. 12. Technology Overview • SoA, REST, Mostly Java • Simple overall architecture:Thursday, October 18, 12
  13. 13. Culture OverviewThursday, October 18, 12
  14. 14. Culture Overview • Freedom and ResponsibilityThursday, October 18, 12
  15. 15. Culture Overview • Freedom and Responsibility • Distributed OperationsThursday, October 18, 12
  16. 16. Culture Overview • Freedom and Responsibility • Distributed Operations • Get out of the way of DevelopersThursday, October 18, 12
  17. 17. The Metric LifecycleThursday, October 18, 12
  18. 18. The Metric Lifecycle • SendThursday, October 18, 12
  19. 19. The Metric Lifecycle • Send • LookThursday, October 18, 12
  20. 20. The Metric Lifecycle • Send • Look • AlertThursday, October 18, 12
  21. 21. Systems • Flexible • Scalable • Self-ServiceThursday, October 18, 12
  22. 22. Telemetry Flexible, Scalable, Self-Service import netflix.metrics [...] self.nm = netflix.metrics.Metrics("core_cag") [...] def api(self): self.nm.nfCounter("api") [...] self.nm.nfCounter(“application_%s” % application) [...]Thursday, October 18, 12
  23. 23. Visualization Flexible, Scalable, Self-ServiceThursday, October 18, 12
  24. 24. Visualization Flexible, Scalable, Self-ServiceThursday, October 18, 12
  25. 25. Visualization Flexible, Scalable, Self-ServiceThursday, October 18, 12
  26. 26. Visualization Flexible, Scalable, Self-ServiceThursday, October 18, 12
  27. 27. Visualization Flexible, Scalable, Self-ServiceThursday, October 18, 12
  28. 28. Visualization Flexible, Scalable, Self-ServiceThursday, October 18, 12
  29. 29. Alerting Flexible, Scalable, Self-ServiceThursday, October 18, 12
  30. 30. Alerting Flexible, Scalable, Self-Service • Static vs Dynamic ThresholdsThursday, October 18, 12
  31. 31. Alerting Flexible, Scalable, Self-Service • Static vs Dynamic Thresholds • Compare to historyThursday, October 18, 12
  32. 32. For Example ... Last 3 hours’ core_tools.core_cag_api What the ...Thursday, October 18, 12
  33. 33. For Example ... Visualization (Continued) Last 4 days’ core_tools.core_cag_api even more questions!Thursday, October 18, 12
  34. 34. For Example ... Visualization (Continued) Last 10 days’ core_tools.core_cag_api What caused the spike?Thursday, October 18, 12
  35. 35. For Example ... Visualization (Continued) Show alert volume per application Someone had a rough few days...Thursday, October 18, 12
  36. 36. Don’t Like Surprises... { "alerts": [ { "applyTo": "cluster", "condition": { "minPercent": 90.0, "noise" : .2, "maxPercent": 25.0, "type": "DoubleExponential" }, "metricName": "core_cag_api", "severity": "major" } ], "clusters": [ "core_tools" ] }Thursday, October 18, 12
  37. 37. Threshold Tuning • An Abbreviated History ...Thursday, October 18, 12
  38. 38. Threshold Tuning (in the beginning) Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor. - http://www.descentinthedark.com/_d_/dark_priests.phpThursday, October 18, 12
  39. 39. Threshold Tuning (in the beginning) • Systems owned by IT Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor. - http://www.descentinthedark.com/_d_/dark_priests.phpThursday, October 18, 12
  40. 40. Threshold Tuning (in the beginning) • Systems owned by IT • Want an alert? Submit a ticket Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor. - http://www.descentinthedark.com/_d_/dark_priests.phpThursday, October 18, 12
  41. 41. Threshold Tuning (in the beginning) • Systems owned by IT • Want an alert? Submit a ticket • Want to tune an alert? Submit a ticket Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor. - http://www.descentinthedark.com/_d_/dark_priests.phpThursday, October 18, 12
  42. 42. Threshold Tuning (It gets better)Thursday, October 18, 12
  43. 43. Threshold Tuning (It gets better) • You get to configure your own thresholdThursday, October 18, 12
  44. 44. Threshold Tuning (It gets better) • You get to configure your own threshold • Freedom!Thursday, October 18, 12
  45. 45. Threshold Tuning (It gets better) • You get to configure your own threshold • Freedom! • Also, you have to configure your own thresholdsThursday, October 18, 12
  46. 46. Threshold Tuning (Are we there yet?)Thursday, October 18, 12
  47. 47. Threshold Tuning (Are we there yet?) • Play with historical dataThursday, October 18, 12
  48. 48. Threshold Tuning (Are we there yet?) • Play with historical data • Huge differenceThursday, October 18, 12
  49. 49. Threshold Tuning (Are we there yet?) • Play with historical data • Huge difference • Still falls shortThursday, October 18, 12
  50. 50. Threshold Tuning (Yeah, that’s the ticket)Thursday, October 18, 12
  51. 51. Threshold Tuning (Yeah, that’s the ticket) • Computers can be good at thisThursday, October 18, 12
  52. 52. Threshold Tuning (Yeah, that’s the ticket) • Computers can be good at thisThursday, October 18, 12
  53. 53. Threshold Tuning (Yeah, that’s the ticket)Thursday, October 18, 12
  54. 54. Threshold Tuning (Yeah, that’s the ticket) • Computers can be good at thisThursday, October 18, 12
  55. 55. Threshold Tuning (Yeah, that’s the ticket)Thursday, October 18, 12
  56. 56. Threshold Tuning (Yeah, that’s the ticket) • Computers can be good at thisThursday, October 18, 12
  57. 57. If Time Allows ...Thursday, October 18, 12
  58. 58. Events vs MetricsThursday, October 18, 12
  59. 59. Events vs Metrics • Irregular IntervalThursday, October 18, 12
  60. 60. Events vs Metrics • Irregular Interval • Point in timeThursday, October 18, 12
  61. 61. Events vs Metrics • Irregular Interval • Point in time • Lack magnitudeThursday, October 18, 12
  62. 62. Why Build It?Thursday, October 18, 12
  63. 63. Why Build It? • Change management • Vs Change controlThursday, October 18, 12
  64. 64. Why Build It? • Change management • Vs Change control • What Changed?Thursday, October 18, 12
  65. 65. Why Build It? • Change management • Vs Change control • What Changed? • Better AlertingThursday, October 18, 12
  66. 66. ChronosThursday, October 18, 12
  67. 67. Chronos • Rapidly PrototypedThursday, October 18, 12
  68. 68. Chronos • Rapidly Prototyped • Adapters and reportersThursday, October 18, 12
  69. 69. Chronos • Rapidly Prototyped • Adapters and reporters • Easy queryingThursday, October 18, 12
  70. 70. Chronos • Rapidly Prototyped • Adapters and reporters • Something happened • Easy querying • AlarmingThursday, October 18, 12
  71. 71. Chronos • Rapidly Prototyped • Adapters and reporters • Something happened • Easy querying • ... X times in Y minutes • AlarmingThursday, October 18, 12
  72. 72. Chronos • Rapidly Prototyped • Adapters and reporters • Something happened • Easy querying • ... X times in Y minutes • Alarming • Something didn’t happenThursday, October 18, 12
  73. 73. Chronos • Rapidly Prototyped • Adapters and reporters • Easy querying • Alarming • Medium volumeThursday, October 18, 12
  74. 74. Chronos • Rapidly Prototyped • Adapters and reporters • Easy querying • Alarming • Medium volume • Recursive • RecursiveThursday, October 18, 12
  75. 75. End ResultThursday, October 18, 12
  76. 76. End Result • Massive decrease in change control ticketsThursday, October 18, 12
  77. 77. End Result • Massive decrease in change control tickets • Not talking about SOX or PCIThursday, October 18, 12
  78. 78. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI • Better visibility into changesThursday, October 18, 12
  79. 79. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI • Better visibility into changes • Decreased TTRThursday, October 18, 12
  80. 80. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI • Better visibility into changes • Decreased TTR • Especially for bad code deploymentsThursday, October 18, 12
  81. 81. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI • Better visibility into changes • Decreased TTR • Especially for bad code deployments • You should do thisThursday, October 18, 12
  82. 82. I Didn’t Mention • End-to-end testing and alerting • External availability and performance • Open Connect • JobsThursday, October 18, 12
  83. 83. Questions?Thursday, October 18, 12

×