Bring the Noise

5,488 views
5,272 views

Published on

This talk was given at Velocity '13 in Santa Clara by Abe Stanway and Jon Cowie. It talks about how Etsy make sense of the 250k metrics they gather, using their new Kale stack.

Published in: Technology, News & Politics
0 Comments
17 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,488
On SlideShare
0
From Embeds
0
Number of Embeds
1,091
Actions
Shares
0
Downloads
50
Comments
0
Likes
17
Embeds 0
No embeds

No notes for slide

Bring the Noise

  1. 1. Abe Stanway @jonlives BRING THE NOISE! MAKING SENSE OF A HAILSTORM OF METRICS Jon Cowie @abestanway Tuesday, 9 July 13
  2. 2. Ninety minutes is a long time. - motivations - skyline - oculus - demo! - questions This talk: ~10 ~25 ~30 ~10 ~15 Tuesday, 9 July 13
  3. 3. Ninety minutes is a long time. - motivations - skyline - oculus - demo! - questions This talk: ~10 ~25 ~30 ~10 ~15 But we have some sweet stuff to show you. Tuesday, 9 July 13
  4. 4. Background and Motivations Tuesday, 9 July 13
  5. 5. Tuesday, 9 July 13
  6. 6. 1.5 billion page views $117 million of goods sold 950 thousand users Tuesday, 9 July 13
  7. 7. 1.5 billion page views $117 million of goods sold 950 thousand users (in december ‘12) Tuesday, 9 July 13
  8. 8. We practice continuous deployment. Tuesday, 9 July 13
  9. 9. de•ploy /diˈploi/ Verb To release your code for the world to see, hopefully without breaking the Internet Tuesday, 9 July 13
  10. 10. Everyone deploys. 250+ committers. Tuesday, 9 July 13
  11. 11. Day one: DEPLOY Tuesday, 9 July 13
  12. 12. Tuesday, 9 July 13
  13. 13. 30+ DEPLOYS A DAY (~8 commits per deploy!) Tuesday, 9 July 13
  14. 14. “30 deploys a day? Is that safe?” Tuesday, 9 July 13
  15. 15. We optimize for quick recovery by anticipating problems... Tuesday, 9 July 13
  16. 16. ...instead of fearing human error. Tuesday, 9 July 13
  17. 17. Can’t fix what you don’t measure! - W. Edwards Deming Tuesday, 9 July 13
  18. 18. StatsD graphite Skyline Oculus Supergrep homemade!not homemade Nagios Ganglia Tuesday, 9 July 13
  19. 19. Text Real time error logging Tuesday, 9 July 13
  20. 20. “Not all things that break throw errors.” - Oscar Wilde Tuesday, 9 July 13
  21. 21. StatsD Tuesday, 9 July 13
  22. 22. StatsD::increment(“foo.bar”) Tuesday, 9 July 13
  23. 23. If it moves, graph it! Tuesday, 9 July 13
  24. 24. If it moves, graph it! we would graph them ➞ Tuesday, 9 July 13
  25. 25. If it doesn’t move, graph it anyway (it might make a run for it) Tuesday, 9 July 13
  26. 26. DASHBOARDS! Tuesday, 9 July 13
  27. 27. [1358731200,20] [1358731200,20] [1358731200,20] [1358731200,20] [1358731200,20] [1358731200,20] [1358731200,20] [1358731200,20] [1358731200,60] [1358731200,20] [1358731200,20] Tuesday, 9 July 13
  28. 28. DASHBOARDS! x 250,000 Tuesday, 9 July 13
  29. 29. Tuesday, 9 July 13
  30. 30. lol nagios Tuesday, 9 July 13
  31. 31. “...but there are also unknown unknowns - there are things we do not know we don’t know.” Tuesday, 9 July 13
  32. 32. Unknown anomalies Tuesday, 9 July 13
  33. 33. Unknown correlations Tuesday, 9 July 13
  34. 34. Kale. Tuesday, 9 July 13
  35. 35. Kale: - leaves - green stuff Tuesday, 9 July 13
  36. 36. Kale: - leaves - green stuff OCULUS SKYLINE Tuesday, 9 July 13
  37. 37. Q). How do you analyze a timeseries for anomalies in real time? Tuesday, 9 July 13
  38. 38. A). Lots of HTTP requests to Graphite’s API! Tuesday, 9 July 13
  39. 39. Q). How do you analyze a quarter million timeseries for anomalies in real time? Tuesday, 9 July 13
  40. 40. SKYLINE Tuesday, 9 July 13
  41. 41. SKYLINE Tuesday, 9 July 13
  42. 42. A real time anomaly detection system Tuesday, 9 July 13
  43. 43. Real time? Tuesday, 9 July 13
  44. 44. Kinda. Tuesday, 9 July 13
  45. 45. StatsD Ten second resolution Tuesday, 9 July 13
  46. 46. Ganglia One minute resolution Tuesday, 9 July 13
  47. 47. ~ 10s ( ~ 1minBest case: Tuesday, 9 July 13
  48. 48. ( Takes about 90 seconds with our throughput. Tuesday, 9 July 13
  49. 49. ( Still faster than you would have discovered it otherwise. Tuesday, 9 July 13
  50. 50. Memory > Disk Tuesday, 9 July 13
  51. 51. Tuesday, 9 July 13
  52. 52. Q). How do you get a quarter million timeseries into Redis on time? Tuesday, 9 July 13
  53. 53. STREAM IT! Tuesday, 9 July 13
  54. 54. Graphite’s relay agent original graphite backup graphite Tuesday, 9 July 13
  55. 55. Graphite’s relay agent original graphite backup graphite [statsd.numStats, [1365603422, 82345]] pickles [statsd.numStats, [1365603432, 80611]] [statsd.numStats, [1365603412, 73421]] Tuesday, 9 July 13
  56. 56. Graphite’s relay agent original graphite skyline [statsd.numStats, [1365603422, 82345]] pickles [statsd.numStats, [1365603432, 80611]] [statsd.numStats, [1365603412, 73421]] Tuesday, 9 July 13
  57. 57. We import from Ganglia too. Tuesday, 9 July 13
  58. 58. Storing timeseries Tuesday, 9 July 13
  59. 59. Minimize I/O Minimize memory Tuesday, 9 July 13
  60. 60. redis.append() - Strings - Constant time - One operation per update Tuesday, 9 July 13
  61. 61. JSON? Tuesday, 9 July 13
  62. 62. “[1358711400, 51],” => get statsD.numStats ---------------------------- Tuesday, 9 July 13
  63. 63. “[1358711400, 51], => get statsD.numStats ---------------------------- [1358711410, 23],” Tuesday, 9 July 13
  64. 64. “[1358711400, 51], => get statsD.numStats ---------------------------- [1358711410, 23], [1358711420, 45],” Tuesday, 9 July 13
  65. 65. OVER HALF CPU time spent decoding JSON Tuesday, 9 July 13
  66. 66. [1,2] Tuesday, 9 July 13
  67. 67. [ 1 , 2 ] Stuff we care about Extra junk Tuesday, 9 July 13
  68. 68. MESSAGEPACK Tuesday, 9 July 13
  69. 69. MESSAGEPACK A binary-based serialization protocol Tuesday, 9 July 13
  70. 70. x93x01x02 Array size (16 or 32 bit big endian integer) Things we care about Tuesday, 9 July 13
  71. 71. x93x01x02 Array size (16 or 32 bit big endian integer) Things we care about x93x02x03 Tuesday, 9 July 13
  72. 72. CUT IN HALF Run Time + Memory Used Tuesday, 9 July 13
  73. 73. ROOMBA.PY CLEANS THE DATA Tuesday, 9 July 13
  74. 74. “Wait...you wrote this in Python?” Tuesday, 9 July 13
  75. 75. Great statistics libraries Not fun for parallelism Tuesday, 9 July 13
  76. 76. Assign Redis keys to each process Process decodes and analyzes The Analyzer Tuesday, 9 July 13
  77. 77. Anomalous metrics written as JSON setInterval() retrieves from front end The Analyzer Tuesday, 9 July 13
  78. 78. Tuesday, 9 July 13
  79. 79. What does it mean to be anomalous? Tuesday, 9 July 13
  80. 80. Consensus model Tuesday, 9 July 13
  81. 81. Implement everything you can get your hands on Tuesday, 9 July 13
  82. 82. Basic algorithm: “A metric is anomalous if its latest datapoint is over three standard deviations above its moving average.” Tuesday, 9 July 13
  83. 83. Grubb’s test, ordinary least squares Tuesday, 9 July 13
  84. 84. Histogram binning Tuesday, 9 July 13
  85. 85. Four horsemen of the modelpocalypse Tuesday, 9 July 13
  86. 86. 1. Seasonality 2. Spike influence 3. Normality 4. Parameters Tuesday, 9 July 13
  87. 87. Anomaly? Tuesday, 9 July 13
  88. 88. Nope. Tuesday, 9 July 13
  89. 89. Text Spikes artificially raise the moving average Anomaly detected (yay!) Anomaly missed :( Bigger moving average Tuesday, 9 July 13
  90. 90. Real world data doesn’t necessarily follow a perfect normal distribution. Tuesday, 9 July 13
  91. 91. Too many metrics to fit parameters for them all! Tuesday, 9 July 13
  92. 92. A robust set of algorithms is the current focus of this project. Tuesday, 9 July 13
  93. 93. Q). How do you analyze a quarter million timeseries for correlations? Tuesday, 9 July 13
  94. 94. OCULUS Tuesday, 9 July 13
  95. 95. Image comparison is expensive and slow Tuesday, 9 July 13
  96. 96. “[[975, 1365528530], [643, 1365528540], [750, 1365528550], [992, 1365528560], [580, 1365528570], [586, 1365528580], [649, 1365528590], [548, 1365528600], [901, 1365528610], [633, 1365528620]]” Use raw timeseries instead of raw graphs Tuesday, 9 July 13
  97. 97. Naming Things Cache Invalidation Numerical Comparison? HARD PROBLEMS Tuesday, 9 July 13
  98. 98. Naming Things Cache Invalidation Numerical Comparison? HARD PROBLEMS Tuesday, 9 July 13
  99. 99. Euclidian Distance Tuesday, 9 July 13
  100. 100. Dynamic Time Warping (helps with phase shifts) Tuesday, 9 July 13
  101. 101. We’ve solved it! Tuesday, 9 July 13
  102. 102. O(N2) Tuesday, 9 July 13
  103. 103. O(N2) x 250k Tuesday, 9 July 13
  104. 104. Too slow! Tuesday, 9 July 13
  105. 105. doesn’t Tuesday, 9 July 13
  106. 106. No need to run DTW on all 250k. Tuesday, 9 July 13
  107. 107. Discard obviously dissimilar metrics. Tuesday, 9 July 13
  108. 108. “975 643 643 750 992 992 992 580” “sharpdecrement flat increment sharpincrement flat flat shapdecrement” Shape Description Alphabet Tuesday, 9 July 13
  109. 109. “975 643 643 750 992 992 992 580” “sharpdecrement flat increment sharpincrement flat flat shapdecrement” Shape Description Alphabet “24 4 4 11 25 25 25 0 1” (normalization step) Tuesday, 9 July 13
  110. 110. Tuesday, 9 July 13
  111. 111. Search for shape description fingerprint in Elasticsearch Tuesday, 9 July 13
  112. 112. Run DTW on results as final polish Tuesday, 9 July 13
  113. 113. O(N2) on ~10k metrics Tuesday, 9 July 13
  114. 114. Still too slow. Tuesday, 9 July 13
  115. 115. Fast DTW - O(N) coarsen project refine Tuesday, 9 July 13
  116. 116. Elasticsearch Details Phrase search for first pass scores across shape description fingerprints Tuesday, 9 July 13
  117. 117. Elasticsearch Details Phrase search for first pass scores across shape description fingerprints Custom FastDTW and euclidian distance plugins to score across the remaining filtered timeseries Tuesday, 9 July 13
  118. 118. Elasticsearch Structure { :id => “statsd.numStats”, :fingerprint => “sdec inc sinc sdec”, :values => "10 1 2 15 4" } Tuesday, 9 July 13
  119. 119. Mappings Specify tokenizers “Untouched” fields Tuesday, 9 July 13
  120. 120. First pass query :match => { :fingerprint => { :query => “sdec inc sinc sdec inc”, :type => "phrase", :slop => 20 } } shape description fingerprint Tuesday, 9 July 13
  121. 121. Refinement query{:custom_score  =>  {                        :query  =>  <first_pass_query>,                        :script  =>  "oculus_dtw",                        :params  =>  {                            :query_value  =>  “10  20  20  10   30”,                            :query_field  =>   "values.untouched",                        }, } raw timeseries Tuesday, 9 July 13
  122. 122. Skyline Elasticsearch Resque Sinatra Ganglia Graphite StatsD KALE Flask Tuesday, 9 July 13
  123. 123. Populating Elasticsearch Tuesday, 9 July 13
  124. 124. ES Index resque workers Tuesday, 9 July 13
  125. 125. Too slow to update and search Tuesday, 9 July 13
  126. 126. New Index Last Index Webapp Tuesday, 9 July 13
  127. 127. Sinatra frontend Queries ES Renders results Tuesday, 9 July 13
  128. 128. Collections Tuesday, 9 July 13
  129. 129. devops <3 Tuesday, 9 July 13
  130. 130. Tuesday, 9 July 13
  131. 131. Special thanks to: Dr. Neil Gunther, PerfDynamics Dr. Brian Whitman, Echonest Burc Arpat, Facebook Seth Walker, Etsy Rafe Colburn, Etsy Mike Rembetsy, Etsy John Allspaw, Etsy Tuesday, 9 July 13
  132. 132. @abestanway @jonlives Thanks! github.com/etsy/skyline github.com/etsy/oculus Tuesday, 9 July 13

×