Measure All the Things! - Austin Data Day 2014

941 views
816 views

Published on

Slides used during presentation that covered metrics gathering and analysis

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
941
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
10
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Measure All the Things! - Austin Data Day 2014

  1. 1. Measure All The Things! Gary Dusbabek Rackspace @gdusbabek
  2. 2. Motivation What You Really Want Kinds of Metrics How To Do It Prognostication
  3. 3. Motivation
  4. 4. It’s all about the data
  5. 5. We are generating data at an insane rate.
  6. 6. We are generating data at an insane rate.
  7. 7. 2006 IDC estimates 161 Exabytes of data on the Internet That is 161 MM 1T drives
  8. 8. 2009 988 Exabytes of data 6x growth in 4 years Almost 1B 1T drives A zetabyte 21 zeroes Source http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf
  9. 9. 2012 Internet was estimated to be shipping roughly 2.5 exabytes of data daily. Daily Not counting the NSA
  10. 10. Transferring Data Generates Data
  11. 11. Metadata!
  12. 12. Secondary Information
  13. 13. A by-product
  14. 14. Example 1
  15. 15. Cloud Monitoring Is the website up? GET HTTP/1.1
  16. 16. Status=200 Bytes=432 Time to connect=15ms Time to first byte=21ms Duration=28ms
  17. 17. Example 2
  18. 18. Netflix You want to watch an episode of Buffy
  19. 19. Observations What titles you click on What time of day you started watching When you paused Parts you re-watched When you finished (if you finished)
  20. 20. Useless to people consuming the primary data.
  21. 21. Priceless when you’re trying to understand behavior.
  22. 22. behavior
  23. 23. Understanding = Knowledge
  24. 24. In these cases all the data generated is time-series
  25. 25. Time Series Data Related events sorted by time of occurrence
  26. 26. Example 0600 – Wake up 0601 – Checked Hacker News 0605 – Shower 0630 – Breakfast 0630 – Checked Hacker News 0700 – Left for work 0730 – Arrived at work Etc…
  27. 27. Think about how you’d store something like this if you were building a backend system
  28. 28. Relational Database Much?
  29. 29. You When 0600 What Wake up 0601 Checked Hacker News 0605 Shower 0630 Breakfast 0630 0700 0730 0731 Checked Hacker News Left for work Arrive at work Checked Hacker News
  30. 30. Who When You 0600 What Wake up You 0601 Checked Hacker News You 0605 Shower You 0630 Breakfast You 0630 You 0700 You 0730 You 0731 Checked Hacker News Left for work Arrive at work Checked Hacker News
  31. 31. Who When You 0600 You 0601 Friend 0603 Friend 0604 What Wake up Checked Hacker News Wake up Checked Hacker News You 0605 You 0630 Breakfast You 0630 You 0700 Friend 0715 You You 0730 0731 Shower Checked Hacker News Left for work Left for work Arrive at work Checked Hacker News
  32. 32. Other Ways?
  33. 33. Less Appealing
  34. 34. Column Oriented 0600 Wake up 0601 Checked Hacker News 0605 Shower Friend 0603 Wake up 0604 Checked Hacker News 0715 Left for work You 0630 Breakfast 0630 Checked Hacker News 0700 Left for work 0730 Arrive at work 0731 Checked Hacker News
  35. 35. What You Really Want
  36. 36. You run a business
  37. 37. You want to make money
  38. 38. You want to make money Show me the money!
  39. 39. You need to make decisions
  40. 40. You need to make the right decisions
  41. 41. How do you do that?
  42. 42. With your gut
  43. 43. With data
  44. 44. Example
  45. 45. API responses are taking a long time.
  46. 46. It’s probably the database.
  47. 47. You add a few indexes. You allocate more memory. You get faster disks. You get bigger processors.
  48. 48. Maybe it’s the network…
  49. 49. You replace ethernet adapters. You get faster switches. You replace the cabling.
  50. 50. Crap!
  51. 51. Trace it!
  52. 52. 500 ms for entire request 15 ms on the wire getting there. 200 ms to auth 50 ms looking up account 50 ms looking up other stuff 15 ms on the wire getting back. 170 ms rendering in the browser
  53. 53. 500 ms for entire request 15 ms on the wire getting there. 200 ms to auth 50 ms looking up account 50 ms looking up other stuff 15 ms on the wire getting back. 170 ms rendering in the browser
  54. 54. Make the right decisions with data.
  55. 55. You need a metrics system
  56. 56. Take these things into account: Availability Redundancy Accuracy
  57. 57. And your budget
  58. 58. Example: Pretty Graphs
  59. 59. If graphs go away, do you lose money?
  60. 60. The CEO likes them.
  61. 61. Do graphs help you make decisions?
  62. 62. Example: Usage Billing
  63. 63. Will losing data cost you money?
  64. 64. Data Lifecycle
  65. 65. When can I throw it away?
  66. 66. How much work is throwing it away?
  67. 67. How much work is throwing it away?
  68. 68. More work means it probably won’t happen.
  69. 69. Kinds of Metrics
  70. 70. {Volume, Frequency} ⨯ {Low, High}
  71. 71. Low Volume, High Frequency 5,6,5,6 Things observed infrequently Almost always changes Low storage overhead Bulk operations are easy Usually uninteresting
  72. 72. Low Volume, Low Frequency 5,5,5,6 Roughly the same as LVHF
  73. 73. High Volume, Low Frequency 5,5,5,5,5,5,5,5,5,6,6,6,6,6,6,6,6,6,6,6,6,7,7 Constantly observed But doesn’t change much Optimizations! Detect and record only level changes Requires caching
  74. 74. High Volume, High Frequency 34,4,7,345,6,4,2,54,67,5,6,55,74,5,3,2,5,6745…
  75. 75. High Volume, High Frequency 34,4,7,345,6,4,2,54,67,5,6,55,74,5,3,2,5,6745…
  76. 76. Numeric vs String Most will be numeric Some are strings Usually low frequency Special handling
  77. 77. Numeric vs String High frequency strings are a sign you’re doing something wrong or need a different system.
  78. 78. Gauges Current value of something Operation: snapshot Speedometer Thermometer CPU utilization
  79. 79. Counter Exists as a set of operations –  Operation: increment –  Operation: decrement Read by selecting over time and summing Example: hits on a website Different than unique hits
  80. 80. Set statsD Number of uniquely seen items Think: Conditional counter Example: number of unique visitors
  81. 81. Timer How long something takes Statistics (mean, median, min, max, percentiles) How many times it has happened Rate at which it is happened Uses a sliding window
  82. 82. Histograms Distribution of data Example: when people visit your site
  83. 83. How Do You Do It?
  84. 84. If you make software Instrument it! Java? https://github.com/codahale/metrics Node.js? https://github.com/mikejihbe/metrics Others? Of course
  85. 85. If you run systems Instrument them! Get data via agent Get data via pollers Considerations: inside or outside of your network
  86. 86. StatsD https://github.com/etsy/statsd Ingests, aggregates, flushes Use a client to send your data Pushes aggregations Graphite Databases Flat files of JSON Wherever
  87. 87. Graphite http://graphite.wikidot.com Makes graphs Pluggable backends (NEW!!!11) Scaling problems
  88. 88. Buy Enterprise Software These exist, but I’m an open source hacker and can’t say much about them.
  89. 89. Roll Your Own Easier than you think Harder than you think
  90. 90. Roll Your Own Three components Ingestion Aggregation/Rollup Query/Graphing
  91. 91. Avoid Pileups 1 sample per second 3,600 samples per hour 86,400 samples per day 31,536,000 samples per year 1k of storage? (roughly) 32 gigabytes
  92. 92. No!
  93. 93. Measure all the right things!
  94. 94. Does this measurement matter? You don’t care about it when it changes You aren’t doing anything with it You can’t figure out what actions to take from it (it’s meaningless)
  95. 95. Recent data will almost always be most important.
  96. 96. Monitoring vs Aggregation Graphite collects data that is already aggregated. You are observing history Looking for patterns No alerting
  97. 97. Where Things Are Going
  98. 98. Complex Event Analysis ESPER (my favorite). – Mostly open source. Not enough projects though L
  99. 99. Data Intelligence You need this if you don’t know what questions you ought to ask Correlating signals in order to make useful conclusions
  100. 100. Thanks! @gdusbabek
  101. 101. Photos from the Flickr CC collection train data dump truck traffic byproduct watching numbers birds moons cake business guts data 2 choices flowers metrics gauge counter marbles timer windmils logs train tower h"p://www.flickr.com/photos/vxla/4673817364/sizes/z/   h"p://www.flickr.com/photos/tensafefrogs/3649985674/sizes/z/   h"p://www.flickr.com/photos/seanhobson/3906189027/sizes/l/   h"p://www.flickr.com/photos/shankaronline/7291507876/sizes/l/   h"p://www.flickr.com/photos/honou/3350764803/sizes/l/   h"p://www.flickr.com/photos/jdickert/2152739544/sizes/l/   h"p://www.flickr.com/photos/28misguidedsouls/6517859113/sizes/z/   h"p://www.flickr.com/photos/55176801@N02/7911595842/sizes/o/   h"p://www.flickr.com/photos/johnkay/3764457497/sizes/l/   h"p://www.flickr.com/photos/andykirk/412600169/sizes/l/   h"p://www.flickr.com/photos/jeff-­‐anderson/4385042770/sizes/l/   h"p://www.flickr.com/photos/sgis/6532363/sizes/o/   h"p://www.flickr.com/photos/whatbe"erNme/405735418/sizes/l/   h"p://www.flickr.com/photos/rachubarama/2709346242/sizes/l/   h"p://www.flickr.com/photos/femto-­‐photography/4604878864/sizes/o/   h"p://www.flickr.com/photos/pixx0ne/5689978130/sizes/l/   h"p://www.flickr.com/photos/ruth_w/8432567657/sizes/l/   h"p://www.flickr.com/photos/wesley_lelieveld/8571911541/sizes/l/   h"p://www.flickr.com/photos/lifeasart/242208550/sizes/l/   h"p://www.flickr.com/photos/mrsenil/2219108948/sizes/l/   h"p://www.flickr.com/photos/crisNc/2773883011/sizes/l/   h"p://www.flickr.com/photos/ma"blaze/4491948497/sizes/l/   h"p://www.flickr.com/photos/kenNsh/43788618/sizes/o/   h"p://www.flickr.com/photos/dtanist/10809534755/sizes/l/   h"p://www.flickr.com/photos/jarodcarruthers/10372829184/sizes/l/  

×