Measure All the Things! - Austin Data Day 2014

  • 522 views
Uploaded on

Slides used during presentation that covered metrics gathering and analysis

Slides used during presentation that covered metrics gathering and analysis

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
522
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
9
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Measure All The Things! Gary Dusbabek Rackspace @gdusbabek
  • 2. Motivation What You Really Want Kinds of Metrics How To Do It Prognostication
  • 3. Motivation
  • 4. It’s all about the data
  • 5. We are generating data at an insane rate.
  • 6. We are generating data at an insane rate.
  • 7. 2006 IDC estimates 161 Exabytes of data on the Internet That is 161 MM 1T drives
  • 8. 2009 988 Exabytes of data 6x growth in 4 years Almost 1B 1T drives A zetabyte 21 zeroes Source http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf
  • 9. 2012 Internet was estimated to be shipping roughly 2.5 exabytes of data daily. Daily Not counting the NSA
  • 10. Transferring Data Generates Data
  • 11. Metadata!
  • 12. Secondary Information
  • 13. A by-product
  • 14. Example 1
  • 15. Cloud Monitoring Is the website up? GET HTTP/1.1
  • 16. Status=200 Bytes=432 Time to connect=15ms Time to first byte=21ms Duration=28ms
  • 17. Example 2
  • 18. Netflix You want to watch an episode of Buffy
  • 19. Observations What titles you click on What time of day you started watching When you paused Parts you re-watched When you finished (if you finished)
  • 20. Useless to people consuming the primary data.
  • 21. Priceless when you’re trying to understand behavior.
  • 22. behavior
  • 23. Understanding = Knowledge
  • 24. In these cases all the data generated is time-series
  • 25. Time Series Data Related events sorted by time of occurrence
  • 26. Example 0600 – Wake up 0601 – Checked Hacker News 0605 – Shower 0630 – Breakfast 0630 – Checked Hacker News 0700 – Left for work 0730 – Arrived at work Etc…
  • 27. Think about how you’d store something like this if you were building a backend system
  • 28. Relational Database Much?
  • 29. You When 0600 What Wake up 0601 Checked Hacker News 0605 Shower 0630 Breakfast 0630 0700 0730 0731 Checked Hacker News Left for work Arrive at work Checked Hacker News
  • 30. Who When You 0600 What Wake up You 0601 Checked Hacker News You 0605 Shower You 0630 Breakfast You 0630 You 0700 You 0730 You 0731 Checked Hacker News Left for work Arrive at work Checked Hacker News
  • 31. Who When You 0600 You 0601 Friend 0603 Friend 0604 What Wake up Checked Hacker News Wake up Checked Hacker News You 0605 You 0630 Breakfast You 0630 You 0700 Friend 0715 You You 0730 0731 Shower Checked Hacker News Left for work Left for work Arrive at work Checked Hacker News
  • 32. Other Ways?
  • 33. Less Appealing
  • 34. Column Oriented 0600 Wake up 0601 Checked Hacker News 0605 Shower Friend 0603 Wake up 0604 Checked Hacker News 0715 Left for work You 0630 Breakfast 0630 Checked Hacker News 0700 Left for work 0730 Arrive at work 0731 Checked Hacker News
  • 35. What You Really Want
  • 36. You run a business
  • 37. You want to make money
  • 38. You want to make money Show me the money!
  • 39. You need to make decisions
  • 40. You need to make the right decisions
  • 41. How do you do that?
  • 42. With your gut
  • 43. With data
  • 44. Example
  • 45. API responses are taking a long time.
  • 46. It’s probably the database.
  • 47. You add a few indexes. You allocate more memory. You get faster disks. You get bigger processors.
  • 48. Maybe it’s the network…
  • 49. You replace ethernet adapters. You get faster switches. You replace the cabling.
  • 50. Crap!
  • 51. Trace it!
  • 52. 500 ms for entire request 15 ms on the wire getting there. 200 ms to auth 50 ms looking up account 50 ms looking up other stuff 15 ms on the wire getting back. 170 ms rendering in the browser
  • 53. 500 ms for entire request 15 ms on the wire getting there. 200 ms to auth 50 ms looking up account 50 ms looking up other stuff 15 ms on the wire getting back. 170 ms rendering in the browser
  • 54. Make the right decisions with data.
  • 55. You need a metrics system
  • 56. Take these things into account: Availability Redundancy Accuracy
  • 57. And your budget
  • 58. Example: Pretty Graphs
  • 59. If graphs go away, do you lose money?
  • 60. The CEO likes them.
  • 61. Do graphs help you make decisions?
  • 62. Example: Usage Billing
  • 63. Will losing data cost you money?
  • 64. Data Lifecycle
  • 65. When can I throw it away?
  • 66. How much work is throwing it away?
  • 67. How much work is throwing it away?
  • 68. More work means it probably won’t happen.
  • 69. Kinds of Metrics
  • 70. {Volume, Frequency} ⨯ {Low, High}
  • 71. Low Volume, High Frequency 5,6,5,6 Things observed infrequently Almost always changes Low storage overhead Bulk operations are easy Usually uninteresting
  • 72. Low Volume, Low Frequency 5,5,5,6 Roughly the same as LVHF
  • 73. High Volume, Low Frequency 5,5,5,5,5,5,5,5,5,6,6,6,6,6,6,6,6,6,6,6,6,7,7 Constantly observed But doesn’t change much Optimizations! Detect and record only level changes Requires caching
  • 74. High Volume, High Frequency 34,4,7,345,6,4,2,54,67,5,6,55,74,5,3,2,5,6745…
  • 75. High Volume, High Frequency 34,4,7,345,6,4,2,54,67,5,6,55,74,5,3,2,5,6745…
  • 76. Numeric vs String Most will be numeric Some are strings Usually low frequency Special handling
  • 77. Numeric vs String High frequency strings are a sign you’re doing something wrong or need a different system.
  • 78. Gauges Current value of something Operation: snapshot Speedometer Thermometer CPU utilization
  • 79. Counter Exists as a set of operations –  Operation: increment –  Operation: decrement Read by selecting over time and summing Example: hits on a website Different than unique hits
  • 80. Set statsD Number of uniquely seen items Think: Conditional counter Example: number of unique visitors
  • 81. Timer How long something takes Statistics (mean, median, min, max, percentiles) How many times it has happened Rate at which it is happened Uses a sliding window
  • 82. Histograms Distribution of data Example: when people visit your site
  • 83. How Do You Do It?
  • 84. If you make software Instrument it! Java? https://github.com/codahale/metrics Node.js? https://github.com/mikejihbe/metrics Others? Of course
  • 85. If you run systems Instrument them! Get data via agent Get data via pollers Considerations: inside or outside of your network
  • 86. StatsD https://github.com/etsy/statsd Ingests, aggregates, flushes Use a client to send your data Pushes aggregations Graphite Databases Flat files of JSON Wherever
  • 87. Graphite http://graphite.wikidot.com Makes graphs Pluggable backends (NEW!!!11) Scaling problems
  • 88. Buy Enterprise Software These exist, but I’m an open source hacker and can’t say much about them.
  • 89. Roll Your Own Easier than you think Harder than you think
  • 90. Roll Your Own Three components Ingestion Aggregation/Rollup Query/Graphing
  • 91. Avoid Pileups 1 sample per second 3,600 samples per hour 86,400 samples per day 31,536,000 samples per year 1k of storage? (roughly) 32 gigabytes
  • 92. No!
  • 93. Measure all the right things!
  • 94. Does this measurement matter? You don’t care about it when it changes You aren’t doing anything with it You can’t figure out what actions to take from it (it’s meaningless)
  • 95. Recent data will almost always be most important.
  • 96. Monitoring vs Aggregation Graphite collects data that is already aggregated. You are observing history Looking for patterns No alerting
  • 97. Where Things Are Going
  • 98. Complex Event Analysis ESPER (my favorite). – Mostly open source. Not enough projects though L
  • 99. Data Intelligence You need this if you don’t know what questions you ought to ask Correlating signals in order to make useful conclusions
  • 100. Thanks! @gdusbabek
  • 101. Photos from the Flickr CC collection train data dump truck traffic byproduct watching numbers birds moons cake business guts data 2 choices flowers metrics gauge counter marbles timer windmils logs train tower h"p://www.flickr.com/photos/vxla/4673817364/sizes/z/   h"p://www.flickr.com/photos/tensafefrogs/3649985674/sizes/z/   h"p://www.flickr.com/photos/seanhobson/3906189027/sizes/l/   h"p://www.flickr.com/photos/shankaronline/7291507876/sizes/l/   h"p://www.flickr.com/photos/honou/3350764803/sizes/l/   h"p://www.flickr.com/photos/jdickert/2152739544/sizes/l/   h"p://www.flickr.com/photos/28misguidedsouls/6517859113/sizes/z/   h"p://www.flickr.com/photos/55176801@N02/7911595842/sizes/o/   h"p://www.flickr.com/photos/johnkay/3764457497/sizes/l/   h"p://www.flickr.com/photos/andykirk/412600169/sizes/l/   h"p://www.flickr.com/photos/jeff-­‐anderson/4385042770/sizes/l/   h"p://www.flickr.com/photos/sgis/6532363/sizes/o/   h"p://www.flickr.com/photos/whatbe"erNme/405735418/sizes/l/   h"p://www.flickr.com/photos/rachubarama/2709346242/sizes/l/   h"p://www.flickr.com/photos/femto-­‐photography/4604878864/sizes/o/   h"p://www.flickr.com/photos/pixx0ne/5689978130/sizes/l/   h"p://www.flickr.com/photos/ruth_w/8432567657/sizes/l/   h"p://www.flickr.com/photos/wesley_lelieveld/8571911541/sizes/l/   h"p://www.flickr.com/photos/lifeasart/242208550/sizes/l/   h"p://www.flickr.com/photos/mrsenil/2219108948/sizes/l/   h"p://www.flickr.com/photos/crisNc/2773883011/sizes/l/   h"p://www.flickr.com/photos/ma"blaze/4491948497/sizes/l/   h"p://www.flickr.com/photos/kenNsh/43788618/sizes/o/   h"p://www.flickr.com/photos/dtanist/10809534755/sizes/l/   h"p://www.flickr.com/photos/jarodcarruthers/10372829184/sizes/l/