It's all about telemetry
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

It's all about telemetry

on

  • 5,750 views

The ins and outs of monitoring your technology enabled business.

The ins and outs of monitoring your technology enabled business.

Statistics

Views

Total Views
5,750
Views on SlideShare
5,456
Embed Views
294

Actions

Likes
6
Downloads
65
Comments
2

9 Embeds 294

http://cmgbrasil.posterous.com 228
http://lanyrd.com 25
http://cmgbrasil.com 19
http://eventifier.co 13
http://www.brijj.com 3
http://posterous.com 3
https://si0.twimg.com 1
http://www.linkedin.com 1
https://www.linkedin.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

It's all about telemetry Presentation Transcript

  • 1. It’s all about telemetry Monitoring what matters in a useful way.Tuesday, June 26, 12
  • 2. Theo Schlossnagle @postwait I write software I write books I give talks I participate in the industry I speak frankly about industry issuesTuesday, June 26, 12
  • 3. Data, data, everywhere. A billion pageviews / month. 100k database queries / second. 1MM memcache queries / second. 500k MQ messages / second. 10MM I/O operations / second.Tuesday, June 26, 12
  • 4. Big Data Most new big data problems are solvableTuesday, June 26, 12
  • 5. Big Data Most new big data problems are created by our solutions, and thus solvable despite their ROITuesday, June 26, 12
  • 6. That’s a whole lot of data Think in terms of logs (too many do) About 26 trillion log lines / month @ 40 bytes compressed: 1PB / month Just because it is possible does not mean it will return on investment (and does not mean it won’t)Tuesday, June 26, 12
  • 7. It’s all “useful”; which data? Think in terms of cost/benefit. Sure the data is useful, but it costs money to store Does it cost you more to have it or not to have it? Maybe the right approach is to keep that level of detail for a few days?Tuesday, June 26, 12
  • 8. Double-edged sword. Eroding granularity over time keeps storage under controlTuesday, June 26, 12
  • 9. Double-edged sword. K E TA Eroding granularity over time S keeps storage under control M ITuesday, June 26, 12
  • 10. 1 year at a glanceTuesday, June 26, 12
  • 11. 1 week looks normalishTuesday, June 26, 12
  • 12. 1 day confidence of normalcy increasesTuesday, June 26, 12
  • 13. 1 week that looks differentTuesday, June 26, 12
  • 14. 1 day yup, that’s not at all like that other weekTuesday, June 26, 12
  • 15. Other methods What do you store? How do you store it? Why is it useful? Winning the cost benefit game by reducing costs more significantly than reducing benefitsTuesday, June 26, 12
  • 16. 0 0.5 1 1.5 2 2.5 3 1 efit Ben o st C 0.75 0.5 0.25 monitoring activity ➠ Positive Value Be in the green.Tuesday, June 26, 12
  • 17. 0 1 2 3 4 5 6 7 8 9 10 10 7.5 5 o st C 2.5 Benefit monitoring activity ➠ There’s a bigger picture It’s not as easy as you think.Tuesday, June 26, 12
  • 18. 0 0.5 1 1.5 2 2.5 3 1 efit Ben o st C 0.75 0.5 0.25 monitoring activity ➠ Value is difference, not area Green can be misleadingTuesday, June 26, 12
  • 19. 0.5 1 1.5 2 2.5 3 0.5 0.25 -0.25 -0.5 monitoring activity ➠ -0.75 Value = Benefit - Cost -1 Green means we have positive returnTuesday, June 26, 12
  • 20. 0.5 1 1.5 2 2.5 3 0.5 0.25 -0.25 -0.5 monitoring activity ➠ -0.75 It’s not about return -1 Well, it’s not only about returnTuesday, June 26, 12
  • 21. 0.5 1 1.5 2 2.5 3 0.5 0.25 -0.25 -0.5 monitoring activity ➠ -0.75 It’s about maximizing return -1 This is a bit like black magicTuesday, June 26, 12
  • 22. Technique 1: text Store changesTuesday, June 26, 12
  • 23. Technique 2: numeric Store rollups (i.e. statistical aggregates over fixed windows) over 1 minute store min/max/avg/stddev/covariance/50%/95%/99% lots of information heavy lossy compression of high-frequency data loses population distribution informationTuesday, June 26, 12
  • 24. Database replication Lag (green) and rate of lag change (purple)Tuesday, June 26, 12
  • 25. Storage Usage We can see growth. More useful, we can use this to project.Tuesday, June 26, 12
  • 26. Storage Usage We can see growth. More useful, we can use this to project.Tuesday, June 26, 12
  • 27. With simple numeric dataTuesday, June 26, 12
  • 28. With simple numeric data Unknowns can be predictedTuesday, June 26, 12
  • 29. With simple numeric data In sane ways with confidenceTuesday, June 26, 12
  • 30. Full Disclosure You see awesome examples of predictive analytics Like the real-world one on the previous slide In practice, almost all data streams predict one thing: they have no fucking clue.Tuesday, June 26, 12
  • 31. Technique 3: histograms Store histograms over 1 minute store counts of datapoints seen in various buckets retains complete population distribution loss of precisionTuesday, June 26, 12
  • 32. Histograms 101 This. This is a histogram. It shows the frequency of values within a population. Height represents frequencyTuesday, June 26, 12
  • 33. Histograms 101 This. This is a histogram. It shows the frequency of values within a population. Now, height and color represents frequencyTuesday, June 26, 12
  • 34. Histograms 101 This. This is a histogram. It shows the frequency of values within a population. Now, only color represents frequencyTuesday, June 26, 12
  • 35. Histograms 101 This. This is a histogram. It shows the frequency of values within a population. Now, only color represents frequencyTuesday, June 26, 12
  • 36. Histograms ➠ time series This. This is a histogram. It shows the frequency of values within a population. Now, only color represents frequencyTuesday, June 26, 12
  • 37. Histograms ➠ time series This. This is a histogram. It shows the frequency of values within a population. Now, only color represents frequencyTuesday, June 26, 12
  • 38. Histograms ➠ time series This. This is a histogram. It shows the frequency of values within a population. Now, only color represents frequency at a single time intervalTuesday, June 26, 12
  • 39. API Service Times We can see a full population shift of several millisecondsTuesday, June 26, 12
  • 40. Combining techniques In our system (as a reference point) Arbitrary numbers of numeric data points on a single stream occupy 32 bytes of space for statistical aggregates and occupy about 2k of space for a histogram These means we can store these transforms on numeric data in perpetuityTuesday, June 26, 12
  • 41. Combining techniques Text is a bit harder You need to be careful Some data sources can be constantly changing Producing gobs of change data You’re doing it wrong Find these and fix themTuesday, June 26, 12
  • 42. Correlating Events Change Management vs. PerformanceTuesday, June 26, 12
  • 43. Correlating Events Change Management vs. PerformanceTuesday, June 26, 12
  • 44. What to monitor? Most people don’t monitor the things that matter mostTuesday, June 26, 12
  • 45. Monitor the Business Financials: Revenues. Costs. Margins. AR. Account delinquency. Marketing: Web analytics. Campaigns. Costs. Returns. Convergence.Tuesday, June 26, 12
  • 46. Monitor the Support Customer Service: Problems. Time investment. Customer satisfaction. Resolution time.Tuesday, June 26, 12
  • 47. Monitor the Engineering Engineering: Deployments. Test coverage. Bug reports. Bug fixes. Effort spent. Operations: Faults. Pages. Escalations. Provisioning time. Equipment defect rates. 3rd party failure rates.Tuesday, June 26, 12
  • 48. Monitor the Service Systems: Networks. Systems. Storage. Databases: Performance. Error rates. Backups. Middleware: Herein lies the magic and room for awesomenessTuesday, June 26, 12
  • 49. Monitor the Middleware Your systems are complex Monitor their interactions Messaging, APIs, etc.Tuesday, June 26, 12
  • 50. Monitor all the things. But, perhaps most importantly...Tuesday, June 26, 12
  • 51. Monitor all the things. But, perhaps most importantly... USE UNIFIED TOOLINGTuesday, June 26, 12
  • 52. What we use... reconnoiter SNMP, nad, resmon, statsd, HTTP traps, jdbc, etc. statsd (clients) javascript beaconsTuesday, June 26, 12
  • 53. Middleware mix API service times, traffic, user signup rates.Tuesday, June 26, 12
  • 54. Tuesday, June 26, 12
  • 55. Thank you!Tuesday, June 26, 12