1. It’s all about telemetry
Monitoring what matters in a useful way.
Tuesday, June 26, 12
2. Theo Schlossnagle @postwait
I write software
I write books
I give talks
I participate in the industry
I speak frankly about industry issues
Tuesday, June 26, 12
4. Big Data
Most new big data problems are
solvable
Tuesday, June 26, 12
5. Big Data
Most new big data problems are
created by our solutions, and thus
solvable
despite their ROI
Tuesday, June 26, 12
6. That’s a whole lot of data
Think in terms of logs (too many do)
About 26 trillion log lines / month
@ 40 bytes compressed: 1PB / month
Just because it is possible
does not mean it will return on investment
(and does not mean it won’t)
Tuesday, June 26, 12
7. It’s all “useful”; which data?
Think in terms of cost/benefit.
Sure the data is useful, but it costs money to store
Does it cost you more to have it or not to have it?
Maybe the right approach is to keep that level of detail
for a few days?
Tuesday, June 26, 12
8. Double-edged sword.
Eroding granularity over time
keeps storage under control
Tuesday, June 26, 12
9. Double-edged sword.
K E
TA
Eroding granularity over time
S
keeps storage under control
M I
Tuesday, June 26, 12
12. 1 day
confidence of normalcy increases
Tuesday, June 26, 12
13. 1 week
that looks different
Tuesday, June 26, 12
14. 1 day
yup, that’s not at all like that other week
Tuesday, June 26, 12
15. Other methods
What do you store?
How do you store it?
Why is it useful?
Winning the cost benefit game by
reducing costs more significantly than
reducing benefits
Tuesday, June 26, 12
16. 0 0.5 1 1.5 2 2.5 3
1
efit
Ben
o st
C 0.75
0.5
0.25
monitoring activity ➠
Positive Value
Be in the green.
Tuesday, June 26, 12
17. 0 1 2 3 4 5 6 7 8 9 10
10
7.5
5
o st
C
2.5
Benefit
monitoring activity ➠
There’s a bigger picture
It’s not as easy as you think.
Tuesday, June 26, 12
18. 0 0.5 1 1.5 2 2.5 3
1
efit
Ben
o st
C 0.75
0.5
0.25
monitoring activity ➠
Value is difference, not area
Green can be misleading
Tuesday, June 26, 12
19. 0.5 1 1.5 2 2.5 3
0.5
0.25
-0.25
-0.5
monitoring activity ➠
-0.75
Value = Benefit - Cost -1
Green means we have positive return
Tuesday, June 26, 12
20. 0.5 1 1.5 2 2.5 3
0.5
0.25
-0.25
-0.5
monitoring activity ➠
-0.75
It’s not about return -1
Well, it’s not only about return
Tuesday, June 26, 12
21. 0.5 1 1.5 2 2.5 3
0.5
0.25
-0.25
-0.5
monitoring activity ➠
-0.75
It’s about maximizing return -1
This is a bit like black magic
Tuesday, June 26, 12
23. Technique 2: numeric
Store rollups
(i.e. statistical aggregates over fixed windows)
over 1 minute store
min/max/avg/stddev/covariance/50%/95%/99%
lots of information
heavy lossy compression of high-frequency data
loses population distribution information
Tuesday, June 26, 12
24. Database replication
Lag (green) and rate of lag change (purple)
Tuesday, June 26, 12
25. Storage Usage
We can see growth.
More useful, we can use this to project.
Tuesday, June 26, 12
26. Storage Usage
We can see growth.
More useful, we can use this to project.
Tuesday, June 26, 12
30. Full Disclosure
You see awesome examples of predictive analytics
Like the real-world one on the previous slide
In practice, almost all data streams predict one thing:
they have no fucking clue.
Tuesday, June 26, 12
31. Technique 3: histograms
Store histograms
over 1 minute store
counts of datapoints seen in various buckets
retains complete population distribution
loss of precision
Tuesday, June 26, 12
32. Histograms 101
This.
This is a histogram.
It shows the frequency of
values within a population.
Height represents frequency
Tuesday, June 26, 12
33. Histograms 101
This.
This is a histogram.
It shows the frequency of
values within a population.
Now, height and color
represents frequency
Tuesday, June 26, 12
34. Histograms 101
This.
This is a histogram.
It shows the frequency of
values within a population.
Now, only color
represents frequency
Tuesday, June 26, 12
35. Histograms 101
This.
This is a histogram.
It shows the frequency of
values within a population.
Now, only color
represents frequency
Tuesday, June 26, 12
36. Histograms ➠ time series
This.
This is a histogram.
It shows the frequency of
values within a population.
Now, only color
represents frequency
Tuesday, June 26, 12
37. Histograms ➠ time series
This.
This is a histogram.
It shows the frequency of
values within a population.
Now, only color
represents frequency
Tuesday, June 26, 12
38. Histograms ➠ time series
This.
This is a histogram.
It shows the frequency of
values within a population.
Now, only color
represents frequency
at a single time interval
Tuesday, June 26, 12
39. API Service Times
We can see a full population shift
of several milliseconds
Tuesday, June 26, 12
40. Combining techniques
In our system (as a reference point)
Arbitrary numbers of numeric data points
on a single stream
occupy 32 bytes of space for statistical aggregates and
occupy about 2k of space for a histogram
These means we can store these transforms on
numeric data in perpetuity
Tuesday, June 26, 12
41. Combining techniques
Text is a bit harder
You need to be careful
Some data sources can be constantly changing
Producing gobs of change data
You’re doing it wrong
Find these and fix them
Tuesday, June 26, 12
42. Correlating Events
Change Management vs. Performance
Tuesday, June 26, 12
43. Correlating Events
Change Management vs. Performance
Tuesday, June 26, 12
44. What to monitor?
Most people don’t monitor the things that matter most
Tuesday, June 26, 12
45. Monitor the Business
Financials:
Revenues. Costs. Margins. AR. Account delinquency.
Marketing:
Web analytics. Campaigns. Costs. Returns.
Convergence.
Tuesday, June 26, 12
46. Monitor the Support
Customer Service:
Problems. Time investment. Customer satisfaction.
Resolution time.
Tuesday, June 26, 12
47. Monitor the Engineering
Engineering:
Deployments. Test coverage.
Bug reports. Bug fixes. Effort spent.
Operations:
Faults. Pages. Escalations. Provisioning time.
Equipment defect rates. 3rd party failure rates.
Tuesday, June 26, 12
48. Monitor the Service
Systems:
Networks. Systems. Storage.
Databases:
Performance. Error rates. Backups.
Middleware:
Herein lies the magic and room for awesomeness
Tuesday, June 26, 12
49. Monitor the Middleware
Your systems are complex
Monitor their interactions
Messaging, APIs, etc.
Tuesday, June 26, 12
50. Monitor all the things.
But, perhaps most importantly...
Tuesday, June 26, 12
51. Monitor all the things.
But, perhaps most importantly...
USE UNIFIED TOOLING
Tuesday, June 26, 12
52. What we use...
reconnoiter
SNMP, nad, resmon, statsd, HTTP traps, jdbc, etc.
statsd (clients)
javascript beacons
Tuesday, June 26, 12
53. Middleware mix
API service times, traffic, user signup rates.
Tuesday, June 26, 12