Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
   
   
Credit: user niteroi @ panoramio.com
   
vimeo.com/43800150
   
   
   
   
   
   
   
   
1  Metrics 2.0 concepts
2  Implementation
3  Advanced stuff
   
“Dieter” ?
   
Peter   Deter→
   
Terminology sync
   
(1234567890, 82)
(1234567900, 123)
(1234567910, 109)
(1234567920, 77)
db15.mysql.queries_running
host=db15 mysql.queri...
   
   
How many pagerequests/s is vimeo.com 
doing?
   
● stats.hits.vimeo_com
● stats_counts.hits.vimeo_com
   
   
stats.<host>.requesthostport.vimeo_
com_443
   
stats.timers.dfs5.proxy­
server.object.GET.200.timing
.upper_90
   
O(X*Y*Z)
X = # apps                
Y = # people             
Z = # aggregators     
   
How long does it take to retrieve an object from swift?
   
stats.timers.<host>.proxy­
server.<swift_type>.<http_method>.
<http_code>.timing.<stat>
stats.timers.<host>.object­
se...
   
swift_type=object stat=mean timing GET avg by http_code
   
   
   
O((DxV)^2)
D = # dimensions             
V = # values per dim             
   
collectd.db.disk.sda1.dis
k_time.write
   
   
   
What should I name my metric?
   
10
100
1000
10000
100000
1000000
   
   
Metrics 2.0
   
Old:
● information lacking
● fields unclear & inconsistent
● cumbersome strings / trees
● forbidden characters
New:
● ...
   
stats.timers.dfs5.proxy­server.object.GET.200.timing.upper_90
{
    “server”: “dfvimeodfsproxy5”,
    “http_method”: “...
   
Main advantages:
● Immediate understanding of metric meaning (ideally)
● Minimize time to graphs, dashboards, alerting...
   
github.com/vimeo/graph­explorer/wiki
   
SI + IEC
B   Err   Warn   Conn   Job   File   Req    ...
MB/s   Err/d   Req/h   ...
   
{
    “site”: “vimeo.com”,
    “port”: 80,
    “unit”: “Req/s”,
    “direction”: “in”,
    “service”: “webapp_php”,
  ...
   
   
Carbon­tagger:
... 
service=foo.instance=host.target_type=gauge.type=calculatio
n.unit=B 123 1234567890
…
Statsdaemon:...
   
   
   
Graph­Explorer queries 101
site:api.vimeo.com unit=Req/s
requesthostport api_vimeo_com
   
   
Smoothing
avg over 10M
avg over ...
   
   
Aggregation, compare port 80 vs 443
avg by <dimension>
sum by <dimension>
sum by server
   
   
Compare 80 traffic amongt servers
site:api.vimeo.com unit=Req/s port=80 group by none avg 
over 10M
   
   
Graph­Explorer queries 201
proxy­server swift server:regex upper_90 unit=ms from 
<datetime> to <datetime> avg over <t...
   
   
   
   
   
Compare object put/get
Stack .. http_method:(PUT|GET) swift_type=object avg by 
http_code,server
   
   
Comparing servers
http_method:(PUT|GET) avg by 
http_code,swift_type,http_method group by none
   
   
Compare http codes for GET, per swift type
http_method=GET avg by server group by swift_type
   
   
transcode unit=Job/s avg over <time> from <datetime> to 
<datetime>
    Note: data is obfuscated
   
Bucketing
!queue sum by zone:ap­southeast|eu­west|us­east|us­
west|sa­east|vimeo­df|vimeo­lv group by state
    Note: data is obfuscated
   
Compare job states per region (zones bucket)
group by zone
    Note: data is obfuscated
   
Unit conversion
unit=Mb/s network dfvimeorpc sum by server
   
   
   
unit=MB
   
   
   
{
    server=dfvimeodfs1
    plugin=diskspace
    mountpoint=_srv_node_dfs5
    unit=B
    type=used
    target_type=g...
   
server:dfvimeodfs unit=GB type=free srv node
   
   
unit=GB/d group by mountpoint
   
   
   
   
   
   
   
Dashboard definition
 queries = [
   'cpu usage sum by core',
   'mem unit=B !total group by type:swap',
   'stack net...
   
   
stats.dfvimeocliapp2.twitter.error
{
    “n1”: “dfvimeocliapp2”,
    “n2”: “twitter”,
    “n3”: “error”,
    “plugin”:...
   
Two hard things in computer science
   
stats.gauges.files.
id_boundary_7day
stats.gauges.files.
id_boundary_ceil
   
unit=File id_boundary_7d 
{
   “unit”: “File”,
   “n1”: “id_boundary_7d”,
}
   
{
    “intrinsic”: {
        “site”: “vimeo.com”,
        “unit”: “Req/s”
    },
    “extrinsic”: {
        “agent”: “...
   
site=vimeo.com unit=Req/s 
  processed_by=statsd1  
src=index.php:135 added_by=dieter 
123 1234567890
   
   
Equivalence
servers.host.cpu.total.iowait   “core” : “_sum_”→
servers.host.cpu.<core­number>.iowait
servers.host.loada...
   
Rollups & aggregation
   
/etc/carbon/storage­aggregation.conf
[min]
pattern = .min$
aggregationMethod = min
[max]
pattern = .max$
aggregationMe...
   
   
2 kinds of graphite users
   
Self­describing metrics
stat=upper/lower/mean/...
target_type=counter..
   
●    stats.timers.render_time.histogram.bin_0.01
●    stats.timers.render_time.histogram.bin_0.1
●    stats.timers.ren...
   
Also..
● graphite API functions such as "cumulative", "summarize" 
and "smartSummarize"
● Graph renderers
   
   
From: dygraphs.com
   
   
   
   
   
   
Facet based suggestions
   
   
Metric types
● gauge
● count & rate
● counter
● timer
   
   
   
   
   
gauge
● Multiple values in same interval
● “sticky”
   
   
Count & Rate
   
Counter
   
Timer..
   
   
http://janabeck.com/blog/2012/10/12/lessons­learned­from­100/
   
Timer..
   
● What should a metric be?
● Stickyness?
● Behavior on no packets received
● Behavior on multiple packets received
   
My personal takeaways
   
Conclusion
● Building graphs, setting up alerting cumbersome
● Esp. changing information needs (troubleshooting, explo...
   
Conclusion
● Metrics can be so much more usable and useful. Let's talk about 
tagging, standardisation, retaining info...
   
github.com/vimeo
github.com/Dieterbe
twitter.com/Dieter_be
dieter.plaetinck.be
   
Upcoming SlideShare
Loading in …5
×

Metrics stack 2.0

6,246 views

Published on

Most metrics systems link timeseries to a string key, some add a few tags. They often lack information, use inconsistent formats and terminology, and are poorly organized. As the amount of people and software generating, processing, storing and visualizing metrics grows, this approach becomes very cumbersome and there is a lot to be gained from taking a step back and re-thinking metric identifiers and metadata.

Metrics 2.0 is a set of conventions around metrics: With barely any extra work metrics become self-describing and standardized. Compatibility between tools increases dramatically, dashboards can automatically convert information needs into graphs, graph renderers can present data more usefully, anomaly detection and aggregators can work more autonomously and avoid common mistakes. Result: less micromanaging of software and configuration, quicker results, more clarity. Less frustration and room for errors.

This talk will also cover the tools that turn this concept into production-ready reality:

Graph-Explorer is an application that integrates with Graphite. Enter an expression that represents an information need and it generates the corresponding graphs or alerting rules, automatically applying unit conversion, aggregation, processing, etc.

Statsdaemon is an aggregation daemon like Etsy's Statsd that expresses performed aggregations and statistical operations by updating the metrics tags, making sure that the metric metadata always corresponds to the data.

Dieter Plaetinck is a systems-gone-backend engineer at Vimeo.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Metrics stack 2.0

  1. 1.    
  2. 2.     Credit: user niteroi @ panoramio.com
  3. 3.     vimeo.com/43800150
  4. 4.    
  5. 5.    
  6. 6.    
  7. 7.    
  8. 8.    
  9. 9.    
  10. 10.    
  11. 11.     1  Metrics 2.0 concepts 2  Implementation 3  Advanced stuff
  12. 12.     “Dieter” ?
  13. 13.     Peter   Deter→
  14. 14.     Terminology sync
  15. 15.     (1234567890, 82) (1234567900, 123) (1234567910, 109) (1234567920, 77) db15.mysql.queries_running host=db15 mysql.queries_running
  16. 16.    
  17. 17.     How many pagerequests/s is vimeo.com  doing?
  18. 18.     ● stats.hits.vimeo_com ● stats_counts.hits.vimeo_com
  19. 19.    
  20. 20.     stats.<host>.requesthostport.vimeo_ com_443
  21. 21.     stats.timers.dfs5.proxy­ server.object.GET.200.timing .upper_90
  22. 22.     O(X*Y*Z) X = # apps                 Y = # people              Z = # aggregators     
  23. 23.     How long does it take to retrieve an object from swift?
  24. 24.     stats.timers.<host>.proxy­ server.<swift_type>.<http_method>. <http_code>.timing.<stat> stats.timers.<host>.object­ server.<http_method>. timing.<stat> target=stats.timers.dfs*.object*GET*timing.mean ? target=groupByNode(stats.timers.dfs*.proxy ­server.object.GET.*.timing.mean,2,"avg") target=stats.timers.dfs*.object­ server.GET.timing.mean
  25. 25.     swift_type=object stat=mean timing GET avg by http_code
  26. 26.    
  27. 27.    
  28. 28.     O((DxV)^2) D = # dimensions              V = # values per dim             
  29. 29.     collectd.db.disk.sda1.dis k_time.write
  30. 30.    
  31. 31.    
  32. 32.     What should I name my metric?
  33. 33.     10 100 1000 10000 100000 1000000
  34. 34.    
  35. 35.     Metrics 2.0
  36. 36.     Old: ● information lacking ● fields unclear & inconsistent ● cumbersome strings / trees ● forbidden characters New: ● Self­describing ● Standardized ● all dimensions in orthogonal tag­space ● Allow some useful characters
  37. 37.     stats.timers.dfs5.proxy­server.object.GET.200.timing.upper_90 {     “server”: “dfvimeodfsproxy5”,     “http_method”: “GET”,     “http_code”: “200”,     “unit”: “ms”,     “target_type”: “gauge”,     “stat”: “upper_90”,     “swift_type”: “object”     “plugin”: “swift_proxy_server” }
  38. 38.     Main advantages: ● Immediate understanding of metric meaning (ideally) ● Minimize time to graphs, dashboards, alerting rules 
  39. 39.     github.com/vimeo/graph­explorer/wiki
  40. 40.     SI + IEC B   Err   Warn   Conn   Job   File   Req    ... MB/s   Err/d   Req/h   ...
  41. 41.     {     “site”: “vimeo.com”,     “port”: 80,     “unit”: “Req/s”,     “direction”: “in”,     “service”: “webapp_php”,     “server”:  “webxx” }
  42. 42.    
  43. 43.     Carbon­tagger: ...  service=foo.instance=host.target_type=gauge.type=calculatio n.unit=B 123 1234567890 … Statsdaemon: ..unit=B..unit=B...        unit=B/s→ ..unit=ms..unit=ms..    unit=ms stat=mean→                                    → unit=ms stat=upper_90                                    → ...
  44. 44.    
  45. 45.    
  46. 46.     Graph­Explorer queries 101 site:api.vimeo.com unit=Req/s requesthostport api_vimeo_com
  47. 47.    
  48. 48.     Smoothing avg over 10M avg over ...
  49. 49.    
  50. 50.     Aggregation, compare port 80 vs 443 avg by <dimension> sum by <dimension> sum by server
  51. 51.    
  52. 52.     Compare 80 traffic amongt servers site:api.vimeo.com unit=Req/s port=80 group by none avg  over 10M
  53. 53.    
  54. 54.     Graph­Explorer queries 201 proxy­server swift server:regex upper_90 unit=ms from  <datetime> to <datetime> avg over <timespec> 
  55. 55.    
  56. 56.    
  57. 57.    
  58. 58.    
  59. 59.     Compare object put/get Stack .. http_method:(PUT|GET) swift_type=object avg by  http_code,server
  60. 60.    
  61. 61.     Comparing servers http_method:(PUT|GET) avg by  http_code,swift_type,http_method group by none
  62. 62.    
  63. 63.     Compare http codes for GET, per swift type http_method=GET avg by server group by swift_type
  64. 64.    
  65. 65.     transcode unit=Job/s avg over <time> from <datetime> to  <datetime>
  66. 66.     Note: data is obfuscated
  67. 67.     Bucketing !queue sum by zone:ap­southeast|eu­west|us­east|us­ west|sa­east|vimeo­df|vimeo­lv group by state
  68. 68.     Note: data is obfuscated
  69. 69.     Compare job states per region (zones bucket) group by zone
  70. 70.     Note: data is obfuscated
  71. 71.     Unit conversion unit=Mb/s network dfvimeorpc sum by server
  72. 72.    
  73. 73.    
  74. 74.     unit=MB
  75. 75.    
  76. 76.    
  77. 77.     {     server=dfvimeodfs1     plugin=diskspace     mountpoint=_srv_node_dfs5     unit=B     type=used     target_type=gauge }
  78. 78.     server:dfvimeodfs unit=GB type=free srv node
  79. 79.    
  80. 80.     unit=GB/d group by mountpoint
  81. 81.    
  82. 82.    
  83. 83.    
  84. 84.    
  85. 85.    
  86. 86.    
  87. 87.     Dashboard definition  queries = [    'cpu usage sum by core',    'mem unit=B !total group by type:swap',    'stack network unit=b/s',    'unit=B (free|used) group by =mountpoint'  ]
  88. 88.    
  89. 89.     stats.dfvimeocliapp2.twitter.error {     “n1”: “dfvimeocliapp2”,     “n2”: “twitter”,     “n3”: “error”,     “plugin”: “catchall_statsd”,     “source”: “statsd”,     “target_type”: “rate”,     “unit”: “unknown/s” }
  90. 90.     Two hard things in computer science
  91. 91.     stats.gauges.files. id_boundary_7day stats.gauges.files. id_boundary_ceil
  92. 92.     unit=File id_boundary_7d  {    “unit”: “File”,    “n1”: “id_boundary_7d”, }
  93. 93.     {     “intrinsic”: {         “site”: “vimeo.com”,         “unit”: “Req/s”     },     “extrinsic”: {         “agent”: “diamond”,         “processed_by”: “statsd1”,         “src”: “index.php:135”,         “replaces”: “vimeo_com_reqps”     } }
  94. 94.     site=vimeo.com unit=Req/s    processed_by=statsd1   src=index.php:135 added_by=dieter  123 1234567890
  95. 95.    
  96. 96.     Equivalence servers.host.cpu.total.iowait   “core” : “_sum_”→ servers.host.cpu.<core­number>.iowait servers.host.loadavg.15
  97. 97.     Rollups & aggregation
  98. 98.     /etc/carbon/storage­aggregation.conf [min] pattern = .min$ aggregationMethod = min [max] pattern = .max$ aggregationMethod = max [sum] pattern = .count$ aggregationMethod = sum [default_average] pattern = .* aggregationMethod = average
  99. 99.    
  100. 100.     2 kinds of graphite users
  101. 101.     Self­describing metrics stat=upper/lower/mean/... target_type=counter..
  102. 102.     ●    stats.timers.render_time.histogram.bin_0.01 ●    stats.timers.render_time.histogram.bin_0.1 ●    stats.timers.render_time.histogram.bin_1           unit=Freq_abs bin_upper=1→ ●    stats.timers.render_time.histogram.bin_10 ●    stats.timers.render_time.histogram.bin_50 ●    stats.timers.render_time.histogram.bin_inf ●    stats.timers.render_time.lower                            unit=ms stat=lower→ ●    stats.timers.render_time.mean                            unit=ms stat=mean→ ●    stats.timers.render_time.mean_90                      ...→ ●    stats.timers.render_time.median ●    stats.timers.render_time.std ●    stats.timers.render_time.upper ●    stats.timers.render_time.upper_90
  103. 103.     Also.. ● graphite API functions such as "cumulative", "summarize"  and "smartSummarize" ● Graph renderers
  104. 104.    
  105. 105.     From: dygraphs.com
  106. 106.    
  107. 107.    
  108. 108.    
  109. 109.    
  110. 110.    
  111. 111.     Facet based suggestions
  112. 112.    
  113. 113.     Metric types ● gauge ● count & rate ● counter ● timer
  114. 114.    
  115. 115.    
  116. 116.    
  117. 117.    
  118. 118.     gauge ● Multiple values in same interval ● “sticky”
  119. 119.    
  120. 120.     Count & Rate
  121. 121.     Counter
  122. 122.     Timer..
  123. 123.    
  124. 124.     http://janabeck.com/blog/2012/10/12/lessons­learned­from­100/
  125. 125.     Timer..
  126. 126.     ● What should a metric be? ● Stickyness? ● Behavior on no packets received ● Behavior on multiple packets received
  127. 127.     My personal takeaways
  128. 128.     Conclusion ● Building graphs, setting up alerting cumbersome ● Esp. changing information needs (troubleshooting, exploring, ..) ● Esp. Complicated information needs    → PAIN ● Structuring metrics ● Self­describing metrics ● Standardized metrics ● Native metrics 2.0 ●  → BREEZE 
  129. 129.     Conclusion ● Metrics can be so much more usable and useful. Let's talk about  tagging, standardisation, retaining information throughout the  pipeline. ● Converting information needs into graph defs, alerting rules ● Graph­Explorer, carbon­tagger, statsdaemon, … ● Graphite­ng (native metrics 2.0) ● Metrics 2.0 in your apps, agents, aggregators? ● Build out structured metrics library
  130. 130.     github.com/vimeo github.com/Dieterbe twitter.com/Dieter_be dieter.plaetinck.be
  131. 131.    

×