Your SlideShare is downloading. ×
Metrics stack 2.0
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Metrics stack 2.0

1,816
views

Published on

Most metrics systems link timeseries to a string key, some add a few tags. They often lack information, use inconsistent formats and terminology, and are poorly organized. As the amount of people and …

Most metrics systems link timeseries to a string key, some add a few tags. They often lack information, use inconsistent formats and terminology, and are poorly organized. As the amount of people and software generating, processing, storing and visualizing metrics grows, this approach becomes very cumbersome and there is a lot to be gained from taking a step back and re-thinking metric identifiers and metadata.

Metrics 2.0 is a set of conventions around metrics: With barely any extra work metrics become self-describing and standardized. Compatibility between tools increases dramatically, dashboards can automatically convert information needs into graphs, graph renderers can present data more usefully, anomaly detection and aggregators can work more autonomously and avoid common mistakes. Result: less micromanaging of software and configuration, quicker results, more clarity. Less frustration and room for errors.

This talk will also cover the tools that turn this concept into production-ready reality:

Graph-Explorer is an application that integrates with Graphite. Enter an expression that represents an information need and it generates the corresponding graphs or alerting rules, automatically applying unit conversion, aggregation, processing, etc.

Statsdaemon is an aggregation daemon like Etsy's Statsd that expresses performed aggregations and statistical operations by updating the metrics tags, making sure that the metric metadata always corresponds to the data.

Dieter Plaetinck is a systems-gone-backend engineer at Vimeo.

Published in: Engineering

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,816
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1.    
  • 2.     Credit: user niteroi @ panoramio.com
  • 3.     vimeo.com/43800150
  • 4.    
  • 5.    
  • 6.    
  • 7.    
  • 8.    
  • 9.    
  • 10.    
  • 11.     1  Metrics 2.0 concepts 2  Implementation 3  Advanced stuff
  • 12.     “Dieter” ?
  • 13.     Peter   Deter→
  • 14.     Terminology sync
  • 15.     (1234567890, 82) (1234567900, 123) (1234567910, 109) (1234567920, 77) db15.mysql.queries_running host=db15 mysql.queries_running
  • 16.    
  • 17.     How many pagerequests/s is vimeo.com  doing?
  • 18.     ● stats.hits.vimeo_com ● stats_counts.hits.vimeo_com
  • 19.    
  • 20.     stats.<host>.requesthostport.vimeo_ com_443
  • 21.     stats.timers.dfs5.proxy­ server.object.GET.200.timing .upper_90
  • 22.     O(X*Y*Z) X = # apps                 Y = # people              Z = # aggregators     
  • 23.     How long does it take to retrieve an object from swift?
  • 24.     stats.timers.<host>.proxy­ server.<swift_type>.<http_method>. <http_code>.timing.<stat> stats.timers.<host>.object­ server.<http_method>. timing.<stat> target=stats.timers.dfs*.object*GET*timing.mean ? target=groupByNode(stats.timers.dfs*.proxy ­server.object.GET.*.timing.mean,2,"avg") target=stats.timers.dfs*.object­ server.GET.timing.mean
  • 25.     swift_type=object stat=mean timing GET avg by http_code
  • 26.    
  • 27.    
  • 28.     O((DxV)^2) D = # dimensions              V = # values per dim             
  • 29.     collectd.db.disk.sda1.dis k_time.write
  • 30.    
  • 31.    
  • 32.     What should I name my metric?
  • 33.     10 100 1000 10000 100000 1000000
  • 34.    
  • 35.     Metrics 2.0
  • 36.     Old: ● information lacking ● fields unclear & inconsistent ● cumbersome strings / trees ● forbidden characters New: ● Self­describing ● Standardized ● all dimensions in orthogonal tag­space ● Allow some useful characters
  • 37.     stats.timers.dfs5.proxy­server.object.GET.200.timing.upper_90 {     “server”: “dfvimeodfsproxy5”,     “http_method”: “GET”,     “http_code”: “200”,     “unit”: “ms”,     “target_type”: “gauge”,     “stat”: “upper_90”,     “swift_type”: “object”     “plugin”: “swift_proxy_server” }
  • 38.     Main advantages: ● Immediate understanding of metric meaning (ideally) ● Minimize time to graphs, dashboards, alerting rules 
  • 39.     github.com/vimeo/graph­explorer/wiki
  • 40.     SI + IEC B   Err   Warn   Conn   Job   File   Req    ... MB/s   Err/d   Req/h   ...
  • 41.     {     “site”: “vimeo.com”,     “port”: 80,     “unit”: “Req/s”,     “direction”: “in”,     “service”: “webapp_php”,     “server”:  “webxx” }
  • 42.    
  • 43.     Carbon­tagger: ...  service=foo.instance=host.target_type=gauge.type=calculatio n.unit=B 123 1234567890 … Statsdaemon: ..unit=B..unit=B...        unit=B/s→ ..unit=ms..unit=ms..    unit=ms stat=mean→                                    → unit=ms stat=upper_90                                    → ...
  • 44.    
  • 45.    
  • 46.     Graph­Explorer queries 101 site:api.vimeo.com unit=Req/s requesthostport api_vimeo_com
  • 47.    
  • 48.     Smoothing avg over 10M avg over ...
  • 49.    
  • 50.     Aggregation, compare port 80 vs 443 avg by <dimension> sum by <dimension> sum by server
  • 51.    
  • 52.     Compare 80 traffic amongt servers site:api.vimeo.com unit=Req/s port=80 group by none avg  over 10M
  • 53.    
  • 54.     Graph­Explorer queries 201 proxy­server swift server:regex upper_90 unit=ms from  <datetime> to <datetime> avg over <timespec> 
  • 55.    
  • 56.    
  • 57.    
  • 58.    
  • 59.     Compare object put/get Stack .. http_method:(PUT|GET) swift_type=object avg by  http_code,server
  • 60.    
  • 61.     Comparing servers http_method:(PUT|GET) avg by  http_code,swift_type,http_method group by none
  • 62.    
  • 63.     Compare http codes for GET, per swift type http_method=GET avg by server group by swift_type
  • 64.    
  • 65.     transcode unit=Job/s avg over <time> from <datetime> to  <datetime>
  • 66.     Note: data is obfuscated
  • 67.     Bucketing !queue sum by zone:ap­southeast|eu­west|us­east|us­ west|sa­east|vimeo­df|vimeo­lv group by state
  • 68.     Note: data is obfuscated
  • 69.     Compare job states per region (zones bucket) group by zone
  • 70.     Note: data is obfuscated
  • 71.     Unit conversion unit=Mb/s network dfvimeorpc sum by server
  • 72.    
  • 73.    
  • 74.     unit=MB
  • 75.    
  • 76.    
  • 77.     {     server=dfvimeodfs1     plugin=diskspace     mountpoint=_srv_node_dfs5     unit=B     type=used     target_type=gauge }
  • 78.     server:dfvimeodfs unit=GB type=free srv node
  • 79.    
  • 80.     unit=GB/d group by mountpoint
  • 81.    
  • 82.    
  • 83.    
  • 84.    
  • 85.    
  • 86.    
  • 87.     Dashboard definition  queries = [    'cpu usage sum by core',    'mem unit=B !total group by type:swap',    'stack network unit=b/s',    'unit=B (free|used) group by =mountpoint'  ]
  • 88.    
  • 89.     stats.dfvimeocliapp2.twitter.error {     “n1”: “dfvimeocliapp2”,     “n2”: “twitter”,     “n3”: “error”,     “plugin”: “catchall_statsd”,     “source”: “statsd”,     “target_type”: “rate”,     “unit”: “unknown/s” }
  • 90.     Two hard things in computer science
  • 91.     stats.gauges.files. id_boundary_7day stats.gauges.files. id_boundary_ceil
  • 92.     unit=File id_boundary_7d  {    “unit”: “File”,    “n1”: “id_boundary_7d”, }
  • 93.     {     “intrinsic”: {         “site”: “vimeo.com”,         “unit”: “Req/s”     },     “extrinsic”: {         “agent”: “diamond”,         “processed_by”: “statsd1”,         “src”: “index.php:135”,         “replaces”: “vimeo_com_reqps”     } }
  • 94.     site=vimeo.com unit=Req/s    processed_by=statsd1   src=index.php:135 added_by=dieter  123 1234567890
  • 95.    
  • 96.     Equivalence servers.host.cpu.total.iowait   “core” : “_sum_”→ servers.host.cpu.<core­number>.iowait servers.host.loadavg.15
  • 97.     Rollups & aggregation
  • 98.     /etc/carbon/storage­aggregation.conf [min] pattern = .min$ aggregationMethod = min [max] pattern = .max$ aggregationMethod = max [sum] pattern = .count$ aggregationMethod = sum [default_average] pattern = .* aggregationMethod = average
  • 99.    
  • 100.     2 kinds of graphite users
  • 101.     Self­describing metrics stat=upper/lower/mean/... target_type=counter..
  • 102.     ●    stats.timers.render_time.histogram.bin_0.01 ●    stats.timers.render_time.histogram.bin_0.1 ●    stats.timers.render_time.histogram.bin_1           unit=Freq_abs bin_upper=1→ ●    stats.timers.render_time.histogram.bin_10 ●    stats.timers.render_time.histogram.bin_50 ●    stats.timers.render_time.histogram.bin_inf ●    stats.timers.render_time.lower                            unit=ms stat=lower→ ●    stats.timers.render_time.mean                            unit=ms stat=mean→ ●    stats.timers.render_time.mean_90                      ...→ ●    stats.timers.render_time.median ●    stats.timers.render_time.std ●    stats.timers.render_time.upper ●    stats.timers.render_time.upper_90
  • 103.     Also.. ● graphite API functions such as "cumulative", "summarize"  and "smartSummarize" ● Graph renderers
  • 104.    
  • 105.     From: dygraphs.com
  • 106.    
  • 107.    
  • 108.    
  • 109.    
  • 110.    
  • 111.     Facet based suggestions
  • 112.    
  • 113.     Metric types ● gauge ● count & rate ● counter ● timer
  • 114.    
  • 115.    
  • 116.    
  • 117.    
  • 118.     gauge ● Multiple values in same interval ● “sticky”
  • 119.    
  • 120.     Count & Rate
  • 121.     Counter
  • 122.     Timer..
  • 123.    
  • 124.     http://janabeck.com/blog/2012/10/12/lessons­learned­from­100/
  • 125.     Timer..
  • 126.     ● What should a metric be? ● Stickyness? ● Behavior on no packets received ● Behavior on multiple packets received
  • 127.     My personal takeaways
  • 128.     Conclusion ● Building graphs, setting up alerting cumbersome ● Esp. changing information needs (troubleshooting, exploring, ..) ● Esp. Complicated information needs    → PAIN ● Structuring metrics ● Self­describing metrics ● Standardized metrics ● Native metrics 2.0 ●  → BREEZE 
  • 129.     Conclusion ● Metrics can be so much more usable and useful. Let's talk about  tagging, standardisation, retaining information throughout the  pipeline. ● Converting information needs into graph defs, alerting rules ● Graph­Explorer, carbon­tagger, statsdaemon, … ● Graphite­ng (native metrics 2.0) ● Metrics 2.0 in your apps, agents, aggregators? ● Build out structured metrics library
  • 130.     github.com/vimeo github.com/Dieterbe twitter.com/Dieter_be dieter.plaetinck.be
  • 131.