Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to measure everything - a million metrics per second with minimal developer overhead


Published on

Krux is an infrastructure provider for many of the websites you
use online today, like,, Wikia and NBCU. For
every request on those properties, Krux will get one or more as
well. We grew from zero traffic to several billion requests per
day in the span of 2 years, and we did so exclusively in AWS.

To make the right decisions in such a volatile environment, we
knew that data is everything; without it, you can't possibly make
informed decisions. However, collecting it efficiently, at scale,
at minimal cost and without burdening developers is a tremendous

Join me in this session to learn how we overcame this challenge
at Krux; I will share with you the details of how we set up our
global infrastructure, entirely managed by Puppet, to capture over
a million data points every second on virtually every part of the
system, including inside the web server, user apps and Puppet itself,
for under $2000/month using off the shelf Open Source software and
some code we've released as Open Source ourselves. In addition, I’ll
show you how you can take (a subset of) these metrics and send them
to advanced analytics and alerting tools like Circonus or Zabbix.

This content will be applicable for anyone collecting or desiring to
collect vast amounts of metrics in a cloud or datacenter setting and
making sense of them.

Published in: Engineering

How to measure everything - a million metrics per second with minimal developer overhead

  1. 1. HOW TO MEASURE EVERYTHING A million metrics per second with minimal developer overhead ! Jos Boumans - @jiboumans
  2. 2. RIPE NCC Engineering manager for RIPE Database
  3. 3. CANONICAL Engineering manager for Ubuntu Server 10.04 & 10.10
  4. 4. KRUX VP of Operations & Infrastructure
  7. 7. 0 35,000 70,000 105,000 140,000 AVERAGE DATA EVENTS / SEC Twitter: New Tweets Wikipedia: Page Views Facebook: Messages Sent Krux: New Data Points
  8. 8. 0 500,000,000 1,000,000,000 1,500,000,000 2,000,000,000 MONTHLY UNIQUE USERS
  9. 9. DATA IS EVERYTHING Always know what’s going on
  10. 10. UNIQUE METRICS Unique metrics received, per second
  11. 11. METRICS & VISUALIZATION … and a little bit of monitoring
  12. 12. VISUALIZATION MATTERS Humans are good at patterns & shapes
  13. 13. INSIGHT MATTERS We consider it a core competence
  14. 14. SHOW EVERYONE And better yet, encourage people to add their own
  16. 16. KEY CHARACTERISTICS … of our metrics collection
  17. 17. WHAT TO VISUALIZE Pick your operational KPIs
  18. 18. REQUEST & ERROR RATES The baseline for everything else
  19. 19. WORST RESPONSE TIMES Track the worst upper 95th & upper 99th across a cluster
  20. 20. TRACK EVENTS Did a code change or batch job cause a change in behaviour?
  21. 21. CAPACITY / THRESHOLDS How much traffic can your service sustain?
  22. 22. SINGLE SERVICE OVERVIEW Create a single graph for every service
  23. 23. WHAT TO CAPTURE Everything. No, really.
  24. 24. INFRASTRUCTURE Everything needed to create, capture and act on a million metrics per seconds
  25. 25. GRAPHITE, STATSD & COLLECTD The Trifecta
  26. 26. COLLECTD Open Source Monitoring Tool
  27. 27. STATSD Simple stats collector service network_tuning.php
  28. 28. STATSD NAMING SCHEME stats. # to distinguish from events $environment. # prod, dev, etc $cluster_name. # api-ash, www-dub, etc $application. # webapp, login, etc $metric_name_here. # any key the app wants $hostname # node the stat came from
  29. 29. STATSD CONFIGURATION { graphite: { globalPrefix: stats.$env.$cluster_name, globalSuffix: require(‘os').hostname().split('.')[0], legacyNamespace: false, }, percentThreshold: [ 95, 99 ], deleteIdleStats: true, }
  30. 30. GRAPHITE Metric store & Graph UI
  31. 31. GRAPHITE SETUP At least one graphite server per data center
  32. 32. DATA RETENTION [default] pattern = .* priority = 110 retentions = 10:6h,60:15d,600:5y xFilesFactor = 0
  33. 33. STANDARD AGGREGATIONS # Average & Sum for timers <prefix>.timers.<key>._totals.ash.<type>.avg (10) = avg <<prefix>>.timers.<<key>>.<node>.<type> ! <prefix>.timers.<key>._totals.ash.<type>.sum (10) = sum <<prefix>>.timers.<<key>>.<node>.(?!upper|lower)<type> ! # Min / Max for Lower / Upper <prefix>.timers.<key>._totals.ash.upper (10) = max <<prefix>>.timers.<<key>>.<node>.upper ! <prefix>.timers.<key>._totals.ash.lower (10) = min <<prefix>>.timers.<<key>>.<node>.lower
  34. 34. PERFORMANCE First problem: IOPS Second problem: CPU
  35. 35. GRAPHITE ALTERNATIVES Circonus: All the insights you ever wanted Zabbix: OSS self hosted monitoring
  36. 36. GRAPHITE.JS Custom dashboards using jQuery
  37. 37. COST Optimize for adoption rates in your organization by eliminating cost as a constraint].jpg
  38. 38. INSTRUMENTATION Instrument your infrastructure, not just your apps
  39. 39. APACHE Use mod_statsd to capture stats directly from the Apache request
  40. 40. BASIC CONFIGURATION <Location /api> Statsd On StatsdPrefix apache </Location> $ curl http://localhost/api/foo?id=42 ! Stat:|ms
  41. 41. VARNISH use libvmod-statsd & libvmod-timers to capture stats directly from the Varnish request
  42. 42. BASIC CONFIGURATION # pseudo code import statsd; import timers; sub vcl_deliver { statsd.timing( $backend + # from req.backend $hit_miss + # from obj.hits $resp_code, # from obj.status timers. req_response_time() ); }
  43. 43. SAMPLE GRAPH The request per second & response time graphs are coming straight from varnish
  44. 44. PYTHON Create a base library in your language of choice
  45. 45. KRUX-STDLIB $ pip install krux-stdlib
  46. 46. BASIC APP USING STDLIB $ sample-app -h […] ! logging: --log-level {info,debug,critical,warning,error} Verbosity of logging. (default: warning) stats: --stats Enable sending statistics to statsd. (default: False) --stats-host STATS_HOST Statsd host to send statistics to. (default: localhost) --stats-port STATS_PORT Statsd port to send statistics to. (default: 8125) --stats-environment STATS_ENVIRONMENT Statsd environment. (default: dev)
  47. 47. BASIC APP USING STDLIB class App(krux.cli.Application): def __init__(self): ### Call to the superclass to bootstrap. super(Application, self).__init__( name = 'sample-app') def run(self): stats = self.stats log = self.logger ! with stats.timer('run'):'running...') ...
  48. 48. CLI echo ‘events.deploy.appname:1|c’ | nc localhost -u 8125
  49. 49. JAVASCRIPT Use a simple HTTP endpoint to send stats
  50. 50. SUPERVISOR Instrument Supervisord using Sulphite
  51. 51. BASIC CONFIGURATION # Install from PyPi $ pip install sulphite ! # Setup as eventlistener in Supervisor [eventlistener:sulphite] command=sulphite --graphite-server=… events=PROCESS_STATE numprocs=1
  52. 52. FATAL PROCESS EXITS Processes that exited unexpectedly, and supervisor was unable to restart after N retries
  53. 53. PUPPET Use the Puppet module graphite-report to send Puppet reporting data directly to Graphite
  54. 54. KEEP TRACK OF COSTS Use CloudWatch CLI tools and send to Statsd
  55. 55. BASIC USAGE # Charge to date for $service $ mon-get-stats EstimatedCharges --namespace "AWS/Billing" --statistics Sum --dimensions "ServiceName=${service}" --start-time $date
  56. 56. Q & A @jiboumans