Your SlideShare is downloading. ×
How to measure everything - a million metrics per second with minimal developer overhead
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

How to measure everything - a million metrics per second with minimal developer overhead


Published on

Krux is an infrastructure provider for many of the websites you …

Krux is an infrastructure provider for many of the websites you
use online today, like,, Wikia and NBCU. For
every request on those properties, Krux will get one or more as
well. We grew from zero traffic to several billion requests per
day in the span of 2 years, and we did so exclusively in AWS.

To make the right decisions in such a volatile environment, we
knew that data is everything; without it, you can't possibly make
informed decisions. However, collecting it efficiently, at scale,
at minimal cost and without burdening developers is a tremendous

Join me in this session to learn how we overcame this challenge
at Krux; I will share with you the details of how we set up our
global infrastructure, entirely managed by Puppet, to capture over
a million data points every second on virtually every part of the
system, including inside the web server, user apps and Puppet itself,
for under $2000/month using off the shelf Open Source software and
some code we've released as Open Source ourselves. In addition, I’ll
show you how you can take (a subset of) these metrics and send them
to advanced analytics and alerting tools like Circonus or Zabbix.

This content will be applicable for anyone collecting or desiring to
collect vast amounts of metrics in a cloud or datacenter setting and
making sense of them.

Published in: Engineering

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. HOW TO MEASURE EVERYTHING A million metrics per second with minimal developer overhead ! Jos Boumans - @jiboumans
  • 2. RIPE NCC Engineering manager for RIPE Database
  • 3. CANONICAL Engineering manager for Ubuntu Server 10.04 & 10.10
  • 4. KRUX VP of Operations & Infrastructure
  • 7. 0 35,000 70,000 105,000 140,000 AVERAGE DATA EVENTS / SEC Twitter: New Tweets Wikipedia: Page Views Facebook: Messages Sent Krux: New Data Points
  • 8. 0 500,000,000 1,000,000,000 1,500,000,000 2,000,000,000 MONTHLY UNIQUE USERS
  • 9. DATA IS EVERYTHING Always know what’s going on
  • 10. UNIQUE METRICS Unique metrics received, per second
  • 11. METRICS & VISUALIZATION … and a little bit of monitoring
  • 12. VISUALIZATION MATTERS Humans are good at patterns & shapes
  • 13. INSIGHT MATTERS We consider it a core competence
  • 14. SHOW EVERYONE And better yet, encourage people to add their own
  • 16. KEY CHARACTERISTICS … of our metrics collection
  • 17. WHAT TO VISUALIZE Pick your operational KPIs
  • 18. REQUEST & ERROR RATES The baseline for everything else
  • 19. WORST RESPONSE TIMES Track the worst upper 95th & upper 99th across a cluster
  • 20. TRACK EVENTS Did a code change or batch job cause a change in behaviour?
  • 21. CAPACITY / THRESHOLDS How much traffic can your service sustain?
  • 22. SINGLE SERVICE OVERVIEW Create a single graph for every service
  • 23. WHAT TO CAPTURE Everything. No, really.
  • 24. INFRASTRUCTURE Everything needed to create, capture and act on a million metrics per seconds
  • 25. GRAPHITE, STATSD & COLLECTD The Trifecta
  • 26. COLLECTD Open Source Monitoring Tool
  • 27. STATSD Simple stats collector service network_tuning.php
  • 28. STATSD NAMING SCHEME stats. # to distinguish from events $environment. # prod, dev, etc $cluster_name. # api-ash, www-dub, etc $application. # webapp, login, etc $metric_name_here. # any key the app wants $hostname # node the stat came from
  • 29. STATSD CONFIGURATION { graphite: { globalPrefix: stats.$env.$cluster_name, globalSuffix: require(‘os').hostname().split('.')[0], legacyNamespace: false, }, percentThreshold: [ 95, 99 ], deleteIdleStats: true, }
  • 30. GRAPHITE Metric store & Graph UI
  • 31. GRAPHITE SETUP At least one graphite server per data center
  • 32. DATA RETENTION [default] pattern = .* priority = 110 retentions = 10:6h,60:15d,600:5y xFilesFactor = 0
  • 33. STANDARD AGGREGATIONS # Average & Sum for timers <prefix>.timers.<key>._totals.ash.<type>.avg (10) = avg <<prefix>>.timers.<<key>>.<node>.<type> ! <prefix>.timers.<key>._totals.ash.<type>.sum (10) = sum <<prefix>>.timers.<<key>>.<node>.(?!upper|lower)<type> ! # Min / Max for Lower / Upper <prefix>.timers.<key>._totals.ash.upper (10) = max <<prefix>>.timers.<<key>>.<node>.upper ! <prefix>.timers.<key>._totals.ash.lower (10) = min <<prefix>>.timers.<<key>>.<node>.lower
  • 34. PERFORMANCE First problem: IOPS Second problem: CPU
  • 35. GRAPHITE ALTERNATIVES Circonus: All the insights you ever wanted Zabbix: OSS self hosted monitoring
  • 36. GRAPHITE.JS Custom dashboards using jQuery
  • 37. COST Optimize for adoption rates in your organization by eliminating cost as a constraint].jpg
  • 38. INSTRUMENTATION Instrument your infrastructure, not just your apps
  • 39. APACHE Use mod_statsd to capture stats directly from the Apache request
  • 40. BASIC CONFIGURATION <Location /api> Statsd On StatsdPrefix apache </Location> $ curl http://localhost/api/foo?id=42 ! Stat:|ms
  • 41. VARNISH use libvmod-statsd & libvmod-timers to capture stats directly from the Varnish request
  • 42. BASIC CONFIGURATION # pseudo code import statsd; import timers; sub vcl_deliver { statsd.timing( $backend + # from req.backend $hit_miss + # from obj.hits $resp_code, # from obj.status timers. req_response_time() ); }
  • 43. SAMPLE GRAPH The request per second & response time graphs are coming straight from varnish
  • 44. PYTHON Create a base library in your language of choice
  • 45. KRUX-STDLIB $ pip install krux-stdlib
  • 46. BASIC APP USING STDLIB $ sample-app -h […] ! logging: --log-level {info,debug,critical,warning,error} Verbosity of logging. (default: warning) stats: --stats Enable sending statistics to statsd. (default: False) --stats-host STATS_HOST Statsd host to send statistics to. (default: localhost) --stats-port STATS_PORT Statsd port to send statistics to. (default: 8125) --stats-environment STATS_ENVIRONMENT Statsd environment. (default: dev)
  • 47. BASIC APP USING STDLIB class App(krux.cli.Application): def __init__(self): ### Call to the superclass to bootstrap. super(Application, self).__init__( name = 'sample-app') def run(self): stats = self.stats log = self.logger ! with stats.timer('run'):'running...') ...
  • 48. CLI echo ‘events.deploy.appname:1|c’ | nc localhost -u 8125
  • 49. JAVASCRIPT Use a simple HTTP endpoint to send stats
  • 50. SUPERVISOR Instrument Supervisord using Sulphite
  • 51. BASIC CONFIGURATION # Install from PyPi $ pip install sulphite ! # Setup as eventlistener in Supervisor [eventlistener:sulphite] command=sulphite --graphite-server=… events=PROCESS_STATE numprocs=1
  • 52. FATAL PROCESS EXITS Processes that exited unexpectedly, and supervisor was unable to restart after N retries
  • 53. PUPPET Use the Puppet module graphite-report to send Puppet reporting data directly to Graphite
  • 54. KEEP TRACK OF COSTS Use CloudWatch CLI tools and send to Statsd
  • 55. BASIC USAGE # Charge to date for $service $ mon-get-stats EstimatedCharges --namespace "AWS/Billing" --statistics Sum --dimensions "ServiceName=${service}" --start-time $date
  • 56. Q & A @jiboumans