• Save
How to measure everything - a million metrics per second with minimal developer overhead
Upcoming SlideShare
Loading in...5

How to measure everything - a million metrics per second with minimal developer overhead



Krux is an infrastructure provider for many of the websites you ...

Krux is an infrastructure provider for many of the websites you
use online today, like NYTimes.com, WSJ.com, Wikia and NBCU. For
every request on those properties, Krux will get one or more as
well. We grew from zero traffic to several billion requests per
day in the span of 2 years, and we did so exclusively in AWS.

To make the right decisions in such a volatile environment, we
knew that data is everything; without it, you can't possibly make
informed decisions. However, collecting it efficiently, at scale,
at minimal cost and without burdening developers is a tremendous

Join me in this session to learn how we overcame this challenge
at Krux; I will share with you the details of how we set up our
global infrastructure, entirely managed by Puppet, to capture over
a million data points every second on virtually every part of the
system, including inside the web server, user apps and Puppet itself,
for under $2000/month using off the shelf Open Source software and
some code we've released as Open Source ourselves. In addition, I’ll
show you how you can take (a subset of) these metrics and send them
to advanced analytics and alerting tools like Circonus or Zabbix.

This content will be applicable for anyone collecting or desiring to
collect vast amounts of metrics in a cloud or datacenter setting and
making sense of them.



Total Views
Views on SlideShare
Embed Views



11 Embeds 1,949

http://jiboumans.wordpress.com 1286
http://www.krux.com 268
https://puppetlabs.com 139
http://puppetlabs.com 117
https://twitter.com 109
http://new-krux.bluecoastweb.com 16
https://jiboumans.wordpress.com 6
http://www.pinterest.com 4
http://krux.bluecoastweb.com 2
http://www.slideee.com 1
http://feedly.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • please enable saving (offline reading rocks)
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

How to measure everything - a million metrics per second with minimal developer overhead How to measure everything - a million metrics per second with minimal developer overhead Presentation Transcript

  • HOWTO MEASURE EVERYTHING A million metrics per second with minimal developer overhead ! Jos Boumans - @jiboumans http://www.imagemediapartners.com/Portals/20286/images/MeasuringTape-s.jpg
  • RIPE NCC Engineering manager for RIPE Database http://www.ripe.net/db
  • CANONICAL http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775 Engineering manager for Ubuntu Server 10.04 & 10.10 http://www.ubuntu.com/business/server/overview
  • KRUX VP of Operations & Infrastructure http://www.krux.com/
  • A LOT OFTRAFFIC http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html
  • AVERAGE DATA EVENTS / SEC http://www.statisticbrain.com/twitter-statistics/ http://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm Twitter: New tweets Wikipedia: Page requests Krux: New data points 0 10,000 20,000 30,000 40,000
  • MONTHLY UNIQUE USERS 0 300,000,000 600,000,000 900,000,000 1,200,000,000 http://en.wikipedia.org/wiki/Wikipedia http://www.statisticbrain.com/twitter-statistics/
  • DATA IS EVERYTHING Always know what’s going on http://perpetual-wonder.com/blog/wp-content/uploads/2012/09/Where-do-we-go-from-here.jpg
  • UNIQUE METRICS Unique metrics received, per second
  • METRICS &VISUALIZATION … and a little bit of monitoring http://getfit101.files.wordpress.com/2012/04/visualization.jpg
  • VISUALIZATION MATTERS Humans are good at patterns & shapes http://1.bp.blogspot.com/-CO-8FK9bohE/T89rD8dTyEI/AAAAAAAAAEE/YUZ00v_filk/s1600/live_like_it_matters_by_mythirll-d3iqcxt.jpg
  • INSIGHT MATTERS We consider it a core competence http://yourselfseries.com/teens/files/2013/05/suicide_bonus_Insight_final.jpg
  • SHOW EVERYONE And better yet, encourage people to add their own http://www.kissimmee.org/ftp/KCC/events/views/images/crowd_cheer.jpg
  • KEY CHARACTERISTICS … of our metrics collection http://www.fullcirclefeedback.com.au/resources/wp-content/uploads/2014/01/Key-skills-and-characteristics-of-good-HR-leaders.jpg
  • WHATTOVISUALIZE Pick your operational KPIs http://1.bp.blogspot.com/-nrB1A9hamEk/UVZui_JUG1I/AAAAAAAAAdI/zGqHuanZNVU/s1600/missed-opportunities.jpg
  • REQUEST & ERROR RATES The baseline for everything else
  • WORST RESPONSETIMES Track the worst upper 95th & upper 99th across a cluster
  • TRACK EVENTS Did a code change or batch job cause a change in behaviour?
  • CAPACITY /THRESHOLDS How much traffic can your service sustain?
  • SINGLE SERVICE OVERVIEW Create a single graph for every service
  • WHATTO CAPTURE Everything. No, really. http://arkansasagnews.uark.edu/monarchs95.jpg
  • INFRASTRUCTURE Everything needed to create, capture and act on a million metrics per seconds http://discussamerica.org/remer-blog/images/Freeway_Interchange2.jpg
  • COLLECTD Open Source MonitoringTool https://collectd.org/ https://collectd.org/wiki/index.php/Plugin:StatsD
  • STATSD Simple stats collector service https://github.com/etsy/statsd http://codeascraft.com/2011/02/15/measure-anything-measure-everything/http://emps.exeter.ac.uk/media/universityofexeter/emps/eisa/exista-splash.jpg
  • STATSD NAMING SCHEME stats. # to distinguish from events $environment. # prod, dev, etc $clustername. # api-ash, www-dub, etc $application. # webapp, login, etc $metric_name_here. # any key the app wants $hostname # node the stat came from
  • STATSD CONFIGURATION { graphite: { globalPrefix: stats.$env.$clustername, globalSuffix: require(‘os').hostname().split('.')[0], legacyNamespace: false, }, percentThreshold: [ 95, 99 ], deleteIdleStats: true, } https://github.com/etsy/statsd/blob/master/exampleConfig.js
  • GRAPHITE Metric store & Graph UI http://graphite.wikidot.com/ http://graphite.readthedocs.org/en/latest/
  • GRAPHITE SETUP At least one graphite server per data center
  • DATA RETENTION [default] pattern = .* priority = 110 retentions = 10:6h,60:15d,600:5y xFilesFactor = 0 http://graphite.readthedocs.org/en/latest/config-carbon.html#storage-schemas-conf
  • STANDARD AGGREGATIONS # Average & Sum for timers <prefix>.timers.<key>._totals.ash.<type>.avg (10) = avg <<prefix>>.timers.<<key>>.<node>.<type> ! <prefix>.timers.<key>._totals.ash.<type>.sum (10) = sum <<prefix>>.timers.<<key>>.<node>.(?!upper|lower)<type> ! # Min / Max for Lower / Upper <prefix>.timers.<key>._totals.ash.upper (10) = max <<prefix>>.timers.<<key>>.<node>.upper ! <prefix>.timers.<key>._totals.ash.lower (10) = min <<prefix>>.timers.<<key>>.<node>.lower http://graphite.readthedocs.org/en/latest/config-carbon.html#aggregation-rules-conf
  • PERFORMANCE First problem: IOPS Second problem: CPU http://www.organisationscience.com/styled-6/files/dt-improved-performance.jpg
  • GRAPHITE ALTERNATIVES Circonus:All the insights you ever wanted Zabbix: OSS self hosted monitoring http://circonus.com http://zabbix.com https://github.com/lyft/circonus-statsd-backend https://github.com/dlecocq/statsd-zabbix
  • GRAPHITE.JS Custom dashboards using jQuery https://github.com/prestontimmons/graphitejs http://dashboarddude.com/blog/2013/01/23/dashboards-for-graphite/
  • COST Optimize for adoption rates in your organization by eliminating cost as a constraint http://www.examiner.com/images/blog/wysiwyg/image/money].jpg
  • INSTRUMENTATION Instrument your infrastructure, not just your apps http://2.bp.blogspot.com/-bL9D8VMtor4/TiNBDEJmvOI/AAAAAAAAByc/Y0Uc3GVPNl0/s400/SeminaGestaoPessoasOrquestraROB4428.jpg
  • APACHE Use mod_statsd to capture stats directly from the Apache request http://kaleidos.net/files/images/apache318x260.png http://httpd.apache.org/ https://github.com/jib/mod_statsd
  • BASIC CONFIGURATION <Location /api> Statsd On StatsdPrefix apache </Location> https://github.com/jib/mod_statsd/blob/master/DOCUMENTATION $ curl http://localhost/api/foo?id=42 ! Stat: apache.api.foo.GET.200:31|ms
  • VARNISH use libvmod-statsd & libvmod-timers to capture stats directly from theVarnish request http://www.adammalone.net/sites/default/files/styles/blog_image/public/varnish-bunny.png?itok=1bBDTA1A https://www.varnish-cache.org/ https://github.com/jib/libvmod-statsd
  • BASIC CONFIGURATION # pseudo code import statsd; import timers; sub vcl_deliver { statsd.timing( $backend + # from req.backend $hit_miss + # from obj.hits $resp_code, # from obj.status timers. req_response_time() ) } https://github.com/jib/libvmod-statsd/blob/master/README.rst http://jiboumans.wordpress.com/2013/02/27/realtime-stats-from-varnish/
  • SAMPLE GRAPH The request per second & response time graphs are coming straight from varnish
  • PYTHON Create a base library in your language of choice
  • KRUX-STDLIB pip install krux-stdlib —extra-index-url=https://staticfiles.krxd.net/foss/pypi/ https://staticfiles.krxd.net/foss/docs/pypi/krux-stdlib/
  • BASIC APP USING STDLIB $ sample-app -h […] ! logging: --log-level {info,debug,critical,warning,error} Verbosity of logging. (default: warning) stats: --stats Enable sending statistics to statsd. (default: False) --stats-host STATS_HOST Statsd host to send statistics to. (default: localhost) --stats-port STATS_PORT Statsd port to send statistics to. (default: 8125) --stats-environment STATS_ENVIRONMENT Statsd environment. (default: dev) https://staticfiles.krxd.net/foss/docs/pypi/krux-stdlib/
  • BASIC APP USING STDLIB class App(krux.cli.Application): def __init__(self): ### Call to the superclass to bootstrap. super(Application, self).__init__(name = 'sample-app') def run(self): stats = self.stats log = self.logger ! with stats.timer('run'): log.info('running...') ... https://staticfiles.krxd.net/foss/docs/pypi/krux-stdlib/
  • CLI echo ‘events.deploy.appname:1|c’ | nc localhost -u 8125
  • JAVASCRIPT Use a simple HTTP endpoint to send stats
  • PUPPET Use the Puppet module graphite-report to send Puppet reporting data directly to Graphite http://docs.puppetlabs.com/guides/reporting.html https://github.com/krux/puppet-module-graphite-report
  • KEEPTRACK OF COSTS Use CloudWatch CLI tools and send to Statsd http://docs.aws.amazon.com/AmazonCloudWatch/latest/cli/SetupCLI.html
  • BASIC USAGE # Charge to date for $service $ mon-get-stats EstimatedCharges --namespace "AWS/Billing" --statistics Sum --dimensions "ServiceName=${service}" --start-time $date http://docs.aws.amazon.com/AmazonCloudWatch/latest/cli/SetupCLI.html
  • Q & A http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.html @jiboumans http://slideshare.net/jiboumans
  • SUPERVISOR Instrument Supervisord using Sulphite http://www.dilbertcelart.com/dale/c26.jpg http://supervisord.org/ https://github.com/jib/sulphite
  • BASIC CONFIGURATION # Install from Krux pypi repo pip install sulphite --extra-index-url=https:// staticfiles.krxd.net/foss/pypi/ ! # Setup as eventlistener in Supervisor [eventlistener:sulphite] command=sulphite --graphite-server=… events=PROCESS_STATE numprocs=1 http://supervisord.org/events.html https://github.com/jib/sulphite/blob/master/README.md
  • MOST DEVIANT NODES Find machines that behave differently from the others
  • TRENDS OVERTIME Day over day is not always the most representative