Your SlideShare is downloading. ×
0
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
How to measure everything - a million metrics per second with minimal developer overhead
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

How to measure everything - a million metrics per second with minimal developer overhead

10,801

Published on

Krux is an infrastructure provider for many of the websites you …

Krux is an infrastructure provider for many of the websites you
use online today, like NYTimes.com, WSJ.com, Wikia and NBCU. For
every request on those properties, Krux will get one or more as
well. We grew from zero traffic to several billion requests per
day in the span of 2 years, and we did so exclusively in AWS.

To make the right decisions in such a volatile environment, we
knew that data is everything; without it, you can't possibly make
informed decisions. However, collecting it efficiently, at scale,
at minimal cost and without burdening developers is a tremendous
challenge.

Join me in this session to learn how we overcame this challenge
at Krux; I will share with you the details of how we set up our
global infrastructure, entirely managed by Puppet, to capture over
a million data points every second on virtually every part of the
system, including inside the web server, user apps and Puppet itself,
for under $2000/month using off the shelf Open Source software and
some code we've released as Open Source ourselves. In addition, I’ll
show you how you can take (a subset of) these metrics and send them
to advanced analytics and alerting tools like Circonus or Zabbix.

This content will be applicable for anyone collecting or desiring to
collect vast amounts of metrics in a cloud or datacenter setting and
making sense of them.

Published in: Engineering
2 Comments
20 Likes
Statistics
Notes
No Downloads
Views
Total Views
10,801
On Slideshare
0
From Embeds
0
Number of Embeds
18
Actions
Shares
0
Downloads
0
Comments
2
Likes
20
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  1. HOW TO MEASURE EVERYTHING A million metrics per second with minimal developer overhead ! Jos Boumans - @jiboumans http://www.imagemediapartners.com/Portals/20286/images/MeasuringTape-s.jpg
  2. RIPE NCC Engineering manager for RIPE Database http://www.ripe.net/db
  3. CANONICAL Engineering manager for Ubuntu Server 10.04 & 10.10 http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775 http://www.ubuntu.com/business/server/overview
  4. KRUX VP of Operations & Infrastructure http://www.krux.com/
  5. SOME OF OUR CUSTOMERS
  6. A LOT OF TRAFFIC http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html
  7. 0 35,000 70,000 105,000 140,000 AVERAGE DATA EVENTS / SEC http://investor.fb.com/results.cfm Twitter: New Tweets Wikipedia: Page Views Facebook: Messages Sent Krux: New Data Points http://www.statisticbrain.com/twitter-statistics/ http://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm
  8. 0 500,000,000 1,000,000,000 1,500,000,000 2,000,000,000 MONTHLY UNIQUE USERS http://reportcard.wmflabs.org/ http://www.statisticbrain.com/twitter-statistics/ http://newsroom.fb.com/company-info/
  9. DATA IS EVERYTHING Always know what’s going on http://perpetual-wonder.com/blog/wp-content/uploads/2012/09/Where-do-we-go-from-here.jpg
  10. UNIQUE METRICS Unique metrics received, per second
  11. METRICS & VISUALIZATION … and a little bit of monitoring http://getfit101.files.wordpress.com/2012/04/visualization.jpg
  12. VISUALIZATION MATTERS Humans are good at patterns & shapes http://1.bp.blogspot.com/-CO-8FK9bohE/T89rD8dTyEI/AAAAAAAAAEE/YUZ00v_filk/s1600/live_like_it_matters_by_mythirll-d3iqcxt.jpg
  13. INSIGHT MATTERS We consider it a core competence http://yourselfseries.com/teens/files/2013/05/suicide_bonus_Insight_final.jpg
  14. SHOW EVERYONE And better yet, encourage people to add their own http://www.kissimmee.org/ftp/KCC/events/views/images/crowd_cheer.jpg
  15. THE BOTTOM LINE
  16. KEY CHARACTERISTICS … of our metrics collection http://www.fullcirclefeedback.com.au/resources/wp-content/uploads/2014/01/Key-skills-and-characteristics-of-good-HR-leaders.jpg
  17. WHAT TO VISUALIZE Pick your operational KPIs http://1.bp.blogspot.com/-nrB1A9hamEk/UVZui_JUG1I/AAAAAAAAAdI/zGqHuanZNVU/s1600/missed-opportunities.jpg
  18. REQUEST & ERROR RATES The baseline for everything else
  19. WORST RESPONSE TIMES Track the worst upper 95th & upper 99th across a cluster
  20. TRACK EVENTS Did a code change or batch job cause a change in behaviour?
  21. CAPACITY / THRESHOLDS How much traffic can your service sustain?
  22. SINGLE SERVICE OVERVIEW Create a single graph for every service
  23. WHAT TO CAPTURE Everything. No, really. http://arkansasagnews.uark.edu/monarchs95.jpg
  24. INFRASTRUCTURE Everything needed to create, capture and act on a million metrics per seconds http://discussamerica.org/remer-blog/images/Freeway_Interchange2.jpg
  25. GRAPHITE, STATSD & COLLECTD The Trifecta
  26. COLLECTD Open Source Monitoring Tool https://collectd.org/ https://collectd.org/wiki/index.php/Plugin:StatsD
  27. STATSD Simple stats collector service https://github.com/etsy/statsd http://codeascraft.com/2011/02/15/measure-anything-measure-everything/ https://wwwx.cs.unc.edu/~sparkst/howto/http://emps.exeter.ac.uk/media/universityofexeter/emps/eisa/exista-splash.jpg network_tuning.php
  28. STATSD NAMING SCHEME stats. # to distinguish from events $environment. # prod, dev, etc $cluster_name. # api-ash, www-dub, etc $application. # webapp, login, etc $metric_name_here. # any key the app wants $hostname # node the stat came from
  29. STATSD CONFIGURATION { graphite: { globalPrefix: stats.$env.$cluster_name, globalSuffix: require(‘os').hostname().split('.')[0], legacyNamespace: false, }, percentThreshold: [ 95, 99 ], deleteIdleStats: true, } https://github.com/etsy/statsd/blob/master/exampleConfig.js
  30. GRAPHITE Metric store & Graph UI http://graphite.wikidot.com/ http://graphite.readthedocs.org/en/latest/
  31. GRAPHITE SETUP At least one graphite server per data center
  32. DATA RETENTION [default] pattern = .* priority = 110 retentions = 10:6h,60:15d,600:5y xFilesFactor = 0 http://graphite.readthedocs.org/en/latest/config-carbon.html#storage-schemas-conf
  33. STANDARD AGGREGATIONS # Average & Sum for timers <prefix>.timers.<key>._totals.ash.<type>.avg (10) = avg <<prefix>>.timers.<<key>>.<node>.<type> ! <prefix>.timers.<key>._totals.ash.<type>.sum (10) = sum <<prefix>>.timers.<<key>>.<node>.(?!upper|lower)<type> ! # Min / Max for Lower / Upper <prefix>.timers.<key>._totals.ash.upper (10) = max <<prefix>>.timers.<<key>>.<node>.upper ! <prefix>.timers.<key>._totals.ash.lower (10) = min <<prefix>>.timers.<<key>>.<node>.lower http://graphite.readthedocs.org/en/latest/config-carbon.html#aggregation-rules-conf
  34. PERFORMANCE First problem: IOPS Second problem: CPU http://www.organisationscience.com/styled-6/files/dt-improved-performance.jpg
  35. GRAPHITE ALTERNATIVES Circonus: All the insights you ever wanted Zabbix: OSS self hosted monitoring http://circonus.com http://zabbix.com https://github.com/lyft/circonus-statsd-backend https://github.com/dlecocq/statsd-zabbix
  36. GRAPHITE.JS Custom dashboards using jQuery https://github.com/prestontimmons/graphitejs http://dashboarddude.com/blog/2013/01/23/dashboards-for-graphite/
  37. COST Optimize for adoption rates in your organization by eliminating cost as a constraint http://www.examiner.com/images/blog/wysiwyg/image/money].jpg
  38. INSTRUMENTATION Instrument your infrastructure, not just your apps http://2.bp.blogspot.com/-bL9D8VMtor4/TiNBDEJmvOI/AAAAAAAAByc/Y0Uc3GVPNl0/s400/SeminaGestaoPessoasOrquestraROB4428.jpg
  39. APACHE Use mod_statsd to capture stats directly from the Apache request http://kaleidos.net/files/images/apache318x260.png http://httpd.apache.org/ https://github.com/jib/mod_statsd
  40. BASIC CONFIGURATION <Location /api> Statsd On StatsdPrefix apache </Location> $ curl http://localhost/api/foo?id=42 ! Stat: apache.api.foo.GET.200:31|ms https://github.com/jib/mod_statsd/blob/master/DOCUMENTATION
  41. VARNISH use libvmod-statsd & libvmod-timers to capture stats directly from the Varnish request http://www.adammalone.net/sites/default/files/styles/blog_image/public/varnish-bunny.png?itok=1bBDTA1A https://www.varnish-cache.org/ https://github.com/jib/libvmod-statsd
  42. BASIC CONFIGURATION # pseudo code import statsd; import timers; sub vcl_deliver { statsd.timing( $backend + # from req.backend $hit_miss + # from obj.hits $resp_code, # from obj.status timers. req_response_time() ); } https://github.com/jib/libvmod-statsd/blob/master/README.rst http://jiboumans.wordpress.com/2013/02/27/realtime-stats-from-varnish/
  43. SAMPLE GRAPH The request per second & response time graphs are coming straight from varnish
  44. PYTHON Create a base library in your language of choice https://pypi.python.org/pypi?%3Aaction=search&term=krux&submit=search
  45. KRUX-STDLIB $ pip install krux-stdlib https://staticfiles.krxd.net/foss/docs/pypi/krux-stdlib/
  46. BASIC APP USING STDLIB $ sample-app -h […] ! logging: --log-level {info,debug,critical,warning,error} Verbosity of logging. (default: warning) stats: --stats Enable sending statistics to statsd. (default: False) --stats-host STATS_HOST Statsd host to send statistics to. (default: localhost) --stats-port STATS_PORT Statsd port to send statistics to. (default: 8125) --stats-environment STATS_ENVIRONMENT Statsd environment. (default: dev) https://staticfiles.krxd.net/foss/docs/pypi/krux-stdlib/
  47. BASIC APP USING STDLIB class App(krux.cli.Application): def __init__(self): ### Call to the superclass to bootstrap. super(Application, self).__init__( name = 'sample-app') def run(self): stats = self.stats log = self.logger ! with stats.timer('run'): log.info('running...') ... https://staticfiles.krxd.net/foss/docs/pypi/krux-stdlib/ https://pypi.python.org/pypi?%3Aaction=search&term=krux&submit=search
  48. CLI echo ‘events.deploy.appname:1|c’ | nc localhost -u 8125
  49. JAVASCRIPT Use a simple HTTP endpoint to send stats
  50. SUPERVISOR Instrument Supervisord using Sulphite http://www.dilbertcelart.com/dale/c26.jpg http://supervisord.org/ https://github.com/jib/sulphite
  51. BASIC CONFIGURATION # Install from PyPi $ pip install sulphite ! # Setup as eventlistener in Supervisor [eventlistener:sulphite] command=sulphite --graphite-server=… events=PROCESS_STATE numprocs=1 http://supervisord.org/events.html https://github.com/jib/sulphite/blob/master/README.md
  52. FATAL PROCESS EXITS Processes that exited unexpectedly, and supervisor was unable to restart after N retries
  53. PUPPET Use the Puppet module graphite-report to send Puppet reporting data directly to Graphite http://docs.puppetlabs.com/guides/reporting.html https://github.com/krux/puppet-module-graphite-report
  54. KEEP TRACK OF COSTS Use CloudWatch CLI tools and send to Statsd http://docs.aws.amazon.com/AmazonCloudWatch/latest/cli/SetupCLI.html
  55. BASIC USAGE # Charge to date for $service $ mon-get-stats EstimatedCharges --namespace "AWS/Billing" --statistics Sum --dimensions "ServiceName=${service}" --start-time $date http://docs.aws.amazon.com/AmazonCloudWatch/latest/cli/SetupCLI.html
  56. Q & A http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.html @jiboumans http://slideshare.net/jiboumans

×