How to measure everything - a million metrics per second with minimal developer overhead

  • 6,993 views
Uploaded on

Krux is an infrastructure provider for many of the websites you …

Krux is an infrastructure provider for many of the websites you
use online today, like NYTimes.com, WSJ.com, Wikia and NBCU. For
every request on those properties, Krux will get one or more as
well. We grew from zero traffic to several billion requests per
day in the span of 2 years, and we did so exclusively in AWS.

To make the right decisions in such a volatile environment, we
knew that data is everything; without it, you can't possibly make
informed decisions. However, collecting it efficiently, at scale,
at minimal cost and without burdening developers is a tremendous
challenge.

Join me in this session to learn how we overcame this challenge
at Krux; I will share with you the details of how we set up our
global infrastructure, entirely managed by Puppet, to capture over
a million data points every second on virtually every part of the
system, including inside the web server, user apps and Puppet itself,
for under $2000/month using off the shelf Open Source software and
some code we've released as Open Source ourselves. In addition, I’ll
show you how you can take (a subset of) these metrics and send them
to advanced analytics and alerting tools like Circonus or Zabbix.

This content will be applicable for anyone collecting or desiring to
collect vast amounts of metrics in a cloud or datacenter setting and
making sense of them.

More in: Engineering
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
6,993
On Slideshare
0
From Embeds
0
Number of Embeds
15

Actions

Shares
Downloads
0
Comments
2
Likes
13

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. HOW TO MEASURE EVERYTHING A million metrics per second with minimal developer overhead ! Jos Boumans - @jiboumans http://www.imagemediapartners.com/Portals/20286/images/MeasuringTape-s.jpg
  • 2. RIPE NCC Engineering manager for RIPE Database http://www.ripe.net/db
  • 3. CANONICAL Engineering manager for Ubuntu Server 10.04 & 10.10 http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775 http://www.ubuntu.com/business/server/overview
  • 4. KRUX VP of Operations & Infrastructure http://www.krux.com/
  • 5. SOME OF OUR CUSTOMERS
  • 6. A LOT OF TRAFFIC http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html
  • 7. 0 35,000 70,000 105,000 140,000 AVERAGE DATA EVENTS / SEC http://investor.fb.com/results.cfm Twitter: New Tweets Wikipedia: Page Views Facebook: Messages Sent Krux: New Data Points http://www.statisticbrain.com/twitter-statistics/ http://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm
  • 8. 0 500,000,000 1,000,000,000 1,500,000,000 2,000,000,000 MONTHLY UNIQUE USERS http://reportcard.wmflabs.org/ http://www.statisticbrain.com/twitter-statistics/ http://newsroom.fb.com/company-info/
  • 9. DATA IS EVERYTHING Always know what’s going on http://perpetual-wonder.com/blog/wp-content/uploads/2012/09/Where-do-we-go-from-here.jpg
  • 10. UNIQUE METRICS Unique metrics received, per second
  • 11. METRICS & VISUALIZATION … and a little bit of monitoring http://getfit101.files.wordpress.com/2012/04/visualization.jpg
  • 12. VISUALIZATION MATTERS Humans are good at patterns & shapes http://1.bp.blogspot.com/-CO-8FK9bohE/T89rD8dTyEI/AAAAAAAAAEE/YUZ00v_filk/s1600/live_like_it_matters_by_mythirll-d3iqcxt.jpg
  • 13. INSIGHT MATTERS We consider it a core competence http://yourselfseries.com/teens/files/2013/05/suicide_bonus_Insight_final.jpg
  • 14. SHOW EVERYONE And better yet, encourage people to add their own http://www.kissimmee.org/ftp/KCC/events/views/images/crowd_cheer.jpg
  • 15. THE BOTTOM LINE
  • 16. KEY CHARACTERISTICS … of our metrics collection http://www.fullcirclefeedback.com.au/resources/wp-content/uploads/2014/01/Key-skills-and-characteristics-of-good-HR-leaders.jpg
  • 17. WHAT TO VISUALIZE Pick your operational KPIs http://1.bp.blogspot.com/-nrB1A9hamEk/UVZui_JUG1I/AAAAAAAAAdI/zGqHuanZNVU/s1600/missed-opportunities.jpg
  • 18. REQUEST & ERROR RATES The baseline for everything else
  • 19. WORST RESPONSE TIMES Track the worst upper 95th & upper 99th across a cluster
  • 20. TRACK EVENTS Did a code change or batch job cause a change in behaviour?
  • 21. CAPACITY / THRESHOLDS How much traffic can your service sustain?
  • 22. SINGLE SERVICE OVERVIEW Create a single graph for every service
  • 23. WHAT TO CAPTURE Everything. No, really. http://arkansasagnews.uark.edu/monarchs95.jpg
  • 24. INFRASTRUCTURE Everything needed to create, capture and act on a million metrics per seconds http://discussamerica.org/remer-blog/images/Freeway_Interchange2.jpg
  • 25. GRAPHITE, STATSD & COLLECTD The Trifecta
  • 26. COLLECTD Open Source Monitoring Tool https://collectd.org/ https://collectd.org/wiki/index.php/Plugin:StatsD
  • 27. STATSD Simple stats collector service https://github.com/etsy/statsd http://codeascraft.com/2011/02/15/measure-anything-measure-everything/ https://wwwx.cs.unc.edu/~sparkst/howto/http://emps.exeter.ac.uk/media/universityofexeter/emps/eisa/exista-splash.jpg network_tuning.php
  • 28. STATSD NAMING SCHEME stats. # to distinguish from events $environment. # prod, dev, etc $cluster_name. # api-ash, www-dub, etc $application. # webapp, login, etc $metric_name_here. # any key the app wants $hostname # node the stat came from
  • 29. STATSD CONFIGURATION { graphite: { globalPrefix: stats.$env.$cluster_name, globalSuffix: require(‘os').hostname().split('.')[0], legacyNamespace: false, }, percentThreshold: [ 95, 99 ], deleteIdleStats: true, } https://github.com/etsy/statsd/blob/master/exampleConfig.js
  • 30. GRAPHITE Metric store & Graph UI http://graphite.wikidot.com/ http://graphite.readthedocs.org/en/latest/
  • 31. GRAPHITE SETUP At least one graphite server per data center
  • 32. DATA RETENTION [default] pattern = .* priority = 110 retentions = 10:6h,60:15d,600:5y xFilesFactor = 0 http://graphite.readthedocs.org/en/latest/config-carbon.html#storage-schemas-conf
  • 33. STANDARD AGGREGATIONS # Average & Sum for timers <prefix>.timers.<key>._totals.ash.<type>.avg (10) = avg <<prefix>>.timers.<<key>>.<node>.<type> ! <prefix>.timers.<key>._totals.ash.<type>.sum (10) = sum <<prefix>>.timers.<<key>>.<node>.(?!upper|lower)<type> ! # Min / Max for Lower / Upper <prefix>.timers.<key>._totals.ash.upper (10) = max <<prefix>>.timers.<<key>>.<node>.upper ! <prefix>.timers.<key>._totals.ash.lower (10) = min <<prefix>>.timers.<<key>>.<node>.lower http://graphite.readthedocs.org/en/latest/config-carbon.html#aggregation-rules-conf
  • 34. PERFORMANCE First problem: IOPS Second problem: CPU http://www.organisationscience.com/styled-6/files/dt-improved-performance.jpg
  • 35. GRAPHITE ALTERNATIVES Circonus: All the insights you ever wanted Zabbix: OSS self hosted monitoring http://circonus.com http://zabbix.com https://github.com/lyft/circonus-statsd-backend https://github.com/dlecocq/statsd-zabbix
  • 36. GRAPHITE.JS Custom dashboards using jQuery https://github.com/prestontimmons/graphitejs http://dashboarddude.com/blog/2013/01/23/dashboards-for-graphite/
  • 37. COST Optimize for adoption rates in your organization by eliminating cost as a constraint http://www.examiner.com/images/blog/wysiwyg/image/money].jpg
  • 38. INSTRUMENTATION Instrument your infrastructure, not just your apps http://2.bp.blogspot.com/-bL9D8VMtor4/TiNBDEJmvOI/AAAAAAAAByc/Y0Uc3GVPNl0/s400/SeminaGestaoPessoasOrquestraROB4428.jpg
  • 39. APACHE Use mod_statsd to capture stats directly from the Apache request http://kaleidos.net/files/images/apache318x260.png http://httpd.apache.org/ https://github.com/jib/mod_statsd
  • 40. BASIC CONFIGURATION <Location /api> Statsd On StatsdPrefix apache </Location> $ curl http://localhost/api/foo?id=42 ! Stat: apache.api.foo.GET.200:31|ms https://github.com/jib/mod_statsd/blob/master/DOCUMENTATION
  • 41. VARNISH use libvmod-statsd & libvmod-timers to capture stats directly from the Varnish request http://www.adammalone.net/sites/default/files/styles/blog_image/public/varnish-bunny.png?itok=1bBDTA1A https://www.varnish-cache.org/ https://github.com/jib/libvmod-statsd
  • 42. BASIC CONFIGURATION # pseudo code import statsd; import timers; sub vcl_deliver { statsd.timing( $backend + # from req.backend $hit_miss + # from obj.hits $resp_code, # from obj.status timers. req_response_time() ); } https://github.com/jib/libvmod-statsd/blob/master/README.rst http://jiboumans.wordpress.com/2013/02/27/realtime-stats-from-varnish/
  • 43. SAMPLE GRAPH The request per second & response time graphs are coming straight from varnish
  • 44. PYTHON Create a base library in your language of choice https://pypi.python.org/pypi?%3Aaction=search&term=krux&submit=search
  • 45. KRUX-STDLIB $ pip install krux-stdlib https://staticfiles.krxd.net/foss/docs/pypi/krux-stdlib/
  • 46. BASIC APP USING STDLIB $ sample-app -h […] ! logging: --log-level {info,debug,critical,warning,error} Verbosity of logging. (default: warning) stats: --stats Enable sending statistics to statsd. (default: False) --stats-host STATS_HOST Statsd host to send statistics to. (default: localhost) --stats-port STATS_PORT Statsd port to send statistics to. (default: 8125) --stats-environment STATS_ENVIRONMENT Statsd environment. (default: dev) https://staticfiles.krxd.net/foss/docs/pypi/krux-stdlib/
  • 47. BASIC APP USING STDLIB class App(krux.cli.Application): def __init__(self): ### Call to the superclass to bootstrap. super(Application, self).__init__( name = 'sample-app') def run(self): stats = self.stats log = self.logger ! with stats.timer('run'): log.info('running...') ... https://staticfiles.krxd.net/foss/docs/pypi/krux-stdlib/ https://pypi.python.org/pypi?%3Aaction=search&term=krux&submit=search
  • 48. CLI echo ‘events.deploy.appname:1|c’ | nc localhost -u 8125
  • 49. JAVASCRIPT Use a simple HTTP endpoint to send stats
  • 50. SUPERVISOR Instrument Supervisord using Sulphite http://www.dilbertcelart.com/dale/c26.jpg http://supervisord.org/ https://github.com/jib/sulphite
  • 51. BASIC CONFIGURATION # Install from PyPi $ pip install sulphite ! # Setup as eventlistener in Supervisor [eventlistener:sulphite] command=sulphite --graphite-server=… events=PROCESS_STATE numprocs=1 http://supervisord.org/events.html https://github.com/jib/sulphite/blob/master/README.md
  • 52. FATAL PROCESS EXITS Processes that exited unexpectedly, and supervisor was unable to restart after N retries
  • 53. PUPPET Use the Puppet module graphite-report to send Puppet reporting data directly to Graphite http://docs.puppetlabs.com/guides/reporting.html https://github.com/krux/puppet-module-graphite-report
  • 54. KEEP TRACK OF COSTS Use CloudWatch CLI tools and send to Statsd http://docs.aws.amazon.com/AmazonCloudWatch/latest/cli/SetupCLI.html
  • 55. BASIC USAGE # Charge to date for $service $ mon-get-stats EstimatedCharges --namespace "AWS/Billing" --statistics Sum --dimensions "ServiceName=${service}" --start-time $date http://docs.aws.amazon.com/AmazonCloudWatch/latest/cli/SetupCLI.html
  • 56. Q & A http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.html @jiboumans http://slideshare.net/jiboumans