Krux is an infrastructure provider for many of the websites you
use online today, like NYTimes.com, WSJ.com, Wikia and NBCU. For
every request on those properties, Krux will get one or more as
well. We grew from zero traffic to several billion requests per
day in the span of 2 years, and we did so exclusively in AWS.
To make the right decisions in such a volatile environment, we
knew that data is everything; without it, you can't possibly make
informed decisions. However, collecting it efficiently, at scale,
at minimal cost and without burdening developers is a tremendous
challenge.
Join me in this session to learn how we overcame this challenge
at Krux; I will share with you the details of how we set up our
global infrastructure, entirely managed by Puppet, to capture over
a million data points every second on virtually every part of the
system, including inside the web server, user apps and Puppet itself,
for under $2000/month using off the shelf Open Source software and
some code we've released as Open Source ourselves. In addition, I’ll
show you how you can take (a subset of) these metrics and send them
to advanced analytics and alerting tools like Circonus or Zabbix.
This content will be applicable for anyone collecting or desiring to
collect vast amounts of metrics in a cloud or datacenter setting and
making sense of them.
How to measure everything - a million metrics per second with minimal developer overhead
1. HOW TO MEASURE EVERYTHING
A million metrics per second with minimal developer overhead
!
Jos Boumans - @jiboumans
http://www.imagemediapartners.com/Portals/20286/images/MeasuringTape-s.jpg
6. A LOT OF TRAFFIC
http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html
7. 0 35,000 70,000 105,000 140,000
AVERAGE DATA EVENTS / SEC
http://investor.fb.com/results.cfm
Twitter: New Tweets Wikipedia: Page Views
Facebook: Messages Sent Krux: New Data Points
http://www.statisticbrain.com/twitter-statistics/
http://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm
11. METRICS & VISUALIZATION
… and a little bit of monitoring
http://getfit101.files.wordpress.com/2012/04/visualization.jpg
12. VISUALIZATION MATTERS
Humans are good at patterns & shapes
http://1.bp.blogspot.com/-CO-8FK9bohE/T89rD8dTyEI/AAAAAAAAAEE/YUZ00v_filk/s1600/live_like_it_matters_by_mythirll-d3iqcxt.jpg
13. INSIGHT MATTERS
We consider it a core competence
http://yourselfseries.com/teens/files/2013/05/suicide_bonus_Insight_final.jpg
14. SHOW EVERYONE
And better yet, encourage people to add their own
http://www.kissimmee.org/ftp/KCC/events/views/images/crowd_cheer.jpg
16. KEY CHARACTERISTICS
… of our metrics collection
http://www.fullcirclefeedback.com.au/resources/wp-content/uploads/2014/01/Key-skills-and-characteristics-of-good-HR-leaders.jpg
17. WHAT TO VISUALIZE
Pick your operational KPIs
http://1.bp.blogspot.com/-nrB1A9hamEk/UVZui_JUG1I/AAAAAAAAAdI/zGqHuanZNVU/s1600/missed-opportunities.jpg
23. WHAT TO CAPTURE
Everything.
No, really.
http://arkansasagnews.uark.edu/monarchs95.jpg
24. INFRASTRUCTURE
Everything needed to create, capture and
act on a million metrics per seconds
http://discussamerica.org/remer-blog/images/Freeway_Interchange2.jpg
26. COLLECTD
Open Source Monitoring Tool
https://collectd.org/
https://collectd.org/wiki/index.php/Plugin:StatsD
27. STATSD
Simple stats collector service
https://github.com/etsy/statsd
http://codeascraft.com/2011/02/15/measure-anything-measure-everything/
https://wwwx.cs.unc.edu/~sparkst/howto/http://emps.exeter.ac.uk/media/universityofexeter/emps/eisa/exista-splash.jpg network_tuning.php
28. STATSD NAMING SCHEME
stats. # to distinguish from events
$environment. # prod, dev, etc
$cluster_name. # api-ash, www-dub, etc
$application. # webapp, login, etc
$metric_name_here. # any key the app wants
$hostname # node the stat came from
33. STANDARD AGGREGATIONS
# Average & Sum for timers
<prefix>.timers.<key>._totals.ash.<type>.avg (10) =
avg <<prefix>>.timers.<<key>>.<node>.<type>
!
<prefix>.timers.<key>._totals.ash.<type>.sum (10) =
sum <<prefix>>.timers.<<key>>.<node>.(?!upper|lower)<type>
!
# Min / Max for Lower / Upper
<prefix>.timers.<key>._totals.ash.upper (10) =
max <<prefix>>.timers.<<key>>.<node>.upper
!
<prefix>.timers.<key>._totals.ash.lower (10) =
min <<prefix>>.timers.<<key>>.<node>.lower
http://graphite.readthedocs.org/en/latest/config-carbon.html#aggregation-rules-conf
34. PERFORMANCE
First problem: IOPS
Second problem: CPU
http://www.organisationscience.com/styled-6/files/dt-improved-performance.jpg
35. GRAPHITE ALTERNATIVES
Circonus: All the insights you ever wanted
Zabbix: OSS self hosted monitoring http://circonus.com
http://zabbix.com
https://github.com/lyft/circonus-statsd-backend
https://github.com/dlecocq/statsd-zabbix
36. GRAPHITE.JS
Custom dashboards using jQuery
https://github.com/prestontimmons/graphitejs
http://dashboarddude.com/blog/2013/01/23/dashboards-for-graphite/
37. COST
Optimize for adoption rates in your organization by
eliminating cost as a constraint
http://www.examiner.com/images/blog/wysiwyg/image/money].jpg
38. INSTRUMENTATION
Instrument your infrastructure, not just your apps
http://2.bp.blogspot.com/-bL9D8VMtor4/TiNBDEJmvOI/AAAAAAAAByc/Y0Uc3GVPNl0/s400/SeminaGestaoPessoasOrquestraROB4428.jpg
39. APACHE
Use mod_statsd to capture stats
directly from the Apache request
http://kaleidos.net/files/images/apache318x260.png
http://httpd.apache.org/
https://github.com/jib/mod_statsd
41. VARNISH
use libvmod-statsd & libvmod-timers to capture
stats directly from the Varnish request
http://www.adammalone.net/sites/default/files/styles/blog_image/public/varnish-bunny.png?itok=1bBDTA1A
https://www.varnish-cache.org/
https://github.com/jib/libvmod-statsd
42. BASIC CONFIGURATION
# pseudo code
import statsd; import timers;
sub vcl_deliver {
statsd.timing(
$backend + # from req.backend
$hit_miss + # from obj.hits
$resp_code, # from obj.status
timers. req_response_time() );
}
https://github.com/jib/libvmod-statsd/blob/master/README.rst
http://jiboumans.wordpress.com/2013/02/27/realtime-stats-from-varnish/
43. SAMPLE GRAPH
The request per second & response time graphs
are coming straight from varnish
44. PYTHON
Create a base library in your language of choice
https://pypi.python.org/pypi?%3Aaction=search&term=krux&submit=search
50. SUPERVISOR
Instrument Supervisord using Sulphite
http://www.dilbertcelart.com/dale/c26.jpg
http://supervisord.org/
https://github.com/jib/sulphite
51. BASIC CONFIGURATION
# Install from PyPi
$ pip install sulphite
!
# Setup as eventlistener in Supervisor
[eventlistener:sulphite]
command=sulphite --graphite-server=…
events=PROCESS_STATE
numprocs=1
http://supervisord.org/events.html
https://github.com/jib/sulphite/blob/master/README.md
52. FATAL PROCESS EXITS
Processes that exited unexpectedly, and supervisor was
unable to restart after N retries
53. PUPPET
Use the Puppet module graphite-report to send Puppet
reporting data directly to Graphite
http://docs.puppetlabs.com/guides/reporting.html
https://github.com/krux/puppet-module-graphite-report
54. KEEP TRACK OF COSTS
Use CloudWatch CLI tools and send to Statsd
http://docs.aws.amazon.com/AmazonCloudWatch/latest/cli/SetupCLI.html
55. BASIC USAGE
# Charge to date for $service
$ mon-get-stats EstimatedCharges
--namespace "AWS/Billing"
--statistics Sum
--dimensions "ServiceName=${service}"
--start-time $date
http://docs.aws.amazon.com/AmazonCloudWatch/latest/cli/SetupCLI.html
56. Q & A
http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.html
@jiboumans
http://slideshare.net/jiboumans