How to measure everything - a million metrics per second with minimal developer overhead

HOW TO MEASURE EVERYTHING
A million metrics per second with minimal developer overhead
!
Jos Boumans - @jiboumans
http://www.imagemediapartners.com/Portals/20286/images/MeasuringTape-s.jpg

RIPE NCC
Engineering manager for RIPE Database
http://www.ripe.net/db

CANONICAL
Engineering manager for Ubuntu Server 10.04 & 10.10
http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775
http://www.ubuntu.com/business/server/overview

KRUX
VP of Operations & Infrastructure
http://www.krux.com/

A LOT OF TRAFFIC
http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html

0 35,000 70,000 105,000 140,000
AVERAGE DATA EVENTS / SEC
http://investor.fb.com/results.cfm
Twitter: New Tweets Wikipedia: Page Views
Facebook: Messages Sent Krux: New Data Points
http://www.statisticbrain.com/twitter-statistics/
http://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm

0 500,000,000 1,000,000,000 1,500,000,000 2,000,000,000
MONTHLY UNIQUE USERS
http://reportcard.wmflabs.org/
http://www.statisticbrain.com/twitter-statistics/
http://newsroom.fb.com/company-info/

DATA IS EVERYTHING
Always know what’s going on
http://perpetual-wonder.com/blog/wp-content/uploads/2012/09/Where-do-we-go-from-here.jpg

UNIQUE METRICS
Unique metrics received, per second

METRICS & VISUALIZATION
… and a little bit of monitoring
http://getfit101.files.wordpress.com/2012/04/visualization.jpg

VISUALIZATION MATTERS
Humans are good at patterns & shapes
http://1.bp.blogspot.com/-CO-8FK9bohE/T89rD8dTyEI/AAAAAAAAAEE/YUZ00v_filk/s1600/live_like_it_matters_by_mythirll-d3iqcxt.jpg

INSIGHT MATTERS
We consider it a core competence
http://yourselfseries.com/teens/files/2013/05/suicide_bonus_Insight_final.jpg

SHOW EVERYONE
And better yet, encourage people to add their own
http://www.kissimmee.org/ftp/KCC/events/views/images/crowd_cheer.jpg

KEY CHARACTERISTICS
… of our metrics collection
http://www.fullcirclefeedback.com.au/resources/wp-content/uploads/2014/01/Key-skills-and-characteristics-of-good-HR-leaders.jpg

WHAT TO VISUALIZE
Pick your operational KPIs
http://1.bp.blogspot.com/-nrB1A9hamEk/UVZui_JUG1I/AAAAAAAAAdI/zGqHuanZNVU/s1600/missed-opportunities.jpg

REQUEST & ERROR RATES
The baseline for everything else

WORST RESPONSE TIMES
Track the worst upper 95th & upper 99th across a cluster

TRACK EVENTS
Did a code change or batch job cause a change in
behaviour?

CAPACITY / THRESHOLDS
How much traffic can your service sustain?

SINGLE SERVICE OVERVIEW
Create a single graph for every service

WHAT TO CAPTURE
Everything.
No, really.
http://arkansasagnews.uark.edu/monarchs95.jpg

INFRASTRUCTURE
Everything needed to create, capture and
act on a million metrics per seconds
http://discussamerica.org/remer-blog/images/Freeway_Interchange2.jpg

GRAPHITE, STATSD & COLLECTD
The Trifecta

COLLECTD
Open Source Monitoring Tool
https://collectd.org/
https://collectd.org/wiki/index.php/Plugin:StatsD

STATSD
Simple stats collector service
https://github.com/etsy/statsd
http://codeascraft.com/2011/02/15/measure-anything-measure-everything/
https://wwwx.cs.unc.edu/~sparkst/howto/http://emps.exeter.ac.uk/media/universityofexeter/emps/eisa/exista-splash.jpg network_tuning.php

STATSD NAMING SCHEME
stats. # to distinguish from events
$environment. # prod, dev, etc
$cluster_name. # api-ash, www-dub, etc
$application. # webapp, login, etc
$metric_name_here. # any key the app wants
$hostname # node the stat came from

STATSD CONFIGURATION
{ graphite: {
globalPrefix: stats.$env.$cluster_name,
globalSuffix: require(‘os').hostname().split('.')[0],
legacyNamespace: false,
},
percentThreshold: [ 95, 99 ],
deleteIdleStats: true,
}
https://github.com/etsy/statsd/blob/master/exampleConfig.js

GRAPHITE
Metric store & Graph UI
http://graphite.wikidot.com/
http://graphite.readthedocs.org/en/latest/

GRAPHITE SETUP
At least one graphite server per data center

DATA RETENTION
[default]
pattern = .*
priority = 110
retentions = 10:6h,60:15d,600:5y
xFilesFactor = 0
http://graphite.readthedocs.org/en/latest/config-carbon.html#storage-schemas-conf

STANDARD AGGREGATIONS
# Average & Sum for timers
<prefix>.timers.<key>._totals.ash.<type>.avg (10) =
avg <<prefix>>.timers.<<key>>.<node>.<type>
!
<prefix>.timers.<key>._totals.ash.<type>.sum (10) =
sum <<prefix>>.timers.<<key>>.<node>.(?!upper|lower)<type>
!
# Min / Max for Lower / Upper
<prefix>.timers.<key>._totals.ash.upper (10) =
max <<prefix>>.timers.<<key>>.<node>.upper
!
<prefix>.timers.<key>._totals.ash.lower (10) =
min <<prefix>>.timers.<<key>>.<node>.lower
http://graphite.readthedocs.org/en/latest/config-carbon.html#aggregation-rules-conf

PERFORMANCE
First problem: IOPS
Second problem: CPU
http://www.organisationscience.com/styled-6/files/dt-improved-performance.jpg

GRAPHITE ALTERNATIVES
Circonus: All the insights you ever wanted
Zabbix: OSS self hosted monitoring http://circonus.com
http://zabbix.com
https://github.com/lyft/circonus-statsd-backend
https://github.com/dlecocq/statsd-zabbix

GRAPHITE.JS
Custom dashboards using jQuery
https://github.com/prestontimmons/graphitejs
http://dashboarddude.com/blog/2013/01/23/dashboards-for-graphite/

COST
Optimize for adoption rates in your organization by
eliminating cost as a constraint
http://www.examiner.com/images/blog/wysiwyg/image/money].jpg

INSTRUMENTATION
Instrument your infrastructure, not just your apps
http://2.bp.blogspot.com/-bL9D8VMtor4/TiNBDEJmvOI/AAAAAAAAByc/Y0Uc3GVPNl0/s400/SeminaGestaoPessoasOrquestraROB4428.jpg

APACHE
Use mod_statsd to capture stats
directly from the Apache request
http://kaleidos.net/files/images/apache318x260.png
http://httpd.apache.org/
https://github.com/jib/mod_statsd

BASIC CONFIGURATION
<Location /api>
Statsd On
StatsdPrefix apache
</Location>
$ curl http://localhost/api/foo?id=42
!
Stat: apache.api.foo.GET.200:31|ms
https://github.com/jib/mod_statsd/blob/master/DOCUMENTATION

VARNISH
use libvmod-statsd & libvmod-timers to capture
stats directly from the Varnish request
http://www.adammalone.net/sites/default/files/styles/blog_image/public/varnish-bunny.png?itok=1bBDTA1A
https://www.varnish-cache.org/
https://github.com/jib/libvmod-statsd

BASIC CONFIGURATION
# pseudo code
import statsd; import timers;
sub vcl_deliver {
statsd.timing(
$backend + # from req.backend
$hit_miss + # from obj.hits
$resp_code, # from obj.status
timers. req_response_time() );
}
https://github.com/jib/libvmod-statsd/blob/master/README.rst
http://jiboumans.wordpress.com/2013/02/27/realtime-stats-from-varnish/

SAMPLE GRAPH
The request per second & response time graphs
are coming straight from varnish

PYTHON
Create a base library in your language of choice
https://pypi.python.org/pypi?%3Aaction=search&term=krux&submit=search

KRUX-STDLIB
$ pip install krux-stdlib
https://staticfiles.krxd.net/foss/docs/pypi/krux-stdlib/

BASIC APP USING STDLIB
$ sample-app -h
[…]
!
logging:
--log-level {info,debug,critical,warning,error}
Verbosity of logging. (default: warning)
stats:
--stats Enable sending statistics to statsd. (default: False)
--stats-host STATS_HOST
Statsd host to send statistics to. (default: localhost)
--stats-port STATS_PORT
Statsd port to send statistics to. (default: 8125)
--stats-environment STATS_ENVIRONMENT
Statsd environment. (default: dev)

BASIC APP USING STDLIB
class App(krux.cli.Application):
def __init__(self):
### Call to the superclass to bootstrap.
super(Application, self).__init__(
name = 'sample-app')
def run(self):
stats = self.stats
log = self.logger
!
with stats.timer('run'):
log.info('running...')
...
https://pypi.python.org/pypi?%3Aaction=search&term=krux&submit=search

CLI
echo ‘events.deploy.appname:1|c’ | nc localhost -u 8125

JAVASCRIPT
Use a simple HTTP endpoint to send stats

SUPERVISOR
Instrument Supervisord using Sulphite
http://www.dilbertcelart.com/dale/c26.jpg
http://supervisord.org/
https://github.com/jib/sulphite

BASIC CONFIGURATION
# Install from PyPi
$ pip install sulphite
!
# Setup as eventlistener in Supervisor
[eventlistener:sulphite]
command=sulphite --graphite-server=…
events=PROCESS_STATE
numprocs=1
http://supervisord.org/events.html
https://github.com/jib/sulphite/blob/master/README.md

FATAL PROCESS EXITS
Processes that exited unexpectedly, and supervisor was
unable to restart after N retries

PUPPET
Use the Puppet module graphite-report to send Puppet
reporting data directly to Graphite
http://docs.puppetlabs.com/guides/reporting.html
https://github.com/krux/puppet-module-graphite-report

KEEP TRACK OF COSTS
Use CloudWatch CLI tools and send to Statsd
http://docs.aws.amazon.com/AmazonCloudWatch/latest/cli/SetupCLI.html

BASIC USAGE
# Charge to date for $service
$ mon-get-stats EstimatedCharges
--namespace "AWS/Billing"
--statistics Sum
--dimensions "ServiceName=${service}"
--start-time $date
http://docs.aws.amazon.com/AmazonCloudWatch/latest/cli/SetupCLI.html

Q & A
http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.html
@jiboumans
http://slideshare.net/jiboumans

How to measure everything - a million metrics per second with minimal developer overhead

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to How to measure everything - a million metrics per second with minimal developer overhead

Similar to How to measure everything - a million metrics per second with minimal developer overhead (20)

Recently uploaded

Recently uploaded (20)

How to measure everything - a million metrics per second with minimal developer overhead