Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
HOWTO MEASURE EVERYTHING
A million metrics per second with minimal developer overhead	

!
Jos Boumans - @jiboumans
http://...
RIPE NCC
Engineering manager for RIPE Database
http://www.ripe.net/db
CANONICAL
http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775
Engineering manager for Ubuntu Server 10.04 & 10.1...
KRUX
VP of Operations & Infrastructure
http://www.krux.com/
SOME OF OUR CUSTOMERS
A LOT OFTRAFFIC
http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html
AVERAGE DATA EVENTS / SEC
http://investor.fb.com/results.cfm
http://www.statisticbrain.com/twitter-statistics/
http://stat...
MONTHLY UNIQUE USERS
0 500,000,000 1,000,000,000 1,500,000,000 2,000,000,000
http://reportcard.wmflabs.org/
http://www.sta...
DATA IS EVERYTHING
Always know what’s going on
http://perpetual-wonder.com/blog/wp-content/uploads/2012/09/Where-do-we-go-...
UNIQUE METRICS
Unique metrics received, per second
METRICS &VISUALIZATION
… and a little bit of monitoring
http://getfit101.files.wordpress.com/2012/04/visualization.jpg
VISUALIZATION MATTERS
Humans are good at patterns & shapes
http://1.bp.blogspot.com/-CO-8FK9bohE/T89rD8dTyEI/AAAAAAAAAEE/Y...
INSIGHT MATTERS
We consider it a core competence
http://yourselfseries.com/teens/files/2013/05/suicide_bonus_Insight_final...
SHOW EVERYONE
And better yet, encourage people to add their own
http://www.kissimmee.org/ftp/KCC/events/views/images/crowd...
THE BOTTOM LINE
KEY CHARACTERISTICS
… of our metrics collection
http://www.fullcirclefeedback.com.au/resources/wp-content/uploads/2014/01/...
WHATTOVISUALIZE
Pick your operational KPIs
http://1.bp.blogspot.com/-nrB1A9hamEk/UVZui_JUG1I/AAAAAAAAAdI/zGqHuanZNVU/s1600...
REQUEST & ERROR RATES
The baseline for everything else
WORST RESPONSETIMES
Track the worst upper 95th & upper 99th across a cluster
TRACK EVENTS
Did a code change or batch job cause a change in
behaviour?
CAPACITY /THRESHOLDS
How much traffic can your service sustain?
SINGLE SERVICE OVERVIEW
Create a single graph for every service
WHATTO CAPTURE
Everything.	

No, really.
http://arkansasagnews.uark.edu/monarchs95.jpg
INFRASTRUCTURE
Everything needed to create, capture and 	

act on a million metrics per seconds
http://discussamerica.org/...
GRAPHITE, STATSD & COLLECTD
TheTrifecta
COLLECTD
Open Source MonitoringTool
https://collectd.org/
https://collectd.org/wiki/index.php/Plugin:StatsD
STATSD
Simple stats collector service
https://github.com/etsy/statsd
http://codeascraft.com/2011/02/15/measure-anything-me...
STATSD NAMING SCHEME
stats. # to distinguish from events	
$environment. # prod, dev, etc	
$cluster_name. # api-ash, www-du...
STATSD CONFIGURATION
{ graphite: {	

globalPrefix: stats.$env.$cluster_name,	

globalSuffix: require(‘os').hostname().split(...
GRAPHITE
Metric store & Graph UI
http://graphite.wikidot.com/
http://graphite.readthedocs.org/en/latest/
GRAPHITE SETUP
At least one graphite server per data center
DATA RETENTION
[default]	

pattern = .*	

priority = 110	

retentions = 10:6h,60:15d,600:5y	

xFilesFactor = 0
http://grap...
STANDARD AGGREGATIONS
# Average & Sum for timers	

<prefix>.timers.<key>._totals.ash.<type>.avg (10) = 	

	

 avg <<prefix>>...
PERFORMANCE
First problem: IOPS	

Second problem: CPU
http://www.organisationscience.com/styled-6/files/dt-improved-perfor...
GRAPHITE ALTERNATIVES
Circonus:All the insights you ever wanted	

Zabbix: OSS self hosted monitoring http://circonus.com
h...
GRAPHITE.JS
Custom dashboards using jQuery
https://github.com/prestontimmons/graphitejs
http://dashboarddude.com/blog/2013...
COST
Optimize for adoption rates in your organization by
eliminating cost as a constraint
http://www.examiner.com/images/b...
INSTRUMENTATION
Instrument your infrastructure, not just your apps
http://2.bp.blogspot.com/-bL9D8VMtor4/TiNBDEJmvOI/AAAAA...
APACHE
Use mod_statsd to capture stats 	

directly from the Apache request
http://kaleidos.net/files/images/apache318x260....
BASIC CONFIGURATION
<Location /api>	

Statsd On	

StatsdPrefix apache 	

</Location>
https://github.com/jib/mod_statsd/blob...
VARNISH
use libvmod-statsd & libvmod-timers to capture 	

stats directly from theVarnish request
http://www.adammalone.net...
BASIC CONFIGURATION
# pseudo code	
import statsd; import timers;	
sub vcl_deliver {	
statsd.timing(	
$backend + # from req...
SAMPLE GRAPH
The request per second & response time graphs 	

are coming straight from varnish
PYTHON
Create a base library in your language of choice
https://pypi.python.org/pypi?%3Aaction=search&term=krux&submit=sea...
KRUX-STDLIB
$ pip install krux-stdlib
https://staticfiles.krxd.net/foss/docs/pypi/krux-stdlib/
BASIC APP USING STDLIB
$ sample-app -h	
[…]	
!
logging:	
--log-level {info,debug,critical,warning,error}	
Verbosity of log...
BASIC APP USING STDLIB
class App(krux.cli.Application):	
def __init__(self):	
### Call to the superclass to bootstrap.	
su...
CLI
echo ‘events.deploy.appname:1|c’ | nc localhost -u 8125
JAVASCRIPT
Use a simple HTTP endpoint to send stats
PUPPET
Use the Puppet module graphite-report to send Puppet
reporting data directly to Graphite
http://docs.puppetlabs.com...
Q & A
http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.html
@jiboumans	

http://slideshare.net/j...
Upcoming SlideShare
Loading in …5
×

How to Measure Everything: A Million Metrics Per Second with Minimal Developer Overhead - PuppetCo

2,033 views

Published on

How to Measure Everything: A Million Metrics Per Second with Minimal Developer Overhead - Jos Boumans, Krux

Published in: Technology

How to Measure Everything: A Million Metrics Per Second with Minimal Developer Overhead - PuppetCo

  1. 1. HOWTO MEASURE EVERYTHING A million metrics per second with minimal developer overhead ! Jos Boumans - @jiboumans http://www.imagemediapartners.com/Portals/20286/images/MeasuringTape-s.jpg
  2. 2. RIPE NCC Engineering manager for RIPE Database http://www.ripe.net/db
  3. 3. CANONICAL http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775 Engineering manager for Ubuntu Server 10.04 & 10.10 http://www.ubuntu.com/business/server/overview
  4. 4. KRUX VP of Operations & Infrastructure http://www.krux.com/
  5. 5. SOME OF OUR CUSTOMERS
  6. 6. A LOT OFTRAFFIC http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html
  7. 7. AVERAGE DATA EVENTS / SEC http://investor.fb.com/results.cfm http://www.statisticbrain.com/twitter-statistics/ http://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm 0 35,000 70,000 105,000 140,000 Twitter: NewTweets Wikipedia: PageViews Facebook: Messages Sent Krux: New Data Points
  8. 8. MONTHLY UNIQUE USERS 0 500,000,000 1,000,000,000 1,500,000,000 2,000,000,000 http://reportcard.wmflabs.org/ http://www.statisticbrain.com/twitter-statistics/ http://newsroom.fb.com/company-info/
  9. 9. DATA IS EVERYTHING Always know what’s going on http://perpetual-wonder.com/blog/wp-content/uploads/2012/09/Where-do-we-go-from-here.jpg
  10. 10. UNIQUE METRICS Unique metrics received, per second
  11. 11. METRICS &VISUALIZATION … and a little bit of monitoring http://getfit101.files.wordpress.com/2012/04/visualization.jpg
  12. 12. VISUALIZATION MATTERS Humans are good at patterns & shapes http://1.bp.blogspot.com/-CO-8FK9bohE/T89rD8dTyEI/AAAAAAAAAEE/YUZ00v_filk/s1600/live_like_it_matters_by_mythirll-d3iqcxt.jpg
  13. 13. INSIGHT MATTERS We consider it a core competence http://yourselfseries.com/teens/files/2013/05/suicide_bonus_Insight_final.jpg
  14. 14. SHOW EVERYONE And better yet, encourage people to add their own http://www.kissimmee.org/ftp/KCC/events/views/images/crowd_cheer.jpg
  15. 15. THE BOTTOM LINE
  16. 16. KEY CHARACTERISTICS … of our metrics collection http://www.fullcirclefeedback.com.au/resources/wp-content/uploads/2014/01/Key-skills-and-characteristics-of-good-HR-leaders.jpg
  17. 17. WHATTOVISUALIZE Pick your operational KPIs http://1.bp.blogspot.com/-nrB1A9hamEk/UVZui_JUG1I/AAAAAAAAAdI/zGqHuanZNVU/s1600/missed-opportunities.jpg
  18. 18. REQUEST & ERROR RATES The baseline for everything else
  19. 19. WORST RESPONSETIMES Track the worst upper 95th & upper 99th across a cluster
  20. 20. TRACK EVENTS Did a code change or batch job cause a change in behaviour?
  21. 21. CAPACITY /THRESHOLDS How much traffic can your service sustain?
  22. 22. SINGLE SERVICE OVERVIEW Create a single graph for every service
  23. 23. WHATTO CAPTURE Everything. No, really. http://arkansasagnews.uark.edu/monarchs95.jpg
  24. 24. INFRASTRUCTURE Everything needed to create, capture and act on a million metrics per seconds http://discussamerica.org/remer-blog/images/Freeway_Interchange2.jpg
  25. 25. GRAPHITE, STATSD & COLLECTD TheTrifecta
  26. 26. COLLECTD Open Source MonitoringTool https://collectd.org/ https://collectd.org/wiki/index.php/Plugin:StatsD
  27. 27. STATSD Simple stats collector service https://github.com/etsy/statsd http://codeascraft.com/2011/02/15/measure-anything-measure-everything/ https://wwwx.cs.unc.edu/~sparkst/howto/network_tuning.phphttp://emps.exeter.ac.uk/media/universityofexeter/emps/eisa/exista-splash.jpg
  28. 28. STATSD NAMING SCHEME stats. # to distinguish from events $environment. # prod, dev, etc $cluster_name. # api-ash, www-dub, etc $application. # webapp, login, etc $metric_name_here. # any key the app wants $hostname # node the stat came from
  29. 29. STATSD CONFIGURATION { graphite: { globalPrefix: stats.$env.$cluster_name, globalSuffix: require(‘os').hostname().split('.')[0], legacyNamespace: false, }, percentThreshold: [ 95, 99 ], deleteIdleStats: true, } https://github.com/etsy/statsd/blob/master/exampleConfig.js
  30. 30. GRAPHITE Metric store & Graph UI http://graphite.wikidot.com/ http://graphite.readthedocs.org/en/latest/
  31. 31. GRAPHITE SETUP At least one graphite server per data center
  32. 32. DATA RETENTION [default] pattern = .* priority = 110 retentions = 10:6h,60:15d,600:5y xFilesFactor = 0 http://graphite.readthedocs.org/en/latest/config-carbon.html#storage-schemas-conf
  33. 33. STANDARD AGGREGATIONS # Average & Sum for timers <prefix>.timers.<key>._totals.ash.<type>.avg (10) = avg <<prefix>>.timers.<<key>>.<node>.<type> ! <prefix>.timers.<key>._totals.ash.<type>.sum (10) = sum <<prefix>>.timers.<<key>>.<node>.(?!upper|lower)<type> ! # Min / Max for Lower / Upper <prefix>.timers.<key>._totals.ash.upper (10) = max <<prefix>>.timers.<<key>>.<node>.upper ! <prefix>.timers.<key>._totals.ash.lower (10) = min <<prefix>>.timers.<<key>>.<node>.lower http://graphite.readthedocs.org/en/latest/config-carbon.html#aggregation-rules-conf
  34. 34. PERFORMANCE First problem: IOPS Second problem: CPU http://www.organisationscience.com/styled-6/files/dt-improved-performance.jpg
  35. 35. GRAPHITE ALTERNATIVES Circonus:All the insights you ever wanted Zabbix: OSS self hosted monitoring http://circonus.com http://zabbix.com https://github.com/lyft/circonus-statsd-backend https://github.com/dlecocq/statsd-zabbix
  36. 36. GRAPHITE.JS Custom dashboards using jQuery https://github.com/prestontimmons/graphitejs http://dashboarddude.com/blog/2013/01/23/dashboards-for-graphite/
  37. 37. COST Optimize for adoption rates in your organization by eliminating cost as a constraint http://www.examiner.com/images/blog/wysiwyg/image/money].jpg
  38. 38. INSTRUMENTATION Instrument your infrastructure, not just your apps http://2.bp.blogspot.com/-bL9D8VMtor4/TiNBDEJmvOI/AAAAAAAAByc/Y0Uc3GVPNl0/s400/SeminaGestaoPessoasOrquestraROB4428.jpg
  39. 39. APACHE Use mod_statsd to capture stats directly from the Apache request http://kaleidos.net/files/images/apache318x260.png http://httpd.apache.org/ https://github.com/jib/mod_statsd
  40. 40. BASIC CONFIGURATION <Location /api> Statsd On StatsdPrefix apache </Location> https://github.com/jib/mod_statsd/blob/master/DOCUMENTATION $ curl http://localhost/api/foo?id=42 ! Stat: apache.api.foo.GET.200:31|ms
  41. 41. VARNISH use libvmod-statsd & libvmod-timers to capture stats directly from theVarnish request http://www.adammalone.net/sites/default/files/styles/blog_image/public/varnish-bunny.png?itok=1bBDTA1A https://www.varnish-cache.org/ https://github.com/jib/libvmod-statsd
  42. 42. BASIC CONFIGURATION # pseudo code import statsd; import timers; sub vcl_deliver { statsd.timing( $backend + # from req.backend $hit_miss + # from obj.hits $resp_code, # from obj.status timers. req_response_time() ); } https://github.com/jib/libvmod-statsd/blob/master/README.rst http://jiboumans.wordpress.com/2013/02/27/realtime-stats-from-varnish/
  43. 43. SAMPLE GRAPH The request per second & response time graphs are coming straight from varnish
  44. 44. PYTHON Create a base library in your language of choice https://pypi.python.org/pypi?%3Aaction=search&term=krux&submit=search
  45. 45. KRUX-STDLIB $ pip install krux-stdlib https://staticfiles.krxd.net/foss/docs/pypi/krux-stdlib/
  46. 46. BASIC APP USING STDLIB $ sample-app -h […] ! logging: --log-level {info,debug,critical,warning,error} Verbosity of logging. (default: warning) stats: --stats Enable sending statistics to statsd. (default: False) --stats-host STATS_HOST Statsd host to send statistics to. (default: localhost) --stats-port STATS_PORT Statsd port to send statistics to. (default: 8125) --stats-environment STATS_ENVIRONMENT Statsd environment. (default: dev) https://staticfiles.krxd.net/foss/docs/pypi/krux-stdlib/
  47. 47. BASIC APP USING STDLIB class App(krux.cli.Application): def __init__(self): ### Call to the superclass to bootstrap. super(Application, self).__init__( name = 'sample-app') def run(self): stats = self.stats log = self.logger ! with stats.timer('run'): log.info('running...') ... https://staticfiles.krxd.net/foss/docs/pypi/krux-stdlib/ https://pypi.python.org/pypi?%3Aaction=search&term=krux&submit=search
  48. 48. CLI echo ‘events.deploy.appname:1|c’ | nc localhost -u 8125
  49. 49. JAVASCRIPT Use a simple HTTP endpoint to send stats
  50. 50. PUPPET Use the Puppet module graphite-report to send Puppet reporting data directly to Graphite http://docs.puppetlabs.com/guides/reporting.html https://github.com/krux/puppet-module-graphite-report
  51. 51. Q & A http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.html @jiboumans http://slideshare.net/jiboumans

×