SlideShare a Scribd company logo
1 of 56
HOW TO MEASURE EVERYTHING 
A million metrics per second with minimal developer overhead 
! 
Jos Boumans - @jiboumans 
http://www.imagemediapartners.com/Portals/20286/images/MeasuringTape-s.jpg
RIPE NCC 
Engineering manager for RIPE Database 
http://www.ripe.net/db
CANONICAL 
Engineering manager for Ubuntu Server 10.04 & 10.10 
http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775 
http://www.ubuntu.com/business/server/overview
KRUX 
VP of Operations & Infrastructure 
http://www.krux.com/
SOME OF OUR CUSTOMERS
A LOT OF TRAFFIC 
http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html
0 35,000 70,000 105,000 140,000 
AVERAGE DATA EVENTS / SEC 
http://investor.fb.com/results.cfm 
Twitter: New Tweets Wikipedia: Page Views 
Facebook: Messages Sent Krux: New Data Points 
http://www.statisticbrain.com/twitter-statistics/ 
http://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm
0 500,000,000 1,000,000,000 1,500,000,000 2,000,000,000 
MONTHLY UNIQUE USERS 
http://reportcard.wmflabs.org/ 
http://www.statisticbrain.com/twitter-statistics/ 
http://newsroom.fb.com/company-info/
DATA IS EVERYTHING 
Always know what’s going on 
http://perpetual-wonder.com/blog/wp-content/uploads/2012/09/Where-do-we-go-from-here.jpg
UNIQUE METRICS 
Unique metrics received, per second
METRICS & VISUALIZATION 
… and a little bit of monitoring 
http://getfit101.files.wordpress.com/2012/04/visualization.jpg
VISUALIZATION MATTERS 
Humans are good at patterns & shapes 
http://1.bp.blogspot.com/-CO-8FK9bohE/T89rD8dTyEI/AAAAAAAAAEE/YUZ00v_filk/s1600/live_like_it_matters_by_mythirll-d3iqcxt.jpg
INSIGHT MATTERS 
We consider it a core competence 
http://yourselfseries.com/teens/files/2013/05/suicide_bonus_Insight_final.jpg
SHOW EVERYONE 
And better yet, encourage people to add their own 
http://www.kissimmee.org/ftp/KCC/events/views/images/crowd_cheer.jpg
THE BOTTOM LINE
KEY CHARACTERISTICS 
… of our metrics collection 
http://www.fullcirclefeedback.com.au/resources/wp-content/uploads/2014/01/Key-skills-and-characteristics-of-good-HR-leaders.jpg
WHAT TO VISUALIZE 
Pick your operational KPIs 
http://1.bp.blogspot.com/-nrB1A9hamEk/UVZui_JUG1I/AAAAAAAAAdI/zGqHuanZNVU/s1600/missed-opportunities.jpg
REQUEST & ERROR RATES 
The baseline for everything else
WORST RESPONSE TIMES 
Track the worst upper 95th & upper 99th across a cluster
TRACK EVENTS 
Did a code change or batch job cause a change in 
behaviour?
CAPACITY / THRESHOLDS 
How much traffic can your service sustain?
SINGLE SERVICE OVERVIEW 
Create a single graph for every service
WHAT TO CAPTURE 
Everything. 
No, really. 
http://arkansasagnews.uark.edu/monarchs95.jpg
INFRASTRUCTURE 
Everything needed to create, capture and 
act on a million metrics per seconds 
http://discussamerica.org/remer-blog/images/Freeway_Interchange2.jpg
GRAPHITE, STATSD & COLLECTD 
The Trifecta
COLLECTD 
Open Source Monitoring Tool 
https://collectd.org/ 
https://collectd.org/wiki/index.php/Plugin:StatsD
STATSD 
Simple stats collector service 
https://github.com/etsy/statsd 
http://codeascraft.com/2011/02/15/measure-anything-measure-everything/ 
https://wwwx.cs.unc.edu/~sparkst/howto/http://emps.exeter.ac.uk/media/universityofexeter/emps/eisa/exista-splash.jpg network_tuning.php
STATSD NAMING SCHEME 
stats. # to distinguish from events 
$environment. # prod, dev, etc 
$cluster_name. # api-ash, www-dub, etc 
$application. # webapp, login, etc 
$metric_name_here. # any key the app wants 
$hostname # node the stat came from
STATSD CONFIGURATION 
{ graphite: { 
globalPrefix: stats.$env.$cluster_name, 
globalSuffix: require(‘os').hostname().split('.')[0], 
legacyNamespace: false, 
}, 
percentThreshold: [ 95, 99 ], 
deleteIdleStats: true, 
} 
https://github.com/etsy/statsd/blob/master/exampleConfig.js
GRAPHITE 
Metric store & Graph UI 
http://graphite.wikidot.com/ 
http://graphite.readthedocs.org/en/latest/
GRAPHITE SETUP 
At least one graphite server per data center
DATA RETENTION 
[default] 
pattern = .* 
priority = 110 
retentions = 10:6h,60:15d,600:5y 
xFilesFactor = 0 
http://graphite.readthedocs.org/en/latest/config-carbon.html#storage-schemas-conf
STANDARD AGGREGATIONS 
# Average & Sum for timers 
<prefix>.timers.<key>._totals.ash.<type>.avg (10) = 
avg <<prefix>>.timers.<<key>>.<node>.<type> 
! 
<prefix>.timers.<key>._totals.ash.<type>.sum (10) = 
sum <<prefix>>.timers.<<key>>.<node>.(?!upper|lower)<type> 
! 
# Min / Max for Lower / Upper 
<prefix>.timers.<key>._totals.ash.upper (10) = 
max <<prefix>>.timers.<<key>>.<node>.upper 
! 
<prefix>.timers.<key>._totals.ash.lower (10) = 
min <<prefix>>.timers.<<key>>.<node>.lower 
http://graphite.readthedocs.org/en/latest/config-carbon.html#aggregation-rules-conf
PERFORMANCE 
First problem: IOPS 
Second problem: CPU 
http://www.organisationscience.com/styled-6/files/dt-improved-performance.jpg
GRAPHITE ALTERNATIVES 
Circonus: All the insights you ever wanted 
Zabbix: OSS self hosted monitoring http://circonus.com 
http://zabbix.com 
https://github.com/lyft/circonus-statsd-backend 
https://github.com/dlecocq/statsd-zabbix
GRAPHITE.JS 
Custom dashboards using jQuery 
https://github.com/prestontimmons/graphitejs 
http://dashboarddude.com/blog/2013/01/23/dashboards-for-graphite/
COST 
Optimize for adoption rates in your organization by 
eliminating cost as a constraint 
http://www.examiner.com/images/blog/wysiwyg/image/money].jpg
INSTRUMENTATION 
Instrument your infrastructure, not just your apps 
http://2.bp.blogspot.com/-bL9D8VMtor4/TiNBDEJmvOI/AAAAAAAAByc/Y0Uc3GVPNl0/s400/SeminaGestaoPessoasOrquestraROB4428.jpg
APACHE 
Use mod_statsd to capture stats 
directly from the Apache request 
http://kaleidos.net/files/images/apache318x260.png 
http://httpd.apache.org/ 
https://github.com/jib/mod_statsd
BASIC CONFIGURATION 
<Location /api> 
Statsd On 
StatsdPrefix apache 
</Location> 
$ curl http://localhost/api/foo?id=42 
! 
Stat: apache.api.foo.GET.200:31|ms 
https://github.com/jib/mod_statsd/blob/master/DOCUMENTATION
VARNISH 
use libvmod-statsd & libvmod-timers to capture 
stats directly from the Varnish request 
http://www.adammalone.net/sites/default/files/styles/blog_image/public/varnish-bunny.png?itok=1bBDTA1A 
https://www.varnish-cache.org/ 
https://github.com/jib/libvmod-statsd
BASIC CONFIGURATION 
# pseudo code 
import statsd; import timers; 
sub vcl_deliver { 
statsd.timing( 
$backend + # from req.backend 
$hit_miss + # from obj.hits 
$resp_code, # from obj.status 
timers. req_response_time() ); 
} 
https://github.com/jib/libvmod-statsd/blob/master/README.rst 
http://jiboumans.wordpress.com/2013/02/27/realtime-stats-from-varnish/
SAMPLE GRAPH 
The request per second & response time graphs 
are coming straight from varnish
PYTHON 
Create a base library in your language of choice 
https://pypi.python.org/pypi?%3Aaction=search&term=krux&submit=search
KRUX-STDLIB 
$ pip install krux-stdlib 
https://staticfiles.krxd.net/foss/docs/pypi/krux-stdlib/
BASIC APP USING STDLIB 
$ sample-app -h 
[…] 
! 
logging: 
--log-level {info,debug,critical,warning,error} 
Verbosity of logging. (default: warning) 
stats: 
--stats Enable sending statistics to statsd. (default: False) 
--stats-host STATS_HOST 
Statsd host to send statistics to. (default: localhost) 
--stats-port STATS_PORT 
Statsd port to send statistics to. (default: 8125) 
--stats-environment STATS_ENVIRONMENT 
Statsd environment. (default: dev) 
https://staticfiles.krxd.net/foss/docs/pypi/krux-stdlib/
BASIC APP USING STDLIB 
class App(krux.cli.Application): 
def __init__(self): 
### Call to the superclass to bootstrap. 
super(Application, self).__init__( 
name = 'sample-app') 
def run(self): 
stats = self.stats 
log = self.logger 
! 
with stats.timer('run'): 
log.info('running...') 
... 
https://staticfiles.krxd.net/foss/docs/pypi/krux-stdlib/ 
https://pypi.python.org/pypi?%3Aaction=search&term=krux&submit=search
CLI 
echo ‘events.deploy.appname:1|c’ | nc localhost -u 8125
JAVASCRIPT 
Use a simple HTTP endpoint to send stats
SUPERVISOR 
Instrument Supervisord using Sulphite 
http://www.dilbertcelart.com/dale/c26.jpg 
http://supervisord.org/ 
https://github.com/jib/sulphite
BASIC CONFIGURATION 
# Install from PyPi 
$ pip install sulphite 
! 
# Setup as eventlistener in Supervisor 
[eventlistener:sulphite] 
command=sulphite --graphite-server=… 
events=PROCESS_STATE 
numprocs=1 
http://supervisord.org/events.html 
https://github.com/jib/sulphite/blob/master/README.md
FATAL PROCESS EXITS 
Processes that exited unexpectedly, and supervisor was 
unable to restart after N retries
PUPPET 
Use the Puppet module graphite-report to send Puppet 
reporting data directly to Graphite 
http://docs.puppetlabs.com/guides/reporting.html 
https://github.com/krux/puppet-module-graphite-report
KEEP TRACK OF COSTS 
Use CloudWatch CLI tools and send to Statsd 
http://docs.aws.amazon.com/AmazonCloudWatch/latest/cli/SetupCLI.html
BASIC USAGE 
# Charge to date for $service 
$ mon-get-stats EstimatedCharges 
--namespace "AWS/Billing" 
--statistics Sum 
--dimensions "ServiceName=${service}" 
--start-time $date 
http://docs.aws.amazon.com/AmazonCloudWatch/latest/cli/SetupCLI.html
Q & A 
http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.html 
@jiboumans 
http://slideshare.net/jiboumans

More Related Content

What's hot

Блохин Леонид - "Mist, как часть Hydrosphere"
Блохин Леонид - "Mist, как часть Hydrosphere"Блохин Леонид - "Mist, как часть Hydrosphere"
Блохин Леонид - "Mist, как часть Hydrosphere"
Provectus
 

What's hot (20)

The Monitoring Playground
The Monitoring PlaygroundThe Monitoring Playground
The Monitoring Playground
 
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...
 
Блохин Леонид - "Mist, как часть Hydrosphere"
Блохин Леонид - "Mist, как часть Hydrosphere"Блохин Леонид - "Mist, как часть Hydrosphere"
Блохин Леонид - "Mist, как часть Hydrosphere"
 
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
 
Managing data workflows with Luigi
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with Luigi
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
 
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
H2O World - Munging, modeling, and pipelines using Python - Hank RoarkH2O World - Munging, modeling, and pipelines using Python - Hank Roark
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
 
Scaling Graphite At Yelp
Scaling Graphite At YelpScaling Graphite At Yelp
Scaling Graphite At Yelp
 
Storm over gearpump
Storm over gearpumpStorm over gearpump
Storm over gearpump
 
PythonBrasil[8] - CPython for dummies
PythonBrasil[8] - CPython for dummiesPythonBrasil[8] - CPython for dummies
PythonBrasil[8] - CPython for dummies
 
Building infrastructure with Terraform (Google)
Building infrastructure with Terraform (Google)Building infrastructure with Terraform (Google)
Building infrastructure with Terraform (Google)
 
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Aws Quick Dirty Hadoop Mapreduce Ec2 S3Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
 
Taming the Tiger: Tips and Tricks for Using Telegraf
Taming the Tiger: Tips and Tricks for Using TelegrafTaming the Tiger: Tips and Tricks for Using Telegraf
Taming the Tiger: Tips and Tricks for Using Telegraf
 
Enable IPv6 on Route53 AWS ELB, docker and node App
Enable IPv6 on Route53 AWS ELB, docker and  node AppEnable IPv6 on Route53 AWS ELB, docker and  node App
Enable IPv6 on Route53 AWS ELB, docker and node App
 
GPU-Accelerating A Deep Learning Anomaly Detection Platform
GPU-Accelerating A Deep Learning Anomaly Detection PlatformGPU-Accelerating A Deep Learning Anomaly Detection Platform
GPU-Accelerating A Deep Learning Anomaly Detection Platform
 
Apache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming EngineApache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming Engine
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Terraform
TerraformTerraform
Terraform
 
Amazon elastic map reduce
Amazon elastic map reduceAmazon elastic map reduce
Amazon elastic map reduce
 
Terraforming the Kubernetes Land
Terraforming the Kubernetes LandTerraforming the Kubernetes Land
Terraforming the Kubernetes Land
 

Viewers also liked

Keynote Puppet Camp San Francisco 2010
Keynote Puppet Camp San Francisco 2010Keynote Puppet Camp San Francisco 2010
Keynote Puppet Camp San Francisco 2010
Puppet
 
Centralized monitoring station for it computing and network infrastructure
Centralized monitoring station for it computing and network infrastructureCentralized monitoring station for it computing and network infrastructure
Centralized monitoring station for it computing and network infrastructure
MOHD ARISH
 
Naxsi, an open source WAF for Nginx
Naxsi, an open source WAF  for NginxNaxsi, an open source WAF  for Nginx
Naxsi, an open source WAF for Nginx
Positive Hack Days
 

Viewers also liked (20)

Nginx monitoring with graphite
Nginx monitoring with graphiteNginx monitoring with graphite
Nginx monitoring with graphite
 
AWS: Architecting for resilience & cost at scale
AWS: Architecting for resilience & cost at scaleAWS: Architecting for resilience & cost at scale
AWS: Architecting for resilience & cost at scale
 
Devoxx UK: Reliability & Scale in AWS while letting you sleep through the night
Devoxx UK: Reliability & Scale in AWS while letting you sleep through the night Devoxx UK: Reliability & Scale in AWS while letting you sleep through the night
Devoxx UK: Reliability & Scale in AWS while letting you sleep through the night
 
Puppet Camp Sydney 2015: Puppet and AWS is easy right.....?
Puppet Camp Sydney 2015: Puppet and AWS is easy right.....? Puppet Camp Sydney 2015: Puppet and AWS is easy right.....?
Puppet Camp Sydney 2015: Puppet and AWS is easy right.....?
 
State of Puppet - Puppet Camp Barcelona 2013
State of Puppet - Puppet Camp Barcelona 2013State of Puppet - Puppet Camp Barcelona 2013
State of Puppet - Puppet Camp Barcelona 2013
 
Puppet Camp San Francisco 2015: Puppet Adoption in a Mature Environment
Puppet Camp San Francisco 2015: Puppet Adoption in a Mature EnvironmentPuppet Camp San Francisco 2015: Puppet Adoption in a Mature Environment
Puppet Camp San Francisco 2015: Puppet Adoption in a Mature Environment
 
Puppet camp LA and Phoenix 2015: Keynote
Puppet camp LA and Phoenix 2015: Keynote Puppet camp LA and Phoenix 2015: Keynote
Puppet camp LA and Phoenix 2015: Keynote
 
Why apps
Why appsWhy apps
Why apps
 
One-Man Ops
One-Man OpsOne-Man Ops
One-Man Ops
 
Puppet Camp Phoenix 2015: Managing Files via Puppet: Let Me Count The Ways (B...
Puppet Camp Phoenix 2015: Managing Files via Puppet: Let Me Count The Ways (B...Puppet Camp Phoenix 2015: Managing Files via Puppet: Let Me Count The Ways (B...
Puppet Camp Phoenix 2015: Managing Files via Puppet: Let Me Count The Ways (B...
 
Puppet Camp London 2014: Chasing AMI: baking Amazon machine images with Jenki...
Puppet Camp London 2014: Chasing AMI: baking Amazon machine images with Jenki...Puppet Camp London 2014: Chasing AMI: baking Amazon machine images with Jenki...
Puppet Camp London 2014: Chasing AMI: baking Amazon machine images with Jenki...
 
Keynote Puppet Camp San Francisco 2010
Keynote Puppet Camp San Francisco 2010Keynote Puppet Camp San Francisco 2010
Keynote Puppet Camp San Francisco 2010
 
Monitoring NGINX (plus): key metrics and how-to
Monitoring NGINX (plus): key metrics and how-toMonitoring NGINX (plus): key metrics and how-to
Monitoring NGINX (plus): key metrics and how-to
 
Puppet Camp Berlin 2014: Advanced Puppet Design
Puppet Camp Berlin 2014: Advanced Puppet DesignPuppet Camp Berlin 2014: Advanced Puppet Design
Puppet Camp Berlin 2014: Advanced Puppet Design
 
Deploying E.L.K stack w Puppet
Deploying E.L.K stack w PuppetDeploying E.L.K stack w Puppet
Deploying E.L.K stack w Puppet
 
Centralized monitoring station for it computing and network infrastructure
Centralized monitoring station for it computing and network infrastructureCentralized monitoring station for it computing and network infrastructure
Centralized monitoring station for it computing and network infrastructure
 
Naxsi, an open source WAF for Nginx
Naxsi, an open source WAF  for NginxNaxsi, an open source WAF  for Nginx
Naxsi, an open source WAF for Nginx
 
Devops training in Hyderabad
Devops training in HyderabadDevops training in Hyderabad
Devops training in Hyderabad
 
sensu
sensusensu
sensu
 
Creating personalized cross platform mobile apps with the Sitecore Mobile SDK
Creating personalized cross platform mobile apps with the Sitecore Mobile SDKCreating personalized cross platform mobile apps with the Sitecore Mobile SDK
Creating personalized cross platform mobile apps with the Sitecore Mobile SDK
 

Similar to How to measure everything - a million metrics per second with minimal developer overhead

[convergese] Adaptive Images in Responsive Web Design
[convergese] Adaptive Images in Responsive Web Design[convergese] Adaptive Images in Responsive Web Design
[convergese] Adaptive Images in Responsive Web Design
Christopher Schmitt
 

Similar to How to measure everything - a million metrics per second with minimal developer overhead (20)

Mechatronics engineer
Mechatronics engineerMechatronics engineer
Mechatronics engineer
 
Chicago Docker Meetup Presentation - Mediafly
Chicago Docker Meetup Presentation - MediaflyChicago Docker Meetup Presentation - Mediafly
Chicago Docker Meetup Presentation - Mediafly
 
Kubernetes to improve business scalability and processes (Cloud & DevOps Worl...
Kubernetes to improve business scalability and processes (Cloud & DevOps Worl...Kubernetes to improve business scalability and processes (Cloud & DevOps Worl...
Kubernetes to improve business scalability and processes (Cloud & DevOps Worl...
 
Data science for infrastructure dev week 2022
Data science for infrastructure   dev week 2022Data science for infrastructure   dev week 2022
Data science for infrastructure dev week 2022
 
Continuous Deployment: The Dirty Details
Continuous Deployment: The Dirty DetailsContinuous Deployment: The Dirty Details
Continuous Deployment: The Dirty Details
 
iguazio - nuclio Meetup Nov 30th
iguazio - nuclio Meetup Nov 30thiguazio - nuclio Meetup Nov 30th
iguazio - nuclio Meetup Nov 30th
 
Bringing JAMStack to the Enterprise
Bringing JAMStack to the EnterpriseBringing JAMStack to the Enterprise
Bringing JAMStack to the Enterprise
 
IBM Cloud University: Build, Deploy and Scale Node.js Microservices
IBM Cloud University: Build, Deploy and Scale Node.js MicroservicesIBM Cloud University: Build, Deploy and Scale Node.js Microservices
IBM Cloud University: Build, Deploy and Scale Node.js Microservices
 
[convergese] Adaptive Images in Responsive Web Design
[convergese] Adaptive Images in Responsive Web Design[convergese] Adaptive Images in Responsive Web Design
[convergese] Adaptive Images in Responsive Web Design
 
Camel on Cloud by Christina Lin
Camel on Cloud by Christina LinCamel on Cloud by Christina Lin
Camel on Cloud by Christina Lin
 
Best Practices in Handling Performance Issues
Best Practices in Handling Performance IssuesBest Practices in Handling Performance Issues
Best Practices in Handling Performance Issues
 
Ben ford intro
Ben ford introBen ford intro
Ben ford intro
 
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben Ford
 
MongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local DC 2018: MongoDB Ops Manager + KubernetesMongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
 
Ultimate Free SQL Server Toolkit
Ultimate Free SQL Server ToolkitUltimate Free SQL Server Toolkit
Ultimate Free SQL Server Toolkit
 
Ato2019 weave-services-istio
Ato2019 weave-services-istioAto2019 weave-services-istio
Ato2019 weave-services-istio
 
All Things Open 2019 weave-services-istio
All Things Open 2019 weave-services-istioAll Things Open 2019 weave-services-istio
All Things Open 2019 weave-services-istio
 
Weave Your Microservices with Istio
Weave Your Microservices with IstioWeave Your Microservices with Istio
Weave Your Microservices with Istio
 
What is going on - Application diagnostics on Azure - TechDays Finland
What is going on - Application diagnostics on Azure - TechDays FinlandWhat is going on - Application diagnostics on Azure - TechDays Finland
What is going on - Application diagnostics on Azure - TechDays Finland
 
Profiling PHP with Xdebug / Webgrind
Profiling PHP with Xdebug / WebgrindProfiling PHP with Xdebug / Webgrind
Profiling PHP with Xdebug / Webgrind
 

Recently uploaded

Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
HenryBriggs2
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 

Recently uploaded (20)

Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Rums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfRums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdf
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptx
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 

How to measure everything - a million metrics per second with minimal developer overhead

  • 1. HOW TO MEASURE EVERYTHING A million metrics per second with minimal developer overhead ! Jos Boumans - @jiboumans http://www.imagemediapartners.com/Portals/20286/images/MeasuringTape-s.jpg
  • 2. RIPE NCC Engineering manager for RIPE Database http://www.ripe.net/db
  • 3. CANONICAL Engineering manager for Ubuntu Server 10.04 & 10.10 http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775 http://www.ubuntu.com/business/server/overview
  • 4. KRUX VP of Operations & Infrastructure http://www.krux.com/
  • 5. SOME OF OUR CUSTOMERS
  • 6. A LOT OF TRAFFIC http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html
  • 7. 0 35,000 70,000 105,000 140,000 AVERAGE DATA EVENTS / SEC http://investor.fb.com/results.cfm Twitter: New Tweets Wikipedia: Page Views Facebook: Messages Sent Krux: New Data Points http://www.statisticbrain.com/twitter-statistics/ http://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm
  • 8. 0 500,000,000 1,000,000,000 1,500,000,000 2,000,000,000 MONTHLY UNIQUE USERS http://reportcard.wmflabs.org/ http://www.statisticbrain.com/twitter-statistics/ http://newsroom.fb.com/company-info/
  • 9. DATA IS EVERYTHING Always know what’s going on http://perpetual-wonder.com/blog/wp-content/uploads/2012/09/Where-do-we-go-from-here.jpg
  • 10. UNIQUE METRICS Unique metrics received, per second
  • 11. METRICS & VISUALIZATION … and a little bit of monitoring http://getfit101.files.wordpress.com/2012/04/visualization.jpg
  • 12. VISUALIZATION MATTERS Humans are good at patterns & shapes http://1.bp.blogspot.com/-CO-8FK9bohE/T89rD8dTyEI/AAAAAAAAAEE/YUZ00v_filk/s1600/live_like_it_matters_by_mythirll-d3iqcxt.jpg
  • 13. INSIGHT MATTERS We consider it a core competence http://yourselfseries.com/teens/files/2013/05/suicide_bonus_Insight_final.jpg
  • 14. SHOW EVERYONE And better yet, encourage people to add their own http://www.kissimmee.org/ftp/KCC/events/views/images/crowd_cheer.jpg
  • 16. KEY CHARACTERISTICS … of our metrics collection http://www.fullcirclefeedback.com.au/resources/wp-content/uploads/2014/01/Key-skills-and-characteristics-of-good-HR-leaders.jpg
  • 17. WHAT TO VISUALIZE Pick your operational KPIs http://1.bp.blogspot.com/-nrB1A9hamEk/UVZui_JUG1I/AAAAAAAAAdI/zGqHuanZNVU/s1600/missed-opportunities.jpg
  • 18. REQUEST & ERROR RATES The baseline for everything else
  • 19. WORST RESPONSE TIMES Track the worst upper 95th & upper 99th across a cluster
  • 20. TRACK EVENTS Did a code change or batch job cause a change in behaviour?
  • 21. CAPACITY / THRESHOLDS How much traffic can your service sustain?
  • 22. SINGLE SERVICE OVERVIEW Create a single graph for every service
  • 23. WHAT TO CAPTURE Everything. No, really. http://arkansasagnews.uark.edu/monarchs95.jpg
  • 24. INFRASTRUCTURE Everything needed to create, capture and act on a million metrics per seconds http://discussamerica.org/remer-blog/images/Freeway_Interchange2.jpg
  • 25. GRAPHITE, STATSD & COLLECTD The Trifecta
  • 26. COLLECTD Open Source Monitoring Tool https://collectd.org/ https://collectd.org/wiki/index.php/Plugin:StatsD
  • 27. STATSD Simple stats collector service https://github.com/etsy/statsd http://codeascraft.com/2011/02/15/measure-anything-measure-everything/ https://wwwx.cs.unc.edu/~sparkst/howto/http://emps.exeter.ac.uk/media/universityofexeter/emps/eisa/exista-splash.jpg network_tuning.php
  • 28. STATSD NAMING SCHEME stats. # to distinguish from events $environment. # prod, dev, etc $cluster_name. # api-ash, www-dub, etc $application. # webapp, login, etc $metric_name_here. # any key the app wants $hostname # node the stat came from
  • 29. STATSD CONFIGURATION { graphite: { globalPrefix: stats.$env.$cluster_name, globalSuffix: require(‘os').hostname().split('.')[0], legacyNamespace: false, }, percentThreshold: [ 95, 99 ], deleteIdleStats: true, } https://github.com/etsy/statsd/blob/master/exampleConfig.js
  • 30. GRAPHITE Metric store & Graph UI http://graphite.wikidot.com/ http://graphite.readthedocs.org/en/latest/
  • 31. GRAPHITE SETUP At least one graphite server per data center
  • 32. DATA RETENTION [default] pattern = .* priority = 110 retentions = 10:6h,60:15d,600:5y xFilesFactor = 0 http://graphite.readthedocs.org/en/latest/config-carbon.html#storage-schemas-conf
  • 33. STANDARD AGGREGATIONS # Average & Sum for timers <prefix>.timers.<key>._totals.ash.<type>.avg (10) = avg <<prefix>>.timers.<<key>>.<node>.<type> ! <prefix>.timers.<key>._totals.ash.<type>.sum (10) = sum <<prefix>>.timers.<<key>>.<node>.(?!upper|lower)<type> ! # Min / Max for Lower / Upper <prefix>.timers.<key>._totals.ash.upper (10) = max <<prefix>>.timers.<<key>>.<node>.upper ! <prefix>.timers.<key>._totals.ash.lower (10) = min <<prefix>>.timers.<<key>>.<node>.lower http://graphite.readthedocs.org/en/latest/config-carbon.html#aggregation-rules-conf
  • 34. PERFORMANCE First problem: IOPS Second problem: CPU http://www.organisationscience.com/styled-6/files/dt-improved-performance.jpg
  • 35. GRAPHITE ALTERNATIVES Circonus: All the insights you ever wanted Zabbix: OSS self hosted monitoring http://circonus.com http://zabbix.com https://github.com/lyft/circonus-statsd-backend https://github.com/dlecocq/statsd-zabbix
  • 36. GRAPHITE.JS Custom dashboards using jQuery https://github.com/prestontimmons/graphitejs http://dashboarddude.com/blog/2013/01/23/dashboards-for-graphite/
  • 37. COST Optimize for adoption rates in your organization by eliminating cost as a constraint http://www.examiner.com/images/blog/wysiwyg/image/money].jpg
  • 38. INSTRUMENTATION Instrument your infrastructure, not just your apps http://2.bp.blogspot.com/-bL9D8VMtor4/TiNBDEJmvOI/AAAAAAAAByc/Y0Uc3GVPNl0/s400/SeminaGestaoPessoasOrquestraROB4428.jpg
  • 39. APACHE Use mod_statsd to capture stats directly from the Apache request http://kaleidos.net/files/images/apache318x260.png http://httpd.apache.org/ https://github.com/jib/mod_statsd
  • 40. BASIC CONFIGURATION <Location /api> Statsd On StatsdPrefix apache </Location> $ curl http://localhost/api/foo?id=42 ! Stat: apache.api.foo.GET.200:31|ms https://github.com/jib/mod_statsd/blob/master/DOCUMENTATION
  • 41. VARNISH use libvmod-statsd & libvmod-timers to capture stats directly from the Varnish request http://www.adammalone.net/sites/default/files/styles/blog_image/public/varnish-bunny.png?itok=1bBDTA1A https://www.varnish-cache.org/ https://github.com/jib/libvmod-statsd
  • 42. BASIC CONFIGURATION # pseudo code import statsd; import timers; sub vcl_deliver { statsd.timing( $backend + # from req.backend $hit_miss + # from obj.hits $resp_code, # from obj.status timers. req_response_time() ); } https://github.com/jib/libvmod-statsd/blob/master/README.rst http://jiboumans.wordpress.com/2013/02/27/realtime-stats-from-varnish/
  • 43. SAMPLE GRAPH The request per second & response time graphs are coming straight from varnish
  • 44. PYTHON Create a base library in your language of choice https://pypi.python.org/pypi?%3Aaction=search&term=krux&submit=search
  • 45. KRUX-STDLIB $ pip install krux-stdlib https://staticfiles.krxd.net/foss/docs/pypi/krux-stdlib/
  • 46. BASIC APP USING STDLIB $ sample-app -h […] ! logging: --log-level {info,debug,critical,warning,error} Verbosity of logging. (default: warning) stats: --stats Enable sending statistics to statsd. (default: False) --stats-host STATS_HOST Statsd host to send statistics to. (default: localhost) --stats-port STATS_PORT Statsd port to send statistics to. (default: 8125) --stats-environment STATS_ENVIRONMENT Statsd environment. (default: dev) https://staticfiles.krxd.net/foss/docs/pypi/krux-stdlib/
  • 47. BASIC APP USING STDLIB class App(krux.cli.Application): def __init__(self): ### Call to the superclass to bootstrap. super(Application, self).__init__( name = 'sample-app') def run(self): stats = self.stats log = self.logger ! with stats.timer('run'): log.info('running...') ... https://staticfiles.krxd.net/foss/docs/pypi/krux-stdlib/ https://pypi.python.org/pypi?%3Aaction=search&term=krux&submit=search
  • 49. JAVASCRIPT Use a simple HTTP endpoint to send stats
  • 50. SUPERVISOR Instrument Supervisord using Sulphite http://www.dilbertcelart.com/dale/c26.jpg http://supervisord.org/ https://github.com/jib/sulphite
  • 51. BASIC CONFIGURATION # Install from PyPi $ pip install sulphite ! # Setup as eventlistener in Supervisor [eventlistener:sulphite] command=sulphite --graphite-server=… events=PROCESS_STATE numprocs=1 http://supervisord.org/events.html https://github.com/jib/sulphite/blob/master/README.md
  • 52. FATAL PROCESS EXITS Processes that exited unexpectedly, and supervisor was unable to restart after N retries
  • 53. PUPPET Use the Puppet module graphite-report to send Puppet reporting data directly to Graphite http://docs.puppetlabs.com/guides/reporting.html https://github.com/krux/puppet-module-graphite-report
  • 54. KEEP TRACK OF COSTS Use CloudWatch CLI tools and send to Statsd http://docs.aws.amazon.com/AmazonCloudWatch/latest/cli/SetupCLI.html
  • 55. BASIC USAGE # Charge to date for $service $ mon-get-stats EstimatedCharges --namespace "AWS/Billing" --statistics Sum --dimensions "ServiceName=${service}" --start-time $date http://docs.aws.amazon.com/AmazonCloudWatch/latest/cli/SetupCLI.html
  • 56. Q & A http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.html @jiboumans http://slideshare.net/jiboumans