SlideShare a Scribd company logo
1 of 35
Download to read offline
Fact-based Monitoring 
puppetconf 2014 
Alexis Lê-Quôc @alq
Alexis Lê-Quôc, @alq 
CTO at Datadog
Poll: Monitoring makes me… 
happy 
proud 
cry 
want to hide
Puppet brings Automation to 
Systems Management
Improve 
Monitoring 
the way Puppet has 
improved 
Systems Management
“The good old days” 
• Your “CMDB” was Excel 
• SSH in and hack away 
• Little time for anything else
Then Puppet came… 
• Expressive rules that capture expected result 
• Using facts and classifiers, a.k.a. metadata to figure out where to 
apply changes 
• That freed up a lot of our time* 
* on a per-machine basis
“Puppet brings immunity of configuration to change in 
infrastructure” 
–Me (just now)
I have seen this before…
“[SQL brings] immunity of application to change in storage 
structure and access strategy” 
–C.J. Date (1977) 
http://www.cs.berkeley.edu/~brewer/cs262/SystemR.pdf
SQL 
• 1974 IBM introduces System R and its Structured Query Language 
• Expressive rules that capture expected result 
• Using facts and predicates, a.k.a. metadata to figure out what data 
to get 
• That freed up a lot of development time
SQL 
• From a time-consuming, imperative mess (“how”) 
• … to expressive data queries (“what”) 
SQL query 
SELECT (desired facts) 
FROM (existing facts) 
WHERE (matching criteria)
Puppet 
• From a time-consuming, imperative mess (“how”) 
• … to expressive configuration queries (“what”) 
puppet apply 
CHANGE (desired facts) 
FROM (existing puppet facts) 
WHERE (matching puppet classes)
Is there a pattern?
“Break free from ever more complex naming conventions for 
hostnames as a means of identity. Use a very rich set of meta 
data provided by each machine to address them.” 
–MCollective overview
MCollective 
• From a time-consuming, imperative mess (“how”) 
• … to expressive orchestration queries (“what”) 
mco rpc service restart service=nginx 
-F webpool=A 
EXEC (desired actions) 
FROM (existing puppet facts) 
WHERE (matching puppet classes)
Back to monitoring 
• Monitoring is to behavior what Puppet is to configuration 
• Monitoring is to behavior what MCollective is to orchestration
Monitoring 
• From a time-consuming, imperative mess (“how”) 
• … to expressive monitoring queries (“what”) 
Monitoring query 
MONITOR (desired behavior) 
FROM (existing heartbeats/metrics) 
WHERE (matching puppet facts)
Examples 
• “All provisioned web servers in the production environment, 
datacenter ABC must respond to queries within 200ms” 
• “All PostgreSQL servers must have a postgres: bgwriter process 
running” 
• “At least one ActiveMQ server is up to support mcollective" 
• Never mention a hostname
Hosts are not the center of the 
monitoring universe. 
Facts are! 
Hosts are just places where facts occur.
The proof is in the pudding…
Hosts at the center of the universe 
a.k.a. the Wrong Way
“Its fairly straightforward, so hopefully you find things easy to 
understand…” 
–Nagios Core 4 manual on monitoring clusters
Host-centric: Monitor a DNS cluster 
check_command 
check_service_cluster!"DNS Cluster"!0!1! 
$SERVICESTATEID:host1:DNS Service$,$SERVICESTATEID:host2:DNS 
Service$,$SERVICESTATEID:host3:DNS Service$ 
Where do host1, host2, host3 come from?
Host-centric: can’t use facts directly 
• “Host groups solve this problem”. No, they don’t. 
• Combinatorial explosion, e.g. trivially 
• 4 data centers (us-1, us-2, eu, apac) 
• 5 classes (web, db, cache, appserver, hadoop) 
• 3 environments (test, staging, prod) 
• => up to 119 materialized host groups
Nagios-bashing? 
• No! 
• Same fatal flaw with all host-centric monitoring tools 
• Host-centric monitoring forces an extra, expensive step: 
• replicate fact-based conditionals in host-centric templates
“Please note that this module is not for the faint of heart. Even I 
(the author) have my head hurt each time I have to make 
modifications to it…” 
–puppet-nagios author
Facts at the center of the universe 
a.k.a. the Right Way 
"De Revolutionibus manuscript p9b" by Nicolas Copernicus - www.bj.uj.edu.pl. Licensed under Public domain via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:De_Revolutionibus_manuscript_p9b.jpg#mediaviewer/ 
File:De_Revolutionibus_manuscript_p9b.jpga
Earlier Examples 
• “All provisioned web servers in the production environment, 
datacenter ABC must respond to queries within 200ms” 
• “All PostgreSQL servers must have a postgres: bgwriter process 
running” 
• “At least one ActiveMQ server is up to support mcollective"
In Sensu (heartbeats) 
• “All PostgreSQL servers must have a postgres: bgwriter process 
running” 
class postgres::monitoring::sensu { 
sensu::subscription { 'postgres': } 
} 
• Monitoring using a fact-based query 
• Is node of class “postgres” and subscribed to “postgres” or not? 
• If so, it will execute the postgres check
In Datadog (metrics) 
• “All provisioned web servers in the production environment, 
datacenter ABC must respond to queries within 200ms” 
$ puppet module install datadog-datadog_agent 
class { 
‘datadog_agent’: 
api_key => …, 
tags => [$environment], 
fact_to_tags => [“datacenter”] 
} 
include datadog_agent::integrations::nginx
In Datadog (metrics) 
• Monitoring using a fact-based query 
• Puppet facts directly reused 
max(nginx.request.latency{production,datacenter:ABC}) < 200
What to take away
Fact-based monitoring 
1. Hosts are not at the center of the monitoring universe 
2. Expressive monitoring uses queries 
3. Monitoring queries should use Puppet facts
Thank you!

More Related Content

What's hot

Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudDatadog
 
Events and metrics the Lifeblood of Webops
Events and metrics the Lifeblood of WebopsEvents and metrics the Lifeblood of Webops
Events and metrics the Lifeblood of WebopsDatadog
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integrationaspyker
 
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事smalltown
 
Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016aspyker
 
RENCI User Group Meeting 2017 - I Upgraded iRODS and I still have all my hair
RENCI User Group Meeting 2017 - I Upgraded iRODS and I still have all my hairRENCI User Group Meeting 2017 - I Upgraded iRODS and I still have all my hair
RENCI User Group Meeting 2017 - I Upgraded iRODS and I still have all my hairJohn Constable
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Dave Holland
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoringRohit Jnagal
 
Introduction to Akka-Streams
Introduction to Akka-StreamsIntroduction to Akka-Streams
Introduction to Akka-Streamsdmantula
 
QCon NYC: Distributed systems in practice, in theory
QCon NYC: Distributed systems in practice, in theoryQCon NYC: Distributed systems in practice, in theory
QCon NYC: Distributed systems in practice, in theoryAysylu Greenberg
 
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...DataStax Academy
 
Sf bay area Kubernetes meetup dec8 2016 - deployment models
Sf bay area Kubernetes meetup dec8 2016 - deployment modelsSf bay area Kubernetes meetup dec8 2016 - deployment models
Sf bay area Kubernetes meetup dec8 2016 - deployment modelsPeter Ss
 
What's new in Kubernetes
What's new in KubernetesWhat's new in Kubernetes
What's new in KubernetesDaniel Smith
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
 
Handling Redis failover with ZooKeeper
Handling Redis failover with ZooKeeperHandling Redis failover with ZooKeeper
Handling Redis failover with ZooKeeperryanlecompte
 
Python & Cassandra - Best Friends
Python & Cassandra - Best FriendsPython & Cassandra - Best Friends
Python & Cassandra - Best FriendsJon Haddad
 
Arc305 how netflix leverages multiple regions to increase availability an i...
Arc305 how netflix leverages multiple regions to increase availability   an i...Arc305 how netflix leverages multiple regions to increase availability   an i...
Arc305 how netflix leverages multiple regions to increase availability an i...Ruslan Meshenberg
 
Diagnosing Problems in Production: Cassandra Summit 2014
Diagnosing Problems in Production: Cassandra Summit 2014Diagnosing Problems in Production: Cassandra Summit 2014
Diagnosing Problems in Production: Cassandra Summit 2014Jon Haddad
 
How Yelp does Service Discovery
How Yelp does Service DiscoveryHow Yelp does Service Discovery
How Yelp does Service DiscoveryJohn Billings
 
Managing Stateful Services with the Operator Pattern in Kubernetes - Kubernet...
Managing Stateful Services with the Operator Pattern in Kubernetes - Kubernet...Managing Stateful Services with the Operator Pattern in Kubernetes - Kubernet...
Managing Stateful Services with the Operator Pattern in Kubernetes - Kubernet...Jakob Karalus
 

What's hot (20)

Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
 
Events and metrics the Lifeblood of Webops
Events and metrics the Lifeblood of WebopsEvents and metrics the Lifeblood of Webops
Events and metrics the Lifeblood of Webops
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integration
 
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
 
Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016
 
RENCI User Group Meeting 2017 - I Upgraded iRODS and I still have all my hair
RENCI User Group Meeting 2017 - I Upgraded iRODS and I still have all my hairRENCI User Group Meeting 2017 - I Upgraded iRODS and I still have all my hair
RENCI User Group Meeting 2017 - I Upgraded iRODS and I still have all my hair
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoring
 
Introduction to Akka-Streams
Introduction to Akka-StreamsIntroduction to Akka-Streams
Introduction to Akka-Streams
 
QCon NYC: Distributed systems in practice, in theory
QCon NYC: Distributed systems in practice, in theoryQCon NYC: Distributed systems in practice, in theory
QCon NYC: Distributed systems in practice, in theory
 
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
 
Sf bay area Kubernetes meetup dec8 2016 - deployment models
Sf bay area Kubernetes meetup dec8 2016 - deployment modelsSf bay area Kubernetes meetup dec8 2016 - deployment models
Sf bay area Kubernetes meetup dec8 2016 - deployment models
 
What's new in Kubernetes
What's new in KubernetesWhat's new in Kubernetes
What's new in Kubernetes
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Handling Redis failover with ZooKeeper
Handling Redis failover with ZooKeeperHandling Redis failover with ZooKeeper
Handling Redis failover with ZooKeeper
 
Python & Cassandra - Best Friends
Python & Cassandra - Best FriendsPython & Cassandra - Best Friends
Python & Cassandra - Best Friends
 
Arc305 how netflix leverages multiple regions to increase availability an i...
Arc305 how netflix leverages multiple regions to increase availability   an i...Arc305 how netflix leverages multiple regions to increase availability   an i...
Arc305 how netflix leverages multiple regions to increase availability an i...
 
Diagnosing Problems in Production: Cassandra Summit 2014
Diagnosing Problems in Production: Cassandra Summit 2014Diagnosing Problems in Production: Cassandra Summit 2014
Diagnosing Problems in Production: Cassandra Summit 2014
 
How Yelp does Service Discovery
How Yelp does Service DiscoveryHow Yelp does Service Discovery
How Yelp does Service Discovery
 
Managing Stateful Services with the Operator Pattern in Kubernetes - Kubernet...
Managing Stateful Services with the Operator Pattern in Kubernetes - Kubernet...Managing Stateful Services with the Operator Pattern in Kubernetes - Kubernet...
Managing Stateful Services with the Operator Pattern in Kubernetes - Kubernet...
 

Similar to Fact-Based Monitoring

Cassandra
CassandraCassandra
Cassandraexsuns
 
CBDW2014 - MockBox, get ready to mock your socks off!
CBDW2014 - MockBox, get ready to mock your socks off!CBDW2014 - MockBox, get ready to mock your socks off!
CBDW2014 - MockBox, get ready to mock your socks off!Ortus Solutions, Corp
 
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵Amazon Web Services Korea
 
Kubernetes Walk Through from Technical View
Kubernetes Walk Through from Technical ViewKubernetes Walk Through from Technical View
Kubernetes Walk Through from Technical ViewLei (Harry) Zhang
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchAbhishek Andhavarapu
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018Roy Russo
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disquszeeg
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqEnis Afgan
 
Building a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management ApplicationBuilding a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management ApplicationJonathan Katz
 
Protect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying TechniquesProtect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying TechniquesLeo Loobeek
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
Learn you some Ansible for great good!
Learn you some Ansible for great good!Learn you some Ansible for great good!
Learn you some Ansible for great good!David Lapsley
 
Using Apache Spark and MySQL for Data Analysis
Using Apache Spark and MySQL for Data AnalysisUsing Apache Spark and MySQL for Data Analysis
Using Apache Spark and MySQL for Data AnalysisSveta Smirnova
 
Reactive programming with examples
Reactive programming with examplesReactive programming with examples
Reactive programming with examplesPeter Lawrey
 
Ansible: How to Get More Sleep and Require Less Coffee
Ansible: How to Get More Sleep and Require Less CoffeeAnsible: How to Get More Sleep and Require Less Coffee
Ansible: How to Get More Sleep and Require Less CoffeeSarah Z
 
Python Utilities for Managing MySQL Databases
Python Utilities for Managing MySQL DatabasesPython Utilities for Managing MySQL Databases
Python Utilities for Managing MySQL DatabasesMats Kindahl
 
Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014Bryan Bende
 

Similar to Fact-Based Monitoring (20)

Cassandra
CassandraCassandra
Cassandra
 
CBDW2014 - MockBox, get ready to mock your socks off!
CBDW2014 - MockBox, get ready to mock your socks off!CBDW2014 - MockBox, get ready to mock your socks off!
CBDW2014 - MockBox, get ready to mock your socks off!
 
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 
Kubernetes Walk Through from Technical View
Kubernetes Walk Through from Technical ViewKubernetes Walk Through from Technical View
Kubernetes Walk Through from Technical View
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disqus
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-Seq
 
Building a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management ApplicationBuilding a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management Application
 
TIAD : Automating the modern datacenter
TIAD : Automating the modern datacenterTIAD : Automating the modern datacenter
TIAD : Automating the modern datacenter
 
Protect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying TechniquesProtect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying Techniques
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Learn you some Ansible for great good!
Learn you some Ansible for great good!Learn you some Ansible for great good!
Learn you some Ansible for great good!
 
Tech4Africa 2014
Tech4Africa 2014Tech4Africa 2014
Tech4Africa 2014
 
Using Apache Spark and MySQL for Data Analysis
Using Apache Spark and MySQL for Data AnalysisUsing Apache Spark and MySQL for Data Analysis
Using Apache Spark and MySQL for Data Analysis
 
Reactive programming with examples
Reactive programming with examplesReactive programming with examples
Reactive programming with examples
 
Ansible: How to Get More Sleep and Require Less Coffee
Ansible: How to Get More Sleep and Require Less CoffeeAnsible: How to Get More Sleep and Require Less Coffee
Ansible: How to Get More Sleep and Require Less Coffee
 
Python Utilities for Managing MySQL Databases
Python Utilities for Managing MySQL DatabasesPython Utilities for Managing MySQL Databases
Python Utilities for Managing MySQL Databases
 
Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014
 

More from Datadog

What it Means to be a Next-Generation Managed Service Provider
What it Means to be a Next-Generation Managed Service ProviderWhat it Means to be a Next-Generation Managed Service Provider
What it Means to be a Next-Generation Managed Service ProviderDatadog
 
Datadog + VictorOps Webinar
Datadog + VictorOps WebinarDatadog + VictorOps Webinar
Datadog + VictorOps WebinarDatadog
 
Dataday Texas 2016 - Datadog
Dataday Texas 2016 - DatadogDataday Texas 2016 - Datadog
Dataday Texas 2016 - DatadogDatadog
 
PyData NYC 2015 - Automatically Detecting Outliers with Datadog
PyData NYC 2015 - Automatically Detecting Outliers with Datadog PyData NYC 2015 - Automatically Detecting Outliers with Datadog
PyData NYC 2015 - Automatically Detecting Outliers with Datadog Datadog
 
Treating Infrastructure as Garbage
Treating Infrastructure as GarbageTreating Infrastructure as Garbage
Treating Infrastructure as GarbageDatadog
 
Big (IT) data
Big (IT) dataBig (IT) data
Big (IT) dataDatadog
 
Deep dive into Nagios analytics
Deep dive into Nagios analyticsDeep dive into Nagios analytics
Deep dive into Nagios analyticsDatadog
 
Just enough web ops for web developers
Just enough web ops for web developersJust enough web ops for web developers
Just enough web ops for web developersDatadog
 
Customer Ops: DevOps &lt;3 customer support
Customer Ops: DevOps &lt;3 customer supportCustomer Ops: DevOps &lt;3 customer support
Customer Ops: DevOps &lt;3 customer supportDatadog
 
I &lt;3 graphs in 20 slides
I &lt;3 graphs in 20 slidesI &lt;3 graphs in 20 slides
I &lt;3 graphs in 20 slidesDatadog
 
Effective monitoring with StatsD
Effective monitoring with StatsDEffective monitoring with StatsD
Effective monitoring with StatsDDatadog
 
Alerting: more signal, less noise, less pain
Alerting: more signal, less noise, less painAlerting: more signal, less noise, less pain
Alerting: more signal, less noise, less painDatadog
 
Fact based monitoring
Fact based monitoringFact based monitoring
Fact based monitoringDatadog
 
Monitoring NGINX (plus): key metrics and how-to
Monitoring NGINX (plus): key metrics and how-toMonitoring NGINX (plus): key metrics and how-to
Monitoring NGINX (plus): key metrics and how-toDatadog
 
What’s in this Cookbook? - Mike Fiedler
What’s in this Cookbook? - Mike FiedlerWhat’s in this Cookbook? - Mike Fiedler
What’s in this Cookbook? - Mike FiedlerDatadog
 
I Love Graphs - Alexis Lê-Quôc
I Love Graphs - Alexis Lê-QuôcI Love Graphs - Alexis Lê-Quôc
I Love Graphs - Alexis Lê-QuôcDatadog
 
Why Puppet Sucks - Rob Terhaar
Why Puppet Sucks - Rob TerhaarWhy Puppet Sucks - Rob Terhaar
Why Puppet Sucks - Rob TerhaarDatadog
 
Welcome to a Computing Revolution - Alex Lesser
Welcome to a Computing Revolution - Alex LesserWelcome to a Computing Revolution - Alex Lesser
Welcome to a Computing Revolution - Alex LesserDatadog
 
Cosa Nostra - Tom Santero
Cosa Nostra - Tom SanteroCosa Nostra - Tom Santero
Cosa Nostra - Tom SanteroDatadog
 
Bulk Exporting from Cassandra - Carlo Cabanilla
Bulk Exporting from Cassandra - Carlo CabanillaBulk Exporting from Cassandra - Carlo Cabanilla
Bulk Exporting from Cassandra - Carlo CabanillaDatadog
 

More from Datadog (20)

What it Means to be a Next-Generation Managed Service Provider
What it Means to be a Next-Generation Managed Service ProviderWhat it Means to be a Next-Generation Managed Service Provider
What it Means to be a Next-Generation Managed Service Provider
 
Datadog + VictorOps Webinar
Datadog + VictorOps WebinarDatadog + VictorOps Webinar
Datadog + VictorOps Webinar
 
Dataday Texas 2016 - Datadog
Dataday Texas 2016 - DatadogDataday Texas 2016 - Datadog
Dataday Texas 2016 - Datadog
 
PyData NYC 2015 - Automatically Detecting Outliers with Datadog
PyData NYC 2015 - Automatically Detecting Outliers with Datadog PyData NYC 2015 - Automatically Detecting Outliers with Datadog
PyData NYC 2015 - Automatically Detecting Outliers with Datadog
 
Treating Infrastructure as Garbage
Treating Infrastructure as GarbageTreating Infrastructure as Garbage
Treating Infrastructure as Garbage
 
Big (IT) data
Big (IT) dataBig (IT) data
Big (IT) data
 
Deep dive into Nagios analytics
Deep dive into Nagios analyticsDeep dive into Nagios analytics
Deep dive into Nagios analytics
 
Just enough web ops for web developers
Just enough web ops for web developersJust enough web ops for web developers
Just enough web ops for web developers
 
Customer Ops: DevOps &lt;3 customer support
Customer Ops: DevOps &lt;3 customer supportCustomer Ops: DevOps &lt;3 customer support
Customer Ops: DevOps &lt;3 customer support
 
I &lt;3 graphs in 20 slides
I &lt;3 graphs in 20 slidesI &lt;3 graphs in 20 slides
I &lt;3 graphs in 20 slides
 
Effective monitoring with StatsD
Effective monitoring with StatsDEffective monitoring with StatsD
Effective monitoring with StatsD
 
Alerting: more signal, less noise, less pain
Alerting: more signal, less noise, less painAlerting: more signal, less noise, less pain
Alerting: more signal, less noise, less pain
 
Fact based monitoring
Fact based monitoringFact based monitoring
Fact based monitoring
 
Monitoring NGINX (plus): key metrics and how-to
Monitoring NGINX (plus): key metrics and how-toMonitoring NGINX (plus): key metrics and how-to
Monitoring NGINX (plus): key metrics and how-to
 
What’s in this Cookbook? - Mike Fiedler
What’s in this Cookbook? - Mike FiedlerWhat’s in this Cookbook? - Mike Fiedler
What’s in this Cookbook? - Mike Fiedler
 
I Love Graphs - Alexis Lê-Quôc
I Love Graphs - Alexis Lê-QuôcI Love Graphs - Alexis Lê-Quôc
I Love Graphs - Alexis Lê-Quôc
 
Why Puppet Sucks - Rob Terhaar
Why Puppet Sucks - Rob TerhaarWhy Puppet Sucks - Rob Terhaar
Why Puppet Sucks - Rob Terhaar
 
Welcome to a Computing Revolution - Alex Lesser
Welcome to a Computing Revolution - Alex LesserWelcome to a Computing Revolution - Alex Lesser
Welcome to a Computing Revolution - Alex Lesser
 
Cosa Nostra - Tom Santero
Cosa Nostra - Tom SanteroCosa Nostra - Tom Santero
Cosa Nostra - Tom Santero
 
Bulk Exporting from Cassandra - Carlo Cabanilla
Bulk Exporting from Cassandra - Carlo CabanillaBulk Exporting from Cassandra - Carlo Cabanilla
Bulk Exporting from Cassandra - Carlo Cabanilla
 

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Fact-Based Monitoring

  • 1. Fact-based Monitoring puppetconf 2014 Alexis Lê-Quôc @alq
  • 2. Alexis Lê-Quôc, @alq CTO at Datadog
  • 3. Poll: Monitoring makes me… happy proud cry want to hide
  • 4. Puppet brings Automation to Systems Management
  • 5. Improve Monitoring the way Puppet has improved Systems Management
  • 6. “The good old days” • Your “CMDB” was Excel • SSH in and hack away • Little time for anything else
  • 7. Then Puppet came… • Expressive rules that capture expected result • Using facts and classifiers, a.k.a. metadata to figure out where to apply changes • That freed up a lot of our time* * on a per-machine basis
  • 8. “Puppet brings immunity of configuration to change in infrastructure” –Me (just now)
  • 9. I have seen this before…
  • 10. “[SQL brings] immunity of application to change in storage structure and access strategy” –C.J. Date (1977) http://www.cs.berkeley.edu/~brewer/cs262/SystemR.pdf
  • 11. SQL • 1974 IBM introduces System R and its Structured Query Language • Expressive rules that capture expected result • Using facts and predicates, a.k.a. metadata to figure out what data to get • That freed up a lot of development time
  • 12. SQL • From a time-consuming, imperative mess (“how”) • … to expressive data queries (“what”) SQL query SELECT (desired facts) FROM (existing facts) WHERE (matching criteria)
  • 13. Puppet • From a time-consuming, imperative mess (“how”) • … to expressive configuration queries (“what”) puppet apply CHANGE (desired facts) FROM (existing puppet facts) WHERE (matching puppet classes)
  • 14. Is there a pattern?
  • 15. “Break free from ever more complex naming conventions for hostnames as a means of identity. Use a very rich set of meta data provided by each machine to address them.” –MCollective overview
  • 16. MCollective • From a time-consuming, imperative mess (“how”) • … to expressive orchestration queries (“what”) mco rpc service restart service=nginx -F webpool=A EXEC (desired actions) FROM (existing puppet facts) WHERE (matching puppet classes)
  • 17. Back to monitoring • Monitoring is to behavior what Puppet is to configuration • Monitoring is to behavior what MCollective is to orchestration
  • 18. Monitoring • From a time-consuming, imperative mess (“how”) • … to expressive monitoring queries (“what”) Monitoring query MONITOR (desired behavior) FROM (existing heartbeats/metrics) WHERE (matching puppet facts)
  • 19. Examples • “All provisioned web servers in the production environment, datacenter ABC must respond to queries within 200ms” • “All PostgreSQL servers must have a postgres: bgwriter process running” • “At least one ActiveMQ server is up to support mcollective" • Never mention a hostname
  • 20. Hosts are not the center of the monitoring universe. Facts are! Hosts are just places where facts occur.
  • 21. The proof is in the pudding…
  • 22. Hosts at the center of the universe a.k.a. the Wrong Way
  • 23. “Its fairly straightforward, so hopefully you find things easy to understand…” –Nagios Core 4 manual on monitoring clusters
  • 24. Host-centric: Monitor a DNS cluster check_command check_service_cluster!"DNS Cluster"!0!1! $SERVICESTATEID:host1:DNS Service$,$SERVICESTATEID:host2:DNS Service$,$SERVICESTATEID:host3:DNS Service$ Where do host1, host2, host3 come from?
  • 25. Host-centric: can’t use facts directly • “Host groups solve this problem”. No, they don’t. • Combinatorial explosion, e.g. trivially • 4 data centers (us-1, us-2, eu, apac) • 5 classes (web, db, cache, appserver, hadoop) • 3 environments (test, staging, prod) • => up to 119 materialized host groups
  • 26. Nagios-bashing? • No! • Same fatal flaw with all host-centric monitoring tools • Host-centric monitoring forces an extra, expensive step: • replicate fact-based conditionals in host-centric templates
  • 27. “Please note that this module is not for the faint of heart. Even I (the author) have my head hurt each time I have to make modifications to it…” –puppet-nagios author
  • 28. Facts at the center of the universe a.k.a. the Right Way "De Revolutionibus manuscript p9b" by Nicolas Copernicus - www.bj.uj.edu.pl. Licensed under Public domain via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:De_Revolutionibus_manuscript_p9b.jpg#mediaviewer/ File:De_Revolutionibus_manuscript_p9b.jpga
  • 29. Earlier Examples • “All provisioned web servers in the production environment, datacenter ABC must respond to queries within 200ms” • “All PostgreSQL servers must have a postgres: bgwriter process running” • “At least one ActiveMQ server is up to support mcollective"
  • 30. In Sensu (heartbeats) • “All PostgreSQL servers must have a postgres: bgwriter process running” class postgres::monitoring::sensu { sensu::subscription { 'postgres': } } • Monitoring using a fact-based query • Is node of class “postgres” and subscribed to “postgres” or not? • If so, it will execute the postgres check
  • 31. In Datadog (metrics) • “All provisioned web servers in the production environment, datacenter ABC must respond to queries within 200ms” $ puppet module install datadog-datadog_agent class { ‘datadog_agent’: api_key => …, tags => [$environment], fact_to_tags => [“datacenter”] } include datadog_agent::integrations::nginx
  • 32. In Datadog (metrics) • Monitoring using a fact-based query • Puppet facts directly reused max(nginx.request.latency{production,datacenter:ABC}) < 200
  • 33. What to take away
  • 34. Fact-based monitoring 1. Hosts are not at the center of the monitoring universe 2. Expressive monitoring uses queries 3. Monitoring queries should use Puppet facts