SlideShare a Scribd company logo

Systems Monitoring with Prometheus (Devops Ireland April 2015)

Monitoring means many things to many people. This talk looks at Systems Monitoring, that is how to keep an eye on a given system and use this as part of overall management of a system. This talk will cover Why one monitors, What to monitor, How to monitor, the general design of a monitoring system and how Prometheus is a good fit for this in terms of instrumentation, consoles, alerts, general system health and sanity. Prometheus is a next-generation monitoring system publicly announced earlier this year, developed by companies including SoundCloud, locals Boxever and Docker. Since launch there has been wide-spread interest, and many community contributions. For more information see http://prometheus.io or http://www.boxever.com/tag/monitoring

1 of 46
Systems Monitoring with Prometheus
Devops Ireland, April 2015
Brian Brazil
Senior Software Engineer
Boxever
What is monitoring?
What is monitoring?
Host-based checks?
• Typically shell scripts with success/fail
• Failure causes alerts
• More blackbox than whitebox
• About machines, not services
Brian’s Pet Peeve #1
Thinking in terms of machines rather than services.
It’s the future, it’s not the “Webserver machine” it’s one
machine what happens to run a Webserver as part of the
Webserver service.
What is monitoring?
Highly granular information about a subsystem?
• Tends to be focused on one subsystem, such as
incoming http requests
• No visibility into the rest of the system
What is monitoring?
High frequency high granularity profiling?
• Great for debugging once you’ve narrowed down the
problem
• Not so useful for general monitoring
Ad

Recommended

Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus OverviewBrian Brazil
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Brian Brazil
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Brian Brazil
 
Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusMarco Pas
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheusCeline George
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheusKasper Nissen
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With PrometheusKnoldus Inc.
 
Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy Docker, Inc.
 

More Related Content

What's hot

Introduction to Prometheus
Introduction to PrometheusIntroduction to Prometheus
Introduction to PrometheusJulien Pivotto
 
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Brian Brazil
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 
Monitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_TutorialMonitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_TutorialTim Vaillancourt
 
Getting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaSyah Dwi Prihatmoko
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basicsJuraj Hantak
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaArvind Kumar G.S
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Brian Brazil
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...LibbySchulze
 
Grafana optimization for Prometheus
Grafana optimization for PrometheusGrafana optimization for Prometheus
Grafana optimization for PrometheusMitsuhiro Tanda
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusGrafana Labs
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaSridhar Kumar N
 
MySQL Monitoring using Prometheus & Grafana
MySQL Monitoring using Prometheus & GrafanaMySQL Monitoring using Prometheus & Grafana
MySQL Monitoring using Prometheus & GrafanaYoungHeon (Roy) Kim
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)Lucas Jellema
 
Grafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for LogsGrafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for LogsMarco Pracucci
 

What's hot (20)

Introduction to Prometheus
Introduction to PrometheusIntroduction to Prometheus
Introduction to Prometheus
 
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
Monitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_TutorialMonitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_Tutorial
 
Getting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and Grafana
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basics
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...
 
Grafana optimization for Prometheus
Grafana optimization for PrometheusGrafana optimization for Prometheus
Grafana optimization for Prometheus
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
 
Prometheus 101
Prometheus 101Prometheus 101
Prometheus 101
 
Prometheus and Grafana
Prometheus and GrafanaPrometheus and Grafana
Prometheus and Grafana
 
MySQL Monitoring using Prometheus & Grafana
MySQL Monitoring using Prometheus & GrafanaMySQL Monitoring using Prometheus & Grafana
MySQL Monitoring using Prometheus & Grafana
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
 
Prometheus monitoring
Prometheus monitoringPrometheus monitoring
Prometheus monitoring
 
Grafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for LogsGrafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for Logs
 

Similar to Systems Monitoring with Prometheus (Devops Ireland April 2015)

Lotuscript for large systems
Lotuscript for large systemsLotuscript for large systems
Lotuscript for large systemsBill Buchan
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
The Business Value of PaaS Automation - Kieron Sambrook-Smith - Presentation ...
The Business Value of PaaS Automation - Kieron Sambrook-Smith - Presentation ...The Business Value of PaaS Automation - Kieron Sambrook-Smith - Presentation ...
The Business Value of PaaS Automation - Kieron Sambrook-Smith - Presentation ...eZ Systems
 
Chris Mathias Presents Advanced API Design Considerations at LA CTO Forum
Chris Mathias Presents Advanced API Design Considerations at LA CTO ForumChris Mathias Presents Advanced API Design Considerations at LA CTO Forum
Chris Mathias Presents Advanced API Design Considerations at LA CTO ForumChris Mathias
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
(SPOT205) 5 Lessons for Managing Massive IT Transformation ProjectsAmazon Web Services
 
Prometheus lightning talk (Devops Dublin March 2015)
Prometheus lightning talk (Devops Dublin March 2015)Prometheus lightning talk (Devops Dublin March 2015)
Prometheus lightning talk (Devops Dublin March 2015)Brian Brazil
 
Eric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler
 
Best Practices for WordPress in Enterprise
Best Practices for WordPress in EnterpriseBest Practices for WordPress in Enterprise
Best Practices for WordPress in EnterpriseTaylor Lovett
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 
How Open Source / Open Technology Could Help On Your Project
How Open Source / Open Technology Could Help On Your ProjectHow Open Source / Open Technology Could Help On Your Project
How Open Source / Open Technology Could Help On Your ProjectWan Leung Wong
 
SharePoint Performance Monitoring with Sean P. McDonough
SharePoint Performance Monitoring with Sean P. McDonoughSharePoint Performance Monitoring with Sean P. McDonough
SharePoint Performance Monitoring with Sean P. McDonoughGabrijela Orsag
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdbjixuan1989
 
Using Apache Camel as AKKA
Using Apache Camel as AKKAUsing Apache Camel as AKKA
Using Apache Camel as AKKAJohan Edstrom
 
An Introduction to Microservices
An Introduction to MicroservicesAn Introduction to Microservices
An Introduction to MicroservicesAd van der Veer
 
Design for Scale / Surge 2010
Design for Scale / Surge 2010Design for Scale / Surge 2010
Design for Scale / Surge 2010Christopher Brown
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago
 

Similar to Systems Monitoring with Prometheus (Devops Ireland April 2015) (20)

Dev Ops without the Ops
Dev Ops without the OpsDev Ops without the Ops
Dev Ops without the Ops
 
Lotuscript for large systems
Lotuscript for large systemsLotuscript for large systems
Lotuscript for large systems
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
The Business Value of PaaS Automation - Kieron Sambrook-Smith - Presentation ...
The Business Value of PaaS Automation - Kieron Sambrook-Smith - Presentation ...The Business Value of PaaS Automation - Kieron Sambrook-Smith - Presentation ...
The Business Value of PaaS Automation - Kieron Sambrook-Smith - Presentation ...
 
Chris Mathias Presents Advanced API Design Considerations at LA CTO Forum
Chris Mathias Presents Advanced API Design Considerations at LA CTO ForumChris Mathias Presents Advanced API Design Considerations at LA CTO Forum
Chris Mathias Presents Advanced API Design Considerations at LA CTO Forum
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
 
Prometheus lightning talk (Devops Dublin March 2015)
Prometheus lightning talk (Devops Dublin March 2015)Prometheus lightning talk (Devops Dublin March 2015)
Prometheus lightning talk (Devops Dublin March 2015)
 
Eric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New Contexts
 
Best Practices for WordPress in Enterprise
Best Practices for WordPress in EnterpriseBest Practices for WordPress in Enterprise
Best Practices for WordPress in Enterprise
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
Redundant devops
Redundant devopsRedundant devops
Redundant devops
 
How Open Source / Open Technology Could Help On Your Project
How Open Source / Open Technology Could Help On Your ProjectHow Open Source / Open Technology Could Help On Your Project
How Open Source / Open Technology Could Help On Your Project
 
SharePoint Performance Monitoring with Sean P. McDonough
SharePoint Performance Monitoring with Sean P. McDonoughSharePoint Performance Monitoring with Sean P. McDonough
SharePoint Performance Monitoring with Sean P. McDonough
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
Using Apache Camel as AKKA
Using Apache Camel as AKKAUsing Apache Camel as AKKA
Using Apache Camel as AKKA
 
soa
soasoa
soa
 
An Introduction to Microservices
An Introduction to MicroservicesAn Introduction to Microservices
An Introduction to Microservices
 
Design for Scale / Surge 2010
Design for Scale / Surge 2010Design for Scale / Surge 2010
Design for Scale / Surge 2010
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
 

More from Brian Brazil

OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)Brian Brazil
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Brian Brazil
 
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Brian Brazil
 
Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)Brian Brazil
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Brian Brazil
 
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Brian Brazil
 
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Brian Brazil
 
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
Evolution of the Prometheus TSDB  (Percona Live Europe 2017)Evolution of the Prometheus TSDB  (Percona Live Europe 2017)
Evolution of the Prometheus TSDB (Percona Live Europe 2017)Brian Brazil
 
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)Brian Brazil
 
Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)Brian Brazil
 
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)Brian Brazil
 
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Brian Brazil
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)Brian Brazil
 
Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Brian Brazil
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum JapanBrian Brazil
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Brian Brazil
 
So You Want to Write an Exporter
So You Want to Write an ExporterSo You Want to Write an Exporter
So You Want to Write an ExporterBrian Brazil
 
An Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQLAn Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQLBrian Brazil
 
Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)Brian Brazil
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Brian Brazil
 

More from Brian Brazil (20)

OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)
 
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
 
Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)
 
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
 
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
 
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
Evolution of the Prometheus TSDB  (Percona Live Europe 2017)Evolution of the Prometheus TSDB  (Percona Live Europe 2017)
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
 
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
 
Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)
 
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
 
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
 
Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
 
So You Want to Write an Exporter
So You Want to Write an ExporterSo You Want to Write an Exporter
So You Want to Write an Exporter
 
An Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQLAn Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQL
 
Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 

Recently uploaded

Obstructive jaundice is a medical condition characterized by the yellowing of...
Obstructive jaundice is a medical condition characterized by the yellowing of...Obstructive jaundice is a medical condition characterized by the yellowing of...
Obstructive jaundice is a medical condition characterized by the yellowing of...ssuser7b7f4e
 
Regulation is Coming - Trusted Media Summit 2023
Regulation is Coming - Trusted Media Summit 2023Regulation is Coming - Trusted Media Summit 2023
Regulation is Coming - Trusted Media Summit 2023Damar Juniarto
 
Red shadows ringing in Japan's Cyberspace
Red shadows ringing in Japan's CyberspaceRed shadows ringing in Japan's Cyberspace
Red shadows ringing in Japan's Cyberspacesttyk
 
Modern Red Teaming - subverting mature defenses on a budget
Modern Red Teaming - subverting mature defenses on a budgetModern Red Teaming - subverting mature defenses on a budget
Modern Red Teaming - subverting mature defenses on a budgetmatt806068
 
history of tau gamma architect.1968.....
history of tau gamma architect.1968.....history of tau gamma architect.1968.....
history of tau gamma architect.1968.....josephiigo
 
UGB INTERNETBANKING FACILITY LAUNCHED.pptx
UGB INTERNETBANKING FACILITY LAUNCHED.pptxUGB INTERNETBANKING FACILITY LAUNCHED.pptx
UGB INTERNETBANKING FACILITY LAUNCHED.pptxRitesh Sahu
 
Model Jaringan network jaringan komputer.pdf
Model Jaringan network jaringan komputer.pdfModel Jaringan network jaringan komputer.pdf
Model Jaringan network jaringan komputer.pdfgalfinprihardiputra0
 
AWS Overview of AWS Clarify, Feature Store, Hyper parameter Tuning
AWS Overview of AWS  Clarify, Feature Store, Hyper parameter TuningAWS Overview of AWS  Clarify, Feature Store, Hyper parameter Tuning
AWS Overview of AWS Clarify, Feature Store, Hyper parameter TuningVarun Garg
 
Augmented and Mixed Reality Solutions for Frontline Medical Professionals
Augmented and Mixed Reality Solutions for Frontline Medical ProfessionalsAugmented and Mixed Reality Solutions for Frontline Medical Professionals
Augmented and Mixed Reality Solutions for Frontline Medical Professionalsthirdeyegen65
 
Augmented and Mixed Reality Solutions for Aerospace & Defense
Augmented and Mixed Reality Solutions for Aerospace & DefenseAugmented and Mixed Reality Solutions for Aerospace & Defense
Augmented and Mixed Reality Solutions for Aerospace & Defensethirdeyegen65
 

Recently uploaded (10)

Obstructive jaundice is a medical condition characterized by the yellowing of...
Obstructive jaundice is a medical condition characterized by the yellowing of...Obstructive jaundice is a medical condition characterized by the yellowing of...
Obstructive jaundice is a medical condition characterized by the yellowing of...
 
Regulation is Coming - Trusted Media Summit 2023
Regulation is Coming - Trusted Media Summit 2023Regulation is Coming - Trusted Media Summit 2023
Regulation is Coming - Trusted Media Summit 2023
 
Red shadows ringing in Japan's Cyberspace
Red shadows ringing in Japan's CyberspaceRed shadows ringing in Japan's Cyberspace
Red shadows ringing in Japan's Cyberspace
 
Modern Red Teaming - subverting mature defenses on a budget
Modern Red Teaming - subverting mature defenses on a budgetModern Red Teaming - subverting mature defenses on a budget
Modern Red Teaming - subverting mature defenses on a budget
 
history of tau gamma architect.1968.....
history of tau gamma architect.1968.....history of tau gamma architect.1968.....
history of tau gamma architect.1968.....
 
UGB INTERNETBANKING FACILITY LAUNCHED.pptx
UGB INTERNETBANKING FACILITY LAUNCHED.pptxUGB INTERNETBANKING FACILITY LAUNCHED.pptx
UGB INTERNETBANKING FACILITY LAUNCHED.pptx
 
Model Jaringan network jaringan komputer.pdf
Model Jaringan network jaringan komputer.pdfModel Jaringan network jaringan komputer.pdf
Model Jaringan network jaringan komputer.pdf
 
AWS Overview of AWS Clarify, Feature Store, Hyper parameter Tuning
AWS Overview of AWS  Clarify, Feature Store, Hyper parameter TuningAWS Overview of AWS  Clarify, Feature Store, Hyper parameter Tuning
AWS Overview of AWS Clarify, Feature Store, Hyper parameter Tuning
 
Augmented and Mixed Reality Solutions for Frontline Medical Professionals
Augmented and Mixed Reality Solutions for Frontline Medical ProfessionalsAugmented and Mixed Reality Solutions for Frontline Medical Professionals
Augmented and Mixed Reality Solutions for Frontline Medical Professionals
 
Augmented and Mixed Reality Solutions for Aerospace & Defense
Augmented and Mixed Reality Solutions for Aerospace & DefenseAugmented and Mixed Reality Solutions for Aerospace & Defense
Augmented and Mixed Reality Solutions for Aerospace & Defense
 

Systems Monitoring with Prometheus (Devops Ireland April 2015)

  • 1. Systems Monitoring with Prometheus Devops Ireland, April 2015 Brian Brazil Senior Software Engineer Boxever
  • 3. What is monitoring? Host-based checks? • Typically shell scripts with success/fail • Failure causes alerts • More blackbox than whitebox • About machines, not services
  • 4. Brian’s Pet Peeve #1 Thinking in terms of machines rather than services. It’s the future, it’s not the “Webserver machine” it’s one machine what happens to run a Webserver as part of the Webserver service.
  • 5. What is monitoring? Highly granular information about a subsystem? • Tends to be focused on one subsystem, such as incoming http requests • No visibility into the rest of the system
  • 6. What is monitoring? High frequency high granularity profiling? • Great for debugging once you’ve narrowed down the problem • Not so useful for general monitoring
  • 7. What is monitoring? Tailing logs? • Easy to miss something • Tend to get very noisy • We have computers, why are humans doing repetitive tasks?
  • 8. Step Back Why do we want monitoring?
  • 9. Why: Alerting We want to know when things go wrong We want to know when things aren’t quite right We want to know in advance of problems
  • 10. Why: Debugging When something is up, you need to debug. You want to go from high-level problem, and drill down to what’s causing it. Need to be able to reason about things. Sometimes want to go from code back to metrics.
  • 11. Brian’s Pet Peeve #2 Instrumentation that you need to read the code to understand. Make the names such that a random person not intimately familiar with the system would have a good chance at guessing what it means. Specify your units.
  • 12. Why: Trending How the various bits of a system are being used. For example, how many static requests per dynamic request? How many sessions active at once? How many hit a certain corner case? For some stats, also want to know how they change over time for capacity planning and design discussions.
  • 13. A different approach What if we instrumented everything? • RPCs • Interfaces between subsystems • Business logic • Every time you’d log something What if we monitored systems and subsystems to know how everything is generally doing?
  • 14. That’s a lot of metrics That could be tens of thousands of codepoints across an entire system. You’d need some way to make it easy to instrument all code, not just the externally facing parts of applications. You’d need something able to handle a million time series.
  • 15. Presenting Prometheus An open-source service monitoring system and time series database. Started in 2012, primarily developed in Soundcloud with committers also in Boxever and Docker. Publicly announced January 2015, many contributions and users since then.
  • 17. The Server • Can handle over a million time series in one instance • No dependencies such as HBase or Cassandra • Stores data on local disk • Written in Go • Easy to run
  • 18. Data Model Tired of aggregating and alerting off metrics like http. responses.500.myserver.mydc.production? Time series have structured key-value pairs, e.g. http_responses_total{ response_code=”500”,instance=”myserver”, dc=”mydc”,env=”production”}
  • 19. Brian’s Pet Peeve #3 Munging structured data in a way that loses the structure Is it so much to ask for some escaping, or at least sanitizing any separators in the data?
  • 20. Query Language Aggregation based on the key-value labels Arbitrarily complex math And all of this can be used in pre-computed rules and alerts
  • 21. Query Language: Example Column families with the 10 highest read rates per second topk(10, sum by(job, keyspace, columnfamily) ( rate(cassandra_columnfamily_readlatency[5m]) ) )
  • 22. Client Libraries How you instrument your code • Official: Python, Java, Go, Ruby • Unofficial: Bash, NodeJS, .Net Text-based format, easy to write new client libraries
  • 24. Client Libraries: In and Out Client libraries don’t tie you to Prometheus instrumentation Custom collectors allow pulling data from other instrumentation systems into Prometheus client library Similarly, can pull data out of client library and expose as you wish
  • 25. Client Libraries: What to Instrument
  • 26. Client Libraries: What to Instrument Everything
  • 27. Things to have • Client and server qps/errors/latency • Every log message should be a metric • Every failure should be a metric • Threadpool/queue size, in progress, latency • Business logic inputs and outputs • Data sizes in/out • Process cpu/ram/language internals (e.g. GC) • Blackbox and end-to-end monitoring
  • 28. Batch/Offline Processing Metrics • Last time it succeeded • Records processed/throughput • Duration of major batch stages • Heartbeats for end-to-end testing Use the PushGateway for ephemeral jobs
  • 29. Brian’s Pet Peeve #4 Wrapping instrumentation libraries to make them “simpler” Tend to confuse abstractions, encourage bad practices and make it difficult to write correct and useable instrumentation e.g. Prometheus values are doubles, if you only allow ints then end user has to do math to convert back to seconds
  • 30. Speaking of Correct Instrumentation It’s better to have math done in the server, not the client Many instrumentation systems are exponentially decaying Do you really want to do calculus during an outage? Prometheus has monotonic counters Races and missed scrapers don’t lose data
  • 31. Integrations More powerful data model needs integrations and instrumentation written to take advantage of that Machine (Node Exporter), HAProxy, CloudWatch, Statsd, Collectd, JMX, Mesos, Consul, MySQL Direct instrumentation: cadvisor, etcd
  • 32. Dashboards • Promdash: Ruby on Rails web app • Console templates: More power for those who like checking things in • Expression browser: Ad-hoc queries • JSON interface: Roll your own
  • 34. Dashboards What to put on dashboards?
  • 35. Dashboards Goal: Make it easy to logically drill down a problem Most services are a graph. Make it easy to go from a console about one service, see which of it’s backends is the problem, and repeat.
  • 36. Brian’s Pet Peeve #5 Dashboard anti-patterns: ● Graph of a hundred plots ● Page of a thousand graphs ● Consoles that their creator can barely understand ● “Put it on a console somewhere”
  • 37. Dashboard Guidelines • Don’t put every possible metric on the dashboard • Focus on the top few metrics, based on the most likely failure modes and things you’ll want to know • No more than 5 graphs per console, 5 plots per graph • Have units, y-labels, legends and descriptions • Split out or trim if dashboards are getting too complex • It’s difficult for a dashboard to serve two masters • Dashboards are not for alerting
  • 38. Alerting Alertmanager aggregates alerts from Prometheus servers Supports notifications to Pagerduty, Email, Pushover Best practices: • Alert on symptoms not causes • Have a way to deal with non-critical alerts
  • 39. What to Alert On Online Serving: Overall latency, errors Offline Processing: Propagation/Processing Delay Batch Jobs: When it last Suceeded
  • 40. The Live Demo please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work
  • 41. The Future Many features on roadmap: • Service discovery • Federation • Long term storage • HA Alertmanager • More exporters, client libraries and integrations
  • 42. Final Word Systems Monitoring is your first port of call in an emergency, keep it working without needing lots of effort. Prometheus is awesome - can be lead to non-critical data taking lots of management, crowding out critical metrics. At some point, have to move non-critical things to generic data processing system.
  • 43. Finaler Word How do you know your monitoring system is good? • When you have superb monitoring for everything? • When it causes a HDD to fail? • When it finds two bugs in Go’s DNS library?
  • 44. Finaler Word How do you know your monitoring system is good? • When you have superb monitoring for everything? • When it causes a HDD to fail? • When it finds two bugs in Go’s DNS library? No, when the unittests find a bug in your filesystem!
  • 45. Try it out! http://www.boxever.com/tag/monitoring has step-by-step instructions on monitoring: • Machines • Cassandra • HAProxy • Python Batch host-based jobs • Java applications Problems? We’re on #prometheus on Freenode