SlideShare a Scribd company logo
1 of 65
Download to read offline
Julien Pivotto
@roidelapluie
Monitoring in a fast-changing world with
Prometheus
October 2021
Monitoring
@roidelapluie
• Applications are short lived
• Updated often
• Infrastructure changes
(Nothing new...)
A fast changing world
@roidelapluie
• Monitoring an infrastructure
• Monitoring user experience
... together (dev&ops)
Monitoring in a "fast changing world"
@roidelapluie
CPU Usage, Disk space, Memory, Open file descriptors, ...
Infrastructure monitoring
@roidelapluie
Request Rate, Request Errors, Request Duration
Utilization, Saturation, Errors
User experience monitoring
@roidelapluie RED method by Tom Wilkie, USE method by Brendan Gregg
• High level overview of the state of a service/component
• Performance
• Availability
• Technical components
What is going on?
What is monitoring?
@roidelapluie
• Understand how your services behave
• Like you are at their place
• Without specific code
Why is this going on?
What's observability?
@roidelapluie
• Monitoring is required
• Some monitoring systems are design for observability
• If lucky, monitoring is enough
• Observability is removing luck
How do monitoring and observability connect?
@roidelapluie
Three pillars:
• Metrics
• Logs
• Traces
What's observability - in Practice?
@roidelapluie
Metrics
@roidelapluie https://play.grafana.org/
Logs
@roidelapluie
Traces
@roidelapluie https://www.jaegertracing.io/img/trace-detail-ss.png
• Culture: building together
• Automation
• Measurement
• Sharing
A DevOps world
@roidelapluie John Willis and Damon Edwards, 2010
Prometheus
@roidelapluie
• Open Source monitoring solution
• Graduated CNCF Project
• Born in 2012, publicly announced in 2015
• Collects metrics
• Plenty of integrations
• Service discoveries, like kubernetes.
• Easy to use query language
• Built-in alerting
Prometheus
@roidelapluie
• A community
• A server and many other components
• An ecosystem
What "Prometheus" means
@roidelapluie
• Open Source
• Pull-based Monitoring over HTTP
• Powerful query language
• Optimized TSDB
The Prometheus server
@roidelapluie
How does it work?
@roidelapluie
How does it work?
@roidelapluie
How does it work?
@roidelapluie
How does it work?
@roidelapluie
How does it work?
@roidelapluie
Metrics
@roidelapluie
Prometheus scrapes metrics over HTTP.
caddy_http_requests_total{code="200",method="POST",path="/load"} 1
Dimensional data model, for filtering and aggregation.
Metrics and Labels
@roidelapluie
sum(rate(caddy_http_requests_total{code=~"5.."}[5m])
Gets the rate of all 5xx HTTP responses (server errors).
Querying metrics and Labels
@roidelapluie
sum(rate(caddy_http_requests_total{code=~"5.."}[5m]) without(code)
/
sum(rate(caddy_http_requests_total[5m]) without(code)
Gets the % of 5xx HTTP responses (server errors).
Querying metrics and Labels
@roidelapluie
• Metrics do not represent problems
• Metrics represent a state, give insights
• Metrics can be graphed
• You can alert based on them
Metrics and monitoring
@roidelapluie
In general you can just expose counters, and let the monitoring server do the
real maths.
That keeps the overhead very low of apps.
Exposed metrics are "raw"
@roidelapluie
Architecture
@roidelapluie
• Prometheus server
• Alertmanager
• Exporters
Prometheus components
@roidelapluie
• Single binary
• No clustering
• No dependency on distributed FS
Prometheus server
@roidelapluie
• Single binary
• Clustering (raft protocol)
Alertmanager
@roidelapluie
Automation
@roidelapluie
Let's see what makes Prometheus play nicely with automation tools.
Automating Prometheus
@roidelapluie
• Works on your machine
• Container ready
• Not tied to kubernetes (see prometheus-operator)
Deploy anywhere
@roidelapluie
• Reloads on SIGHUP
• /-/reload endpoint (--web.enable-lifecycle)
• Working to have less and less overhead on reloads
Reloading Prometheus
@roidelapluie
- template:
source: prometheus.yml.j2
target: /etc/prometheus/prometheus.yml
validate: /usr/bin/promtool check config %s
Also: check rules, check web-config.
Ansible
@roidelapluie
Plenty of situation do not require a reload of Prometheus:
• Password files
• TLS certificates
Prometheus will read them before use, no reload needed!
Not reloading Prometheus
@roidelapluie
HashiCorp Vault enables retrieving temporary secrets and writing them to a file.
./vault agent -config vault-agent.hcl
Using vault
@roidelapluie https://inuits.eu/blog/prometheus-consul-vault-228/
scrape_configs:
- job_name: consul-services
consul_sd_configs:
- server: localhost:8500
authorization:
credentials_file: consul_token
Reading token from vault
@roidelapluie
Prometheus offers native TLS and basic auth.
tls_server_config:
cert_file: server.crt
key_file: server.key
basic_auth_users:
alice: $2y$10$mDwo.lAisC94iLAyP8
bob: $2y$10$hLqFl9jSjoAAy95Z/zw8
TLS and basic auth
@roidelapluie
The "web-config" file is read on every request:
• No need to reload
• Instantly change passwords, cert files
Shared config format between Prometheus and exporters!
TLS and basic auth
@roidelapluie
Prometheus has a snapshot API.
Enable with --web.enable-admin-api
curl -d{} http://localhost:9090/api/v1/admin/tsdb/snapshot
Prometheus TSDB is made of immutable blocks. Snapshots use hard links.
Backups
@roidelapluie
Service Discovery
@roidelapluie
• Easier to know what's down with Pull
• Easy debugging (curl)
• Easier to spread the load
• Central configuration point
• High Availability
Prometheus pull model
@roidelapluie
• Prometheus must know what to pull
• Source of Truth
• Service Discovery != Auto discovery
• Event based when possible
Service Discovery
@roidelapluie
• Kubernetes
• Consul
• Cloud providers (Azure, AWS, GCP, DigitalOcean, Hetzner, Scaleway, Linode)
• Docker & Docker Swarm
• And more! 20+ external SD in total.
Sources of Truth
@roidelapluie
• Static SD: into Prometheus main config
• File SD: Files on disk
• HTTP SD: HTTP endpoints
Generic Service Discovery
@roidelapluie https://inuits.eu/blog/prometheus-http-service-discovery/
[
{
"targets": ["10.0.10.2:9100"],
"labels": {
"__meta_datacenter": "london"
}
}
]
Generic SD format (file & http SD)
@roidelapluie
• Both integrate your own SD systems into prometheus
• File SD is event based (inotify)
• HTTP SD can be integrated in your apps
File SD vs HTTP SD
@roidelapluie
Labels can be used to configure targets.
• __address__: 127.0.0.1:9090
• __metrics_path__: /metrics
• __scheme__: http or https
• __param_<name>: http parameter
• __scrape_interval__, __scrape_timeout__: 1m
Labels
@roidelapluie
Additionally, extra labels are added by SD.
• __meta_kubernetes_pod_label_app
• __meta_digitalocean_region
• __meta_linode_public_ipv6
• __meta_scaleway_instance_status
Meta labels
@roidelapluie https://prometheus.io/docs/prometheus/latest/configuration/configuration/
A fundamental principle in Prometheus.
Transform input labels into a new set of labels.
Relabeling
@roidelapluie
• Rename, merge, replace labels
• Conditionally drop label sets
• Only keep labels sets
Relabeling actions
@roidelapluie
• Get lots of labels as input
• Turns them into targets
• Remove labels prefixed with __
• Can use "special labels"
Target relabeling
@roidelapluie
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://prometheus.io
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115
Target relabeling example
@roidelapluie
https://relabeler.promlabs.com/
Relabeler
@roidelapluie
puppetdb_sd_configs:
- url: http://127.0.0.1:8080/
query: 'resources { type = "Apache::Vhost" }'
include_parameters: true
relabel_configs:
- source_labels: [__meta_puppetdb_parameter_servername]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115
Relabel from SD
@roidelapluie
Recap
@roidelapluie
• Simple to deploy
• Behaves as you expect
• Easy to reload
Prometheus is automation Friendly
@roidelapluie
• Easy password/cert rotation
• Service discovery to keep up to date infra
Prometheus is change friendly
@roidelapluie
https://prometheus.io/community
Prometheus is open, join us!
@roidelapluie
Julien Pivotto
@roidelapluie
roidelapluie@inuits.eu
Essensteenweg 31
2930 Brasschaat
Belgium
Contact:
info@inuits.eu
+32-3-8082105

More Related Content

Similar to Monitoring in a fast-changing world with Prometheus

Similar to Monitoring in a fast-changing world with Prometheus (20)

Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
 
56k.cloud training
56k.cloud training56k.cloud training
56k.cloud training
 
Scaling Prometheus on Kubernetes with Thanos
Scaling Prometheus on Kubernetes with ThanosScaling Prometheus on Kubernetes with Thanos
Scaling Prometheus on Kubernetes with Thanos
 
WebDev Crash Course
WebDev Crash CourseWebDev Crash Course
WebDev Crash Course
 
Modern Web-site Development Pipeline
Modern Web-site Development PipelineModern Web-site Development Pipeline
Modern Web-site Development Pipeline
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
 
Build cloud native solution using open source
Build cloud native solution using open source Build cloud native solution using open source
Build cloud native solution using open source
 
Node.js
Node.jsNode.js
Node.js
 
Getting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and Grafana
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL
 
Monitoring as an entry point for collaboration
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaboration
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrong
 
Monitoring your API
Monitoring your APIMonitoring your API
Monitoring your API
 
Adopting OpenTelemetry
Adopting OpenTelemetryAdopting OpenTelemetry
Adopting OpenTelemetry
 
Streams API (Web Engines Hackfest 2015)
Streams API (Web Engines Hackfest 2015)Streams API (Web Engines Hackfest 2015)
Streams API (Web Engines Hackfest 2015)
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Redundant devops
Redundant devopsRedundant devops
Redundant devops
 
ApacheCon Core: Service Discovery in OSGi: Beyond the JVM using Docker and Co...
ApacheCon Core: Service Discovery in OSGi: Beyond the JVM using Docker and Co...ApacheCon Core: Service Discovery in OSGi: Beyond the JVM using Docker and Co...
ApacheCon Core: Service Discovery in OSGi: Beyond the JVM using Docker and Co...
 
Do you know what your Drupal is doing_ Observe it!
Do you know what your Drupal is doing_ Observe it!Do you know what your Drupal is doing_ Observe it!
Do you know what your Drupal is doing_ Observe it!
 
You're monitoring Kubernetes Wrong
You're monitoring Kubernetes WrongYou're monitoring Kubernetes Wrong
You're monitoring Kubernetes Wrong
 

More from Julien Pivotto

More from Julien Pivotto (20)

The O11y Toolkit
The O11y ToolkitThe O11y Toolkit
The O11y Toolkit
 
What's New in Prometheus and Its Ecosystem
What's New in Prometheus and Its EcosystemWhat's New in Prometheus and Its Ecosystem
What's New in Prometheus and Its Ecosystem
 
Prometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is comingPrometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is coming
 
What's new in Prometheus?
What's new in Prometheus?What's new in Prometheus?
What's new in Prometheus?
 
Introduction to Grafana Loki
Introduction to Grafana LokiIntroduction to Grafana Loki
Introduction to Grafana Loki
 
Why you should revisit mgmt
Why you should revisit mgmtWhy you should revisit mgmt
Why you should revisit mgmt
 
Observing the HashiCorp Ecosystem From Prometheus
Observing the HashiCorp Ecosystem From PrometheusObserving the HashiCorp Ecosystem From Prometheus
Observing the HashiCorp Ecosystem From Prometheus
 
5 tips for Prometheus Service Discovery
5 tips for Prometheus Service Discovery5 tips for Prometheus Service Discovery
5 tips for Prometheus Service Discovery
 
Prometheus and TLS - an Introduction
Prometheus and TLS - an IntroductionPrometheus and TLS - an Introduction
Prometheus and TLS - an Introduction
 
Powerful graphs in Grafana
Powerful graphs in GrafanaPowerful graphs in Grafana
Powerful graphs in Grafana
 
YAML Magic
YAML MagicYAML Magic
YAML Magic
 
HAProxy as Egress Controller
HAProxy as Egress ControllerHAProxy as Egress Controller
HAProxy as Egress Controller
 
Improved alerting with Prometheus and Alertmanager
Improved alerting with Prometheus and AlertmanagerImproved alerting with Prometheus and Alertmanager
Improved alerting with Prometheus and Alertmanager
 
SIngle Sign On with Keycloak
SIngle Sign On with KeycloakSIngle Sign On with Keycloak
SIngle Sign On with Keycloak
 
Incident Resolution as Code
Incident Resolution as CodeIncident Resolution as Code
Incident Resolution as Code
 
Monitor your CentOS stack with Prometheus
Monitor your CentOS stack with PrometheusMonitor your CentOS stack with Prometheus
Monitor your CentOS stack with Prometheus
 
Monitor your CentOS stack with Prometheus
Monitor your CentOS stack with PrometheusMonitor your CentOS stack with Prometheus
Monitor your CentOS stack with Prometheus
 
An introduction to Ansible
An introduction to AnsibleAn introduction to Ansible
An introduction to Ansible
 
Jsonnet
JsonnetJsonnet
Jsonnet
 
Cfgmgmt Challenges aren't technical anymore
Cfgmgmt Challenges aren't technical anymoreCfgmgmt Challenges aren't technical anymore
Cfgmgmt Challenges aren't technical anymore
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
The UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoThe UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, Ocado
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreel
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 

Monitoring in a fast-changing world with Prometheus