SlideShare a Scribd company logo
1 of 30
Download to read offline
The hitchhiker’s guide to
Remco Overdijk
1
"A Metric, The Hitchhiker's Guide to Prometheus says, is
about the most massively useful thing someone doing
Monitoring can have. It has great practical value. You can
wave your Metric in emergencies as a distress signal, and
produce pretty Graphs at the same time."
1. The Landscape
What are we running and why?
2. Core Concepts
How does Prometheus work?
3. Demo Time!
It’s a Tools in Action talk after all, right?
4. Tips & Tricks
Getting the most out of your Prometheus Experience
5. Questions?
I’m probably going to answer “42” to most of them..
So many things to tell, so little time..
2
The Hitchhiker’s Guide to Prometheus
• Started out in TES, doing Metrics, Monitoring & Logging.
(Graphite, Statsd, Grafana, Nagios, Logstash, ElasticSearch, Kibana, etc. )
• Currently in DPI, doing CI/CD and bringing Gitlab/Spinnaker to the Cloud.
That requires a lot of monitoring…
• Member of the Cloud9 MML Circle, doing Prometheus
• Core Contributor to the R2D2 module that manages Prometheus and Monitoring/Alerting resources
within Cloud9
• Worked on implementing Prometheus and Grafana, while also using these stacks for monitoring
production systems.
• NightOwl for SRT Platform; I know how pagers work.
Who are you, and why are you telling us this?
3
Introduction
The Landscape
What are we running?
Data Center VS Cloud
VM’s and Servers VS containers in Kubernetes
5
Monitoring Prometheus
Metrics Prometheus (+
InfluxDB/Thanos)
Alerting AlertManager, Iris,
OnCall, Grafana
Visualization Grafana
Logging StackDriver,
ElasticSearch + Kibana
Monitoring Nagios + Thruk +
Lookingglass
Metrics Graphite + Statsd
Alerting SMS modems in
physical servers
Visualization Grafana
Logging ElasticSearch + Kibana
•Applications in Kubernetes are much more dynamic than we’re used to.
• No Static IP addresses.
• No Static amount servers (Well, pods actually..)
• Kubernetes can reschedule / relocate pods at will.
• Prometheus uses Service Discovery to find targets
•Both Nagios and Graphite have scaling issues and are too rigid.
• Prometheus is Pull instead of Push based and doesn’t require execution for every single check
• Combines Metrics & Monitoring into a single stack, but focuses on Monitoring.
•Being based on BorgMon, it works out of the box with a lot of Kubernetes /
Cloud native components and the services supporting them.
•StackDriver is not a full fledged alternative due to features, retention and cost.
Why didn’t you come up with something else?
6
So, why Prometheus?
•Out of the box, Prometheus also doesn’t scale endlessly without compromises
(But Thanos will)
•Scalability is solved through retention, manual sharding and vertical scaling,
which all have clear drawbacks.
•HA is solved through duplication (Polling twice from independent instances
with individual TSDB’s).
•Prometheus development is very focused, which shows in certain aspects.
Well.. No.
7
Is this the answer to everything then?
All the pods & services
8
Infrastructure Overview
Kubernetes {DEV, STG, PRO} Clusters
Datacenters
Prometheus
Prometheus
AlertManager
AlertManager
AlertManager
Grafana
PushGateway
IRIS
OnCall
SMS / Call
Provider
HipChat
Operator
Remote
Storage
Adapter
InfluxDB
YOUR App!
Kubernetes
Exporters
Core Concepts
How does it work and what makes it tick?
- Counters
- A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only
increase or be reset to zero on restart. (1, 2, 5, 9, 0, 2, 7)
- Gauges
- A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.
(1, 4, 2, 5, 8)
- Histograms
- A histogram samples observations (usually things like request durations or response sizes) and counts them in
configurable buckets. It also provides a sum of all observed values.
- Summaries
- Similar to a histogram, a summary samples observations (usually things like request durations and response
sizes). While it also provides a total count of observations and a sum of all observed values, it calculates
configurable quantiles over a sliding time window.
- Quantiles are convenient when (for example) expressing median (2-quantile) and 95th percentiles.
Supported Types
10
Making Metrics
- Instead of creating separate checks for every metric that should be monitored for your
application, you expose a single (or multiple..) HTTP Endpoint containing all metrics.
- It’s your responsibility to make this endpoint Available, Fast and Reliable.
- Multiple Frameworks and Libraries can help you provisioning and maintaining such an
endpoint.
- Axle Comes with built-in support for MicroMeter, which does everything for you.
- Backspin support is coming soon™.
- Example: http://localhost:30000/metrics
The concept of Scraping HTTP Metric Endpoints
11
Exposing Metrics: Push VS Pull
# HELP prometheus_tsdb_head_min_time Minimum time bound of the head block.
# TYPE prometheus_tsdb_head_min_time gauge
prometheus_tsdb_head_min_time 1.5282792e+12
# HELP prometheus_tsdb_head_samples_appended_total Total number of appended samples.
# TYPE prometheus_tsdb_head_samples_appended_total counter
prometheus_tsdb_head_samples_appended_total 2.9485092e+07
# HELP prometheus_tsdb_head_series Total number of series in the head block.
# TYPE prometheus_tsdb_head_series gauge
prometheus_tsdb_head_series 19956
# HELP prometheus_tsdb_head_series_created_total Total number of series created in the head
# TYPE prometheus_tsdb_head_series_created_total gauge
prometheus_tsdb_head_series_created_total 56888
- An actual Query Language that looks a lot more like SQL than Graphite.
- You’ll need to learn a new language, but it’s only a single language for creating Graphs and Alerts; for
monitoring and long term metrics.
- Allows for a lot of flexibility, but can be a bit harder to grasp when starting out.
- Supports functions, operators, regex, arithmetic and expressions.
- Four expression types are supported:
- Instant Vectors (like http_requests_total{environment=~"staging|testing|development", method!="GET"})
- Instant vector selectors allow the selection of a set of time series and a single sample value for each at a given timestamp
(instant): in the simplest form, only a metric name is specified. This results in an instant vector containing elements for all time
series that have this metric name.
- Range Vectors (like http_requests_total{job="prometheus"}[5m] )
- Range vector literals work like instant vector literals, except that they select a range of samples back from the current instant.
Syntactically, a range duration is appended in square brackets ([]) at the end of a vector selector to specify how far back in time
values should be fetched for each resulting range vector element.
- Scalars
- Strings
PromQL
12
Querying Metrics
- Custom Resource Type provided by Prometheus-operator
- Abstraction of Prometheus “job” and Service Discovery
- Allows for easy ingestion of new endpoints through their k8s service
- Example:
ServiceMonitors
13
Getting your endpoint monitored
Prometheus
Prometheus OperatorYOUR App! K8s Service ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
spec:
endpoints:
- bearerTokenFile:
/var/run/secrets/kubernetes.io/serviceaccount/token
interval: 30s
port: https
scheme: https
tlsConfig:
insecureSkipVerify: true
jobLabel: k8s-app
selector:
matchLabels:
k8s-app: node-exporter
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: node-exporter
name: node-exporter
spec:
ports:
- name: https
port: 9100
protocol: TCP
targetPort: https
selector:
app: node-exporter
type: ClusterIP
- The same tool you were probably already using.
- The central interface for cloud insights
- Contains a specialized query editor for Prometheus data sources.
- Prometheus currently doesn’t store metrics older than one month for performance reasons.
- Multiple solutions for long term metrics exist, but it’s a work in progress.
Dashboarding with Grafana
14
Creating Insights
Prometheus
Prometheus Grafana
HipChat
Remote
Storage
Adapter
InfluxDB
Trouble in Paradise
Creating Alerts, choosing your weapon
15
WARNINGS – Notifications During workhours
- No direct intervention is required
- Usually picked up by members of the team
developing / maintaining a system.
- Alert delivery is NOT guaranteed.
Use Grafana with HipChat or Email alerts
CRITICALS – 24x7 Text Messages with Escalation
- Actionable events that require immediate attention
by an Engineer on Duty, who does not necessarily
have intimate knowledge of your system.
- Response is required to silence/end the alert.
- Provisioned through RuleList (R2D2 / Operator)
Use AlertManager / Iris / Oncall
Yes, It’s PromQL as well!
16
Alert Basics
%YAML 1.1
---
kind: PrometheusAlertRule
Data:
test.rules: |
Groups:
- name: Load
interval: 30s
Rules:
- alert: HighLoad
expr: rate(web_http_responses_total[1m]) > 1
for: 1m
Labels:
Severity: attention
Annotations:
description: The rate of HTTP requests is too high.
- Alerts should be actionable: Somebody has to do something, now.
- They should be simple: Someone without intimate knowledge of the system should ideally be
able to solve the alert.
- They should be urgent and require human intervention: No point in waking someone up if they
shouldn’t have to do something, or when tomorrow afternoon would be soon enough.
- Provide accurate descriptions and a playbook where possible.
- Basic system monitoring should be based on SLI/SLO’s rather than infra metrics.
- Prefer AM/Iris/OnCall if you’re serious about your alert.
Creating the perfect alert
17
Alert Perfection
Prometheus
AlertManager
AlertManager
AlertManager
Grafana
IRIS OnCall
SMS / Call
Provider
HipChat
• A long list of exporters is available at https://prometheus.io/docs/instrumenting/exporters/
• A number of these come preconfigured with our Kubernetes clusters and provide additional metrics
When artisanal endpoints don’t cut the cake
18
Exporters - Additional sources of metrics
Databases
Aerospike exporter
ClickHouse exporter
Consul exporter (official)
CouchDB exporter
ElasticSearch exporter
Memcached exporter (official)
MongoDB exporter
MSSQL server exporter
MySQL server exporter (official)
OpenTSDB Exporter
Oracle DB Exporter
PgBouncer exporter
PostgreSQL exporter
ProxySQL exporter
RavenDB exporter
Redis exporter
RethinkDB exporter
SQL exporter
Tarantool metric library
Hardware related
apcupsd exporter
Collins exporter
IoT Edison exporter
IPMI exporter
knxd exporter
Node/system metrics exporter (official)
Ubiquiti UniFi exporter
Messaging systems
Beanstalkd exporter
Gearman exporter
Kafka exporter
NATS exporter
NSQ exporter
Mirth Connect exporter
MQTT blackbox exporter
RabbitMQ exporter
RabbitMQ Management Plugin exporter
Storage
Ceph exporter
Ceph RADOSGW exporter
Gluster exporter
Hadoop HDFS FSImage exporter
Lustre exporter
ScaleIO exporter
HTTP
Apache exporter
HAProxy exporter (official)
Nginx metric library
Nginx VTS exporter
Passenger exporter
Tinyproxy exporter
Varnish exporter
WebDriver exporter
APIs
AWS ECS exporter
AWS Health exporter
AWS SQS exporter
Cloudflare exporter
DigitalOcean exporter
Docker Cloud exporter
Docker Hub exporter
GitHub exporter
InstaClustr exporter
Mozilla Observatory exporter
OpenWeatherMap exporter
Pagespeed exporter
Rancher exporter
Speedtest exporter
Logging
Fluentd exporter
Google's mtail log data extractor
Grok exporter
Other monitoring systems
Akamai Cloudmonitor exporter
AWS CloudWatch exporter (official)
Cloud Foundry Firehose exporter
Collectd exporter (official)
Google Stackdriver exporter
Graphite exporter (official)
Heka dashboard exporter
Heka exporter
InfluxDB exporter (official)
JavaMelody exporter
JMX exporter (official)
Munin exporter
Nagios / Naemon exporter
New Relic exporter
NRPE exporter
Osquery exporter
Pingdom exporter
scollector exporter
Sensu exporter
SNMP exporter (official)
StatsD exporter (official)
Miscellaneous
Bamboo exporter
BIG-IP exporter
BIND exporter
Bitbucket exporter
Blackbox exporter (official)
BOSH exporter
cAdvisor
Confluence exporter
Dovecot exporter
eBPF exporter
Jenkins exporter
JIRA exporter
Kannel exporter
Kemp LoadBalancer exporter
Meteor JS web framework exporter
Minecraft exporter module
PHP-FPM exporter
PowerDNS exporter
Process exporter
rTorrent exporter
SABnzbd exporter
Script exporter
Shield exporter
SMTP/Maildir MDA blackbox prober
SoftEther exporter
Transmission exporter
Unbound exporter
Xen exporter
• StackDriver Exporter- Get your GCP Project’s native metrics into Prometheus.
• Blackbox Exporter – Monitor Golden Signals on any system, without knowledge about the inner working
• Nginx exporter – used in Ingresses
• SNMP Exporter – Bring your own MIB’s.
• Statsd Exporter – Push your statsd metrics to a sidecar container
• Node Exporter – Provides system metrics for VM and Physical systems (like kubernetes nodes)
• cAdvisor – Get generic container metrics
• Etcd
• Kubernetes
• Minio (Gitlab Runner Caching)
The most commonly used
19
Exporters - Highlights
Prometheus
Prometheus OperatorExporter K8s Service ServiceMonitor
• For situations where you are unable to serve a HTTP metrics page for a reliable period of time.
• Ideal for short running tasks like Kubernetes CronJobs, Hadoop Jobs, Scripts, etc.
• Allows you to Push (through a HTTP call) Metrics to buffering service, which in turn exposes them to
Prometheus.
• Metrics will live forever on the Gateway, so be careful of what you push and how you name them.
• Avoid this route if possible, since it scales very badly and is NOT redundant. Bring your own endpoint if
and when possible.
• PRO-Tip: If you have an ephemeral job, also push the timestamp of last successful job completion.
The Push Gateway
20
Metrics for ephemeral jobs
Prometheus
PrometheusYOUR App! Push Gateway
echo ”ultimate_answer 42.0" | curl --data-binary @- http://gateway:9091/metrics/job/magrathea/instance/zaphod-001/group/vogon/opex/DPI
ultimate_answer{group=”vogon",instance=”zaphod-001",job=”magrathea",opex=”DPI"} 42.0
Demo Time!
• Kubernetes Running on Docker for macOS.
• Out of the box Prometheus on Kubernetes from https://github.com/coreos/prometheus-
operator/tree/master/contrib/kube-prometheus
• Services are running without an Ingress, so we’re accessing them directly, using NodePorts.
• We’re going to add our own Full Featured Axle Service by creating a Deployment and a Service to match
it, adding a ServiceMonitor, watching Service Discovery do it’s thing, graphing one of the metrics and
creating an alert for it.
• Prometheus: http://localhost:30000/graph
• AlertManager: http://localhost:31000/#/alerts
• Grafana: http://localhost:32000/d/9dP_FHImz/pods
Getting started in 5 minutes
22
Today’s Quick Demo
Tips & Tricks
Getting the most out of your Prometheus Experience
• Metrics in Prometheus are multi dimensional; They consist of names and labels.
• Names are generic identifiers to tell WHAT you are measuring, in what format.
• Metric Names SHOULD have a single (base!) unit, added as a suffix describing that unit. (bytes, seconds,
meters)
• Labels describe characteristics, and are usually used to identify WHERE those metrics are coming from,
and can be multi faceted.
• Prometheus saves a separate Time Series for each name/labels combination, so you have to ensure
label cardinality does not get too high, or you will kill Prometheus in the end. (Bad examples: usernames,
internet IP addresses, hashes).
• Read https://prometheus.io/docs/practices/naming/ before you start making your own!
Keep things running smoothly by not making a mess.
24
Metric Naming
api_http_requests_total { type="create|update|delete”, method=“GET|POST|DELETE” }
api_request_duration_seconds { stage="extract|transform|load” }
api_errors_total { endpoint=“listProducts|updatePricing”, code=“500|404|418 I'm a teapot” }
•An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of
the level of service that is provided.
•An SLO is a service level objective: a target value or range of values for a service level that is
measured by an SLI. A natural structure for SLOs is thus
[SLI ≤ target], or [lower bound ≤ SLI ≤ upper bound].
•Symptoms vs Causes: Monitor things that users will notice when using your system.
•Latency - The time it takes to service a request.
•Traffic. - A measure of how much demand is being placed on your system, measured in a
high-level system-specific metric. For a web service, this measurement is usually HTTP
requests per second.
•Errors - The rate of requests that fail (like HTTP 500’s)
•Saturation- "How "full" your service is. A measure of your system fraction, emphasizing the
resources that are most constrained.
What should you be monitoring?
25
The Golden Signals
•BlackBox Exporter for period requests and their Metrics (Success, Latency, Errors)
•Nginx Ingress Metrics for a man-in-the-middle view of your application (Flow, Latency, Errors)
•Your own application’s Metrics for insights, details and under-the-hood view.
Combining Metric Sources for an unbiassed view
26
Bringing it all together
Your App
Blackbox
Exporter
Ingress
Poll Metrics
Ingress Metrics
App Metrics
- job_name: 'blackbox’
metrics_path: /probe
params:
module: [http_2xx] # Look for a HTTP 200 response.
static_configs:
- targets:
- http://myapp.behindingress.io # Target to probe with http
Prometheus scrape
•Introducing the GenericServiceMonitor and DCServiceMonitor
•These types allow you to define endpoints outside of Kubernetes, and allow
you to monitor on-premise services.
•DCServiceMonitor works based on bol_applications and as such is bol.com
specific:
•GenericServiceMonitor works on static endpoints
My stuff runs in the DC and I want to keep it there.
27
So what about non-Cloud resources?
kind: Prometheus/DCServiceMonitor
name: tst-sdd-app
spec:
port: 8080
path: /internal/metrics
kind: Prometheus/GenericServiceMonitor
name: dev-atscale-app
Spec:
hosts: - ip: 1.2.3.4
hostname: some.host.name
port: 8080
path: /internal/metrics
opex: srt-bificsps
•Always initialize your metrics at zero when possible, or you won’t know the significance of the
first value.
•How do you know if your application is OK when the metrics stopped working? The up metric
might also disappear when Service Discovery no longer detects your service. Always use
absent() to check for existence of up!
•(i)rate()/increase() then sum(), not sum() then (i)rate()/increase(), since those
are the only safe functions to deal with resets.
•The rate function takes a time series over a time range, and based on the first and last data
points within that range (http://localhost:32000/d/h3RZO2Iik/rate-vs-irate?orgId=1 )
•By contrast irate is an instant rate. It only looks at the last two points within the
range passed to it and calculates a per-second rate.
•To complement the saturation signal; Prometheus has predict_linear() for Gauges.
•All the metrics? http://localhost:30000/federate?match[]={__name__%3D~%22[a-z].*%22}
Things you’ll encounter once you start making queries
28
Other tips
Questions?
Don’t bother to ask me the Ultimate Question of Life, the
Universe and Everything, because you already know the answer.
(and yes, I know where my towel is.)
Remco Overdijk
roverdijk@bol.com
So Long!
And thanks for all the fish.

More Related Content

What's hot

Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus OverviewBrian Brazil
 
Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusMarco Pas
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)Lucas Jellema
 
Monitoring, Logging and Tracing on Kubernetes
Monitoring, Logging and Tracing on KubernetesMonitoring, Logging and Tracing on Kubernetes
Monitoring, Logging and Tracing on KubernetesMartin Etmajer
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with PrometheusShiao-An Yuan
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheusCeline George
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With PrometheusKnoldus Inc.
 
End to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max IndenEnd to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max IndenParis Container Day
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaSridhar Kumar N
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusGrafana Labs
 
Getting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaSyah Dwi Prihatmoko
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetesRishabh Indoria
 
CI/CD Tools Universe: The Ultimate List
CI/CD Tools Universe: The Ultimate ListCI/CD Tools Universe: The Ultimate List
CI/CD Tools Universe: The Ultimate ListPlutora
 
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교  및 구축 방법[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교  및 구축 방법
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법Open Source Consulting
 
Kubernetes Networking
Kubernetes NetworkingKubernetes Networking
Kubernetes NetworkingCJ Cullen
 
MySQL Monitoring using Prometheus & Grafana
MySQL Monitoring using Prometheus & GrafanaMySQL Monitoring using Prometheus & Grafana
MySQL Monitoring using Prometheus & GrafanaYoungHeon (Roy) Kim
 
OpenTelemetry For Developers
OpenTelemetry For DevelopersOpenTelemetry For Developers
OpenTelemetry For DevelopersKevin Brockhoff
 
Kubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory GuideKubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory GuideBytemark
 

What's hot (20)

Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus Overview
 
Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using Prometheus
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)
 
Monitoring, Logging and Tracing on Kubernetes
Monitoring, Logging and Tracing on KubernetesMonitoring, Logging and Tracing on Kubernetes
Monitoring, Logging and Tracing on Kubernetes
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with Prometheus
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
End to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max IndenEnd to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max Inden
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
Getting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and Grafana
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetes
 
CI/CD Tools Universe: The Ultimate List
CI/CD Tools Universe: The Ultimate ListCI/CD Tools Universe: The Ultimate List
CI/CD Tools Universe: The Ultimate List
 
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교  및 구축 방법[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교  및 구축 방법
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
 
Kubernetes Basics
Kubernetes BasicsKubernetes Basics
Kubernetes Basics
 
Kubernetes Networking
Kubernetes NetworkingKubernetes Networking
Kubernetes Networking
 
MySQL Monitoring using Prometheus & Grafana
MySQL Monitoring using Prometheus & GrafanaMySQL Monitoring using Prometheus & Grafana
MySQL Monitoring using Prometheus & Grafana
 
OpenTelemetry For Developers
OpenTelemetry For DevelopersOpenTelemetry For Developers
OpenTelemetry For Developers
 
Kubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory GuideKubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory Guide
 

Similar to Prometheus monitoring

Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Brian Brazil
 
Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Brian Brazil
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Brian Brazil
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Brian Brazil
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Brian Brazil
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemAccumulo Summit
 
Monitoring with prometheus at scale
Monitoring with prometheus at scaleMonitoring with prometheus at scale
Monitoring with prometheus at scaleAdam Hamsik
 
Monitoring with prometheus at scale
Monitoring with prometheus at scaleMonitoring with prometheus at scale
Monitoring with prometheus at scaleJuraj Hantak
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Brian Brazil
 
Distributed tracing 101
Distributed tracing 101Distributed tracing 101
Distributed tracing 101Itiel Shwartz
 
Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)Eran Levy
 
Slack in the Age of Prometheus
Slack in the Age of PrometheusSlack in the Age of Prometheus
Slack in the Age of PrometheusGeorge Luong
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Brian Brazil
 
Monitoring Weave Cloud with Prometheus
Monitoring Weave Cloud with PrometheusMonitoring Weave Cloud with Prometheus
Monitoring Weave Cloud with PrometheusWeaveworks
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsBrendan Gregg
 

Similar to Prometheus monitoring (20)

Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)
 
Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
 
Monitoring with prometheus at scale
Monitoring with prometheus at scaleMonitoring with prometheus at scale
Monitoring with prometheus at scale
 
Monitoring with prometheus at scale
Monitoring with prometheus at scaleMonitoring with prometheus at scale
Monitoring with prometheus at scale
 
Distributed Tracing
Distributed TracingDistributed Tracing
Distributed Tracing
 
RxJava@Android
RxJava@AndroidRxJava@Android
RxJava@Android
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
 
Mini training - Reactive Extensions (Rx)
Mini training - Reactive Extensions (Rx)Mini training - Reactive Extensions (Rx)
Mini training - Reactive Extensions (Rx)
 
Distributed tracing 101
Distributed tracing 101Distributed tracing 101
Distributed tracing 101
 
Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)
 
Slack in the Age of Prometheus
Slack in the Age of PrometheusSlack in the Age of Prometheus
Slack in the Age of Prometheus
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
 
Monitoring Weave Cloud with Prometheus
Monitoring Weave Cloud with PrometheusMonitoring Weave Cloud with Prometheus
Monitoring Weave Cloud with Prometheus
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
 

Recently uploaded

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Prometheus monitoring

  • 1. The hitchhiker’s guide to Remco Overdijk 1 "A Metric, The Hitchhiker's Guide to Prometheus says, is about the most massively useful thing someone doing Monitoring can have. It has great practical value. You can wave your Metric in emergencies as a distress signal, and produce pretty Graphs at the same time."
  • 2. 1. The Landscape What are we running and why? 2. Core Concepts How does Prometheus work? 3. Demo Time! It’s a Tools in Action talk after all, right? 4. Tips & Tricks Getting the most out of your Prometheus Experience 5. Questions? I’m probably going to answer “42” to most of them.. So many things to tell, so little time.. 2 The Hitchhiker’s Guide to Prometheus
  • 3. • Started out in TES, doing Metrics, Monitoring & Logging. (Graphite, Statsd, Grafana, Nagios, Logstash, ElasticSearch, Kibana, etc. ) • Currently in DPI, doing CI/CD and bringing Gitlab/Spinnaker to the Cloud. That requires a lot of monitoring… • Member of the Cloud9 MML Circle, doing Prometheus • Core Contributor to the R2D2 module that manages Prometheus and Monitoring/Alerting resources within Cloud9 • Worked on implementing Prometheus and Grafana, while also using these stacks for monitoring production systems. • NightOwl for SRT Platform; I know how pagers work. Who are you, and why are you telling us this? 3 Introduction
  • 5. Data Center VS Cloud VM’s and Servers VS containers in Kubernetes 5 Monitoring Prometheus Metrics Prometheus (+ InfluxDB/Thanos) Alerting AlertManager, Iris, OnCall, Grafana Visualization Grafana Logging StackDriver, ElasticSearch + Kibana Monitoring Nagios + Thruk + Lookingglass Metrics Graphite + Statsd Alerting SMS modems in physical servers Visualization Grafana Logging ElasticSearch + Kibana
  • 6. •Applications in Kubernetes are much more dynamic than we’re used to. • No Static IP addresses. • No Static amount servers (Well, pods actually..) • Kubernetes can reschedule / relocate pods at will. • Prometheus uses Service Discovery to find targets •Both Nagios and Graphite have scaling issues and are too rigid. • Prometheus is Pull instead of Push based and doesn’t require execution for every single check • Combines Metrics & Monitoring into a single stack, but focuses on Monitoring. •Being based on BorgMon, it works out of the box with a lot of Kubernetes / Cloud native components and the services supporting them. •StackDriver is not a full fledged alternative due to features, retention and cost. Why didn’t you come up with something else? 6 So, why Prometheus?
  • 7. •Out of the box, Prometheus also doesn’t scale endlessly without compromises (But Thanos will) •Scalability is solved through retention, manual sharding and vertical scaling, which all have clear drawbacks. •HA is solved through duplication (Polling twice from independent instances with individual TSDB’s). •Prometheus development is very focused, which shows in certain aspects. Well.. No. 7 Is this the answer to everything then?
  • 8. All the pods & services 8 Infrastructure Overview Kubernetes {DEV, STG, PRO} Clusters Datacenters Prometheus Prometheus AlertManager AlertManager AlertManager Grafana PushGateway IRIS OnCall SMS / Call Provider HipChat Operator Remote Storage Adapter InfluxDB YOUR App! Kubernetes Exporters
  • 9. Core Concepts How does it work and what makes it tick?
  • 10. - Counters - A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. (1, 2, 5, 9, 0, 2, 7) - Gauges - A gauge is a metric that represents a single numerical value that can arbitrarily go up and down. (1, 4, 2, 5, 8) - Histograms - A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values. - Summaries - Similar to a histogram, a summary samples observations (usually things like request durations and response sizes). While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window. - Quantiles are convenient when (for example) expressing median (2-quantile) and 95th percentiles. Supported Types 10 Making Metrics
  • 11. - Instead of creating separate checks for every metric that should be monitored for your application, you expose a single (or multiple..) HTTP Endpoint containing all metrics. - It’s your responsibility to make this endpoint Available, Fast and Reliable. - Multiple Frameworks and Libraries can help you provisioning and maintaining such an endpoint. - Axle Comes with built-in support for MicroMeter, which does everything for you. - Backspin support is coming soon™. - Example: http://localhost:30000/metrics The concept of Scraping HTTP Metric Endpoints 11 Exposing Metrics: Push VS Pull # HELP prometheus_tsdb_head_min_time Minimum time bound of the head block. # TYPE prometheus_tsdb_head_min_time gauge prometheus_tsdb_head_min_time 1.5282792e+12 # HELP prometheus_tsdb_head_samples_appended_total Total number of appended samples. # TYPE prometheus_tsdb_head_samples_appended_total counter prometheus_tsdb_head_samples_appended_total 2.9485092e+07 # HELP prometheus_tsdb_head_series Total number of series in the head block. # TYPE prometheus_tsdb_head_series gauge prometheus_tsdb_head_series 19956 # HELP prometheus_tsdb_head_series_created_total Total number of series created in the head # TYPE prometheus_tsdb_head_series_created_total gauge prometheus_tsdb_head_series_created_total 56888
  • 12. - An actual Query Language that looks a lot more like SQL than Graphite. - You’ll need to learn a new language, but it’s only a single language for creating Graphs and Alerts; for monitoring and long term metrics. - Allows for a lot of flexibility, but can be a bit harder to grasp when starting out. - Supports functions, operators, regex, arithmetic and expressions. - Four expression types are supported: - Instant Vectors (like http_requests_total{environment=~"staging|testing|development", method!="GET"}) - Instant vector selectors allow the selection of a set of time series and a single sample value for each at a given timestamp (instant): in the simplest form, only a metric name is specified. This results in an instant vector containing elements for all time series that have this metric name. - Range Vectors (like http_requests_total{job="prometheus"}[5m] ) - Range vector literals work like instant vector literals, except that they select a range of samples back from the current instant. Syntactically, a range duration is appended in square brackets ([]) at the end of a vector selector to specify how far back in time values should be fetched for each resulting range vector element. - Scalars - Strings PromQL 12 Querying Metrics
  • 13. - Custom Resource Type provided by Prometheus-operator - Abstraction of Prometheus “job” and Service Discovery - Allows for easy ingestion of new endpoints through their k8s service - Example: ServiceMonitors 13 Getting your endpoint monitored Prometheus Prometheus OperatorYOUR App! K8s Service ServiceMonitor apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor spec: endpoints: - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token interval: 30s port: https scheme: https tlsConfig: insecureSkipVerify: true jobLabel: k8s-app selector: matchLabels: k8s-app: node-exporter apiVersion: v1 kind: Service metadata: labels: k8s-app: node-exporter name: node-exporter spec: ports: - name: https port: 9100 protocol: TCP targetPort: https selector: app: node-exporter type: ClusterIP
  • 14. - The same tool you were probably already using. - The central interface for cloud insights - Contains a specialized query editor for Prometheus data sources. - Prometheus currently doesn’t store metrics older than one month for performance reasons. - Multiple solutions for long term metrics exist, but it’s a work in progress. Dashboarding with Grafana 14 Creating Insights Prometheus Prometheus Grafana HipChat Remote Storage Adapter InfluxDB
  • 15. Trouble in Paradise Creating Alerts, choosing your weapon 15 WARNINGS – Notifications During workhours - No direct intervention is required - Usually picked up by members of the team developing / maintaining a system. - Alert delivery is NOT guaranteed. Use Grafana with HipChat or Email alerts CRITICALS – 24x7 Text Messages with Escalation - Actionable events that require immediate attention by an Engineer on Duty, who does not necessarily have intimate knowledge of your system. - Response is required to silence/end the alert. - Provisioned through RuleList (R2D2 / Operator) Use AlertManager / Iris / Oncall
  • 16. Yes, It’s PromQL as well! 16 Alert Basics %YAML 1.1 --- kind: PrometheusAlertRule Data: test.rules: | Groups: - name: Load interval: 30s Rules: - alert: HighLoad expr: rate(web_http_responses_total[1m]) > 1 for: 1m Labels: Severity: attention Annotations: description: The rate of HTTP requests is too high.
  • 17. - Alerts should be actionable: Somebody has to do something, now. - They should be simple: Someone without intimate knowledge of the system should ideally be able to solve the alert. - They should be urgent and require human intervention: No point in waking someone up if they shouldn’t have to do something, or when tomorrow afternoon would be soon enough. - Provide accurate descriptions and a playbook where possible. - Basic system monitoring should be based on SLI/SLO’s rather than infra metrics. - Prefer AM/Iris/OnCall if you’re serious about your alert. Creating the perfect alert 17 Alert Perfection Prometheus AlertManager AlertManager AlertManager Grafana IRIS OnCall SMS / Call Provider HipChat
  • 18. • A long list of exporters is available at https://prometheus.io/docs/instrumenting/exporters/ • A number of these come preconfigured with our Kubernetes clusters and provide additional metrics When artisanal endpoints don’t cut the cake 18 Exporters - Additional sources of metrics Databases Aerospike exporter ClickHouse exporter Consul exporter (official) CouchDB exporter ElasticSearch exporter Memcached exporter (official) MongoDB exporter MSSQL server exporter MySQL server exporter (official) OpenTSDB Exporter Oracle DB Exporter PgBouncer exporter PostgreSQL exporter ProxySQL exporter RavenDB exporter Redis exporter RethinkDB exporter SQL exporter Tarantool metric library Hardware related apcupsd exporter Collins exporter IoT Edison exporter IPMI exporter knxd exporter Node/system metrics exporter (official) Ubiquiti UniFi exporter Messaging systems Beanstalkd exporter Gearman exporter Kafka exporter NATS exporter NSQ exporter Mirth Connect exporter MQTT blackbox exporter RabbitMQ exporter RabbitMQ Management Plugin exporter Storage Ceph exporter Ceph RADOSGW exporter Gluster exporter Hadoop HDFS FSImage exporter Lustre exporter ScaleIO exporter HTTP Apache exporter HAProxy exporter (official) Nginx metric library Nginx VTS exporter Passenger exporter Tinyproxy exporter Varnish exporter WebDriver exporter APIs AWS ECS exporter AWS Health exporter AWS SQS exporter Cloudflare exporter DigitalOcean exporter Docker Cloud exporter Docker Hub exporter GitHub exporter InstaClustr exporter Mozilla Observatory exporter OpenWeatherMap exporter Pagespeed exporter Rancher exporter Speedtest exporter Logging Fluentd exporter Google's mtail log data extractor Grok exporter Other monitoring systems Akamai Cloudmonitor exporter AWS CloudWatch exporter (official) Cloud Foundry Firehose exporter Collectd exporter (official) Google Stackdriver exporter Graphite exporter (official) Heka dashboard exporter Heka exporter InfluxDB exporter (official) JavaMelody exporter JMX exporter (official) Munin exporter Nagios / Naemon exporter New Relic exporter NRPE exporter Osquery exporter Pingdom exporter scollector exporter Sensu exporter SNMP exporter (official) StatsD exporter (official) Miscellaneous Bamboo exporter BIG-IP exporter BIND exporter Bitbucket exporter Blackbox exporter (official) BOSH exporter cAdvisor Confluence exporter Dovecot exporter eBPF exporter Jenkins exporter JIRA exporter Kannel exporter Kemp LoadBalancer exporter Meteor JS web framework exporter Minecraft exporter module PHP-FPM exporter PowerDNS exporter Process exporter rTorrent exporter SABnzbd exporter Script exporter Shield exporter SMTP/Maildir MDA blackbox prober SoftEther exporter Transmission exporter Unbound exporter Xen exporter
  • 19. • StackDriver Exporter- Get your GCP Project’s native metrics into Prometheus. • Blackbox Exporter – Monitor Golden Signals on any system, without knowledge about the inner working • Nginx exporter – used in Ingresses • SNMP Exporter – Bring your own MIB’s. • Statsd Exporter – Push your statsd metrics to a sidecar container • Node Exporter – Provides system metrics for VM and Physical systems (like kubernetes nodes) • cAdvisor – Get generic container metrics • Etcd • Kubernetes • Minio (Gitlab Runner Caching) The most commonly used 19 Exporters - Highlights Prometheus Prometheus OperatorExporter K8s Service ServiceMonitor
  • 20. • For situations where you are unable to serve a HTTP metrics page for a reliable period of time. • Ideal for short running tasks like Kubernetes CronJobs, Hadoop Jobs, Scripts, etc. • Allows you to Push (through a HTTP call) Metrics to buffering service, which in turn exposes them to Prometheus. • Metrics will live forever on the Gateway, so be careful of what you push and how you name them. • Avoid this route if possible, since it scales very badly and is NOT redundant. Bring your own endpoint if and when possible. • PRO-Tip: If you have an ephemeral job, also push the timestamp of last successful job completion. The Push Gateway 20 Metrics for ephemeral jobs Prometheus PrometheusYOUR App! Push Gateway echo ”ultimate_answer 42.0" | curl --data-binary @- http://gateway:9091/metrics/job/magrathea/instance/zaphod-001/group/vogon/opex/DPI ultimate_answer{group=”vogon",instance=”zaphod-001",job=”magrathea",opex=”DPI"} 42.0
  • 22. • Kubernetes Running on Docker for macOS. • Out of the box Prometheus on Kubernetes from https://github.com/coreos/prometheus- operator/tree/master/contrib/kube-prometheus • Services are running without an Ingress, so we’re accessing them directly, using NodePorts. • We’re going to add our own Full Featured Axle Service by creating a Deployment and a Service to match it, adding a ServiceMonitor, watching Service Discovery do it’s thing, graphing one of the metrics and creating an alert for it. • Prometheus: http://localhost:30000/graph • AlertManager: http://localhost:31000/#/alerts • Grafana: http://localhost:32000/d/9dP_FHImz/pods Getting started in 5 minutes 22 Today’s Quick Demo
  • 23. Tips & Tricks Getting the most out of your Prometheus Experience
  • 24. • Metrics in Prometheus are multi dimensional; They consist of names and labels. • Names are generic identifiers to tell WHAT you are measuring, in what format. • Metric Names SHOULD have a single (base!) unit, added as a suffix describing that unit. (bytes, seconds, meters) • Labels describe characteristics, and are usually used to identify WHERE those metrics are coming from, and can be multi faceted. • Prometheus saves a separate Time Series for each name/labels combination, so you have to ensure label cardinality does not get too high, or you will kill Prometheus in the end. (Bad examples: usernames, internet IP addresses, hashes). • Read https://prometheus.io/docs/practices/naming/ before you start making your own! Keep things running smoothly by not making a mess. 24 Metric Naming api_http_requests_total { type="create|update|delete”, method=“GET|POST|DELETE” } api_request_duration_seconds { stage="extract|transform|load” } api_errors_total { endpoint=“listProducts|updatePricing”, code=“500|404|418 I'm a teapot” }
  • 25. •An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided. •An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus [SLI ≤ target], or [lower bound ≤ SLI ≤ upper bound]. •Symptoms vs Causes: Monitor things that users will notice when using your system. •Latency - The time it takes to service a request. •Traffic. - A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For a web service, this measurement is usually HTTP requests per second. •Errors - The rate of requests that fail (like HTTP 500’s) •Saturation- "How "full" your service is. A measure of your system fraction, emphasizing the resources that are most constrained. What should you be monitoring? 25 The Golden Signals
  • 26. •BlackBox Exporter for period requests and their Metrics (Success, Latency, Errors) •Nginx Ingress Metrics for a man-in-the-middle view of your application (Flow, Latency, Errors) •Your own application’s Metrics for insights, details and under-the-hood view. Combining Metric Sources for an unbiassed view 26 Bringing it all together Your App Blackbox Exporter Ingress Poll Metrics Ingress Metrics App Metrics - job_name: 'blackbox’ metrics_path: /probe params: module: [http_2xx] # Look for a HTTP 200 response. static_configs: - targets: - http://myapp.behindingress.io # Target to probe with http Prometheus scrape
  • 27. •Introducing the GenericServiceMonitor and DCServiceMonitor •These types allow you to define endpoints outside of Kubernetes, and allow you to monitor on-premise services. •DCServiceMonitor works based on bol_applications and as such is bol.com specific: •GenericServiceMonitor works on static endpoints My stuff runs in the DC and I want to keep it there. 27 So what about non-Cloud resources? kind: Prometheus/DCServiceMonitor name: tst-sdd-app spec: port: 8080 path: /internal/metrics kind: Prometheus/GenericServiceMonitor name: dev-atscale-app Spec: hosts: - ip: 1.2.3.4 hostname: some.host.name port: 8080 path: /internal/metrics opex: srt-bificsps
  • 28. •Always initialize your metrics at zero when possible, or you won’t know the significance of the first value. •How do you know if your application is OK when the metrics stopped working? The up metric might also disappear when Service Discovery no longer detects your service. Always use absent() to check for existence of up! •(i)rate()/increase() then sum(), not sum() then (i)rate()/increase(), since those are the only safe functions to deal with resets. •The rate function takes a time series over a time range, and based on the first and last data points within that range (http://localhost:32000/d/h3RZO2Iik/rate-vs-irate?orgId=1 ) •By contrast irate is an instant rate. It only looks at the last two points within the range passed to it and calculates a per-second rate. •To complement the saturation signal; Prometheus has predict_linear() for Gauges. •All the metrics? http://localhost:30000/federate?match[]={__name__%3D~%22[a-z].*%22} Things you’ll encounter once you start making queries 28 Other tips
  • 29. Questions? Don’t bother to ask me the Ultimate Question of Life, the Universe and Everything, because you already know the answer. (and yes, I know where my towel is.)