Prometheus – a next-gen Monitoring System

Prometheus
A next-generation monitoring system
Fabian Reinartz – Production Engineer, SoundCloud Ltd.

Monitoring at SC 2012 – from monolith ...

Monitoring at SC 2012
Service A
Service B
Service C
StatsD Graphite

History – monitoring at SoundCloud 2012
Source: http://eugenedvorkin.com/seven-micro-services-architecture-problems-and-solutions/

Source: http://blog.sflow.com/2011/12/using-ganglia-to-monitor-java-virtual.html

Source: http://www.bellarmine.edu/faculty/amahmood/tier3/monitoring.html

Prometheus
- started by Matt Proud and Julius Volz as an Open Source project
- first commit 24-11-2012
- public announcement in January 2015
- inspired by Borgmon
- not Borgmon

Features – multi-dimensional data model
http_requests_total{instance=”web-1”, path=”/index”, status=”401”, method=”GET”}
#metrics x #labels x #values ▶ millions of time series

Features – powerful query language
topk(3, sum by(path, method) (
rate(http_requests_total{status=~”5..”}[5m])
))
histogram_quantile(0.99, sum by(le, path) (
rate(http_requests_duration_seconds_bucket[5m])
))

Features – powerful query language
topk(3, sum by(path, method) (
rate(http_requests_total{status=~”5..”}[5m])
))
{path=”/api/comments”, method=”POST”} 105.4
{path=”/api/user/:id”, method=”GET”} 34.122
{path=”/api/comment/:id/edit”, method=”POST”} 29.31

Features – easy to use, yet scalable
- single static binary, no dependencies
$ go get github.com/prometheus/prometheus/cmd/...
$ prometheus
- local storage
- high-throughput [millions of time series, 380,000 samples/sec]
- efficient compression

Instrument – natively
var httpDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Namespace: namespace,
Name: "http_request_duration_seconds",
Help: "A histogram of HTTP request durations.",
Buckets: prometheus.ExponentialBuckets(0.0001, 1.5, 25),
},
[]string{"path", "method", "status"},
)
func handleAPI(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// do work
httpDuration.WithLabelValues(r.URL.Path, r.Method, status).Observe(time.Since(start).Seconds())
}

Features – built-in expression browser

Features – native Grafana support

Features – federation & sharding
Cluster A Cluster B
Cluster C
service metrics container metrics

S E R V I C E D I S C O V E R Y

DNS SRV
$ dig +short SRV all.foo-api.srv.int.example.com
0 0 4738 ip-10-22-11-32.int.example.com.
[...]

DNS SRV
scrape_configs:
- job_name: "foo-api"
metrics_path: "/metrics"
dns_sd_configs:
- names: ["all.foo-api.srv.int.example.com"]
refresh_interval: 10s

Fancy SD
- Consul
- Kubernetes
- Zookeeper
- EC2
- Mesos-Marathon
- … any via file-based plugins
Relabel based on SD data.

Relabeling
relabel_config:
action: replace
source_labels: [__address__, __telemetry_port]
target_label: __address__
regex: (.+):(.+);(.+)
replacement: $1:$3
OUT
“__address__”: “10.44.12.135:82432”
“__telemetry_port”: “82432”
“cluster”: “AB”
“environment”: “production”
IN
“__address__”: “10.44.12.135:25431”
“__telemetry_port”: “82432”
“cluster”: “AB”
“environment”: “production”

AWS EC2
scrape_configs:
- job_name: "foo-api"
metrics_path: "/metrics"
ec2_sd_configs:
- region: us-east-1
refresh_interval: 60s
port: 80
The following meta labels are available during relabeling:
- __meta_ec2_instance_id: the EC2 instance ID
- __meta_ec2_public_ip: the public IP address of the instance
- __meta_ec2_private_ip: the private IP address of the instance, if present
- __meta_ec2_tag_<tagkey>: each tag value of the instance

AWS EC2 – relabeling
relabel_configs:
- source_labels: [__meta_ec2_tag_Type]
action: keep
regex: foo-api
- source_labels: [__meta_ec2_tag_Deployment]
action: replace
target_label: deployment
regex: (.+)
replacement: $1

Alerting
- no opinions
- directly defined on time series data
- verbose on firing ▶ compact but detailed on notifcation

Alerting
ALERT HighErrorRate
IF sum by(job, path)(rate(http_requests_total{status=~”5..”}[5m])) /
sum by(job, path)(rate(http_requests_total[5m])) * 100 > 1
FOR 10m
SUMMARY “high number of 5xx errors”
DESCRIPTION “{{$labels.job}} has {{$value}}% 5xx errors on {{ $labels.path }}”

Alerting
{path=”/api/comments”, method=”POST”} 5.43
{path=”/api/user/:id”, method=”GET”} 1.22
{path=”/api/comment/:id/edit”, method=”POST”} 1.01

Alerting
ALERT HighErrorRate
IF ... * 100 > 1
FOR 10m
WITH { severity = “warning” } …
ALERT HighErrorRate
IF ... * 100 > 3
FOR 10m
WITH { severity = “critical” } …

ALERTMANAGER
a l e r t s
silence
inhibit
g r o u p
d e d u p
r o u t e
PagerDuty
Mail
Slack
...

Alerting
ALERT DiskWillFillIn4Hours
IF predict_linear(node_filesystem_free{job='node'}[1h], 4*3600) < 0
FOR 5m
SUMMARY “device filling up”
DESCRIPTION “{{$labels.device}} mounted on {{$labels.mountpoint}} on
{{$labels.instance}} will fill up within 4 hours.”
http://www.robustperception.io/reduce-noise-from-disk-space-alerts/

Turing complete
http://www.robustperception.io/conways-life-in-prometheus/

Recording rules
job:http_requests:rate5m = sum by(job) (
rate(http_requests_total[5m])
)

Prometheus – a next-gen Monitoring System

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Prometheus – a next-gen Monitoring System

Similar to Prometheus – a next-gen Monitoring System (20)

Recently uploaded

Recently uploaded (20)

Prometheus – a next-gen Monitoring System