We all know the wall of illegible wall of small graphs, that we like to present to (senior) management and auditors as proof that we do observability. It doesnt matter that the person ( and the associated knowledge) has long since left the company, nor that the dashboard doesnt autorefresh and still show the same data from when we turned on the monitoring PC.
In an ever changing IT landscape we deserve better than that. We shouldn't have to rely on the gut instinct of the senior engineer on deck about the general shape of the dashboard to know where to start fixing whatever it is that needs fixing.
We should aim to collect and present such information both from the it and business side that let a relatively inexperienced oncall engineer differentiate between a P1 incident and a major client/customer/continet going to bed.
We should be telling beautiful stories with data and dashboards in such a way that (even) management can pull up global dashboard and can determine business relevant information like "is our advertisement campaign having any impact". We should also not be afraid to have multiple dashboards that show different relevant information to stack holders (e.g sales figures for management and links to runbooks for engineers)
2. Confidential and Proprietary
~ ❯ whoami
• Used to be a Molecular Biologist
• Then became a Dev, now an Ops
• Currently Cloud Engineer @ Seaplane.io
• Amsterdam HUG organizer
• Author of soon to be released LGTM book
8. Confidential and Proprietary
Metrics
● Open-Source tool to do metrics collection and storage
● Since 2012/2015, by SoundCloud
● CNCF graduated project
● HTTP pull model
● PromQL DSL
https://prometheus.io/
18. Confidential and Proprietary
PromQL
http_requests_total{job=”nginx", handler="/grafana"}[5m]
sum by (job) ( rate(http_requests_total[5m]) )
sum by (region) (bird_protocol_up{ip_version="ipv4",proto="BGP",env="$environment"})
https://prometheus.io/docs/prometheus/latest/querying/basics/
20. Confidential and Proprietary
Dashboards
● Grafana is a multi-platform open source analytics and
interactive visualization web application
● Since 2014, by Torkel Ödegaard
https://grafana.com/oss
/grafana/
26. Confidential and Proprietary
Data should tell beautiful stories
https://www.youtube.com/watch?v=hVimVzgtD6w
https://www.youtube.com/watch?v=jbkSRLYSojo
33. Confidential and Proprietary
Alerts are PromQL queries too
- alert: PrometheusJobMissing
expr: absent(up{job="prometheus"})
for: 0m
labels:
severity: warning
annotations:
summary: Prometheus job missing (instance {{ $labels.instance }})
description: "A Prometheus job has disappearedn VALUE = {{ $value }}n LABELS = {{ $labels
}}"
https://awesome-prometheus-alerts.grep.to/
34. Confidential and Proprietary
R.E.D & U.S.E
U = Utilization, as canonically defined
S = Concurrency
E = Error Rate, as a throughput metric
R = Request Throughput, in requests per second
E = Request Error Rate, as either a throughput metric or a
fraction of overall throughput
D = Latency, Residence Time, or Response Time; all three are
widely used
38. Confidential and Proprietary
Only alert on SLO violation
And definitely on HDD space usage
Don’t alert on symptoms
Do alert on the rate you are using the error budget