DevOps Fest 2019. Björn Rabenstein. Applied Alerting Philosophy

CONTINUOUS DELIVERY.
CONTINUOUS DEVOPS.
Professional conference on DevOps practices
6APRIL 2019
KYIV,
th

6APRIL 2019 KYIV,
Björn “Beorn” Rabenstein
Applied Alerting Philosophy
th

My Philosophy on Alerting
based my observations while I was a Site Reliability Engineer at Google
Author: Rob Ewaschuk <rob@infinitepigeons.org>
Introduction
Vernacular
Monitor for your users
Cause-based alerts are bad (but sometimes necessary)
Alerting from the spout (or beyond!)
Causes are still useful
Tickets, Reports and Email
Playbooks
Tracking & Accountability
You're being naïve!
Summary
Summary
When you are auditing or writing alerting rules, consider these things to keep your oncall rotation happier:
goo.gl/2vrpSO

“What” versus “why” is one of the most important
distinctions in writing good monitoring with maximum
signal and minimum noise.
Chapter 6: Monitoring Distributed Systems
Symptoms vs. causes
Source: Betsy Beyer et al. “Site Reliability Engineering – How Google Runs Production Systems”

Expected response SRE book SoundCloud lingo Delivered to Based on
Act immediately Alerts Pages
severity="critical"
Pager Symptoms
Act eventually Tickets Tickets / “email alerts”
severity="warning"
Issue tracker /
Chat / email :-(
Symptoms or
causes
None (for diagnostics
only)
Logs Informational alerts
severity="info"
Nowhere /
dashboards
Causes
“Alerts” according to Prometheus:
Pages vs. tickets

“One person’s symptom is another person’s cause.”
“Not-yet-occurring but imminent problems.”
“Zero-redundancy (N + 0) situations count as imminent,
as do ‘nearly full’ parts of your service!”
What also counts as “symptoms”…

white-box
(needs instrumentation)
black-box
(no changes required)
host-based
“traditional”
service-based
“modern”
Black-box
monitoring
FTW?!?
Prometheus
Blackbox
Exporter
Black-box vs. white-box

Probing with real user traffic in multi-tiered services
Frontend service
(instrumented)
Backend service
A
Backend service
B
User
traffic
Measures
A’s and B’s
latency,
rps,
errors…
Alerts owners
of A or B

https://developers.soundcloud.com/blog/how-soundclo
ud-uses-haproxy-with-kubernetes-for-user-facing-traffic
AMPELMANN GmbH CC BY-SA 4.0

- record: backend:http_errors_per_response:ratio_rate5m
expr: |2
sum by (backend)(rate(
haproxy_backend_http_responses_total{job="ampelmann", code="5xx"}[5m]
))
/
sum by (backend)(rate(
haproxy_backend_http_responses_total{job="ampelmann"}[5m]
))

- record: backend:error_slo:percent
labels:
backend: "api-v4"
expr: 0.1
- record: backend:error_slo:percent
labels:
backend: "api-v3"
expr: 0.2
# ... Many more backends.

SLO budget consumption Time window Burn rate Notification
2% 1 hour 14.4 Page
5% 6 hours 6 Page
10% 3 days 1 Ticket

expr: (
job:slo_errors_per_request:ratio_rate1h{job="myjob"} > (14.4*0.001)
or
job:slo_errors_per_request:ratio_rate6h{job="myjob"} > (6*0.001)
)
severity: page
expr: job:slo_errors_per_request:ratio_rate3d{job="myjob"} > 0.001
severity: ticket

- alert: AmpelmannErrorBudgetBurn
expr: |2
(
100 * backend:http_errors_per_response:ratio_rate1h
> on (backend)
14.4 * backend:error_slo:percent
)
and
(
100 * backend:http_errors_per_response:ratio_rate5m
> on (backend)
14.4 * backend:error_slo:percent
)
for: 2m
labels:
system: "{{$labels.backend}}"
severity: "critical"
window: "1h"
annotations:
summary: "a backend burns its error budget very fast"
description: "Backend {{$labels.backend}} has returned {{ $value | printf `%.2f` }}% 5xx
runbook: "http://runbooks.soundcloud.com/runbooks/ampelmann/#ampelmannerrorbudgetburn"

Alert Long window Short window for duration Burn rate
factor
Error budget
consumed
Page 1h 5m 2m 14.4 2%
Page 6h 30m 15m 6 5%
Ticket 1d 2h 1h 3 10%
Ticket 3d 6h 1h 1 10%

route:
receiver: prodeng-warn
group_by:
- alertname
- zone
- system
routes:
- receiver: api-team-warn
match:
system: api-v4
routes:
- receiver: api-team-crit
match:
severity: critical
group_wait: 20s
group_interval: 5m
repeat_interval: 3h
- receiver: api-team-info
match:
severity: info

https://prometheus.io
https://github.com/beorn7/talks
https://developers.soundcloud.com/blog

DevOps Fest 2019. Björn Rabenstein. Applied Alerting Philosophy

Recommended

Recommended

More Related Content

Similar to DevOps Fest 2019. Björn Rabenstein. Applied Alerting Philosophy

Similar to DevOps Fest 2019. Björn Rabenstein. Applied Alerting Philosophy (20)

More from DevOps_Fest

More from DevOps_Fest (20)

Recently uploaded

Recently uploaded (20)

DevOps Fest 2019. Björn Rabenstein. Applied Alerting Philosophy