Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DevOps Fest 2019. Björn Rabenstein. Applied Alerting Philosophy

More than five years ago, Rob Ewaschuk created an innocuous Google doc titled “My Philosophy On Alerting”. It became kind of viral and later formed the foundation of a chapter in the famous book Site Reliability Engineering – How Google Runs Production Systems. In parallel, the metrics-based monitoring and alerting system Prometheus was developed at SoundCloud. It is the open-source tool to put Rob’s philosophy into practice. Thus, I would like to present “applied alerting philosophy” and explain how we use Prometheus at SoundCloud to create meaningful and actionable alerts. In particular, SoundCloud follows a fairly radical “you build it – you run it” approach, which requires additional care to route alerts to the right group of engineers. Prometheus’s “label everything” mantra proves to be very helpful here.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

DevOps Fest 2019. Björn Rabenstein. Applied Alerting Philosophy

  1. 1. CONTINUOUS DELIVERY. CONTINUOUS DEVOPS. Professional conference on DevOps practices 6APRIL 2019 KYIV, th
  2. 2. 6APRIL 2019 KYIV, Björn “Beorn” Rabenstein Applied Alerting Philosophy th
  3. 3. My Philosophy on Alerting based my observations while I was a Site Reliability Engineer at Google Author: Rob Ewaschuk <rob@infinitepigeons.org> Introduction Vernacular Monitor for your users Cause-based alerts are bad (but sometimes necessary) Alerting from the spout (or beyond!) Causes are still useful Tickets, Reports and Email Playbooks Tracking & Accountability You're being naïve! Summary Summary When you are auditing or writing alerting rules, consider these things to keep your oncall rotation happier: goo.gl/2vrpSO
  4. 4. “What” versus “why” is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise. Chapter 6: Monitoring Distributed Systems Symptoms vs. causes Source: Betsy Beyer et al. “Site Reliability Engineering – How Google Runs Production Systems”
  5. 5. Expected response SRE book SoundCloud lingo Delivered to Based on Act immediately Alerts Pages severity="critical" Pager Symptoms Act eventually Tickets Tickets / “email alerts” severity="warning" Issue tracker / Chat / email :-( Symptoms or causes None (for diagnostics only) Logs Informational alerts severity="info" Nowhere / dashboards Causes “Alerts” according to Prometheus: Pages vs. tickets
  6. 6. “One person’s symptom is another person’s cause.” “Not-yet-occurring but imminent problems.” “Zero-redundancy (N + 0) situations count as imminent, as do ‘nearly full’ parts of your service!” What also counts as “symptoms”…
  7. 7. white-box (needs instrumentation) black-box (no changes required) host-based “traditional” service-based “modern” Black-box monitoring FTW?!? Prometheus Blackbox Exporter Black-box vs. white-box
  8. 8. Probing with real user traffic in multi-tiered services Frontend service (instrumented) Backend service A Backend service B User traffic Measures A’s and B’s latency, rps, errors… Alerts owners of A or B
  9. 9. Alerting on SLOs at
  10. 10. https://developers.soundcloud.com/blog/how-soundclo ud-uses-haproxy-with-kubernetes-for-user-facing-traffic AMPELMANN GmbH CC BY-SA 4.0
  11. 11. - record: backend:http_errors_per_response:ratio_rate5m expr: |2 sum by (backend)(rate( haproxy_backend_http_responses_total{job="ampelmann", code="5xx"}[5m] )) / sum by (backend)(rate( haproxy_backend_http_responses_total{job="ampelmann"}[5m] ))
  12. 12. - record: backend:error_slo:percent labels: backend: "api-v4" expr: 0.1 - record: backend:error_slo:percent labels: backend: "api-v3" expr: 0.2 # ... Many more backends.
  13. 13. SLO budget consumption Time window Burn rate Notification 2% 1 hour 14.4 Page 5% 6 hours 6 Page 10% 3 days 1 Ticket
  14. 14. expr: ( job:slo_errors_per_request:ratio_rate1h{job="myjob"} > (14.4*0.001) or job:slo_errors_per_request:ratio_rate6h{job="myjob"} > (6*0.001) ) severity: page expr: job:slo_errors_per_request:ratio_rate3d{job="myjob"} > 0.001 severity: ticket
  15. 15. - alert: AmpelmannErrorBudgetBurn expr: |2 ( 100 * backend:http_errors_per_response:ratio_rate1h > on (backend) 14.4 * backend:error_slo:percent ) and ( 100 * backend:http_errors_per_response:ratio_rate5m > on (backend) 14.4 * backend:error_slo:percent ) for: 2m labels: system: "{{$labels.backend}}" severity: "critical" window: "1h" annotations: summary: "a backend burns its error budget very fast" description: "Backend {{$labels.backend}} has returned {{ $value | printf `%.2f` }}% 5xx runbook: "http://runbooks.soundcloud.com/runbooks/ampelmann/#ampelmannerrorbudgetburn"
  16. 16. Alert Long window Short window for duration Burn rate factor Error budget consumed Page 1h 5m 2m 14.4 2% Page 6h 30m 15m 6 5% Ticket 1d 2h 1h 3 10% Ticket 3d 6h 1h 1 10%
  17. 17. Who gets the page?
  18. 18. - alert: AmpelmannErrorBudgetBurn expr: |2 ( 100 * backend:http_errors_per_response:ratio_rate1h > on (backend) 14.4 * backend:error_slo:percent ) and ( 100 * backend:http_errors_per_response:ratio_rate5m > on (backend) 14.4 * backend:error_slo:percent ) for: 2m labels: system: "{{$labels.backend}}" severity: "critical" window: "1h" annotations: summary: "a backend burns its error budget very fast" description: "Backend {{$labels.backend}} has returned {{ $value | printf `%.2f` }}% 5xx runbook: "http://runbooks.soundcloud.com/runbooks/ampelmann/#ampelmannerrorbudgetburn"
  19. 19. route: receiver: prodeng-warn group_by: - alertname - zone - system routes: - receiver: api-team-warn match: system: api-v4 routes: - receiver: api-team-crit match: severity: critical group_wait: 20s group_interval: 5m repeat_interval: 3h - receiver: api-team-info match: severity: info
  20. 20. https://prometheus.io https://github.com/beorn7/talks https://developers.soundcloud.com/blog

    Be the first to comment

    Login to see the comments

  • YingjieDu2

    Jun. 22, 2021

More than five years ago, Rob Ewaschuk created an innocuous Google doc titled “My Philosophy On Alerting”. It became kind of viral and later formed the foundation of a chapter in the famous book Site Reliability Engineering – How Google Runs Production Systems. In parallel, the metrics-based monitoring and alerting system Prometheus was developed at SoundCloud. It is the open-source tool to put Rob’s philosophy into practice. Thus, I would like to present “applied alerting philosophy” and explain how we use Prometheus at SoundCloud to create meaningful and actionable alerts. In particular, SoundCloud follows a fairly radical “you build it – you run it” approach, which requires additional care to route alerts to the right group of engineers. Prometheus’s “label everything” mantra proves to be very helpful here.

Views

Total views

174

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

0

Shares

0

Comments

0

Likes

1

×