Five Causes of Alert Fatigue -- and how to prevent them

1,396 views
1,244 views

Published on

“Alert Spam” is a major recurring pain brought up by Ops teams: the constant flood of noisy alerts from your monitoring stack. This presentation discusses five types of spammy alerts that we hear about most often (and how we’d like to see them resolved). Most of them will sound familiar to you.

Published in: Technology, Design
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,396
On SlideShare
0
From Embeds
0
Number of Embeds
53
Actions
Shares
0
Downloads
9
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Drama!Open source, SaaS, continuous integration have disrupted datacenters.Tectonic shift from a dev perspective: agility & speed of development. Accelerated the service disruptions (CEO feels effect) Lengthened MTTR.
  • Drama!Open source, SaaS, continuous integration have disrupted datacenters.Tectonic shift from a dev perspective: agility & speed of development. Accelerated the service disruptions (CEO feels effect) Lengthened MTTR.
  • Drama!Open source, SaaS, continuous integration have disrupted datacenters.Tectonic shift from a dev perspective: agility & speed of development. Accelerated the service disruptions (CEO feels effect) Lengthened MTTR.
  • Drama!Open source, SaaS, continuous integration have disrupted datacenters.Tectonic shift from a dev perspective: agility & speed of development. Accelerated the service disruptions (CEO feels effect) Lengthened MTTR.
  • Drama!Open source, SaaS, continuous integration have disrupted datacenters.Tectonic shift from a dev perspective: agility & speed of development. Accelerated the service disruptions (CEO feels effect) Lengthened MTTR.
  • Drama!Open source, SaaS, continuous integration have disrupted datacenters.Tectonic shift from a dev perspective: agility & speed of development. Accelerated the service disruptions (CEO feels effect) Lengthened MTTR.
  • Drama!Open source, SaaS, continuous integration have disrupted datacenters.Tectonic shift from a dev perspective: agility & speed of development. Accelerated the service disruptions (CEO feels effect) Lengthened MTTR.
  • Five Causes of Alert Fatigue -- and how to prevent them

    1. 1. Alert Fatigue - and what to do about it Elik Eizenberg, VP R&D http://www.bigpanda.io
    2. 2. alert fatigue noun A constant flood of noisy, non-actionable alerts, generated by your monitoring stack. Synonyms: alert overload, alert spam 2
    3. 3. 3 Poor Signal-to-Noise Ratio Delayed Response Wrong Prioritization Constant Context Switching
    4. 4. 4 Common Pitfalls
    5. 5. What you see: 20 critical Nagios / Zabbix alerts, all at once What happened: - Unexpected traffic to your app - You get an alert from practically every host in the cluster In an ideal world: - 1 alert, indicating 80% of the cluster has problems - Don’t wake me up unless at least some % of the cluster is down 5 Alert Per Host
    6. 6. What you see: Low disk space alert on a MongoDB host What happened: - DB disk is slowly filling up as expected - Will become urgent in a few weeks In an ideal world: - No need for an alert at all! - Automatically issue a Jira ticket and assign it to me 6 Important != Urgent
    7. 7. What you see: The same high-load alerts, every Monday after lunch What happened: - Monday is busy by definition - You can’t use the same thresholds every day In an ideal world: - Dynamically update your thresholds - Or focus only on anomalies (e.g. etsy/skyline) 7 Non-Adaptive Thresholds
    8. 8. What you see: Incoming alerts from Nagios, Pingdom, NewRelic, Keynote & Splunk… What happened: - Data corruption in a couple of Mongo nodes - Resulting in heavy disk IO and some transaction errors - This kind of error manifests itself in server, application & user level In an ideal world: - Auto correlate highly-related alerts from different systems - Show me one high-level incident, instead of low-level alerts 8 Same Issue, Different System
    9. 9. What you see: Issue pops us for a couple of minutes, then disappears. What happened: - Maybe a cronjob over utilizes the netwrok - Or a random race-condition in the app - Or a rarely-used product feature that causes the backend to crash In an ideal world: - No need for an alert every time it happens - Give me a monthly report of common shot-lived alerts 9 Transient Alerts
    10. 10. 10 Give us a try - http://www.bigpanda.io http://twitter.com/bigpanda Thanks for listening!

    ×