Less Alarming Alerts!

975 views

Published on

Pretty much every company that has computers on the internet has someone who gets called when those computers go down. While this practice isn’t surprising, what is surprising is that we spend very little time as an industry discussing the right way to design and implement alerts. Not from a technical sense; what we need to discuss are how to make alerts something that are actually of value for the business, and worth the disruption they cause in peoples lives. That may sound a bit dramatic, but “pager fatigue” is a real risk to business, and “phantom pages” are a sign that things have gotten out of hand. We have terms for the bad things, it’s time to start talking about the good things. Topics we’ll cover include:

* The difference between metrics, alerts, alarms, and other particulars.
* How do you determine who should be called when a problem arises.
* Simple and effective techniques for your team to responding to alerts & alarms.
* How to attack your monitoring setup to eliminate alerts without adding risk.
* Defining what “production ready” ready software is in a way that the business people will agree to.

At OmniTI, we’re often forced to walk into the middle of an existing infrastructure that is already set on fire. The only thing worse than having no alerts in that situation is having hundreds of alerts screaming at you constantly. Over the years we’ve had to come up with a way to help keep our operations team sane while also providing business value, and most importantly giving comfort to the folks that have brought us in. The methods that we’ve developed can be used by any operations team to help bring sanity back to their world, and end the cycle of “pager fatigue”.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
975
On SlideShare
0
From Embeds
0
Number of Embeds
25
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Less Alarming Alerts!

  1. 1. Less Alarming Alerts / Robert Treat
  2. 2. Hello /@robtreat2 Former • Web Developer • SysAdmin • Database Administrator
  3. 3. Hello /@robtreat2 Now COO @OMNITI
  4. 4. Hello /@robtreat2 Who Cares What Some Suite Thinks?
  5. 5. Hello /@robtreat2 Phantom Pages
  6. 6. Memory Lane / @robtreat2 Benny
  7. 7. Memory Lane / @robtreat2 MyFirstPager
  8. 8. Memory Lane / @robtreat2 Multiple Rotations
  9. 9. Memory Lane / @robtreat2 always available, phone only no pager for 1.5 years
  10. 10. Why God Why? paging is useful “broken systems should not be just another day at the office” -- me
  11. 11. Why God Why? paging is useful Who has ever gotten an alert and ignored it? (/me looks at alert, says “oh, it’ll probably recover, no need to look further”)
  12. 12. Why God Why? paging is useful How many alerts were received in the past week that were not actionable? (no human action was required)
  13. 13. Why God Why? paging CAN BE useful
  14. 14. Can We Fix It? how to improve?
  15. 15. Can We Fix It? sales@omniti.com we offer operationally focused services to help build and manage your infrastructure :-)
  16. 16. Terms • Metrics • (anything which can be measured)
  17. 17. Terms • • Metrics • (anything which can be measured) Graphs • (trending systems)
  18. 18. Terms • • • Metrics • (anything which can be measured) Graphs • (trending systems) Notices / Alerts • (notification of event; email)
  19. 19. Terms • • • • Metrics • (anything which can be measured) Graphs • (trending systems) Notices / Alerts • (notification of event; email) ALARMS • (wake’n you up; pages)
  20. 20. Terms • • • • Metrics • (anything which can be measured) Graphs • (trending systems) Notices / Alerts • (notification of event; email) ALARMS • (wake’n you up; pages)
  21. 21. Onward and Upward If you want to improve your alerts think in terms your business can get on board with
  22. 22. Onward and Upward for every alert you receive What is the business impact of this alert?
  23. 23. Onward and Upward for every alert you receive What is the remediation for this alert?
  24. 24. Onward and Upward remediation: • • • • Summarize the problem What was done to solve the problem? Who was notified? Can this be prevented?
  25. 25. Onward and Upward send the answer to these questions to everyone on the team every time
  26. 26. Onward and Upward • • • Knowledge Transfer Gaps Exposed Patterns will emerge
  27. 27. Onward and Upward you might be a bad alert • • • • cannot determine business impact no remediation necessary no one needs to be told work arounds are available
  28. 28. Onward and Upward in case of bad alarm • • • remove the alarm convert the alarm to an alert implement fixes
  29. 29. Onward and Upward if you can’t fix it, you don’t need to wake up for it
  30. 30. Onward and Upward if it can wait until morning, you don’t need to wake up for it
  31. 31. Onward and Upward pro tip: never let anyone add an alarm unless they can answer these questions first
  32. 32. Can We Really Do This? this is partially an organizational issue
  33. 33. Can We Really Do This? thought exercise: if you launched a new web site today, you really only need one alarm
  34. 34. Can We Really Do This? “I don’t care if my servers are on fire, as long as I am still making money” -- Kevin, actual OmniTI customer
  35. 35. This sounds good but... Most SA/SRE types want to be pro-active, not re-active. ie. alarm on leading indicators, not on problems
  36. 36. This sounds good but... Carrie: I-I'm just making sure we don't get hit again. Saul: Well, I'm glad someone's looking out for us, Carrie. Carrie: I'm serious. I-I missed something once before, I won't... I can't let that happen again. Saul: It was ten years ago. Everyone missed something that day. Carrie: Yeah, everyone's not me.
  37. 37. Based On A True Story site down: monitor was checking 200 response code. failed to notice absence of response code. easily fixed, but reactive
  38. 38. Based On A True Story “root cause” ==> OOM why don’t we alarm on OOM? OOM does not consistently cause outages
  39. 39. Based On A True Story too many false positives leads to ignoring alarms
  40. 40. Digression Görges M, Markewitz BA, Westenskow DR Improving Alarm Performance In The Medical Intensive Care Unit Using Delays and Clinical Context http://www.ncbi.nlm.nih.gov/pubmed/19372334 “In an intensive care unit, alarms are used to call attention to a patient, to alert a change in the patient's physiology, or to warn of a failure in a medical device; however, up to 94% of the alarms are false.” Friendman, Naparstek, Taussing-Rubbo, Alarmingly Useless, The Case For Banning Car Alarms In NYC http://transalt.org/files/news/reports/caralarms/report.pdf Blackstone, Buck, Hakim Evaluation of alternative policies to combat false emergency calls http://isc.temple.edu/economics/wkpapers/Pubs/FalsePolicy.pdf Wickens, Rice, Keller, Hutchins, Hughes, Clayton False Alerts in Air Traffic Control Conflict Alerting System: Is There A Cry Wolf Effect? http://www.tc.faa.gov/LOGISTICS/grants/pdf/2007/07-G-002.pdf
  41. 41. Digression AESOP The Boy Who Cried Wolf
  42. 42. Based On A True Story • • • send notice of OOM? fix the cause of OOM? make a useful alarm?
  43. 43. Based On A True Story useful alarming • • • • script that checks for OOM restart app server when found find offending process; kill it spin up new node; kill old node in the event all of these fail, send an alarm?
  44. 44. Based On A True Story thought exercise: if you launched a new web site today, you really only need one alarm
  45. 45. In Conclusion if we need software that runs 24x7, we should design resiliency into our software, not human intervention
  46. 46. In Conclusion thinking doesn’t scale especially at 2AM
  47. 47. In Conclusion thanks! more: @robtreat2 @omniti

×