Your SlideShare is downloading. ×
Less Alarming Alerts!
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Less Alarming Alerts!

352
views

Published on

Pretty much every company that has computers on the internet has someone who gets called when those computers go down. While this practice isn’t surprising, what is surprising is that we spend very …

Pretty much every company that has computers on the internet has someone who gets called when those computers go down. While this practice isn’t surprising, what is surprising is that we spend very little time as an industry discussing the right way to design and implement alerts. Not from a technical sense; what we need to discuss are how to make alerts something that are actually of value for the business, and worth the disruption they cause in peoples lives. That may sound a bit dramatic, but “pager fatigue” is a real risk to business, and “phantom pages” are a sign that things have gotten out of hand. We have terms for the bad things, it’s time to start talking about the good things. Topics we’ll cover include:

* The difference between metrics, alerts, alarms, and other particulars.
* How do you determine who should be called when a problem arises.
* Simple and effective techniques for your team to responding to alerts & alarms.
* How to attack your monitoring setup to eliminate alerts without adding risk.
* Defining what “production ready” ready software is in a way that the business people will agree to.

At OmniTI, we’re often forced to walk into the middle of an existing infrastructure that is already set on fire. The only thing worse than having no alerts in that situation is having hundreds of alerts screaming at you constantly. Over the years we’ve had to come up with a way to help keep our operations team sane while also providing business value, and most importantly giving comfort to the folks that have brought us in. The methods that we’ve developed can be used by any operations team to help bring sanity back to their world, and end the cycle of “pager fatigue”.

Published in: Technology, Business

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
352
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Less Alarming Alerts / Robert Treat
  • 2. Hello /@robtreat2 Former • Web Developer • SysAdmin • Database Administrator
  • 3. Hello /@robtreat2 Now COO @OMNITI
  • 4. Hello /@robtreat2 Who Cares What Some Suite Thinks?
  • 5. Hello /@robtreat2 Phantom Pages
  • 6. Memory Lane / @robtreat2 Benny
  • 7. Memory Lane / @robtreat2 MyFirstPager
  • 8. Memory Lane / @robtreat2 Multiple Rotations
  • 9. Memory Lane / @robtreat2 always available, phone only no pager for 1.5 years
  • 10. Why God Why? paging is useful “broken systems should not be just another day at the office” -- me
  • 11. Why God Why? paging is useful Who has ever gotten an alert and ignored it? (/me looks at alert, says “oh, it’ll probably recover, no need to look further”)
  • 12. Why God Why? paging is useful How many alerts were received in the past week that were not actionable? (no human action was required)
  • 13. Why God Why? paging CAN BE useful
  • 14. Can We Fix It? how to improve?
  • 15. Can We Fix It? sales@omniti.com we offer operationally focused services to help build and manage your infrastructure :-)
  • 16. Terms • Metrics • (anything which can be measured)
  • 17. Terms • • Metrics • (anything which can be measured) Graphs • (trending systems)
  • 18. Terms • • • Metrics • (anything which can be measured) Graphs • (trending systems) Notices / Alerts • (notification of event; email)
  • 19. Terms • • • • Metrics • (anything which can be measured) Graphs • (trending systems) Notices / Alerts • (notification of event; email) ALARMS • (wake’n you up; pages)
  • 20. Terms • • • • Metrics • (anything which can be measured) Graphs • (trending systems) Notices / Alerts • (notification of event; email) ALARMS • (wake’n you up; pages)
  • 21. Onward and Upward If you want to improve your alerts think in terms your business can get on board with
  • 22. Onward and Upward for every alert you receive What is the business impact of this alert?
  • 23. Onward and Upward for every alert you receive What is the remediation for this alert?
  • 24. Onward and Upward remediation: • • • • Summarize the problem What was done to solve the problem? Who was notified? Can this be prevented?
  • 25. Onward and Upward send the answer to these questions to everyone on the team every time
  • 26. Onward and Upward • • • Knowledge Transfer Gaps Exposed Patterns will emerge
  • 27. Onward and Upward you might be a bad alert • • • • cannot determine business impact no remediation necessary no one needs to be told work arounds are available
  • 28. Onward and Upward in case of bad alarm • • • remove the alarm convert the alarm to an alert implement fixes
  • 29. Onward and Upward if you can’t fix it, you don’t need to wake up for it
  • 30. Onward and Upward if it can wait until morning, you don’t need to wake up for it
  • 31. Onward and Upward pro tip: never let anyone add an alarm unless they can answer these questions first
  • 32. Can We Really Do This? this is partially an organizational issue
  • 33. Can We Really Do This? thought exercise: if you launched a new web site today, you really only need one alarm
  • 34. Can We Really Do This? “I don’t care if my servers are on fire, as long as I am still making money” -- Kevin, actual OmniTI customer
  • 35. This sounds good but... Most SA/SRE types want to be pro-active, not re-active. ie. alarm on leading indicators, not on problems
  • 36. This sounds good but... Carrie: I-I'm just making sure we don't get hit again. Saul: Well, I'm glad someone's looking out for us, Carrie. Carrie: I'm serious. I-I missed something once before, I won't... I can't let that happen again. Saul: It was ten years ago. Everyone missed something that day. Carrie: Yeah, everyone's not me.
  • 37. Based On A True Story site down: monitor was checking 200 response code. failed to notice absence of response code. easily fixed, but reactive
  • 38. Based On A True Story “root cause” ==> OOM why don’t we alarm on OOM? OOM does not consistently cause outages
  • 39. Based On A True Story too many false positives leads to ignoring alarms
  • 40. Digression Görges M, Markewitz BA, Westenskow DR Improving Alarm Performance In The Medical Intensive Care Unit Using Delays and Clinical Context http://www.ncbi.nlm.nih.gov/pubmed/19372334 “In an intensive care unit, alarms are used to call attention to a patient, to alert a change in the patient's physiology, or to warn of a failure in a medical device; however, up to 94% of the alarms are false.” Friendman, Naparstek, Taussing-Rubbo, Alarmingly Useless, The Case For Banning Car Alarms In NYC http://transalt.org/files/news/reports/caralarms/report.pdf Blackstone, Buck, Hakim Evaluation of alternative policies to combat false emergency calls http://isc.temple.edu/economics/wkpapers/Pubs/FalsePolicy.pdf Wickens, Rice, Keller, Hutchins, Hughes, Clayton False Alerts in Air Traffic Control Conflict Alerting System: Is There A Cry Wolf Effect? http://www.tc.faa.gov/LOGISTICS/grants/pdf/2007/07-G-002.pdf
  • 41. Digression AESOP The Boy Who Cried Wolf
  • 42. Based On A True Story • • • send notice of OOM? fix the cause of OOM? make a useful alarm?
  • 43. Based On A True Story useful alarming • • • • script that checks for OOM restart app server when found find offending process; kill it spin up new node; kill old node in the event all of these fail, send an alarm?
  • 44. Based On A True Story thought exercise: if you launched a new web site today, you really only need one alarm
  • 45. In Conclusion if we need software that runs 24x7, we should design resiliency into our software, not human intervention
  • 46. In Conclusion thinking doesn’t scale especially at 2AM
  • 47. In Conclusion thanks! more: @robtreat2 @omniti