Pretty much every company that has computers on the internet has someone who gets called when those computers go down. While this practice isn’t surprising, what is surprising is that we spend very little time as an industry discussing the right way to design and implement alerts. Not from a technical sense; what we need to discuss are how to make alerts something that are actually of value for the business, and worth the disruption they cause in peoples lives. That may sound a bit dramatic, but “pager fatigue” is a real risk to business, and “phantom pages” are a sign that things have gotten out of hand. We have terms for the bad things, it’s time to start talking about the good things. Topics we’ll cover include:
* The difference between metrics, alerts, alarms, and other particulars.
* How do you determine who should be called when a problem arises.
* Simple and effective techniques for your team to responding to alerts & alarms.
* How to attack your monitoring setup to eliminate alerts without adding risk.
* Defining what “production ready” ready software is in a way that the business people will agree to.
At OmniTI, we’re often forced to walk into the middle of an existing infrastructure that is already set on fire. The only thing worse than having no alerts in that situation is having hundreds of alerts screaming at you constantly. Over the years we’ve had to come up with a way to help keep our operations team sane while also providing business value, and most importantly giving comfort to the folks that have brought us in. The methods that we’ve developed can be used by any operations team to help bring sanity back to their world, and end the cycle of “pager fatigue”.