In recent years it’s become evident that alerting is one of the biggest challenges facing modern Operations Engineers. Conference talks, hallways tracks, meetups, etc are rife with discussions about poor signal/noise in alerts, fatigue from false positives, and general lack of actionability.
Our talk (informed by real-world experience designing, building and maintaining our distributed, multi-tenant metrics/alerting service) takes a fundamental approach and examines alerting requirements and practices in the abstract. We put forth a comprehensive abstract model with best practices that should be followed and implemented by your team regardless of your tool of choice.
This talk is equal parts cultural and technical, encompassing both computational capabilities as well as social practices, like:
Defining organizational policy about where and when to set alerts.
Ensuring the on-call engineer is armed with the proper information to take action
Best practices for configuring an alert
Fire-fighting after an alert has triggered
Performing analysis across your organization wide history of alerts
26. What did he just say?
•Notifications are expensive, they hurt people and productivity
•Make people work harder to send them by requiring run books
•Run books add context to alerts. Other types of context are awesome too
•Like graphs
51. WE CAN REDUCE ALERTS BY
IMPROVING OUR TELEMETRY
SIGNAL
52. What did he just say?
•Monitoring isn't a thing. It’s just part of the engineering process
•We’re treating it like a thing that only some types of engineers might want to
do, and that’s giving us broken feedback
•Aerospace engineers are rad, they don’t do that.
•Fix your monitoring and your alerts will follow
96. What did he just say?
• Choose metrics that tell you about the things you care about.
•Alert when the things you care about hit limits you understand
•All alerts < critical go to chatrooms, ticket systems or dashboards
•Critical alers use an automated escalation service that enforces on call policy
•Escalated alerts require acknowledgement
•Escalated alerts require run book url’s and/or links to graphs of the metric
106. The Ultimate Recap
• Enforce a notification policy that requires context
• Make monitoring an engineering process
• Use the same signal for all metrics introspection and notification
• Encourage everyone to rely on telemetry data (graphs or it didn’t happen!)
• Everyone who collects a metric, gets keys to dashboard and alert design