In this talk from Doximity's Ben Abrams (from Sensu Summit 2018), you'll learn:
- Why alert fatigue is dangerous
- How we can solve it
- Sensu core components
- Filters
- Round robin subscriptions
- Check dependencies
- Check hooks (not strictly alert fatigue but auto triage can really help in general)
- Sensu community components
auto remediation: use the handler not hooks
- External tools for on call management and paging (such as pagerduty)
- General tuning
- Reduction in noise in alerting (as opposed to monitoring)
2. Sensu And I:
● Ben Abrams / @majormoses
● Systems Engineer @doximity
● Sensu Experience
○ 2014: Started a pet project to replace Nagios with Sensu
○ 2015: Sensu was production capable and started contributing back to the community
○ 2017: Became a maintainer for various areas across the sensu ecosystem
■ Plugins
■ Chef Cookbooks
■ Slack
■ OSS mentorship to other maintainers in other areas
○ Maintain over 200+ repositories for the sensu community
3. What is it?
Alert fatigue occurs when one is exposed to a large
number of frequent alarms (alerts) and consequently
becomes desensitized to them.
5. The Problem
● We are not computers
● Costly extended outages
● Burnout / Retention
6. Agnostic tips to reduce or eradicate alert fatigue
● Not Actionable == Not my problem
● If an alert can wait until the morning hold it until business hours
● Consolidate Alerts
● Ensure alerts come with contextual awareness
● Service Ownership
● Effective On-call scheduling
● Wake me up when it’s over / Snooze
● Monitoring should be reviewed at the end of your on-call handoffs
7. How Sensu can help
● Token Substitution
● Filters
● Handlers
● Check Hooks
● Proxy/JIT Clients + Round Robin Subscriptions
● Check Dependencies
● Aggregate Checks
● Flap Detection
● Silencing
● Safe Mode
10. Sensu Filters (1.x)
● Runs to determine if a handler should run
● Inclusive and Exclusive Filters
● Allows running anything you can write in ruby (which
means anything)
● Days and Times
● Documentation: https://docs.sensu.io/sensu-
core/1.4/reference/filters/
11. Inclusive Filter: Nine to Five
{
"filters": {
"nine_to_fiver_eastern": {
"negate": false, # default: false
"attributes": {
"timestamp": "eval: ENV['TZ'] = 'America/New_York';
[1,2,3,4,5].include?(Time.at(value).wday) &&
Time.at(value).hour.between?(9,17)"
}
}
}
}
24. Q&A
● Github / Slack: @majormoses
● Email: me@benabrams.it
● Open positions at doximity on my team:
○ Security focused DevOps Engineer: https://grnh.se/4a4116de1
○ Kafka Focused DevOps Engineer: https://grnh.se/fb3d19641
● Open Positions at doximity on other teams:
○ https://workat.doximity.com/positions
Editor's Notes
The boy who cried wolf
We have an inferior queue buffer to RabbitMQ
Before talking about sensu specific solutions, lets talk in general terms.
If an alert can wait until the morning hold it until business hours: if I can’t fix it now why are you telling me now.
Wake me up when its over: Short term relief of temporarily snoozing alerts for a predetermined time
These are sensu features that either directly or indirectly can help manage alert fatigue
You can set per client thresholds but use the same check definition.
Filters in 2.x work very differently, you can write a ruby extension as a gRPC service.
Inclusive filtering: by setting the filter definition attribute "negate": false, only events that match the defined filter attributes are handled.
This will run any handlers if its a weekday and between 10 AM -> 10 PM Eastern Time
Exclusive filtering: by setting the filter definition attribute "negate": true, events are only handled if they do not match the defined filter attributes.
This runs any handlers where the occurences is less than the check configured occurrences OR 60
Automate all the things!
While you can use check hooks for remediation they are best used for triage and handlers are best used for remediation because they have extra context such as occurrences.
Valid hook names include (in order of precedence): “1”-“255”, “ok”, “warning”, “critical”, “unknown”, and “non-zero”.
This appends the result of a single ping to the default gateway in the event of being unable to reach the public internet (in this case google DNS load balanced server). Other common use cases include processes running when cpu, memory, load is high or showing the top x directories when disk is near full.
`remediator` handler
occurrences can either be a range such as `[“1-5”]`, `[“1+”]`, `[“1,3,5”]`, etc
You can technically have multiple levels of remediation. See: https://github.com/sensu-plugins/sensu-plugins-sensu/blob/3.0.0/bin/handler-sensu.rb#L32-L64 for more advanced usage
You can set severities for `0` OK (not sure the use case), `1` WARN, `2` CRIT, or `3` UNKNOWN.
Sudoers.d needs an entry allowing you to run it.
Restrict running to only the exact client rather than running it on all nodes with the subscription
Unpublished check prevents sensu from automatically being scheduled. This allows it to be triggered by other events (handlers) and can be used for various scenerios such as auto remediation, automatically updating a maintenance/status page, etc.Unpublished also prevents standalone from self scheduling
In the subscription `roundrobin:` attempts to do a round robin by letting the first eligible node from retrieving the request from rabbitmq.In the source you define the client name you want (such as the whole cluster vs per machine checks) also this uses token substitution to set the cluster name.
Think of aggregate checks like a check against a load balancer, not every node needs to be functional just some number/percentage does.
`handle: false` is not required, this is just one place to demonstrate a way to reduce alerts that are not actionable.
This is useful when performing maintenance
While it’s main purpose is security (preventing malicious code to be executed via sensu) it can be used to prevent say monitoring alerting on checks that might be in the process of being installed during initial node bootstrap.
Some handlers such as ones that provide single pane of glass should set these to true (which overrides the default)