Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Alert Fatigue
Avoidance and course correction
Follow along at: https://goo.gl/jh5RB7
Sensu And I:
● Ben Abrams / @majormoses
● Systems Engineer @doximity
● Sensu Experience
○ 2014: Started a pet project to r...
What is it?
Alert fatigue occurs when one is exposed to a large
number of frequent alarms (alerts) and consequently
become...
Or, Simply Put:
The Problem
● We are not computers
● Costly extended outages
● Burnout / Retention
Agnostic tips to reduce or eradicate alert fatigue
● Not Actionable == Not my problem
● If an alert can wait until the mor...
How Sensu can help
● Token Substitution
● Filters
● Handlers
● Check Hooks
● Proxy/JIT Clients + Round Robin Subscriptions...
Alert Reduction
With token substitution and sensu filters
Setting Thresholds with Token
Substitution{
"checks": {
"check_cpu": {
"command": "check-cpu.rb -
w ":::cpu.warn|80:::" -c...
Sensu Filters (1.x)
● Runs to determine if a handler should run
● Inclusive and Exclusive Filters
● Allows running anythin...
Inclusive Filter: Nine to Five
{
"filters": {
"nine_to_fiver_eastern": {
"negate": false, # default: false
"attributes": {...
Automate Triage and Remediation with
check hooks and handlers
Automate Triage with check hooks
{
"checks": {
"ping_four8s": {
"command": "check-ping.rb -h 8.8.8.8 -T 5",
"subscribers":...
Automate Remediation with handler Part 1
{
"checks": {
"check_process_foo": {
"command": "check-process.rb -p foo",
"subsc...
Automate Remediation with handler Part 2
{
"checks": {
"foo_process_remediate": {
"publish": false,
"command": "sudo -u se...
Consolidating Alerts
With Proxy/JIT clients + Round Robin subscriptions,
Aggregate checks, and check dependencies
Proxy/JIT + Round Robin
{
"checks": {
"check_es5_cluster": {
"command": "check-es-cluster-status.rb -h
:::address:::",
"su...
Check Dependencies
{
"checks": {
"check_foo_open_files": {
"command": "check-open-files.rb -u foo -p foo -w 80 -c
90",
"su...
Sensu Aggregate checks
{
"checks": {
"sensu_rabbitmq_amqp_alive": {
"command": "check-rabbitmq-amqp-alive.rb",
"subscriber...
{
"checks": {
"sensu_rabbitmq_amqp_alive_aggregate": {
"command": "check-aggregate.rb --check sensu_rabbitmq_amqp_alive
--...
Flap Detection
{
"checks": {
"check_cpu": {
"command": "check-cpu.rb -w 80 -c 90 --sleep 5",
"subscribers": ["base"],
"int...
Silence a client or check
$ curl -s -i -X POST 
-H 'Content-Type: application/json' 
-d '{"subscription": "load-balancer",...
Happy monitoring now with fewer alerts
Q&A
● Github / Slack: @majormoses
● Email: me@benabrams.it
● Open positions at doximity on my team:
○ Security focused Dev...
Upcoming SlideShare
Loading in …5
×

Alert Fatigue: Avoidance and Course Correction

113 views

Published on

In this talk from Doximity's Ben Abrams (from Sensu Summit 2018), you'll learn:

- Why alert fatigue is dangerous
- How we can solve it
- Sensu core components
- Filters
- Round robin subscriptions
- Check dependencies
- Check hooks (not strictly alert fatigue but auto triage can really help in general)
- Sensu community components
auto remediation: use the handler not hooks
- External tools for on call management and paging (such as pagerduty)
- General tuning
- Reduction in noise in alerting (as opposed to monitoring)

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Alert Fatigue: Avoidance and Course Correction

  1. 1. Alert Fatigue Avoidance and course correction Follow along at: https://goo.gl/jh5RB7
  2. 2. Sensu And I: ● Ben Abrams / @majormoses ● Systems Engineer @doximity ● Sensu Experience ○ 2014: Started a pet project to replace Nagios with Sensu ○ 2015: Sensu was production capable and started contributing back to the community ○ 2017: Became a maintainer for various areas across the sensu ecosystem ■ Plugins ■ Chef Cookbooks ■ Slack ■ OSS mentorship to other maintainers in other areas ○ Maintain over 200+ repositories for the sensu community
  3. 3. What is it? Alert fatigue occurs when one is exposed to a large number of frequent alarms (alerts) and consequently becomes desensitized to them.
  4. 4. Or, Simply Put:
  5. 5. The Problem ● We are not computers ● Costly extended outages ● Burnout / Retention
  6. 6. Agnostic tips to reduce or eradicate alert fatigue ● Not Actionable == Not my problem ● If an alert can wait until the morning hold it until business hours ● Consolidate Alerts ● Ensure alerts come with contextual awareness ● Service Ownership ● Effective On-call scheduling ● Wake me up when it’s over / Snooze ● Monitoring should be reviewed at the end of your on-call handoffs
  7. 7. How Sensu can help ● Token Substitution ● Filters ● Handlers ● Check Hooks ● Proxy/JIT Clients + Round Robin Subscriptions ● Check Dependencies ● Aggregate Checks ● Flap Detection ● Silencing ● Safe Mode
  8. 8. Alert Reduction With token substitution and sensu filters
  9. 9. Setting Thresholds with Token Substitution{ "checks": { "check_cpu": { "command": "check-cpu.rb - w ":::cpu.warn|80:::" -c "cpu.crit|90:::" --sleep 5", "subscribers": ["base"], "interval": 30, "occurrences": ":::cpu.occurrences|4:::" } } } { "client": { "name": "i-424242", "address": "10.10.10.10", "subscriptions": ["base", "etl"], "safe_mode": true, "cpu" { "crit": 100, "warn": 95, "occurrences": 10, } } }
  10. 10. Sensu Filters (1.x) ● Runs to determine if a handler should run ● Inclusive and Exclusive Filters ● Allows running anything you can write in ruby (which means anything) ● Days and Times ● Documentation: https://docs.sensu.io/sensu- core/1.4/reference/filters/
  11. 11. Inclusive Filter: Nine to Five { "filters": { "nine_to_fiver_eastern": { "negate": false, # default: false "attributes": { "timestamp": "eval: ENV['TZ'] = 'America/New_York'; [1,2,3,4,5].include?(Time.at(value).wday) && Time.at(value).hour.between?(9,17)" } } } }
  12. 12. Automate Triage and Remediation with check hooks and handlers
  13. 13. Automate Triage with check hooks { "checks": { "ping_four8s": { "command": "check-ping.rb -h 8.8.8.8 -T 5", "subscribers": ["base"], "interval": 5, "hooks": { "non-zero": { "command": "ping -c 1 `route -n | awk '$1 == "0.0.0.0" { print $2 }'`" } } } } }
  14. 14. Automate Remediation with handler Part 1 { "checks": { "check_process_foo": { "command": "check-process.rb -p foo", "subscribers": ["foo_service"], "handlers": ["pagerduty", "remediator"], "remediation": { "foo_process_remediate": { "occurrences": ["1-5"], "severities": [2] } } } } }
  15. 15. Automate Remediation with handler Part 2 { "checks": { "foo_process_remediate": { "publish": false, "command": "sudo -u sensu service foo restart", "subscribers": ["foo_service", "client:CLIENT_NAME"], "handlers": ["pagerduty"], "interval": 10, } } }
  16. 16. Consolidating Alerts With Proxy/JIT clients + Round Robin subscriptions, Aggregate checks, and check dependencies
  17. 17. Proxy/JIT + Round Robin { "checks": { "check_es5_cluster": { "command": "check-es-cluster-status.rb -h :::address:::", "subscribers": ["roundrobin:es5"], "interval": 30, "source": ":::es5.cluster.name:::", "ttl": 120 } } }
  18. 18. Check Dependencies { "checks": { "check_foo_open_files": { "command": "check-open-files.rb -u foo -p foo -w 80 -c 90", "subscribers": ["foo_service"], "handlers": ["pagerduty"], "dependencies": ["client:CLIENT_NAME/check_foo_process"] } } }
  19. 19. Sensu Aggregate checks { "checks": { "sensu_rabbitmq_amqp_alive": { "command": "check-rabbitmq-amqp-alive.rb", "subscribers": ["sensu-rabbitmq"], "interval": 60, "ttl": 180, "aggregates": ["sensu_rabbitmq"], "handle": false } } }
  20. 20. { "checks": { "sensu_rabbitmq_amqp_alive_aggregate": { "command": "check-aggregate.rb --check sensu_rabbitmq_amqp_alive --critical_count 2 --age 180", "aggregate": "sensu_rabbitmq", "source": "sensu-rabbitmq", "hooks": { "non-zero": { "command": "curl -s -S localhost:4567/aggregates/sensu_rabbitmq/results/critical | jq .[].check --raw-output", } } } } }
  21. 21. Flap Detection { "checks": { "check_cpu": { "command": "check-cpu.rb -w 80 -c 90 --sleep 5", "subscribers": ["base"], "interval": 30, "low_flap_threshold": ":::cpu.low_flap_threshold|25:::", "high_flap_threshold": ":::cpu.high_flap_threshold|50:::" } } }
  22. 22. Silence a client or check $ curl -s -i -X POST -H 'Content-Type: application/json' -d '{"subscription": "load-balancer", "check": "check_haproxy", "expire": 3600 }' http://localhost:4567/silenced HTTP/1.1 201 Created Access-Control-Allow-Credentials: true Access-Control-Allow-Headers: Origin, X-Requested-With, Content-Type, Accept, Authorization Access-Control-Allow-Methods: GET, POST, PUT, DELETE, OPTIONS Access-Control-Allow-Origin: * Connection: close Content-length: 0 $ curl -s -i -X POST -H 'Content-Type: application/json' -d '{"subscription": "load-balancer", "check": "check_haproxy", "expire": 3600 }' http://localhost:4567/silenced HTTP/1.1 201 Created Access-Control-Allow-Credentials: true Access-Control-Allow-Headers: Origin, X-Requested-With, Content- Type, Accept, Authorization Access-Control-Allow-Methods: GET, POST, PUT, DELETE, OPTIONS Access-Control-Allow-Origin: * Connection: close Content-length: 0
  23. 23. Happy monitoring now with fewer alerts
  24. 24. Q&A ● Github / Slack: @majormoses ● Email: me@benabrams.it ● Open positions at doximity on my team: ○ Security focused DevOps Engineer: https://grnh.se/4a4116de1 ○ Kafka Focused DevOps Engineer: https://grnh.se/fb3d19641 ● Open Positions at doximity on other teams: ○ https://workat.doximity.com/positions

×