Alert Fatigue: Avoidance and Course Correction

Alert Fatigue
Avoidance and course correction
Follow along at: https://goo.gl/jh5RB7

Sensu And I:
● Ben Abrams / @majormoses
● Systems Engineer @doximity
● Sensu Experience
○ 2014: Started a pet project to replace Nagios with Sensu
○ 2015: Sensu was production capable and started contributing back to the community
○ 2017: Became a maintainer for various areas across the sensu ecosystem
■ Plugins
■ Chef Cookbooks
■ Slack
■ OSS mentorship to other maintainers in other areas
○ Maintain over 200+ repositories for the sensu community

What is it?
Alert fatigue occurs when one is exposed to a large
number of frequent alarms (alerts) and consequently
becomes desensitized to them.

The Problem
● We are not computers
● Costly extended outages
● Burnout / Retention

Agnostic tips to reduce or eradicate alert fatigue
● Not Actionable == Not my problem
● If an alert can wait until the morning hold it until business hours
● Consolidate Alerts
● Ensure alerts come with contextual awareness
● Service Ownership
● Effective On-call scheduling
● Wake me up when it’s over / Snooze
● Monitoring should be reviewed at the end of your on-call handoffs

How Sensu can help
● Token Substitution
● Filters
● Handlers
● Check Hooks
● Proxy/JIT Clients + Round Robin Subscriptions
● Check Dependencies
● Aggregate Checks
● Flap Detection
● Silencing
● Safe Mode

Alert Reduction
With token substitution and sensu filters

Setting Thresholds with Token
Substitution{
"checks": {
"check_cpu": {
"command": "check-cpu.rb -
w ":::cpu.warn|80:::" -c
"cpu.crit|90:::" --sleep 5",
"subscribers": ["base"],
"interval": 30,
"occurrences":
":::cpu.occurrences|4:::"
}
}
}
{
"client": {
"name": "i-424242",
"address": "10.10.10.10",
"subscriptions": ["base", "etl"],
"safe_mode": true,
"cpu" {
"crit": 100,
"warn": 95,
"occurrences": 10,
}
}
}

Sensu Filters (1.x)
● Runs to determine if a handler should run
● Inclusive and Exclusive Filters
● Allows running anything you can write in ruby (which
means anything)
● Days and Times
● Documentation: https://docs.sensu.io/sensu-
core/1.4/reference/filters/

Inclusive Filter: Nine to Five
{
"filters": {
"nine_to_fiver_eastern": {
"negate": false, # default: false
"attributes": {
"timestamp": "eval: ENV['TZ'] = 'America/New_York';
[1,2,3,4,5].include?(Time.at(value).wday) &&
Time.at(value).hour.between?(9,17)"
}
}
}
}

Automate Triage and Remediation with
check hooks and handlers

Automate Triage with check hooks
{
"checks": {
"ping_four8s": {
"command": "check-ping.rb -h 8.8.8.8 -T 5",
"interval": 5,
"hooks": {
"non-zero": {
"command": "ping -c 1 `route -n | awk '$1 == "0.0.0.0" {
print $2 }'`"
}
}
}
}
}

Automate Remediation with handler Part 1
{
"checks": {
"check_process_foo": {
"command": "check-process.rb -p foo",
"subscribers": ["foo_service"],
"handlers": ["pagerduty", "remediator"],
"remediation": {
"foo_process_remediate": {
"occurrences": ["1-5"],
"severities": [2]
}
}
}
}
}

Automate Remediation with handler Part 2
{
"checks": {
"foo_process_remediate": {
"publish": false,
"command": "sudo -u sensu service foo restart",
"subscribers": ["foo_service", "client:CLIENT_NAME"],
"handlers": ["pagerduty"],
"interval": 10,
}
}
}

Consolidating Alerts
With Proxy/JIT clients + Round Robin subscriptions,
Aggregate checks, and check dependencies

Proxy/JIT + Round Robin
{
"checks": {
"check_es5_cluster": {
"command": "check-es-cluster-status.rb -h
:::address:::",
"subscribers": ["roundrobin:es5"],
"interval": 30,
"source": ":::es5.cluster.name:::",
"ttl": 120
}
}
}

Check Dependencies
{
"checks": {
"check_foo_open_files": {
"command": "check-open-files.rb -u foo -p foo -w 80 -c
90",
"subscribers": ["foo_service"],
"handlers": ["pagerduty"],
"dependencies": ["client:CLIENT_NAME/check_foo_process"]
}
}
}

Sensu Aggregate checks
{
"checks": {
"sensu_rabbitmq_amqp_alive": {
"command": "check-rabbitmq-amqp-alive.rb",
"subscribers": ["sensu-rabbitmq"],
"interval": 60,
"ttl": 180,
"aggregates": ["sensu_rabbitmq"],
"handle": false
}
}
}

{
"checks": {
"sensu_rabbitmq_amqp_alive_aggregate": {
"command": "check-aggregate.rb --check sensu_rabbitmq_amqp_alive
--critical_count 2 --age 180",
"aggregate": "sensu_rabbitmq",
"source": "sensu-rabbitmq",
"hooks": {
"non-zero": {
"command": "curl -s -S
localhost:4567/aggregates/sensu_rabbitmq/results/critical | jq
.[].check --raw-output",
}
}
}
}
}

Flap Detection
{
"checks": {
"check_cpu": {
"command": "check-cpu.rb -w 80 -c 90 --sleep 5",
"interval": 30,
"low_flap_threshold": ":::cpu.low_flap_threshold|25:::",
"high_flap_threshold": ":::cpu.high_flap_threshold|50:::"
}
}
}

Silence a client or check
$ curl -s -i -X POST
-H 'Content-Type: application/json'
-d '{"subscription": "load-balancer", "check": "check_haproxy", "expire": 3600 }'
http://localhost:4567/silenced
HTTP/1.1 201 Created
Access-Control-Allow-Credentials: true
Access-Control-Allow-Headers: Origin, X-Requested-With, Content-Type, Accept, Authorization
Access-Control-Allow-Methods: GET, POST, PUT, DELETE, OPTIONS
Access-Control-Allow-Origin: *
Connection: close
Content-length: 0
$ curl -s -i -X POST
-H 'Content-Type: application/json'
-d '{"subscription": "load-balancer", "check": "check_haproxy",
"expire": 3600 }'
http://localhost:4567/silenced
HTTP/1.1 201 Created
Access-Control-Allow-Credentials: true
Access-Control-Allow-Headers: Origin, X-Requested-With, Content-
Type, Accept, Authorization
Access-Control-Allow-Methods: GET, POST, PUT, DELETE, OPTIONS
Access-Control-Allow-Origin: *
Connection: close
Content-length: 0

Happy monitoring now with fewer alerts

Q&A
● Github / Slack: @majormoses
● Email: me@benabrams.it
● Open positions at doximity on my team:
○ Security focused DevOps Engineer: https://grnh.se/4a4116de1
○ Kafka Focused DevOps Engineer: https://grnh.se/fb3d19641
● Open Positions at doximity on other teams:
○ https://workat.doximity.com/positions

Alert Fatigue: Avoidance and Course Correction

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to Alert Fatigue: Avoidance and Course Correction

Similar to Alert Fatigue: Avoidance and Course Correction (20)

More from Sensu Inc.

More from Sensu Inc. (20)

Recently uploaded

Recently uploaded (20)

Alert Fatigue: Avoidance and Course Correction

Editor's Notes