SlideShare a Scribd company logo
Alert Fatigue
Avoidance and course correction
Follow along at: https://goo.gl/jh5RB7
Sensu And I:
● Ben Abrams / @majormoses
● Systems Engineer @doximity
● Sensu Experience
○ 2014: Started a pet project to replace Nagios with Sensu
○ 2015: Sensu was production capable and started contributing back to the community
○ 2017: Became a maintainer for various areas across the sensu ecosystem
■ Plugins
■ Chef Cookbooks
■ Slack
■ OSS mentorship to other maintainers in other areas
○ Maintain over 200+ repositories for the sensu community
What is it?
Alert fatigue occurs when one is exposed to a large
number of frequent alarms (alerts) and consequently
becomes desensitized to them.
Or, Simply Put:
The Problem
● We are not computers
● Costly extended outages
● Burnout / Retention
Agnostic tips to reduce or eradicate alert fatigue
● Not Actionable == Not my problem
● If an alert can wait until the morning hold it until business hours
● Consolidate Alerts
● Ensure alerts come with contextual awareness
● Service Ownership
● Effective On-call scheduling
● Wake me up when it’s over / Snooze
● Monitoring should be reviewed at the end of your on-call handoffs
How Sensu can help
● Token Substitution
● Filters
● Handlers
● Check Hooks
● Proxy/JIT Clients + Round Robin Subscriptions
● Check Dependencies
● Aggregate Checks
● Flap Detection
● Silencing
● Safe Mode
Alert Reduction
With token substitution and sensu filters
Setting Thresholds with Token
Substitution{
"checks": {
"check_cpu": {
"command": "check-cpu.rb -
w ":::cpu.warn|80:::" -c
"cpu.crit|90:::" --sleep 5",
"subscribers": ["base"],
"interval": 30,
"occurrences":
":::cpu.occurrences|4:::"
}
}
}
{
"client": {
"name": "i-424242",
"address": "10.10.10.10",
"subscriptions": ["base", "etl"],
"safe_mode": true,
"cpu" {
"crit": 100,
"warn": 95,
"occurrences": 10,
}
}
}
Sensu Filters (1.x)
● Runs to determine if a handler should run
● Inclusive and Exclusive Filters
● Allows running anything you can write in ruby (which
means anything)
● Days and Times
● Documentation: https://docs.sensu.io/sensu-
core/1.4/reference/filters/
Inclusive Filter: Nine to Five
{
"filters": {
"nine_to_fiver_eastern": {
"negate": false, # default: false
"attributes": {
"timestamp": "eval: ENV['TZ'] = 'America/New_York';
[1,2,3,4,5].include?(Time.at(value).wday) &&
Time.at(value).hour.between?(9,17)"
}
}
}
}
Automate Triage and Remediation with
check hooks and handlers
Automate Triage with check hooks
{
"checks": {
"ping_four8s": {
"command": "check-ping.rb -h 8.8.8.8 -T 5",
"subscribers": ["base"],
"interval": 5,
"hooks": {
"non-zero": {
"command": "ping -c 1 `route -n | awk '$1 == "0.0.0.0" {
print $2 }'`"
}
}
}
}
}
Automate Remediation with handler Part 1
{
"checks": {
"check_process_foo": {
"command": "check-process.rb -p foo",
"subscribers": ["foo_service"],
"handlers": ["pagerduty", "remediator"],
"remediation": {
"foo_process_remediate": {
"occurrences": ["1-5"],
"severities": [2]
}
}
}
}
}
Automate Remediation with handler Part 2
{
"checks": {
"foo_process_remediate": {
"publish": false,
"command": "sudo -u sensu service foo restart",
"subscribers": ["foo_service", "client:CLIENT_NAME"],
"handlers": ["pagerduty"],
"interval": 10,
}
}
}
Consolidating Alerts
With Proxy/JIT clients + Round Robin subscriptions,
Aggregate checks, and check dependencies
Proxy/JIT + Round Robin
{
"checks": {
"check_es5_cluster": {
"command": "check-es-cluster-status.rb -h
:::address:::",
"subscribers": ["roundrobin:es5"],
"interval": 30,
"source": ":::es5.cluster.name:::",
"ttl": 120
}
}
}
Check Dependencies
{
"checks": {
"check_foo_open_files": {
"command": "check-open-files.rb -u foo -p foo -w 80 -c
90",
"subscribers": ["foo_service"],
"handlers": ["pagerduty"],
"dependencies": ["client:CLIENT_NAME/check_foo_process"]
}
}
}
Sensu Aggregate checks
{
"checks": {
"sensu_rabbitmq_amqp_alive": {
"command": "check-rabbitmq-amqp-alive.rb",
"subscribers": ["sensu-rabbitmq"],
"interval": 60,
"ttl": 180,
"aggregates": ["sensu_rabbitmq"],
"handle": false
}
}
}
{
"checks": {
"sensu_rabbitmq_amqp_alive_aggregate": {
"command": "check-aggregate.rb --check sensu_rabbitmq_amqp_alive
--critical_count 2 --age 180",
"aggregate": "sensu_rabbitmq",
"source": "sensu-rabbitmq",
"hooks": {
"non-zero": {
"command": "curl -s -S
localhost:4567/aggregates/sensu_rabbitmq/results/critical | jq
.[].check --raw-output",
}
}
}
}
}
Flap Detection
{
"checks": {
"check_cpu": {
"command": "check-cpu.rb -w 80 -c 90 --sleep 5",
"subscribers": ["base"],
"interval": 30,
"low_flap_threshold": ":::cpu.low_flap_threshold|25:::",
"high_flap_threshold": ":::cpu.high_flap_threshold|50:::"
}
}
}
Silence a client or check
$ curl -s -i -X POST 
-H 'Content-Type: application/json' 
-d '{"subscription": "load-balancer", "check": "check_haproxy", "expire": 3600 }' 
http://localhost:4567/silenced
HTTP/1.1 201 Created
Access-Control-Allow-Credentials: true
Access-Control-Allow-Headers: Origin, X-Requested-With, Content-Type, Accept, Authorization
Access-Control-Allow-Methods: GET, POST, PUT, DELETE, OPTIONS
Access-Control-Allow-Origin: *
Connection: close
Content-length: 0
$ curl -s -i -X POST 
-H 'Content-Type: application/json' 
-d '{"subscription": "load-balancer", "check": "check_haproxy",
"expire": 3600 }' 
http://localhost:4567/silenced
HTTP/1.1 201 Created
Access-Control-Allow-Credentials: true
Access-Control-Allow-Headers: Origin, X-Requested-With, Content-
Type, Accept, Authorization
Access-Control-Allow-Methods: GET, POST, PUT, DELETE, OPTIONS
Access-Control-Allow-Origin: *
Connection: close
Content-length: 0
Happy monitoring now with fewer alerts
Q&A
● Github / Slack: @majormoses
● Email: me@benabrams.it
● Open positions at doximity on my team:
○ Security focused DevOps Engineer: https://grnh.se/4a4116de1
○ Kafka Focused DevOps Engineer: https://grnh.se/fb3d19641
● Open Positions at doximity on other teams:
○ https://workat.doximity.com/positions

More Related Content

What's hot

What's hot (9)

Zen: Building Maintainable Catalyst Applications
Zen: Building Maintainable Catalyst ApplicationsZen: Building Maintainable Catalyst Applications
Zen: Building Maintainable Catalyst Applications
 
Puppet Camp 2012
Puppet Camp 2012Puppet Camp 2012
Puppet Camp 2012
 
Hunting for malicious modules in npm - NodeSummit
Hunting for malicious modules in npm - NodeSummitHunting for malicious modules in npm - NodeSummit
Hunting for malicious modules in npm - NodeSummit
 
Varnish qconsp 2011
Varnish qconsp 2011Varnish qconsp 2011
Varnish qconsp 2011
 
Asynchronous Programming FTW! 2 (with AnyEvent)
Asynchronous Programming FTW! 2 (with AnyEvent)Asynchronous Programming FTW! 2 (with AnyEvent)
Asynchronous Programming FTW! 2 (with AnyEvent)
 
The why and how of moving to PHP 5.4/5.5
The why and how of moving to PHP 5.4/5.5The why and how of moving to PHP 5.4/5.5
The why and how of moving to PHP 5.4/5.5
 
Any event intro
Any event introAny event intro
Any event intro
 
nginx mod PSGI
nginx mod PSGInginx mod PSGI
nginx mod PSGI
 
Intro to Consul
Intro to ConsulIntro to Consul
Intro to Consul
 

Similar to Alert Fatigue: Avoidance and Course Correction

Summit demystifying systemd1
Summit demystifying systemd1Summit demystifying systemd1
Summit demystifying systemd1
Susant Sahani
 

Similar to Alert Fatigue: Avoidance and Course Correction (20)

Sensu and Sensibility - Puppetconf 2014
Sensu and Sensibility - Puppetconf 2014Sensu and Sensibility - Puppetconf 2014
Sensu and Sensibility - Puppetconf 2014
 
Sensu @ Yelp!: A Guided Tour
Sensu @ Yelp!: A Guided TourSensu @ Yelp!: A Guided Tour
Sensu @ Yelp!: A Guided Tour
 
PuppetConf 2016: Watching the Puppet Show – Sean Porter, Heavy Water Operations
PuppetConf 2016: Watching the Puppet Show – Sean Porter, Heavy Water OperationsPuppetConf 2016: Watching the Puppet Show – Sean Porter, Heavy Water Operations
PuppetConf 2016: Watching the Puppet Show – Sean Porter, Heavy Water Operations
 
Service discovery like a pro (presented at reversimX)
Service discovery like a pro (presented at reversimX)Service discovery like a pro (presented at reversimX)
Service discovery like a pro (presented at reversimX)
 
Node.js API 서버 성능 개선기
Node.js API 서버 성능 개선기Node.js API 서버 성능 개선기
Node.js API 서버 성능 개선기
 
OSMC 2014 | Monitoring Love with Sensu by Jochen Lillich
OSMC 2014 | Monitoring Love with Sensu by Jochen LillichOSMC 2014 | Monitoring Love with Sensu by Jochen Lillich
OSMC 2014 | Monitoring Love with Sensu by Jochen Lillich
 
OSMC 2014: MonitoringLove with Sensu | Jochen Lillich
OSMC 2014: MonitoringLove with Sensu | Jochen LillichOSMC 2014: MonitoringLove with Sensu | Jochen Lillich
OSMC 2014: MonitoringLove with Sensu | Jochen Lillich
 
Deploying systemd at scale
Deploying systemd at scaleDeploying systemd at scale
Deploying systemd at scale
 
Testing Terraform
Testing TerraformTesting Terraform
Testing Terraform
 
Look, ma! no clients!
Look, ma! no clients!Look, ma! no clients!
Look, ma! no clients!
 
I hunt sys admins 2.0
I hunt sys admins 2.0I hunt sys admins 2.0
I hunt sys admins 2.0
 
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios CoreNagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
 
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
 
Monitoring and tuning your chef server - chef conf talk
Monitoring and tuning your chef server - chef conf talk Monitoring and tuning your chef server - chef conf talk
Monitoring and tuning your chef server - chef conf talk
 
One-Man Ops
One-Man OpsOne-Man Ops
One-Man Ops
 
Monitoring with Syslog and EventMachine
Monitoring with Syslog and EventMachineMonitoring with Syslog and EventMachine
Monitoring with Syslog and EventMachine
 
Ansible tips & tricks
Ansible tips & tricksAnsible tips & tricks
Ansible tips & tricks
 
Eko10 workshop - OPEN SOURCE DATABASE MONITORING
Eko10 workshop - OPEN SOURCE DATABASE MONITORINGEko10 workshop - OPEN SOURCE DATABASE MONITORING
Eko10 workshop - OPEN SOURCE DATABASE MONITORING
 
OSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean GabèsOSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean Gabès
 
Summit demystifying systemd1
Summit demystifying systemd1Summit demystifying systemd1
Summit demystifying systemd1
 

More from Sensu Inc.

More from Sensu Inc. (20)

Introducing GoAlert: a brand-new on-call scheduling and notification open sou...
Introducing GoAlert: a brand-new on-call scheduling and notification open sou...Introducing GoAlert: a brand-new on-call scheduling and notification open sou...
Introducing GoAlert: a brand-new on-call scheduling and notification open sou...
 
Monitoring Graceful Failure
Monitoring Graceful FailureMonitoring Graceful Failure
Monitoring Graceful Failure
 
The Bonsai Asset Index : A new way for the community to share resources
The Bonsai Asset Index : A new way for the community to share resourcesThe Bonsai Asset Index : A new way for the community to share resources
The Bonsai Asset Index : A new way for the community to share resources
 
PPB's Sensu Journey
PPB's Sensu JourneyPPB's Sensu Journey
PPB's Sensu Journey
 
Testing and monitoring and broken things
Testing and monitoring and broken thingsTesting and monitoring and broken things
Testing and monitoring and broken things
 
Order from chaos: automating monitoring configuration
Order from chaos: automating monitoring configurationOrder from chaos: automating monitoring configuration
Order from chaos: automating monitoring configuration
 
Keynote: Measuring the right things
Keynote: Measuring the right thingsKeynote: Measuring the right things
Keynote: Measuring the right things
 
Keynote: Scaling Sensu Go
Keynote: Scaling Sensu GoKeynote: Scaling Sensu Go
Keynote: Scaling Sensu Go
 
Keynote: Sensu as a multi-cloud monitoring control plane
Keynote: Sensu as a multi-cloud monitoring control planeKeynote: Sensu as a multi-cloud monitoring control plane
Keynote: Sensu as a multi-cloud monitoring control plane
 
AIOps & Observability to Lead Your Digital Transformation
AIOps & Observability to Lead Your Digital TransformationAIOps & Observability to Lead Your Digital Transformation
AIOps & Observability to Lead Your Digital Transformation
 
Ecosystem session: Sensu + Puppet
Ecosystem session: Sensu + PuppetEcosystem session: Sensu + Puppet
Ecosystem session: Sensu + Puppet
 
Herding cats & catching fire: Workday's telemetry & middleware
Herding cats & catching fire: Workday's telemetry & middlewareHerding cats & catching fire: Workday's telemetry & middleware
Herding cats & catching fire: Workday's telemetry & middleware
 
7 Years of Sensu: Then, Now, and Soon
7 Years of Sensu: Then, Now, and Soon7 Years of Sensu: Then, Now, and Soon
7 Years of Sensu: Then, Now, and Soon
 
Pull, don’t push: Architectures for monitoring and configuration in a microse...
Pull, don’t push: Architectures for monitoring and configuration in a microse...Pull, don’t push: Architectures for monitoring and configuration in a microse...
Pull, don’t push: Architectures for monitoring and configuration in a microse...
 
Assets in Sensu 2.0
Assets in Sensu 2.0Assets in Sensu 2.0
Assets in Sensu 2.0
 
The Box.com success story: migrating 350K Nagios objects to Sensu
The Box.com success story: migrating 350K Nagios objects to SensuThe Box.com success story: migrating 350K Nagios objects to Sensu
The Box.com success story: migrating 350K Nagios objects to Sensu
 
Project 3M: Meaningful Monitoring and Messaging
Project 3M: Meaningful Monitoring and MessagingProject 3M: Meaningful Monitoring and Messaging
Project 3M: Meaningful Monitoring and Messaging
 
Sharing Sensu with Multiple Teams using Ansible
Sharing Sensu with Multiple Teams using AnsibleSharing Sensu with Multiple Teams using Ansible
Sharing Sensu with Multiple Teams using Ansible
 
Where's My Beer: Building a Better Kegerator with a Raspberry Pi & Sensu
Where's My Beer: Building a Better Kegerator with a Raspberry Pi & SensuWhere's My Beer: Building a Better Kegerator with a Raspberry Pi & Sensu
Where's My Beer: Building a Better Kegerator with a Raspberry Pi & Sensu
 
Reimagining Sensu
Reimagining SensuReimagining Sensu
Reimagining Sensu
 

Recently uploaded

Recently uploaded (20)

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 

Alert Fatigue: Avoidance and Course Correction

  • 1. Alert Fatigue Avoidance and course correction Follow along at: https://goo.gl/jh5RB7
  • 2. Sensu And I: ● Ben Abrams / @majormoses ● Systems Engineer @doximity ● Sensu Experience ○ 2014: Started a pet project to replace Nagios with Sensu ○ 2015: Sensu was production capable and started contributing back to the community ○ 2017: Became a maintainer for various areas across the sensu ecosystem ■ Plugins ■ Chef Cookbooks ■ Slack ■ OSS mentorship to other maintainers in other areas ○ Maintain over 200+ repositories for the sensu community
  • 3. What is it? Alert fatigue occurs when one is exposed to a large number of frequent alarms (alerts) and consequently becomes desensitized to them.
  • 5. The Problem ● We are not computers ● Costly extended outages ● Burnout / Retention
  • 6. Agnostic tips to reduce or eradicate alert fatigue ● Not Actionable == Not my problem ● If an alert can wait until the morning hold it until business hours ● Consolidate Alerts ● Ensure alerts come with contextual awareness ● Service Ownership ● Effective On-call scheduling ● Wake me up when it’s over / Snooze ● Monitoring should be reviewed at the end of your on-call handoffs
  • 7. How Sensu can help ● Token Substitution ● Filters ● Handlers ● Check Hooks ● Proxy/JIT Clients + Round Robin Subscriptions ● Check Dependencies ● Aggregate Checks ● Flap Detection ● Silencing ● Safe Mode
  • 8. Alert Reduction With token substitution and sensu filters
  • 9. Setting Thresholds with Token Substitution{ "checks": { "check_cpu": { "command": "check-cpu.rb - w ":::cpu.warn|80:::" -c "cpu.crit|90:::" --sleep 5", "subscribers": ["base"], "interval": 30, "occurrences": ":::cpu.occurrences|4:::" } } } { "client": { "name": "i-424242", "address": "10.10.10.10", "subscriptions": ["base", "etl"], "safe_mode": true, "cpu" { "crit": 100, "warn": 95, "occurrences": 10, } } }
  • 10. Sensu Filters (1.x) ● Runs to determine if a handler should run ● Inclusive and Exclusive Filters ● Allows running anything you can write in ruby (which means anything) ● Days and Times ● Documentation: https://docs.sensu.io/sensu- core/1.4/reference/filters/
  • 11. Inclusive Filter: Nine to Five { "filters": { "nine_to_fiver_eastern": { "negate": false, # default: false "attributes": { "timestamp": "eval: ENV['TZ'] = 'America/New_York'; [1,2,3,4,5].include?(Time.at(value).wday) && Time.at(value).hour.between?(9,17)" } } } }
  • 12. Automate Triage and Remediation with check hooks and handlers
  • 13. Automate Triage with check hooks { "checks": { "ping_four8s": { "command": "check-ping.rb -h 8.8.8.8 -T 5", "subscribers": ["base"], "interval": 5, "hooks": { "non-zero": { "command": "ping -c 1 `route -n | awk '$1 == "0.0.0.0" { print $2 }'`" } } } } }
  • 14. Automate Remediation with handler Part 1 { "checks": { "check_process_foo": { "command": "check-process.rb -p foo", "subscribers": ["foo_service"], "handlers": ["pagerduty", "remediator"], "remediation": { "foo_process_remediate": { "occurrences": ["1-5"], "severities": [2] } } } } }
  • 15. Automate Remediation with handler Part 2 { "checks": { "foo_process_remediate": { "publish": false, "command": "sudo -u sensu service foo restart", "subscribers": ["foo_service", "client:CLIENT_NAME"], "handlers": ["pagerduty"], "interval": 10, } } }
  • 16. Consolidating Alerts With Proxy/JIT clients + Round Robin subscriptions, Aggregate checks, and check dependencies
  • 17. Proxy/JIT + Round Robin { "checks": { "check_es5_cluster": { "command": "check-es-cluster-status.rb -h :::address:::", "subscribers": ["roundrobin:es5"], "interval": 30, "source": ":::es5.cluster.name:::", "ttl": 120 } } }
  • 18. Check Dependencies { "checks": { "check_foo_open_files": { "command": "check-open-files.rb -u foo -p foo -w 80 -c 90", "subscribers": ["foo_service"], "handlers": ["pagerduty"], "dependencies": ["client:CLIENT_NAME/check_foo_process"] } } }
  • 19. Sensu Aggregate checks { "checks": { "sensu_rabbitmq_amqp_alive": { "command": "check-rabbitmq-amqp-alive.rb", "subscribers": ["sensu-rabbitmq"], "interval": 60, "ttl": 180, "aggregates": ["sensu_rabbitmq"], "handle": false } } }
  • 20. { "checks": { "sensu_rabbitmq_amqp_alive_aggregate": { "command": "check-aggregate.rb --check sensu_rabbitmq_amqp_alive --critical_count 2 --age 180", "aggregate": "sensu_rabbitmq", "source": "sensu-rabbitmq", "hooks": { "non-zero": { "command": "curl -s -S localhost:4567/aggregates/sensu_rabbitmq/results/critical | jq .[].check --raw-output", } } } } }
  • 21. Flap Detection { "checks": { "check_cpu": { "command": "check-cpu.rb -w 80 -c 90 --sleep 5", "subscribers": ["base"], "interval": 30, "low_flap_threshold": ":::cpu.low_flap_threshold|25:::", "high_flap_threshold": ":::cpu.high_flap_threshold|50:::" } } }
  • 22. Silence a client or check $ curl -s -i -X POST -H 'Content-Type: application/json' -d '{"subscription": "load-balancer", "check": "check_haproxy", "expire": 3600 }' http://localhost:4567/silenced HTTP/1.1 201 Created Access-Control-Allow-Credentials: true Access-Control-Allow-Headers: Origin, X-Requested-With, Content-Type, Accept, Authorization Access-Control-Allow-Methods: GET, POST, PUT, DELETE, OPTIONS Access-Control-Allow-Origin: * Connection: close Content-length: 0 $ curl -s -i -X POST -H 'Content-Type: application/json' -d '{"subscription": "load-balancer", "check": "check_haproxy", "expire": 3600 }' http://localhost:4567/silenced HTTP/1.1 201 Created Access-Control-Allow-Credentials: true Access-Control-Allow-Headers: Origin, X-Requested-With, Content- Type, Accept, Authorization Access-Control-Allow-Methods: GET, POST, PUT, DELETE, OPTIONS Access-Control-Allow-Origin: * Connection: close Content-length: 0
  • 23. Happy monitoring now with fewer alerts
  • 24. Q&A ● Github / Slack: @majormoses ● Email: me@benabrams.it ● Open positions at doximity on my team: ○ Security focused DevOps Engineer: https://grnh.se/4a4116de1 ○ Kafka Focused DevOps Engineer: https://grnh.se/fb3d19641 ● Open Positions at doximity on other teams: ○ https://workat.doximity.com/positions

Editor's Notes

  1. The boy who cried wolf
  2. We have an inferior queue buffer to RabbitMQ
  3. Before talking about sensu specific solutions, lets talk in general terms. If an alert can wait until the morning hold it until business hours: if I can’t fix it now why are you telling me now. Wake me up when its over: Short term relief of temporarily snoozing alerts for a predetermined time
  4. These are sensu features that either directly or indirectly can help manage alert fatigue
  5. You can set per client thresholds but use the same check definition.
  6. Filters in 2.x work very differently, you can write a ruby extension as a gRPC service.
  7. Inclusive filtering: by setting the filter definition attribute "negate": false, only events that match the defined filter attributes are handled. This will run any handlers if its a weekday and between 10 AM -> 10 PM Eastern Time
  8. Exclusive filtering: by setting the filter definition attribute "negate": true, events are only handled if they do not match the defined filter attributes. This runs any handlers where the occurences is less than the check configured occurrences OR 60
  9. Automate all the things! While you can use check hooks for remediation they are best used for triage and handlers are best used for remediation because they have extra context such as occurrences.
  10. Valid hook names include (in order of precedence): “1”-“255”, “ok”, “warning”, “critical”, “unknown”, and “non-zero”. This appends the result of a single ping to the default gateway in the event of being unable to reach the public internet (in this case google DNS load balanced server). Other common use cases include processes running when cpu, memory, load is high or showing the top x directories when disk is near full.
  11. `remediator` handler occurrences can either be a range such as `[“1-5”]`, `[“1+”]`, `[“1,3,5”]`, etc You can technically have multiple levels of remediation. See: https://github.com/sensu-plugins/sensu-plugins-sensu/blob/3.0.0/bin/handler-sensu.rb#L32-L64 for more advanced usage You can set severities for `0` OK (not sure the use case), `1` WARN, `2` CRIT, or `3` UNKNOWN.
  12. Sudoers.d needs an entry allowing you to run it. Restrict running to only the exact client rather than running it on all nodes with the subscription Unpublished check prevents sensu from automatically being scheduled. This allows it to be triggered by other events (handlers) and can be used for various scenerios such as auto remediation, automatically updating a maintenance/status page, etc. Unpublished also prevents standalone from self scheduling
  13. In the subscription `roundrobin:` attempts to do a round robin by letting the first eligible node from retrieving the request from rabbitmq. In the source you define the client name you want (such as the whole cluster vs per machine checks) also this uses token substitution to set the cluster name.
  14. Think of aggregate checks like a check against a load balancer, not every node needs to be functional just some number/percentage does. `handle: false` is not required, this is just one place to demonstrate a way to reduce alerts that are not actionable.
  15. This is useful when performing maintenance
  16. While it’s main purpose is security (preventing malicious code to be executed via sensu) it can be used to prevent say monitoring alerting on checks that might be in the process of being installed during initial node bootstrap.
  17. Some handlers such as ones that provide single pane of glass should set these to true (which overrides the default)