Stop using Nagios (so it can die peacefully)

Andy Sykes
Andy SykesDevOps Engineer at Forward3D
Please stop using Nagios
(so it can die peacefully)
Andy Sykes
Devops @ Forward3D
@supersheep
andy@forward3d.com
Do you use Nagios?
Tell me why you picked it.
Go on.
If you don't, why don't you?
Reasons for choosing Nagios

•  stupid simple plugin system
•  billions* of existing plugins
•  years of development behind it
•  you can hire people who know it
"Everybody uses it."**

* may not actually be true
** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know
who you are.
Reasons for choosing Nagios

•  stupid simple plugin system
•  billions* of existing plugins
•  years of development behind it
•  you can hire people who know it
"Everybody uses it."**

* may not actually be true
** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know
who you are.
So why did you pick Nagios?
Because it's the "safe", default choice.
Because we've grown accustomed to the things
that really, really suck about it.
It's a little like we've all got Stockholm
Syndrome.
What Nagios gets right
Incredibly simple plugin model.
Fairly secure (SSL between agents + master).
Very simple conceptually.
Reliable.
Nagios, I hate thee; let me count thy ways
Doesn't scale. At all.
World's second most horrible configuration*.
Horrendous interface**.
Assumes a static infrastructure.
No decent programmatic interfaces***.
Throws away perfdata.
Stupid wire format for clients (NRPE/NSCA).
* the world's most horrible configuration is, obviously, Sendmail.
** even the paid Nagios XI one is ugly as sin and unusable.
*** if I catch you parsing status.dat, I will beat your ass.
Expansion about config
Configuration has to be in two places:
Server has to know what checks to invoke
via NRPE.
Client has to know what checks it will be
asked to invoke with NRPE.
THIS IS MADNESS.
Scaling, or lack of it
No such thing as a Nagios cluster.
More checks = more work = longer before you
know something's happened!
Every check increases your master's load
average.
Okay, yes, there’s mod_gearman
But it’s a hack at best.
No redundancy for the machine that distributes
the checks, so it’s not a real cluster.
API poverty
Can't easily integrate with other systems.
Can't easily write custom dashboards.
Can't get information out again!

Assumes a static infra
Master has to be told about a client before
things can happen.
The bandaids we make
Interface:
Opsview, Icinga, Shinken, others

API:
Parsing status.dat, NDO

Client wire format:
Opsview's NRPE, NRD

Config management:
Puppet types, Chef cookbooks
None of it is good enough.
The take-home point:

"If we keep using Nagios,
we'll never get anything
better."
(Writing monitoring systems is hard, and needs community involvement and
real world adoption. Nagios steals mindshare by being just good enough. It's
the monitoring system we deserve, but not the one we need right now.)
So, smart guy. What do we do?

Steal all the things that are great about Nagios.
(existing plugin investment, simplicity, security, reliability)

Strap them to something more awesome.
(scalable, API-ready, config management friendly, modern!)
THIS DOESN’T MEAN WRITING
YOUR OWN MONITORING SYSTEM
Points for thought:

●  What else are people using?
●  Should we greenfield or lift existing tools?
●  What tools could we go with?
My suggestion:

Like OMD, but better.
Wrap up a series of “best in breed” tools to
make one kickass monitoring tool.
What we need:
Core
Agent
Graphing
Anomaly detection
Alerting
UI
Core:
Holds configuration about hosts / services
Distributed across X masters
Check execution (poke)
Results queue (poke response)
There’s something we can use for this.
Sensu!
Sensu is often described as the “monitoring router”.
Stop using Nagios (so it can die peacefully)
{
"checks": {
"chef_client": {
"command": "check-chef-client.rb",
"subscribers": [
"production" ],
"interval": 60,
"handlers": [
"pagerduty",
"irc"
]
}
}
}

Only on the server
Client requires no registration for the server
to know about it
Uses Nagios status return codes
Doesn’t talk to the server - talks to
RabbitMQ
Core:
Holds configuration about hosts / services
Distributed across X masters
Check execution (poke)
Results queue (poke response)
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing
Anomaly detection
Alerting
UI
Graphing is easy now.
If you’re not using Graphite, you should be.
Sensu “metric” checks can pump data to it.
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing - Graphite
Anomaly detection
Alerting
UI
Anomaly detection is hard.
We’ve got all this metric data, but how do we check it?
- Skyline/Oculus (Etsy)
- Grok (very early days)
- ???
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing - Graphite
Anomaly detection - ???
Alerting
UI
Alerting is tricky, but mostly solved.
Flapjack! - flapjack.io
Alerting is not the concern of your monitoring tool.
Push all alerts at Flapjack
- define gateways (PagerDuty, email)
- create relationships between checks and gateways
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing - Graphite
Anomaly detection - ???
Alerting - Flapjack
UI
User interfaces are hard.
What do we need from it?
- What’s broken
- When it broke, when it broke in the past
- Say “OK, I know it’s broken”
- View graphs to see how quickly it broke
- See every check everywhere, and filter the list
The Sensu Dashboard sucks.
No history!
Acknowledgements aren’t easy to do.
No graphing.
Can’t see anything that’s reporting an OK status.
This won’t do.
I’m going to have to write a UI. Sigh.
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing - Graphite
Anomaly detection - ???
Alerting - Flapjack
UI
- ???
In Summary

Nagios sucks.
There are good tools for each concern
of monitoring.
If we can package them together, we
can have something that rocks.
Thank You.

Contact
andy@forward3d.com (@supersheep)
1 of 37

Recommended

클라우드 엔지니어 취업 고군 분투기 by
클라우드 엔지니어 취업 고군 분투기클라우드 엔지니어 취업 고군 분투기
클라우드 엔지니어 취업 고군 분투기InfraEngineer
774 views23 slides
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases... by
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...xKinAnx
3.9K views36 slides
Cracking Digital VLSI Verification Interview: Interview Success by
Cracking Digital VLSI Verification Interview: Interview SuccessCracking Digital VLSI Verification Interview: Interview Success
Cracking Digital VLSI Verification Interview: Interview SuccessRamdas Mozhikunnath
7K views27 slides
I2C And SPI Part-23 by
I2C And  SPI Part-23I2C And  SPI Part-23
I2C And SPI Part-23Techvilla
1.3K views28 slides
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016 by
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016DataStax
4.6K views107 slides
cisco csr1000v by
cisco csr1000vcisco csr1000v
cisco csr1000vMing914298
330 views152 slides

More Related Content

What's hot

Continuous Integration and Kamailio by
Continuous Integration and KamailioContinuous Integration and Kamailio
Continuous Integration and KamailioGiacomo Vacca
2.5K views45 slides
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC by
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
10.5K views45 slides
Rs 422-rs-485-e book-graphics-embedded by
Rs 422-rs-485-e book-graphics-embeddedRs 422-rs-485-e book-graphics-embedded
Rs 422-rs-485-e book-graphics-embeddedRAHUL CHATURVEDI
2.1K views137 slides
Deepu Kumar Shah.pptx by
Deepu Kumar Shah.pptxDeepu Kumar Shah.pptx
Deepu Kumar Shah.pptxDeepuShah
27 views17 slides
Axi protocol by
Axi protocolAxi protocol
Axi protocolRohit Kumar Pathak
2K views34 slides
[IBM Korea 김상훈] 파라다이스 시티 구축 사례 소개 (인프라운영) by
[IBM Korea 김상훈] 파라다이스 시티 구축 사례 소개 (인프라운영)[IBM Korea 김상훈] 파라다이스 시티 구축 사례 소개 (인프라운영)
[IBM Korea 김상훈] 파라다이스 시티 구축 사례 소개 (인프라운영)(Joe), Sanghun Kim
961 views6 slides

What's hot(16)

Continuous Integration and Kamailio by Giacomo Vacca
Continuous Integration and KamailioContinuous Integration and Kamailio
Continuous Integration and Kamailio
Giacomo Vacca2.5K views
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC by Kristofferson A
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A10.5K views
Rs 422-rs-485-e book-graphics-embedded by RAHUL CHATURVEDI
Rs 422-rs-485-e book-graphics-embeddedRs 422-rs-485-e book-graphics-embedded
Rs 422-rs-485-e book-graphics-embedded
RAHUL CHATURVEDI2.1K views
Deepu Kumar Shah.pptx by DeepuShah
Deepu Kumar Shah.pptxDeepu Kumar Shah.pptx
Deepu Kumar Shah.pptx
DeepuShah27 views
[IBM Korea 김상훈] 파라다이스 시티 구축 사례 소개 (인프라운영) by (Joe), Sanghun Kim
[IBM Korea 김상훈] 파라다이스 시티 구축 사례 소개 (인프라운영)[IBM Korea 김상훈] 파라다이스 시티 구축 사례 소개 (인프라운영)
[IBM Korea 김상훈] 파라다이스 시티 구축 사례 소개 (인프라운영)
(Joe), Sanghun Kim961 views
Scylla Summit 2022: Scylla 5.0 New Features, Part 2 by ScyllaDB
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
ScyllaDB558 views
Oracle 21c: New Features and Enhancements of Data Pump & TTS by Christian Gohmann
Oracle 21c: New Features and Enhancements of Data Pump & TTSOracle 21c: New Features and Enhancements of Data Pump & TTS
Oracle 21c: New Features and Enhancements of Data Pump & TTS
Christian Gohmann531 views
Things you should know about Oracle truncate by Kazuhiro Takahashi
Things you should know about Oracle truncateThings you should know about Oracle truncate
Things you should know about Oracle truncate
Kazuhiro Takahashi2.3K views
System verilog assertions by HARINATH REDDY
System verilog assertionsSystem verilog assertions
System verilog assertions
HARINATH REDDY4.9K views
Optimizing ModSecurity on NGINX and NGINX Plus by Christian Folini
Optimizing ModSecurity on NGINX and NGINX PlusOptimizing ModSecurity on NGINX and NGINX Plus
Optimizing ModSecurity on NGINX and NGINX Plus
Christian Folini2.2K views
Monitorando Bancos Oracle - 2º ZABBIX MEETUP DO INTERIOR-SP by Zabbix BR
Monitorando Bancos Oracle - 2º ZABBIX MEETUP DO INTERIOR-SPMonitorando Bancos Oracle - 2º ZABBIX MEETUP DO INTERIOR-SP
Monitorando Bancos Oracle - 2º ZABBIX MEETUP DO INTERIOR-SP
Zabbix BR4.7K views
Introducing the eDB360 Tool by Carlos Sierra
Introducing the eDB360 ToolIntroducing the eDB360 Tool
Introducing the eDB360 Tool
Carlos Sierra655 views
UVM Update: Register Package by DVClub
UVM Update: Register PackageUVM Update: Register Package
UVM Update: Register Package
DVClub11.1K views
Fundamentals of HDL (first 4 chapters only) - Godse by Hammam
Fundamentals of HDL (first 4 chapters only) - GodseFundamentals of HDL (first 4 chapters only) - Godse
Fundamentals of HDL (first 4 chapters only) - Godse
Hammam4.4K views

Viewers also liked

Zabbix 3.0 and beyond - FISL 2015 by
Zabbix 3.0 and beyond - FISL 2015Zabbix 3.0 and beyond - FISL 2015
Zabbix 3.0 and beyond - FISL 2015Zabbix
32.4K views57 slides
Grafana and MySQL - Benefits and Challenges by
Grafana and MySQL - Benefits and ChallengesGrafana and MySQL - Benefits and Challenges
Grafana and MySQL - Benefits and ChallengesPhilip Wernersbach
17.3K views23 slides
Andrew Nelson - Zabbix and SNMP on Linux by
Andrew Nelson - Zabbix and SNMP on LinuxAndrew Nelson - Zabbix and SNMP on Linux
Andrew Nelson - Zabbix and SNMP on LinuxZabbix
12K views27 slides
Icinga Camp Barcelona - Current State of Icinga by
Icinga Camp Barcelona - Current State of IcingaIcinga Camp Barcelona - Current State of Icinga
Icinga Camp Barcelona - Current State of IcingaIcinga
30.8K views48 slides
Monitoring the #DevOps way by
Monitoring the #DevOps wayMonitoring the #DevOps way
Monitoring the #DevOps wayTheo Schlossnagle
9.5K views18 slides
Alexei Vladishev - Opening Speech by
Alexei Vladishev - Opening SpeechAlexei Vladishev - Opening Speech
Alexei Vladishev - Opening SpeechZabbix
11.2K views44 slides

Viewers also liked(7)

Zabbix 3.0 and beyond - FISL 2015 by Zabbix
Zabbix 3.0 and beyond - FISL 2015Zabbix 3.0 and beyond - FISL 2015
Zabbix 3.0 and beyond - FISL 2015
Zabbix32.4K views
Grafana and MySQL - Benefits and Challenges by Philip Wernersbach
Grafana and MySQL - Benefits and ChallengesGrafana and MySQL - Benefits and Challenges
Grafana and MySQL - Benefits and Challenges
Philip Wernersbach17.3K views
Andrew Nelson - Zabbix and SNMP on Linux by Zabbix
Andrew Nelson - Zabbix and SNMP on LinuxAndrew Nelson - Zabbix and SNMP on Linux
Andrew Nelson - Zabbix and SNMP on Linux
Zabbix12K views
Icinga Camp Barcelona - Current State of Icinga by Icinga
Icinga Camp Barcelona - Current State of IcingaIcinga Camp Barcelona - Current State of Icinga
Icinga Camp Barcelona - Current State of Icinga
Icinga30.8K views
Alexei Vladishev - Opening Speech by Zabbix
Alexei Vladishev - Opening SpeechAlexei Vladishev - Opening Speech
Alexei Vladishev - Opening Speech
Zabbix11.2K views
Fall in Love with Graphs and Metrics using Grafana by torkelo
Fall in Love with Graphs and Metrics using GrafanaFall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using Grafana
torkelo11.5K views

Similar to Stop using Nagios (so it can die peacefully)

How Yelp Uses Sensu to Monitor Services in a SOA World by
How Yelp Uses Sensu to Monitor Services in a SOA WorldHow Yelp Uses Sensu to Monitor Services in a SOA World
How Yelp Uses Sensu to Monitor Services in a SOA WorldKyle Anderson
1.1K views22 slides
Monitoring with sensu by
Monitoring with sensuMonitoring with sensu
Monitoring with sensumiquelruizm
58K views45 slides
Automating Monitoring with Puppet by
Automating Monitoring with PuppetAutomating Monitoring with Puppet
Automating Monitoring with PuppetChristian Mague
12.9K views45 slides
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo... by
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...Puppet
4.7K views58 slides
Django: Beyond Basics by
Django: Beyond BasicsDjango: Beyond Basics
Django: Beyond Basicsarunvr
2.2K views44 slides
Sensu @ Yelp!: A Guided Tour by
Sensu @ Yelp!: A Guided TourSensu @ Yelp!: A Guided Tour
Sensu @ Yelp!: A Guided TourKyle Anderson
8.3K views41 slides

Similar to Stop using Nagios (so it can die peacefully)(20)

How Yelp Uses Sensu to Monitor Services in a SOA World by Kyle Anderson
How Yelp Uses Sensu to Monitor Services in a SOA WorldHow Yelp Uses Sensu to Monitor Services in a SOA World
How Yelp Uses Sensu to Monitor Services in a SOA World
Kyle Anderson1.1K views
Monitoring with sensu by miquelruizm
Monitoring with sensuMonitoring with sensu
Monitoring with sensu
miquelruizm58K views
Automating Monitoring with Puppet by Christian Mague
Automating Monitoring with PuppetAutomating Monitoring with Puppet
Automating Monitoring with Puppet
Christian Mague12.9K views
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo... by Puppet
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
Puppet4.7K views
Django: Beyond Basics by arunvr
Django: Beyond BasicsDjango: Beyond Basics
Django: Beyond Basics
arunvr2.2K views
Sensu @ Yelp!: A Guided Tour by Kyle Anderson
Sensu @ Yelp!: A Guided TourSensu @ Yelp!: A Guided Tour
Sensu @ Yelp!: A Guided Tour
Kyle Anderson8.3K views
Making operations visible - devopsdays tokyo 2013 by Nick Galbreath
Making operations visible  - devopsdays tokyo 2013Making operations visible  - devopsdays tokyo 2013
Making operations visible - devopsdays tokyo 2013
Nick Galbreath13.8K views
Making operations visible - Nick Gallbreath by Devopsdays
Making operations visible - Nick GallbreathMaking operations visible - Nick Gallbreath
Making operations visible - Nick Gallbreath
Devopsdays1.5K views
Move out from AppEngine, and Python PaaS alternatives by tzang ms
Move out from AppEngine, and Python PaaS alternativesMove out from AppEngine, and Python PaaS alternatives
Move out from AppEngine, and Python PaaS alternatives
tzang ms3.4K views
Google Hacking by Pim Piepers
Google HackingGoogle Hacking
Google Hacking
Pim Piepers79.5K views
Advanced googling by sonuagain
Advanced googlingAdvanced googling
Advanced googling
sonuagain60.8K views
OSMC 2012 | Shinken by Jean Gabès by NETWAYS
OSMC 2012 | Shinken by Jean GabèsOSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean Gabès
NETWAYS41 views
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl by OpenNebula Project
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
Monitoring of OpenNebula installations by NETWAYS
Monitoring of OpenNebula installationsMonitoring of OpenNebula installations
Monitoring of OpenNebula installations
NETWAYS3.9K views
Abusing bleeding edge web standards for appsec glory by Priyanka Aash
Abusing bleeding edge web standards for appsec gloryAbusing bleeding edge web standards for appsec glory
Abusing bleeding edge web standards for appsec glory
Priyanka Aash209 views
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs... by André Goliath
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
André Goliath653 views
Hacklu2011 tricaud by stricaud
Hacklu2011 tricaudHacklu2011 tricaud
Hacklu2011 tricaud
stricaud558 views
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud by Sylvain Kalache
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSkynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Sylvain Kalache64.4K views
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017 by Demi Ben-Ari
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Demi Ben-Ari400 views

Recently uploaded

HTTP headers that make your website go faster - devs.gent November 2023 by
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023Thijs Feryn
28 views151 slides
Scaling Knowledge Graph Architectures with AI by
Scaling Knowledge Graph Architectures with AIScaling Knowledge Graph Architectures with AI
Scaling Knowledge Graph Architectures with AIEnterprise Knowledge
53 views15 slides
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlueShapeBlue
50 views23 slides
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ... by
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...ShapeBlue
83 views15 slides
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue by
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueShapeBlue
96 views20 slides
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...ShapeBlue
74 views18 slides

Recently uploaded(20)

HTTP headers that make your website go faster - devs.gent November 2023 by Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn28 views
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue50 views
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ... by ShapeBlue
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
ShapeBlue83 views
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue by ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
ShapeBlue96 views
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by ShapeBlue
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
ShapeBlue74 views
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by ShapeBlue
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue65 views
NTGapps NTG LowCode Platform by Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu141 views
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT by ShapeBlue
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
ShapeBlue91 views
Why and How CloudStack at weSystems - Stephan Bienek - weSystems by ShapeBlue
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystems
ShapeBlue111 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely56 views
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit... by ShapeBlue
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
ShapeBlue57 views
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... by ShapeBlue
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
ShapeBlue88 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10369 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software344 views
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda... by ShapeBlue
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
ShapeBlue63 views
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... by ShapeBlue
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
ShapeBlue48 views
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De... by Moses Kemibaro
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Moses Kemibaro29 views
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue85 views

Stop using Nagios (so it can die peacefully)

  • 1. Please stop using Nagios (so it can die peacefully) Andy Sykes Devops @ Forward3D @supersheep andy@forward3d.com
  • 2. Do you use Nagios? Tell me why you picked it. Go on. If you don't, why don't you?
  • 3. Reasons for choosing Nagios •  stupid simple plugin system •  billions* of existing plugins •  years of development behind it •  you can hire people who know it "Everybody uses it."** * may not actually be true ** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know who you are.
  • 4. Reasons for choosing Nagios •  stupid simple plugin system •  billions* of existing plugins •  years of development behind it •  you can hire people who know it "Everybody uses it."** * may not actually be true ** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know who you are.
  • 5. So why did you pick Nagios? Because it's the "safe", default choice. Because we've grown accustomed to the things that really, really suck about it. It's a little like we've all got Stockholm Syndrome.
  • 6. What Nagios gets right Incredibly simple plugin model. Fairly secure (SSL between agents + master). Very simple conceptually. Reliable.
  • 7. Nagios, I hate thee; let me count thy ways Doesn't scale. At all. World's second most horrible configuration*. Horrendous interface**. Assumes a static infrastructure. No decent programmatic interfaces***. Throws away perfdata. Stupid wire format for clients (NRPE/NSCA). * the world's most horrible configuration is, obviously, Sendmail. ** even the paid Nagios XI one is ugly as sin and unusable. *** if I catch you parsing status.dat, I will beat your ass.
  • 8. Expansion about config Configuration has to be in two places: Server has to know what checks to invoke via NRPE. Client has to know what checks it will be asked to invoke with NRPE. THIS IS MADNESS.
  • 9. Scaling, or lack of it No such thing as a Nagios cluster. More checks = more work = longer before you know something's happened! Every check increases your master's load average.
  • 10. Okay, yes, there’s mod_gearman But it’s a hack at best. No redundancy for the machine that distributes the checks, so it’s not a real cluster.
  • 11. API poverty Can't easily integrate with other systems. Can't easily write custom dashboards. Can't get information out again! Assumes a static infra Master has to be told about a client before things can happen.
  • 12. The bandaids we make Interface: Opsview, Icinga, Shinken, others API: Parsing status.dat, NDO Client wire format: Opsview's NRPE, NRD Config management: Puppet types, Chef cookbooks None of it is good enough.
  • 13. The take-home point: "If we keep using Nagios, we'll never get anything better." (Writing monitoring systems is hard, and needs community involvement and real world adoption. Nagios steals mindshare by being just good enough. It's the monitoring system we deserve, but not the one we need right now.)
  • 14. So, smart guy. What do we do? Steal all the things that are great about Nagios. (existing plugin investment, simplicity, security, reliability) Strap them to something more awesome. (scalable, API-ready, config management friendly, modern!)
  • 15. THIS DOESN’T MEAN WRITING YOUR OWN MONITORING SYSTEM
  • 16. Points for thought: ●  What else are people using? ●  Should we greenfield or lift existing tools? ●  What tools could we go with?
  • 17. My suggestion: Like OMD, but better. Wrap up a series of “best in breed” tools to make one kickass monitoring tool.
  • 19. Core: Holds configuration about hosts / services Distributed across X masters Check execution (poke) Results queue (poke response)
  • 20. There’s something we can use for this. Sensu! Sensu is often described as the “monitoring router”.
  • 22. { "checks": { "chef_client": { "command": "check-chef-client.rb", "subscribers": [ "production" ], "interval": 60, "handlers": [ "pagerduty", "irc" ] } } } Only on the server
  • 23. Client requires no registration for the server to know about it Uses Nagios status return codes Doesn’t talk to the server - talks to RabbitMQ
  • 24. Core: Holds configuration about hosts / services Distributed across X masters Check execution (poke) Results queue (poke response)
  • 25. What we need: Core - Sensu-server Agent - Sensu-client Graphing Anomaly detection Alerting UI
  • 26. Graphing is easy now. If you’re not using Graphite, you should be. Sensu “metric” checks can pump data to it.
  • 27. What we need: Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection Alerting UI
  • 28. Anomaly detection is hard. We’ve got all this metric data, but how do we check it? - Skyline/Oculus (Etsy) - Grok (very early days) - ???
  • 29. What we need: Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection - ??? Alerting UI
  • 30. Alerting is tricky, but mostly solved. Flapjack! - flapjack.io Alerting is not the concern of your monitoring tool. Push all alerts at Flapjack - define gateways (PagerDuty, email) - create relationships between checks and gateways
  • 31. What we need: Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection - ??? Alerting - Flapjack UI
  • 32. User interfaces are hard. What do we need from it? - What’s broken - When it broke, when it broke in the past - Say “OK, I know it’s broken” - View graphs to see how quickly it broke - See every check everywhere, and filter the list
  • 33. The Sensu Dashboard sucks. No history! Acknowledgements aren’t easy to do. No graphing. Can’t see anything that’s reporting an OK status. This won’t do.
  • 34. I’m going to have to write a UI. Sigh.
  • 35. What we need: Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection - ??? Alerting - Flapjack UI - ???
  • 36. In Summary Nagios sucks. There are good tools for each concern of monitoring. If we can package them together, we can have something that rocks.