Please stop using Nagios
(so it can die peacefully)
Andy Sykes
Devops @ Forward3D
@supersheep
andy@forward3d.com
Do you use Nagios?
Tell me why you picked it.
Go on.
If you don't, why don't you?
Reasons for choosing Nagios

•  stupid simple plugin system
•  billions* of existing plugins
•  years of development behind it
•  you can hire people who know it
"Everybody uses it."**

* may not actually be true
** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know
who you are.
Reasons for choosing Nagios

•  stupid simple plugin system
•  billions* of existing plugins
•  years of development behind it
•  you can hire people who know it
"Everybody uses it."**

* may not actually be true
** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know
who you are.
So why did you pick Nagios?
Because it's the "safe", default choice.
Because we've grown accustomed to the things
that really, really suck about it.
It's a little like we've all got Stockholm
Syndrome.
What Nagios gets right
Incredibly simple plugin model.
Fairly secure (SSL between agents + master).
Very simple conceptually.
Reliable.
Nagios, I hate thee; let me count thy ways
Doesn't scale. At all.
World's second most horrible configuration*.
Horrendous interface**.
Assumes a static infrastructure.
No decent programmatic interfaces***.
Throws away perfdata.
Stupid wire format for clients (NRPE/NSCA).
* the world's most horrible configuration is, obviously, Sendmail.
** even the paid Nagios XI one is ugly as sin and unusable.
*** if I catch you parsing status.dat, I will beat your ass.
Expansion about config
Configuration has to be in two places:
Server has to know what checks to invoke
via NRPE.
Client has to know what checks it will be
asked to invoke with NRPE.
THIS IS MADNESS.
Scaling, or lack of it
No such thing as a Nagios cluster.
More checks = more work = longer before you
know something's happened!
Every check increases your master's load
average.
Okay, yes, there’s mod_gearman
But it’s a hack at best.
No redundancy for the machine that distributes
the checks, so it’s not a real cluster.
API poverty
Can't easily integrate with other systems.
Can't easily write custom dashboards.
Can't get information out again!

Assumes a static infra
Master has to be told about a client before
things can happen.
The bandaids we make
Interface:
Opsview, Icinga, Shinken, others

API:
Parsing status.dat, NDO

Client wire format:
Opsview's NRPE, NRD

Config management:
Puppet types, Chef cookbooks
None of it is good enough.
The take-home point:

"If we keep using Nagios,
we'll never get anything
better."
(Writing monitoring systems is hard, and needs community involvement and
real world adoption. Nagios steals mindshare by being just good enough. It's
the monitoring system we deserve, but not the one we need right now.)
So, smart guy. What do we do?

Steal all the things that are great about Nagios.
(existing plugin investment, simplicity, security, reliability)

Strap them to something more awesome.
(scalable, API-ready, config management friendly, modern!)
THIS DOESN’T MEAN WRITING
YOUR OWN MONITORING SYSTEM
Points for thought:

●  What else are people using?
●  Should we greenfield or lift existing tools?
●  What tools could we go with?
My suggestion:

Like OMD, but better.
Wrap up a series of “best in breed” tools to
make one kickass monitoring tool.
What we need:
Core
Agent
Graphing
Anomaly detection
Alerting
UI
Core:
Holds configuration about hosts / services
Distributed across X masters
Check execution (poke)
Results queue (poke response)
There’s something we can use for this.
Sensu!
Sensu is often described as the “monitoring router”.
{
"checks": {
"chef_client": {
"command": "check-chef-client.rb",
"subscribers": [
"production" ],
"interval": 60,
"handlers": [
"pagerduty",
"irc"
]
}
}
}

Only on the server
Client requires no registration for the server
to know about it
Uses Nagios status return codes
Doesn’t talk to the server - talks to
RabbitMQ
Core:
Holds configuration about hosts / services
Distributed across X masters
Check execution (poke)
Results queue (poke response)
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing
Anomaly detection
Alerting
UI
Graphing is easy now.
If you’re not using Graphite, you should be.
Sensu “metric” checks can pump data to it.
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing - Graphite
Anomaly detection
Alerting
UI
Anomaly detection is hard.
We’ve got all this metric data, but how do we check it?
- Skyline/Oculus (Etsy)
- Grok (very early days)
- ???
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing - Graphite
Anomaly detection - ???
Alerting
UI
Alerting is tricky, but mostly solved.
Flapjack! - flapjack.io
Alerting is not the concern of your monitoring tool.
Push all alerts at Flapjack
- define gateways (PagerDuty, email)
- create relationships between checks and gateways
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing - Graphite
Anomaly detection - ???
Alerting - Flapjack
UI
User interfaces are hard.
What do we need from it?
- What’s broken
- When it broke, when it broke in the past
- Say “OK, I know it’s broken”
- View graphs to see how quickly it broke
- See every check everywhere, and filter the list
The Sensu Dashboard sucks.
No history!
Acknowledgements aren’t easy to do.
No graphing.
Can’t see anything that’s reporting an OK status.
This won’t do.
I’m going to have to write a UI. Sigh.
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing - Graphite
Anomaly detection - ???
Alerting - Flapjack
UI
- ???
In Summary

Nagios sucks.
There are good tools for each concern
of monitoring.
If we can package them together, we
can have something that rocks.
Thank You.

Contact
andy@forward3d.com (@supersheep)

Stop using Nagios (so it can die peacefully)

  • 1.
    Please stop usingNagios (so it can die peacefully) Andy Sykes Devops @ Forward3D @supersheep andy@forward3d.com
  • 2.
    Do you useNagios? Tell me why you picked it. Go on. If you don't, why don't you?
  • 3.
    Reasons for choosingNagios •  stupid simple plugin system •  billions* of existing plugins •  years of development behind it •  you can hire people who know it "Everybody uses it."** * may not actually be true ** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know who you are.
  • 4.
    Reasons for choosingNagios •  stupid simple plugin system •  billions* of existing plugins •  years of development behind it •  you can hire people who know it "Everybody uses it."** * may not actually be true ** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know who you are.
  • 5.
    So why didyou pick Nagios? Because it's the "safe", default choice. Because we've grown accustomed to the things that really, really suck about it. It's a little like we've all got Stockholm Syndrome.
  • 6.
    What Nagios getsright Incredibly simple plugin model. Fairly secure (SSL between agents + master). Very simple conceptually. Reliable.
  • 7.
    Nagios, I hatethee; let me count thy ways Doesn't scale. At all. World's second most horrible configuration*. Horrendous interface**. Assumes a static infrastructure. No decent programmatic interfaces***. Throws away perfdata. Stupid wire format for clients (NRPE/NSCA). * the world's most horrible configuration is, obviously, Sendmail. ** even the paid Nagios XI one is ugly as sin and unusable. *** if I catch you parsing status.dat, I will beat your ass.
  • 8.
    Expansion about config Configurationhas to be in two places: Server has to know what checks to invoke via NRPE. Client has to know what checks it will be asked to invoke with NRPE. THIS IS MADNESS.
  • 9.
    Scaling, or lackof it No such thing as a Nagios cluster. More checks = more work = longer before you know something's happened! Every check increases your master's load average.
  • 10.
    Okay, yes, there’smod_gearman But it’s a hack at best. No redundancy for the machine that distributes the checks, so it’s not a real cluster.
  • 11.
    API poverty Can't easilyintegrate with other systems. Can't easily write custom dashboards. Can't get information out again! Assumes a static infra Master has to be told about a client before things can happen.
  • 12.
    The bandaids wemake Interface: Opsview, Icinga, Shinken, others API: Parsing status.dat, NDO Client wire format: Opsview's NRPE, NRD Config management: Puppet types, Chef cookbooks None of it is good enough.
  • 13.
    The take-home point: "Ifwe keep using Nagios, we'll never get anything better." (Writing monitoring systems is hard, and needs community involvement and real world adoption. Nagios steals mindshare by being just good enough. It's the monitoring system we deserve, but not the one we need right now.)
  • 14.
    So, smart guy.What do we do? Steal all the things that are great about Nagios. (existing plugin investment, simplicity, security, reliability) Strap them to something more awesome. (scalable, API-ready, config management friendly, modern!)
  • 15.
    THIS DOESN’T MEANWRITING YOUR OWN MONITORING SYSTEM
  • 16.
    Points for thought: ● What else are people using? ●  Should we greenfield or lift existing tools? ●  What tools could we go with?
  • 17.
    My suggestion: Like OMD,but better. Wrap up a series of “best in breed” tools to make one kickass monitoring tool.
  • 18.
  • 19.
    Core: Holds configuration abouthosts / services Distributed across X masters Check execution (poke) Results queue (poke response)
  • 20.
    There’s something wecan use for this. Sensu! Sensu is often described as the “monitoring router”.
  • 22.
    { "checks": { "chef_client": { "command":"check-chef-client.rb", "subscribers": [ "production" ], "interval": 60, "handlers": [ "pagerduty", "irc" ] } } } Only on the server
  • 23.
    Client requires noregistration for the server to know about it Uses Nagios status return codes Doesn’t talk to the server - talks to RabbitMQ
  • 24.
    Core: Holds configuration abouthosts / services Distributed across X masters Check execution (poke) Results queue (poke response)
  • 25.
    What we need: Core -Sensu-server Agent - Sensu-client Graphing Anomaly detection Alerting UI
  • 26.
    Graphing is easynow. If you’re not using Graphite, you should be. Sensu “metric” checks can pump data to it.
  • 27.
    What we need: Core -Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection Alerting UI
  • 28.
    Anomaly detection ishard. We’ve got all this metric data, but how do we check it? - Skyline/Oculus (Etsy) - Grok (very early days) - ???
  • 29.
    What we need: Core -Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection - ??? Alerting UI
  • 30.
    Alerting is tricky,but mostly solved. Flapjack! - flapjack.io Alerting is not the concern of your monitoring tool. Push all alerts at Flapjack - define gateways (PagerDuty, email) - create relationships between checks and gateways
  • 31.
    What we need: Core -Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection - ??? Alerting - Flapjack UI
  • 32.
    User interfaces arehard. What do we need from it? - What’s broken - When it broke, when it broke in the past - Say “OK, I know it’s broken” - View graphs to see how quickly it broke - See every check everywhere, and filter the list
  • 33.
    The Sensu Dashboardsucks. No history! Acknowledgements aren’t easy to do. No graphing. Can’t see anything that’s reporting an OK status. This won’t do.
  • 34.
    I’m going tohave to write a UI. Sigh.
  • 35.
    What we need: Core -Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection - ??? Alerting - Flapjack UI - ???
  • 36.
    In Summary Nagios sucks. Thereare good tools for each concern of monitoring. If we can package them together, we can have something that rocks.
  • 37.