• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Stop using Nagios (so it can die peacefully)
 

Stop using Nagios (so it can die peacefully)

on

  • 82,337 views

You shouldn't use Nagios any more - it sucks. Let's build a new, better, more awesome monitoring system.

You shouldn't use Nagios any more - it sucks. Let's build a new, better, more awesome monitoring system.

Statistics

Views

Total Views
82,337
Views on SlideShare
80,433
Embed Views
1,904

Actions

Likes
110
Downloads
483
Comments
17

29 Embeds 1,904

https://twitter.com 587
http://status.touk.pl 356
http://ivanpesin.info 319
http://inokara.hateblo.jp 278
http://www.cirip.ro 122
http://www.reddit.com 47
http://techtoday.de 37
http://www.linkedin.com 25
http://www.ipv6.ivanpesin.info 19
http://ipv6.ivanpesin.info 19
http://paisdondenaci.blogspot.com.es 17
https://www.facebook.com 17
https://www.linkedin.com 13
http://www.scoop.it 12
http://feedly.com 11
http://srv.ivanpesin.info 4
http://svr.ivanpesin.info 4
http://and2014.fey 3
http://inoreader.com 2
https://kcw.kddi.ne.jp 2
https://www.chatwork.com 2
http://s.appium.cn 1
http://comtel.in 1
https://m.facebook.com&_=1392755031708 HTTP 1
https://m.facebook.com&_=1392744135828 HTTP 1
https://m.facebook.com&_=1392741240646 HTTP 1
http://b.hatena.ne.jp 1
https://mail.google.com 1
http://webcache.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

110 of 17 previous next Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • We've been using Nagios for several years in our consulting company and it is true that it works, but the presentation gets it right. There are too many drawbacks. Well, the bright side for us, was that it gave us the idea to build our own monitoring and alerting service, initialy for cloud servers but eventually for bare metals too. We wanted something easy to set up, but we weren't keen on installing closed-source agents on our servers that we had no way of knowing how secure they are. Also, we wanted a touch-friendly interface because we were a small company and we'ld rather not take shifts in front of our laptops just in case apache needed a restart.So, we created https://mist.io

    You can also host it internally, as the interface and management code is open-source (https://github.com/mistio/mist.io) and use the service only for the monitoring and alerting part. You can check it our if you're looking for alternatives.
    Are you sure you want to
    Your message goes here
    Processing…
  • @superdupersheep please let us know when you figure this out, or where you are working on this (e.g. Github repo)
    Are you sure you want to
    Your message goes here
    Processing…
  • nagios does not get ssl right, its a trivial mitm since its not using certs but a shared 'secret' thats in the source.
    Are you sure you want to
    Your message goes here
    Processing…
  • Sure, you can go and piece meal the things in the slide, or you can just sign up for Stackdriver and go get a beer.
    http://www.stackdriver.com/
    Are you sure you want to
    Your message goes here
    Processing…
  • Hi Andy,

    I kinf od share your pain... Also have a look at Observium:
    * http://www.observium.com/wiki/Main_Page
    * http://demo.observium.org/ (enter with demo/demo)
    There's some really good parts from the UI that you could get some inspiration from...
    Are you sure you want to
    Your message goes here
    Processing…

110 of 17 previous next

Post Comment
Edit your comment

    Stop using Nagios (so it can die peacefully) Stop using Nagios (so it can die peacefully) Presentation Transcript

    • Please stop using Nagios (so it can die peacefully) Andy Sykes Devops @ Forward3D @supersheep andy@forward3d.com
    • Do you use Nagios? Tell me why you picked it. Go on. If you don't, why don't you?
    • Reasons for choosing Nagios •  stupid simple plugin system •  billions* of existing plugins •  years of development behind it •  you can hire people who know it "Everybody uses it."** * may not actually be true ** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know who you are.
    • Reasons for choosing Nagios •  stupid simple plugin system •  billions* of existing plugins •  years of development behind it •  you can hire people who know it "Everybody uses it."** * may not actually be true ** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know who you are.
    • So why did you pick Nagios? Because it's the "safe", default choice. Because we've grown accustomed to the things that really, really suck about it. It's a little like we've all got Stockholm Syndrome.
    • What Nagios gets right Incredibly simple plugin model. Fairly secure (SSL between agents + master). Very simple conceptually. Reliable.
    • Nagios, I hate thee; let me count thy ways Doesn't scale. At all. World's second most horrible configuration*. Horrendous interface**. Assumes a static infrastructure. No decent programmatic interfaces***. Throws away perfdata. Stupid wire format for clients (NRPE/NSCA). * the world's most horrible configuration is, obviously, Sendmail. ** even the paid Nagios XI one is ugly as sin and unusable. *** if I catch you parsing status.dat, I will beat your ass.
    • Expansion about config Configuration has to be in two places: Server has to know what checks to invoke via NRPE. Client has to know what checks it will be asked to invoke with NRPE. THIS IS MADNESS.
    • Scaling, or lack of it No such thing as a Nagios cluster. More checks = more work = longer before you know something's happened! Every check increases your master's load average.
    • Okay, yes, there’s mod_gearman But it’s a hack at best. No redundancy for the machine that distributes the checks, so it’s not a real cluster.
    • API poverty Can't easily integrate with other systems. Can't easily write custom dashboards. Can't get information out again! Assumes a static infra Master has to be told about a client before things can happen.
    • The bandaids we make Interface: Opsview, Icinga, Shinken, others API: Parsing status.dat, NDO Client wire format: Opsview's NRPE, NRD Config management: Puppet types, Chef cookbooks None of it is good enough.
    • The take-home point: "If we keep using Nagios, we'll never get anything better." (Writing monitoring systems is hard, and needs community involvement and real world adoption. Nagios steals mindshare by being just good enough. It's the monitoring system we deserve, but not the one we need right now.)
    • So, smart guy. What do we do? Steal all the things that are great about Nagios. (existing plugin investment, simplicity, security, reliability) Strap them to something more awesome. (scalable, API-ready, config management friendly, modern!)
    • THIS DOESN’T MEAN WRITING YOUR OWN MONITORING SYSTEM
    • Points for thought: ●  What else are people using? ●  Should we greenfield or lift existing tools? ●  What tools could we go with?
    • My suggestion: Like OMD, but better. Wrap up a series of “best in breed” tools to make one kickass monitoring tool.
    • What we need: Core Agent Graphing Anomaly detection Alerting UI
    • Core: Holds configuration about hosts / services Distributed across X masters Check execution (poke) Results queue (poke response)
    • There’s something we can use for this. Sensu! Sensu is often described as the “monitoring router”.
    • { "checks": { "chef_client": { "command": "check-chef-client.rb", "subscribers": [ "production" ], "interval": 60, "handlers": [ "pagerduty", "irc" ] } } } Only on the server
    • Client requires no registration for the server to know about it Uses Nagios status return codes Doesn’t talk to the server - talks to RabbitMQ
    • Core: Holds configuration about hosts / services Distributed across X masters Check execution (poke) Results queue (poke response)
    • What we need: Core - Sensu-server Agent - Sensu-client Graphing Anomaly detection Alerting UI
    • Graphing is easy now. If you’re not using Graphite, you should be. Sensu “metric” checks can pump data to it.
    • What we need: Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection Alerting UI
    • Anomaly detection is hard. We’ve got all this metric data, but how do we check it? - Skyline/Oculus (Etsy) - Grok (very early days) - ???
    • What we need: Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection - ??? Alerting UI
    • Alerting is tricky, but mostly solved. Flapjack! - flapjack.io Alerting is not the concern of your monitoring tool. Push all alerts at Flapjack - define gateways (PagerDuty, email) - create relationships between checks and gateways
    • What we need: Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection - ??? Alerting - Flapjack UI
    • User interfaces are hard. What do we need from it? - What’s broken - When it broke, when it broke in the past - Say “OK, I know it’s broken” - View graphs to see how quickly it broke - See every check everywhere, and filter the list
    • The Sensu Dashboard sucks. No history! Acknowledgements aren’t easy to do. No graphing. Can’t see anything that’s reporting an OK status. This won’t do.
    • I’m going to have to write a UI. Sigh.
    • What we need: Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection - ??? Alerting - Flapjack UI - ???
    • In Summary Nagios sucks. There are good tools for each concern of monitoring. If we can package them together, we can have something that rocks.
    • Thank You. Contact andy@forward3d.com (@supersheep)