Sensu at brightpearl
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Sensu at brightpearl

on

  • 658 views

 

Statistics

Views

Total Views
658
Views on SlideShare
650
Embed Views
8

Actions

Likes
2
Downloads
3
Comments
0

1 Embed 8

http://www.slideee.com 8

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • EXPLAIN WHY NAGIOS CHECKS ARE BAD – NRPE check fired to each server, the more checks, the more they queue up. Check can fire off on server before previous one has completed – never get a result back.Chef kind of helps with configuration, but not by a lot. As there are more servers, there are more exceptions not covered so easily by configuration management. <br /> What follows NAGIOSAGEDDON? Mail queue overload and eventual crash. Alerts stop all together, which nobody notices, because they&apos;re ignoring them. <br />
  • If the list of checks doesn&apos;t get processed before the next check is due..... we may never get results back for the later checks in the list.Or, consider that the server is able to process the checks required within the time “window” (e.g. 1 minute for checks that are made every minute) – what if the number of checks is doubled? Tripled? <br />
  • Reliability – when was the last time you saw the nagios daemon crash? It&apos;s usually things external to Nagios that are the problem, <br /> Painful setting up – there are bolt-ons like Groundworks to improve setting up but they&apos;re not that much better than arsing about with configuration files <br /> Deviation = non-static hostnames in the cloud. Generally in a datacentre most is static. <br />
  • A lot has changed in 15 years – biggest of which is is a) everyone&apos;s running more servers and more servicesb) Most people relying on the cloud = many many non-static IP addresses. <br /> Nagios is 15 years old, give or take – released in 1999 and the design hasn&apos;t changed much in years. It&apos;s not fair to expect them to predict the changes back then, but neither has the software moved with the times. <br /> Configuration duplication – the server has to be aware of what checks it wants clients to make, the client has to be aware of what checks it&apos;s going to be expected to be run. Absolutely crazy setup. <br />
  • Stockholm syndrome not just in our company or even with me – everyone seems to have it. Reference everyone defending Nagios when it&apos;s basically shit. <br />
  • “Sensu” from the Japanese word for “fan” - relates to the “fanout exchange”, one of the exchange types used by RabbitMQ. <br />
  • Server – orchestrates check executions, processes the results, and handles events from results to handlers. You can run more than one server and tasks are distributed amongst them <br /> Client – Recieves check execution requests, executes the checks, and publishes the results. <br /> API – Provides a REST-like interface to Sensu data, such as registered clients and current events. You can run more than one. <br /> Dashboard – UI for Sensu. Not great. <br />
  • Server – orchestrates check executions, processes the results, and handles events from results to handlers. You can run more than one server and tasks are distributed amongst them <br /> Client – Recieves check execution requests, executes the checks, and publishes the results. <br /> API – Provides a REST-like interface to Sensu data, such as registered clients and current events. You can run more than one. <br /> Dashboard – UI for Sensu. Not great. <br />
  • Server – orchestrates check executions, processes the results, and handles events from results to handlers. You can run more than one server and tasks are distributed amongst them <br /> Client – Recieves check execution requests, executes the checks, and publishes the results. <br /> API – Provides a REST-like interface to Sensu data, such as registered clients and current events. You can run more than one. <br /> Dashboard – UI for Sensu. Not great. <br />
  • Server – orchestrates check executions, processes the results, and handles events from results to handlers. You can run more than one server and tasks are distributed amongst them <br /> Client – Recieves check execution requests, executes the checks, and publishes the results. <br /> API – Provides a REST-like interface to Sensu data, such as registered clients and current events. You can run more than one. <br /> Dashboard – UI for Sensu. Not great. <br />

Sensu at brightpearl Presentation Transcript

  • 1. Sensu at Brightpearl Turning a hatred of Nagios into a love of Sensu www.brightpearl.com
  • 2. Who the hell am I? Dave Tibbs @LowlySysadm1n l Systems Administrator at Brightpearl Inc l Started at Brightpearl UK in October 2010 l Back then, only about 20 people in the company – I was the only Systems Administrator/General IT Dogsbody l ~7 years experience as Sysadmin working with various flavours of Linux
  • 3. Monitoring – who needs it anyway? l Basically everyone – if you're running production software that people depend on, you need to know what's going on with your servers l You can't rely on screaming users to let you know when things go wrong l Certain metrics can be a very good indicator of failures before they happen – think disk space, memory consumption, failed backups, web requests/sec, etc
  • 4. Monitoring in place when I started
  • 5. Right, better get some monitoring. Nagios, then? l Reputation of being the default, safe choice l Claim to be “Industry Standard” on their website l Historically people were put off by extortionate costs of enterprise software (e.g. HP Openview) – now cloud-based software still requires a subscription. l Hey, Nagios is free. l Neckbeards rejoice – it's open source.
  • 6. In the beginning, it was joyous. l MONITOR ALL TEH THINGZ l (Relatively) low server count means it was still manageable. Easy to tune alerts to specific servers. l All the plugins you can imagine means we could monitor RDS instances, internal office servers, UPS, etc etc l Email alerts for warnings keep us abreast of things that might happen l Pagerduty integration for critical alerts l Configuration assisted with Chef.
  • 7. But then... l As the number of servers increases, so does the configuration required l ...and so do the spurious alerts, where the thresholds aren't so simple to set. Hosting cost restraints means sometimes running close to the wire on some servers but not others. l Because of this, NAGIOSAGEDDON in your email inbox. Soon enough, everyone's ignoring them, especially the warnings. And especially if stuff is still working
  • 8. A quick note on Nagios checks. l Monitoring host sends check command over NRPE and waits for a response l Queue of checks are processed one by one – if networking to certain hosts is slow, it's slower to process the list. l If the list of checks doesn't get processed before the next check is due.....
  • 9. So Nagios sucks then? l Well, Nagios gets some things right - The plugin model is simple (4 exit codes!) and reasonably well-designed ● It's pretty reliable ● SSL Support = secure l If you're running a small office/datacentre with servers and requirements that rarely or never change it works – but still with a lot of painful setting up l But as soon as you deviate from this, it all goes wrong.
  • 10. Yes, bascially Nagios sucks. l A lot has changed in the IT world in 15 years – Nagios hasn't. l It's completely unscalable. There is no such thing as a Nagios cluster. More checks = more server load on master l The configuration format is horrible – chef/puppet only slightly dulls the pain l It has a horrendous interface – even if you pay for Nagios XI, which isn't cheap l It assumes a static infrastructure, which in the days of Cloud is almost never. l Configuration has to be duplicated in two places
  • 11. So what to do? l Reached the limit of Nagios pain – determined to shake the Stockholm Syndrome we all appear to have l Alerts are pretty much ignored by all, once flood gets large enough they WILL end up filtered. Nagios has gone stopped for days without anybody noticing. l A monitoring system that people ignore is utterly pointless. l Started to investigate other alternatives.
  • 12. Alternatives to Nagios l NagiosXI - $$$ and apparently not much better. l Zabbix – Not as much support as Nagios, lots of people seem to think it's worse. Configuration possibly even more complex l ZenOSS – Confusing config, issues with false positives and massive numbers of alerts l Then I found Sensu.
  • 13. What is this Sensu then? l Much, much better model (queue-subscriber) l Purpose-built for this, best tool for the job. Think Graphite for graphing, pagerduty for alerting. l Supports existing Nagios plugins l Integrates with graphite, pagerduty l Easy to scale – automatically handles clustering. l Great REST API – you can do most things with it
  • 14. No really, what is is it? l Often described as a “monitoring router” l Results of “check” scripts are passed onto one or more handlers, depending on certain conditions l Written in Ruby (yay!) l Configuration is all in JSON l Four main components: ● Server ● Client ● API ● Dashboard
  • 15. Compared to Nagios, this is good l Hosting our infrastructure in the cloud, we need to have our monitoring solution be ● able to cope with changing instances/infrastructure ● aware of new servers without us having to remember to tell it ● Able to cope with possbibly rapid expansion l Sensu fulfills these objectives reasonably well.
  • 16. So is Sensu perfect? ● No, nothing is. ● The dashboard is immature – basically still a bit rubbish ● Current release is only version 0.12 – so the whole software itself is fairly immature. ● Fairly complicated install process, with dependencies on quite a bit more than Nagios. It's been Chef'd (and Chef'd well) but seems easy for these dependencies to break with version inconsistencies.
  • 17. But it's still immeasurably better. ● It'll scale well when our infrastructure expands ● Has performed great in a test environment ● Looking forward to rolling it out to production!