Sensu at Brightpearl
Turning a hatred of Nagios into a
love of Sensu
Who the hell am I?
Systems Administrator at Brightpearl Inc
Started at Brightpearl UK in October 2010
Back then, only about 20 people in the company
– I was the only Systems Administrator/General
~7 years experience as Sysadmin working with
various flavours of Linux
Monitoring – who needs it anyway?
Basically everyone – if you're running production
software that people depend on, you need to know
what's going on with your servers
You can't rely on screaming users to let you know
when things go wrong
Certain metrics can be a very good indicator of
failures before they happen – think disk space,
memory consumption, failed backups, web
Right, better get some monitoring.
Reputation of being the default, safe choice
Claim to be “Industry Standard” on their website
Historically people were put off by extortionate
costs of enterprise software (e.g. HP Openview) –
now cloud-based software still requires a
Hey, Nagios is free.
Neckbeards rejoice – it's open source.
In the beginning, it was joyous.
MONITOR ALL TEH THINGZ
(Relatively) low server count means it was still
manageable. Easy to tune alerts to specific
All the plugins you can imagine means we could
monitor RDS instances, internal office servers,
UPS, etc etc
Email alerts for warnings keep us abreast of
things that might happen
Pagerduty integration for critical alerts
Configuration assisted with Chef.
l As the number of servers increases, so does the
l ...and so do the spurious alerts, where the
thresholds aren't so simple to set. Hosting cost
restraints means sometimes running close to the
wire on some servers but not others.
l Because of this, NAGIOSAGEDDON in your
email inbox. Soon enough, everyone's ignoring
them, especially the warnings. And especially if
stuff is still working
A quick note on Nagios checks.
l Monitoring host sends check command over NRPE and waits for a response
l Queue of checks are processed one by one – if networking to certain hosts is
slow, it's slower to process the list.
l If the list of checks doesn't get processed before the next check is due.....
So Nagios sucks then?
l Well, Nagios gets some things right -
The plugin model is simple (4 exit codes!) and
It's pretty reliable
SSL Support = secure
l If you're running a small office/datacentre with
servers and requirements that rarely or never
change it works – but still with a lot of painful
l But as soon as you deviate from this, it all goes
Yes, bascially Nagios sucks.
l A lot has changed in the IT world in 15 years –
l It's completely unscalable. There is no such
thing as a Nagios cluster. More checks = more
server load on master
l The configuration format is horrible –
chef/puppet only slightly dulls the pain
l It has a horrendous interface – even if you pay
for Nagios XI, which isn't cheap
l It assumes a static infrastructure, which in the
days of Cloud is almost never.
l Configuration has to be duplicated in two places
So what to do?
l Reached the limit of Nagios pain – determined to
shake the Stockholm Syndrome we all appear to
l Alerts are pretty much ignored by all, once flood
gets large enough they WILL end up filtered.
Nagios has gone stopped for days without
l A monitoring system that people ignore is utterly
l Started to investigate other alternatives.
Alternatives to Nagios
l NagiosXI - $$$ and apparently not much better.
l Zabbix – Not as much support as Nagios, lots of
people seem to think it's worse. Configuration
possibly even more complex
l ZenOSS – Confusing config, issues with false
positives and massive numbers of alerts
l Then I found Sensu.
What is this Sensu then?
l Much, much better model (queue-subscriber)
l Purpose-built for this, best tool for the job. Think
Graphite for graphing, pagerduty for alerting.
l Supports existing Nagios plugins
l Integrates with graphite, pagerduty
l Easy to scale – automatically handles clustering.
l Great REST API – you can do most things with it
No really, what is is it?
l Often described as a “monitoring router”
l Results of “check” scripts are passed onto one
or more handlers, depending on certain
l Written in Ruby (yay!)
l Configuration is all in JSON
l Four main components:
Compared to Nagios, this is good
l Hosting our infrastructure in the cloud, we need
to have our monitoring solution be
able to cope with changing
aware of new servers without us having to
remember to tell it
Able to cope with possbibly rapid expansion
l Sensu fulfills these objectives reasonably well.
So is Sensu perfect?
No, nothing is.
The dashboard is immature – basically still a
Current release is only version 0.12 – so the
whole software itself is fairly immature.
Fairly complicated install process, with
dependencies on quite a bit more than Nagios.
It's been Chef'd (and Chef'd well) but seems
easy for these dependencies to break with
But it's still immeasurably better.
It'll scale well when our infrastructure expands
Has performed great in a test environment
Looking forward to rolling it out to production!