Dave Williams presentation on Multi-Tenant Nagios Monitoring.
The presentation was given during the Nagios World Conference North America held Oct 13th - Oct 16th, 2014 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/conference
First some background about me – where I’ve been and where I am
Then some description of the Nagios multi tenant solution – why we needed it and some of the design decisions made
Over 35 years working in the IT industry
Coding real time systems and then on to Operating system support
GeCOS & Transaction Processing – assembler / machine code
Network Processor software development
Customer Facing
IBM Mainframes MVS / VM – SNA VTAM / NCP – Bureau Environment / Service Orientated
Bull – Honeywell Bull / CIIHB / Honeywell - worked at R&D in Toronto / Minneapolis / Grenoble
Bull Toronto was R&D centre for system monitoring SNMp & Graphical systems
Openview on HP kit for IBM Bureau ISDN / Dial-In systems (from FT / Legal document systems)
Netview for IBM SNA networks
Open Master Openview like system written by Bull – AIX based
Saw Netsaint 0.6 finally complied by Chris Rothecker, tried it decided I could do a better port – did it. – built AIX installp file that contained everything GD etc. released via the Bull freeware site.
Followed project and continued to work with Nagios under AIX.
Later evangelized Nagios in Bull and worked on the Linux based versions
Support multiple customers throughout Europe – different hardware / software sets / different level of involvement – simple reporting / OS patching / Database / Citrix support / Application support. Only 1 3rd line support team – therefore need system to tell them what is happening before the Service Desk gets involved. Need for monitoriing 24x7 365/6 days a year – keeping relevant history for problem analysis.
All sizes and shapes 2 – 3000 hosts maybe lots of services – maybe only a few (but very important)
SLA reporting – the service must as a minimum achieve the service levels contracted
Each Customer is unique in terms of alerting and reporting.
Virtual central system. XenServer because it’s ‘free’ and supports clusters.
QNAP for storage because it was there, could use FreeNAS or other shareable storage – with replication.
No need to implement Nagios clustering or multiple nodes as its all taken care of at hardware & hypervisor level
The usual things to allow the system to stay up – bonded interfaces, multiple switches , multiple nodes in hardware cluster, hardware replication of disk storage.
Need to cater for site loss – data replicated to 2nd geographic site, identical hardware ready to start up – IP addresses swapped by telecoms provider
Each customer needs at least one appliance to do the actual monitoring. The size and number of the appliances depends on the needs of the customer. Raspberry Pi is good for the smaller end of SME’s , the Netbook serves the larger customers well. By ‘Gold disk’ the appliance and pulling the Nagios config files back to the centre every day recovery of the appliance is easy – most large sites hold a spare appliance.
Using Core with LiveStatus to provide data access. Speeds local access and gives better filter facilities.
LiveStatus also allows multiple remote Nagios inputs - https
Using Thruk for Visual representation – derives data from livestatus, gives SLA reporting in PDF format and a Dashboard.
Graylog2 / Elastic search used to absorb Nagios logs + Syslog (Cisco devices sure can generate syslog) + Windows Event logs. Allows searching and some correlation analysis.
Asterisk used to alert ‘wetware’ – SMS or emails easily ignored, a voice in the dark reading out error messages is a lot harder (can hit landlines as well as mobiles !)
NSCA-ng used because I wanted to submit commands other than ENABLE/DISABLE_NOTIFICATIONS and ACKNOWLEDGE_SVC_PROBLEM – had ACKNOWLEDGE_HOST_PROBLEM in my sights as well.
Email handling a whole new topic – sending is easy but routing inbound emails a lot harder – acks , new alerts
If nothing else available (on a customer basis) use OTRS to hold alerts and commentary
Just usual stuff running on the remote platform. OpenVPN because we use Cisco IPSeC vpn connection but have used other software when the Cisco ports are blocked (https anyone ?)
Slightly stretched version of ITIL Service Catalogue – services and their characteristics , processes are related mainly to change control and configuration item relationships
Service catalogues are implemented in a manner that facilitate the registration, discovery, request, execution, and tracking of desired services for catalogue users. Each service within the catalogue typically includes traits and elements such as:
Clear ownership of and accountability for the service (a person and often an organization)
A name or identification label for the service
A description of the service
A service categorization or type that allows it to be grouped with other similar services
Related service request types
Any supporting or underpinning services
Service level agreement (SLA) data and information that helps service providers set expectations for their service requestors
Who is entitled to request/view the service
Associated costs (if any)
How to request the service and how its delivery is fulfilled
Escalation points and key contacts
The more descriptive the service details are, the easier it is for end users of the service catalog to find and invoke the services they desire.
The agreed list is the hardest to get, pulling teeth… no one wants to be alerted (or the guy that does is no use) , if there is an automated way of doing it – please do it…
Really hard to get people to do this – ‘job preservation’ – work smarter…..
Will get round to Puppet soon……
Joys of multi –tenant – naming conventions & physical locations combine to make it hard, then change control for CI’s is important.
‘Cloud’ – it’s a lovely buzzword – sometimes it fits a solution like a glove – sometimes not. Horizontal scaling and moving the costs directly back to the customer is good. Cloud geographic issues don’t apply – no local data held.
Engineer around real life problems. For $40 why spend weeks / month tuning & developing when simple hardware / software add on will do the job
Business process by drag & drop – simple but effective. Explains why a process is at warning and not critical for example.
Dashboard by drag & drop – can restrict to be locked down for a particular user – only screen they can see (Nagios contacts based)