Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

October 14th 2014 Dave Williams
Technical Architect
Multi-Tenant Nagios
Monitoring
© Bull, 2014 1

Agenda
Background
Multi-Tenant Monitoring
Why Multi-Tenant
Multi-Tenant Design
Service Catalogue
Futures & ‘Blue Sky thinking’
Questions
© Bull, 2014 2

Background
UK based
Mainframe (IBM & Honeywell)
Unix (HP-UX, AIX, Solaris)
Linux (RedHat, SLES, Debian)
Network (CASE, 3COM, CISCO)
Working for Bull
French Computer Manufacturer
Mainframes, Unix, HPC,
Security, Managed Services,
Advisory Services
© Bull, 2014 3

Background
System Monitoring
OpenView
Netview
Open Master
Open Source Monitoring
NetSaint on AIX
Nagios
© Bull, 2014 4

Why Multi-Tenant ?
Outsourcing Support & Monitoring
Multiple Customers
–Different Levels of security
–Different Hardware / Software Platforms
One Support Team
–Only need to know about real problems
–Can be driven by support ticket not Nagios
Required 365 x 24
–Infrastructure must survive all outages without loss of service
© Bull, 2014 5

Multi-Tenant Design
Each customer may have 2-3000 hosts
10-100 services per host
Real time monitoring
Customer profile
SLA Reporting
Batch Event completion
Different SLA’s for each Business Process per customer
Different alerting & escalation methods per customer
© Bull, 2014 6

Multi-Tenant Design
Hardware Platform – Central Support
Virtualised Platform (Intel based)
–XenServer Hypervisor
 Allows clustering with shared storage
 Inexpensive Licensing
Shared Storage
–NAS
 Using QNAP Appliances with underlying RAID-5 & Hot Spare protection
 Network connection using dual interfaces bound across multiple switches
 Could have used FreeNas
LAN Infrastructure
–Dual connections to all hardware
–SNMP managed switches
© Bull, 2014 7

Hardware Platform – Basic Schematic
© Bull, 2014 8

Multi-Tenant Design
Hardware Platform – Resilience
 If Primary node fails cluster will ‘spin up’ image on 2nd node
Same data / logs (Shared storage)
LAN Infrastructure
 Bonded interfaces for NAS access – no data loss / access loss with failure
 SNMP managed switches
© Bull, 2014 9

Hardware Setup
© Bull, 2014 10

Multi-Tenant Design
Hardware Platform – Recovery
 If Primary Site fails will spin up image
 Internet Access fails over – using BGP
Shared Storage – replicated from Prime Site
–NAS
 Using QNAP Appliances with underlying RAID-5 & Hot Spare protection
 Using RTRR (Real Time Remote Replication) between sites
 Network connection using dual interfaces bound across multiple switches
LAN Infrastructure
 Bonded interfaces for NAS access – no data loss / access loss with failure
 SNMP managed switches
© Bull, 2014 11

Hardware Platform - Resilience
© Bull, 2014 12

Hardware Platform – Customer Site
Using generic netbooks
Minimum requirement
–1Gb Memory , Atom processor, Ethernet Port
–Running Centos 6.4 64 bit Operating System
Can use Raspberry Pi for small customers
–512K Memory , Arm processor , Ethernet Port
–Running Raspbian Operating System
© Bull, 2014 13

Software Platform – Central Site
Nagios – Core
Running latest 4.0.8
Using MK Livestatus for interfacing
Using Thruk for Visualisation
Graylog2 / Elastic Search
Store all logs & Syslog in ‘Big Data’ repository using MongoDB
Asterisk PBX
Allow all alerting to use standard dial-up with speech synthesis + IVR
SMS-Client
Still using TAPI to SMS Text contacts
© Bull, 2014 14

Software Platform – Central Site (contd)
NRPE
Running 2.1.5
NSCA &NSCA-ng
Using NSCA for external communication
Using NSCA-ng for issuing remote commands
Postfix / Procmail
Used to generate emails but also handle responses.
Routes unsolicited alerting emails (HP Insight, Pingdom)
OTRS
Record alerts, track issues
© Bull, 2014 15

Software Platform – Remote Site
Nagios – Core
Running latest 4.0.8
NRPE
Running 2.14
NSCA
Using NSCA for external communication
OpenVPN
Communication via IPSec VPN
© Bull, 2014 16

Customer Multi-Tenant
© Bull, 2014 17

Service Catalogue
Agreed list of servers / services
With importance levels
With alerting paths
With escalation paths
Recovery options
Feeds into Service Level Agreements and Operational Level
Agreements
Basis of agreed reporting structures
© Bull, 2014 20

Examples
Basic Spreadsheet plus Shell script
Usually easy to create, Shell script is different for each customer based
on a initial standard script
Chef or Puppet
Use Exported Resources
Nagios Cookbook – Nagios Conference 2012 Presentation
© Bull, 2014 21

Multi Tenant Issues
Naming conventions
Every customer has a server01
Customers naming conventions are obscure
Customers have multiple physical locations or levels of security
–This gives rise to different nagios names to actual names:
–Custloc1-swfeltsw01
–Custloc2-nwfeltsw01
Not so smart when a non-Nagios originated alert is received,
–‘swfeltsw01 – RAID battery backup failure’ from HP Insight for example
–The external alert processor has to perform table lookups before building the
appropriate NSCA command for example
© Bull, 2014 22

Futures & Blue Sky thinking
The Nagios Visualisation is resource heavy
All Customers want their own Dashboard
All Customers want a different screen layout
Why not move the visualisation into the cloud ?
Use a Amazon EC2 image to access central Livestatus via https
Allow end user to authenticate
Customer portal allows ‘spin up’ & ‘spin down’ of images
–Move billing to the customer
–Scale horizontally for Visualisation
© Bull, 2014 23

Load Sharing
Using plugins like check_wmi_plus put a strain on the
monitoring system, large number of queries that take wall
clock time to complete and parse.
Better to have ‘worker nodes’ via Merlin or Mod Gearman
similar to perform these functions – Raspberry Pi for example.
No great expense to add 2/3 Pi’s to customer site
configurations, easy fall back if they fail – no unique locally
stored data
© Bull, 2014 24

Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

Similar to Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring (20)

More from Nagios

More from Nagios (20)

Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

Editor's Notes