Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

Nagios Is Down and
Your Boss Wants to See You
Andrew Widdersheim

awiddersheim@inetu.net

Nooooooooooo!!!

2012 2

Breaking News!

2012 3

Nagios High Availability Options

Merlin by op5
Classic method described in
Nagios Core documentation
Some type of virtualized
solution like VMWare

or…

2012 4

Nagios High Availability

+

= Win

2012 5

DRBD magic

Linbit
Free
Runs in Kernel either by module or in the mainline
code if Kernel is new enough
Each server gets its own independent storage
Able to maintain the data’s consistency between
the nodes
Resource level fencing

2012 7

DRBD considerations

DRBD is as fast as the slowest node
Network latency
Replication over great distances can be done
DRBD proxy can increase performance over great
distances but does cost money
Recommend using dedicated cross-over link for best
performance
Protocol Choices
Protocol A: write IO is reported as completed, if it has reached local disk and
local TCP send buffer.
Protocol B: write IO is reported as completed, if it has reached local disk and
remote buffer cache.
Protocol C: write IO is reported as completed, if it has reached both local
and remote disk.

2012 8

Pacemaker + DRBD + Nagios
Nagios

Apache

NPCD

NCSA

Nagios Stuff rrdcached

VIP 192.168.1.57

Filesystem ext4

DRBD Primary Secondary
Resource
Manager
Pacemaker

Messaging CoroSync / Heartbeat

Hardware Node1 Node2

2012 10

Nagios

Apache

NPCD

NCSA

Nagios Stuff rrdcached

VIP 192.168.1.57

Filesystem ext4

DRBD Primary Secondary
Resource
Manager
Pacemaker



2012 11

Nagios

Apache

NPCD

NCSA

rrdcached

192.168.1.57

ext4

DRBD Secondary Primary
Resource
Manager
Pacemaker



2012 12

Pacemaker and Nagios

2012 13


primitive p_fs_nagios ocf:heartbeat:Filesystem
params device="/dev/drbd/by-res/r1" directory="/drbd/r1" fstype="ext4“ options="noatime"
op start interval="0" timeout="60s"
op stop interval="0" timeout="180s"
op monitor interval="30s" timeout="40s"

primitive p_nagios lsb:nagios
op start interval="0" timeout="180s"
op stop interval="0" timeout="40s"
op monitor interval="30s"
meta target-role="Started"

group g_nagios p_fs_nagios p_nagios_ip p_nagios_bacula p_nagios_mysql
p_nagios_rrdcached p_nagios_npcd p_nagios_nsca p_nagios_apache
p_nagios_syslog-ng p_nagios
meta target-role="Started"

2012 14


2012 15

Pacemaker considerations

Redundant communication links are a must
Recommend use of crossover to help
accomplish this
Init scripts for Nagios must be LSB compliant…
some are not

2012 16

What to replicate?

Configuration
Host
Service
Multi check command files
Webinject command files
PNP4Nagios RRD’s
Nagios log files
retention.dat
Mail Queue (eh…)

2012 17

Everything else?

Binaries and main configuration files installed using
packages independently on each server
Able to update one node at a time
Easy to roll back should there be an issue
Version/change management
Consistent build process
NDO and MySQL hosted on separate HA cluster

2012 18

RPM’s

Build and maintain our own RPM’s
Lets us configure everything to our liking
Lets us update at our own pace
Controlled through SVN with a post-commit to
automatically update our own Nagios repository with new
packages/updates. Then it is as simple as doing “yum
update” on your servers.
A lot of upfront work but was worth it

2012 19

How has this helped?

Have been able to repair, upgrade and move
hardware with minimal downtime
Updated OS and restart server with minimal
downtime
Able to update to 3.4.1 and promptly patch issue
affecting Nagios downtime’s that was not caught in
QA
CGI pages of death

2012 20

What doesn’t this solve?

Having an HA cluster is great but there are still
things that can go wrong having a cluster does not
solve
Configuration issues are probably the most
prevalent thing we run into that might bring down
Nagios without there being a major hardware/DC
issue
We make use of NagiosQL which does a backup
when a configuration is changed. This allows us to
rollback unwanted changes but isn’t the best.

2012 21

Two is better than one

Setting up another cluster for “development” with
similar hardware and software is a great way to test
things outside of production
Lets you spot potential problems before they
become a problem

2012 22

Monitoring your cluster

check_crm
http://exchange.nagios.org/directory/Plugins/Clustering-
and-High-2DAvailability/Check-CRM/details
check_drbd
http://exchange.nagios.org/directory/Plugins/Operating-
Systems/Linux/check_drbd/details
check_heartbeat_link
http://exchange.nagios.org/directory/Plugins/Operating-
Systems/Linux/check_heartbeat_link/details

2012 23

Gotcha’s

RPM’s and symlinks in an HA solution are bad
Symlink /usr/local/nagios/etc/ -> /drbd/r1/nagios/etc when
node is secondary and you update RPM your symlink will
get blown away
Restarting services controlled by Pacemaker
should be done within Pacemaker

crm resource restart p_nagios

2012 24

Quick Stats

Thousands of host and service checks
Average check latency ~.300 sec
Average checks per second ~70
Mostly active checks polling every 5 minutes
DL360 G5
6 146GB 10k SAS drives in RAID10
2 quad core E5450 @ 3.00GHz
8GB Memory

2012 25

Tuning
RAM disk for check results queue, NPCD queue, objects.cache and
status.dat
NDOUtils with async patch
Built in since version 1.5
Limit what you send to NDOUtils
Bulk Mode with npcdmod
rrdcached
Restarting Nagios through external command eventually resulted in
higher latencies for some reason
Large installation tweaks
Disable environment macros
A lot of trial and error with scheduling and reaper frequencies
Small amount of check optimization
Measuring Nagios performance using PNP4Nagios is a must
2012 26

RAM disk + ndo-async + rrdcached

2012 27

non-external command file restarts

2012 28

One Year’s Progress

2012 30

How we run today

2012 31

Quick Stats

Questions?

2012 32

Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

Recommended

Recommended

More Related Content

Similar to Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

Similar to Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you (20)

More from Nagios

More from Nagios (20)

Recently uploaded

Recently uploaded (20)

Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you