Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you
Upcoming SlideShare
Loading in...5
×
 

Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

on

  • 3,113 views

Andrew Widdersheim's presentation on using Nagios high availability. ...

Andrew Widdersheim's presentation on using Nagios high availability.
The presentation was given during the Nagios World Conference North America held Sept 25-28th, 2012 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna

Statistics

Views

Total Views
3,113
Views on SlideShare
3,082
Embed Views
31

Actions

Likes
0
Downloads
27
Comments
0

2 Embeds 31

http://exchange.nagios.org 30
https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you Presentation Transcript

    • Nagios Is Down andYour Boss Wants to See You Andrew Widdersheim awiddersheim@inetu.net
    • Nooooooooooo!!! 2012 2
    • Breaking News! 2012 3
    • Nagios High Availability Options Merlin by op5 Classic method described in Nagios Core documentation Some type of virtualized solution like VMWare or… 2012 4
    • Nagios High Availability + = Win 2012 5
    • DRBD magic 2012 6
    • DRBD magic Linbit Free Runs in Kernel either by module or in the mainline code if Kernel is new enough Each server gets its own independent storage Able to maintain the data’s consistency between the nodes Resource level fencing 2012 7
    • DRBD considerations DRBD is as fast as the slowest node Network latency Replication over great distances can be done DRBD proxy can increase performance over great distances but does cost money Recommend using dedicated cross-over link for best performance Protocol Choices Protocol A: write IO is reported as completed, if it has reached local disk and local TCP send buffer. Protocol B: write IO is reported as completed, if it has reached local disk and remote buffer cache. Protocol C: write IO is reported as completed, if it has reached both local and remote disk. 2012 8
    • Pacemaker 2012 9
    • Pacemaker + DRBD + Nagios Nagios Apache NPCD NCSA Nagios Stuff rrdcached VIP 192.168.1.57 Filesystem ext4 DRBD Primary Secondary Resource Manager Pacemaker Messaging CoroSync / Heartbeat Hardware Node1 Node2 2012 10
    • Pacemaker + DRBD + Nagios Nagios Apache NPCD NCSA Nagios Stuff rrdcached VIP 192.168.1.57 Filesystem ext4 DRBD Primary Secondary Resource Manager Pacemaker Messaging CoroSync / Heartbeat Hardware Node1 Node2 2012 11
    • Pacemaker + DRBD + Nagios Nagios Apache NPCD NCSA rrdcached 192.168.1.57 ext4 DRBD Secondary Primary Resource Manager Pacemaker Messaging CoroSync / Heartbeat Hardware Node1 Node2 2012 12
    • Pacemaker and Nagios 2012 13
    • Pacemaker and Nagiosprimitive p_fs_nagios ocf:heartbeat:Filesystem params device="/dev/drbd/by-res/r1" directory="/drbd/r1" fstype="ext4“ options="noatime" op start interval="0" timeout="60s" op stop interval="0" timeout="180s" op monitor interval="30s" timeout="40s" primitive p_nagios lsb:nagios op start interval="0" timeout="180s" op stop interval="0" timeout="40s" op monitor interval="30s" meta target-role="Started"group g_nagios p_fs_nagios p_nagios_ip p_nagios_bacula p_nagios_mysql p_nagios_rrdcached p_nagios_npcd p_nagios_nsca p_nagios_apache p_nagios_syslog-ng p_nagios meta target-role="Started" 2012 14
    • Pacemaker and Nagios 2012 15
    • Pacemaker considerations Redundant communication links are a must Recommend use of crossover to help accomplish this Init scripts for Nagios must be LSB compliant… some are not 2012 16
    • What to replicate? Configuration Host Service Multi check command files Webinject command files PNP4Nagios RRD’s Nagios log files retention.dat Mail Queue (eh…) 2012 17
    • Everything else? Binaries and main configuration files installed using packages independently on each server Able to update one node at a time Easy to roll back should there be an issue Version/change management Consistent build process NDO and MySQL hosted on separate HA cluster 2012 18
    • RPM’s Build and maintain our own RPM’s Lets us configure everything to our liking Lets us update at our own pace Controlled through SVN with a post-commit to automatically update our own Nagios repository with new packages/updates. Then it is as simple as doing “yum update” on your servers. A lot of upfront work but was worth it 2012 19
    • How has this helped? Have been able to repair, upgrade and move hardware with minimal downtime Updated OS and restart server with minimal downtime Able to update to 3.4.1 and promptly patch issue affecting Nagios downtime’s that was not caught in QA CGI pages of death 2012 20
    • What doesn’t this solve? Having an HA cluster is great but there are still things that can go wrong having a cluster does not solve Configuration issues are probably the most prevalent thing we run into that might bring down Nagios without there being a major hardware/DC issue We make use of NagiosQL which does a backup when a configuration is changed. This allows us to rollback unwanted changes but isn’t the best. 2012 21
    • Two is better than one Setting up another cluster for “development” with similar hardware and software is a great way to test things outside of production Lets you spot potential problems before they become a problem 2012 22
    • Monitoring your cluster check_crm http://exchange.nagios.org/directory/Plugins/Clustering- and-High-2DAvailability/Check-CRM/details check_drbd http://exchange.nagios.org/directory/Plugins/Operating- Systems/Linux/check_drbd/details check_heartbeat_link http://exchange.nagios.org/directory/Plugins/Operating- Systems/Linux/check_heartbeat_link/details 2012 23
    • Gotcha’s RPM’s and symlinks in an HA solution are bad Symlink /usr/local/nagios/etc/ -> /drbd/r1/nagios/etc when node is secondary and you update RPM your symlink will get blown away Restarting services controlled by Pacemaker should be done within Pacemaker crm resource restart p_nagios 2012 24
    • Quick Stats Thousands of host and service checks Average check latency ~.300 sec Average checks per second ~70 Mostly active checks polling every 5 minutes DL360 G5 6 146GB 10k SAS drives in RAID10 2 quad core E5450 @ 3.00GHz 8GB Memory 2012 25
    • Tuning RAM disk for check results queue, NPCD queue, objects.cache and status.dat NDOUtils with async patch Built in since version 1.5 Limit what you send to NDOUtils Bulk Mode with npcdmod rrdcached Restarting Nagios through external command eventually resulted in higher latencies for some reason Large installation tweaks Disable environment macros A lot of trial and error with scheduling and reaper frequencies Small amount of check optimization Measuring Nagios performance using PNP4Nagios is a must 2012 26
    • RAM disk + ndo-async + rrdcached 2012 27
    • non-external command file restarts 2012 28
    • nsca-2.9 2012 29
    • One Year’s Progress 2012 30
    • How we run today 2012 31
    • Quick Stats Questions? 2012 32