Your SlideShare is downloading. ×
0
Nagios Is Down andYour Boss Wants to See You        Andrew Widdersheim         awiddersheim@inetu.net
Nooooooooooo!!!                  2012   2
Breaking News!                 2012   3
Nagios High Availability Options   Merlin by op5   Classic method described in   Nagios Core documentation   Some type of ...
Nagios High Availability                           +                      = Win                       2012    5
DRBD magic             2012   6
DRBD magic  Linbit  Free  Runs in Kernel either by module or in the mainline  code if Kernel is new enough  Each server ge...
DRBD considerations   DRBD is as fast as the slowest node   Network latency     Replication over great distances can be do...
Pacemaker            2012   9
Pacemaker + DRBD + Nagios                       Nagios                       Apache                        NPCD           ...
Pacemaker + DRBD + Nagios                       Nagios                       Apache                        NPCD           ...
Pacemaker + DRBD + Nagios                                             Nagios                                             A...
Pacemaker and Nagios                       2012   13
Pacemaker and Nagiosprimitive p_fs_nagios ocf:heartbeat:Filesystem      params device="/dev/drbd/by-res/r1" directory="/dr...
Pacemaker and Nagios                       2012   15
Pacemaker considerations   Redundant communication links are a must     Recommend use of crossover to help     accomplish ...
What to replicate?   Configuration     Host     Service     Multi check command files     Webinject command files   PNP4Na...
Everything else?   Binaries and main configuration files installed using   packages independently on each server     Able ...
RPM’s  Build and maintain our own RPM’s    Lets us configure everything to our liking    Lets us update at our own pace   ...
How has this helped?   Have been able to repair, upgrade and move   hardware with minimal downtime   Updated OS and restar...
What doesn’t this solve?   Having an HA cluster is great but there are still   things that can go wrong having a cluster d...
Two is better than one   Setting up another cluster for “development” with   similar hardware and software is a great way ...
Monitoring your cluster   check_crm     http://exchange.nagios.org/directory/Plugins/Clustering-     and-High-2DAvailabili...
Gotcha’s   RPM’s and symlinks in an HA solution are bad     Symlink /usr/local/nagios/etc/ -> /drbd/r1/nagios/etc when    ...
Quick Stats   Thousands of host and service checks   Average check latency ~.300 sec   Average checks per second ~70   Mos...
Tuning  RAM disk for check results queue, NPCD queue, objects.cache and  status.dat  NDOUtils with async patch    Built in...
RAM disk + ndo-async + rrdcached                     2012          27
non-external command file restarts                      2012           28
nsca-2.9           2012   29
One Year’s Progress                      2012   30
How we run today                   2012   31
Quick Stats   Questions?                2012   32
Upcoming SlideShare
Loading in...5
×

Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

4,553

Published on

Andrew Widdersheim's presentation on using Nagios high availability.
The presentation was given during the Nagios World Conference North America held Sept 25-28th, 2012 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
4,553
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
35
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you"

  1. 1. Nagios Is Down andYour Boss Wants to See You Andrew Widdersheim awiddersheim@inetu.net
  2. 2. Nooooooooooo!!! 2012 2
  3. 3. Breaking News! 2012 3
  4. 4. Nagios High Availability Options Merlin by op5 Classic method described in Nagios Core documentation Some type of virtualized solution like VMWare or… 2012 4
  5. 5. Nagios High Availability + = Win 2012 5
  6. 6. DRBD magic 2012 6
  7. 7. DRBD magic Linbit Free Runs in Kernel either by module or in the mainline code if Kernel is new enough Each server gets its own independent storage Able to maintain the data’s consistency between the nodes Resource level fencing 2012 7
  8. 8. DRBD considerations DRBD is as fast as the slowest node Network latency Replication over great distances can be done DRBD proxy can increase performance over great distances but does cost money Recommend using dedicated cross-over link for best performance Protocol Choices Protocol A: write IO is reported as completed, if it has reached local disk and local TCP send buffer. Protocol B: write IO is reported as completed, if it has reached local disk and remote buffer cache. Protocol C: write IO is reported as completed, if it has reached both local and remote disk. 2012 8
  9. 9. Pacemaker 2012 9
  10. 10. Pacemaker + DRBD + Nagios Nagios Apache NPCD NCSA Nagios Stuff rrdcached VIP 192.168.1.57 Filesystem ext4 DRBD Primary Secondary Resource Manager Pacemaker Messaging CoroSync / Heartbeat Hardware Node1 Node2 2012 10
  11. 11. Pacemaker + DRBD + Nagios Nagios Apache NPCD NCSA Nagios Stuff rrdcached VIP 192.168.1.57 Filesystem ext4 DRBD Primary Secondary Resource Manager Pacemaker Messaging CoroSync / Heartbeat Hardware Node1 Node2 2012 11
  12. 12. Pacemaker + DRBD + Nagios Nagios Apache NPCD NCSA rrdcached 192.168.1.57 ext4 DRBD Secondary Primary Resource Manager Pacemaker Messaging CoroSync / Heartbeat Hardware Node1 Node2 2012 12
  13. 13. Pacemaker and Nagios 2012 13
  14. 14. Pacemaker and Nagiosprimitive p_fs_nagios ocf:heartbeat:Filesystem params device="/dev/drbd/by-res/r1" directory="/drbd/r1" fstype="ext4“ options="noatime" op start interval="0" timeout="60s" op stop interval="0" timeout="180s" op monitor interval="30s" timeout="40s" primitive p_nagios lsb:nagios op start interval="0" timeout="180s" op stop interval="0" timeout="40s" op monitor interval="30s" meta target-role="Started"group g_nagios p_fs_nagios p_nagios_ip p_nagios_bacula p_nagios_mysql p_nagios_rrdcached p_nagios_npcd p_nagios_nsca p_nagios_apache p_nagios_syslog-ng p_nagios meta target-role="Started" 2012 14
  15. 15. Pacemaker and Nagios 2012 15
  16. 16. Pacemaker considerations Redundant communication links are a must Recommend use of crossover to help accomplish this Init scripts for Nagios must be LSB compliant… some are not 2012 16
  17. 17. What to replicate? Configuration Host Service Multi check command files Webinject command files PNP4Nagios RRD’s Nagios log files retention.dat Mail Queue (eh…) 2012 17
  18. 18. Everything else? Binaries and main configuration files installed using packages independently on each server Able to update one node at a time Easy to roll back should there be an issue Version/change management Consistent build process NDO and MySQL hosted on separate HA cluster 2012 18
  19. 19. RPM’s Build and maintain our own RPM’s Lets us configure everything to our liking Lets us update at our own pace Controlled through SVN with a post-commit to automatically update our own Nagios repository with new packages/updates. Then it is as simple as doing “yum update” on your servers. A lot of upfront work but was worth it 2012 19
  20. 20. How has this helped? Have been able to repair, upgrade and move hardware with minimal downtime Updated OS and restart server with minimal downtime Able to update to 3.4.1 and promptly patch issue affecting Nagios downtime’s that was not caught in QA CGI pages of death 2012 20
  21. 21. What doesn’t this solve? Having an HA cluster is great but there are still things that can go wrong having a cluster does not solve Configuration issues are probably the most prevalent thing we run into that might bring down Nagios without there being a major hardware/DC issue We make use of NagiosQL which does a backup when a configuration is changed. This allows us to rollback unwanted changes but isn’t the best. 2012 21
  22. 22. Two is better than one Setting up another cluster for “development” with similar hardware and software is a great way to test things outside of production Lets you spot potential problems before they become a problem 2012 22
  23. 23. Monitoring your cluster check_crm http://exchange.nagios.org/directory/Plugins/Clustering- and-High-2DAvailability/Check-CRM/details check_drbd http://exchange.nagios.org/directory/Plugins/Operating- Systems/Linux/check_drbd/details check_heartbeat_link http://exchange.nagios.org/directory/Plugins/Operating- Systems/Linux/check_heartbeat_link/details 2012 23
  24. 24. Gotcha’s RPM’s and symlinks in an HA solution are bad Symlink /usr/local/nagios/etc/ -> /drbd/r1/nagios/etc when node is secondary and you update RPM your symlink will get blown away Restarting services controlled by Pacemaker should be done within Pacemaker crm resource restart p_nagios 2012 24
  25. 25. Quick Stats Thousands of host and service checks Average check latency ~.300 sec Average checks per second ~70 Mostly active checks polling every 5 minutes DL360 G5 6 146GB 10k SAS drives in RAID10 2 quad core E5450 @ 3.00GHz 8GB Memory 2012 25
  26. 26. Tuning RAM disk for check results queue, NPCD queue, objects.cache and status.dat NDOUtils with async patch Built in since version 1.5 Limit what you send to NDOUtils Bulk Mode with npcdmod rrdcached Restarting Nagios through external command eventually resulted in higher latencies for some reason Large installation tweaks Disable environment macros A lot of trial and error with scheduling and reaper frequencies Small amount of check optimization Measuring Nagios performance using PNP4Nagios is a must 2012 26
  27. 27. RAM disk + ndo-async + rrdcached 2012 27
  28. 28. non-external command file restarts 2012 28
  29. 29. nsca-2.9 2012 29
  30. 30. One Year’s Progress 2012 30
  31. 31. How we run today 2012 31
  32. 32. Quick Stats Questions? 2012 32
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×