Staying Sane with Nagios

3,992 views

Published on

From an invited talk I did at PICC-10 (now known as LOPSA-East) about how to manage a Nagios installation without pulling your hair out.

In the ensuing years, I've automated more, but still have the same kind of mindset about inheritance and so on.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,992
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Staying Sane with Nagios

  1. 1. Staying Sane with Nagios Matt Simmons @standaloneSA standalone.sysadmin@gmail.com http://www.standalone-sysadmin.com
  2. 2. Introduction & Outline Confessions:  I am not actually a Nagios Expert I do actually LIKE Nagios Outline:  Global Sanity   Small & Medium Shops  Large Scale Shops  Add Ons  Warnings  Additional Resources
  3. 3. I know what you're thinking... Nagios? Sane??? Unlikely!!! Serenity Now!!!
  4. 4. Nagios? SANE?!? Serenity Now!!!
  5. 5. Global Sanity  Universal Advice  Affects installations of all sizes  Documentation  Centralized Authentication  Plugin Development
  6. 6. Global Sanity: Documentation  Read the documentation  Object Definitions http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html Use 3_0 when searching    Bookmark the good ones  Nagiosbook.org will be soon coming out with 3.x docs  http://www.nagiosbook.org/
  7. 7. Global Sanity: Central Auth  Centralized Authentication  LDAP / AD with Apache   (I use Likewise Open) Domain users -> Nagios Contacts   msimmons@EXAMPLE.COM Access to CGI interface
  8. 8. Global Sanity: Do Not Reinvent the Wheel...  Nagios Exchange  http://exchange.nagios.org/  Pros:    Nearly 2000 Listings >1600 plugins Cons:   Varying quality and reliability Old, unmaintained, code rot, etc
  9. 9. Global Sanity: ...unless you have to  Writing your own Nagios Plugins  Great guide  http://nagiosplug.sourceforge.net/developer-guidelines.html  Extended Output  Huge Community  Any language you want
  10. 10. Small & Medium Shops   Not exclusively small or medium, just a nonautomatic way of doing things For people who:  Manually edit / create entries in config files  Don't use extensive 3rd party management software  Have a small team of responsible admins  Don't require large distributed monitoring networks
  11. 11. Configuration Sanity  When:  Creating new configs  Working with existing configs  Testing  Responding to events
  12. 12. Syntax Highlighting This?
  13. 13. Syntax Highlighting Or this?
  14. 14. Config File Hierarchy  Default config is stupid.  cfg_dir directive is key  *.cfg – recursively  Hierarchy should resemble “real life”  Allows for additional “group” security  Use what makes sense to you and document it
  15. 15. Config File Hierarchy: Example Output of “tree -d” on my Nagios objects directory |-- commands |-- computers | |-- groups | |-- linux | | `-- services | `-- windows |-- misc `-- network |-- firewalls |-- links |-- routers `-- switches
  16. 16. Regular Expressions  Not all regexes are created equal  use_regexp_matching  Only when object names contain:    * ? use_true_regexp_matching    'man regex' All object names Caution: Unintended consequences
  17. 17. Better Object Formatting This?
  18. 18. Better Object Formatting Or this?
  19. 19. Revision Control  CVS/SVN/git(?)  Simple, maintainable, recoverable  Self-documenting (if done correctly)
  20. 20. (ab)Use Inheritance  Templates  register = 0  Multiple Inheritance  Beware the spaghetti code
  21. 21. Use Hostgroups define service{    service_description SSH Service Check    check_command check_ssh    host_name linux01, linux02, linux03, ... linux50 }
  22. 22. Use Hostgroups define hostgroup{    hostgroup_name linux­servers } define host{    use generic­host    host_name linux01    address 192.168.0.10    hostgroups linux­servers } define service{    service_description SSH service check    check_command check_ssh    hostgroup_name linux­servers }
  23. 23. Script / Automate  Automate as much as possible   New Services   New Hosts Commands mkhost.sh as a template
  24. 24. Use alternate contacts file when testing new features  Coworkers are under enough stress as it is  No messy explanations  Use symlinks to point to “real” contacts file
  25. 25. Plugin Sanity Thoughts about writing, configuring, and using Nagios plugins
  26. 26. SNMP Use it whenever possible. Really.
  27. 27. NRPE vs check_by_ssh  Nagios Remote Plugin Executable(?)  Skip it when possible   Use SNMP NRPE
  28. 28. When checking disk usage  Do not specify the partitions to check  Instead, specify the partitions to NOT check  Too easy to forget to add new partitions.  If possible, use a plugin that produces statistics for graphing usage trends
  29. 29. Notification Sanity   Notifications suck. Here are some ways to make them not suck as much.
  30. 30. Alternate Communication Method  When the network Is down, email is down too  Have a non-email contact method  SMS, cell modem, smoke signals  Test it occasionally
  31. 31. Use parents  Establish a path FROM THE NAGIOS SERVER  Failure will trigger “unreachable” states   “u” notification flag Only useful for non-local-subnet hosts typically  If the local switch dies, alerts don't go out anyway  Typically
  32. 32. Use Dependencies  Available for both hosts and services    The disks didn't blow up, SNMP crashed What do you mean, the website is unavailable when the database crashes Dependencies != parents   Parents establish a line between the host and Nagios Dependencies establish logical object relationships
  33. 33. Notifications are Commands  Use Them   Execute what you need, when you need, where you need through extra-nagios scripts Your imagination is the limit  Electrical relays?  Flashing lights?  HALON release?  Please don't.
  34. 34. Use Passive Checks (when necessary / appropriate)   For “normal” passive checks, specify freshness checks Useful for SNMP traps   Combine with snmptrapd Distributed Monitoring   Use for capacity reasons Physical separation calls for separate Nagios installs (in my opinion)
  35. 35. Macros GOOD  60 bajillion available   http://nagios.sourceforge.net/docs/3_0/macrolist.html On Demand Macros  Specify “remote” macros from other hosts   Custom Variable Macros  _MACADDRESS 00:01:02:03:04:05   $HOSTMACRO:SOMEHOST$ $_HOSTMACADDRESS$ Available as environmental variables in scripts  $NAGIOS_MACRONAME
  36. 36. Use Flap Detection  Or not. Who wants a charged cellphone battery?  Measures state changes:  Weighted measure of the last 21 checks  More recent counts higher
  37. 37. Large Shops Too many nodes to easily configure by hand, or too many nodes to deal with using one server  Scaling Nagios  Centralized Management  Web Configurators
  38. 38. Scaling Nagios  large_installation_tweaks   Distributed monitoring   No summary macros, memory handling is different, and processes fork() less Assign groups of hosts to one Nagios server (reporting via NSCA / Passive checks) Check tuning docs:  http://nagios.sourceforge.net/docs/3_0/tuning.html
  39. 39. Centralized Management  Puppet / chef / cfengine / whatever  Distribute nagios user's key if necessary  Install nagios agents (NSCA / NRPE)  Automate Configuration Build  Puppet's built-in Nagios types sound convenient...sort of
  40. 40. Nagios Web Configuration  Dozen, If not hundreds  I don't know of a great one.  May be worth building or finding one that matches your inventory system  Don't double-up on data if you don't have to
  41. 41. Malproductive Practices  Overreliance on Event Handlers    Please don't do anything terribly important. Edge cases are scary. Overabuse of inheritance    Spaghetti code Hard to trace Overcomplification  Simple is nearly always better
  42. 42. Learn More  Mailing List  Nagios Users   https://lists.sourceforge.net/lists/listinfo/nagios-users LinkedIn  Nagios Users  http://www.linkedin.com/groupAnswers?viewQuestions=&gid=

×