Your SlideShare is downloading. ×
Staying Sane with Nagios
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Staying Sane with Nagios

3,366
views

Published on

From an invited talk I did at PICC-10 (now known as LOPSA-East) about how to manage a Nagios installation without pulling your hair out. …

From an invited talk I did at PICC-10 (now known as LOPSA-East) about how to manage a Nagios installation without pulling your hair out.

In the ensuing years, I've automated more, but still have the same kind of mindset about inheritance and so on.

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
3,366
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Staying Sane with Nagios Matt Simmons @standaloneSA standalone.sysadmin@gmail.com http://www.standalone-sysadmin.com
  • 2. Introduction & Outline Confessions:  I am not actually a Nagios Expert I do actually LIKE Nagios Outline:  Global Sanity   Small & Medium Shops  Large Scale Shops  Add Ons  Warnings  Additional Resources
  • 3. I know what you're thinking... Nagios? Sane??? Unlikely!!! Serenity Now!!!
  • 4. Nagios? SANE?!? Serenity Now!!!
  • 5. Global Sanity  Universal Advice  Affects installations of all sizes  Documentation  Centralized Authentication  Plugin Development
  • 6. Global Sanity: Documentation  Read the documentation  Object Definitions http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html Use 3_0 when searching    Bookmark the good ones  Nagiosbook.org will be soon coming out with 3.x docs  http://www.nagiosbook.org/
  • 7. Global Sanity: Central Auth  Centralized Authentication  LDAP / AD with Apache   (I use Likewise Open) Domain users -> Nagios Contacts   msimmons@EXAMPLE.COM Access to CGI interface
  • 8. Global Sanity: Do Not Reinvent the Wheel...  Nagios Exchange  http://exchange.nagios.org/  Pros:    Nearly 2000 Listings >1600 plugins Cons:   Varying quality and reliability Old, unmaintained, code rot, etc
  • 9. Global Sanity: ...unless you have to  Writing your own Nagios Plugins  Great guide  http://nagiosplug.sourceforge.net/developer-guidelines.html  Extended Output  Huge Community  Any language you want
  • 10. Small & Medium Shops   Not exclusively small or medium, just a nonautomatic way of doing things For people who:  Manually edit / create entries in config files  Don't use extensive 3rd party management software  Have a small team of responsible admins  Don't require large distributed monitoring networks
  • 11. Configuration Sanity  When:  Creating new configs  Working with existing configs  Testing  Responding to events
  • 12. Syntax Highlighting This?
  • 13. Syntax Highlighting Or this?
  • 14. Config File Hierarchy  Default config is stupid.  cfg_dir directive is key  *.cfg – recursively  Hierarchy should resemble “real life”  Allows for additional “group” security  Use what makes sense to you and document it
  • 15. Config File Hierarchy: Example Output of “tree -d” on my Nagios objects directory |-- commands |-- computers | |-- groups | |-- linux | | `-- services | `-- windows |-- misc `-- network |-- firewalls |-- links |-- routers `-- switches
  • 16. Regular Expressions  Not all regexes are created equal  use_regexp_matching  Only when object names contain:    * ? use_true_regexp_matching    'man regex' All object names Caution: Unintended consequences
  • 17. Better Object Formatting This?
  • 18. Better Object Formatting Or this?
  • 19. Revision Control  CVS/SVN/git(?)  Simple, maintainable, recoverable  Self-documenting (if done correctly)
  • 20. (ab)Use Inheritance  Templates  register = 0  Multiple Inheritance  Beware the spaghetti code
  • 21. Use Hostgroups define service{    service_description SSH Service Check    check_command check_ssh    host_name linux01, linux02, linux03, ... linux50 }
  • 22. Use Hostgroups define hostgroup{    hostgroup_name linux­servers } define host{    use generic­host    host_name linux01    address 192.168.0.10    hostgroups linux­servers } define service{    service_description SSH service check    check_command check_ssh    hostgroup_name linux­servers }
  • 23. Script / Automate  Automate as much as possible   New Services   New Hosts Commands mkhost.sh as a template
  • 24. Use alternate contacts file when testing new features  Coworkers are under enough stress as it is  No messy explanations  Use symlinks to point to “real” contacts file
  • 25. Plugin Sanity Thoughts about writing, configuring, and using Nagios plugins
  • 26. SNMP Use it whenever possible. Really.
  • 27. NRPE vs check_by_ssh  Nagios Remote Plugin Executable(?)  Skip it when possible   Use SNMP NRPE
  • 28. When checking disk usage  Do not specify the partitions to check  Instead, specify the partitions to NOT check  Too easy to forget to add new partitions.  If possible, use a plugin that produces statistics for graphing usage trends
  • 29. Notification Sanity   Notifications suck. Here are some ways to make them not suck as much.
  • 30. Alternate Communication Method  When the network Is down, email is down too  Have a non-email contact method  SMS, cell modem, smoke signals  Test it occasionally
  • 31. Use parents  Establish a path FROM THE NAGIOS SERVER  Failure will trigger “unreachable” states   “u” notification flag Only useful for non-local-subnet hosts typically  If the local switch dies, alerts don't go out anyway  Typically
  • 32. Use Dependencies  Available for both hosts and services    The disks didn't blow up, SNMP crashed What do you mean, the website is unavailable when the database crashes Dependencies != parents   Parents establish a line between the host and Nagios Dependencies establish logical object relationships
  • 33. Notifications are Commands  Use Them   Execute what you need, when you need, where you need through extra-nagios scripts Your imagination is the limit  Electrical relays?  Flashing lights?  HALON release?  Please don't.
  • 34. Use Passive Checks (when necessary / appropriate)   For “normal” passive checks, specify freshness checks Useful for SNMP traps   Combine with snmptrapd Distributed Monitoring   Use for capacity reasons Physical separation calls for separate Nagios installs (in my opinion)
  • 35. Macros GOOD  60 bajillion available   http://nagios.sourceforge.net/docs/3_0/macrolist.html On Demand Macros  Specify “remote” macros from other hosts   Custom Variable Macros  _MACADDRESS 00:01:02:03:04:05   $HOSTMACRO:SOMEHOST$ $_HOSTMACADDRESS$ Available as environmental variables in scripts  $NAGIOS_MACRONAME
  • 36. Use Flap Detection  Or not. Who wants a charged cellphone battery?  Measures state changes:  Weighted measure of the last 21 checks  More recent counts higher
  • 37. Large Shops Too many nodes to easily configure by hand, or too many nodes to deal with using one server  Scaling Nagios  Centralized Management  Web Configurators
  • 38. Scaling Nagios  large_installation_tweaks   Distributed monitoring   No summary macros, memory handling is different, and processes fork() less Assign groups of hosts to one Nagios server (reporting via NSCA / Passive checks) Check tuning docs:  http://nagios.sourceforge.net/docs/3_0/tuning.html
  • 39. Centralized Management  Puppet / chef / cfengine / whatever  Distribute nagios user's key if necessary  Install nagios agents (NSCA / NRPE)  Automate Configuration Build  Puppet's built-in Nagios types sound convenient...sort of
  • 40. Nagios Web Configuration  Dozen, If not hundreds  I don't know of a great one.  May be worth building or finding one that matches your inventory system  Don't double-up on data if you don't have to
  • 41. Malproductive Practices  Overreliance on Event Handlers    Please don't do anything terribly important. Edge cases are scary. Overabuse of inheritance    Spaghetti code Hard to trace Overcomplification  Simple is nearly always better
  • 42. Learn More  Mailing List  Nagios Users   https://lists.sourceforge.net/lists/listinfo/nagios-users LinkedIn  Nagios Users  http://www.linkedin.com/groupAnswers?viewQuestions=&gid=