Staying Sane with Nagios
Upcoming SlideShare
Loading in...5
×
 

Staying Sane with Nagios

on

  • 3,207 views

From an invited talk I did at PICC-10 (now known as LOPSA-East) about how to manage a Nagios installation without pulling your hair out. ...

From an invited talk I did at PICC-10 (now known as LOPSA-East) about how to manage a Nagios installation without pulling your hair out.

In the ensuing years, I've automated more, but still have the same kind of mindset about inheritance and so on.

Statistics

Views

Total Views
3,207
Views on SlideShare
3,207
Embed Views
0

Actions

Likes
0
Downloads
11
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Staying Sane with Nagios Staying Sane with Nagios Presentation Transcript

  • Staying Sane with Nagios Matt Simmons @standaloneSA standalone.sysadmin@gmail.com http://www.standalone-sysadmin.com
  • Introduction & Outline Confessions:  I am not actually a Nagios Expert I do actually LIKE Nagios Outline:  Global Sanity   Small & Medium Shops  Large Scale Shops  Add Ons  Warnings  Additional Resources
  • I know what you're thinking... Nagios? Sane??? Unlikely!!! Serenity Now!!!
  • Nagios? SANE?!? Serenity Now!!!
  • Global Sanity  Universal Advice  Affects installations of all sizes  Documentation  Centralized Authentication  Plugin Development
  • Global Sanity: Documentation  Read the documentation  Object Definitions http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html Use 3_0 when searching    Bookmark the good ones  Nagiosbook.org will be soon coming out with 3.x docs  http://www.nagiosbook.org/
  • Global Sanity: Central Auth  Centralized Authentication  LDAP / AD with Apache   (I use Likewise Open) Domain users -> Nagios Contacts   msimmons@EXAMPLE.COM Access to CGI interface
  • Global Sanity: Do Not Reinvent the Wheel...  Nagios Exchange  http://exchange.nagios.org/  Pros:    Nearly 2000 Listings >1600 plugins Cons:   Varying quality and reliability Old, unmaintained, code rot, etc
  • Global Sanity: ...unless you have to  Writing your own Nagios Plugins  Great guide  http://nagiosplug.sourceforge.net/developer-guidelines.html  Extended Output  Huge Community  Any language you want
  • Small & Medium Shops   Not exclusively small or medium, just a nonautomatic way of doing things For people who:  Manually edit / create entries in config files  Don't use extensive 3rd party management software  Have a small team of responsible admins  Don't require large distributed monitoring networks
  • Configuration Sanity  When:  Creating new configs  Working with existing configs  Testing  Responding to events
  • Syntax Highlighting This?
  • Syntax Highlighting Or this?
  • Config File Hierarchy  Default config is stupid.  cfg_dir directive is key  *.cfg – recursively  Hierarchy should resemble “real life”  Allows for additional “group” security  Use what makes sense to you and document it
  • Config File Hierarchy: Example Output of “tree -d” on my Nagios objects directory |-- commands |-- computers | |-- groups | |-- linux | | `-- services | `-- windows |-- misc `-- network |-- firewalls |-- links |-- routers `-- switches
  • Regular Expressions  Not all regexes are created equal  use_regexp_matching  Only when object names contain:    * ? use_true_regexp_matching    'man regex' All object names Caution: Unintended consequences
  • Better Object Formatting This?
  • Better Object Formatting Or this?
  • Revision Control  CVS/SVN/git(?)  Simple, maintainable, recoverable  Self-documenting (if done correctly)
  • (ab)Use Inheritance  Templates  register = 0  Multiple Inheritance  Beware the spaghetti code
  • Use Hostgroups define service{    service_description SSH Service Check    check_command check_ssh    host_name linux01, linux02, linux03, ... linux50 }
  • Use Hostgroups define hostgroup{    hostgroup_name linux­servers } define host{    use generic­host    host_name linux01    address 192.168.0.10    hostgroups linux­servers } define service{    service_description SSH service check    check_command check_ssh    hostgroup_name linux­servers }
  • Script / Automate  Automate as much as possible   New Services   New Hosts Commands mkhost.sh as a template
  • Use alternate contacts file when testing new features  Coworkers are under enough stress as it is  No messy explanations  Use symlinks to point to “real” contacts file
  • Plugin Sanity Thoughts about writing, configuring, and using Nagios plugins
  • SNMP Use it whenever possible. Really.
  • NRPE vs check_by_ssh  Nagios Remote Plugin Executable(?)  Skip it when possible   Use SNMP NRPE
  • When checking disk usage  Do not specify the partitions to check  Instead, specify the partitions to NOT check  Too easy to forget to add new partitions.  If possible, use a plugin that produces statistics for graphing usage trends
  • Notification Sanity   Notifications suck. Here are some ways to make them not suck as much.
  • Alternate Communication Method  When the network Is down, email is down too  Have a non-email contact method  SMS, cell modem, smoke signals  Test it occasionally
  • Use parents  Establish a path FROM THE NAGIOS SERVER  Failure will trigger “unreachable” states   “u” notification flag Only useful for non-local-subnet hosts typically  If the local switch dies, alerts don't go out anyway  Typically
  • Use Dependencies  Available for both hosts and services    The disks didn't blow up, SNMP crashed What do you mean, the website is unavailable when the database crashes Dependencies != parents   Parents establish a line between the host and Nagios Dependencies establish logical object relationships
  • Notifications are Commands  Use Them   Execute what you need, when you need, where you need through extra-nagios scripts Your imagination is the limit  Electrical relays?  Flashing lights?  HALON release?  Please don't.
  • Use Passive Checks (when necessary / appropriate)   For “normal” passive checks, specify freshness checks Useful for SNMP traps   Combine with snmptrapd Distributed Monitoring   Use for capacity reasons Physical separation calls for separate Nagios installs (in my opinion)
  • Macros GOOD  60 bajillion available   http://nagios.sourceforge.net/docs/3_0/macrolist.html On Demand Macros  Specify “remote” macros from other hosts   Custom Variable Macros  _MACADDRESS 00:01:02:03:04:05   $HOSTMACRO:SOMEHOST$ $_HOSTMACADDRESS$ Available as environmental variables in scripts  $NAGIOS_MACRONAME
  • Use Flap Detection  Or not. Who wants a charged cellphone battery?  Measures state changes:  Weighted measure of the last 21 checks  More recent counts higher
  • Large Shops Too many nodes to easily configure by hand, or too many nodes to deal with using one server  Scaling Nagios  Centralized Management  Web Configurators
  • Scaling Nagios  large_installation_tweaks   Distributed monitoring   No summary macros, memory handling is different, and processes fork() less Assign groups of hosts to one Nagios server (reporting via NSCA / Passive checks) Check tuning docs:  http://nagios.sourceforge.net/docs/3_0/tuning.html
  • Centralized Management  Puppet / chef / cfengine / whatever  Distribute nagios user's key if necessary  Install nagios agents (NSCA / NRPE)  Automate Configuration Build  Puppet's built-in Nagios types sound convenient...sort of
  • Nagios Web Configuration  Dozen, If not hundreds  I don't know of a great one.  May be worth building or finding one that matches your inventory system  Don't double-up on data if you don't have to
  • Malproductive Practices  Overreliance on Event Handlers    Please don't do anything terribly important. Edge cases are scary. Overabuse of inheritance    Spaghetti code Hard to trace Overcomplification  Simple is nearly always better
  • Learn More  Mailing List  Nagios Users   https://lists.sourceforge.net/lists/listinfo/nagios-users LinkedIn  Nagios Users  http://www.linkedin.com/groupAnswers?viewQuestions=&gid=