How to deploy and maintain a large scale Nagios installation covering multiple locations. The system is version controlled and distributed, so every location can configure their own hosts and services to monitor.
3. 10.09.2008Robert M. Albrecht / Large scale Nagios 3
Large scale Nagios design
Recommendations & ideas for building a large scale distributed Nagios installation.
5. 10.09.2008Robert M. Albrecht / Large scale Nagios 5
Who are you listening to ?
T-Systems Enterprise Services GmbH
Business Enabling Infrastructure
Senior expert for technical infrastructure
Fedora Project
Ambassador
Package Maintainer
Nagios
Nagios PlugIns
PNP4Nagios, checkmulti
Manpagesde, ...
6. 10.09.2008Robert M. Albrecht / Large scale Nagios 6
Who are you listening to ?
In the first live
First Linux kernel compiled on Minix
about 30 books on programming languages and operating systems
years of magazine articles
12 years of training
7. 10.09.2008Robert M. Albrecht / Large scale Nagios 7
What is T-Systems ?
T-Home
Endusers (soho)
T-Mobile
Mobile communications
T-Systems Business Services GmbH
160.000 business customers
T-Systems Enterprise Services GmbH
60 large customers
8. 10.09.2008Robert M. Albrecht / Large scale Nagios 8
Who ist T-Systems Enterprise ?
IT-Operations
Data centers
IT-Outsourcing
Desktop support & HelpDesk
Systems Integrations
Research & development
8.500 people
25 locations (in Germany)
9. 10.09.2008Robert M. Albrecht / Large scale Nagios 9
Fedora Project
Fedora is one of the most popular Linux-distributions.
RedHat Enterprise Linux draws packages from us, like Ubuntu from Debian.
Good relations to our brothers and sisters like RedHat Enterprise Linux and CentOS.
We adhere to FSFs definition of free software.
10. 10.09.2008Robert M. Albrecht / Large scale Nagios 10
Fedora Project
Excerpts from: http://fedoraproject.org/wiki/Objectives
Fedora is about the rapid progress of Free, Open Source software and content.
Fedora believes in the statement "once free, always free.".
To do as much of the development work as possible directly in the upstream
packages.
11. 10.09.2008Robert M. Albrecht / Large scale Nagios 11
Fedora & Nagios
Today we have
Nagios 3.0.3
Nagios Plugins 1.4.12
PNP for Nagios 0.4.10
Check_Multi (work in progress)
Also available in EPEL for RedHat Enterrprise Linux & CentOS
14. 10.09.2008Robert M. Albrecht / Large scale Nagios 14
Agenda
Plan
I will show you how we planned the architecture.
Build
You all know this.
Run
and show you how to keep it running.
16. 10.09.2008Robert M. Albrecht / Large scale Nagios 16
Nagios: Our main problems
We have several locations.
Most locations have their own administrators.
Every location has their own services and hosts.
Our management and user helpdesk need's a common reporting on availability
and outtakes.
Looking into 25 different webInterfaces is a bad solution.
Privacy and multitenancy enabled, due to different departments, subcontractors,
freelancers, ...
17. 10.09.2008Robert M. Albrecht / Large scale Nagios 17
Nagios: What do we need ?
What do we need ?
Reporting: Centralized
Configuration: Decentralized
Performance: Decentralized
Multitenancy: Enabled
We need a decentralized configuration mechanism for a distributed but centralized
system which is only partly visible. :-)
18. 10.09.2008Robert M. Albrecht / Large scale Nagios 18
Basic ideas: Plan
For the performance:
We need a distributed monitoring setup (NSCA). We put Nagios-servers in the
different locations.
For the reporting:
We add a Nagios-master for consolidating the data. The Nagios-slave-servers will
send their data to the Nagios-master.
19. 10.09.2008Robert M. Albrecht / Large scale Nagios 19
Basic ideas: Plan
For the configuration
We use Subversion for managing the configuration files.
but
Nagios-master and Nagios-slaves need more or less the same config-files.
Problem:
Having all items for all locations on the Nagios-Master could result in naming-
collisions for templates, hosts, services, …
As Nagios does not have namespaces like Java or C++, we need to build our own
concept.
Also we need to synchronize the configuration files on all servers.
20. 10.09.2008Robert M. Albrecht / Large scale Nagios 20
Basic ideas: Run
Where are the difficulties in keeping a large scale distributed installation running ?
Every location has to define their own hosts, services, checks, …
All these config-files have to be synchronized on all servers.
One admin does a faulty configuration, and the whole systems can potentially
break :-(
22. 10.09.2008Robert M. Albrecht / Large scale Nagios 22
Folder structure
To isolate the locations we use folders:
/etc/nagios3
conf.d/
location1/
plugins/
commands.cfg
contactgroups.cfg
dependencies.cfg
escalations.cfg
hostgroups.cfg
server1.cfg
server2.cfg
location2/
global/
nagios.cfg
23. 10.09.2008Robert M. Albrecht / Large scale Nagios 23
Folder structure
All definitions for one server are in one file:
fqdn.cfg
For example:
koji.fedoraproject.org.cfg
Provided are the templates generic-host and generic-service that must be used (see
next slide).
As we use FQDN we have a namespace and thus no naming collisions.
Since Nagios 3 there is no need for separate hostextinfo and serviceextinfo files
anymore. Thanks Ethan. Simplifies life a lot.
Every host and every service must have a contact, otherwise the access rights to the
webinterface won't work later.
24. 10.09.2008Robert M. Albrecht / Large scale Nagios 24
Mandatory templates
There are mandatory templates: generic-host and generic-service that MUST be
used and may not be overwritten !
Why ? Host / Service definitions on Master and Slaves are different !
We can use the same host / service definitions on master and slaves, as we can hide
the differences in the used templates.
Nagios has some good docs for this:
http://nagios.sourceforge.net/docs/2_0/distributed.html
Master Slave
active_checks disabled enabled
Retain_*_informations enabled disabled
check_commmands none active
25. 10.09.2008Robert M. Albrecht / Large scale Nagios 25
Introducing namespaces
We do not have namespaces for hostgroups & contactgroups, therefore collisions
could happen. For example the contactgroup BackupOperators may be defined in
multiple locations.
/etc/nagios
conf.d/
location1/
contactgroups.cfg
hostgroups.cfg
location2/
contactgroups.cfg
hostgroups.cfg
We construct namespaces by prefixing them:
Hostgroup: de_hb_BackupServers
Contactgroup: de_hb_BackupOperators
26. 10.09.2008Robert M. Albrecht / Large scale Nagios 26
Namespace prefixes
Location address prefix
Berlin Kuhdamm de_b
Bremen Bahnhofstrasse de_hb
Munich Viktualienmarkt de_m
Barcelona La Rambla es_bcn
ISO 3166 country codes:
http://www.iso.org/iso/country_codes/iso_3166_code_lists/english_country_nam
es_and_code_elements.htm
27. 10.09.2008Robert M. Albrecht / Large scale Nagios 27
Contacts are global
Contacts do not get prefixes.
Not all locations have their own administrators, so contacts should not be site-
specific.
Contacts get defined in global/contacts.cfg
/etc/nagios
conf.d/
global/
contacts.cfg
28. 10.09.2008Robert M. Albrecht / Large scale Nagios 28
Local plugins
A location can put their own checks in
/etc/nagios
conf.d/
location1/
plugins/
The command-name has to be prefixed:
define command {
command_name de_hb_CheckTea
command_line /etc/nagios2/conf.d/de_hb/plugins/CheckTea.pl}
As checking for tea is essential for all locations, using site-specific plugins is not
recommended.
Better is a change request for including this new check in the global configuration.
30. 10.09.2008Robert M. Albrecht / Large scale Nagios 30
Folder structure
All servers (Master & Slave) need essentially the same config-files.
But there are differences !
Example:
/etc/nagios
conf.d/
location1/
location2/
global/
nagios.cfg
The master need to load the global configuratgion and all other locations. The slave
needs only parts from global (e.g. the templates, contacts) and his location.
The nagios.cfg ist different on Master and slaves due to different roles to perform.
31. 10.09.2008Robert M. Albrecht / Large scale Nagios 31
Folder structure
So we can not simply rsync the files on all servers (actually we can, but the files have
to be modified first).
As all Nagios-Slaves are nearly identical, we can simply copy a nagios.cfg.slave to all
slaves and rename it to nagios.cfg .
Creating DNS-CNAMES like
nagios_de_hb.fedoraproject.org
makes it easy to copy the files
svn checkout http://repo.internal.com/repos/nagios /tmp/workingcopy
cd /tmp/workingcopy
for i in $( ls -d locations* ); do
scp -r $i nagios.$i.fedoraproject.org
done
32. 10.09.2008Robert M. Albrecht / Large scale Nagios 32
Changing the config-files
We need some additional pattern matching for adapting the nagios.cfg-slave to
correct the cfg_dir path to point on the correct location-folder:
/etc/nagios
conf.d/
location1/
You can use AWK, Perl, whatever you like. After that, you copy the files via rsync, scp,
… to the slave servers.
Geeky solution: CFEngine and BCFG2 can do both: altering and copying the files.
34. 10.09.2008Robert M. Albrecht / Large scale Nagios 34
Build it
Nagios original docs on distributed monitoring:
http://nagios.sourceforge.net/docs/3_0/distributed.html
Nagios large installation tweaks
http://nagios.sourceforge.net/docs/3_0/largeinstalltweaks.html
Tuning Nagios for maximum performance:
http://nagios.sourceforge.net/docs/3_0/tuning.html
Startup times:
http://nagios.sourceforge.net/docs/3_0/faststartup.html
Buy fast hardware.
Don`t use performance data (pnp4nagios) on the master server.
No secret ingredients here.
36. 10.09.2008Robert M. Albrecht / Large scale Nagios 36
Avoiding breakdown
Remember: the object-definitions are replicated to master and slave-servers.
If a local administrators makes a change, the global server could break = VERY BAD.
Nagios could be more forgiving (optionally) by simply ignoring defect configurations.
But Nagios isn't.
So we need to check configuration changes before using them.
As the administrator does configuration changes through Subversion, Subversion
would be the ideal point to to this.
Luckily Subversion supports that: hook scripts.
37. 10.09.2008Robert M. Albrecht / Large scale Nagios 37
Subversion hook scripts
Subversion has several hook scripts, that are invoked, when changes to the
repository are made.
Wording: commit-transaction = transfer the files + writing them into the repository
Start-commit: This is run before the commit transaction is even created. It is
typically used to decide if the user has commit privileges at all.
Pre-commit: This is run when the transaction is complete, but before it is
committed. Typically, this hook is used to protect against commits that are
disallowed due to content or location. Subversion has some examples of how to
create fine grained write-access controls.
This would be ideal, but sadly does not work for us. At this moment, we have no way to
look into the new data to see, if this breaks our configuration. We could check
metadata though.
38. 10.09.2008Robert M. Albrecht / Large scale Nagios 38
Subversion hook scripts
Post-commit: This is run after the transaction is committed, and a new revision is
created. Most people use this hook to send out descriptive emails about the
commit or to make a backup of the repository.
This does work for us.
Sort of.
39. 10.09.2008Robert M. Albrecht / Large scale Nagios 39
Subversion hook scripts
Subversion has some sample hook-scripts:
$ ls repos/hooks/
post-commit.tmpl post-unlock.tmpl pre-revprop-change.tmpl
post-lock.tmpl pre-commit.tmpl pre-unlock.tmpl
post-revprop-change.tmpl pre-lock.tmpl start-commit.tmpl
$
You can simply rename the post-commit.tmpl to post-commit and subversion uses it.
Hook-scripts are really simple:
#!/bin/sh
USER="$2"
if [ "$USER" = "JohnDoe" ]; then exit 0; fi
echo "Only John may commit" >&2
exit 1
Anything printed to stderr is given back to the client.
40. 10.09.2008Robert M. Albrecht / Large scale Nagios 40
Subversion hook scripts
What you need is something like this:
$NagiosBin -v $NagiosCfgFile > /dev/null 2>&1;
if [ $? -eq 0 ]; then
mail $2@fedoraproject.org -s “You idiot!”
“restore backup”
else
mail $2@fedoraproject.org -s “Thanks for your attention.”
“start syncing”
fi
Subversion puts comitters username in $2.
Exit 1 (error while nagios -v) does nothing, sadly Subversion ignores the returncode
for post-commit and does not drop the commit as one would expect.
Exit 0 does nothing either. After that you can sync the new object definitions.
41. 10.09.2008Robert M. Albrecht / Large scale Nagios 41
Subversion hook scripts
Problem of this solution: it does not work :-(
We need some more complexity.
Even if you have Subversion and Nagios on the same machine:
You can't run a Nagios direct out of the repository. Inside Subversions folders, the
files are not readable by Nagios. The files might even reside inside a database,
depending on your Subversion setup.
So, you need to check out a working copy and do nagios -v against the working
copy. The working copy has to be a master-configuration, some errors like duplicate
contactgroups might only show up on the master, as only he loads all object
definitions.
svn checkout http://repo.internal.com/repos/nagios /tmp/nagios
nagios -v /tmp/nagios/nagios.cfg
You may want to use a chroot-environment or another machine, to deal with absolute
paths (for example in checkcommands.cfg).
42. 10.09.2008Robert M. Albrecht / Large scale Nagios 42
Subversion hook script
What to do if the nagios -v fails ?
You could mail the committer and blame him for all the evil in the world.
You could make a backup of your repository in hook-script pre-commit and do an
automatic restore.
What to do if all checks out ?
Start your synchronization and reload your Nagios daemons.
If you use precached configuration files, recreate the cache:
nagios -pv /etc/nagios3/nagios.cfg
44. 10.09.2008Robert M. Albrecht / Large scale Nagios 44
Privacy & multitenancy
Currently all people can peek in the webinterface into every customers and division.
Not every division should see everything.
Not every employee is permitted to see every customers infrastructure.
HelpDesk / Freelancers / students / should only see what is needed.
Nagios has a very simple form of this: authenticated contacts
If a authenticated users name (authenticated by the webserver) matches a contact-
name, the user becomes an authenticated contact.
45. 10.09.2008Robert M. Albrecht / Large scale Nagios 45
Make it work
After enabling this feature
use_authentication=1 (cgi.cfg)
only authenticated contacts can view informations on hosts and services for which
they are contacts.
All host & service-orientated CGIs honor this authenticated contacts
Some CGIs (eventlog, alertlog, …) only displays the parts, which are suitable for you.
46. 10.09.2008Robert M. Albrecht / Large scale Nagios 46
Drawbacks and further configuration
Nobody is a contact for:
another contact
contact group
time periods ...
So no one can view them in the webinterface :-(
You can grant global rights by adding contacts to
authorized_for_system_informations
authorized_for_configuration_informations
authorized_for_all_hosts
authorized_for_all_services ...
48. 10.09.2008Robert M. Albrecht / Large scale Nagios 48
What have we build ?
What goals are finished ?
Reporting: Centralized
Configuration: Decentralized
Performance: Decentralized
Multitenancy: Works, but some CGIs infomations are non-accessible
We have a decentralized configuration mechanism for a distributed but centralized
system which is only partly visible. :-)