Scaling Nagios At AGiant Insurance Company            Daniel Wittenberg           dwittenberg2008@gmail.com   https://gith...
Personal Background   Certified for HP-UX in mid 90s, then RHCE in 99, and AIX in   early 2000s.   Worked on lots of diffe...
Topics    Hardware    Operating System    Nagios Core    Plugins    Other Add-ons    Event Brokers    Other Software    Pe...
Overview           2012   4
Highest Counts Seen                      2012   5
Hardware  Hardware vs VMware             High forking rate not good fit for VMware (livecheck/4.0)  CPU Requirements      ...
VMware Performance Comparison                                   Isolated VMWare ESX 4Procs   Memory   Hosts    Avg Svc Lat...
Operating System     CentOS / RHEL 6.3     Strip down the running services     Create ramdisk in Nagios RC script     - fi...
Operating System   Make sure no ulimit restrictions   ulimit -a   Renice daemons and services   daemon -15 --user=$user $e...
Nagios Core   Currently using Nagios 3.4.1 / 4.0          Stock with the exception of custom rc script in 3.4.1   Large Sc...
Nagios Core  Remove use of CGIs, disable in Apache           Using Livestatus/Multisite/livestatus-slave  Limit use of OS ...
Plugins   check_nrpe   check_logfiles   check_hpasm / check_dell_sensors / check_dell_omreport   check_oracle_health – che...
Other Add-ons   NRPE          Patched to allow large buffer size (20480 bytes)   NSCA – (NRDP future ?)          Patched M...
Event Brokers   DNX   Mod-Gearman   MK Livestatus   Performance Data Splunker (custom)   Log separator (reduces grepping f...
Other Software   Puppet            Manage entire server, from OS to .cfg   Splunk            Log files, performance data, ...
Performance Monitoring   How to watch your system to determine bottlenecks   vmstat   iostat   top   iptraf   sar   strace...
General Configs   Host config files are standalone configurations that tell everything about a host.   Hosts are tied to a...
Example Template Config                     2012   18
General Configs   Types of things being monitored:   cpu load, cpu stats (idle/wait/user/system), disk space, log files/Ev...
Links – where to find this stuff   My Stuff – https://github.com/dwittenberg2008/nagios   MK Livestatus - http://mathias-k...
Future ?   Nagios 4.0 will save the world!                 2012                21
Nagios 4.0 Initial Specs  Memory usage wasnt too good during initial testing....                                2012      ...
Nagios 3.4.1 vs 4.0 -v TimesFinal Numbers: 1,423,345 Services - 36,254 hosts – 255,108 service dependenciesNEVER would hav...
Questions ?Suggestions ?
Upcoming SlideShare
Loading in …5
×

Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at Fortune 50 Company

2,391 views

Published on

Dan Wittenberg's presentation on using Nagios at a Fortune 50 Company
The presentation was given during the Nagios World Conference North America held Sept 25-28th, 2012 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,391
On SlideShare
0
From Embeds
0
Number of Embeds
54
Actions
Shares
0
Downloads
44
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at Fortune 50 Company

  1. 1. Scaling Nagios At AGiant Insurance Company Daniel Wittenberg dwittenberg2008@gmail.com https://github.com/dwittenberg2008/nagios
  2. 2. Personal Background Certified for HP-UX in mid 90s, then RHCE in 99, and AIX in early 2000s. Worked on lots of different technologies and solutions including HA, SAN/iSCSI, Forensics/Security, Backups/Disaster Recovery, Performance Tuning, Capacity Planning, Monitoring/Trending, Networking/Protocol Analysis, Virtualization and Cloud Computing. Consulted and worked in many industries include insurance, banking, accounting, construction, embedded hardware design, printing/publishing early education, higher education, and ISP/hosting providers. 2012 2
  3. 3. Topics Hardware Operating System Nagios Core Plugins Other Add-ons Event Brokers Other Software Performance Monitoring General 2012 3
  4. 4. Overview 2012 4
  5. 5. Highest Counts Seen 2012 5
  6. 6. Hardware Hardware vs VMware High forking rate not good fit for VMware (livecheck/4.0) CPU Requirements Quantity vs Quality Memory Typically memory efficient, but have enough for ramdisk(s) Affected by your plugins if using active checks Disk I/O Faster the better! 2012 6
  7. 7. VMware Performance Comparison Isolated VMWare ESX 4Procs Memory Hosts Avg Svc Lat Avg CPU Util Avg CPU Load # Act Checks # Pass Checks 4 8 GB 1000 182 65 5 12084 6042 8 8 GB 1000 162 47 4 12084 6042 8 8 GB 600 87 60 8 7272 3636 Physical Dell PowerEdge R710 (new)Procs Memory Hosts Avg Svc Lat Avg CPU Util Avg CPU Load # Act Checks # Pass Checks 4 16 GB 1000 0.19 10 1.25 12084 6042 8 8 GB 1000 0.38 25 1.15 12084 6042 Physical HP Proliant DL380 G4 (~ 8 years old)Procs Memory Hosts Avg Svc Lat Avg CPU Util Avg CPU Load # Act Checks # Pass Checks 4 4 GB 800 0.29 32 1.95 9684 4842 8 4 GB 1000 0.47 37 4.43 12084 6042 2012 7
  8. 8. Operating System CentOS / RHEL 6.3 Strip down the running services Create ramdisk in Nagios RC script - first one for status.dat, checkresults, temp_file - nagios.rc on github for full rc script – will be default in 4.0 ramdisk=`mount |grep "/var/nagios/ramcache"` if [ "$ramdisk"X == "X" ]; then mkdir -p -m 755 /var/nagios/ramcache mount -t tmpfs -o size=128m tmpfs /var/nagios/ramcache mkdir -p -m 755 /var/nagios/ramcache/checkresults chown -R nagios:nagios /var/nagios/ramcache fi 2012 8
  9. 9. Operating System Make sure no ulimit restrictions ulimit -a Renice daemons and services daemon -15 --user=$user $exec -ud $config perfdata_file_run_cmd =/bin/nice -n 20 /usr/libexec/pnp4nagios/process_perfdata.pl puppet runs also re-niced (/etc/sysconfig/puppet – NICELEVEL=19) Watch your other running services and cron jobs interactively for awhile to see what spikes, you might be surprised! 2012 9
  10. 10. Nagios Core Currently using Nagios 3.4.1 / 4.0 Stock with the exception of custom rc script in 3.4.1 Large Scale Suggestions Doc Pre-caching objects Re-write RC script to optimize restart time (use -vx) Dont allow restart/stop if config broken Limit use of macros (resources.cfg) 2012 10
  11. 11. Nagios Core Remove use of CGIs, disable in Apache Using Livestatus/Multisite/livestatus-slave Limit use of OS backups (crazy huh?) Keep logging level low in all core and plugins/brokers Keep comments limited, delete if X # or Y days old status_update_interval=20 (default is 10) (how often to update the status.dat in seconds) enable_environment_macros=0 (default is 1) (pass macros as ENV variables) 2012 11
  12. 12. Plugins check_nrpe check_logfiles check_hpasm / check_dell_sensors / check_dell_omreport check_oracle_health – check_mysql_health check_ps.sh (re-written for perf data, correct calculations) nagios_auto_service Return perf data whenever possible Many other custom and one-up plugins 2012 12
  13. 13. Other Add-ons NRPE Patched to allow large buffer size (20480 bytes) NSCA – (NRDP future ?) Patched MAX_PLUGINOUTPUT_LENGTH to 4096 max_packet_age=60, forward and back time patch Run from xinetd to allow larger/faster connections/hang protection MUST use instances = UNLIMITED Recommend per_source = UNLIMITED Recommend cps = 5000 3 NSClient++/NSCP Many updates for buffering, data truncation, queueing PNP4Nagios rrdcached 2012 13
  14. 14. Event Brokers DNX Mod-Gearman MK Livestatus Performance Data Splunker (custom) Log separator (reduces grepping for messages) (custom) 2012 14
  15. 15. Other Software Puppet Manage entire server, from OS to .cfg Splunk Log files, performance data, sampled from servers (25GB/day+) Cacti Nagiostats template, updated to use livestatus instead of CGI Custom Control Panel Build host groups based on templates, auto-config based on host info ConSol Labs check_logfiles, check_hpasm, mod_gearman, check_mysql_health, check_oracle_health 2012 15
  16. 16. Performance Monitoring How to watch your system to determine bottlenecks vmstat iostat top iptraf sar strace esxtop (if have to use VM) 2012 16
  17. 17. General Configs Host config files are standalone configurations that tell everything about a host. Hosts are tied to a hostgroup Hostgroups are tied to a servicegroup Services are tied to a servicegroup host.cfg → hostgroups → service ← servicegroups This allows for easy drop-in and removal of hosts, but also requires at least 1 host be assigned to a management server Limitations – harder to make per-server per-service customizations Hosts are built/assigned from control panel (round-robin distribution) Parents built automatically from topology database, updated nightly, ESX hourly Parents only ping once a day unless there are problems, uses fping Some alerts do trigger eventhandlers – automate fixes as much as possible 2012 17
  18. 18. Example Template Config 2012 18
  19. 19. General Configs Types of things being monitored: cpu load, cpu stats (idle/wait/user/system), disk space, log files/Event Log, hardware, processes, swap, memory usage, service ports, NTP drift, cron job completion, UPS Nagios configtest, livestatus connectivity PNP4Nagios/check_results directory size (keeping up on processing) Performance (cpu/memory) usage on certain processes Puppet update time to make sure doesnt get behind DB Response times and health (oracle/mysql/postgresql) Apache Stats Custom app status (user accounts, response times, loads, etc.) Various SNMP/WMI values (most network related stats) ActiveMQ/Mule ESB 2012 19
  20. 20. Links – where to find this stuff My Stuff – https://github.com/dwittenberg2008/nagios MK Livestatus - http://mathias-kettner.de/checkmk_livestatus.html LivestatusSlave - http://nagios.larsmichelsen.com/livestatusslave/ PNP4Nagios - http://docs.pnp4nagios.org/pnp-0.6/start ConSol Labs - http://labs.consol.de/ Puppet - http://puppetlabs.com/ Cacti Template (Base) - http://forums.cacti.net/about33806.html 2012 20
  21. 21. Future ? Nagios 4.0 will save the world! 2012 21
  22. 22. Nagios 4.0 Initial Specs Memory usage wasnt too good during initial testing.... 2012 22
  23. 23. Nagios 3.4.1 vs 4.0 -v TimesFinal Numbers: 1,423,345 Services - 36,254 hosts – 255,108 service dependenciesNEVER would have done a complete -v, now completes in 1:51:00 !!! 2012 23
  24. 24. Questions ?Suggestions ?

×