Scaling Nagios At A Giant Insurance Company Daniel Wittenberg [email_address] https://github.com/dwittenberg2008/nagios
Personal Background <ul><li>Certified for HP-UX in mid 90's, then RHCE in '99, and AIX in early 2000's.
Worked on lots of different technologies and solutions including HA, SAN/iSCSI, Forensics/Security, Backups/Disaster Recov...
Consulted and worked in many industries include insurance, banking, accounting, construction, embedded hardware design, pr...
Topics <ul><li>Hardware
Operating System
Nagios Core
Plugins
Other Add-ons
Event Brokers
Other Software
Performance Monitoring
General </li></ul>
Overview
Hardware <ul><li>Hardware vs VMware </li><ul><li>High forking rate not good fit for VMware </li></ul><li>CPU Requirements ...
Affected by your plugins if using active checks </li></ul><li>Disk I/O </li><ul><li>Faster the better! </li></ul></ul>
VMware Performance Comparison Procs Memory Hosts Avg Svc Lat Avg CPU Util Avg CPU Load # Act Checks # Pass Checks 4 8 GB 1...
Operating System <ul><li>CentOS / RHEL 5.6
Strip down the running services
Upcoming SlideShare
Loading in …5
×

Nagios Conference 2011 - Daniel Wittenberg - Scaling Nagios At A Giant Insurance Company

5,482 views
5,441 views

Published on

Daniel Wittenburg' presentation on a reference story for a German Health Insurance Company. The presentation was given during the Nagios World Conference North America held Sept 27-29th, 2011 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,482
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
58
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Nagios Conference 2011 - Daniel Wittenberg - Scaling Nagios At A Giant Insurance Company

  1. 1. Scaling Nagios At A Giant Insurance Company Daniel Wittenberg [email_address] https://github.com/dwittenberg2008/nagios
  2. 2. Personal Background <ul><li>Certified for HP-UX in mid 90's, then RHCE in '99, and AIX in early 2000's.
  3. 3. Worked on lots of different technologies and solutions including HA, SAN/iSCSI, Forensics/Security, Backups/Disaster Recovery, Performance Tuning, Capacity Planning, Monitoring/Trending, Networking/Protocol Analysis, Virtualization and Cloud Computing.
  4. 4. Consulted and worked in many industries include insurance, banking, accounting, construction, embedded hardware design, printing/publishing early education, higher education, and ISP/hosting providers. </li></ul>
  5. 5. Topics <ul><li>Hardware
  6. 6. Operating System
  7. 7. Nagios Core
  8. 8. Plugins
  9. 9. Other Add-ons
  10. 10. Event Brokers
  11. 11. Other Software
  12. 12. Performance Monitoring
  13. 13. General </li></ul>
  14. 14. Overview
  15. 15. Hardware <ul><li>Hardware vs VMware </li><ul><li>High forking rate not good fit for VMware </li></ul><li>CPU Requirements </li><ul><li>Quantity vs Quality </li></ul><li>Memory </li><ul><li>Typically memory efficient, but have enough for ramdisk(s)
  16. 16. Affected by your plugins if using active checks </li></ul><li>Disk I/O </li><ul><li>Faster the better! </li></ul></ul>
  17. 17. VMware Performance Comparison Procs Memory Hosts Avg Svc Lat Avg CPU Util Avg CPU Load # Act Checks # Pass Checks 4 8 GB 1000 182 65 5 12084 6042 8 8 GB 1000 162 47 4 12084 6042 8 8 GB 600 87 60 8 7272 3636 <ul>Isolated VMWare ESX 4 </ul><ul>Physical Dell PowerEdge R710 (new) </ul><ul>Physical HP Proliant DL380 G4 (~ 8 years old) </ul>Procs Memory Hosts Avg Svc Lat Avg CPU Util Avg CPU Load # Act Checks # Pass Checks 4 16 GB 1000 0.19 10 1.25 12084 6042 8 8 GB 1000 0.38 25 1.15 12084 6042 Procs Memory Hosts Avg Svc Lat Avg CPU Util Avg CPU Load # Act Checks # Pass Checks 4 4 GB 800 0.29 32 1.95 9684 4842 8 4 GB 1000 0.47 37 4.43 12084 6042
  18. 18. Operating System <ul><li>CentOS / RHEL 5.6
  19. 19. Strip down the running services
  20. 20. Create ramdisks in Nagios RC script
  21. 21. - first one for status.dat, checkresults, temp_file
  22. 22. - second one for pnp processing directory, moved there by nagios perf command
  23. 23. - nagios.rc on github for full rc script </li></ul>ramdisk=`mount |grep &quot;/var/nagios/ramcache&quot;` if [ &quot;$ramdisk&quot;X == &quot;X&quot; ]; then mkdir -p -m 755 /var/nagios/ramcache mount -t tmpfs -o size=128m tmpfs /var/nagios/ramcache mkdir -p -m 755 /var/nagios/ramcache/checkresults chown -R nagios:nagios /var/nagios/ramcache fi
  24. 24. Operating System <ul><li>Make sure no ulimit restrictions
  25. 25. ulimit -a
  26. 26. Renice daemons and services
  27. 27. daemon -15 --user=$user $exec -d $config
  28. 28. perfdata_file_run_cmd =/bin/nice -n 20 /usr/libexec/pnp4nagios/process_perfdata.pl
  29. 29. puppet runs also re-niced (/etc/sysconfig/puppet – NICELEVEL=19) </li></ul>
  30. 30. Nagios Core <ul><li>Currently using Nagios 3.2.3 </li><ul><li>Empty hostgroups patch
  31. 31. Non-existent device patch (causes sigfault w/pipe) </li></ul><li>Large Scale Suggestions Doc </li><ul><li>Pre-caching objects
  32. 32. Re-write RC script to optimize restart time
  33. 33. Don't allow restart/stop if config broken
  34. 34. Limit use of macros (resources.cfg) </li></ul></ul>
  35. 35. Nagios Core <ul><li>Remove use of CGI's </li><ul><li>Using Livestatus/Multisite/livestatus-slave </li></ul><li>Limit use of backups (crazy huh?)
  36. 36. Keep logging level low in all core and plugins/brokers
  37. 37. Keep comments limited, delete if X # or Y days old
  38. 38. status_update_interval=20 (default is 10)
  39. 39. (how often to update the status.dat in seconds)
  40. 40. enable_environment_macros=0 (default is 1)
  41. 41. (pass macros as ENV variables) </li></ul>
  42. 42. Plugins <ul><li>check_nrpe
  43. 43. check_logfiles
  44. 44. check_hpasm / check_dell_sensors / check_dell_omreport
  45. 45. check_oracle_health – check_mysql_health
  46. 46. check_ps.sh (re-written for perf data, correct calculations)
  47. 47. nagios_auto_service
  48. 48. Return perf data whenever possible
  49. 49. Many other custom and one-up plugins </li></ul>
  50. 50. Other Add-ons <ul><li>NRPE </li><ul><li>Patched to allow large buffer size </li></ul><li>NSCA – (NRDP future ?) </li><ul><li>max_packet_age=60, forward and back time patch
  51. 51. Run from xinetd to allow larger/faster connections
  52. 52. MUST use instances = UNLIMITED
  53. 53. Recommend per_source = UNLIMITED
  54. 54. Recommend cps = 5000 3 </li></ul><li>NSClient++ </li><ul><li>Many updates for buffering, data truncation, queueing </li></ul><li>PNP4Nagios </li><ul><li>0.6 branch, since php53 RPM's available </li></ul></ul>
  55. 55. Event Brokers <ul><li>DNX
  56. 56. Mod-Gearman (under evaluation)
  57. 57. MK Livestatus
  58. 58. Performance Data Splunker (custom)
  59. 59. Log separator (reduces grepping for messages) (custom) </li></ul>
  60. 60. Other Software <ul><li>Puppet </li><ul><li>Manage entire server </li></ul><li>Splunk </li><ul><li>Log files, performance data </li></ul><li>Graylog2 ? (not implemented yet)
  61. 61. Cacti </li><ul><li>Nagiostats template, updated to use livestatus instead of CGI </li></ul><li>Custom Control Panel </li><ul><li>Build host groups based on templates, auto-config based on host info </li></ul><li>ConSol Labs </li><ul><li>check_logfiles, check_hpasm, mod_gearman, check_mysql_health, check_oracle_health </li></ul></ul>
  62. 62. Performance Monitoring <ul>How to watch your system to determine bottlenecks <li>vmstat
  63. 63. iostat
  64. 64. top
  65. 65. iptraf
  66. 66. sar
  67. 67. strace
  68. 68. esxtop (if have to use VM) </li></ul>
  69. 69. General Configs <ul><li>Host config files are standalone configurations that tell everything about a host.
  70. 70. Hosts are tied to a hostgroup
  71. 71. Hostgroups are tied to a servicegroup
  72. 72. Services are tied to a servicegroup
  73. 73. host.cfg -> hostgroups -> servicegroups ← services
  74. 74. This allows for easy drop-in and removal of hosts, but also requires at least 1 host be assigned to a management server
  75. 75. Limitations – harder to make per-server per-service customizations
  76. 76. Hosts are built/assigned from control panel (round-robin distribution)
  77. 77. Parents built automatically from topology database, updated nightly, ESX hourly
  78. 78. Parent's only ping once a day unless there are problems, uses fping
  79. 79. Some alerts do trigger eventhandlers – auto-fix as much as possible </li></ul>
  80. 80. General Configs <ul>Types of things being monitored: <li>cpu load, cpu stats (idle/wait/user/system), disk space, log files/Event Log, hardware, processes, swap, memory usage, service ports, NTP drift, cron job completion, UPS
  81. 81. Nagios configtest, livestatus connectivity
  82. 82. PNP4Nagios directory size (keeping up on processing)
  83. 83. Performance (cpu/memory) usage on certain processes
  84. 84. Puppet update time to make sure doesn't get behind
  85. 85. DB Response times (oracle/mysql/postgresql)
  86. 86. Apache Stats
  87. 87. Custom app status (user accounts, response times, loads, etc.)
  88. 88. Various SNMP/WMI values (most network related stats) </li></ul>
  89. 89. Links – where to find this stuff <ul><li>My Stuff – https://github.com/dwittenberg2008/nagios
  90. 90. MK Livestatus - http://mathias-kettner.de/checkmk_livestatus.html
  91. 91. LivestatusSlave - http://nagios.larsmichelsen.com/livestatusslave/
  92. 92. PNP4Nagios - http://docs.pnp4nagios.org/pnp-0.6/start
  93. 93. ConSol Labs - http://labs.consol.de/
  94. 94. Puppet - http://puppetlabs.com/
  95. 95. DNX - http://dnx.sourceforge.net/
  96. 96. Cacti Template (Base) - http://forums.cacti.net/about33806.html </li></ul>
  97. 97. Questions ? Suggestions ?

×