Scaling Nagios 4
Daniel Wittenberg
daniel.wittenberg@ipsoft.com
About MeAbout Me
● Unix/Linux admin since mid 90's
● Nagios/Netsaint user since early 2000's
● Owned/operated consulting b...
About IPsoftAbout IPsoft
● Provider of Remote Infrastructure Management and automation
services
● ITIL and 6 Sigma complia...
Last year...Last year...
My ConfigurationMy Configuration
● ~700 Nagios Servers
● ~130,000 Monitored Devices
● ~3,000,000 Service Checks
● Mix of c...
What's different with Nagios 4What's different with Nagios 4
SPEED!
● Current testing shows on average 500% faster over 3....
What's different with Nagios 4What's different with Nagios 4
Some things that would impact performance/stability
http://na...
Perf Testing Lab SetupPerf Testing Lab Setup
● Servers are all ESX 5 based VM's on the same cluster
● Variable CPU cores, ...
Test Lab ArchitectureTest Lab Architecture
Test ResultsTest Results
CPU Cores Service Checks
Version 3.2.3
Service Checks
Version 4.0.0rc1
Difference
1 1700 10500 61...
Other software usedOther software used
● Customized livestatus based on Andreas updates for Nagios 4
● https://github.com/...
Other performance tweaksOther performance tweaks
● Sysctl Changes
● net.ipv4.tcp_fin_timeout
● net.ipv4.tcp_keepalive_prof...
Other performance tweaksOther performance tweaks
● /etc/security/limits.d/nagios.conf
● ipmon soft nofile 131072
● ipmon h...
Common Perf ToolsCommon Perf Tools
● vmstat / top – cpu/memory
● iostat / iotop – disk usage
● iptraf - network
● sar – cp...
How to keep it running goodHow to keep it running good
● Monitor everything...you can never have too much info!
● CPU load...
Our Platform Architecture (simplified)Our Platform Architecture (simplified)
Known IssuesKnown Issues (and complaints)(and complaints)
● Number of workers on smaller (1-2 core) systems easily overloa...
Questions ?Questions ?
● Daniel.Wittenberg@ipsoft.com
● dwittenberg2008@gmail.com
● @dwittenberg2008
● www.linkedin.com/in...
Upcoming SlideShare
Loading in...5
×

Nagios Conference 2013 - Daniel Wittenberg - Scaling Nagios Core 4

2,395

Published on

Daniel Wittenberg's presentation on Scaling Nagios Core 4.
The presentation was given during the Nagios World Conference North America held Sept 20-Oct 2nd, 2013 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,395
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
31
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Nagios Conference 2013 - Daniel Wittenberg - Scaling Nagios Core 4

  1. 1. Scaling Nagios 4 Daniel Wittenberg daniel.wittenberg@ipsoft.com
  2. 2. About MeAbout Me ● Unix/Linux admin since mid 90's ● Nagios/Netsaint user since early 2000's ● Owned/operated consulting business for almost 10 years that provided distributed monitoring using Nagios ● Previously employed by Fortune 50 Insurance company ● Currently Monitoring Platform Manager at IPsoft Inc.
  3. 3. About IPsoftAbout IPsoft ● Provider of Remote Infrastructure Management and automation services ● ITIL and 6 Sigma compliance management framework ● Automation that resolves 56% of all incidents, and 90% L1 ● Monitoring, Automation, Event Correlation, Management.... ● Offices around the world in ten countries ● http://www.ipsoft.com
  4. 4. Last year...Last year...
  5. 5. My ConfigurationMy Configuration ● ~700 Nagios Servers ● ~130,000 Monitored Devices ● ~3,000,000 Service Checks ● Mix of customized Nagios 3.2.3 and 4.0.0 ● Scientific Linux 6.2/6.4 ● Managed by Puppet 3.x ● 2/3 on VMware ESX rest are bare metal ● Adding new Nagios servers almost daily
  6. 6. What's different with Nagios 4What's different with Nagios 4 SPEED! ● Current testing shows on average 500% faster over 3.2.3
  7. 7. What's different with Nagios 4What's different with Nagios 4 Some things that would impact performance/stability http://nagios.sourceforge.net/docs/nagioscore/4/en/whatsnew.html ● Embedded Perl – Gone ● external_command_buffer_slots - Gone ● -x option to not verify circular paths no longer needed in rc scripts ● Configuration Verification algorithm changes, massive startup speed increase ● Event Queue algorithm changes, helps with CPU utilization * Andreas 2012 Pres. ● Disk I/O reduced to virtually 0 ● NEW query handler interface, better communication with core ● NEW core workers – reduces I/O, memory, CPU ● Completely re-written spec file for better installs, debug modes
  8. 8. Perf Testing Lab SetupPerf Testing Lab Setup ● Servers are all ESX 5 based VM's on the same cluster ● Variable CPU cores, 4GB memory ● Metrics used to consider a test failure: ● CPU Block Queue > 3 ● CPU I/O Wait > 3 ● CPU Idle < 10% ● Service Check Latency > 1s ● Host Check Latency > 1s ● 30 minute run time, > 3% failure rate failed the test ● Fully automated increasing work load, consistent results ● Add 1 host + 1 service check, try to get “best case” numbers w/o check lat.
  9. 9. Test Lab ArchitectureTest Lab Architecture
  10. 10. Test ResultsTest Results CPU Cores Service Checks Version 3.2.3 Service Checks Version 4.0.0rc1 Difference 1 1700 10500 617% 2 3300 20800 630% 4 6500 35300 543% 8 11700 45100 385%
  11. 11. Other software usedOther software used ● Customized livestatus based on Andreas updates for Nagios 4 ● https://github.com/ageric/livestatus ● Developing custom “single pane” interface to replace CGI/Check_mk Multisite ● Developing full REST API to talk to QH, livestatus and config files ● nagios-qh.rb Query Handler interface to gather loadctl metrics ● https://www.dropbox.com/s/h6zn0ecycqb1xrc/nagios-qh.rb ● Custom load control daemon that talks to QH ● Custom Event Broker to send perf data directly to ActiveMQ for post- processing ● Custom agent, like NRPE on steroids without limitations like buffer size
  12. 12. Other performance tweaksOther performance tweaks ● Sysctl Changes ● net.ipv4.tcp_fin_timeout ● net.ipv4.tcp_keepalive_profiles ● net.ipv4.tcp_tw_recycle ● net.ipv4.tcp_tw.reuse ● No longer need RAMDISK, but still in the default sysconfig/RC script for now ● Keep logging levels as low as possible ● Disable CGI's whenever possible ● Disable Environment Macros ● Don't use resource macros when you don't need to, they are not cached
  13. 13. Other performance tweaksOther performance tweaks ● /etc/security/limits.d/nagios.conf ● ipmon soft nofile 131072 ● ipmon hard nofile 131072 ● ipmon soft nproc 131072 ● ipmon hard nproc 131072 ● Nearly disable OOM killer for the nagios process, saves it until last ● echo '-16' > /proc/<nagios pid>/oom_adj ● Re-nice puppet to run at 10 so less impacting (true for any extra services) ● /etc/sysconfig/puppet – NICELEVEL=10 ● This should apply to any other running services that might take resources
  14. 14. Common Perf ToolsCommon Perf Tools ● vmstat / top – cpu/memory ● iostat / iotop – disk usage ● iptraf - network ● sar – cpu/memory/disk ● strace – immediate debugging, also debugging QA ● esxtop – VM stats ● tuned – can dynamically tune system ● perf record -p <pid> / perf list / perf top -u nagios
  15. 15. How to keep it running goodHow to keep it running good ● Monitor everything...you can never have too much info! ● CPU load and CPU stats (idle/wait/user/system) ● Disk space, inodes free ● All application/system logs (apache, syslog, nagios.log, etc.) ● Hardware status ● Swap / Physical Memory Usage ● Puppet state (state.yaml) ● Apache Stats (if have GUI/API) ● Network performance and stats (errors, throughput, etc.) ● NTP time and drift (more important on VM's)
  16. 16. Our Platform Architecture (simplified)Our Platform Architecture (simplified)
  17. 17. Known IssuesKnown Issues (and complaints)(and complaints) ● Number of workers on smaller (1-2 core) systems easily overloaded ● No remote workers (yet) ● Still have to restart to add new hosts/services ● No REST API natively ● Livestatus (or similar) not native
  18. 18. Questions ?Questions ? ● Daniel.Wittenberg@ipsoft.com ● dwittenberg2008@gmail.com ● @dwittenberg2008 ● www.linkedin.com/in/dwittenberg ● nagios and nagios-devel IRC ● Nagios Users and Devel mailing lists ● Always looking to hire new people so contact me!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×