Nagios Conference 2013 - Daniel Wittenberg - Scaling Nagios Core 4
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Nagios Conference 2013 - Daniel Wittenberg - Scaling Nagios Core 4

on

  • 2,276 views

Daniel Wittenberg's presentation on Scaling Nagios Core 4. ...

Daniel Wittenberg's presentation on Scaling Nagios Core 4.
The presentation was given during the Nagios World Conference North America held Sept 20-Oct 2nd, 2013 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna

Statistics

Views

Total Views
2,276
Views on SlideShare
2,276
Embed Views
0

Actions

Likes
1
Downloads
23
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Nagios Conference 2013 - Daniel Wittenberg - Scaling Nagios Core 4 Presentation Transcript

  • 1. Scaling Nagios 4 Daniel Wittenberg daniel.wittenberg@ipsoft.com
  • 2. About MeAbout Me ● Unix/Linux admin since mid 90's ● Nagios/Netsaint user since early 2000's ● Owned/operated consulting business for almost 10 years that provided distributed monitoring using Nagios ● Previously employed by Fortune 50 Insurance company ● Currently Monitoring Platform Manager at IPsoft Inc.
  • 3. About IPsoftAbout IPsoft ● Provider of Remote Infrastructure Management and automation services ● ITIL and 6 Sigma compliance management framework ● Automation that resolves 56% of all incidents, and 90% L1 ● Monitoring, Automation, Event Correlation, Management.... ● Offices around the world in ten countries ● http://www.ipsoft.com
  • 4. Last year...Last year...
  • 5. My ConfigurationMy Configuration ● ~700 Nagios Servers ● ~130,000 Monitored Devices ● ~3,000,000 Service Checks ● Mix of customized Nagios 3.2.3 and 4.0.0 ● Scientific Linux 6.2/6.4 ● Managed by Puppet 3.x ● 2/3 on VMware ESX rest are bare metal ● Adding new Nagios servers almost daily
  • 6. What's different with Nagios 4What's different with Nagios 4 SPEED! ● Current testing shows on average 500% faster over 3.2.3
  • 7. What's different with Nagios 4What's different with Nagios 4 Some things that would impact performance/stability http://nagios.sourceforge.net/docs/nagioscore/4/en/whatsnew.html ● Embedded Perl – Gone ● external_command_buffer_slots - Gone ● -x option to not verify circular paths no longer needed in rc scripts ● Configuration Verification algorithm changes, massive startup speed increase ● Event Queue algorithm changes, helps with CPU utilization * Andreas 2012 Pres. ● Disk I/O reduced to virtually 0 ● NEW query handler interface, better communication with core ● NEW core workers – reduces I/O, memory, CPU ● Completely re-written spec file for better installs, debug modes
  • 8. Perf Testing Lab SetupPerf Testing Lab Setup ● Servers are all ESX 5 based VM's on the same cluster ● Variable CPU cores, 4GB memory ● Metrics used to consider a test failure: ● CPU Block Queue > 3 ● CPU I/O Wait > 3 ● CPU Idle < 10% ● Service Check Latency > 1s ● Host Check Latency > 1s ● 30 minute run time, > 3% failure rate failed the test ● Fully automated increasing work load, consistent results ● Add 1 host + 1 service check, try to get “best case” numbers w/o check lat.
  • 9. Test Lab ArchitectureTest Lab Architecture
  • 10. Test ResultsTest Results CPU Cores Service Checks Version 3.2.3 Service Checks Version 4.0.0rc1 Difference 1 1700 10500 617% 2 3300 20800 630% 4 6500 35300 543% 8 11700 45100 385%
  • 11. Other software usedOther software used ● Customized livestatus based on Andreas updates for Nagios 4 ● https://github.com/ageric/livestatus ● Developing custom “single pane” interface to replace CGI/Check_mk Multisite ● Developing full REST API to talk to QH, livestatus and config files ● nagios-qh.rb Query Handler interface to gather loadctl metrics ● https://www.dropbox.com/s/h6zn0ecycqb1xrc/nagios-qh.rb ● Custom load control daemon that talks to QH ● Custom Event Broker to send perf data directly to ActiveMQ for post- processing ● Custom agent, like NRPE on steroids without limitations like buffer size
  • 12. Other performance tweaksOther performance tweaks ● Sysctl Changes ● net.ipv4.tcp_fin_timeout ● net.ipv4.tcp_keepalive_profiles ● net.ipv4.tcp_tw_recycle ● net.ipv4.tcp_tw.reuse ● No longer need RAMDISK, but still in the default sysconfig/RC script for now ● Keep logging levels as low as possible ● Disable CGI's whenever possible ● Disable Environment Macros ● Don't use resource macros when you don't need to, they are not cached
  • 13. Other performance tweaksOther performance tweaks ● /etc/security/limits.d/nagios.conf ● ipmon soft nofile 131072 ● ipmon hard nofile 131072 ● ipmon soft nproc 131072 ● ipmon hard nproc 131072 ● Nearly disable OOM killer for the nagios process, saves it until last ● echo '-16' > /proc/<nagios pid>/oom_adj ● Re-nice puppet to run at 10 so less impacting (true for any extra services) ● /etc/sysconfig/puppet – NICELEVEL=10 ● This should apply to any other running services that might take resources
  • 14. Common Perf ToolsCommon Perf Tools ● vmstat / top – cpu/memory ● iostat / iotop – disk usage ● iptraf - network ● sar – cpu/memory/disk ● strace – immediate debugging, also debugging QA ● esxtop – VM stats ● tuned – can dynamically tune system ● perf record -p <pid> / perf list / perf top -u nagios
  • 15. How to keep it running goodHow to keep it running good ● Monitor everything...you can never have too much info! ● CPU load and CPU stats (idle/wait/user/system) ● Disk space, inodes free ● All application/system logs (apache, syslog, nagios.log, etc.) ● Hardware status ● Swap / Physical Memory Usage ● Puppet state (state.yaml) ● Apache Stats (if have GUI/API) ● Network performance and stats (errors, throughput, etc.) ● NTP time and drift (more important on VM's)
  • 16. Our Platform Architecture (simplified)Our Platform Architecture (simplified)
  • 17. Known IssuesKnown Issues (and complaints)(and complaints) ● Number of workers on smaller (1-2 core) systems easily overloaded ● No remote workers (yet) ● Still have to restart to add new hosts/services ● No REST API natively ● Livestatus (or similar) not native
  • 18. Questions ?Questions ? ● Daniel.Wittenberg@ipsoft.com ● dwittenberg2008@gmail.com ● @dwittenberg2008 ● www.linkedin.com/in/dwittenberg ● nagios and nagios-devel IRC ● Nagios Users and Devel mailing lists ● Always looking to hire new people so contact me!