How monitoring should be automated without jeopardizing accuracy.
I will present a ready-to-use system that allows system admins to set up their servers to be automagically picked up by Naemon, and also allowing them to tweak their settings without requiring access to the monitoring system. Most notably, without even restarting or reloading the monitoring system.
I will also present a working (I hope) demo of dynamic thresholds in Naemon, using various helpers in a request/response system.
1. … more than software
Naemon & Nostalgia
Andreas Ericsson
ageric79@gmail.com
2. … more than software
Agenda
●Agenda
●Ego slide
●About op5
●IT now and then
●Naemon
●Roadmap progress
●Up and coming
3. … more than software
Ego slide
●Programming since I was seven
●Core architect at op5 since 2003
●Nagios core developer 2009-2013
●Performance fanatic
●Author of Merlin and Nagios 4
●Naemon maintainer
●Voted “most likely to invent the lightsaber and then accidentally killing himself with it” when we last played that particular drinking game.
●Motivation: “He does a lot of dumb shit but he's really smart on the inside”
4. … more than software
About op5
●Founded 2003
●+900 customers
●97% renewal rate
●Focus on large installations
●http://www.op5.com
6. … more than software
“IT” ca 1970
●Computer performance measured in CPM (cards per minute) as often as in MHz
●1 year of training before you were allowed to touch a machine
●Unix development began to create a multitasking, multiuser system
●Firs time a computer passes a college-level calculus course
●First successful ARPANET test
●IANA formed
●Computer-powered devices per person: 0.00000001
●Average CPU speed: 1Mhz
●admin:computer ratio 300:1
7. … more than software
IT ca 1980
●One-man computers were gaining ground (Apple II)
●PacMan!
●Portable computers were being developed
●Ethernet standards introduced
●ARPANET, BitNet, CSNET et al began merging
●TCP/IP standards formalized
●SMTP, DNS etc quickly followed
●Average CPU speed: 8Mhz
●admin:computer ratio 10:1
8. … more than software
IT ca 1990
●IBM PC style computers becoming popular
●GUI's become popular
●First web page created
●Linux invented
●Internet had suffered its first worm (Morris)
●“Software installation” was actually part of a job description
●Average CPU speed: 66Mhz
●admin:computer ratio 1:1
9. … more than software
IT ca 2000
●WiFi standards emerge
●The dot-com era
●Google overtakes AltaVista as the most popular search engine
●It was ok for non-nerds to get computers
●Monitoring starts to become a thing
●Nagios development starts
●Average CPU speed: 800Mhz
●admins:computers ratio 1:10
10. … more than software
IT ca now
●An average smartphone has 120 million times the computing power of the first general-purpose computer (Ferranti Mark 1)
●An average smartphone has 4 million times the amount of main memory
●Giant datacenters house 100,000+ servers
●Average CPU speed: 2.4GHz
●admins:computers ratio 1:300
11. … more than software
Hands up if...
●… the number of servers you manage has grown faster than the staff you have to monitor and manage them
●… you use more than two tools just to manage your servers
●… you have people dedicated to managing the servers that monitor and manage your servers
12. … more than software
Conclusions
●Manpower is getting scarce
●Sharing resources is more important than ever
●The most expensive resources are the most important to share
●Using what works now but doesn't lock one into a corner is key to remaining effective
●Developers have a duty to minimize the job they do (laziness is important! :-p)
●Developers have a duty to minimize the job sysadmins do
●Keep stuff simple. If it breaks, you not only get to keep the pieces, but you get to do the same job again in a different way
13. … more than software
Last year's Naemon roadmap
●Completed
●External commands via query handler
●Dropdir support
●Livestatus
●In progress
●Check result transformer
●Object creation/modification at runtime
●Scheduler-controlled helper daemons (well, kinda)
●Scrapped
●Runtime-modifiable main-config – no usecase found
●Object extensions – custom variables fill the same role
14. … more than software
Up and coming
●Backlogged: Runtime object creation and modification
●Backlogged: Check result transformer
●Backlogged: Scheduler-controlled helper daemons
●Performance data handling
●Active agents
●Report data export
●Because http://www.youtube.com/watch?v=8yVFkMXy8rw
15. … more than software
Runtime object creation/modification
●User story:
●If users prefer to configure their monitoring on the monitored hosts, we should automagically add them to the monitoring config without reloading it.
●On-call schedule handover
●Added as a queryhandler extension
●Housekeeping events every X seconds
●Allows new stuff to call in and start getting monitored automagically
●Object creation may not happen if I can't get it stable in a reasonable timeframe, because config reload is superfast nowadays
16. … more than software
Check result transformer
●User story:
●Since anomalies in network and application behaviour often indicate errors, it's important that we can detect them and notify about it
●External helper connects to NERD
●Events zip to the helper
●Helper can alter state/output/perfdata/whatever
●Helper zaps result back to core via QH
●Allows for adaptive thresholds
●The monitoring system requires little or no configuration
●Inspired by BisCheck (which we will likely end up using as engine)
17. … more than software
Scheduler-controlled helper daemon
●User story:
●Users must be able to trust that exported data is complete
●New twist: Naemon will connect to other systems sockets instead
●Will most likely be implemented as a module
18. … more than software
Performance data handling
●User story:
●Users should be able to produce graphs of all metrics they monitor with as little impact on available resources as possible.
●Performance data can now be streamed from Naemon
●Reduces I/O from spoolfile writing
●Feeds data to PNP or Graphite
●Gets rid of the last synchronously executed system call
●Already completed
●Builds on top of already-implemented interfaces
19. … more than software
Naemon
Perfdata handling design
PNP/Graphite/?
NERD
20. … more than software
Active agents
●User stories:
●Naemon should scale to as close to infinite sizes as possible
●To save time, new hosts should report what metrics they're offering and Naemon should automagically monitor them.
●collectd or a pushing version of check_mk
●Input for automagic host/service creation
●Large providers already write and ship compatible plugins
●Complements active checks, but doesn't replace them
●Improves security in many setups (especially over NRPE)
●Allows us to reuse existing code (ie, be lazy and do less work)
21. … more than software
Active agents design
Naemon
query-handler
Receiver/Evaluator
livestatus
collectd
collectd
collectd
collectd
22. … more than software
Report data export
●User story:
●It should be easy to get performance data from Naemon into SystemX in order to facilitate the graphing power of SystemX
●Streams host- and service statechanges, downtime and start/stop events from Naemon
●Builds on top of existing interfaces
●Provides excellent performance
●Allows using extreme-performance data warehouse software to store (possibly huge amounts of) report data.
23. … more than software
Naemon
Report data export design
Data warehouse
NERD
24. … more than software
Questions?
●ageric79@gmail.com
●http://www.naemon.org/
●http://www.youtube.com/watch?v=8yVFkMXy8rw
●Or just talk to me. I'm not dangerous until I have that lightsaber ;-)