Nagios Conference 2013 - David Stern - The Nagios Light Bar


Published on

David Stern's presentation on The Nagios Light Bar.
The presentation was given during the Nagios World Conference North America held Sept 20-Oct 2nd, 2013 in Saint Paul, MN. For more information on the conference (including photos and videos), visit:

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The light-bar is a device that visually replicates Nagios states Each light-bar represents a separate network There was an earlier project called Nampel; nagios Ampel, the German word for traffic light. This involved a great deal of hardware engineering; soldering things onto the motherboard of a dedicated machine.
  • Sitting in our offices, we have no way of knowing what is happening on a remote network It’s also eye-candy
  • sells a number of factory or Industrial grade sensors including thermometers, water, light and power sensors, etc. The light-bar and all sensors require a base unit. The unit pictured has ports for network and power on one side and ports for the light-bar on the other side The base unit also happens to have a built-in thermometer. The ports on the far left of the unit are for additional sensors. Avtech sells thermistors with rather long cables (50 feet). They can be plugged into these ports on the base unit. If you are going to monitor temperature, It’d be a good idea to also install nagiosgraph to see temperature trends in your data center. Data from the thermistors can be gotten by SNMP. The light-bar is meant to be controlled by SNMP We had some interesting talks with our security people convincing them it’s OK to make a hole in the wall of a closed area for the light-bar cable
  • Here’s an unencumbered view of the front of the base unit showing the power, network and sensor ports
  • Here’s a view of the back of the base unit showing where and how the light-bar plugs in The smaller ribbon-cable segment controls sound; the larger one controls the lights
  • The equipment includes Discovery software for both Linux and Windows You can specify a range or individual IP address The default address the device comes up on is a non-routable class C address The host you run this on must be on the same subnet as the base unit As that’s not likely to be the case when you first configure it, you need to attach the base unit to a laptop via a crossover cable And since most server-class computers have multiple NICs, you can make this your operational configuration ie permanently connect the light-bar to one of several NICs on a server via crossover cable Highlight the device to configure in this GUI and click on the Web button to go to the base unit’s home webpage
  • By mouse-clicking over the light segments, you can turn each light on and off Note the sound icons below the light-bar. One is for a slow stream of beeps; the other for a quicker stream of beeps. Note that sound is currently turned off Also note the built in thermometer and the greyed-out areas for two more sensors From the main STATUS page, we can click on the Settings tab
  • Probably the most important settings include the network information. You probably really only need to set the IP, gateway and netmask
  • In addition to network settings, you must choose the appropriate signal tower device (Red, yellow, green with sound or just red,green lights)
  • Depending on your environment or security mindset, you may wish to add authentication information
  • Also optional, you can set the Time Zone and temperature units Upon finishing with your settings, click “ Save Settings ”. The light-bar then automatically reboots During reboot, it tests all light segments and sound for about 20 seconds. There is a volume control on the light-bar but it cannot completely turn off sound; only deaden it.
  • After a power outage (planned or otherwise), the light-bar is likely to fail. The network port MUST be active before you plug in power. So just cycling power will get the light-bar and base unit back on the network As mentioned, on power-up, the light-bar turns on all lights and beeps, ideally for no more than 30 seconds. But if it can’t reach the net, this condition may continue We did static IP. At first, I thought this was a mistake. Perhaps it would auto-recover from an outage if we used DHCP. But this is really a timing issue. After an outage, the light-bar is almost certainly going to power up before the network is available anyway. Sometimes the light-bar may disappear from the ‘net. So it’d be a good idea to have a Nagios ping test to insure it’s accessible. Obviously if the test fails, it can’t notify you via the light-bar; it’s assumed at some point, you’ll get onto the closed network and look at the webpage. If the light-bar loses connectivity to the network, it will retain the last state it knew about eg just the GREEN light lit.
  • I already had another project in mind associated with this where WGET would be useful so I decided to control the light-bar using WGET Even though the light-bar is SUPPOSED to be controlled by SNMP. Even as a child, I could never color within the lines
  • Although not in the documentation, this was obtained from Avtech
  • The first URL is the action from clicking on the Nagios SERVICES button
  • Setting a flag-file will indicate if this is the first alert The alarm rings ONLY the first time a status goes red Our shop doesn’t use Nagios warnings. A condition is either ok or requires attention. A disk that’s 80% full will likely completely fill its disk soon
  • This script is run via cron on the nagios server every 5 minutes
  • Initially, we had a Nagios install only for Core services. Lab managers resisted having a Nagios server in their labs. They didn’t care if a user turned off a host. Newer security requirements however mandated that we must account for ALL time gaps in audit records. So how about if we had remote Nagios servers in each lab that somehow communicates back to the Core server? The challenge was to make the Core server aware of the other labs without throwing a red light. It would also be nice if we could zoom into the remote labs from the Core server to see the specific problem. This is a WORK IN PROGRESS so I don’t have any source code to show. We’ve been using this for several months now.
  • Note the lower left frame. I basically copied and tailored a stanza from above. Also added an HTML tag to refresh within 5 minutes If there is a green dot next to the lab, everything is fine. If there is a red dot, there’s an ALERT If you mouse-click on the lab in question, the right frame will zoom into that lab showing the specific problem. Make sure to backup your files before a Nagios upgrade or you’ll lose your work (side.html, side.php) This is still run by the light-bar cron job
  • Once you finish these steps, all that remains is deciding how to present the data. Should you zoom into the main Nagios page of each sub-site, or zoom into the service page?
  • Here’s what it would look like if you zoomed into the sub-site’s Main page. Clicking on anything in the rightmost frame would zoom into that item on the sub-site. Clicking on anything in the leftmost frame would zoom into the Core
  • Others preferred the clickable sub-site link going directly to the Service page of the sub-site. Clicking on services (or anything above the “Labs” stanza) in the leftmost frame will return you to the Core Nagios server main page. But this too was confusing. The winning combination was changing Target=“MAIN” in the side Frame to Target=“_Blank”. This opens a new tab or window into the sub-site. Operationally, you see a red dot on a sub-site on the Nagios Core page, click the link. This opens a tab to the sub-site. You look around to identify the problem then click the “ X” to the right of the tab to close that tab. And you’re back at the Nagios Core page
  • Sysadmins Love Nagios; it’s very extensible It makes us look like wizards A parting thought; sysadmins tend to be a cocky bunch. After all, we’re doing something new every single day But we only got to be this good because of those who came before us
  • Nagios Conference 2013 - David Stern - The Nagios Light Bar

    1. 1. The Nagios light-bar David Stern
    2. 2. What is it?
    3. 3. Why do we need it? DoD work requires air-gapped networks How would you know if a service is down if you’re not on the network?
    4. 4. The Hardware
    5. 5. An unobstructed View
    6. 6. Backside of base unit: connection to light-bar
    7. 7. Equipment network discovery program Initial connection requires crossover cable
    8. 8. Base unit generates a webpage
    9. 9. The Settings tab allows you to configure the device
    10. 10. You need to identify the type of device
    11. 11. Security settings
    12. 12. Optional settings
    13. 13. The network MUST be active before you power-on the unit It might be a smart idea to set the light-bar for DHCP address It would be a smart idea to monitor that the light-bar is available
    14. 14. If I can control the light-bar from the web, why not use WGET
    15. 15. The Secret Sauce; undocumented http://light-bar/cmd.cgi?action=ST&t=A2&a=1 makes the light-bar go beep http://light-bar/cmd.cgi?action=ST&t=A2&a=0 turns off the noise http://light-bar/cmd.cgi?action=ST&t=GR&a=1 makes the green light go on http://light-bar/cmd.cgi?action=ST&t=GR&a=0 turns off the green light Substitute OR for GR to affect the orange(yellow) light Substitute RE for GR to affect the red light
    16. 16. Getting nagios status We can get the nagios status from the Service page: http://nagios-server/nagios/cgi-bin/status?host=all Just search it for “serviceTotalsPROBLEMS” N.B. You may need to insert authentication information in URL http://nagiosadmin:nag-password@nagios-server/nagios/cgi- bin/status?host=all You can use the same format for the light-bar authentication
    17. 17. Mission Creep How about if we get the light-bar to beep the first time a new alert occurs? And since we don’t use nagios WARNING conditions, let’s use the yellow light to indicate unacknowledged alerts Let’s put it all together…
    18. 18. This is NOT a plugin
    19. 19. Other cool things you can do with WGET Hierarchical Nagios: A core nagios aware of other installs
    20. 20. What’s different on this nagios page?
    21. 21. Configuring “hierarchical” nagios • Backup nagios before and after these changes • Install a Nagios server in each lab/sub-site • Edit side.{html,php}, Add stanza for each sub- site and set refresh=300, • Tag each nagios sub-site: Main.{html,php} and status.c status.cgi • Modify light-bar cron job to check each sub- site, swap red,green dots as needed
    22. 22. One way to present the pages
    23. 23. Another way to present the data
    24. 24. The Results Faster response time/higher uptime Better awareness of our networks
    25. 25. Questions?