Leveraging and Understanding
Performance Data and Graphs
Troy Lea
troy@box293.com
Twitter: @Box293
http://exchange.nagios....
2
About Me
IT Consultant
Nagios Developer
Love tinkering with Nagios
Why Nagios XI?
It’s a virtual appliance - ready to go
3
About This Presentation
Understanding how performance data is stored
in the back end and how Nagios accesses it
Goal is ...
4
Basic Concepts - Part 1
5
Basic Concepts - Part 2
./check_nt -H SERVER -s "" -p 12489 -v USEDDISKSPACE -l C -w 80 -c 95
C: - total: 39.99 Gb - use...
6
Basic Concepts - Part 3
Service check command is executed by the monitoring engine
Monitoring engine receives the result...
7
Plugins
The power of Nagios is in the plugins!
Monitor what you want, how you want!
Resources available that clearly def...
8
Plugin Output Explained - Part 1
Plugins produce data divided into two parts
The pipe symbol “|” is used as a delimiter
...
9
Plugin Output Explained - Part 2
The exit code Nagios receives from the plugin
determines the state of the service
0 = O...
10
Plugin Output Explained - Part 3
No performance data = no pretty graphs
You can create a plugin using whatever
language...
11
Plugin Output Explained - Part 4
Examples:
Shell script
Something you might want to check on the Nagios
host itself
per...
12
Performance Data Specifics - Part 1
Asterix (*) fields are required fields, everything
else is optional
In this instanc...
13
Performance Data Specifics - Part 2
Multiple DS
Each DS is separated by a space
rta=2.687ms;3000.000;5000.000;0; pl=0%;...
14
Basic Plugin - Part 1
Example shell script demonstrating how a plugin
outputs performance data
NUMBER1=$[ ( $RANDOM % 1...
15
Basic Plugin - Part 2
Here is the output each time it is run:
OK - Number 1: 4 Number 2: 74 | 'Number 1'=4;;;; 'Number ...
16
Basic Plugin - Part 3
Performance data
displayed as a
pretty graph
Demonstration of
how you can
generate
performance da...
17
Basic Plugin - Part 4
Now lets add warning and critical thresholds to
the performance data string
Number1
WARNING @ 50
...
18
Basic Plugin - Part 5
Here is the output each time it is run:
OK - Number 1: 4 Number 2: 74 | 'Number 1'=4;50;75;; 'Num...
19
Basic Plugin - Part 6
This demonstrates
how the
performance data
does not have any
effect on the state
of the service
W...
20
.rrd and .xml files
Used for recording the results from Nagios checks
Useful for observing daily trends of your environ...
21
Location of .rrd and .xml files
When a service check returns performance data,
Nagios dumps this into:
/usr/local/nagio...
22
Extract .rrd data
You can extract data from an .rrd file
Example (from the CLI):
rrdtool fetch
/usr/local/nagios/share/...
23
.rrd and .xml Gotchya - Part 1
The .xml file can contain sensitive data
<NAGIOS_SERVICECHECKCOMMAND>check_emc_clariion!...
24
.rrd and .xml Gotchya - Part 2
Perhaps use a central credential file
<NAGIOS_SERVICECHECKCOMMAND>check_vmware_host!
che...
25
.rrd and .xml Gotchya - Part 3
RRD Data is averaged out over time
Looking at performance graphs for past day / week /
m...
26
Graphs - How Templates Are Used - Part 1
http://docs.pnp4nagios.org/pnp-0.4/tpl
27
Graphs - How Templates Are Used - Part 2
PNP4Nagios queries the XML file for the
<TEMPLATE> tag
Each datasource has it’...
28
Graphs - How Templates Are Used - Part 3
From the example graphs:
<TEMPLATE>check-host-alive</TEMPLATE>
<TEMPLATE>check...
29
Graphs - How Templates Are Used - Part 4
check-host-alive
/usr/local/nagios/share/pnp/templates.dist/check-host-
alive....
30
Graphs - Creating Your Own Template - Part 1
The check_command name is what Nagios uses
to insert into the <TEMPLATE> t...
31
Graphs - Creating Your Own Template - Part 2
The service definition using the new command
32
Graphs - Creating Your Own Template - Part 3
The graph currently being generated
Default Template being used
Check Comm...
33
Graphs - Creating Your Own Template - Part 4
Copy the file:
/usr/local/nagios/share/pnp/templates.dist/default.php
To t...
34
Graphs - Creating Your Own Template - Part 5
In the graph we are removing the bottom two lines
Default Template
Check C...
35
Graphs - Creating Your Own Template - Part 6
How easy was that!
Updated graph
Template Name and Check Command removed
36
PNP Templates In Detail - Part 1
Lets get into specifics
Template we just
modified
It’s not that
complicated! (LOL)
36
37
PNP Templates In Detail - Part 2
.rrd files can have multiple datasources (DS)
Round Trip Time and Packet Loss for exam...
38
PNP Templates In Detail - Part 3
Example of .rrd file with five DS
Two graphs generated using these DS
39
PNP Templates In Detail - Part 4
Default Template creates one graph per DS
This is a simple PHP foreach loop
The code w...
40
PNP Templates In Detail - Part 5
This section of the template uses three DS
One graph will be generated using three DS
...
41
PNP Templates In Detail - Part 6
Number formatting
Our modified template and the relative code
The relevant information...
42
PNP Templates In Detail - Part 7
The three DS template and the relative code
The relevant information:
%4.0lf
43
PNP Templates In Detail - Part 8
Numbers are displayed with four decimal points
%3.4lf
Numbers are displayed as whole n...
44
PNP Templates In Detail - Part 9
PNP documentation defines the number
formatting using the printf standard defined here...
45
PNP Templates In Detail - Part 10
width
When the number is generated on the graph, it will
allocate a minimum specific ...
46
PNP Templates In Detail - Part 11
%3.4lf
width = 3
precision = .4
hence the displayed number is 25.3800
%4.0lf
width = ...
47
MRTG - Part 1
MRTG = Multi Router Traffic Grapher
Nagios Addon that is useful for monitoring
network switch and router ...
48
MRTG - Part 2
Nagios XI Wizard called “Network Switch /
Router” automates the configuration of MRTG
MRTG configuration ...
49
MRTG - Part 3
When MRTG runs, it gathers data from the
devices defined in the mrtg.cfg file
It dumps this data into the...
50
MRTG Gotchya - Part 1
When the Wizard populates the mrtg.cfg file it will
add ALL ports on the switch to the config fil...
51
MRTG Gotchya - Part 2
On a 48 port switch this might not concern you
But in a stack of two 48 port switches this
become...
52
MRTG Gotchya - Part 3
Suggestion
Clean up the mrtg.cfg file
Remove the ports you do not wish to gather data on
Can this...
53
MRTG Gotchya - Part 4
Problem 2 - Adding a switch (or module) to an
existing switch
Monitoring additional ports later u...
54
MRTG Gotchya - Part 5
Solutions to Problems 1 & 2
cfgmaker
This is how the Wizard configures mrtg.cfg
The wizard update...
55
MRTG Gotchya - Part 6
Problem 3 - With a frequently changing
environment, keep mrtg.cfg clean
Monitoring WAN links for ...
56
MRTG Gotchya - Part 7
Problem 4 - Firmware Upgrade causes port
numbering to change
Major firmware revision applied to s...
57
Questions
Questions ?
58
Discount Offer
But wait, there's more ...
When visiting the Nagios XI use my affiliate link
http://www.nagios.com/#ref=...
Upcoming SlideShare
Loading in …5
×

Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

3,447 views

Published on

Troy Lea's presentation on Leveraging and Understanding Performance Data and Graphs.
The presentation was given during the Nagios World Conference North America held Sept 20-Oct 2nd, 2013 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,447
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Good afternoon all and thank you for coming to my session. My name is Troy Lea and I&apos;m here to talk to you about leveraging and understanding performance data and graphs in Nagios.
  • First a little about me. I’m primarily a Windows tech starting back in DOS 6 and Windows 3.1. I’ve worked on a variety of support roles over the years and my last role involved the development and maintenance of a cloud computing platform based on Windows Remote Desktop. I primarily looked after the backend infrastructure. I&apos;ve been using Nagios XI since 2009. I originally tried Nagios before XI was released however being a Windows guy there were some linux barriers that I just could not get my head around. I love Nagios XI because it is delivered as a virtual appliance. Within minutes of importing that VM and powering it on you have a fully . functional . monitoring . product. Before I caught the Nagios bug, my programming experience was all windows related. Batch files, VB scripts and Powershell. I had dabbled in a little HTML but only because I had to. Since then I&apos;ve learnt HTML, PHP, CSS, Javascript, Perl, Bash ... whatever is required to get the result I needed.
  • In the world of monitoring there is more to Nagios sending alerts because a server is about to run out of hard disk space. Collecting and storing performance data is one of the most useful features in Nagios, with this information you can get an understanding of your environment&apos;s day to day trends. Analysing this data can be very helpful, perhaps to look at growth, or identifying performance bottlenecks. This session is about understanding how the performance data is stored in the back end and how Nagios accesses it. Topics covered in this session are: • Basic concepts • Understanding the .rrd and .xml files • Understanding how pnp generates graphs • Creating custom graph templates in pnp • Writing plugins that will output the performance data you want • Understanding how MRTG works Everything I will talk about is documented on the Internet, however finding that information does not always appear on the first page of your google search results. It&apos;s especially difficult when you are learning a new language or concept, the information out there is not always helpful, or it can get overwhelming. Even though this is an advanced technical session, it&apos;s aimed at delivering the core concepts and information to help you get the results you need (and impress the boss). As I&apos;ve mentioned before, I&apos;m primarily a Windows tech. So some of the material I talk about might be obvious to a linux tech however to a windows tech it can get frustrating, so my goal here is to make the content accessible to anyone. This presentation is centered around Nagios XI. There are references to locations of files and components, your implementation of Nagios may differ slightly however the concepts are still the same.
  • I&apos;ll start off quickly explaining the basic concepts.   Let&apos;s look at a common service that is used in monitoring, a free disk space check.   Here is the service configuration and the current service status.
  • Here is this command and the output we see when we execute it from the CLI. The data after the pipe symbol is the performance data, I will explain this in more detail later on. Here is the Advanced Status Detail of the service showing the performance data string.
  • Here is the performance graph for this service, the end result. The chain of events that occur are ...
  • When I first began using Nagios, it became apparent that the power behind Nagios came with plugins. The ability to monitor what you want, how you want, using a variety of different methods really appealed to me.   I think everyone who starts developing plugins for Nagios has a very similar journey We modify an existing plugin to make it suit our environment We then create a simple plugin using an existing one to do something completely different Before we know it we are writing very complex plugins   There are two exceptional resources available that clearly define the guidelines around creating plugins.   Nagios Plug-in Developer Guidelines http://nagiosplug.sourceforge.net/developer-guidelines.html The information here is very clear and easy to understand, I constantly am referring to this   PNP Documentation http://docs.pnp4nagios.org/pnp-0.4/doc_complete This has some more detailed information and examples in relation to the performance data and how it needs to be formatted  
  • Taken directly from the PNP documentation   When the plugin produces performance data, it is divided into two parts. The pipe symbol (&quot;|&quot;) is used as a delimiter.   Example check_icmp : OK - 127.0.0.1: rta 2.687ms, lost 0% | rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;; Something I want to make really clear here is: The data to the left of the pipe symbol is processed by the monitoring engine The data to the right of the pipe symbol is used for inserting into RRD files for performance data
  • The only information not shown here is the exit code Nagios receives from the plugin that determines the state of the check 0 = OK 1 = WARNING 2 = CRITICAL 3 = UNKNOWN
  • If your plugin does not output performance data, then graphs will not be available for that service.   So it&apos;s as basic as that. You can create your plugin using whatever language you need to, as it fits your purpose and needs. All that matters is the end result which is returned back to Nagios when the plugin has finished running.
  • Shell script Something you might want to check on the Nagios host itself perl script Remotely checking a device using SNMP OR using third party APIs like the VMware vSphere SDK to remotely access virtual environments visual basic script Using NSClient on a Windows host to perform a check (like RDP usage)
  • Here is a breakdown of the performance data The asterix (*) fields are required fields, everything else is optional.   In this instance, rta is the FIRST datasource, or datasource 1    
  • A plugin can output multiple datasources. Each datasource is separated by a space and the format is the same.   Example: rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;; The label can have spaces if you desire however the label MUST be enclosed by single quotes   Example: &apos;Round Trip Average&apos;=2.687ms;3000.000;5000.000;0; &apos;Packet Loss&apos;=0%;80;100;;
  • Here is a basic plugin I have created to demonstrate outputting performance data using a shell script. This is just a simple script that generates two random numbers and outputs them. For demonstration purposes this script will always return an OK state.   NUMBER1=$[ ( $RANDOM % 100 ) + 1 ] NUMBER2=$[ ( $RANDOM % 1000 ) + 1 ]   echo &quot;&quot;OK - Number 1: $NUMBER1 Number 2: $NUMBER2&quot; | &apos;Number 1&apos;=$NUMBER1;;;; &apos;Number 2&apos;=$NUMBER2;;;;&quot; exit &quot;0&quot;
  • Here is the output each time it is run:
  • Here is the graphs displayed after the check has been running for a while 
  • Now I am going to define a warning and critical threshold in the performance data string , this will show you how they appear in the graphs.   Number1 WARNING @ 50 CRITICAL @ 75 Number2 WARNING @ 500 CRITICAL @ 750   echo &quot;&quot;OK - Number 1: $NUMBER1 Number 2: $NUMBER2&quot; | &apos;Number 1&apos;=$NUMBER1;50;75;; &apos;Number 2&apos;=$NUMBER2;500;750;;&quot;  
  • Here is the output each time it is run:
  • This demonstrates how the performance data does not have any effect on the state of the services.   Also, if you were to look into the XML file generated for this service, this is where the warning and critical thresholds are stored.  
  • What are Performance Data Files?   Performance data files are used for recording the results from Nagios checks, which in turn become useful for observing the daily trends of your environment. Being able to look at hourly/daily/weekly/monthly/yearly historical data can be invaluable when trying to resolve performance issues. It helps get to the bottom of those customer complaints like &quot;the server is slow&quot;.   There are two files created by Nagios for every check that generates performance data.   The RRD file is a Round Robin Database. That means that after some time the oldest data will be dropped at the &quot;end&quot; and it will be replaced by new values &quot;at the beginning&quot;. This is the file that contains all the historical data.   The XML file contains detailed information about the check that generated the performance data. Things like warning and critical thresholds, names of the checks. This file is updated at the same time as the RRD file, so it will always be information that is obtained from when the check was last run.   How are these files used?   When you are viewing performance graphs in Nagios, they are generated by an application called PNP4Nagios. PNP4Nagios uses the XML and RRD files to generate these graphs. PNP4Nagios allows you to create your own customised graphs based on the information in the XML file and then displays the historical data in the RRD file.   It takes a couple of service checks to run initially to collect performance data before you will see performance graphs. Depending on the frequency of your service checks depends on how long it takes to see the data in the performance graphs.
  • Initially, when a service check returns performance data, nagios dumps this into: /usr/local/nagios/var/spool/perfdata   Another background process will then detect this spooled perfdata and create/update the relevant .rrd and .xml files.   The Performance Data files live in: /usr/local/nagios/share/perfdata/&lt;host&gt;   There is a folder for each host   The host object files are called _HOST_ (the check_icmp command that determines if a host is up or down)   All the other files are relevant to the service objects defined for each host.
  • If you want to extract the data from an .rrd file you can do it with the following command: rrdtool fetch /usr/local/nagios/share/perfdata/localhost/_HOST_.rrd MAX -r 900 -s -1h If you don’t specify start and end times the data retrieved will be from the past 1 day.
  • .xml file can contain sensitive data When the .xml file is created/updated, a lot of information is stored in this file that is relevant to the check command that was run, which could have a password stored in plain text.   For example here is a service check that has a password stored in the definition And here is the line in the .xml file   &lt;NAGIOS_SERVICECHECKCOMMAND&gt;check_emc_clariion!$HOSTADDRESS$!-u readonly!-p Str0ngPassw0rd !-t sp_cbt_busy!--sp A!--warn 70!--crit 90!&lt;/NAGIOS_SERVICECHECKCOMMAND&gt;
  • There are many methods to work around this behaviour if you are not comfortable with it. For example this service check uses a file that contains the credentials And you can see that the credentials are not inside the .xml file   &lt;NAGIOS_SERVICECHECKCOMMAND&gt;check_vmware_host! check_vmware_config_vcenter01 !cpu!90!95!!!!&lt;/NAGIOS_SERVICECHECKCOMMAND&gt;
  • RRD Data is averaged out over time. When you look at performance graphs for past day / week / month / year will show results with less spikey data. This generally only occurs with data that has lots of peaks and troughs, the lower troughs will cause the overall average to be less to the peaks will appear lower. Something like active user sessions will have a peak through business hours and then a drop to almost nothing out of hours. Constant data like disk space used will generally not average out that much. It all depends on your environment! When reviewing RRD data you need to take into consideration these factors as it’s all relative.
  • When you are viewing performance graphs in Nagios, they are generated by an application called PNP4Nagios.   Here are two examples: The difference between the two graphs is that the first one has a PNP template and hence it&apos;s a little prettier, compared to the second graph that is generic and tells you that it is using the Default Template.  
  • So how does this work? http://docs.pnp4nagios.org/pnp-0.4/tpl When the RRD and XML files are created / updated, the check_command directive* defined in the service object is added to the XML file under each &lt;DATASOURCE&gt; tag as the TEMPLATE tag.   In relation to distributed monitoring, if PNP finds a string enclosed in brackets at the end of performance data it will be recognized as check command and will be used as PNP template. OK - 127.0.0.1: rta 2.687ms, lost 0% | rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;; [check_icmp]   When PNP goes to display the graph, it queries the XML file and gets the TEMPLATE tag for each datasource.
  • For the example graphs shown on previous slides, these values are: &lt;TEMPLATE&gt;check-host-alive&lt;/TEMPLATE&gt; and &lt;TEMPLATE&gt;check_local_load_alt&lt;/TEMPLATE&gt;   In the examples above, these values are: check-host-alive check_local_load_alt   It then looks in the following folders to see if it can find a php file that has one of these names: /usr/local/nagios/share/pnp/templates.dist /usr/local/nagios/share/pnp/templates
  • In the first example above it finds the following file: /usr/local/nagios/share/pnp/templates.dist/check-host-alive.php So it uses this PHP file to generate the performance graph   In the second example above it cannot find any file named check_local_load_alt.php so it uses the default template which is: /usr/local/nagios/share/pnp/templates.dist/default.php
  • Creating your own templates isn&apos;t too hard, but it is a little complex and will require some trial and error.   The best starting point is to find an existing template and modify it to your liking.   As described in the previous slide, the name of the check_command is what Nagios uses to insert into the &lt;TEMPLATE&gt; tag in the XML file (how PNP determines which template to use).   So for this example I have created a copy of an existing command called &quot;check_xi_service_nsclient_alt&quot;. You can see the command is identical to the original command except for the name.
  • Here is the service I am using that I want to view custom graphs for, you can now see it is using the new command.
  • And here is the graph being generated by this service, you can see it is currently using the default template and it is also telling you the check command So that&apos;s our starting point, we know the data currently exists in the RRD and XML files and we are ready to create our custom template
  • Copy the file: /usr/local/nagios/share/pnp/templates.dist/ default.php To the following location with the name: /usr/local/nagios/share/pnp/templates/ check_xi_service_nsclient_alt.php   Edit the file check_xi_service_nsclient_alt.php
  • I am going to remove the bottom two lines Default Template Check Command command name   Which are lines 62 and 63 $def[$i] .= &apos;COMMENT:&quot;Default Template\r&quot; &apos;; $def[$i] .= &apos;COMMENT:&quot;Check Command &apos; . $TEMPLATE[$i] . &apos;\r&quot; &apos;;   Save the file, and then go and reload the performance graph and we will see the new template
  • Reload the performance graph and we will see the new template The blue arrow I&apos;ve added to the graph is showing where the template name and command name used to be   How easy was that!
  • Now I&apos;ll get a little more technical   Here is the modified template we just created.   There are a few sections in here that can get overwhelming but once you understand it, it&apos;s not that complicated
  • An RRD file can have multiple data sources. An example of this is the check-host-alive command that is a ping test used for host defintions. The performance data returned from this service contains two datasources:   Round Trip Time Packet Loss   When you view the graphs for this service you actually see two graphs. Each datasource increases the size of the .rrd file
  • Here is a check command that generates five data sources and the pnp template uses these to generate two performance graphs. The first graph uses three datasources and the second graphs uses two data sources
  • So going back to the template we modified. The default template is designed to create one graph per data source. It does this by looking at the RRD and looping through each datasource and generates the graphs.   This is a simple php foreach loop And the code within the loop references the relevant datasource by the $i variable So that&apos;s how individual graphs can be generated for each datasource in a generic fashion.
  • In a previous slide I showed you a check command that generated five datasources and the first graph contained three of these datasources. Because I created the check command I know that it will always output five data sources in the performance data and they will always be outputted in the same numerical order. I will explain this in further detail later on when we get to the section on creating your own plugins.   Here is the first part of the template that shows you how this is achieved: On line 10 we define var1 as the 1st datasource $DS[ 1 ] On line 11 we define var2 as the 2nd datasource $DS[ 2 ] On line 12 we define var3 as the 3rd datasource $DS[ 3 ]   And then throughout the rest of the code the graphs that are generated are pulling the specific data from the RRD files for each specific datasource   $opt[1] and $def[1] is a reference for the first graph being generated. Not shown here is the code that generates the second graph which are referenced as $opt[2] and $def[2]
  • The last part I will talk about in relation to templates is the number formatting. Things here can get very complex indeed.   Here is an example of the numbers displayed on the custom template we modified and the relative code The relevant information I am going to refer to is %3.4lf
  • Here is an example of the numbers displayed for the five datasource .rrd file and the relative code The relevant information I am going to refer to is %4.0lf  
  • What I am highlighting here is: On the first graph, the numbers are displayed with four decimal points On the second graph, the numbers are displayed as whole numbers
  • The PNP documentation defines the number formatting using the printf standard defined here: http://en.wikipedia.org/wiki/Printf   I must point out that as the number (1) and the letter &quot;L&quot; look alike, the format %3.4lg contains a lower case &quot;L&quot;. The syntax is %[parameter][flags][width][.precision][length]type  
  • Specifically I am going to focus on: width When the number is generated on the graph, it will allocate a minimum specific width, this helps you align numbers in a column style precision Determines if the number displayed is a whole number, or a number with a specific number of digits following the decimal place
  • % 3.4 lf width = 3 precision = .4 hence the displayed number is 25.3800   % 4.0 lf width = 4 precision = .0 hence the displayed number is 14 Because the precision is 0, no decimal place is used   To be honest I haven&apos;t spent time looking into the other options available in the formatting style, as width and precision were the only options I needed to get the results I was after.
  • MRTG stands for the Multi Router Traffic Grapher
  • In Nagios XI, MRTG uses a config file (/etc/mrtg/mrtg.cfg) that contains all the devices and their ports that it is going to gather data on.   When you run the Network Switch / Router wizard, it will populate the MRTG config file with the device you just queried.   MRTG is run as a cron job every 5 minutes and is defined in /etc/cron.d/mrtg   The name cron comes from the Greek word for time, χρόνος [chronos]. Hence cron is a software utility on linux which is a time-based job scheduler. In the windows world it&apos;s the Task Scheduler.
  • When MRTG runs, it gathers the data from the devices defined in the mrtg.cfg file and dumps this data into the folder /var/lib/mrtg For every port monitored an .rrd file is created. NOTE: there is no .xml file generated   In Nagios XI, the service checks defined for the ports you want to monitor will run a command that looks for the .rrd file in the &quot;/var/lib/mrtg&quot; folder and then puts this information into the regular location for performance data &quot;/usr/local/nagios/share/perfdata/&lt;host&gt;/&lt;service&gt;&quot;
  • As I explained before, when you run the Network Switch / Router wizard, it will populate the MRTG config file with the details about device you just queried. In the wizard you may have only selected to monitor 10 ports on the switch. Regardless of the selections you make in the wizard, mrtg.cfg will be populated with all ports on the switch. Nagios itself will only have the service definitions for the 10 ports you selected to monitor.
  • What you can do here is to go and edit the mrtg.cfg file and remove all of the ports that you do not wish to gather data on. However this can cause another issue in the future which I will explain here.   Let&apos;s say that you need to now monitor an additional two ports on that switch. Running the Network Switch / Router wizard again runs you through all the steps and select these ports. However due to how the wizard works, when it detects that this switch already exists in the mrtg.cfg file, it will not update the mrtg.cfg file. Even though you have edited the mrtg.cfg file in the past and removed these ports, the wizard does not look for this level of detail.
  • Another similar behaviour occurs in relation to switch stacking. For example I have a stack of two 48 port switches (96 ports in total). So in the past I ran the wizard, monitored everything I needed. Now we have added an additional 48 port to the switch stack, taking the total ports to 144. Because this is a stack of switches, it is all monitored through one IP address. So the same behaviour explained above occurs. Running the Network Switch / Router wizard again runs you through all the steps and select these additional ports. However due to how the wizard works, when it detects that this switch already exists in the mrtg.cfg file, it will not update the mrtg.cfg file.
  • Use the cfgmaker tool to update the mrtg.cfg file
  • When you are monitoring an environment that changes frequently, it helps to keep the mrtg.cfg file clean. For example, in my environment we have clients that have multiple WAN links connected in a private IP cloud. We monitor the client routers on these WAN links. From time to time WAN links are decomissioned. While we remove these client routers from the Nagios XI configuration, MRTG is still trying to collect data from these client routers. If the WAN IP no longer exists, then it is going to timeout while trying to contact these routers. These timeouts are going to have an effect, especially as your mrtg.cfg file contains more and more decomissioned client routers. Keeping in mind that MRTG runs every five minutes, these timeoutes can cause MRTG to run longer and hence it&apos;s not really running every five minutes anymore.
  • Firmware upgrades on client routers can cause issues as well. Specifically we&apos;ve noticed this behaviour on SonicWALL firewalls. What can happen is when a major firmware revision is released, the numbering of ports inside the firmware changes. For example the WAN port we monitored was port 1 and the LAN port was port 2. After the firmware upgrade the WAN port became port 0 and the LAN port became port 1. We are only monitoring the WAN port using MRTG however MRTG is still trying to gather data from the SonicWALL on for port 1, so now your MRTG graphs are going to reflect all the data that is relative to the LAN port on the router and not the WAN port. What we saw was a massive jump in the graphs because we were collecting all the local LAN traffic passing through that port, when we were only interested in the WAN port activity.    
  • Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

    1. 1. Leveraging and Understanding Performance Data and Graphs Troy Lea troy@box293.com Twitter: @Box293 http://exchange.nagios.org/directory/Owner/Box293/1
    2. 2. 2 About Me IT Consultant Nagios Developer Love tinkering with Nagios Why Nagios XI? It’s a virtual appliance - ready to go
    3. 3. 3 About This Presentation Understanding how performance data is stored in the back end and how Nagios accesses it Goal is to give you key pieces of information A good reference for understanding concepts This presentation is centered around Nagios XI Valid for other Nagios implementations
    4. 4. 4 Basic Concepts - Part 1
    5. 5. 5 Basic Concepts - Part 2 ./check_nt -H SERVER -s "" -p 12489 -v USEDDISKSPACE -l C -w 80 -c 95 C: - total: 39.99 Gb - used: 25.28 Gb (63%) - free 14.71 Gb (37%) | 'C: Used Space'=25.28Gb;32.00;38.00;0.00;39.99
    6. 6. 6 Basic Concepts - Part 3 Service check command is executed by the monitoring engine Monitoring engine receives the result of the check Data received has performance data Performance data is anything after the | (pipe) The performance data is inserted into an RRD file When viewing the performance graph, PNP4Nagios retrieves the performance data from the RRD file and generates a pretty graph Every time the service check receives performance data, it inserts this performance data into the RRD file which allows you to look at trends over time
    7. 7. 7 Plugins The power of Nagios is in the plugins! Monitor what you want, how you want! Resources available that clearly define the guidelines around creating plugins Nagios Plug-in Developer Guidelines http://nagiosplug.sourceforge.net/developer- guidelines.html PNP Documentation http://docs.pnp4nagios.org/pnp-0.4/doc_complete
    8. 8. 8 Plugin Output Explained - Part 1 Plugins produce data divided into two parts The pipe symbol “|” is used as a delimiter Example check_icmp OK - 127.0.0.1: rta 2.687ms, lost 0% | rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;; Data to the left of the pipe symbol is processed by the monitoring engine Data to the right of the pipe symbol is used for inserting into RRD and XML files
    9. 9. 9 Plugin Output Explained - Part 2 The exit code Nagios receives from the plugin determines the state of the service 0 = OK 1 = WARNING 2 = CRITICAL 3 = UNKNOWN The exit code is not “visible” when running a check from the command line or looking at the output returned from the plugin
    10. 10. 10 Plugin Output Explained - Part 3 No performance data = no pretty graphs You can create a plugin using whatever language and tools are available All that matters is the end result which is returned back to Nagios when the plugin has finished running
    11. 11. 11 Plugin Output Explained - Part 4 Examples: Shell script Something you might want to check on the Nagios host itself perl script Remotely checking a device using SNMP OR using third party APIs like the VMware vSphere SDK to remotely access virtual environments Visual Basic script Using NSClient on a Windows host to perform a check (like RDP usage)
    12. 12. 12 Performance Data Specifics - Part 1 Asterix (*) fields are required fields, everything else is optional In this instance, rta is the FIRST DS, or DS 1
    13. 13. 13 Performance Data Specifics - Part 2 Multiple DS Each DS is separated by a space rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;; The label can have spaces however the label MUST be enclosed by single quotes 'Round Trip Average'=2.687ms;3000.000;5000.000;0; 'Packet Loss'=0%;80;100;; 13
    14. 14. 14 Basic Plugin - Part 1 Example shell script demonstrating how a plugin outputs performance data NUMBER1=$[ ( $RANDOM % 100 ) + 1 ] NUMBER2=$[ ( $RANDOM % 1000 ) + 1 ] echo ""OK - Number 1: $NUMBER1 Number 2: $NUMBER2" | 'Number 1'=$NUMBER1;;;; 'Number 2'=$NUMBER2;;;;“ exit "0"
    15. 15. 15 Basic Plugin - Part 2 Here is the output each time it is run: OK - Number 1: 4 Number 2: 74 | 'Number 1'=4;;;; 'Number 2'=74;;;; OK - Number 1: 52 Number 2: 758 | 'Number 1'=52;;;; 'Number 2'=758;;;; OK - Number 1: 73 Number 2: 60 | 'Number 1'=73;;;; 'Number 2'=60;;;; OK - Number 1: 29 Number 2: 338 | 'Number 1'=29;;;; 'Number 2'=338;;;; OK - Number 1: 87 Number 2: 612 | 'Number 1'=87;;;; 'Number 2'=612;;;;
    16. 16. 16 Basic Plugin - Part 3 Performance data displayed as a pretty graph Demonstration of how you can generate performance data in a plugin
    17. 17. 17 Basic Plugin - Part 4 Now lets add warning and critical thresholds to the performance data string Number1 WARNING @ 50 CRITICAL @ 75 Number2 WARNING @ 500 CRITICAL @ 750 echo ""OK - Number 1: $NUMBER1 Number 2: $NUMBER2" | 'Number 1'=$NUMBER1;50;75;; 'Number 2'=$NUMBER2;500;750;;"
    18. 18. 18 Basic Plugin - Part 5 Here is the output each time it is run: OK - Number 1: 4 Number 2: 74 | 'Number 1'=4;50;75;; 'Number 2'=74;500;750;; OK - Number 1: 52 Number 2: 758 | 'Number 1'=52;50;75;; 'Number 2'=758;500;750;; OK - Number 1: 73 Number 2: 60 | 'Number 1'=73;50;75;; 'Number 2'=60;500;750;; OK - Number 1: 29 Number 2: 338 | 'Number 1'=29;50;75;; 'Number 2'=338;500;750;; OK - Number 1: 87 Number 2: 612 | 'Number 1'=87;50;75;; 'Number 2'=612;500;750;;
    19. 19. 19 Basic Plugin - Part 6 This demonstrates how the performance data does not have any effect on the state of the service Warning and Critical thresholds are inside the .xml file 19
    20. 20. 20 .rrd and .xml files Used for recording the results from Nagios checks Useful for observing daily trends of your environment Invaluable for helping resolve performance issues RRD = Round Robin Database XML = Information about the Nagios check PNP4Nagios uses the RRD and XML files to generate pretty graphs
    21. 21. 21 Location of .rrd and .xml files When a service check returns performance data, Nagios dumps this into: /usr/local/nagios/var/spool/perfdata A background process detects the spooled data and creates / updates the relevant .rrd and .xml The Performance Data files live in: /usr/local/nagios/share/perfdata/<host>
    22. 22. 22 Extract .rrd data You can extract data from an .rrd file Example (from the CLI): rrdtool fetch /usr/local/nagios/share/perfdata/localhost/_HOST_.rrd MAX -r 900 -s -1h
    23. 23. 23 .rrd and .xml Gotchya - Part 1 The .xml file can contain sensitive data <NAGIOS_SERVICECHECKCOMMAND>check_emc_clariion!$HOSTADDRESS$!-u readonly!-p Str0ngPassw0rd!-t sp_cbt_busy!--sp A!--warn 70!--crit 90! </NAGIOS_SERVICECHECKCOMMAND>
    24. 24. 24 .rrd and .xml Gotchya - Part 2 Perhaps use a central credential file <NAGIOS_SERVICECHECKCOMMAND>check_vmware_host! check_vmware_config_vcenter01!cpu!90!95!!!! </NAGIOS_SERVICECHECKCOMMAND>
    25. 25. 25 .rrd and .xml Gotchya - Part 3 RRD Data is averaged out over time Looking at performance graphs for past day / week / month / year will show results with less spikey data This generally only occurs with data that has lots of peaks and troughs Constant data like disk space used will generally not average out that much It all depends on your environment! When reviewing RRD data you need to take into consideration these factors, it’s all relative!
    26. 26. 26 Graphs - How Templates Are Used - Part 1 http://docs.pnp4nagios.org/pnp-0.4/tpl
    27. 27. 27 Graphs - How Templates Are Used - Part 2 PNP4Nagios queries the XML file for the <TEMPLATE> tag Each datasource has it’s own <TEMPLATE> tag <TEMPLATE>check-host-alive</TEMPLATE> Also can be a trailing string in the performance data (good for distributed monitoring) OK - 127.0.0.1: rta 2.687ms, lost 0% | rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;; [check_icmp]
    28. 28. 28 Graphs - How Templates Are Used - Part 3 From the example graphs: <TEMPLATE>check-host-alive</TEMPLATE> <TEMPLATE>check_local_load_alt</TEMPLATE> PNP4Nagios looks for a php file with this name in the following folders: /usr/local/nagios/share/pnp/templates.dist /usr/local/nagios/share/pnp/templates
    29. 29. 29 Graphs - How Templates Are Used - Part 4 check-host-alive /usr/local/nagios/share/pnp/templates.dist/check-host- alive.php This PHP file generates the performance graph check_local_load_alt check_local_load_alt.php does NOT exist Default template is used: /usr/local/nagios/share/pnp/templates.dist/default.php 29
    30. 30. 30 Graphs - Creating Your Own Template - Part 1 The check_command name is what Nagios uses to insert into the <TEMPLATE> tag in the XML file (how PNP determines which template to use) So for this example I have created a copy of an existing command check_xi_service_nsclient_alt
    31. 31. 31 Graphs - Creating Your Own Template - Part 2 The service definition using the new command
    32. 32. 32 Graphs - Creating Your Own Template - Part 3 The graph currently being generated Default Template being used Check Command being used .rrd and .xml files currently contain valid data
    33. 33. 33 Graphs - Creating Your Own Template - Part 4 Copy the file: /usr/local/nagios/share/pnp/templates.dist/default.php To the following location with the name: / usr/local/nagios/share/pnp/templates/check_xi_servic e_nsclient_alt.php Edit check_xi_service_nsclient_alt.php
    34. 34. 34 Graphs - Creating Your Own Template - Part 5 In the graph we are removing the bottom two lines Default Template Check Command command name Which are lines 62 and 63 $def[$i] .= 'COMMENT:"Default Templater" '; $def[$i] .= 'COMMENT:"Check Command ' . $TEMPLATE[$i] . 'r" '; Save check_xi_service_nsclient_alt.php 34
    35. 35. 35 Graphs - Creating Your Own Template - Part 6 How easy was that! Updated graph Template Name and Check Command removed
    36. 36. 36 PNP Templates In Detail - Part 1 Lets get into specifics Template we just modified It’s not that complicated! (LOL) 36
    37. 37. 37 PNP Templates In Detail - Part 2 .rrd files can have multiple datasources (DS) Round Trip Time and Packet Loss for example
    38. 38. 38 PNP Templates In Detail - Part 3 Example of .rrd file with five DS Two graphs generated using these DS
    39. 39. 39 PNP Templates In Detail - Part 4 Default Template creates one graph per DS This is a simple PHP foreach loop The code within the loop references the relevant DS by the $i variable
    40. 40. 40 PNP Templates In Detail - Part 5 This section of the template uses three DS One graph will be generated using three DS $opt[1] and $def[1] is a reference for the first graph being generated
    41. 41. 41 PNP Templates In Detail - Part 6 Number formatting Our modified template and the relative code The relevant information: %3.4lf
    42. 42. 42 PNP Templates In Detail - Part 7 The three DS template and the relative code The relevant information: %4.0lf
    43. 43. 43 PNP Templates In Detail - Part 8 Numbers are displayed with four decimal points %3.4lf Numbers are displayed as whole numbers %4.0lf
    44. 44. 44 PNP Templates In Detail - Part 9 PNP documentation defines the number formatting using the printf standard defined here http://en.wikipedia.org/wiki/Printf The number (1) and the letter "L" look alike %3.4lg contains a lower case "L" The syntax is %[parameter][flags][width][.precision][length]type
    45. 45. 45 PNP Templates In Detail - Part 10 width When the number is generated on the graph, it will allocate a minimum specific width, this helps you align numbers in a column style precision Determines if the number displayed is a whole number, or a number with a specific number of digits following the decimal place
    46. 46. 46 PNP Templates In Detail - Part 11 %3.4lf width = 3 precision = .4 hence the displayed number is 25.3800 %4.0lf width = 4 precision = .0 hence the displayed number is 14 Because the precision is 0, NO decimal place is used
    47. 47. 47 MRTG - Part 1 MRTG = Multi Router Traffic Grapher Nagios Addon that is useful for monitoring network switch and router bandwidth using SNMP Can be complicated to understand configuration
    48. 48. 48 MRTG - Part 2 Nagios XI Wizard called “Network Switch / Router” automates the configuration of MRTG MRTG configuration file /etc/mrtg/mrtg.cfg MRTG runs as a cron job every five minutes cron comes from the Greek word for time, χρόνος [chronos] Hence cron is a software utility on linux which is a time-based job scheduler In the windows world it's the Task Scheduler
    49. 49. 49 MRTG - Part 3 When MRTG runs, it gathers data from the devices defined in the mrtg.cfg file It dumps this data into the folder /var/lib/mrtg For every port monitored, an .rrd file is created (no .xml file created at this point) Another background process will then take the data in /var/lib/mrtg and put it into the correct location /usr/local/nagios/share/perfdata/<host>
    50. 50. 50 MRTG Gotchya - Part 1 When the Wizard populates the mrtg.cfg file it will add ALL ports on the switch to the config file Even if you only selected to monitor 10 ports on the switch The Nagios XI Service Configuration will only have 10 ports defined as service definitions Every time the MRTG cron job runs, it will collect data from all ports on the switch (as defined in the mrtg.cfg file) Extra CPU cycles, extra disk space 50
    51. 51. 51 MRTG Gotchya - Part 2 On a 48 port switch this might not concern you But in a stack of two 48 port switches this becomes 96 ports + also other internal ports like link aggregation ports (another 32 ports perhaps) So these additional 128 ports have now added 8700+ configuration lines to the mrtg.cfg file 128 ports consume about 24 MB of .rrd disk space In my past environment, the mrtg.cfg file was 59,000 lines long! 51
    52. 52. 52 MRTG Gotchya - Part 3 Suggestion Clean up the mrtg.cfg file Remove the ports you do not wish to gather data on Can this cause Problems? Yes! Problem 1 Monitoring additional ports later using the wizard will not work The wizard will NOT re-add the ports to the mrtg.cfg file Wizard detects switch / router is already in the mrtg.cfg file
    53. 53. 53 MRTG Gotchya - Part 4 Problem 2 - Adding a switch (or module) to an existing switch Monitoring additional ports later using the wizard will not work The wizard will NOT add newly detected ports to the mrtg.cfg file Wizard detects switch / router is already in the mrtg.cfg file Very similar behaviour to Problem 1 Only relevant when the new switch / module is managed through the existing IP Address / FQDN Common with stacked switches, adding another switch to the stack
    54. 54. 54 MRTG Gotchya - Part 5 Solutions to Problems 1 & 2 cfgmaker This is how the Wizard configures mrtg.cfg The wizard updates the existing mrtg.cfg using a php function (not available from the CLI) Run cfgmaker @ CLI to generate a config file Add the contents of the config file to the existing mrtg.cfg cfgmaker --noreversedns “public@192.168.1.1" --output=output.txt
    55. 55. 55 MRTG Gotchya - Part 6 Problem 3 - With a frequently changing environment, keep mrtg.cfg clean Monitoring WAN links for remote routers? WAN link no longer exists? Disable / Delete service definition(s) in Core Configuration Manager (CCM) You will NEED to remove device from mrtg.cfg Why? MRTG will still try and collect data from WAN links no longer accessible Causes delays and can make MRTG run past the default 5 minute schedule ... can cause graph anomalies
    56. 56. 56 MRTG Gotchya - Part 7 Problem 4 - Firmware Upgrade causes port numbering to change Major firmware revision applied to switch / router New data collected for ports is no longer the same pattern Internal port numbering has changed mrtg.cfg queries specific port numbers, does not use port names or descriptions Example Old Firmware: WAN = Port 1 LAN = Port 2 New Firmware: WAN = Port 0 LAN = Port 1 Have seen this behaviour on SonicWALL Firewalls
    57. 57. 57 Questions Questions ?
    58. 58. 58 Discount Offer But wait, there's more ... When visiting the Nagios XI use my affiliate link http://www.nagios.com/#ref=3oHG00

    ×