Monitoring…. The title of my talk lays emphasis on two words…. Automated and Adaptive….Automated?In general context an Automated task would mean a task that is controlled by code or control systems in a manner to reduce manual effortsAutomating a monitoring solution would in all essence of the word mean the same. Once your scripts are configured, yours systems are monitored…So why did I add the word Adaptive??? Automation does the trick right?? Imagine a scenario where you have an infrastructure solution that is scalable, a cloud solution per say, how do you reduce the effort of manually planting the monitoring scripts into the new boxes that are spawned up??The adaptive part comes into effect when your solution can accommodate the probability of application specific monitoring. Say a web server running passenger would be configured with scripts that are configured fetch passenger service related status or say a Go agent (Consider Go as the default CI from here on) would be configured with scripts monitoring the agent service and not passenger.So basically all custom services would be adaptive-ly monitored and this would happen automatically when they are spawned…
The automated and adaptive monitoring can be achieved in one way by using Chef as the configuration management tool, Nagios as the monitoring tool and Graphite as the graphing/ plotting tool.
For any sort of automation in infrastructure, the first and foremost requirement is a Configuration Management tool, Chef in this case.Chef works on a client server setup where the all chef-clients, known as Nodes in Chef Terminology, are registered to the Chef Server.Chef has its own DSL in which all the code used to monitor the nodes is written in. The code so written is called a Cookbook.Chef also enables you to perform specific searches to gather information of the entire infrastructure.It allows you to scale your infrastructure, each new node created has a client configuration file where the chef server url can be set to point to chef server configured for your environment.There is a misconception about chef that I heard in one of the previous talks is that Chef can only be used to configure servers. Well that’s not the case… chef can configure a system.. The system’s functionality is decided by the cookbooks we assign to it.Another doubt that is raised is when I change a cookbook and upload it to the chef server, do my chef nodes get a trigger to compile and execute cookbooks. The answer is NO. You can write up script to enforce it.
Starting with the most important component of a chef configuration management system… The chef-serverCentral distribution point for cookbooksManagement and authentication of nodesEnabling searchComponentsAPI Service : chef-server, chef-server-api running on port 4000Management Webui : chef-server-webui running on port 4040Indexer : chef-solr-indexer running on 8983Queuing server : RabbitMQ-server. Whenever data needs to be indexed, the server sends a message and the data to the queue and the indexer picks it upDatastore : Couchdb running on 5984. It stores data of nodes, roles etc as a JSON object
Coming to the chef client…. A chef client is a node registered with a chef-server.It is the node where all the cookbooks are compiled and executed.How does the client know which server to talk to?? This is set up in the client.rb file which is the configuration file for the client set up.In essence all properties that are searchable are stored as keys in the json with values that are set when the cookbook is compiled in the node i.e. during runtime.
Cookbooks are fundamental units of dictribution in chef… basically, all the magic is configured in them…A cookbook can containAttributes : These are values on the node that can be set default values and used through the cookbook.Libraries : Can be written to add helpers with the codeFiles : This directory contains static content files that can be distributed in the nodes using the cookbook file resourceTemplates: These are rendered on chef-clients and values within the templates are dynamically set during the execution of the cookbook on the client nodeRecipes : These are Ruby scripts that are executed on the client and they specify the resources to manage and the order of their executionMetadata : contains information of the recipes in the cookbook, version constraints, supported platforms, dependencies included etc.Definitions : Allows you to create re-usable collections of one or more resources.
Roles enable the admin in setting up nodes based on functionality and consist of attributes and run_list. A run_list includes cookbooks required to achieve the functionality we want the node to have.When chef-client runs it merges its own attributes and run_list with those of any roles assigned to it.For Example : If is have a list of recipes that I require to set up a CentOS node and another set of recipes that I require to set up a web_server. I can easily club the recipes and create two roles CentOS_node and web_server.Roles can be created in four ways but the simplest way is to create from a ruby script.If you have set up environment specific recipes then you could also define environment run_list as env_run_lists "prod" => ["recipe[apache2]"], "staging" => ["recipe[apache2::staging]"]
Nagios is a very effective monitoring tool which provides, among other featuresThe ability to monitor your scalable infrastructureIt provides a dashboard to enable a consolidated and centralized visibilityOutage detection based on custom warning and critical levels.Outage alerts via emails, sms, pager etc.Problem acknowledgement for known issues or the problems where the work is still going on.And alert Escalations.
Nagios reads its configuration data from text files.These files are located in /etc/nagios directory.The primary config file is the nagios.cfg file.Specific configuration files are located under /etc/nagios/objects which includecommands.cfgcontacts.cfghosts.cfglocalhost.cfgservices.cfgtimeperiods.cfgtemplates.cfg
Nagios templates are exceptionally useful in case you plan to customise each host.These are some sample templates which are used widely in the nagios configuration files.A generic-host, service and contact. For an infrastructure where one has a mixture of hosts say windows and linux we can create host templates accordingly and finally “use” them in the hosts and services configuration files. This will be seen in the coming slides where we’ll discuss a small automated nagios set up
After setting up the nagios server we need to understand how would it monitor the remote infrastructure. This is enabled by a plugin called the NRPE pluginIt retrieves the status of remote services.Consists of chef_nrpe plugin and NRPE daemon.The config fileis stored at /etc/nrpeWhen Nagios needs to monitor a resource of service from a remote Linux/Unix machine:Nagios will execute the check_nrpe plugin and tell it what service needs to be checkedThe check_nrpe plugin contacts the NRPE daemon on the remote host over an (optionally) SSL-protected connectionThe NRPE daemon runs the appropriate Nagios plugin to check the service or resourceThe results from the service check are passed from the NRPE daemon back to the check_nrpe plugin, which then returns the check results to the Nagios process.
Consider I write a plugin where I want to query my nodes and retrieve the status of their memory… A custom plugin can be defined as above.. It is generally planted in the plugins directory for the nagios set up… Nagios identifies four exit values for each plugin so written, built in or custom…Exit Value 0 implies Warning and the display on the dashboard stays greenExit Value 1 implies Warning and the display on the dashboard changes to yellowExit Value 2 implies Warning and the display on the dashboard changes to redExit Value 4 implies Warning and the display on the dashboard changes to orange
Well we can’t just write a plugin and leave the rest to the nagios and nrpe to figure out… we need to show them where we are putting in the plugin and where to pick the plugin file fromor what command to run to enable retrieval of data.The essential settings required for custom plugins is that the command that will use the custom plugin we create. In the nrpeconfig file we can add a list of directories where we would define the commands we create with the plugins we write.
So we set up our NRPE to point to the right command and showed the config file where to pick up the plugin from. Now we go one level up, we show the nagios server what it needs to do to monitor our nodes. For nagios server to run the same plugin, we need to specify two things… The command that the nagios server will runThe service definition that it will query.
So finally, when we talk of a custom nagios server, there are quite a few settings that an admin can do….Like adding hosts, defining timeperiods, defining contacts, contactgroups, services and commands that the plugins would query on the local or a remote host that the nagios server is monitoring
So we now understand nagios and nrpe working together… how do we make our infrastructure scalable, automated and adaptive? Here comes chef….While setting up a nagios server the first task is to fetch all the nodes registered with the chef-server. A simple search can be written in the recipe to achieve this and the variables are then passed to the templates we covered in the earlier slides.The approach used here is template-izing all nagios server config files in the /etc/nagios/objects directory. During the compilation and execution of the recipe the nodes_from_solr collection is iterated over and individual host definitions are populated in the hosts config. Also here you can see a small chunk of code that sets up a host group by searching over a role.
Similar to our hosts config file, the services config file is populated and updated in later chef runs in case there are any new nodes added.
This is how the config files would look in the nagios server. You can see how the ruby code has been replaced with the actual entries of the nodes in the infrastructure.
And finally we st up our Graphite for plotting….Real-time graphing server.Frontend : Webapp.Backend : Storage application (Carbon).Carbon-agent.pyCarbon-cache.pyCarbon-persister.pyAgents connect to carbon and send their data and it is then Carbon’s job to make the data available for real-time graphing.Carbon is made up of three processes… The primary process is carbon-agent.py which starts up the other two processes in a pipeline. When agents connect to the carbon server the carbon-agent accepts the connection and receives the formatted data, the data is then forwarded to the carbon cache for caching where data points are grouped by their associated metric. Carbon-cache then feeds these data point to the persister that reads them and and writes them to the disk using Whisper, that’s a fixed size db similar in design to an RRD.
We have seen how chef and nagios can be integrated to set up a monitoring solution. Now we come to the plotting part… It is made fairly simple by using an open source project called Graphios made by Shawn Sterling. Graphios is a script that puts the nagiosperf data into the graphite server, all you need is a running nagios server and a running carbon server.Add this code segment to your nagios.cfg file which will set up how the nagiosperfdata is sent to the graphite server…. Then we need to define the graphite specific commands to the commands config file in the nagios objects directory.In the chef cookbook, we would add a graphite prefix to the hosts config template file
And the services config template would have a postfix attached to it. So basically the node data would be read by the graphite server as graphiteprefix.nodename.graphitepostfix.So now, we have our graphite commands in place, graphite configuration in place and prefixes and postfixes set up… The only thing we’ll do next is pick up the graphios.py and the graphios.init from shawn sterling’s github account and drop the init file in thheinit.d and the python file in /opt/nagios/bin… it can be anywhere but the directory structure should be nagios/bin/graphios.pyHe has an elaborate readme which will take you through the configuration steps in detail…