LOPSA East 2013 - Building a More Effective Monitoring Environment
Building a More EﬀectiveMonitoring EnvironmentMike JulianFriday, May 3, 13
Who am I?• Oak Ridge National Lab operations staﬀ• LOPSA technical staﬀ (tech-team)• Generalist IT engineer for ~10 years• I am obsessed with monitoring.Friday, May 3, 13
This isn’t a beginner’s talk.Friday, May 3, 13This isn’t a talk about how to set up Nagios or Zabbix or whatever. I’m not going to talk about why one system is better thananother. This is a talk for those who have some basic monitoring in place already and want to get more out of it.I’m going to work on the assumption that you already some experience with monitoring systems. I’m going to try to stay astool-agnostic as possible, but I will be talking about some speciﬁc tools in some cases.
Monitoring is not a solved problem.Friday, May 3, 13A quick glance at Wikipedia’s entry for monitoring software shows 52 packages, and I know they’re missing many. That’s a lot ofdifferent systems, and kinda underscores that this is an ongoing problem and not at all solved. New tools are being written everyday and there is a strong community dedicated to this problem. If you ﬁnd yourself banging your head on your desk, screamingabout why your monitoring system isn’t doing what you want, rest assured: you aren’t alone.
There is no one-size-ﬁts-all solution.Friday, May 3, 13Monitoring isn’t a single problem; it’s several problems combined under one label. There’s not a single tool out there that isgoing to magically solve everything for you.The only way you’re going to be totally happy with your monitoring system is if you write it from scratch, tailored to yourenvironment. That’s not feasible for most of us, but that doesn’t mean you can’t write some code to solve a problem every nowand then. Don’t be afraid to create something new: you’re going to have to, to get great monitoring. Think of it as building abetter wheel.
A more eﬀectivemonitoring system• Automated• Low noise• Dynamically notiﬁesFriday, May 3, 13Everything I’m going to talk about falls into these three categories.
Automated• Conﬁg Management• IPAM & CMDB• Service discovery• Self-healingFriday, May 3, 13I love automation. Automation appeals to my inner laziness. I’m sure you all understand.There’s a lot of different ways to handle the automation, and different facets to it.Since there’s so many different tools available for automation, and different monitoring systems have different levels of built-inautomation, I want to talk about the general approach.In my current environment, we have the luxury of knowing deﬁnitively all the hosts on our network through our IPAM/CMDBsystem. We also know who runs them, what team they belong to, and a few other useful details. A few SQL queries goes a longway to automating our conﬁguration. Rather than rely on our conﬁguration management, I’ve chosen to rely on the CMDB, onlybecause that’s more central to my environment. Yours is probably different than mine, and that’s a key point: do what works for*your* environment.Service discovery is a bit trickier. I’ve been relying on the built-in functionality in my monitoring system to handle this for me,but this could easily be set up from a conﬁg management system.One thing I’ve been toying with that you might consider: self-healing. Most systems support the execution of a script when anevent occurs. For example, I have a network device that the SNMP engine likes to fall over on from time to time. When my SNMPcheck fails, I could have that kick off an expect script to log into the device and restart the SNMP engine. Slightly related, youcould also have a script automatically create a ticket in your ticketing system, or toss the event info into a database. Lots ofoptions there.
• Dependencies & Parenting• Handling redundant/HA services• Dynamic Thresholds / PredictiveMonitoringLow noiseFriday, May 3, 13I hate a noisy monitoring system. It’s one of the biggest failures of any monitoring project. People get a deluge of email, most ofit false alarms, and then everyone just starts ignoring everything. It just gets worse the larger your environment is.There’s a few ways to help ﬁx this problem.
Parenting &Dependencies• SVR-01 depends on RACK-1-SW-01, whichdepends on DC-EAST-RTR-01?• Website depends on the SQL clusterFriday, May 3, 13Most modern monitoring systems support the concept of parent-child relationships. If you know something depends onsomething else, then conﬁgure that.This may not seem like a huge deal, but what if that router melts? Your monitoring system is going to suddenly dump a wholelot of emails on you about not being able to reach every single thing in that datacenter.Another example: if you know your website depends on SQL being up, then conﬁgure that, so you aren’t alerted twice for thesame problem. That way, if SQL goes down, you get only an alert for SQL, not an alert for both the website and SQL.Bonus: now you can run reports that will tell you exactly what will be affected by taking down particular components of yournetwork.
Redundancy & HAMonitoring• Clusters• Redundant hosts/servicesFriday, May 3, 13Let’s say you have a farm of web servers, and you know that if one dies, it’s not a show-stopper. Why, then, should you getwoken up at 3am because of it? This even applies on a small scale, such as a simple load balanced two-server setup. If you knowthat one server can handle all of the traffic by itself, then it’s not worth losing sleep because the other server went down. Thisapplies equally well to a service cluster (such as DNS servers) as it does to host clusters (compute systems).You can even get more complex and build some math in, for example, only alert me when 20% of all nodes are down.This is one area where you may have to write some code.
Static thresholds• Alert me when 20% of my disk is free• Alert me when CPU utilization is at 80%• etc, etcFriday, May 3, 13Problem: how big is the disk? Does 20% apply everywhere? Leaves a lot of questions unanswered...
Dynamic thresholds• A spike or cliﬀ is interesting--but what if it’sbelow the static threshold?• Holt-Winters Forecasting• Averaging & standard deviation• Aberrant behavior detection• A holy grail of monitoring & a WIPFriday, May 3, 13I’m going to assume that you’ve already tweaked your thresholds to suit what you need.Unfortunately, the typical threshold approach only goes so far.Take the example of ﬁrewall connections: how do you know what a normal usage is? How do you know what an abnormal usageis? Alerting when I hit 1000 denied connections per minute, but what if my baseline is actually 100 per minute and I suddenlyspike to 500 per minute? That’s something worth knowing. This is called aberrant behavior detection, and is a mainstay in thesecurity monitoring arena, but hasn’t gotten much attention in the non-security operational monitoring area.You can write some code to run checks against data stored in rrdtool or Graphite and send results back to your monitoringsystem. Or use Splunk.
Tools to do this• rrdtool• Graphite• Splunk• And a bit of code to tie it all togetherFriday, May 3, 13
Dynamically notiﬁesJust use PagerDuty.Seriously.Friday, May 3, 13I don’t mean to be a shill for them, but this is a seriously awesome service.
PagerDuty• Flexible notiﬁcation system• Users create their own schedules,including exceptions (eg, vacation)• Email, SMS, voice• Ability to escalate on demand orautomaticallyFriday, May 3, 13Any one who has ever conﬁgured contacts, escalations, and time periods in Nagios knows that it’s painful. PagerDuty solves thisproblem by putting that conﬁguration into a web interface, then allowing each contact to set their own notiﬁcation methods,time period exceptions (vacation, sick day, etc).PagerDuty allows for inputs from multiple different sources, with different rules (or the same!) for each service. Anything thatcan send an email or communicate over the API can be handled by PagerDuty.One of the coolest things about PagerDuty is how it handles notiﬁcations.
PagerDutyFriday, May 3, 13The ﬁrst time I got an alert, it blew my mind. Right in the text, I can acknowledge the alert, close the alert, or escalate it. So cool.
A few neat tools• collectd• GraphiteFriday, May 3, 13I said I would try not to talk about speciﬁc tools much, but these are too valuable not to mention.Not everything in your environment is going to ﬁt nicely into the paradigm of a single monitoring system. A great example ofthis is a guy from another division approached me about graphing input and output voltages on some R&D solar equipment. Notexactly the typical monitoring we sysadmins do. The key aspect was that he needed it as near real-time as possible: less thanthree seconds between each value. I decided for various technical and business reasons to not put them into my usual system,but to use something different.collectd is a poller. It’s got a bunch of plugins to collect data and then send it somewhere. In my particular case, I used collectd’sSNMP plugin to poll the devices every second, then write the data to Graphite.Graphite is a super cool graphing and metric storage system. It accepts data over a TCP port, in an easy format. The webinterface has a really nice looking visualization library, and has lots of different functions you can apply for some really awesomeresults.Of course, you can always run checks against Graphite data, thereby integrating things.
Friday, May 3, 13This is a really simple graph of the bandwidth usage on the LOPSA production server in the last 24 hours.Literally everything you see on this graph is customizable.
• PagerDuty.com• collectd.org• graphite.readthedocs.org• rrdtool Aberrant Behavior Detection(google it)• Slides (and more) at mikejulian.comResourcesFriday, May 3, 13
That’s all, firstname.lastname@example.orgFriday, May 3, 13
Thank You for Attending LOPSA-East 13Please ﬁll out the Trainer Evaluationhttp://lopsa-east.org/2013/training-surveyRate LOPSA-East 13http://www.lopsa-east.org/2013/rate-lopsa-east-13Friday, May 3, 13