SlideShare a Scribd company logo
1 of 22
Download to read offline
Building a More Effective
Monitoring Environment
Mike Julian
Friday, May 3, 13
Who am I?
• Oak Ridge National Lab operations staff
• LOPSA technical staff (tech-team)
• Generalist IT engineer for ~10 years
• I am obsessed with monitoring.
Friday, May 3, 13
This isn’t a beginner’s talk.
Friday, May 3, 13
This isn’t a talk about how to set up Nagios or Zabbix or whatever. I’m not going to talk about why one system is better than
another. This is a talk for those who have some basic monitoring in place already and want to get more out of it.
I’m going to work on the assumption that you already some experience with monitoring systems. I’m going to try to stay as
tool-agnostic as possible, but I will be talking about some specific tools in some cases.
Monitoring is not a solved problem.
Friday, May 3, 13
A quick glance at Wikipedia’s entry for monitoring software shows 52 packages, and I know they’re missing many. That’s a lot of
different systems, and kinda underscores that this is an ongoing problem and not at all solved. New tools are being written every
day and there is a strong community dedicated to this problem. If you find yourself banging your head on your desk, screaming
about why your monitoring system isn’t doing what you want, rest assured: you aren’t alone.
There is no one-size-fits-all solution.
Friday, May 3, 13
Monitoring isn’t a single problem; it’s several problems combined under one label. There’s not a single tool out there that is
going to magically solve everything for you.
The only way you’re going to be totally happy with your monitoring system is if you write it from scratch, tailored to your
environment. That’s not feasible for most of us, but that doesn’t mean you can’t write some code to solve a problem every now
and then. Don’t be afraid to create something new: you’re going to have to, to get great monitoring. Think of it as building a
better wheel.
A more effective
monitoring system
• Automated
• Low noise
• Dynamically notifies
Friday, May 3, 13
Everything I’m going to talk about falls into these three categories.
Automated
• Config Management
• IPAM & CMDB
• Service discovery
• Self-healing
Friday, May 3, 13
I love automation. Automation appeals to my inner laziness. I’m sure you all understand.
There’s a lot of different ways to handle the automation, and different facets to it.
Since there’s so many different tools available for automation, and different monitoring systems have different levels of built-in
automation, I want to talk about the general approach.
In my current environment, we have the luxury of knowing definitively all the hosts on our network through our IPAM/CMDB
system. We also know who runs them, what team they belong to, and a few other useful details. A few SQL queries goes a long
way to automating our configuration. Rather than rely on our configuration management, I’ve chosen to rely on the CMDB, only
because that’s more central to my environment. Yours is probably different than mine, and that’s a key point: do what works for
*your* environment.
Service discovery is a bit trickier. I’ve been relying on the built-in functionality in my monitoring system to handle this for me,
but this could easily be set up from a config management system.
One thing I’ve been toying with that you might consider: self-healing. Most systems support the execution of a script when an
event occurs. For example, I have a network device that the SNMP engine likes to fall over on from time to time. When my SNMP
check fails, I could have that kick off an expect script to log into the device and restart the SNMP engine. Slightly related, you
could also have a script automatically create a ticket in your ticketing system, or toss the event info into a database. Lots of
options there.
• Dependencies & Parenting
• Handling redundant/HA services
• Dynamic Thresholds / Predictive
Monitoring
Low noise
Friday, May 3, 13
I hate a noisy monitoring system. It’s one of the biggest failures of any monitoring project. People get a deluge of email, most of
it false alarms, and then everyone just starts ignoring everything. It just gets worse the larger your environment is.
There’s a few ways to help fix this problem.
Parenting &
Dependencies
• SVR-01 depends on RACK-1-SW-01, which
depends on DC-EAST-RTR-01?
• Website depends on the SQL cluster
Friday, May 3, 13
Most modern monitoring systems support the concept of parent-child relationships. If you know something depends on
something else, then configure that.
This may not seem like a huge deal, but what if that router melts? Your monitoring system is going to suddenly dump a whole
lot of emails on you about not being able to reach every single thing in that datacenter.
Another example: if you know your website depends on SQL being up, then configure that, so you aren’t alerted twice for the
same problem. That way, if SQL goes down, you get only an alert for SQL, not an alert for both the website and SQL.
Bonus: now you can run reports that will tell you exactly what will be affected by taking down particular components of your
network.
Redundancy & HA
Monitoring
• Clusters
• Redundant hosts/services
Friday, May 3, 13
Let’s say you have a farm of web servers, and you know that if one dies, it’s not a show-stopper. Why, then, should you get
woken up at 3am because of it? This even applies on a small scale, such as a simple load balanced two-server setup. If you know
that one server can handle all of the traffic by itself, then it’s not worth losing sleep because the other server went down. This
applies equally well to a service cluster (such as DNS servers) as it does to host clusters (compute systems).
You can even get more complex and build some math in, for example, only alert me when 20% of all nodes are down.
This is one area where you may have to write some code.
Static thresholds
• Alert me when 20% of my disk is free
• Alert me when CPU utilization is at 80%
• etc, etc
Friday, May 3, 13
Problem: how big is the disk? Does 20% apply everywhere? Leaves a lot of questions unanswered...
Dynamic thresholds
• A spike or cliff is interesting--but what if it’s
below the static threshold?
• Holt-Winters Forecasting
• Averaging & standard deviation
• Aberrant behavior detection
• A holy grail of monitoring & a WIP
Friday, May 3, 13
I’m going to assume that you’ve already tweaked your thresholds to suit what you need.
Unfortunately, the typical threshold approach only goes so far.
Take the example of firewall connections: how do you know what a normal usage is? How do you know what an abnormal usage
is? Alerting when I hit 1000 denied connections per minute, but what if my baseline is actually 100 per minute and I suddenly
spike to 500 per minute? That’s something worth knowing. This is called aberrant behavior detection, and is a mainstay in the
security monitoring arena, but hasn’t gotten much attention in the non-security operational monitoring area.
You can write some code to run checks against data stored in rrdtool or Graphite and send results back to your monitoring
system. Or use Splunk.
Graphite forecasting
Friday, May 3, 13
Tools to do this
• rrdtool
• Graphite
• Splunk
• And a bit of code to tie it all together
Friday, May 3, 13
Dynamically notifies
Just use PagerDuty.
Seriously.
Friday, May 3, 13
I don’t mean to be a shill for them, but this is a seriously awesome service.
PagerDuty
• Flexible notification system
• Users create their own schedules,
including exceptions (eg, vacation)
• Email, SMS, voice
• Ability to escalate on demand or
automatically
Friday, May 3, 13
Any one who has ever configured contacts, escalations, and time periods in Nagios knows that it’s painful. PagerDuty solves this
problem by putting that configuration into a web interface, then allowing each contact to set their own notification methods,
time period exceptions (vacation, sick day, etc).
PagerDuty allows for inputs from multiple different sources, with different rules (or the same!) for each service. Anything that
can send an email or communicate over the API can be handled by PagerDuty.
One of the coolest things about PagerDuty is how it handles notifications.
PagerDuty
Friday, May 3, 13
The first time I got an alert, it blew my mind. Right in the text, I can acknowledge the alert, close the alert, or escalate it. So cool.
A few neat tools
• collectd
• Graphite
Friday, May 3, 13
I said I would try not to talk about specific tools much, but these are too valuable not to mention.
Not everything in your environment is going to fit nicely into the paradigm of a single monitoring system. A great example of
this is a guy from another division approached me about graphing input and output voltages on some R&D solar equipment. Not
exactly the typical monitoring we sysadmins do. The key aspect was that he needed it as near real-time as possible: less than
three seconds between each value. I decided for various technical and business reasons to not put them into my usual system,
but to use something different.
collectd is a poller. It’s got a bunch of plugins to collect data and then send it somewhere. In my particular case, I used collectd’s
SNMP plugin to poll the devices every second, then write the data to Graphite.
Graphite is a super cool graphing and metric storage system. It accepts data over a TCP port, in an easy format. The web
interface has a really nice looking visualization library, and has lots of different functions you can apply for some really awesome
results.
Of course, you can always run checks against Graphite data, thereby integrating things.
Friday, May 3, 13
This is a really simple graph of the bandwidth usage on the LOPSA production server in the last 24 hours.
Literally everything you see on this graph is customizable.
• PagerDuty.com
• collectd.org
• graphite.readthedocs.org
• rrdtool Aberrant Behavior Detection
(google it)
• Slides (and more) at mikejulian.com
Resources
Friday, May 3, 13
That’s all, folks.
mike@mikejulian.com
Friday, May 3, 13
Thank You for Attending LOPSA-East '13
Please fill out the Trainer Evaluation
http://lopsa-east.org/2013/training-survey
Rate LOPSA-East '13
http://www.lopsa-east.org/2013/rate-lopsa-east-13
Friday, May 3, 13

More Related Content

Viewers also liked

LOPSA East 2013 - Lessons Learned in Starting a LOPSA Chapter
LOPSA East 2013 - Lessons Learned in Starting a LOPSA ChapterLOPSA East 2013 - Lessons Learned in Starting a LOPSA Chapter
LOPSA East 2013 - Lessons Learned in Starting a LOPSA ChapterMike Julian
 
Actual extended project 2
Actual extended project 2Actual extended project 2
Actual extended project 2Hawesy
 
Why choose Yii framework?
Why choose Yii framework?Why choose Yii framework?
Why choose Yii framework?goodcore
 

Viewers also liked (6)

LOPSA East 2013 - Lessons Learned in Starting a LOPSA Chapter
LOPSA East 2013 - Lessons Learned in Starting a LOPSA ChapterLOPSA East 2013 - Lessons Learned in Starting a LOPSA Chapter
LOPSA East 2013 - Lessons Learned in Starting a LOPSA Chapter
 
Actual extended project 2
Actual extended project 2Actual extended project 2
Actual extended project 2
 
Q2
Q2Q2
Q2
 
Harsh ppt
Harsh pptHarsh ppt
Harsh ppt
 
Alireza
AlirezaAlireza
Alireza
 
Why choose Yii framework?
Why choose Yii framework?Why choose Yii framework?
Why choose Yii framework?
 

Similar to LOPSA East 2013 - Building a More Effective Monitoring Environment

An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Brian Brazil
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Brian Brazil
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Brian Brazil
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum JapanBrian Brazil
 
Observability for Emerging Infra (what got you here won't get you there)
Observability for Emerging Infra (what got you here won't get you there)Observability for Emerging Infra (what got you here won't get you there)
Observability for Emerging Infra (what got you here won't get you there)Charity Majors
 
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Brian Brazil
 
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSkynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSylvain Kalache
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.Theo Schlossnagle
 
Chaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just ChaosChaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just ChaosCharity Majors
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
Monitoring & alerting presentation sabin&mustafa
Monitoring & alerting presentation sabin&mustafaMonitoring & alerting presentation sabin&mustafa
Monitoring & alerting presentation sabin&mustafaLama K Banna
 
How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)Siglos
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesAshutosh Agarwal
 
Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Brian Brazil
 
Actionable Alarm Management
Actionable Alarm ManagementActionable Alarm Management
Actionable Alarm ManagementDan Young
 
Teach Your Sites to Call for Help: Automated Problem Reporting for Online Ser...
Teach Your Sites to Call for Help: Automated Problem Reporting for Online Ser...Teach Your Sites to Call for Help: Automated Problem Reporting for Online Ser...
Teach Your Sites to Call for Help: Automated Problem Reporting for Online Ser...Caktus Group
 
Checking Windows for signs of compromise
Checking Windows for signs of compromiseChecking Windows for signs of compromise
Checking Windows for signs of compromiseCal Bryant
 
Leveling the playing field
Leveling the playing fieldLeveling the playing field
Leveling the playing fieldAaron Bedra
 
Angus Fletcher - Error Handling in Concurrent Systems
Angus Fletcher - Error Handling in Concurrent SystemsAngus Fletcher - Error Handling in Concurrent Systems
Angus Fletcher - Error Handling in Concurrent SystemsMaritime DevCon
 

Similar to LOPSA East 2013 - Building a More Effective Monitoring Environment (20)

An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Observability for Emerging Infra (what got you here won't get you there)
Observability for Emerging Infra (what got you here won't get you there)Observability for Emerging Infra (what got you here won't get you there)
Observability for Emerging Infra (what got you here won't get you there)
 
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
 
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSkynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
 
Chaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just ChaosChaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just Chaos
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Monitoring & alerting presentation sabin&mustafa
Monitoring & alerting presentation sabin&mustafaMonitoring & alerting presentation sabin&mustafa
Monitoring & alerting presentation sabin&mustafa
 
How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
 
Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)
 
Actionable Alarm Management
Actionable Alarm ManagementActionable Alarm Management
Actionable Alarm Management
 
Teach Your Sites to Call for Help: Automated Problem Reporting for Online Ser...
Teach Your Sites to Call for Help: Automated Problem Reporting for Online Ser...Teach Your Sites to Call for Help: Automated Problem Reporting for Online Ser...
Teach Your Sites to Call for Help: Automated Problem Reporting for Online Ser...
 
Checking Windows for signs of compromise
Checking Windows for signs of compromiseChecking Windows for signs of compromise
Checking Windows for signs of compromise
 
Leveling the playing field
Leveling the playing fieldLeveling the playing field
Leveling the playing field
 
Angus Fletcher - Error Handling in Concurrent Systems
Angus Fletcher - Error Handling in Concurrent SystemsAngus Fletcher - Error Handling in Concurrent Systems
Angus Fletcher - Error Handling in Concurrent Systems
 

LOPSA East 2013 - Building a More Effective Monitoring Environment

  • 1. Building a More Effective Monitoring Environment Mike Julian Friday, May 3, 13
  • 2. Who am I? • Oak Ridge National Lab operations staff • LOPSA technical staff (tech-team) • Generalist IT engineer for ~10 years • I am obsessed with monitoring. Friday, May 3, 13
  • 3. This isn’t a beginner’s talk. Friday, May 3, 13 This isn’t a talk about how to set up Nagios or Zabbix or whatever. I’m not going to talk about why one system is better than another. This is a talk for those who have some basic monitoring in place already and want to get more out of it. I’m going to work on the assumption that you already some experience with monitoring systems. I’m going to try to stay as tool-agnostic as possible, but I will be talking about some specific tools in some cases.
  • 4. Monitoring is not a solved problem. Friday, May 3, 13 A quick glance at Wikipedia’s entry for monitoring software shows 52 packages, and I know they’re missing many. That’s a lot of different systems, and kinda underscores that this is an ongoing problem and not at all solved. New tools are being written every day and there is a strong community dedicated to this problem. If you find yourself banging your head on your desk, screaming about why your monitoring system isn’t doing what you want, rest assured: you aren’t alone.
  • 5. There is no one-size-fits-all solution. Friday, May 3, 13 Monitoring isn’t a single problem; it’s several problems combined under one label. There’s not a single tool out there that is going to magically solve everything for you. The only way you’re going to be totally happy with your monitoring system is if you write it from scratch, tailored to your environment. That’s not feasible for most of us, but that doesn’t mean you can’t write some code to solve a problem every now and then. Don’t be afraid to create something new: you’re going to have to, to get great monitoring. Think of it as building a better wheel.
  • 6. A more effective monitoring system • Automated • Low noise • Dynamically notifies Friday, May 3, 13 Everything I’m going to talk about falls into these three categories.
  • 7. Automated • Config Management • IPAM & CMDB • Service discovery • Self-healing Friday, May 3, 13 I love automation. Automation appeals to my inner laziness. I’m sure you all understand. There’s a lot of different ways to handle the automation, and different facets to it. Since there’s so many different tools available for automation, and different monitoring systems have different levels of built-in automation, I want to talk about the general approach. In my current environment, we have the luxury of knowing definitively all the hosts on our network through our IPAM/CMDB system. We also know who runs them, what team they belong to, and a few other useful details. A few SQL queries goes a long way to automating our configuration. Rather than rely on our configuration management, I’ve chosen to rely on the CMDB, only because that’s more central to my environment. Yours is probably different than mine, and that’s a key point: do what works for *your* environment. Service discovery is a bit trickier. I’ve been relying on the built-in functionality in my monitoring system to handle this for me, but this could easily be set up from a config management system. One thing I’ve been toying with that you might consider: self-healing. Most systems support the execution of a script when an event occurs. For example, I have a network device that the SNMP engine likes to fall over on from time to time. When my SNMP check fails, I could have that kick off an expect script to log into the device and restart the SNMP engine. Slightly related, you could also have a script automatically create a ticket in your ticketing system, or toss the event info into a database. Lots of options there.
  • 8. • Dependencies & Parenting • Handling redundant/HA services • Dynamic Thresholds / Predictive Monitoring Low noise Friday, May 3, 13 I hate a noisy monitoring system. It’s one of the biggest failures of any monitoring project. People get a deluge of email, most of it false alarms, and then everyone just starts ignoring everything. It just gets worse the larger your environment is. There’s a few ways to help fix this problem.
  • 9. Parenting & Dependencies • SVR-01 depends on RACK-1-SW-01, which depends on DC-EAST-RTR-01? • Website depends on the SQL cluster Friday, May 3, 13 Most modern monitoring systems support the concept of parent-child relationships. If you know something depends on something else, then configure that. This may not seem like a huge deal, but what if that router melts? Your monitoring system is going to suddenly dump a whole lot of emails on you about not being able to reach every single thing in that datacenter. Another example: if you know your website depends on SQL being up, then configure that, so you aren’t alerted twice for the same problem. That way, if SQL goes down, you get only an alert for SQL, not an alert for both the website and SQL. Bonus: now you can run reports that will tell you exactly what will be affected by taking down particular components of your network.
  • 10. Redundancy & HA Monitoring • Clusters • Redundant hosts/services Friday, May 3, 13 Let’s say you have a farm of web servers, and you know that if one dies, it’s not a show-stopper. Why, then, should you get woken up at 3am because of it? This even applies on a small scale, such as a simple load balanced two-server setup. If you know that one server can handle all of the traffic by itself, then it’s not worth losing sleep because the other server went down. This applies equally well to a service cluster (such as DNS servers) as it does to host clusters (compute systems). You can even get more complex and build some math in, for example, only alert me when 20% of all nodes are down. This is one area where you may have to write some code.
  • 11. Static thresholds • Alert me when 20% of my disk is free • Alert me when CPU utilization is at 80% • etc, etc Friday, May 3, 13 Problem: how big is the disk? Does 20% apply everywhere? Leaves a lot of questions unanswered...
  • 12. Dynamic thresholds • A spike or cliff is interesting--but what if it’s below the static threshold? • Holt-Winters Forecasting • Averaging & standard deviation • Aberrant behavior detection • A holy grail of monitoring & a WIP Friday, May 3, 13 I’m going to assume that you’ve already tweaked your thresholds to suit what you need. Unfortunately, the typical threshold approach only goes so far. Take the example of firewall connections: how do you know what a normal usage is? How do you know what an abnormal usage is? Alerting when I hit 1000 denied connections per minute, but what if my baseline is actually 100 per minute and I suddenly spike to 500 per minute? That’s something worth knowing. This is called aberrant behavior detection, and is a mainstay in the security monitoring arena, but hasn’t gotten much attention in the non-security operational monitoring area. You can write some code to run checks against data stored in rrdtool or Graphite and send results back to your monitoring system. Or use Splunk.
  • 14. Tools to do this • rrdtool • Graphite • Splunk • And a bit of code to tie it all together Friday, May 3, 13
  • 15. Dynamically notifies Just use PagerDuty. Seriously. Friday, May 3, 13 I don’t mean to be a shill for them, but this is a seriously awesome service.
  • 16. PagerDuty • Flexible notification system • Users create their own schedules, including exceptions (eg, vacation) • Email, SMS, voice • Ability to escalate on demand or automatically Friday, May 3, 13 Any one who has ever configured contacts, escalations, and time periods in Nagios knows that it’s painful. PagerDuty solves this problem by putting that configuration into a web interface, then allowing each contact to set their own notification methods, time period exceptions (vacation, sick day, etc). PagerDuty allows for inputs from multiple different sources, with different rules (or the same!) for each service. Anything that can send an email or communicate over the API can be handled by PagerDuty. One of the coolest things about PagerDuty is how it handles notifications.
  • 17. PagerDuty Friday, May 3, 13 The first time I got an alert, it blew my mind. Right in the text, I can acknowledge the alert, close the alert, or escalate it. So cool.
  • 18. A few neat tools • collectd • Graphite Friday, May 3, 13 I said I would try not to talk about specific tools much, but these are too valuable not to mention. Not everything in your environment is going to fit nicely into the paradigm of a single monitoring system. A great example of this is a guy from another division approached me about graphing input and output voltages on some R&D solar equipment. Not exactly the typical monitoring we sysadmins do. The key aspect was that he needed it as near real-time as possible: less than three seconds between each value. I decided for various technical and business reasons to not put them into my usual system, but to use something different. collectd is a poller. It’s got a bunch of plugins to collect data and then send it somewhere. In my particular case, I used collectd’s SNMP plugin to poll the devices every second, then write the data to Graphite. Graphite is a super cool graphing and metric storage system. It accepts data over a TCP port, in an easy format. The web interface has a really nice looking visualization library, and has lots of different functions you can apply for some really awesome results. Of course, you can always run checks against Graphite data, thereby integrating things.
  • 19. Friday, May 3, 13 This is a really simple graph of the bandwidth usage on the LOPSA production server in the last 24 hours. Literally everything you see on this graph is customizable.
  • 20. • PagerDuty.com • collectd.org • graphite.readthedocs.org • rrdtool Aberrant Behavior Detection (google it) • Slides (and more) at mikejulian.com Resources Friday, May 3, 13
  • 22. Thank You for Attending LOPSA-East '13 Please fill out the Trainer Evaluation http://lopsa-east.org/2013/training-survey Rate LOPSA-East '13 http://www.lopsa-east.org/2013/rate-lopsa-east-13 Friday, May 3, 13