Beyond Nagios
Upcoming SlideShare
Loading in...5
×
 

Beyond Nagios

on

  • 875 views

 

Statistics

Views

Total Views
875
Views on SlideShare
864
Embed Views
11

Actions

Likes
0
Downloads
24
Comments
0

2 Embeds 11

http://www.linkedin.com 9
https://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Beyond Nagios Beyond Nagios Presentation Transcript

  • Beyond Nagios NYC DevOps 2011/07/21Alexis Lê-Quôc - alq@datadoghq.com
  • Beyond Nagios NYC DevOps 2011/07/21Alexis Lê-Quôc - alq@datadoghq.com
  • What I’m Going To Talk About • Super-quick Nagios summary • Monitoring/Alerting Pathologies • How to fix it
  • What Is• “Industry Standard in IT Infrastructure Monitoring” • For once it’s true...• Scheduler & Notification server
  • (+) Robust, Mature code-base(-) Configuration can be daunting(-) Not human-friendly
  • “OVERWHELMING”
  • A “NORMAL” HOUR
  • THE “OTHER” NAGIOS UI
  • Process alerts & Fix thingsReceive alerts Add more checks THE HAPPY START
  • Missed alertsIgnore Alerts Add more checks THE SPIRAL OF DEATH
  • Quality of lifeFew checksFew alerts More checks Too many alerts # of alerts FIGHT OR FLIGHT
  • Effective Checks n^2 Coverage Fault-tolerant Less urgencyFew checksFew alertsEvery host counts More checks Too many alerts Every host still counts Scale Complexity THE TROUGH OF DESPAIR
  • EffectiveCoverage Scale IF ONLY I ADDED MORE CHECKS...
  • Reset!
  • Way Out‣Breathe!‣Measure‣Look for Patterns‣Put Alerts in Context‣Focus on the Business
  • Turn Nagios logs into structured data Analyze day | success_pct | warning_pct | error_pct | events---------------------+-------------+-------------+-----------+-------- 2011-07-12 00:00:00 | 89 | 0| 2 | 9628 2011-07-13 00:00:00 | 90 | 0| 2 | 9210 2011-07-14 00:00:00 | 90 | 0| 2 | 9735 2011-07-15 00:00:00 | 89 | 0| 2 | 9531 MEASURE
  • day | success_pct | warning_pct | error_pct | events---------------------+-------------+-------------+-----------+-------- 2011-07-12 00:00:00 | 89 | 0| 2 | 9628 2011-07-13 00:00:00 | 90 | 0| 2 | 9210 2011-07-14 00:00:00 | 90 | 0| 2 | 9735 2011-07-15 00:00:00 | 89 | 0| 2 | 9531VISUALIZATION MATTERS
  • In Time FlappingLOOK FOR PATTERNS
  • PUT ALERTS IN CONTEXT https://app.datad0g.com/dash/dash/1000#/date_range/1310682467000.0-1310684267000.0
  • Ultimate (hard) question‣Does this alert impact the business? ‣If so by how much? ‣Assumes that you track business metrics... ‣And they can be accessed programaticallyFOCUS ON THE BUSINESS
  • What applies to Nagios...Applies to other sources too etc...
  • Thankshttp://datadoghq.com