Beyond Nagios

1,099 views
1,003 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,099
On SlideShare
0
From Embeds
0
Number of Embeds
51
Actions
Shares
0
Downloads
28
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Beyond Nagios

    1. 1. Beyond Nagios NYC DevOps 2011/07/21Alexis Lê-Quôc - alq@datadoghq.com
    2. 2. Beyond Nagios NYC DevOps 2011/07/21Alexis Lê-Quôc - alq@datadoghq.com
    3. 3. What I’m Going To Talk About • Super-quick Nagios summary • Monitoring/Alerting Pathologies • How to fix it
    4. 4. What Is• “Industry Standard in IT Infrastructure Monitoring” • For once it’s true...• Scheduler & Notification server
    5. 5. (+) Robust, Mature code-base(-) Configuration can be daunting(-) Not human-friendly
    6. 6. “OVERWHELMING”
    7. 7. A “NORMAL” HOUR
    8. 8. THE “OTHER” NAGIOS UI
    9. 9. Process alerts & Fix thingsReceive alerts Add more checks THE HAPPY START
    10. 10. Missed alertsIgnore Alerts Add more checks THE SPIRAL OF DEATH
    11. 11. Quality of lifeFew checksFew alerts More checks Too many alerts # of alerts FIGHT OR FLIGHT
    12. 12. Effective Checks n^2 Coverage Fault-tolerant Less urgencyFew checksFew alertsEvery host counts More checks Too many alerts Every host still counts Scale Complexity THE TROUGH OF DESPAIR
    13. 13. EffectiveCoverage Scale IF ONLY I ADDED MORE CHECKS...
    14. 14. Reset!
    15. 15. Way Out‣Breathe!‣Measure‣Look for Patterns‣Put Alerts in Context‣Focus on the Business
    16. 16. Turn Nagios logs into structured data Analyze day | success_pct | warning_pct | error_pct | events---------------------+-------------+-------------+-----------+-------- 2011-07-12 00:00:00 | 89 | 0| 2 | 9628 2011-07-13 00:00:00 | 90 | 0| 2 | 9210 2011-07-14 00:00:00 | 90 | 0| 2 | 9735 2011-07-15 00:00:00 | 89 | 0| 2 | 9531 MEASURE
    17. 17. day | success_pct | warning_pct | error_pct | events---------------------+-------------+-------------+-----------+-------- 2011-07-12 00:00:00 | 89 | 0| 2 | 9628 2011-07-13 00:00:00 | 90 | 0| 2 | 9210 2011-07-14 00:00:00 | 90 | 0| 2 | 9735 2011-07-15 00:00:00 | 89 | 0| 2 | 9531VISUALIZATION MATTERS
    18. 18. In Time FlappingLOOK FOR PATTERNS
    19. 19. PUT ALERTS IN CONTEXT https://app.datad0g.com/dash/dash/1000#/date_range/1310682467000.0-1310684267000.0
    20. 20. Ultimate (hard) question‣Does this alert impact the business? ‣If so by how much? ‣Assumes that you track business metrics... ‣And they can be accessed programaticallyFOCUS ON THE BUSINESS
    21. 21. What applies to Nagios...Applies to other sources too etc...
    22. 22. Thankshttp://datadoghq.com

    ×