Beyond Nagios

  • 701 views
Uploaded on

 

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
701
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
24
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Transcript

  • 1. Beyond Nagios NYC DevOps 2011/07/21Alexis Lê-Quôc - alq@datadoghq.com
  • 2. Beyond Nagios NYC DevOps 2011/07/21Alexis Lê-Quôc - alq@datadoghq.com
  • 3. What I’m Going To Talk About • Super-quick Nagios summary • Monitoring/Alerting Pathologies • How to fix it
  • 4. What Is• “Industry Standard in IT Infrastructure Monitoring” • For once it’s true...• Scheduler & Notification server
  • 5. (+) Robust, Mature code-base(-) Configuration can be daunting(-) Not human-friendly
  • 6. “OVERWHELMING”
  • 7. A “NORMAL” HOUR
  • 8. THE “OTHER” NAGIOS UI
  • 9. Process alerts & Fix thingsReceive alerts Add more checks THE HAPPY START
  • 10. Missed alertsIgnore Alerts Add more checks THE SPIRAL OF DEATH
  • 11. Quality of lifeFew checksFew alerts More checks Too many alerts # of alerts FIGHT OR FLIGHT
  • 12. Effective Checks n^2 Coverage Fault-tolerant Less urgencyFew checksFew alertsEvery host counts More checks Too many alerts Every host still counts Scale Complexity THE TROUGH OF DESPAIR
  • 13. EffectiveCoverage Scale IF ONLY I ADDED MORE CHECKS...
  • 14. Reset!
  • 15. Way Out‣Breathe!‣Measure‣Look for Patterns‣Put Alerts in Context‣Focus on the Business
  • 16. Turn Nagios logs into structured data Analyze day | success_pct | warning_pct | error_pct | events---------------------+-------------+-------------+-----------+-------- 2011-07-12 00:00:00 | 89 | 0| 2 | 9628 2011-07-13 00:00:00 | 90 | 0| 2 | 9210 2011-07-14 00:00:00 | 90 | 0| 2 | 9735 2011-07-15 00:00:00 | 89 | 0| 2 | 9531 MEASURE
  • 17. day | success_pct | warning_pct | error_pct | events---------------------+-------------+-------------+-----------+-------- 2011-07-12 00:00:00 | 89 | 0| 2 | 9628 2011-07-13 00:00:00 | 90 | 0| 2 | 9210 2011-07-14 00:00:00 | 90 | 0| 2 | 9735 2011-07-15 00:00:00 | 89 | 0| 2 | 9531VISUALIZATION MATTERS
  • 18. In Time FlappingLOOK FOR PATTERNS
  • 19. PUT ALERTS IN CONTEXT https://app.datad0g.com/dash/dash/1000#/date_range/1310682467000.0-1310684267000.0
  • 20. Ultimate (hard) question‣Does this alert impact the business? ‣If so by how much? ‣Assumes that you track business metrics... ‣And they can be accessed programaticallyFOCUS ON THE BUSINESS
  • 21. What applies to Nagios...Applies to other sources too etc...
  • 22. Thankshttp://datadoghq.com