Your SlideShare is downloading. ×
0
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios

2,801

Published on

Mike Guthrie's presentation on distributed monitoring solutions for Nagios. The presentation was given during the Nagios World Conference North America held Sept 27-29th, 2011 in Saint Paul, MN. For …

Mike Guthrie's presentation on distributed monitoring solutions for Nagios. The presentation was given during the Nagios World Conference North America held Sept 27-29th, 2011 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,801
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
112
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie [email_address]
  • 2. Distributed Monitoring Introduction
    • Basic Definition: Splitting up your monitoring server over multiple machines
    • 3. Why use distributed monitoring?
      • Multiple sites with firewall restrictions
      • 4. Large installations that exceed the CPU and memory resources that a single machine can offer.
  • 5. Understanding CPU Limitations
    • The primary task of the Nagios Core engine is to schedule checks
    • 6. Example Monitoring Server
      • 1000 Hosts, 4 services per host, 5mn interval
      • 7. Check load = ( 5000 checks / 5mn ) / 60 seconds
        • About 16.6 checks per second
        • 8. In 1 second: About 16 scripts or binary processes are being launched, with about 16 sets of results coming in and being processed by Nagios and written to disk.
        • 9. When the check schedule exceeds CPU limitations, you get “check latency”
  • 10. Picking the Right Distributed Model
    • Pick the right model for your environment
    • 11. Think logistics: PLAN before implementation
      • Every hour spent in planning logistics will save tens or even hundreds of man hours later on
      • 12. A 30mn task on 1 server = 5 hours on 10 servers.
      • 13. Consider how to effectively view information across multiple machines
      • 14. As data quantity increases, discerning useful information from it becomes more important
      • 15. Viewing 10,000 hosts and 50,000 services on a page is too much raw data to be effective information
  • 16. The Classic Distributed Model Central Server (Passive Only) Active Checks Distributed servers running active checks, forwarding results to a central server Active Checks Active Checks Active Checks Active Checks Active Checks Active Checks Active Checks Forward Results After Every Check
  • 17. The Classic Distributed Model
  • 18. The Classic Distributed Model
    • Central Monitoring vs Central Viewing?
      • OCSP vs Event Handlers
      • 19. OSCP runs after every check
      • 20. Event handlers run only on state changes
    • Freshness checking ensures current data
    • 21. Child servers can also do local monitoring without forwarding results
    • 22. Distributed servers can also receive passive checks and forward them along, creating a multi-level tree structure
  • 23. The Classic Distributed Model
    • Strengths:
      • Well tested, well documented, proven solution
      • 24. All built into the Nagios Core package
      • 25. Extremely flexible for checks, performance graphing, notifications, etc.
      • 26. Can be combined with other distributed models
    • Challenges:
      • Maintaining configs on multiple machines
      • 27. Which server issued the check?
      • 28. Where to process/view performance data?
  • 29. The Classic Distributed Model
    • Workarounds:
      • Use SVN, rsync, or cron to automatically maintain host and service configs on both distributed and central servers.
      • 30. Use templating as much possible
        • Read Core Docs on “Object Inheritance”
        • 31. Keep template definitions separate
      • Use naming conventions to keep configs organized
      • 32. Nagios XI distributed tools:
        • Inbound and Outbound Checks
        • 33. Unconfigured Objects
  • 34. The Cluster Model – Nagios Load Balancing
    • Nagios checks are managed by a sub-process and distributed evenly across multiple servers
    • 35. Works like a load balancer
    • 36. Two Popular Examples:
      • DNX: Distributed Nagios eXecutor
      • 37. Mod Gearman
    • Check results and configs are all managed at the central server
  • 38. The Cluster Model – DNX
  • 39. The Cluster Model – DNX
    • DNX: How it works
      • When a check is scheduled to execute, the job is passed to a worker node
      • 40. Worker node executes the check, and send results directly to results queue
      • 41. Checks are not associated with any particular worker node
      • 42. Bypasses the nagios.cmd pipe to eliminate a potential bottleneck
      • 43. If a worker goes down, all checks continue
  • 44. The Cluster Model – DNX
    • DNX: Strengths:
      • Central configuration management
      • 45. Checks redistributed if a worker is down
      • 46. Worker nodes can be added at any time
    • Challenges:
      • Performance data is still handled at the central server
      • 47. If the master goes down, all checks cease
  • 48. The Cluster Model – Mod Gearman
  • 49. The Cluster Model – Mod Gearman
    • Strengths:
      • Central configuration management
      • 50. Checks can be split by hostgroups or servicegroups, which can come in useful if groups are located in different network segments
    • Challenges:
      • Performance data is still handled at the central server
      • 51. If the master goes down, all checks cease
      • 52. Effectively viewing more than 10k+ services on a single machine
  • 53. The Central Dashboard Model
    • Checks are executed and managed on multiple distributed servers
    • 54. Central viewer unifies all servers
    • 55. Central viewer polls data from each server and displays tactical data in the UI
    • 56. Examples:
  • 59. The Central Dashboard Model
  • 60. The Central Dashboard Model: Nagios Fusion
    • Displays tactical overview for each server
    • 61. Monitoring and object configurations compartmentalized to each server
    • 62. Good for geographically distributed servers where local management is required
    • 63. Unified login for all XI servers (basic auth still required for Core machines)
  • 64. The Central Dashboard Model: Nagios Fusion
    • Strengths:
      • Easy to add new servers
      • 65. User-level control of server views
      • 66. High level overview
      • 67. Very little CPU usage
      • 68. Commercial solution with support
    • Challenges:
      • Not a monitoring solution by itself
      • 69. Free 60 day trial, requires a license
  • 70. The Central Dashboard Model: Nagios Fusion
  • 71. The Central Dashboard Model: MNTOS
  • 72. The Central Dashboard Model: Multisite
  • 73. Single Server – Distributed Parts
    • Not all environments require check distribution
      • Offload nodutils (DB backend) to a different machine
      • 74. Offload performance data processing to a different machine
      • 75. Mount disk io intensive files to a RAM disk
      • 76. A Nagios Core installs can run between 10 - 20k checks depending on what is being checked and how it is configured
  • 77. Where To Go From Here?
    • Future of Distributed Monitoring?
      • Improved information viewing instead of just raw data
      • 78. Aggregated reporting and statistics
      • 79. Business process views and monitoring
      • 80. What do you, as admins, need to see in this area of software development?
  • 81. Conclusion
    • Pick the right setup for your environment
    • 82. Any of these models can be mixed and combined
    • 83. PLAN before implementation:
      • Plan for efficient maintenance
      • 84. An environment that implemented 250k services being overseen by a single server took almost an entire year of planning and implementation to do it right
      • 85. Environments can scale even larger with the right logistics planning in place
  • 86. Conference Resources
    • Daniel Wittenberg: “Scaling Nagios At A Giant Insurance Company” @2pm Thursday
      • 35,000 hosts and 1.4 million services
    • Mike Weber: “Reducing Server Load with Mod Gearman” @10:30am Friday
    • 87. Dave Williams: Author of DNX

×