Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios

3,528 views

Published on

Mike Guthrie's presentation on distributed monitoring solutions for Nagios. The presentation was given during the Nagios World Conference North America held Sept 27-29th, 2011 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,528
On SlideShare
0
From Embeds
0
Number of Embeds
128
Actions
Shares
0
Downloads
126
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios

  1. 1. Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie [email_address]
  2. 2. Distributed Monitoring Introduction <ul><li>Basic Definition: Splitting up your monitoring server over multiple machines
  3. 3. Why use distributed monitoring? </li><ul><li>Multiple sites with firewall restrictions
  4. 4. Large installations that exceed the CPU and memory resources that a single machine can offer. </li></ul></ul>
  5. 5. Understanding CPU Limitations <ul><li>The primary task of the Nagios Core engine is to schedule checks
  6. 6. Example Monitoring Server </li><ul><li>1000 Hosts, 4 services per host, 5mn interval
  7. 7. Check load = ( 5000 checks / 5mn ) / 60 seconds </li><ul><li>About 16.6 checks per second
  8. 8. In 1 second: About 16 scripts or binary processes are being launched, with about 16 sets of results coming in and being processed by Nagios and written to disk.
  9. 9. When the check schedule exceeds CPU limitations, you get “check latency” </li></ul></ul></ul>
  10. 10. Picking the Right Distributed Model <ul><li>Pick the right model for your environment
  11. 11. Think logistics: PLAN before implementation </li><ul><li>Every hour spent in planning logistics will save tens or even hundreds of man hours later on
  12. 12. A 30mn task on 1 server = 5 hours on 10 servers.
  13. 13. Consider how to effectively view information across multiple machines
  14. 14. As data quantity increases, discerning useful information from it becomes more important
  15. 15. Viewing 10,000 hosts and 50,000 services on a page is too much raw data to be effective information </li></ul></ul>
  16. 16. The Classic Distributed Model Central Server (Passive Only) Active Checks Distributed servers running active checks, forwarding results to a central server Active Checks Active Checks Active Checks Active Checks Active Checks Active Checks Active Checks Forward Results After Every Check
  17. 17. The Classic Distributed Model
  18. 18. The Classic Distributed Model <ul><li>Central Monitoring vs Central Viewing? </li><ul><li>OCSP vs Event Handlers
  19. 19. OSCP runs after every check
  20. 20. Event handlers run only on state changes </li></ul><li>Freshness checking ensures current data
  21. 21. Child servers can also do local monitoring without forwarding results
  22. 22. Distributed servers can also receive passive checks and forward them along, creating a multi-level tree structure </li></ul>
  23. 23. The Classic Distributed Model <ul><li>Strengths: </li><ul><li>Well tested, well documented, proven solution
  24. 24. All built into the Nagios Core package
  25. 25. Extremely flexible for checks, performance graphing, notifications, etc.
  26. 26. Can be combined with other distributed models </li></ul><li>Challenges: </li><ul><li>Maintaining configs on multiple machines
  27. 27. Which server issued the check?
  28. 28. Where to process/view performance data? </li></ul></ul>
  29. 29. The Classic Distributed Model <ul><li>Workarounds: </li><ul><li>Use SVN, rsync, or cron to automatically maintain host and service configs on both distributed and central servers.
  30. 30. Use templating as much possible </li><ul><li>Read Core Docs on “Object Inheritance”
  31. 31. Keep template definitions separate </li></ul><li>Use naming conventions to keep configs organized
  32. 32. Nagios XI distributed tools: </li><ul><li>Inbound and Outbound Checks
  33. 33. Unconfigured Objects </li></ul></ul></ul>
  34. 34. The Cluster Model – Nagios Load Balancing <ul><li>Nagios checks are managed by a sub-process and distributed evenly across multiple servers
  35. 35. Works like a load balancer
  36. 36. Two Popular Examples: </li><ul><li>DNX: Distributed Nagios eXecutor
  37. 37. Mod Gearman </li></ul><li>Check results and configs are all managed at the central server </li></ul>
  38. 38. The Cluster Model – DNX
  39. 39. The Cluster Model – DNX <ul><li>DNX: How it works </li><ul><li>When a check is scheduled to execute, the job is passed to a worker node
  40. 40. Worker node executes the check, and send results directly to results queue
  41. 41. Checks are not associated with any particular worker node
  42. 42. Bypasses the nagios.cmd pipe to eliminate a potential bottleneck
  43. 43. If a worker goes down, all checks continue </li></ul></ul>
  44. 44. The Cluster Model – DNX <ul><li>DNX: Strengths: </li><ul><li>Central configuration management
  45. 45. Checks redistributed if a worker is down
  46. 46. Worker nodes can be added at any time </li></ul><li>Challenges: </li><ul><li>Performance data is still handled at the central server
  47. 47. If the master goes down, all checks cease </li></ul></ul>
  48. 48. The Cluster Model – Mod Gearman
  49. 49. The Cluster Model – Mod Gearman <ul><li>Strengths: </li><ul><li>Central configuration management
  50. 50. Checks can be split by hostgroups or servicegroups, which can come in useful if groups are located in different network segments </li></ul><li>Challenges: </li><ul><li>Performance data is still handled at the central server
  51. 51. If the master goes down, all checks cease
  52. 52. Effectively viewing more than 10k+ services on a single machine </li></ul></ul>
  53. 53. The Central Dashboard Model <ul><li>Checks are executed and managed on multiple distributed servers
  54. 54. Central viewer unifies all servers
  55. 55. Central viewer polls data from each server and displays tactical data in the UI
  56. 56. Examples: </li><ul><li>Nagios Fusion
  57. 57. MNTOS
  58. 58. check_MK Multisite </li></ul></ul>
  59. 59. The Central Dashboard Model
  60. 60. The Central Dashboard Model: Nagios Fusion <ul><li>Displays tactical overview for each server
  61. 61. Monitoring and object configurations compartmentalized to each server
  62. 62. Good for geographically distributed servers where local management is required
  63. 63. Unified login for all XI servers (basic auth still required for Core machines) </li></ul>
  64. 64. The Central Dashboard Model: Nagios Fusion <ul><li>Strengths: </li><ul><li>Easy to add new servers
  65. 65. User-level control of server views
  66. 66. High level overview
  67. 67. Very little CPU usage
  68. 68. Commercial solution with support </li></ul><li>Challenges: </li><ul><li>Not a monitoring solution by itself
  69. 69. Free 60 day trial, requires a license </li></ul></ul>
  70. 70. The Central Dashboard Model: Nagios Fusion
  71. 71. The Central Dashboard Model: MNTOS
  72. 72. The Central Dashboard Model: Multisite
  73. 73. Single Server – Distributed Parts <ul><li>Not all environments require check distribution </li><ul><li>Offload nodutils (DB backend) to a different machine
  74. 74. Offload performance data processing to a different machine
  75. 75. Mount disk io intensive files to a RAM disk
  76. 76. A Nagios Core installs can run between 10 - 20k checks depending on what is being checked and how it is configured </li></ul></ul>
  77. 77. Where To Go From Here? <ul><li>Future of Distributed Monitoring? </li><ul><li>Improved information viewing instead of just raw data
  78. 78. Aggregated reporting and statistics
  79. 79. Business process views and monitoring
  80. 80. What do you, as admins, need to see in this area of software development? </li></ul></ul>
  81. 81. Conclusion <ul><li>Pick the right setup for your environment
  82. 82. Any of these models can be mixed and combined
  83. 83. PLAN before implementation: </li><ul><li>Plan for efficient maintenance
  84. 84. An environment that implemented 250k services being overseen by a single server took almost an entire year of planning and implementation to do it right
  85. 85. Environments can scale even larger with the right logistics planning in place </li></ul></ul>
  86. 86. Conference Resources <ul><li>Daniel Wittenberg: “Scaling Nagios At A Giant Insurance Company” @2pm Thursday </li><ul><li>35,000 hosts and 1.4 million services </li></ul><li>Mike Weber: “Reducing Server Load with Mod Gearman” @10:30am Friday
  87. 87. Dave Williams: Author of DNX </li></ul>

×