Monitoring with Nagios and Ganglia

14,652 views

Published on

How could one create very sophisticated, open - source based monitoring solution that is very scalable and easy to deploy?

I gave this talk during on of the biggest Linux conferences in Poland: 11 Linux Session which took place in Wrocław on 5/6-04-2013

Published in: Technology
0 Comments
23 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
14,652
On SlideShare
0
From Embeds
0
Number of Embeds
638
Actions
Shares
0
Downloads
375
Comments
0
Likes
23
Embeds 0
No embeds

No notes for slide

Monitoring with Nagios and Ganglia

  1. 1. Maciej Lasyk, Ganglia & Nagios Maciej Lasyk 11. Sesja Linuksowa Wrocław, 2014-04-06 1/25 Ganglia & Nagios
  2. 2. Ganglia.. what? Ganglia – cluster / group of neurons found outside the central nervous system Maciej Lasyk, Ganglia & Nagios 2/25
  3. 3. Just a little about monitoring - the need for monitoring Maciej Lasyk, Ganglia & Nagios 3/25
  4. 4. Just a little about monitoring - the need for monitoring - measuring availability Maciej Lasyk, Ganglia & Nagios 3/25
  5. 5. Just a little about monitoring - the need for monitoring - measuring availability - measuring performance Maciej Lasyk, Ganglia & Nagios 3/25
  6. 6. Just a little about monitoring - the need for monitoring - measuring availability - measuring performance - gathering additional metrics Maciej Lasyk, Ganglia & Nagios 3/25
  7. 7. Monitoring is critical for HA How to measure availability? Maciej Lasyk, Ganglia & Nagios 4/25
  8. 8. Monitoring is critical for HA How to measure availability? A = Uptime / (Uptime + Downtime) Maciej Lasyk, Ganglia & Nagios 4/25
  9. 9. Monitoring is critical for HA How to measure availability? A = Uptime / (Uptime + Downtime) MTTD (Mean Time to Diagnose) The average time it takes to diagnose the problem Maciej Lasyk, Ganglia & Nagios 4/25
  10. 10. Monitoring is critical for HA How to measure availability? A = Uptime / (Uptime + Downtime) MTTD (Mean Time to Diagnose) The average time it takes to diagnose the problem MTTR (Mean Time to Repair) The average time it takes to fix a problem Maciej Lasyk, Ganglia & Nagios 4/25
  11. 11. Monitoring is critical for HA How to measure availability? A = Uptime / (Uptime + Downtime) MTTD (Mean Time to Diagnose) The average time it takes to diagnose the problem MTTR (Mean Time to Repair) The average time it takes to fix a problem MTTF (Mean Time to Failure) The average time there is correct behavior Maciej Lasyk, Ganglia & Nagios 4/25
  12. 12. Monitoring is critical for HA How to measure availability? A = Uptime / (Uptime + Downtime) MTTD (Mean Time to Diagnose) The average time it takes to diagnose the problem MTTR (Mean Time to Repair) The average time it takes to fix a problem MTTF (Mean Time to Failure) The average time there is correct behavior MTBF (Mean Time Between Failures) The average time between different failures of the service Maciej Lasyk, Ganglia & Nagios 4/25
  13. 13. Monitoring is critical for HA Maciej Lasyk, Ganglia & Nagios 4/25
  14. 14. Monitoring is critical for HA Maciej Lasyk, Ganglia & Nagios A = MTTF / MTBF = MTTF / (MTTF + MTTD + MTTR) 4/25
  15. 15. What should we monitor? Maciej Lasyk, Ganglia & Nagios - hardware housing - devices - storage - network - hosts - software (very deep hole) 5/25
  16. 16. What should we monitor? Maciej Lasyk, Ganglia & Nagios - hardware housing - devices - storage - network - hosts - software (very deep hole) Think dependencies! 5/25
  17. 17. When outage hits us – don't panic! Maciej Lasyk, Ganglia & Nagios - Notifications 6/25
  18. 18. When outage hits us – don't panic! Maciej Lasyk, Ganglia & Nagios - Notifications - Escalations L1 <-> L2 <-> L3 <-> L4 lol ;) desktop support / devs / ops / networking / / storage / middleware / dc / security 6/25
  19. 19. When outage hits us – don't panic! Maciej Lasyk, Ganglia & Nagios - Notifications - Escalations L1 <-> L2 <-> L3 <-> L4 lol ;) desktop support / devs / ops / networking / / storage / middleware / dc / security - Clock is ticking – it should be simple 6/25
  20. 20. When outage hits us – don't panic! Maciej Lasyk, Ganglia & Nagios - Notifications - Escalations L1 <-> L2 <-> L3 <-> L4 lol ;) desktop support / devs / ops / networking / / storage / middleware / dc / security - Clock is ticking – it should be simple - What if cell is offline or someone is out? 6/25
  21. 21. Monitoring: notifications issues Maciej Lasyk, Ganglia & Nagios - false positives 7/25
  22. 22. Maciej Lasyk, Ganglia & Nagios - false positives - major events Monitoring: notifications issues 7/25
  23. 23. Maciej Lasyk, Ganglia & Nagios - false positives - major events - failover notifications? Monitoring: notifications issues 7/25
  24. 24. Maciej Lasyk, Ganglia & Nagios - false positives - major events - failover notifications? - tolerance & critical thresholds Monitoring: notifications issues 7/25
  25. 25. Monitoring: reporting Maciej Lasyk, Ganglia & Nagios - baseline 8/25
  26. 26. Maciej Lasyk, Ganglia & Nagios - baseline - correlation between incidents and change management Monitoring: reporting 8/25
  27. 27. Maciej Lasyk, Ganglia & Nagios - baseline - correlation between incidents and change management - trending info Monitoring: reporting 8/25
  28. 28. Maciej Lasyk, Ganglia & Nagios - baseline - correlation between incidents and change management - trending info - reporting Monitoring: reporting 8/25
  29. 29. Monitoring: good practices Maciej Lasyk, Ganglia & Nagios - don't NIH! 9/25
  30. 30. Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS Monitoring: good practices 9/25
  31. 31. Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS - testing envs Monitoring: good practices 9/25
  32. 32. Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS - testing envs - think usability! Monitoring: good practices 9/25
  33. 33. Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS - testing envs - think usability! - passive checks Monitoring: good practices 9/25
  34. 34. Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS - testing envs - think usability! - passive checks - automate – don't hardcode Monitoring: good practices 9/25
  35. 35. Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS - testing envs - think usability! - passive checks - automate – don't hardcode - security Monitoring: good practices 9/25
  36. 36. Maciej Lasyk, Ganglia & Nagios Last but not least... “Quis custodiet ipsos custodes?” (Who will guard the guards?) Monitoring: good practices 9/25
  37. 37. Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups 10/25
  38. 38. Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups 10/25
  39. 39. Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups - templates 10/25
  40. 40. Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups - templates - time periods 10/25
  41. 41. Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups - templates - time periods - host and services dependencies 10/25
  42. 42. Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups - templates - time periods - host and services dependencies - regular expressions 10/25
  43. 43. Maciej Lasyk, Ganglia & Nagios Nagios recap 10/25
  44. 44. Maciej Lasyk, Ganglia & Nagios Nagios recap 10/25
  45. 45. Maciej Lasyk, Ganglia & Nagios Nagios recap Checks and states - frequencies & thresholds 10/25
  46. 46. Maciej Lasyk, Ganglia & Nagios Nagios recap Checks and states - frequencies & thresholds - scheduling downtimes 10/25
  47. 47. Maciej Lasyk, Ganglia & Nagios Nagios recap Checks and states - frequencies & thresholds - scheduling downtimes - outages and flapping 10/25
  48. 48. Maciej Lasyk, Ganglia & Nagios Nagios recap Notifications - periods 10/25
  49. 49. Maciej Lasyk, Ganglia & Nagios Nagios recap Notifications - periods - groups 10/25
  50. 50. Maciej Lasyk, Ganglia & Nagios Nagios recap Notifications - periods - groups - which states to be notified about? 10/25
  51. 51. Maciej Lasyk, Ganglia & Nagios Nagios recap Notifications - periods - groups - which states to be notified about? - escalations / rotations 10/25
  52. 52. Maciej Lasyk, Ganglia & Nagios Nagios recap Notifications - periods - groups - which states to be notified about? - escalations / rotations - custom notifications method 10/25
  53. 53. Maciej Lasyk, Ganglia & Nagios Nagios recap Monitoring remotes - NRPE daemons - checks via SSH 10/25
  54. 54. Maciej Lasyk, Ganglia & Nagios Nagios recap Web interface – tactical overview 10/25
  55. 55. Maciej Lasyk, Ganglia & Nagios Nagios recap Web interface – availability reports 10/25
  56. 56. Maciej Lasyk, Ganglia & Nagios Nagios recap Web interface – trends 10/25
  57. 57. Maciej Lasyk, Ganglia & Nagios Nagios recap Web interface – network maps 10/25
  58. 58. Maciej Lasyk, Ganglia & Nagios Networking recap Unicast 11/25
  59. 59. Maciej Lasyk, Ganglia & Nagios Networking recap Multicast 11/25
  60. 60. Maciej Lasyk, Ganglia & Nagios Networking recap Broadcast 11/25
  61. 61. Maciej Lasyk, Ganglia & Nagios Ganglia – what is it? Problems of big scale: 20k hosts with zylion metrics probed every 10 seconds It is fully redundant (until you spoil it) It is very scalable Regexp searches and creating of views – adhoc :) 12/25
  62. 62. Maciej Lasyk, Ganglia & Nagios Ganglia – architecture 13/25
  63. 63. Maciej Lasyk, Ganglia & Nagios Ganglia – architecture 13/25
  64. 64. Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Default multicast topology 14/25
  65. 65. Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Deaf / mute multicast topology 14/25
  66. 66. Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Unicast topology 14/25
  67. 67. Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Gmetad topology 14/25
  68. 68. Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Gmetad HA topology (active - active) 14/25
  69. 69. Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Gmetad hierarchical topology 14/25
  70. 70. Maciej Lasyk, Ganglia & Nagios Ganglia – RRDcached 15/25
  71. 71. Maciej Lasyk, Ganglia & Nagios Ganglia – sFlow 16/25
  72. 72. Maciej Lasyk, Ganglia & Nagios Ganglia – web (grid view) 17/25
  73. 73. Maciej Lasyk, Ganglia & Nagios Ganglia – web (cluster view) 17/25
  74. 74. Maciej Lasyk, Ganglia & Nagios Ganglia – web (physical view) 17/25
  75. 75. Maciej Lasyk, Ganglia & Nagios Ganglia – web (host view) 17/25
  76. 76. Maciej Lasyk, Ganglia & Nagios Ganglia – web (compare hosts) 17/25
  77. 77. Maciej Lasyk, Ganglia & Nagios Ganglia – web (events) Events have API json based Think – integration with whatever app :) 17/25
  78. 78. Maciej Lasyk, Ganglia & Nagios Ganglia – web (dashboards) - Create view -> apply as dashboard - Create dashboard from XML - Generate graphs and add to views 17/25
  79. 79. Maciej Lasyk, Ganglia & Nagios Ganglia – web (graphs) 17/25
  80. 80. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python - spoofing - gmetric - gmetric4j / java - Which to choose? gmetric / python / c/c++? 18/25
  81. 81. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics 18/25
  82. 82. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules 18/25
  83. 83. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ 18/25
  84. 84. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python 18/25
  85. 85. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python - spoofing 18/25
  86. 86. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python - spoofing - gmetric - gmetric4j / java 18/25
  87. 87. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python - spoofing - gmetric - gmetric4j / java - Which to choose? gmetric / python / c/c++? 18/25
  88. 88. Maciej Lasyk, Ganglia & Nagios Ganglia and logfiles? ganglia-logtailer - https://bitbucket.org/maplebed/ganglia-logtailer - parser logfiles (realtime) - pushes data to ganglia (via gmetric) - yup – based on specific log formats - yet still – open source so poke around ;) 19/25
  89. 89. So... Nagios + Ganglia! Maciej Lasyk, Ganglia & Nagios 3 ways of integration: - ganglia-web/nagios (PHP & bash based) https://github.com/ganglia/ganglia-web - ganglia-nagios-bridge (Python & cron based) https://github.com/ganglia/ganglia-nagios-bridge - check-ganglia-metric (Python) https://github.com/ganglia/ganglia_contrib 20/25
  90. 90. Nagios + Ganglia: ganglia-web/nagios Maciej Lasyk, Ganglia & Nagios https://github.com/ganglia/ganglia-web Sending Nagios Data to Ganglia service_perfdata_command Or replace Nagios checks with Ganglia! - Check heartbeat. - Check a single metric on a specific host. - Check multiple metrics on a specific host. - Check multiple metrics across a regex-defined range of hosts 21/25
  91. 91. Maciej Lasyk, Ganglia & Nagios Nagios + Ganglia: ganglia-web/nagios Nagios pulls info from Ganglia via HTTP 21/25
  92. 92. Maciej Lasyk, Ganglia & Nagios Nagios + Ganglia: ganglia-nagios-bridge - https://github.com/ganglia/ganglia-nagios-bridge - Python script run in e.g. in crontab - pulls data from Ganglia XML via sockets - parses XML - send data to Nagios - Nagios commits only passive checks 22/25
  93. 93. Maciej Lasyk, Ganglia & Nagios Nagios + Ganglia: check_ganglia_metric - https://pypi.python.org/pypi/check_ganglia_metric/ - basically Nagios plugin - pulls data from Ganglia XML via sockets - check_ganglia_metric.py --gmetad_host=gmetad-server.example.com --metric_host=host.example.com --metric_name=cpu_idle 23/25
  94. 94. Maciej Lasyk, Ganglia & Nagios Nagios + Ganglia Which one integration should I use? 24/25
  95. 95. Maciej Lasyk, Ganglia & Nagios Nagios + Ganglia Which one integration should I use? Seriously – try yourself and test 24/25
  96. 96. Maciej Lasyk, Ganglia & Nagios Freenode #ganglia https://lists.sourceforge.net/lists/listinfo/ganglia-general 24.5/25
  97. 97. sources? Maciej Lasyk, Ganglia & Nagios 25/25 - “Monitoring with Ganglia” book - also nagios.org - and “Web Operations” book - plus some experience ;)
  98. 98. Maciej Lasyk 11. Sesja Linuksowa 2014-04-06, Wrocław http://maciek.lasyk.info/sysop maciek@lasyk.info @docent-net Ganglia & Nagios Thank you :) Maciej Lasyk, Ganglia & Nagios 25/25

×