Monitoring with Nagios and Ganglia
Upcoming SlideShare
Loading in...5
×
 

Monitoring with Nagios and Ganglia

on

  • 3,104 views

How could one create very sophisticated, open - source based monitoring solution that is very scalable and easy to deploy? ...

How could one create very sophisticated, open - source based monitoring solution that is very scalable and easy to deploy?

I gave this talk during on of the biggest Linux conferences in Poland: 11 Linux Session which took place in Wrocław on 5/6-04-2013

Statistics

Views

Total Views
3,104
Views on SlideShare
2,692
Embed Views
412

Actions

Likes
1
Downloads
61
Comments
0

5 Embeds 412

http://maciek.lasyk.info 400
http://www.newsblur.com 8
http://www.slideee.com 2
http://feedly.com 1
http://translate.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Monitoring with Nagios and Ganglia Monitoring with Nagios and Ganglia Presentation Transcript

  • Maciej Lasyk, Ganglia & Nagios Maciej Lasyk 11. Sesja Linuksowa Wrocław, 2014-04-06 1/25 Ganglia & Nagios
  • Ganglia.. what? Ganglia – cluster / group of neurons found outside the central nervous system Maciej Lasyk, Ganglia & Nagios 2/25
  • Just a little about monitoring - the need for monitoring Maciej Lasyk, Ganglia & Nagios 3/25
  • Just a little about monitoring - the need for monitoring - measuring availability Maciej Lasyk, Ganglia & Nagios 3/25
  • Just a little about monitoring - the need for monitoring - measuring availability - measuring performance Maciej Lasyk, Ganglia & Nagios 3/25
  • Just a little about monitoring - the need for monitoring - measuring availability - measuring performance - gathering additional metrics Maciej Lasyk, Ganglia & Nagios 3/25
  • Monitoring is critical for HA How to measure availability? Maciej Lasyk, Ganglia & Nagios 4/25
  • Monitoring is critical for HA How to measure availability? A = Uptime / (Uptime + Downtime) Maciej Lasyk, Ganglia & Nagios 4/25
  • Monitoring is critical for HA How to measure availability? A = Uptime / (Uptime + Downtime) MTTD (Mean Time to Diagnose) The average time it takes to diagnose the problem Maciej Lasyk, Ganglia & Nagios 4/25
  • Monitoring is critical for HA How to measure availability? A = Uptime / (Uptime + Downtime) MTTD (Mean Time to Diagnose) The average time it takes to diagnose the problem MTTR (Mean Time to Repair) The average time it takes to fix a problem Maciej Lasyk, Ganglia & Nagios 4/25
  • Monitoring is critical for HA How to measure availability? A = Uptime / (Uptime + Downtime) MTTD (Mean Time to Diagnose) The average time it takes to diagnose the problem MTTR (Mean Time to Repair) The average time it takes to fix a problem MTTF (Mean Time to Failure) The average time there is correct behavior Maciej Lasyk, Ganglia & Nagios 4/25
  • Monitoring is critical for HA How to measure availability? A = Uptime / (Uptime + Downtime) MTTD (Mean Time to Diagnose) The average time it takes to diagnose the problem MTTR (Mean Time to Repair) The average time it takes to fix a problem MTTF (Mean Time to Failure) The average time there is correct behavior MTBF (Mean Time Between Failures) The average time between different failures of the service Maciej Lasyk, Ganglia & Nagios 4/25
  • Monitoring is critical for HA Maciej Lasyk, Ganglia & Nagios 4/25
  • Monitoring is critical for HA Maciej Lasyk, Ganglia & Nagios A = MTTF / MTBF = MTTF / (MTTF + MTTD + MTTR) 4/25
  • What should we monitor? Maciej Lasyk, Ganglia & Nagios - hardware housing - devices - storage - network - hosts - software (very deep hole) 5/25
  • What should we monitor? Maciej Lasyk, Ganglia & Nagios - hardware housing - devices - storage - network - hosts - software (very deep hole) Think dependencies! 5/25
  • When outage hits us – don't panic! Maciej Lasyk, Ganglia & Nagios - Notifications 6/25
  • When outage hits us – don't panic! Maciej Lasyk, Ganglia & Nagios - Notifications - Escalations L1 <-> L2 <-> L3 <-> L4 lol ;) desktop support / devs / ops / networking / / storage / middleware / dc / security 6/25
  • When outage hits us – don't panic! Maciej Lasyk, Ganglia & Nagios - Notifications - Escalations L1 <-> L2 <-> L3 <-> L4 lol ;) desktop support / devs / ops / networking / / storage / middleware / dc / security - Clock is ticking – it should be simple 6/25
  • When outage hits us – don't panic! Maciej Lasyk, Ganglia & Nagios - Notifications - Escalations L1 <-> L2 <-> L3 <-> L4 lol ;) desktop support / devs / ops / networking / / storage / middleware / dc / security - Clock is ticking – it should be simple - What if cell is offline or someone is out? 6/25
  • Monitoring: notifications issues Maciej Lasyk, Ganglia & Nagios - false positives 7/25
  • Maciej Lasyk, Ganglia & Nagios - false positives - major events Monitoring: notifications issues 7/25
  • Maciej Lasyk, Ganglia & Nagios - false positives - major events - failover notifications? Monitoring: notifications issues 7/25
  • Maciej Lasyk, Ganglia & Nagios - false positives - major events - failover notifications? - tolerance & critical thresholds Monitoring: notifications issues 7/25
  • Monitoring: reporting Maciej Lasyk, Ganglia & Nagios - baseline 8/25
  • Maciej Lasyk, Ganglia & Nagios - baseline - correlation between incidents and change management Monitoring: reporting 8/25
  • Maciej Lasyk, Ganglia & Nagios - baseline - correlation between incidents and change management - trending info Monitoring: reporting 8/25
  • Maciej Lasyk, Ganglia & Nagios - baseline - correlation between incidents and change management - trending info - reporting Monitoring: reporting 8/25
  • Monitoring: good practices Maciej Lasyk, Ganglia & Nagios - don't NIH! 9/25
  • Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS Monitoring: good practices 9/25
  • Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS - testing envs Monitoring: good practices 9/25
  • Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS - testing envs - think usability! Monitoring: good practices 9/25
  • Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS - testing envs - think usability! - passive checks Monitoring: good practices 9/25
  • Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS - testing envs - think usability! - passive checks - automate – don't hardcode Monitoring: good practices 9/25
  • Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS - testing envs - think usability! - passive checks - automate – don't hardcode - security Monitoring: good practices 9/25
  • Maciej Lasyk, Ganglia & Nagios Last but not least... “Quis custodiet ipsos custodes?” (Who will guard the guards?) Monitoring: good practices 9/25
  • Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups 10/25
  • Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups 10/25
  • Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups - templates 10/25
  • Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups - templates - time periods 10/25
  • Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups - templates - time periods - host and services dependencies 10/25
  • Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups - templates - time periods - host and services dependencies - regular expressions 10/25
  • Maciej Lasyk, Ganglia & Nagios Nagios recap 10/25
  • Maciej Lasyk, Ganglia & Nagios Nagios recap 10/25
  • Maciej Lasyk, Ganglia & Nagios Nagios recap Checks and states - frequencies & thresholds 10/25
  • Maciej Lasyk, Ganglia & Nagios Nagios recap Checks and states - frequencies & thresholds - scheduling downtimes 10/25
  • Maciej Lasyk, Ganglia & Nagios Nagios recap Checks and states - frequencies & thresholds - scheduling downtimes - outages and flapping 10/25
  • Maciej Lasyk, Ganglia & Nagios Nagios recap Notifications - periods 10/25
  • Maciej Lasyk, Ganglia & Nagios Nagios recap Notifications - periods - groups 10/25
  • Maciej Lasyk, Ganglia & Nagios Nagios recap Notifications - periods - groups - which states to be notified about? 10/25
  • Maciej Lasyk, Ganglia & Nagios Nagios recap Notifications - periods - groups - which states to be notified about? - escalations / rotations 10/25
  • Maciej Lasyk, Ganglia & Nagios Nagios recap Notifications - periods - groups - which states to be notified about? - escalations / rotations - custom notifications method 10/25
  • Maciej Lasyk, Ganglia & Nagios Nagios recap Monitoring remotes - NRPE daemons - checks via SSH 10/25
  • Maciej Lasyk, Ganglia & Nagios Nagios recap Web interface – tactical overview 10/25
  • Maciej Lasyk, Ganglia & Nagios Nagios recap Web interface – availability reports 10/25
  • Maciej Lasyk, Ganglia & Nagios Nagios recap Web interface – trends 10/25
  • Maciej Lasyk, Ganglia & Nagios Nagios recap Web interface – network maps 10/25
  • Maciej Lasyk, Ganglia & Nagios Networking recap Unicast 11/25
  • Maciej Lasyk, Ganglia & Nagios Networking recap Multicast 11/25
  • Maciej Lasyk, Ganglia & Nagios Networking recap Broadcast 11/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – what is it? Problems of big scale: 20k hosts with zylion metrics probed every 10 seconds It is fully redundant (until you spoil it) It is very scalable Regexp searches and creating of views – adhoc :) 12/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – architecture 13/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – architecture 13/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Default multicast topology 14/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Deaf / mute multicast topology 14/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Unicast topology 14/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Gmetad topology 14/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Gmetad HA topology (active - active) 14/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Gmetad hierarchical topology 14/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – RRDcached 15/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – sFlow 16/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – web (grid view) 17/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – web (cluster view) 17/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – web (physical view) 17/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – web (host view) 17/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – web (compare hosts) 17/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – web (events) Events have API json based Think – integration with whatever app :) 17/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – web (dashboards) - Create view -> apply as dashboard - Create dashboard from XML - Generate graphs and add to views 17/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – web (graphs) 17/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python - spoofing - gmetric - gmetric4j / java - Which to choose? gmetric / python / c/c++? 18/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics 18/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules 18/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ 18/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python 18/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python - spoofing 18/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python - spoofing - gmetric - gmetric4j / java 18/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python - spoofing - gmetric - gmetric4j / java - Which to choose? gmetric / python / c/c++? 18/25
  • Maciej Lasyk, Ganglia & Nagios Ganglia and logfiles? ganglia-logtailer - https://bitbucket.org/maplebed/ganglia-logtailer - parser logfiles (realtime) - pushes data to ganglia (via gmetric) - yup – based on specific log formats - yet still – open source so poke around ;) 19/25
  • So... Nagios + Ganglia! Maciej Lasyk, Ganglia & Nagios 3 ways of integration: - ganglia-web/nagios (PHP & bash based) https://github.com/ganglia/ganglia-web - ganglia-nagios-bridge (Python & cron based) https://github.com/ganglia/ganglia-nagios-bridge - check-ganglia-metric (Python) https://github.com/ganglia/ganglia_contrib 20/25
  • Nagios + Ganglia: ganglia-web/nagios Maciej Lasyk, Ganglia & Nagios https://github.com/ganglia/ganglia-web Sending Nagios Data to Ganglia service_perfdata_command Or replace Nagios checks with Ganglia! - Check heartbeat. - Check a single metric on a specific host. - Check multiple metrics on a specific host. - Check multiple metrics across a regex-defined range of hosts 21/25
  • Maciej Lasyk, Ganglia & Nagios Nagios + Ganglia: ganglia-web/nagios Nagios pulls info from Ganglia via HTTP 21/25
  • Maciej Lasyk, Ganglia & Nagios Nagios + Ganglia: ganglia-nagios-bridge - https://github.com/ganglia/ganglia-nagios-bridge - Python script run in e.g. in crontab - pulls data from Ganglia XML via sockets - parses XML - send data to Nagios - Nagios commits only passive checks 22/25
  • Maciej Lasyk, Ganglia & Nagios Nagios + Ganglia: check_ganglia_metric - https://pypi.python.org/pypi/check_ganglia_metric/ - basically Nagios plugin - pulls data from Ganglia XML via sockets - check_ganglia_metric.py --gmetad_host=gmetad-server.example.com --metric_host=host.example.com --metric_name=cpu_idle 23/25
  • Maciej Lasyk, Ganglia & Nagios Nagios + Ganglia Which one integration should I use? 24/25
  • Maciej Lasyk, Ganglia & Nagios Nagios + Ganglia Which one integration should I use? Seriously – try yourself and test 24/25
  • Maciej Lasyk, Ganglia & Nagios Freenode #ganglia https://lists.sourceforge.net/lists/listinfo/ganglia-general 24.5/25
  • sources? Maciej Lasyk, Ganglia & Nagios 25/25 - “Monitoring with Ganglia” book - also nagios.org - and “Web Operations” book - plus some experience ;)
  • Maciej Lasyk 11. Sesja Linuksowa 2014-04-06, Wrocław http://maciek.lasyk.info/sysop maciek@lasyk.info @docent-net Ganglia & Nagios Thank you :) Maciej Lasyk, Ganglia & Nagios 25/25