Maciej Lasyk, Ganglia & Nagios
Maciej Lasyk
11. Sesja Linuksowa
Wrocław, 2014-04-06
1/25
Ganglia & Nagios
Ganglia.. what?
Ganglia – cluster / group of neurons found outside
the central nervous system
Maciej Lasyk, Ganglia & Nagios 2/25
Just a little about monitoring
- the need for monitoring
Maciej Lasyk, Ganglia & Nagios 3/25
Just a little about monitoring
- the need for monitoring
- measuring availability
Maciej Lasyk, Ganglia & Nagios 3/25
Just a little about monitoring
- the need for monitoring
- measuring availability
- measuring performance
Maciej Lasyk, Ganglia & Nagios 3/25
Just a little about monitoring
- the need for monitoring
- measuring availability
- measuring performance
- gathering additional metrics
Maciej Lasyk, Ganglia & Nagios 3/25
Monitoring is critical for HA
How to measure availability?
Maciej Lasyk, Ganglia & Nagios 4/25
Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
Maciej Lasyk, Ganglia & Nagios 4/25
Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
MTTD (Mean Time to Diagnose)
The average time it takes to diagnose the problem
Maciej Lasyk, Ganglia & Nagios 4/25
Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
MTTD (Mean Time to Diagnose)
The average time it takes to diagnose the problem
MTTR (Mean Time to Repair)
The average time it takes to fix a problem
Maciej Lasyk, Ganglia & Nagios 4/25
Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
MTTD (Mean Time to Diagnose)
The average time it takes to diagnose the problem
MTTR (Mean Time to Repair)
The average time it takes to fix a problem
MTTF (Mean Time to Failure)
The average time there is correct behavior
Maciej Lasyk, Ganglia & Nagios 4/25
Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
MTTD (Mean Time to Diagnose)
The average time it takes to diagnose the problem
MTTR (Mean Time to Repair)
The average time it takes to fix a problem
MTTF (Mean Time to Failure)
The average time there is correct behavior
MTBF (Mean Time Between Failures)
The average time between different failures of the service
Maciej Lasyk, Ganglia & Nagios 4/25
Monitoring is critical for HA
Maciej Lasyk, Ganglia & Nagios 4/25
Monitoring is critical for HA
Maciej Lasyk, Ganglia & Nagios
A = MTTF / MTBF = MTTF / (MTTF + MTTD + MTTR)
4/25
What should we monitor?
Maciej Lasyk, Ganglia & Nagios
- hardware housing
- devices
- storage
- network
- hosts
- software (very deep hole)
5/25
What should we monitor?
Maciej Lasyk, Ganglia & Nagios
- hardware housing
- devices
- storage
- network
- hosts
- software (very deep hole)
Think dependencies!
5/25
When outage hits us – don't panic!
Maciej Lasyk, Ganglia & Nagios
- Notifications
6/25
When outage hits us – don't panic!
Maciej Lasyk, Ganglia & Nagios
- Notifications
- Escalations
L1 <-> L2 <-> L3 <-> L4 lol ;)
desktop support / devs / ops / networking /
/ storage / middleware / dc / security
6/25
When outage hits us – don't panic!
Maciej Lasyk, Ganglia & Nagios
- Notifications
- Escalations
L1 <-> L2 <-> L3 <-> L4 lol ;)
desktop support / devs / ops / networking /
/ storage / middleware / dc / security
- Clock is ticking – it should be simple
6/25
When outage hits us – don't panic!
Maciej Lasyk, Ganglia & Nagios
- Notifications
- Escalations
L1 <-> L2 <-> L3 <-> L4 lol ;)
desktop support / devs / ops / networking /
/ storage / middleware / dc / security
- Clock is ticking – it should be simple
- What if cell is offline or someone is out?
6/25
Monitoring: notifications issues
Maciej Lasyk, Ganglia & Nagios
- false positives
7/25
Maciej Lasyk, Ganglia & Nagios
- false positives
- major events
Monitoring: notifications issues
7/25
Maciej Lasyk, Ganglia & Nagios
- false positives
- major events
- failover notifications?
Monitoring: notifications issues
7/25
Maciej Lasyk, Ganglia & Nagios
- false positives
- major events
- failover notifications?
- tolerance & critical thresholds
Monitoring: notifications issues
7/25
Monitoring: reporting
Maciej Lasyk, Ganglia & Nagios
- baseline
8/25
Maciej Lasyk, Ganglia & Nagios
- baseline
- correlation between incidents and
change management
Monitoring: reporting
8/25
Maciej Lasyk, Ganglia & Nagios
- baseline
- correlation between incidents and
change management
- trending info
Monitoring: reporting
8/25
Maciej Lasyk, Ganglia & Nagios
- baseline
- correlation between incidents and
change management
- trending info
- reporting
Monitoring: reporting
8/25
Monitoring: good practices
Maciej Lasyk, Ganglia & Nagios
- don't NIH!
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!
- DVCS
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!
- DVCS
- testing envs
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!
- DVCS
- testing envs
- think usability!
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!
- DVCS
- testing envs
- think usability!
- passive checks
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!
- DVCS
- testing envs
- think usability!
- passive checks
- automate – don't hardcode
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!
- DVCS
- testing envs
- think usability!
- passive checks
- automate – don't hardcode
- security
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
Last but not least...
“Quis custodiet ipsos custodes?”
(Who will guard the guards?)
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts
- hosts, hostgroups
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts
- hosts, hostgroups
- services, service groups
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts
- hosts, hostgroups
- services, service groups
- templates
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts
- hosts, hostgroups
- services, service groups
- templates
- time periods
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts
- hosts, hostgroups
- services, service groups
- templates
- time periods
- host and services dependencies
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts
- hosts, hostgroups
- services, service groups
- templates
- time periods
- host and services dependencies
- regular expressions
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Checks and states
- frequencies & thresholds
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Checks and states
- frequencies & thresholds
- scheduling downtimes
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Checks and states
- frequencies & thresholds
- scheduling downtimes
- outages and flapping
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications
- periods
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications
- periods
- groups
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications
- periods
- groups
- which states to be notified about?
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications
- periods
- groups
- which states to be notified about?
- escalations / rotations
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications
- periods
- groups
- which states to be notified about?
- escalations / rotations
- custom notifications method
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Monitoring remotes
- NRPE daemons
- checks via SSH
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Web interface – tactical overview
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Web interface – availability reports
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Web interface – trends
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Web interface – network maps
10/25
Maciej Lasyk, Ganglia & Nagios
Networking recap
Unicast
11/25
Maciej Lasyk, Ganglia & Nagios
Networking recap
Multicast
11/25
Maciej Lasyk, Ganglia & Nagios
Networking recap
Broadcast
11/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – what is it?
Problems of big scale:
20k hosts with zylion metrics probed every 10 seconds
It is fully redundant (until you spoil it)
It is very scalable
Regexp searches and creating of views – adhoc :)
12/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – architecture
13/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – architecture
13/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – topologies
Default multicast topology
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – topologies
Deaf / mute multicast topology
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – topologies
Unicast topology
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – topologies
Gmetad topology
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – topologies
Gmetad HA topology (active - active)
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – topologies
Gmetad hierarchical topology
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – RRDcached
15/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – sFlow
16/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (grid view)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (cluster view)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (physical view)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (host view)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (compare hosts)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (events)
Events have API json based
Think – integration with whatever app :)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (dashboards)
- Create view -> apply as dashboard
- Create dashboard from XML
- Generate graphs and add to views
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (graphs)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
- spoofing
- gmetric
- gmetric4j / java
- Which to choose? gmetric / python / c/c++?
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
- spoofing
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
- spoofing
- gmetric
- gmetric4j / java
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
- spoofing
- gmetric
- gmetric4j / java
- Which to choose? gmetric / python / c/c++?
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia and logfiles?
ganglia-logtailer
- https://bitbucket.org/maplebed/ganglia-logtailer
- parser logfiles (realtime)
- pushes data to ganglia (via gmetric)
- yup – based on specific log formats
- yet still – open source so poke around ;)
19/25
So... Nagios + Ganglia!
Maciej Lasyk, Ganglia & Nagios
3 ways of integration:
- ganglia-web/nagios (PHP & bash based)
https://github.com/ganglia/ganglia-web
- ganglia-nagios-bridge (Python & cron based)
https://github.com/ganglia/ganglia-nagios-bridge
- check-ganglia-metric (Python)
https://github.com/ganglia/ganglia_contrib
20/25
Nagios + Ganglia: ganglia-web/nagios
Maciej Lasyk, Ganglia & Nagios
https://github.com/ganglia/ganglia-web
Sending Nagios Data to Ganglia
service_perfdata_command
Or replace Nagios checks with Ganglia!
- Check heartbeat.
- Check a single metric on a specific host.
- Check multiple metrics on a specific host.
- Check multiple metrics across a regex-defined
range of hosts
21/25
Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia: ganglia-web/nagios
Nagios pulls info from Ganglia via HTTP
21/25
Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia: ganglia-nagios-bridge
- https://github.com/ganglia/ganglia-nagios-bridge
- Python script run in e.g. in crontab
- pulls data from Ganglia XML via sockets
- parses XML
- send data to Nagios
- Nagios commits only passive checks
22/25
Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia: check_ganglia_metric
- https://pypi.python.org/pypi/check_ganglia_metric/
- basically Nagios plugin
- pulls data from Ganglia XML via sockets
- check_ganglia_metric.py 
--gmetad_host=gmetad-server.example.com 
--metric_host=host.example.com --metric_name=cpu_idle
23/25
Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia
Which one integration should I use?
24/25
Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia
Which one integration should I use?
Seriously – try yourself and test
24/25
Maciej Lasyk, Ganglia & Nagios
Freenode #ganglia
https://lists.sourceforge.net/lists/listinfo/ganglia-general
24.5/25
sources?
Maciej Lasyk, Ganglia & Nagios 25/25
- “Monitoring with Ganglia” book
- also nagios.org
- and “Web Operations” book
- plus some experience ;)
Maciej Lasyk
11. Sesja Linuksowa
2014-04-06, Wrocław
http://maciek.lasyk.info/sysop
maciek@lasyk.info
@docent-net
Ganglia & Nagios
Thank you :)
Maciej Lasyk, Ganglia & Nagios 25/25

Monitoring with Nagios and Ganglia

  • 1.
    Maciej Lasyk, Ganglia& Nagios Maciej Lasyk 11. Sesja Linuksowa Wrocław, 2014-04-06 1/25 Ganglia & Nagios
  • 2.
    Ganglia.. what? Ganglia –cluster / group of neurons found outside the central nervous system Maciej Lasyk, Ganglia & Nagios 2/25
  • 3.
    Just a littleabout monitoring - the need for monitoring Maciej Lasyk, Ganglia & Nagios 3/25
  • 4.
    Just a littleabout monitoring - the need for monitoring - measuring availability Maciej Lasyk, Ganglia & Nagios 3/25
  • 5.
    Just a littleabout monitoring - the need for monitoring - measuring availability - measuring performance Maciej Lasyk, Ganglia & Nagios 3/25
  • 6.
    Just a littleabout monitoring - the need for monitoring - measuring availability - measuring performance - gathering additional metrics Maciej Lasyk, Ganglia & Nagios 3/25
  • 7.
    Monitoring is criticalfor HA How to measure availability? Maciej Lasyk, Ganglia & Nagios 4/25
  • 8.
    Monitoring is criticalfor HA How to measure availability? A = Uptime / (Uptime + Downtime) Maciej Lasyk, Ganglia & Nagios 4/25
  • 9.
    Monitoring is criticalfor HA How to measure availability? A = Uptime / (Uptime + Downtime) MTTD (Mean Time to Diagnose) The average time it takes to diagnose the problem Maciej Lasyk, Ganglia & Nagios 4/25
  • 10.
    Monitoring is criticalfor HA How to measure availability? A = Uptime / (Uptime + Downtime) MTTD (Mean Time to Diagnose) The average time it takes to diagnose the problem MTTR (Mean Time to Repair) The average time it takes to fix a problem Maciej Lasyk, Ganglia & Nagios 4/25
  • 11.
    Monitoring is criticalfor HA How to measure availability? A = Uptime / (Uptime + Downtime) MTTD (Mean Time to Diagnose) The average time it takes to diagnose the problem MTTR (Mean Time to Repair) The average time it takes to fix a problem MTTF (Mean Time to Failure) The average time there is correct behavior Maciej Lasyk, Ganglia & Nagios 4/25
  • 12.
    Monitoring is criticalfor HA How to measure availability? A = Uptime / (Uptime + Downtime) MTTD (Mean Time to Diagnose) The average time it takes to diagnose the problem MTTR (Mean Time to Repair) The average time it takes to fix a problem MTTF (Mean Time to Failure) The average time there is correct behavior MTBF (Mean Time Between Failures) The average time between different failures of the service Maciej Lasyk, Ganglia & Nagios 4/25
  • 13.
    Monitoring is criticalfor HA Maciej Lasyk, Ganglia & Nagios 4/25
  • 14.
    Monitoring is criticalfor HA Maciej Lasyk, Ganglia & Nagios A = MTTF / MTBF = MTTF / (MTTF + MTTD + MTTR) 4/25
  • 15.
    What should wemonitor? Maciej Lasyk, Ganglia & Nagios - hardware housing - devices - storage - network - hosts - software (very deep hole) 5/25
  • 16.
    What should wemonitor? Maciej Lasyk, Ganglia & Nagios - hardware housing - devices - storage - network - hosts - software (very deep hole) Think dependencies! 5/25
  • 17.
    When outage hitsus – don't panic! Maciej Lasyk, Ganglia & Nagios - Notifications 6/25
  • 18.
    When outage hitsus – don't panic! Maciej Lasyk, Ganglia & Nagios - Notifications - Escalations L1 <-> L2 <-> L3 <-> L4 lol ;) desktop support / devs / ops / networking / / storage / middleware / dc / security 6/25
  • 19.
    When outage hitsus – don't panic! Maciej Lasyk, Ganglia & Nagios - Notifications - Escalations L1 <-> L2 <-> L3 <-> L4 lol ;) desktop support / devs / ops / networking / / storage / middleware / dc / security - Clock is ticking – it should be simple 6/25
  • 20.
    When outage hitsus – don't panic! Maciej Lasyk, Ganglia & Nagios - Notifications - Escalations L1 <-> L2 <-> L3 <-> L4 lol ;) desktop support / devs / ops / networking / / storage / middleware / dc / security - Clock is ticking – it should be simple - What if cell is offline or someone is out? 6/25
  • 21.
    Monitoring: notifications issues MaciejLasyk, Ganglia & Nagios - false positives 7/25
  • 22.
    Maciej Lasyk, Ganglia& Nagios - false positives - major events Monitoring: notifications issues 7/25
  • 23.
    Maciej Lasyk, Ganglia& Nagios - false positives - major events - failover notifications? Monitoring: notifications issues 7/25
  • 24.
    Maciej Lasyk, Ganglia& Nagios - false positives - major events - failover notifications? - tolerance & critical thresholds Monitoring: notifications issues 7/25
  • 25.
    Monitoring: reporting Maciej Lasyk,Ganglia & Nagios - baseline 8/25
  • 26.
    Maciej Lasyk, Ganglia& Nagios - baseline - correlation between incidents and change management Monitoring: reporting 8/25
  • 27.
    Maciej Lasyk, Ganglia& Nagios - baseline - correlation between incidents and change management - trending info Monitoring: reporting 8/25
  • 28.
    Maciej Lasyk, Ganglia& Nagios - baseline - correlation between incidents and change management - trending info - reporting Monitoring: reporting 8/25
  • 29.
    Monitoring: good practices MaciejLasyk, Ganglia & Nagios - don't NIH! 9/25
  • 30.
    Maciej Lasyk, Ganglia& Nagios - don't NIH! - DVCS Monitoring: good practices 9/25
  • 31.
    Maciej Lasyk, Ganglia& Nagios - don't NIH! - DVCS - testing envs Monitoring: good practices 9/25
  • 32.
    Maciej Lasyk, Ganglia& Nagios - don't NIH! - DVCS - testing envs - think usability! Monitoring: good practices 9/25
  • 33.
    Maciej Lasyk, Ganglia& Nagios - don't NIH! - DVCS - testing envs - think usability! - passive checks Monitoring: good practices 9/25
  • 34.
    Maciej Lasyk, Ganglia& Nagios - don't NIH! - DVCS - testing envs - think usability! - passive checks - automate – don't hardcode Monitoring: good practices 9/25
  • 35.
    Maciej Lasyk, Ganglia& Nagios - don't NIH! - DVCS - testing envs - think usability! - passive checks - automate – don't hardcode - security Monitoring: good practices 9/25
  • 36.
    Maciej Lasyk, Ganglia& Nagios Last but not least... “Quis custodiet ipsos custodes?” (Who will guard the guards?) Monitoring: good practices 9/25
  • 37.
    Maciej Lasyk, Ganglia& Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups 10/25
  • 38.
    Maciej Lasyk, Ganglia& Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups 10/25
  • 39.
    Maciej Lasyk, Ganglia& Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups - templates 10/25
  • 40.
    Maciej Lasyk, Ganglia& Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups - templates - time periods 10/25
  • 41.
    Maciej Lasyk, Ganglia& Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups - templates - time periods - host and services dependencies 10/25
  • 42.
    Maciej Lasyk, Ganglia& Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups - templates - time periods - host and services dependencies - regular expressions 10/25
  • 43.
    Maciej Lasyk, Ganglia& Nagios Nagios recap 10/25
  • 44.
    Maciej Lasyk, Ganglia& Nagios Nagios recap 10/25
  • 45.
    Maciej Lasyk, Ganglia& Nagios Nagios recap Checks and states - frequencies & thresholds 10/25
  • 46.
    Maciej Lasyk, Ganglia& Nagios Nagios recap Checks and states - frequencies & thresholds - scheduling downtimes 10/25
  • 47.
    Maciej Lasyk, Ganglia& Nagios Nagios recap Checks and states - frequencies & thresholds - scheduling downtimes - outages and flapping 10/25
  • 48.
    Maciej Lasyk, Ganglia& Nagios Nagios recap Notifications - periods 10/25
  • 49.
    Maciej Lasyk, Ganglia& Nagios Nagios recap Notifications - periods - groups 10/25
  • 50.
    Maciej Lasyk, Ganglia& Nagios Nagios recap Notifications - periods - groups - which states to be notified about? 10/25
  • 51.
    Maciej Lasyk, Ganglia& Nagios Nagios recap Notifications - periods - groups - which states to be notified about? - escalations / rotations 10/25
  • 52.
    Maciej Lasyk, Ganglia& Nagios Nagios recap Notifications - periods - groups - which states to be notified about? - escalations / rotations - custom notifications method 10/25
  • 53.
    Maciej Lasyk, Ganglia& Nagios Nagios recap Monitoring remotes - NRPE daemons - checks via SSH 10/25
  • 54.
    Maciej Lasyk, Ganglia& Nagios Nagios recap Web interface – tactical overview 10/25
  • 55.
    Maciej Lasyk, Ganglia& Nagios Nagios recap Web interface – availability reports 10/25
  • 56.
    Maciej Lasyk, Ganglia& Nagios Nagios recap Web interface – trends 10/25
  • 57.
    Maciej Lasyk, Ganglia& Nagios Nagios recap Web interface – network maps 10/25
  • 58.
    Maciej Lasyk, Ganglia& Nagios Networking recap Unicast 11/25
  • 59.
    Maciej Lasyk, Ganglia& Nagios Networking recap Multicast 11/25
  • 60.
    Maciej Lasyk, Ganglia& Nagios Networking recap Broadcast 11/25
  • 61.
    Maciej Lasyk, Ganglia& Nagios Ganglia – what is it? Problems of big scale: 20k hosts with zylion metrics probed every 10 seconds It is fully redundant (until you spoil it) It is very scalable Regexp searches and creating of views – adhoc :) 12/25
  • 62.
    Maciej Lasyk, Ganglia& Nagios Ganglia – architecture 13/25
  • 63.
    Maciej Lasyk, Ganglia& Nagios Ganglia – architecture 13/25
  • 64.
    Maciej Lasyk, Ganglia& Nagios Ganglia – topologies Default multicast topology 14/25
  • 65.
    Maciej Lasyk, Ganglia& Nagios Ganglia – topologies Deaf / mute multicast topology 14/25
  • 66.
    Maciej Lasyk, Ganglia& Nagios Ganglia – topologies Unicast topology 14/25
  • 67.
    Maciej Lasyk, Ganglia& Nagios Ganglia – topologies Gmetad topology 14/25
  • 68.
    Maciej Lasyk, Ganglia& Nagios Ganglia – topologies Gmetad HA topology (active - active) 14/25
  • 69.
    Maciej Lasyk, Ganglia& Nagios Ganglia – topologies Gmetad hierarchical topology 14/25
  • 70.
    Maciej Lasyk, Ganglia& Nagios Ganglia – RRDcached 15/25
  • 71.
    Maciej Lasyk, Ganglia& Nagios Ganglia – sFlow 16/25
  • 72.
    Maciej Lasyk, Ganglia& Nagios Ganglia – web (grid view) 17/25
  • 73.
    Maciej Lasyk, Ganglia& Nagios Ganglia – web (cluster view) 17/25
  • 74.
    Maciej Lasyk, Ganglia& Nagios Ganglia – web (physical view) 17/25
  • 75.
    Maciej Lasyk, Ganglia& Nagios Ganglia – web (host view) 17/25
  • 76.
    Maciej Lasyk, Ganglia& Nagios Ganglia – web (compare hosts) 17/25
  • 77.
    Maciej Lasyk, Ganglia& Nagios Ganglia – web (events) Events have API json based Think – integration with whatever app :) 17/25
  • 78.
    Maciej Lasyk, Ganglia& Nagios Ganglia – web (dashboards) - Create view -> apply as dashboard - Create dashboard from XML - Generate graphs and add to views 17/25
  • 79.
    Maciej Lasyk, Ganglia& Nagios Ganglia – web (graphs) 17/25
  • 80.
    Maciej Lasyk, Ganglia& Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python - spoofing - gmetric - gmetric4j / java - Which to choose? gmetric / python / c/c++? 18/25
  • 81.
    Maciej Lasyk, Ganglia& Nagios Ganglia – metrics - base / extended metrics 18/25
  • 82.
    Maciej Lasyk, Ganglia& Nagios Ganglia – metrics - base / extended metrics - own modules 18/25
  • 83.
    Maciej Lasyk, Ganglia& Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ 18/25
  • 84.
    Maciej Lasyk, Ganglia& Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python 18/25
  • 85.
    Maciej Lasyk, Ganglia& Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python - spoofing 18/25
  • 86.
    Maciej Lasyk, Ganglia& Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python - spoofing - gmetric - gmetric4j / java 18/25
  • 87.
    Maciej Lasyk, Ganglia& Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python - spoofing - gmetric - gmetric4j / java - Which to choose? gmetric / python / c/c++? 18/25
  • 88.
    Maciej Lasyk, Ganglia& Nagios Ganglia and logfiles? ganglia-logtailer - https://bitbucket.org/maplebed/ganglia-logtailer - parser logfiles (realtime) - pushes data to ganglia (via gmetric) - yup – based on specific log formats - yet still – open source so poke around ;) 19/25
  • 89.
    So... Nagios +Ganglia! Maciej Lasyk, Ganglia & Nagios 3 ways of integration: - ganglia-web/nagios (PHP & bash based) https://github.com/ganglia/ganglia-web - ganglia-nagios-bridge (Python & cron based) https://github.com/ganglia/ganglia-nagios-bridge - check-ganglia-metric (Python) https://github.com/ganglia/ganglia_contrib 20/25
  • 90.
    Nagios + Ganglia:ganglia-web/nagios Maciej Lasyk, Ganglia & Nagios https://github.com/ganglia/ganglia-web Sending Nagios Data to Ganglia service_perfdata_command Or replace Nagios checks with Ganglia! - Check heartbeat. - Check a single metric on a specific host. - Check multiple metrics on a specific host. - Check multiple metrics across a regex-defined range of hosts 21/25
  • 91.
    Maciej Lasyk, Ganglia& Nagios Nagios + Ganglia: ganglia-web/nagios Nagios pulls info from Ganglia via HTTP 21/25
  • 92.
    Maciej Lasyk, Ganglia& Nagios Nagios + Ganglia: ganglia-nagios-bridge - https://github.com/ganglia/ganglia-nagios-bridge - Python script run in e.g. in crontab - pulls data from Ganglia XML via sockets - parses XML - send data to Nagios - Nagios commits only passive checks 22/25
  • 93.
    Maciej Lasyk, Ganglia& Nagios Nagios + Ganglia: check_ganglia_metric - https://pypi.python.org/pypi/check_ganglia_metric/ - basically Nagios plugin - pulls data from Ganglia XML via sockets - check_ganglia_metric.py --gmetad_host=gmetad-server.example.com --metric_host=host.example.com --metric_name=cpu_idle 23/25
  • 94.
    Maciej Lasyk, Ganglia& Nagios Nagios + Ganglia Which one integration should I use? 24/25
  • 95.
    Maciej Lasyk, Ganglia& Nagios Nagios + Ganglia Which one integration should I use? Seriously – try yourself and test 24/25
  • 96.
    Maciej Lasyk, Ganglia& Nagios Freenode #ganglia https://lists.sourceforge.net/lists/listinfo/ganglia-general 24.5/25
  • 97.
    sources? Maciej Lasyk, Ganglia& Nagios 25/25 - “Monitoring with Ganglia” book - also nagios.org - and “Web Operations” book - plus some experience ;)
  • 98.
    Maciej Lasyk 11. SesjaLinuksowa 2014-04-06, Wrocław http://maciek.lasyk.info/sysop maciek@lasyk.info @docent-net Ganglia & Nagios Thank you :) Maciej Lasyk, Ganglia & Nagios 25/25