hallenges of Monitoring Big Infrastructure - Icinga Camp Milan 2019
Icinga 2012 at ACOnet on 6th TF-NOC Meeting
1. Monitoring @ ACOnet
Robert Wein, ACOnet NOC
TF-NOC, Dublin, 2012-06-05
1
Dienstag, 05. Juni 2012
2. ACOnet
■ ACOnet is the Austrian NREN, connecting
■ (all) Universities & Academies
■ Colleges & Research Institutes
■ Austrian School Network (edunet), Dormitories
■ Museums, educational and cultural institutions
■ Hospitals
■ Ministries, Federal Agencies
■ Federal Chancellery, Presidential Offices
■ Provincial Government and Administration
■ …
■ Legal Entity & Management: University of Vienna
■ Operation: UniVie + other Universities, fiber backbone by
telco
2
Dienstag, 05. Juni 2012
4. Vienna Internet Exchange (VIX)
■ neutraland non for profit IXP
■ founded 1996
■ 107 participants (different AS-Numbers)
■ 65 Gbps peak traffic in May 2012
■ redundant setup - 2 sites
4
Dienstag, 05. Juni 2012
5. Monitoring status December 2010
■ Nagios/Cacti
■ integration in configuration authority database (ACOnetDB)
■ integration in web-portal
■ (intensive) use of check_rrd
■ outsourced maintainance and development - together with
UniVie Campus
■ troubles
■ check_rrd takes much IO-load
■ integration of new platform in backbone (Cisco ASR9k)
■ lot of CPU load from SNMP on Catalysts due to polling for
values/thresholds _and_ statistics
■ outsourced maintainance and development
■ flowsampling to Arbor boxes
■ VIX: additional sFlow-sampling, „VIXflow“
5
Dienstag, 05. Juni 2012
6. new monitoring setup
■ Icinga
■ Nagios fork
■ Developer@ACOnet-Team
■ pnp4nagios
■ takes perfdata and puts it into rrds
■ check_mk
■ keeping inventory
■ generates Icinga-config
■ one active check for one device
■ python - just a small job to write your own checks :)
6
Dienstag, 05. Juni 2012
7. Monitoring@ACOnet
■ integration
■ ACOnet Database/VIX Database
■ configuration authority
■ dispatcher writes dictionaries for check_mk and calls
check_mk to generate the config
■ display of statistics in portal (per participant)
■ weathermap (standalone php)
■ display of relevant status data/checkresults in portal
7
Dienstag, 05. Juni 2012
8. Monitoring@ACOnet
■ characteristics
■ one active check per device
■ results used in many passive checks
■ SNMPv2 (except older power-measurement-devices)
■ no traps
■ perfdata in RRDs
■ OID cache
■ SNMPv2 bulkwalks
■ ido2db - postgresql
■ one poll for statistics and threshold decision
■ use of rrdcached speeds up the whole thing
■ Icinga classic UI
■ two monitoring hosts at different locations
■ dedicated hardware for monitoring
■ commodity HP hardware
8
Dienstag, 05. Juni 2012
10. Monitoring@ACOnet
■ what do we check/graph
■ traffic/packets/errors/discards
■ CoS (QoS) - basis for cost sharing model
■ module status
■ BGP
■ incl. Prefix count
■ @Cisco ASR9K also IPv6
■ ICMP RTT in v4 and v6
■ Memory/CPU usage
■ temperatures
■ DOM
■ .....
■ @VIX
■ power consumption (for billing of RUs)
■ bird BGP-daemon
■ special: Proxy ARP check
10
Dienstag, 05. Juni 2012
11. Monitoring@ACOnet
■ Enhancements
■ ASR9k integrated
■ checks and statistics in <45 s per Device
■ check latency >200s when using Cacti/Nagios
■ less CPU consumed from SNMP on monitored devices
■ Load@montoring host between 0,3 and 0,9
■ compared to 5 (nagios/cacti)
■ VIX routeserver (bird) monitoring established
■ reduced IO-load due to rrdcached
■ easy (?) implementation of new checks
■ advantages of Icinga
■ active development
■ eg., flexible downtime, multiple acknowledgements, .....
■ easy bringing in of new ideas :)
11
Dienstag, 05. Juni 2012