SlideShare a Scribd company logo
Computing Facilities
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
CF
Hardware monitoring with collectd
Luca Gardi - luca.gardi@cern.ch
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
CF Introduction
● explain the differences between Lemon and collectd
● summarize needed changes for hardware monitoring
● explain the choices made during the process
● provide a status update
● explain current issues and proposed fixes
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
CF 1 - Lemon and collectd
Lemon
● developed by CERN
● in production since 2006 (at least)
● old monitoring infrastructure has been replaced
● retirement efforts started mid-2017
collectd
● open source project
● collects system and service metrics
● optimized to handle thousands of metrics
● modular and portable with community plugins
● easy to develop new plugins in Python/Java/C/Perl
● continuously improving and well documented
m
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
CF 2 - Why collectd?
Pros:
● community-driven and rich ecosystem
● alarms and plugins definitions are puppet-based
● better reusability, documentation
● easier to set up for quick metric collection
● easier metric dispatch in plugins
Cons:
● alarms generated on transition
● existing plugins require re-writing
● MONIT provides a lemon-sensor wrapper but is deprecated
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
CF 3 - HW monitoring in the Lemon era
Agent sensors:
● lemon-sensor-smart: SMART logs monitoring
● lemon-sensor-tw: 3ware RAID controllers
● lemon-sensor-megaraidsas: LSI MegaRAID controllers
● lemon-sensor-adaptec: Adaptec RAID controllers
● lemon-sensor-sasarray: JBODs monitoring
● lemon-sensor-blockdevice-drives: log parser for SCSI errors
● lemon-sensor-ipmi: IPMI monitoring
On-behalf monitoring (centralized):
● pdu-xmas: centralized out-of-band PDU monitoring (SNMPv2)
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
CF 4 - Moving towards a collectd era
● very specific and complex needs
○ heterogeneity of hardware and configurations
○ hardware RAID controllers
○ intense use of IPMI
● no community sensors we could adopt
● good news! code can be ported from lemon sensors
● adopt TDD (Test-Driven Development)
● compatibility with python 2.4, 2.7, 3.4
● Continuous Development (CI/CD) using GitLab
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
CF 5 - Plugin architecture
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
CF 6 - The big migration
Collectd plugins:
● collectd-mdstat: in production (new) ✔
● collectd-smart-tests: in production ✔
● collectd-megaraidsas: in QA ✔
● collectd-sasarray: in development
● collectd-blockdevices: in pipeline
● collectd-adaptec: in pipeline
● mcelog: from the community
Centralized monitoring:
● CINNAMON: in production ✔ (requires minor changes)
● PODIUM: in development
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
CF 7 - Plugin development workflow
● identify output metrics and write the tests
● write the plugin
● if tests.color == green: plugin.puppet_deploy()
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
CF 8 - Plugin deployment workflow
● RPM packaging and repositories using Koji
● Collectd plugin definition on Puppet
○ it-puppet-module-cerncollectd_contrib on GitLab
● standard CERN CRM QA -> Production pipeline (1 week)
● deployed on physical machines
○ it-puppet-module-hardware: physical.pp
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
CF 9 - Alarms
● based on standard collectd Threshold plugin
● checks local metrics against defined thresholds
● states: OK, WARNING, FAILURE
● puppet defined (metricmgr is already read-only)
● Service Managers can override thresholds and SNOW targets
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
CF
● finish porting of the sensors to collectd
● start retirement of old lemon sensors
● too many tickets:
○ fine tuning of the alarms is necessary
○ waiting for better SNOW tickets deduplication
● tickets are not very descriptive:
○ a pull request has been sent to the upstream community
● no lemon-host-check
○ do we need it?
○ is collectdctl enough?
10 - What’s left and current issues
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
CF
● collectd provides a mature environment for HW monitoring
● using puppet for alarms definition is definitely a plus for
versioning and maintenance, compared to metricmgr
● after an initial series of delays, mainly due to our early adoption,
we are now more than half-way there and progressing steadily
● targeting end of the year for finishing the migration
● a good occasion for collaboration with IT-CM-MM
11 - Conclusions
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
CF Hardware monitoring with collectd
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
CF Backup slides - Plugin definition
class cerncollectd_contrib::plugin::mdstat (
Integer $interval,
String $mdstat_path,
) {
require ::cerncollectd_contrib
package { 'collectd-mdstat':
ensure => present,
}
collectd::plugin::python::module { 'collectd_mdstat':
ensure => present,
config => [{
'INTERVAL' => $interval,
'MDSTAT_PATH' => $mdstat_path,
}],
require => Package['collectd-mdstat'],
}
}
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
CF Backup slides - Alarm definition
class cerncollectd_contrib::alarm::mdstat_wrong (
Integer $failure_max,
Integer $hits,
Boolean $persist,
Boolean $interesting,
Optional[Hash] $custom_targets,
Optional[String] $actuator,
) {
::cerncollectd::alarms::threshold::plugin {'mdstat_wrong':
plugin => 'mdstat',
type => 'disk_error',
failure_max => $failure_max,
hits => $hits,
persist => $persist,
interesting => $interesting,
}
::cerncollectd::alarms::extra {'mdstat_wrong':
ctd_namespace => 'mdstat',
targets => $custom_targets,
actuator => $actuator,
}
}
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
CF Backup slides - Plugin deployment
if (versioncmp($::operatingsystemmajrelease,'6') >= 0) or (versioncmp($::operatingsystemmajrelease,'7') >= 0){
# Software RAID failures (see target in YAML file data)
include ::cerncollectd_contrib::alarm::mdstat_wrong
include ::cerncollectd_contrib::plugin::mdstat
# SMART attributes failures
include ::cerncollectd_contrib::alarm::smart_wrong
include ::cerncollectd_contrib::plugin::smart_tests
# MegaRAID failures
include ::cerncollectd_contrib::alarm::megaraidsas::bbu_status_wrong
include ::cerncollectd_contrib::alarm::megaraidsas::controller_status_wrong
include ::cerncollectd_contrib::alarm::megaraidsas::controller_correctable_errors
include ::cerncollectd_contrib::alarm::megaraidsas::controller_uncorrectable_errors
include ::cerncollectd_contrib::alarm::megaraidsas::cache_policy_on_faulty_bbu_wrong
include ::cerncollectd_contrib::alarm::megaraidsas::cache_policy_on_raid_array_wrong
include ::cerncollectd_contrib::alarm::megaraidsas::raid_array_status_wrong
include ::cerncollectd_contrib::alarm::megaraidsas::missing_drives
include ::cerncollectd_contrib::alarm::megaraidsas::unconfigured_good_drives
include ::cerncollectd_contrib::alarm::megaraidsas::unconfigured_bad_drives
include ::cerncollectd_contrib::alarm::megaraidsas::offline_drives
if (versioncmp($::operatingsystemmajrelease,'6') >= 0){
class {'::cerncollectd_contrib::plugin::megaraidsas':
lsmod_path => '/sbin/lsmod',
}
} else {
include ::cerncollectd_contrib::plugin::megaraidsas
}
}
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
CF Backup slides - collectdctl output
● collectd namespace:
○ <hostname>/<plugin>-<plugin_instance>/<type>-<type_instance>
● listing values:
○ [root@lxfsrd08c04 ~]# collectdctl listval
lxfsrd08c04.cern.ch/megaraidsas-bbu_status/count-c0
lxfsrd08c04.cern.ch/megaraidsas-controller_cache_policy_on_faulty_bbu/count-c0
lxfsrd08c04.cern.ch/megaraidsas-controller_cache_policy_wrong_on_raid_array/count-c0
lxfsrd08c04.cern.ch/megaraidsas-controller_memory_correctable_errors/count-c0
lxfsrd08c04.cern.ch/megaraidsas-controller_memory_uncorrectable_errors/count-c0
lxfsrd08c04.cern.ch/megaraidsas-controller_status/count-c0
lxfsrd08c04.cern.ch/megaraidsas-missing_drives/count
lxfsrd08c04.cern.ch/megaraidsas-offline_drives/count-c0
lxfsrd08c04.cern.ch/megaraidsas-raid_array_status/count-c0_vd0
lxfsrd08c04.cern.ch/megaraidsas-raid_array_status/count-c0_vd1
lxfsrd08c04.cern.ch/megaraidsas-unconfigured_bad_drives/count-c0
lxfsrd08c04.cern.ch/megaraidsas-unconfigured_good_drives/count-c0
● getting values:
○ [root@lxfsrd08c04 ~]# collectdctl getval lxfsrd08c04.cern.ch/megaraidsas-bbu_status/count-c0
value=0.000000e+00
CERN IT Department
CH-1211 Geneva 23
Switzerland
www.cern.ch/it
CF Backup slides - GitLab CI/CD

More Related Content

What's hot

Closing the RISC-V compliance gap via fuzzing
Closing the RISC-V compliance gap via fuzzingClosing the RISC-V compliance gap via fuzzing
Closing the RISC-V compliance gap via fuzzing
RISC-V International
 
2008-07-15 zNTP Conference, Red Hat Enterprise Solutions for System z
2008-07-15 zNTP Conference, Red Hat Enterprise Solutions for System z2008-07-15 zNTP Conference, Red Hat Enterprise Solutions for System z
2008-07-15 zNTP Conference, Red Hat Enterprise Solutions for System z
Shawn Wells
 
An Essential Relationship between Real-time and Resource Partitioning
An Essential Relationship between Real-time and Resource PartitioningAn Essential Relationship between Real-time and Resource Partitioning
An Essential Relationship between Real-time and Resource Partitioning
Yoshitake Kobayashi
 
BKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
BKK16-302: Android Optimizing Compiler: New Member Assimilation GuideBKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
BKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
Linaro
 
RISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V NOEL-V - A new high performance RISC-V Processor FamilyRISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V International
 
BKK16-400A LuvOS and ACPI Compliance Testing
BKK16-400A LuvOS and ACPI Compliance TestingBKK16-400A LuvOS and ACPI Compliance Testing
BKK16-400A LuvOS and ACPI Compliance Testing
Linaro
 
從u-boot 移植 NDS32 談 嵌入式系統開放原始碼開發的 一些經驗
從u-boot 移植 NDS32 談 嵌入式系統開放原始碼開發的 一些經驗從u-boot 移植 NDS32 談 嵌入式系統開放原始碼開發的 一些經驗
從u-boot 移植 NDS32 談 嵌入式系統開放原始碼開發的 一些經驗
Macpaul Lin
 
LAS16-201: ART JIT in Android N
LAS16-201: ART JIT in Android NLAS16-201: ART JIT in Android N
LAS16-201: ART JIT in Android N
Linaro
 
LAS16-106: GNU Toolchain Development Lifecycle
LAS16-106: GNU Toolchain Development LifecycleLAS16-106: GNU Toolchain Development Lifecycle
LAS16-106: GNU Toolchain Development Lifecycle
Linaro
 
Kernel Recipes 2013 - Deciphering Oopsies
Kernel Recipes 2013 - Deciphering OopsiesKernel Recipes 2013 - Deciphering Oopsies
Kernel Recipes 2013 - Deciphering Oopsies
Anne Nicolas
 
Sprint 126
Sprint 126Sprint 126
Sprint 126
ManageIQ
 
Tech talk with lampro mellon an open source solution for accelerating verific...
Tech talk with lampro mellon an open source solution for accelerating verific...Tech talk with lampro mellon an open source solution for accelerating verific...
Tech talk with lampro mellon an open source solution for accelerating verific...
RISC-V International
 
Tommaso Cucinotta - Low-latency and power-efficient audio applications on Linux
Tommaso Cucinotta - Low-latency and power-efficient audio applications on LinuxTommaso Cucinotta - Low-latency and power-efficient audio applications on Linux
Tommaso Cucinotta - Low-latency and power-efficient audio applications on Linux
linuxlab_conf
 
LAS16-407: Internet of Tiny Linux (IoTL): the sequel.
LAS16-407: Internet of Tiny Linux (IoTL): the sequel.LAS16-407: Internet of Tiny Linux (IoTL): the sequel.
LAS16-407: Internet of Tiny Linux (IoTL): the sequel.
Linaro
 
LAS16-210: Hardware Assisted Tracing on ARM with CoreSight and OpenCSD
LAS16-210: Hardware Assisted Tracing on ARM with CoreSight and OpenCSDLAS16-210: Hardware Assisted Tracing on ARM with CoreSight and OpenCSD
LAS16-210: Hardware Assisted Tracing on ARM with CoreSight and OpenCSD
Linaro
 
LAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
LAS16-501: Introduction to LLVM - Projects, Components, Integration, InternalsLAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
LAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
Linaro
 
Перехват функций (хуки) под Windows в приложениях с помощью C/C++
Перехват функций (хуки) под Windows в приложениях с помощью C/C++Перехват функций (хуки) под Windows в приложениях с помощью C/C++
Перехват функций (хуки) под Windows в приложениях с помощью C/C++
corehard_by
 
Andes enhancing verification coverage for risc v vector extension using riscv-dv
Andes enhancing verification coverage for risc v vector extension using riscv-dvAndes enhancing verification coverage for risc v vector extension using riscv-dv
Andes enhancing verification coverage for risc v vector extension using riscv-dv
RISC-V International
 
TMT SequenceL customer use cases and results
TMT SequenceL customer use cases and resultsTMT SequenceL customer use cases and results
TMT SequenceL customer use cases and results
Doug Norton
 
y2038 issue
y2038 issuey2038 issue
y2038 issue
SZ Lin
 

What's hot (20)

Closing the RISC-V compliance gap via fuzzing
Closing the RISC-V compliance gap via fuzzingClosing the RISC-V compliance gap via fuzzing
Closing the RISC-V compliance gap via fuzzing
 
2008-07-15 zNTP Conference, Red Hat Enterprise Solutions for System z
2008-07-15 zNTP Conference, Red Hat Enterprise Solutions for System z2008-07-15 zNTP Conference, Red Hat Enterprise Solutions for System z
2008-07-15 zNTP Conference, Red Hat Enterprise Solutions for System z
 
An Essential Relationship between Real-time and Resource Partitioning
An Essential Relationship between Real-time and Resource PartitioningAn Essential Relationship between Real-time and Resource Partitioning
An Essential Relationship between Real-time and Resource Partitioning
 
BKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
BKK16-302: Android Optimizing Compiler: New Member Assimilation GuideBKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
BKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
 
RISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V NOEL-V - A new high performance RISC-V Processor FamilyRISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V NOEL-V - A new high performance RISC-V Processor Family
 
BKK16-400A LuvOS and ACPI Compliance Testing
BKK16-400A LuvOS and ACPI Compliance TestingBKK16-400A LuvOS and ACPI Compliance Testing
BKK16-400A LuvOS and ACPI Compliance Testing
 
從u-boot 移植 NDS32 談 嵌入式系統開放原始碼開發的 一些經驗
從u-boot 移植 NDS32 談 嵌入式系統開放原始碼開發的 一些經驗從u-boot 移植 NDS32 談 嵌入式系統開放原始碼開發的 一些經驗
從u-boot 移植 NDS32 談 嵌入式系統開放原始碼開發的 一些經驗
 
LAS16-201: ART JIT in Android N
LAS16-201: ART JIT in Android NLAS16-201: ART JIT in Android N
LAS16-201: ART JIT in Android N
 
LAS16-106: GNU Toolchain Development Lifecycle
LAS16-106: GNU Toolchain Development LifecycleLAS16-106: GNU Toolchain Development Lifecycle
LAS16-106: GNU Toolchain Development Lifecycle
 
Kernel Recipes 2013 - Deciphering Oopsies
Kernel Recipes 2013 - Deciphering OopsiesKernel Recipes 2013 - Deciphering Oopsies
Kernel Recipes 2013 - Deciphering Oopsies
 
Sprint 126
Sprint 126Sprint 126
Sprint 126
 
Tech talk with lampro mellon an open source solution for accelerating verific...
Tech talk with lampro mellon an open source solution for accelerating verific...Tech talk with lampro mellon an open source solution for accelerating verific...
Tech talk with lampro mellon an open source solution for accelerating verific...
 
Tommaso Cucinotta - Low-latency and power-efficient audio applications on Linux
Tommaso Cucinotta - Low-latency and power-efficient audio applications on LinuxTommaso Cucinotta - Low-latency and power-efficient audio applications on Linux
Tommaso Cucinotta - Low-latency and power-efficient audio applications on Linux
 
LAS16-407: Internet of Tiny Linux (IoTL): the sequel.
LAS16-407: Internet of Tiny Linux (IoTL): the sequel.LAS16-407: Internet of Tiny Linux (IoTL): the sequel.
LAS16-407: Internet of Tiny Linux (IoTL): the sequel.
 
LAS16-210: Hardware Assisted Tracing on ARM with CoreSight and OpenCSD
LAS16-210: Hardware Assisted Tracing on ARM with CoreSight and OpenCSDLAS16-210: Hardware Assisted Tracing on ARM with CoreSight and OpenCSD
LAS16-210: Hardware Assisted Tracing on ARM with CoreSight and OpenCSD
 
LAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
LAS16-501: Introduction to LLVM - Projects, Components, Integration, InternalsLAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
LAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
 
Перехват функций (хуки) под Windows в приложениях с помощью C/C++
Перехват функций (хуки) под Windows в приложениях с помощью C/C++Перехват функций (хуки) под Windows в приложениях с помощью C/C++
Перехват функций (хуки) под Windows в приложениях с помощью C/C++
 
Andes enhancing verification coverage for risc v vector extension using riscv-dv
Andes enhancing verification coverage for risc v vector extension using riscv-dvAndes enhancing verification coverage for risc v vector extension using riscv-dv
Andes enhancing verification coverage for risc v vector extension using riscv-dv
 
TMT SequenceL customer use cases and results
TMT SequenceL customer use cases and resultsTMT SequenceL customer use cases and results
TMT SequenceL customer use cases and results
 
y2038 issue
y2038 issuey2038 issue
y2038 issue
 

Similar to Hardware monitoring with collectd at CERN

OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
NETWAYS
 
Introduction to Civil Infrastructure Platform
Introduction to Civil Infrastructure PlatformIntroduction to Civil Infrastructure Platform
Introduction to Civil Infrastructure Platform
SZ Lin
 
Shifter singularity - june 7, 2018 - bw symposium
Shifter  singularity - june 7, 2018 - bw symposiumShifter  singularity - june 7, 2018 - bw symposium
Shifter singularity - june 7, 2018 - bw symposium
inside-BigData.com
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPF
Alex Maestretti
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
NVIDIA
 
FIWARE Tech Summit - Stream Processing with Kurento Media Server
FIWARE Tech Summit - Stream Processing with Kurento Media ServerFIWARE Tech Summit - Stream Processing with Kurento Media Server
FIWARE Tech Summit - Stream Processing with Kurento Media Server
FIWARE
 
Benmarking Orange Forge with CLIF, OW2con'15, November 17, Paris
Benmarking Orange Forge with CLIF, OW2con'15, November 17, Paris Benmarking Orange Forge with CLIF, OW2con'15, November 17, Paris
Benmarking Orange Forge with CLIF, OW2con'15, November 17, Paris
OW2
 
2022 Cauldron analyzer talk from david malcolm
2022 Cauldron analyzer talk from david malcolm2022 Cauldron analyzer talk from david malcolm
2022 Cauldron analyzer talk from david malcolm
ssuser866937
 
Presentation 12c grid_upgrade
Presentation 12c grid_upgradePresentation 12c grid_upgrade
Presentation 12c grid_upgrade
Jacques Kostic
 
Puppet Camp Berlin 2015: Configuration Management @ CERN: Going Agile with Style
Puppet Camp Berlin 2015: Configuration Management @ CERN: Going Agile with StylePuppet Camp Berlin 2015: Configuration Management @ CERN: Going Agile with Style
Puppet Camp Berlin 2015: Configuration Management @ CERN: Going Agile with Style
Puppet
 
Puppet Camp Berlin 2015: Andrea Giardini | Configuration Management @ CERN: G...
Puppet Camp Berlin 2015: Andrea Giardini | Configuration Management @ CERN: G...Puppet Camp Berlin 2015: Andrea Giardini | Configuration Management @ CERN: G...
Puppet Camp Berlin 2015: Andrea Giardini | Configuration Management @ CERN: G...
NETWAYS
 
HPC on OpenStack
HPC on OpenStackHPC on OpenStack
HPC on OpenStack
Erich Birngruber
 
BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...
BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...
BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...
aaajjj4
 
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, TrustedNVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
Jeremy Eder
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
Jack (Jaegeun) Han
 
Bounded Model Checking for C Programs in an Enterprise Environment
Bounded Model Checking for C Programs in an Enterprise EnvironmentBounded Model Checking for C Programs in an Enterprise Environment
Bounded Model Checking for C Programs in an Enterprise Environment
AdaCore
 
FIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using KurentoFIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE
 
Data science workflows: from notebooks to production
Data science workflows: from notebooks to productionData science workflows: from notebooks to production
Data science workflows: from notebooks to production
Marissa Saunders
 
IIT-RTC 2017 Qt WebRTC Tutorial (Qt Janus Client)
IIT-RTC 2017 Qt WebRTC Tutorial (Qt Janus Client)IIT-RTC 2017 Qt WebRTC Tutorial (Qt Janus Client)
IIT-RTC 2017 Qt WebRTC Tutorial (Qt Janus Client)
Alexandre Gouaillard
 
Programming embedded systems ii
Programming embedded systems iiProgramming embedded systems ii
Programming embedded systems ii
vtsplgroup
 

Similar to Hardware monitoring with collectd at CERN (20)

OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
 
Introduction to Civil Infrastructure Platform
Introduction to Civil Infrastructure PlatformIntroduction to Civil Infrastructure Platform
Introduction to Civil Infrastructure Platform
 
Shifter singularity - june 7, 2018 - bw symposium
Shifter  singularity - june 7, 2018 - bw symposiumShifter  singularity - june 7, 2018 - bw symposium
Shifter singularity - june 7, 2018 - bw symposium
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPF
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
FIWARE Tech Summit - Stream Processing with Kurento Media Server
FIWARE Tech Summit - Stream Processing with Kurento Media ServerFIWARE Tech Summit - Stream Processing with Kurento Media Server
FIWARE Tech Summit - Stream Processing with Kurento Media Server
 
Benmarking Orange Forge with CLIF, OW2con'15, November 17, Paris
Benmarking Orange Forge with CLIF, OW2con'15, November 17, Paris Benmarking Orange Forge with CLIF, OW2con'15, November 17, Paris
Benmarking Orange Forge with CLIF, OW2con'15, November 17, Paris
 
2022 Cauldron analyzer talk from david malcolm
2022 Cauldron analyzer talk from david malcolm2022 Cauldron analyzer talk from david malcolm
2022 Cauldron analyzer talk from david malcolm
 
Presentation 12c grid_upgrade
Presentation 12c grid_upgradePresentation 12c grid_upgrade
Presentation 12c grid_upgrade
 
Puppet Camp Berlin 2015: Configuration Management @ CERN: Going Agile with Style
Puppet Camp Berlin 2015: Configuration Management @ CERN: Going Agile with StylePuppet Camp Berlin 2015: Configuration Management @ CERN: Going Agile with Style
Puppet Camp Berlin 2015: Configuration Management @ CERN: Going Agile with Style
 
Puppet Camp Berlin 2015: Andrea Giardini | Configuration Management @ CERN: G...
Puppet Camp Berlin 2015: Andrea Giardini | Configuration Management @ CERN: G...Puppet Camp Berlin 2015: Andrea Giardini | Configuration Management @ CERN: G...
Puppet Camp Berlin 2015: Andrea Giardini | Configuration Management @ CERN: G...
 
HPC on OpenStack
HPC on OpenStackHPC on OpenStack
HPC on OpenStack
 
BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...
BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...
BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...
 
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, TrustedNVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
 
Bounded Model Checking for C Programs in an Enterprise Environment
Bounded Model Checking for C Programs in an Enterprise EnvironmentBounded Model Checking for C Programs in an Enterprise Environment
Bounded Model Checking for C Programs in an Enterprise Environment
 
FIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using KurentoFIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using Kurento
 
Data science workflows: from notebooks to production
Data science workflows: from notebooks to productionData science workflows: from notebooks to production
Data science workflows: from notebooks to production
 
IIT-RTC 2017 Qt WebRTC Tutorial (Qt Janus Client)
IIT-RTC 2017 Qt WebRTC Tutorial (Qt Janus Client)IIT-RTC 2017 Qt WebRTC Tutorial (Qt Janus Client)
IIT-RTC 2017 Qt WebRTC Tutorial (Qt Janus Client)
 
Programming embedded systems ii
Programming embedded systems iiProgramming embedded systems ii
Programming embedded systems ii
 

Recently uploaded

一比一原版(UCSB毕业证)圣塔芭芭拉社区大学毕业证如何办理
一比一原版(UCSB毕业证)圣塔芭芭拉社区大学毕业证如何办理一比一原版(UCSB毕业证)圣塔芭芭拉社区大学毕业证如何办理
一比一原版(UCSB毕业证)圣塔芭芭拉社区大学毕业证如何办理
aozcue
 
一比一原版(ANU文凭证书)澳大利亚国立大学毕业证如何办理
一比一原版(ANU文凭证书)澳大利亚国立大学毕业证如何办理一比一原版(ANU文凭证书)澳大利亚国立大学毕业证如何办理
一比一原版(ANU文凭证书)澳大利亚国立大学毕业证如何办理
nudduv
 
欧洲杯冠军-欧洲杯冠军网站-欧洲杯冠军|【​网址​🎉ac123.net🎉​】领先全球的买球投注平台
欧洲杯冠军-欧洲杯冠军网站-欧洲杯冠军|【​网址​🎉ac123.net🎉​】领先全球的买球投注平台欧洲杯冠军-欧洲杯冠军网站-欧洲杯冠军|【​网址​🎉ac123.net🎉​】领先全球的买球投注平台
欧洲杯冠军-欧洲杯冠军网站-欧洲杯冠军|【​网址​🎉ac123.net🎉​】领先全球的买球投注平台
andreassenrolf537
 
LORRAINE ANDREI_LEQUIGAN_GOOGLE CALENDAR
LORRAINE ANDREI_LEQUIGAN_GOOGLE CALENDARLORRAINE ANDREI_LEQUIGAN_GOOGLE CALENDAR
LORRAINE ANDREI_LEQUIGAN_GOOGLE CALENDAR
lorraineandreiamcidl
 
按照学校原版(Columbia文凭证书)哥伦比亚大学毕业证快速办理
按照学校原版(Columbia文凭证书)哥伦比亚大学毕业证快速办理按照学校原版(Columbia文凭证书)哥伦比亚大学毕业证快速办理
按照学校原版(Columbia文凭证书)哥伦比亚大学毕业证快速办理
uyesp1a
 
按照学校原版(SUT文凭证书)斯威本科技大学毕业证快速办理
按照学校原版(SUT文凭证书)斯威本科技大学毕业证快速办理按照学校原版(SUT文凭证书)斯威本科技大学毕业证快速办理
按照学校原版(SUT文凭证书)斯威本科技大学毕业证快速办理
1jtj7yul
 
一比一原版(Greenwich文凭证书)格林威治大学毕业证如何办理
一比一原版(Greenwich文凭证书)格林威治大学毕业证如何办理一比一原版(Greenwich文凭证书)格林威治大学毕业证如何办理
一比一原版(Greenwich文凭证书)格林威治大学毕业证如何办理
byfazef
 
按照学校原版(UVic文凭证书)维多利亚大学毕业证快速办理
按照学校原版(UVic文凭证书)维多利亚大学毕业证快速办理按照学校原版(UVic文凭证书)维多利亚大学毕业证快速办理
按照学校原版(UVic文凭证书)维多利亚大学毕业证快速办理
1jtj7yul
 
一比一原版(UOL文凭证书)利物浦大学毕业证如何办理
一比一原版(UOL文凭证书)利物浦大学毕业证如何办理一比一原版(UOL文凭证书)利物浦大学毕业证如何办理
一比一原版(UOL文凭证书)利物浦大学毕业证如何办理
eydeofo
 
加急办理美国南加州大学毕业证文凭毕业证原版一模一样
加急办理美国南加州大学毕业证文凭毕业证原版一模一样加急办理美国南加州大学毕业证文凭毕业证原版一模一样
加急办理美国南加州大学毕业证文凭毕业证原版一模一样
u0g33km
 
一比一原版(KCL文凭证书)伦敦国王学院毕业证如何办理
一比一原版(KCL文凭证书)伦敦国王学院毕业证如何办理一比一原版(KCL文凭证书)伦敦国王学院毕业证如何办理
一比一原版(KCL文凭证书)伦敦国王学院毕业证如何办理
kuehcub
 
1比1复刻澳洲皇家墨尔本理工大学毕业证本科学位原版一模一样
1比1复刻澳洲皇家墨尔本理工大学毕业证本科学位原版一模一样1比1复刻澳洲皇家墨尔本理工大学毕业证本科学位原版一模一样
1比1复刻澳洲皇家墨尔本理工大学毕业证本科学位原版一模一样
2g3om49r
 
一比一原版(Monash文凭证书)莫纳什大学毕业证如何办理
一比一原版(Monash文凭证书)莫纳什大学毕业证如何办理一比一原版(Monash文凭证书)莫纳什大学毕业证如何办理
一比一原版(Monash文凭证书)莫纳什大学毕业证如何办理
xuqdabu
 
按照学校原版(UST文凭证书)圣托马斯大学毕业证快速办理
按照学校原版(UST文凭证书)圣托马斯大学毕业证快速办理按照学校原版(UST文凭证书)圣托马斯大学毕业证快速办理
按照学校原版(UST文凭证书)圣托马斯大学毕业证快速办理
zpc0z12
 
Building a Raspberry Pi Robot with Dot NET 8, Blazor and SignalR - Slides Onl...
Building a Raspberry Pi Robot with Dot NET 8, Blazor and SignalR - Slides Onl...Building a Raspberry Pi Robot with Dot NET 8, Blazor and SignalR - Slides Onl...
Building a Raspberry Pi Robot with Dot NET 8, Blazor and SignalR - Slides Onl...
Peter Gallagher
 
按照学校原版(QU文凭证书)皇后大学毕业证快速办理
按照学校原版(QU文凭证书)皇后大学毕业证快速办理按照学校原版(QU文凭证书)皇后大学毕业证快速办理
按照学校原版(QU文凭证书)皇后大学毕业证快速办理
8db3cz8x
 
Production.pptxd dddddddddddddddddddddddddddddddddd
Production.pptxd ddddddddddddddddddddddddddddddddddProduction.pptxd dddddddddddddddddddddddddddddddddd
Production.pptxd dddddddddddddddddddddddddddddddddd
DanielOliver74
 
一比一原版(Adelaide文凭证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide文凭证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide文凭证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide文凭证书)阿德莱德大学毕业证如何办理
nudduv
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证如何办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证如何办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证如何办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证如何办理
aozcue
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证如何办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证如何办理一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证如何办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证如何办理
peuce
 

Recently uploaded (20)

一比一原版(UCSB毕业证)圣塔芭芭拉社区大学毕业证如何办理
一比一原版(UCSB毕业证)圣塔芭芭拉社区大学毕业证如何办理一比一原版(UCSB毕业证)圣塔芭芭拉社区大学毕业证如何办理
一比一原版(UCSB毕业证)圣塔芭芭拉社区大学毕业证如何办理
 
一比一原版(ANU文凭证书)澳大利亚国立大学毕业证如何办理
一比一原版(ANU文凭证书)澳大利亚国立大学毕业证如何办理一比一原版(ANU文凭证书)澳大利亚国立大学毕业证如何办理
一比一原版(ANU文凭证书)澳大利亚国立大学毕业证如何办理
 
欧洲杯冠军-欧洲杯冠军网站-欧洲杯冠军|【​网址​🎉ac123.net🎉​】领先全球的买球投注平台
欧洲杯冠军-欧洲杯冠军网站-欧洲杯冠军|【​网址​🎉ac123.net🎉​】领先全球的买球投注平台欧洲杯冠军-欧洲杯冠军网站-欧洲杯冠军|【​网址​🎉ac123.net🎉​】领先全球的买球投注平台
欧洲杯冠军-欧洲杯冠军网站-欧洲杯冠军|【​网址​🎉ac123.net🎉​】领先全球的买球投注平台
 
LORRAINE ANDREI_LEQUIGAN_GOOGLE CALENDAR
LORRAINE ANDREI_LEQUIGAN_GOOGLE CALENDARLORRAINE ANDREI_LEQUIGAN_GOOGLE CALENDAR
LORRAINE ANDREI_LEQUIGAN_GOOGLE CALENDAR
 
按照学校原版(Columbia文凭证书)哥伦比亚大学毕业证快速办理
按照学校原版(Columbia文凭证书)哥伦比亚大学毕业证快速办理按照学校原版(Columbia文凭证书)哥伦比亚大学毕业证快速办理
按照学校原版(Columbia文凭证书)哥伦比亚大学毕业证快速办理
 
按照学校原版(SUT文凭证书)斯威本科技大学毕业证快速办理
按照学校原版(SUT文凭证书)斯威本科技大学毕业证快速办理按照学校原版(SUT文凭证书)斯威本科技大学毕业证快速办理
按照学校原版(SUT文凭证书)斯威本科技大学毕业证快速办理
 
一比一原版(Greenwich文凭证书)格林威治大学毕业证如何办理
一比一原版(Greenwich文凭证书)格林威治大学毕业证如何办理一比一原版(Greenwich文凭证书)格林威治大学毕业证如何办理
一比一原版(Greenwich文凭证书)格林威治大学毕业证如何办理
 
按照学校原版(UVic文凭证书)维多利亚大学毕业证快速办理
按照学校原版(UVic文凭证书)维多利亚大学毕业证快速办理按照学校原版(UVic文凭证书)维多利亚大学毕业证快速办理
按照学校原版(UVic文凭证书)维多利亚大学毕业证快速办理
 
一比一原版(UOL文凭证书)利物浦大学毕业证如何办理
一比一原版(UOL文凭证书)利物浦大学毕业证如何办理一比一原版(UOL文凭证书)利物浦大学毕业证如何办理
一比一原版(UOL文凭证书)利物浦大学毕业证如何办理
 
加急办理美国南加州大学毕业证文凭毕业证原版一模一样
加急办理美国南加州大学毕业证文凭毕业证原版一模一样加急办理美国南加州大学毕业证文凭毕业证原版一模一样
加急办理美国南加州大学毕业证文凭毕业证原版一模一样
 
一比一原版(KCL文凭证书)伦敦国王学院毕业证如何办理
一比一原版(KCL文凭证书)伦敦国王学院毕业证如何办理一比一原版(KCL文凭证书)伦敦国王学院毕业证如何办理
一比一原版(KCL文凭证书)伦敦国王学院毕业证如何办理
 
1比1复刻澳洲皇家墨尔本理工大学毕业证本科学位原版一模一样
1比1复刻澳洲皇家墨尔本理工大学毕业证本科学位原版一模一样1比1复刻澳洲皇家墨尔本理工大学毕业证本科学位原版一模一样
1比1复刻澳洲皇家墨尔本理工大学毕业证本科学位原版一模一样
 
一比一原版(Monash文凭证书)莫纳什大学毕业证如何办理
一比一原版(Monash文凭证书)莫纳什大学毕业证如何办理一比一原版(Monash文凭证书)莫纳什大学毕业证如何办理
一比一原版(Monash文凭证书)莫纳什大学毕业证如何办理
 
按照学校原版(UST文凭证书)圣托马斯大学毕业证快速办理
按照学校原版(UST文凭证书)圣托马斯大学毕业证快速办理按照学校原版(UST文凭证书)圣托马斯大学毕业证快速办理
按照学校原版(UST文凭证书)圣托马斯大学毕业证快速办理
 
Building a Raspberry Pi Robot with Dot NET 8, Blazor and SignalR - Slides Onl...
Building a Raspberry Pi Robot with Dot NET 8, Blazor and SignalR - Slides Onl...Building a Raspberry Pi Robot with Dot NET 8, Blazor and SignalR - Slides Onl...
Building a Raspberry Pi Robot with Dot NET 8, Blazor and SignalR - Slides Onl...
 
按照学校原版(QU文凭证书)皇后大学毕业证快速办理
按照学校原版(QU文凭证书)皇后大学毕业证快速办理按照学校原版(QU文凭证书)皇后大学毕业证快速办理
按照学校原版(QU文凭证书)皇后大学毕业证快速办理
 
Production.pptxd dddddddddddddddddddddddddddddddddd
Production.pptxd ddddddddddddddddddddddddddddddddddProduction.pptxd dddddddddddddddddddddddddddddddddd
Production.pptxd dddddddddddddddddddddddddddddddddd
 
一比一原版(Adelaide文凭证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide文凭证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide文凭证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide文凭证书)阿德莱德大学毕业证如何办理
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证如何办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证如何办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证如何办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证如何办理
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证如何办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证如何办理一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证如何办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证如何办理
 

Hardware monitoring with collectd at CERN

  • 1. Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it CF Hardware monitoring with collectd Luca Gardi - luca.gardi@cern.ch
  • 2. CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it CF Introduction ● explain the differences between Lemon and collectd ● summarize needed changes for hardware monitoring ● explain the choices made during the process ● provide a status update ● explain current issues and proposed fixes
  • 3. CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it CF 1 - Lemon and collectd Lemon ● developed by CERN ● in production since 2006 (at least) ● old monitoring infrastructure has been replaced ● retirement efforts started mid-2017 collectd ● open source project ● collects system and service metrics ● optimized to handle thousands of metrics ● modular and portable with community plugins ● easy to develop new plugins in Python/Java/C/Perl ● continuously improving and well documented m
  • 4. CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it CF 2 - Why collectd? Pros: ● community-driven and rich ecosystem ● alarms and plugins definitions are puppet-based ● better reusability, documentation ● easier to set up for quick metric collection ● easier metric dispatch in plugins Cons: ● alarms generated on transition ● existing plugins require re-writing ● MONIT provides a lemon-sensor wrapper but is deprecated
  • 5. CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it CF 3 - HW monitoring in the Lemon era Agent sensors: ● lemon-sensor-smart: SMART logs monitoring ● lemon-sensor-tw: 3ware RAID controllers ● lemon-sensor-megaraidsas: LSI MegaRAID controllers ● lemon-sensor-adaptec: Adaptec RAID controllers ● lemon-sensor-sasarray: JBODs monitoring ● lemon-sensor-blockdevice-drives: log parser for SCSI errors ● lemon-sensor-ipmi: IPMI monitoring On-behalf monitoring (centralized): ● pdu-xmas: centralized out-of-band PDU monitoring (SNMPv2)
  • 6. CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it CF 4 - Moving towards a collectd era ● very specific and complex needs ○ heterogeneity of hardware and configurations ○ hardware RAID controllers ○ intense use of IPMI ● no community sensors we could adopt ● good news! code can be ported from lemon sensors ● adopt TDD (Test-Driven Development) ● compatibility with python 2.4, 2.7, 3.4 ● Continuous Development (CI/CD) using GitLab
  • 7. CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it CF 5 - Plugin architecture
  • 8. CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it CF 6 - The big migration Collectd plugins: ● collectd-mdstat: in production (new) ✔ ● collectd-smart-tests: in production ✔ ● collectd-megaraidsas: in QA ✔ ● collectd-sasarray: in development ● collectd-blockdevices: in pipeline ● collectd-adaptec: in pipeline ● mcelog: from the community Centralized monitoring: ● CINNAMON: in production ✔ (requires minor changes) ● PODIUM: in development
  • 9. CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it CF 7 - Plugin development workflow ● identify output metrics and write the tests ● write the plugin ● if tests.color == green: plugin.puppet_deploy()
  • 10. CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it CF 8 - Plugin deployment workflow ● RPM packaging and repositories using Koji ● Collectd plugin definition on Puppet ○ it-puppet-module-cerncollectd_contrib on GitLab ● standard CERN CRM QA -> Production pipeline (1 week) ● deployed on physical machines ○ it-puppet-module-hardware: physical.pp
  • 11. CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it CF 9 - Alarms ● based on standard collectd Threshold plugin ● checks local metrics against defined thresholds ● states: OK, WARNING, FAILURE ● puppet defined (metricmgr is already read-only) ● Service Managers can override thresholds and SNOW targets
  • 12. CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it CF ● finish porting of the sensors to collectd ● start retirement of old lemon sensors ● too many tickets: ○ fine tuning of the alarms is necessary ○ waiting for better SNOW tickets deduplication ● tickets are not very descriptive: ○ a pull request has been sent to the upstream community ● no lemon-host-check ○ do we need it? ○ is collectdctl enough? 10 - What’s left and current issues
  • 13. CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it CF ● collectd provides a mature environment for HW monitoring ● using puppet for alarms definition is definitely a plus for versioning and maintenance, compared to metricmgr ● after an initial series of delays, mainly due to our early adoption, we are now more than half-way there and progressing steadily ● targeting end of the year for finishing the migration ● a good occasion for collaboration with IT-CM-MM 11 - Conclusions
  • 14. CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it CF Hardware monitoring with collectd
  • 15. CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it CF Backup slides - Plugin definition class cerncollectd_contrib::plugin::mdstat ( Integer $interval, String $mdstat_path, ) { require ::cerncollectd_contrib package { 'collectd-mdstat': ensure => present, } collectd::plugin::python::module { 'collectd_mdstat': ensure => present, config => [{ 'INTERVAL' => $interval, 'MDSTAT_PATH' => $mdstat_path, }], require => Package['collectd-mdstat'], } }
  • 16. CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it CF Backup slides - Alarm definition class cerncollectd_contrib::alarm::mdstat_wrong ( Integer $failure_max, Integer $hits, Boolean $persist, Boolean $interesting, Optional[Hash] $custom_targets, Optional[String] $actuator, ) { ::cerncollectd::alarms::threshold::plugin {'mdstat_wrong': plugin => 'mdstat', type => 'disk_error', failure_max => $failure_max, hits => $hits, persist => $persist, interesting => $interesting, } ::cerncollectd::alarms::extra {'mdstat_wrong': ctd_namespace => 'mdstat', targets => $custom_targets, actuator => $actuator, } }
  • 17. CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it CF Backup slides - Plugin deployment if (versioncmp($::operatingsystemmajrelease,'6') >= 0) or (versioncmp($::operatingsystemmajrelease,'7') >= 0){ # Software RAID failures (see target in YAML file data) include ::cerncollectd_contrib::alarm::mdstat_wrong include ::cerncollectd_contrib::plugin::mdstat # SMART attributes failures include ::cerncollectd_contrib::alarm::smart_wrong include ::cerncollectd_contrib::plugin::smart_tests # MegaRAID failures include ::cerncollectd_contrib::alarm::megaraidsas::bbu_status_wrong include ::cerncollectd_contrib::alarm::megaraidsas::controller_status_wrong include ::cerncollectd_contrib::alarm::megaraidsas::controller_correctable_errors include ::cerncollectd_contrib::alarm::megaraidsas::controller_uncorrectable_errors include ::cerncollectd_contrib::alarm::megaraidsas::cache_policy_on_faulty_bbu_wrong include ::cerncollectd_contrib::alarm::megaraidsas::cache_policy_on_raid_array_wrong include ::cerncollectd_contrib::alarm::megaraidsas::raid_array_status_wrong include ::cerncollectd_contrib::alarm::megaraidsas::missing_drives include ::cerncollectd_contrib::alarm::megaraidsas::unconfigured_good_drives include ::cerncollectd_contrib::alarm::megaraidsas::unconfigured_bad_drives include ::cerncollectd_contrib::alarm::megaraidsas::offline_drives if (versioncmp($::operatingsystemmajrelease,'6') >= 0){ class {'::cerncollectd_contrib::plugin::megaraidsas': lsmod_path => '/sbin/lsmod', } } else { include ::cerncollectd_contrib::plugin::megaraidsas } }
  • 18. CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it CF Backup slides - collectdctl output ● collectd namespace: ○ <hostname>/<plugin>-<plugin_instance>/<type>-<type_instance> ● listing values: ○ [root@lxfsrd08c04 ~]# collectdctl listval lxfsrd08c04.cern.ch/megaraidsas-bbu_status/count-c0 lxfsrd08c04.cern.ch/megaraidsas-controller_cache_policy_on_faulty_bbu/count-c0 lxfsrd08c04.cern.ch/megaraidsas-controller_cache_policy_wrong_on_raid_array/count-c0 lxfsrd08c04.cern.ch/megaraidsas-controller_memory_correctable_errors/count-c0 lxfsrd08c04.cern.ch/megaraidsas-controller_memory_uncorrectable_errors/count-c0 lxfsrd08c04.cern.ch/megaraidsas-controller_status/count-c0 lxfsrd08c04.cern.ch/megaraidsas-missing_drives/count lxfsrd08c04.cern.ch/megaraidsas-offline_drives/count-c0 lxfsrd08c04.cern.ch/megaraidsas-raid_array_status/count-c0_vd0 lxfsrd08c04.cern.ch/megaraidsas-raid_array_status/count-c0_vd1 lxfsrd08c04.cern.ch/megaraidsas-unconfigured_bad_drives/count-c0 lxfsrd08c04.cern.ch/megaraidsas-unconfigured_good_drives/count-c0 ● getting values: ○ [root@lxfsrd08c04 ~]# collectdctl getval lxfsrd08c04.cern.ch/megaraidsas-bbu_status/count-c0 value=0.000000e+00
  • 19. CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it CF Backup slides - GitLab CI/CD