SlideShare a Scribd company logo
1 of 85
Download to read offline
Server Hardware Monitoring 
done right! 
Werner Fischer, Thomas-Krenn.AG
2 
Status quo 
_ Überwachen Sie Ihre Server Hardware? 
Ja Nein
Nach diesem Vortrag 
überwachen Sie sicherer 
und umfangreicher 
3
Nach diesem Vortrag 
überwachen Sie sicherer 
und umfangreicher 
(hoffe ich zumindest... ;-) 
4
5 
Status quo 
_ Welche Technologien nutzen Sie? 
IPMI / SNMP NRPE CAM CAT
6 
CAMera
7 
CATinspection → satification?
8 
Agenda 
_ IPMI (20') 
_ RAM (5') 
_ RAID (10') 
_ SMART (5') 
_ GPU (5')
9 
monitor your 
IPMI-Sensors!
10
11
12 
Intelligent Platform Management 
Interface
13 
 Monitoring 
(temp, fans, ...) 
 Recovery Control 
(on/off/reset) 
 Logging 
(System Event Log) 
Inventory 
(FRU information) 
Funktionen
FRU Temp. 
sensor 
… 
Chassis board 
14 
Aufbau 
Motherboard 
private mgmt. busses 
Processor 
board 
Memory 
board 
Zugriff mit 
Benutzername 
& Passwort 
Baseboard 
Management 
Controller 
(BMC) 
System bus 
NVS Storage 
SDR 
SEL 
FRU 
Chassis 
mgmt. 
(Satellite 
Controller) 
Sensors & Controls 
Fan sensor 
Temp. sensor 
Power control 
Reset control 
… 
FRU 
Temp. s. 
FRU 
IPMB 
LAN 
interface 
Serial 
Port 
Sharing 
M/B 
Serial 
Controller 
BMC 
Serial 
Controller 
Serial/Modem 
interface 
Serial 
Connector 
LAN 
Connector 
PCI mgmt. bus 
Network 
(LAN) 
Controller 
Remote Mmgt. Card 
(KVM over IP, ...) 
Auxillary 
IPMB Connector 
ICMB 
ICMB 
bridge 
System 
interface 
Redundant Power 
board 
FRU 
Zugriff mit 
root Rechten
15 
IPMI Sensor Klassen 
Discrete (True/False) Threshold (Schwellwerte) 
Mehrere Zustände möglich: 
● bis zu 15 Status möglich 
● jeder Status = 1 Bit 
● mehrere aktive Statusbits möglich 
Zustand abhängig von: 
● Vergleich analoger Messert mit dem 
Schwellwerten (Thresholds) 
Liefert: 
● allgemeine Zustände 
● Sensor-spezifische Zustände 
Liefert: 
● analogen Messwert 
● diskreten Status 
Ähnliche Klasse OEM 
● Bedeutung der Zustände werden 
vom OEM definiert
16 
IPMI Sensor Klassen 
Discrete Threshold 
[root@test ~]# ipmitool sdr get "PS2 Status" 
Sensor ID : PS2 Status (0x71) 
Entity ID : 10.2 (Power Supply) 
Sensor Type (Discrete): Power Supply 
States Asserted : Power Supply 
[Presence detected] 
[Power Supply AC 
lost] 
Assertion Events : Power Supply 
[Presence detected] 
[Power Supply AC 
lost] 
Assertions Enabled : Power Supply 
[Presence detected] 
[Failure detected] 
[Predictive failure] 
[Power Supply AC 
lost] 
[...] 
Deassertions Enabled : Power Supply 
[...] 
[root@test ~]# ipmitool sdr get "Fan 1" 
Sensor ID : Fan 1 (0x50) 
Entity ID : 29.1 (Fan 
Device) 
Sensor Type (Analog) : Fan 
Sensor Reading : 5719 (+/­0) 
RPM 
Status : ok 
Nominal Reading : 6708.000 
Normal Minimum : 2451.000 
Normal Maximum : 10965.000 
Lower critical : 1720.000 
Lower non­critical 
: 1978.000 
Positive Hysteresis : 86.000 
Negative Hysteresis : 86.000 
Minimum sensor range : Unspecified 
Maximum sensor range : Unspecified 
Event Message Control : Per­threshold 
Readable Thresholds : lcr lnc 
Settable Thresholds : lcr lnc 
Threshold Read Mask : lcr lnc 
Assertion Events : 
Assertions Enabled : lnc­lcr­Deassertions 
Enabled : lnc­lcr­
$ sudo ipmi­sensors 
­­output­sensor­state 
­­interpret­oem­data 
Password: 
ID | Name | Type | State | Reading | Units | Event 
4 | System Temp | Temperature | Nominal | 27.00 | C | 'OK' 
71 | Peripheral Temp | Temperature | Nominal | 35.00 | C | 'OK' 
138 | CPU Temp | OEM Reserved | Nominal | N/A | N/A | 'Low' 
205 | FAN 1 | Fan | Nominal | 1800.00 | RPM | 'OK' 
… 942 | VBAT | Voltage | Nominal | 3.15 | V | 'OK' 
1009 | VSB | Voltage | Nominal | 3.34 | V | 'OK' 
1076 | AVCC | Voltage | Nominal | 3.38 | V | 'OK' 
1143 | Chassis Intru | Physical Security | Critical | N/A | N/A | 'Gen...' 
17 
IPMI Sensoren OK 
Critical
18 
IPMI Sensoren (Discrete) 
$ cat /etc/freeipmi/freeipmi_interpret_sensor.conf 
[…] 
## IPMI_Physical_Security 
# 
# IPMI_Physical_Security_No_Event Nominal 
# IPMI_Physical_Security_General_Chassis_Intrusion Critical 
# IPMI_Physical_Security_Drive_Bay_Intrusion Critical 
[…] 
# IPMI_Power_Supply_No_Event Nominal 
# IPMI_Power_Supply_Presence_Detected Nominal 
# IPMI_Power_Supply_Power_Supply_Failure_Detected Critical 
# IPMI_Power_Supply_Predictive_Failure Critical 
# IPMI_Power_Supply_Power_Supply_Input_Lost_AC_DC Critical 
[…]
$ ./check_ipmi_sensor ­H 
192.168.255.5 ­f 
ipmi.cfg ­vv 
IPMI Status: OK | 'System Temp'=27.00 'Peripheral Temp'=35.00 'FAN 
1'=1800.00 'Vcore'=0.98 '3.3VCC'=3.36 '12V'=11.93 'VDIMM'=1.53 
'5VCC'=5.09 '­12V'= 
­12.09 
'VBAT'=3.15 'VSB'=3.34 'AVCC'=3.38 
System Temp = 27.00 (Status: Nominal) 
Peripheral Temp = 35.00 (Status: Nominal) 
CPU Temp = 'Low' (Status: Nominal) 
FAN 1 = 1800.00 (Status: Nominal) 
Vcore = 0.98 (Status: Nominal) 
3.3VCC = 3.36 (Status: Nominal) 
12V = 11.93 (Status: Nominal) 
VDIMM = 1.53 (Status: Nominal) 
5VCC = 5.09 (Status: Nominal) 
­12V 
= ­12.09 
(Status: Nominal) 
VBAT = 3.15 (Status: Nominal) 
VSB = 3.34 (Status: Nominal) 
AVCC = 3.38 (Status: Nominal) 
Chassis Intru = 'OK' (Status: Nominal) 
19 
IPMI Plugin
20 
IPMI Plugin 
#!/usr/bin/perl 
# check_ipmi_sensor: Nagios/Icinga plugin to check IPMI sensors 
## 
Copyright (C) 2009­2014 
Thomas­Krenn. 
AG, 
# additional contributors see changelog.txt 
## 
This program is free software; you can redistribute it and/or modify it under 
[…] 
Version 3.5 20141031 
* Fix LAN Driver if called on localhost 
Version 3.4 20140929 
* Fix implicit array warning with split 
* Add option to disable LAN protocol version 2.0 
Version 3.3 20140606 
* Print a warning if ipmi­sensors 
only returned a single output row 
* Ignore sudo errors and warnings in IPMI command output 
(Thanks to Robert Heinzmann for contributing) 
* Use LAN protocol version 2.0 per default 
* Print empty output error only if return code was 0 
* Exit the plugin with return code 3 if fru command fails 
* Added an include list option to only include specific sensors 
Version 3.2 20131028 
* Added FRU serial number to output
21 
so weit so gut?
Intelligent? Platform Management 
22 
Interface
23
Das Abhörsystem 
in ihrem Computer 
24 
The Eavesdropping System in Your Computer 
(Bruce Schneier, Schneier on Security Blog 31.01.2013)
25
26
230.000 1HE Server 
→ 10.223,5 m Höhe 
(Mount Everest 8.848 m) 
27
28
29 
IPMI Firmware by ATEN / AMI 
_ Mainboard-Hersteller 
passen Firmware an 
_ OS = Embedded Linux 
_ IPMI Firmware Teile 
Closed-Source
Wir empfehlen administrative Zugänge 
wie IPMI- aber auch etwa SSH-Dienste 
nicht offen im Internet zu betreiben, 
30 
sondern mittels Firewall/VPN den 
Zugriff auf solche Dienste 
ausschließlich berechtigten Personen 
zu ermöglichen.
31 
Was wenn doch? 
Enable 
&DROP
32 
IPMI Top 3 
Sicherheitstipps
33 
#1 - Netzwerk
34 
#1 - Netzwerk
35 
#2 – User Management 
sjfaiklaz afjhuijoh 
Administrator 
User
In short, the authentication process for IPMI 2.0 mandates 
that the server send a salted SHA1 or MD5 hash of the 
requested user's password to the client, prior to the client 
authenticating. 
36 
#2 – User Management 
A Penetration Tester's Guide to IPMI and BMCs (rapid7.com) 
msf > use auxiliary/scanner/ipmi/ipmi_dumphashes 
msf auxiliary(ipmi_dumphashes) > set RHOSTS 10.1.102.141 
RHOSTS => 10.1.102.141 
msf auxiliary(ipmi_dumphashes) > set THREADS 128 
THREADS => 128 
msf auxiliary(ipmi_dumphashes) > run 
[+] 10.1.102.141:623 - IPMI - Hash found: 
admin:14667523250000004ec525d3852f4fa73c93b674788217fe00000000000000 
00000000000000000000000000000000000000000000000000140561646d696e:2c7 
6e372d89ac7cd4e3bfecb423962f708d0741c
37 
#2 – User Management 
$ ./cudaHashcat64.bin --outfile=ipmi.out -m 7300 hash.txt -a 3 ?lu? 
lu?lu?lu?lu?lu 
[...] 
Session.Name...: cudaHashcat 
Status.........: Exhausted 
Input.Mode.....: Mask (?lu?lu?lu?lu?lu?lu) [12] 
Hash.Target....: 
54414378fb2db5ff365e4bc5856adaf4c1b8a2f2153efd1b81fb54dfe1bf56478788 
ea7ba154375b40167e34f026e1020010d21d1ea31625040561646d696e:0a0b16023 
1e204a6d0bd086e26718002409b35b7 
Hash.Type......: IPMI2 RAKP HMAC-SHA1 
Time.Started...: Thu Sep 18 10:11:17 2014 (6 secs) 
Time.Estimated.: 0 secs 
Speed.GPU.#1...: 52732.3 kH/s 
Recovered......: 0/1 (0.00%) Digests, 0/1 (0.00%) Salts 
Progress.......: 308915776/308915776 (100.00%) 
Skipped........: 0/308915776 (0.00%) 
Rejected.......: 0/308915776 (0.00%) 
HWMon.GPU.#1...: -1% Util, 41c Temp, 31% Fan
38 
#2 – User Management 
20 
Komplexe 
& lange 
Passwörter
39 
#3 – Dienste limitieren
42 
monitor your RAM! 
(it's ECC, isn't it?)
44 
3% 
min 1 CE/Jahr (DDR2) 
Google 2009, Jaguar-Cluster 2012
45 
70% 
CE's vor UE's 
Google 2009
1,3% 
46 
Server mit UE's/Jahr 
Google 2009
root@debian­test:/ 
sys/devices/system/edac/mc/mc0/csrow0# ls ­l 
total 0 
­r­­r­­r­­1 
root root 4096 Nov 12 09:02 ce_count 
­r­­r­­r­­1 
root root 4096 Nov 12 09:02 ch0_ce_count 
­rw­r­­r­­1 
root root 4096 Nov 12 09:02 ch0_dimm_label 
­r­­r­­r­­1 
root root 4096 Nov 12 09:02 ch1_ce_count 
­rw­r­­r­­1 
root root 4096 Nov 12 09:02 ch1_dimm_label 
­r­­r­­r­­1 
root root 4096 Nov 12 09:02 dev_type 
­r­­r­­r­­1 
root root 4096 Nov 12 09:02 edac_mode 
­r­­r­­r­­1 
root root 4096 Nov 12 09:02 mem_type 
drwxr­xr­x 
2 root root 0 Nov 12 09:02 power 
­r­­r­­r­­1 
root root 4096 Nov 12 09:02 size_mb 
lrwxrwxrwx 1 root root 0 Nov 12 09:02 subsystem ­> 
../../../../../../bus/mc0 
­r­­r­­r­­1 
root root 4096 Nov 12 09:02 ue_count 
­rw­r­­r­­1 
root root 4096 Nov 12 09:02 uevent 
root@debian­test:/ 
sys/devices/system/edac/mc/mc0/csrow0# cat ce_count 
47 
0 root@debian­test:/ 
sys/devices/system/edac/mc/mc0/csrow0# cat ue_count 
0 
Linux EDAC
Linux EDAC Supportmatrix 
Treibermodul CPUs Kernel Unterstützte Architekturen 
amd64_edac.c AMD 2.6.31 
48 
2.6.39 
3.10 
3.13 
3.15 
K8 und F10 
F15 
F16 
F15_M30H 
F16_M30H 
i7core_edac.c Intel Single/Dual 2.6.35 Nehalem/Westmere 
ie31200_edac.c Intel Single-CPU 3.17 Sandy & Ivy Bridge 
Haswell 
sb_edac.c Intel Dual-CPU 3.2 
3.13 
3.17 
Sandy Bridge 
Ivy Bridge 
Haswell
$ ipmi­sel 
ID | Date | Time | Name | State | Event 
1 | Feb­03­2012 
| 10:31:58 | CPU0 DIMM0 | Warning | Correctable memory error 
2 | Feb­13­2012 
| 22:28:58 | CPU0 DIMM0 | Warning | Correctable memory error 
3 | Feb­14­2012 
| 00:29:03 | CPU0 DIMM0 | Warning | Correctable memory error 
4 | Feb­14­2012 
| 01:29:06 | CPU0 DIMM0 | Warning | Correctable memory error 
... 
49 
IPMI SEL (System Event Log) 
Unterstützung ab 
check_ipmi_sensor v3.6 
(geplant 12/2014)
$ ipmi­sel 
ID | Date | Time | Name | State | Event 
1 | Feb­03­2012 
| 10:31:58 | CPU0 DIMM0 | Warning | Correctable memory error 
2 | Feb­13­2012 
| 22:28:58 | CPU0 DIMM0 | Warning | Correctable memory error 
3 | Feb­14­2012 
| 00:29:03 | CPU0 DIMM0 | Warning | Correctable memory error 
4 | Feb­14­2012 
| 01:29:06 | CPU0 DIMM0 | Warning | Correctable memory error 
... 
50 
IPMI SEL (System Event Log) 
OS unabhängig
51 
monitor your RAID!
53 
Linux 
Software 
RAID 
LSI / Adaptec 
Hardware 
RAID
54 
Avago MegaRAID (LSI)
55 
root@debian­test:~# 
storcli64 
Storage Command Line Tool Ver 1.13.06 Sep 03, 2014 
(c)Copyright 2014, LSI Corporation, All Rights Reserved. 
help ­lists 
all the commands with their usage. E.g. storcli help 
<command> help ­gives 
details about a particular command. E.g. storcli add help 
List of commands: 
Commands Description 
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­add 
Adds/creates a new element to controller like VD,Spare..etc 
delete Deletes an element like VD,Spare 
show Displays information about an element 
set Set a particular value to a property 
get Get a particular value to a property 
compare Compares particular value to a property 
start Start background operation 
stop Stop background operation 
pause Pause background operation 
resume Resume background operation 
download Downloads file to given device 
expand expands size of given drive 
insert inserts new drive for missing 
transform downgrades the controller 
/cx Controller specific commands 
/ex Enclosure specific commands 
/sx Slot/PD specific commands 
/vx Virtual drive specific commands 
/dx Disk group specific commands 
/fall Foreign configuration specific commands 
/px Phy specific commands 
/[bbu|cv] Battery Backup Unit, Cachevault commands
$ /usr/lib/nagios/plugins/check_lsi_raid ­vv 
Warning (LD Warn) [c0/v0_Consist = Warning (No)]| 
CV_Temperature=22;70;85 ROC_Temperature=57;80;90 
c0/e252/s0_Drive_Temperature=21;40;45 
c0/e252/s1_Drive_Temperature=21;40;45 
Used storcli commands: 
­/ 
usr/bin/sudo /usr/sbin/storcli64 /c0 /cv show status 
­/ 
usr/bin/sudo /usr/sbin/storcli64 adpallinfo a0 
­/ 
usr/bin/sudo /usr/sbin/storcli64 /c0/vall show all 
­/ 
usr/bin/sudo /usr/sbin/storcli64 /c0/vall show init 
­/ 
usr/bin/sudo /usr/sbin/storcli64 /c0/eall/sall show all 
­/ 
usr/bin/sudo /usr/sbin/storcli64 /c0/eall/sall show initialization 
­/ 
usr/bin/sudo /usr/sbin/storcli64 /c0/eall/sall show rebuild 
Warning sensors: 
­c0/ 
v0_Consist (No) 
56 
check_lsi_raid
Warum adpallinfo a0? 
„storcli /0 show all … 
blocks the whole raid card 
i/o for … upto ~4 seconds“ 
57
Warum adpallinfo a0? 
„storcli /0 show all … 
blocks the whole raid card 
i/o for … upto ~4 seconds“ 
58
59 
check_lsi_raid 
$ /usr/lib/nagios/plugins/check_lsi_raid ­h 
check_lsi_raid: Nagios/Icinga plugin to check LSI Raid Controller status 
Pulgin version: 2.0 
Copyright (C) 2013­2014 
Thomas­Krenn. 
AG 
Current updates available at 
http://git.thomas­krenn. 
com/check_lsi_raid.git 
This Nagios/Icinga Plugin checks LSI RAID controllers for controller, 
physical device, logical device, BBU and CV warnings and errors. 
In order for this plugin to work properly you need to add the nagios 
user to your sudoers file (or create a new one in /etc/sudoers.d/). 
Usage: 
[ ­h 
| ­­help 
] 
Display this help page 
[ ­v 
| ­vv 
| ­vvv 
| ­­verbose 
] 
Sets the verbosity level. 
No ­v 
is the normal single line output for Nagios/Icinga, ­v 
is a 
more detailed version but still usable in Nagios. ­vv 
is a 
multiline output for debugging configuration errors or more 
detailed information. ­vvv 
is for plugin problem diagnosis. 
For further information please visit: 
http://nagiosplug.sourceforge.net/developer­guidelines. 
html#AEN39 
[ ­V 
­­version 
] 
Displays the plugin and, if available, the version if StorCLI. 
[ ­C 
<num> | ­­controller 
<num> ] 
Specifies a controller number, defaults to 0. 
...
60 
VMware? → CIM Provider
61 
VMware? → Plugin 
check_esxi_hardware.py check_vmware_esx.pl 
Hardware VMware allgemein 
python-pywbem VMware Perl SDK 
Claudio Kuenzler et.al. 
Infos: 
Martin Fürstenau
VMware? check_esxi_hardware.py 
62 
#!/usr/bin/python 
# ­* 
­coding: 
UTF­8 
­* 
­ 
## 
Script for checking global health of host running VMware ESX/ESXi 
## 
Licence : GNU General Public Licence (GPL) http://www.gnu.org/ 
# This program is free software; you can redistribute it and/or 
... 
# Copyright (c) 2008 David Ligeret 
# Copyright (c) 2009 Joshua Daniel Franklin 
# Copyright (c) 2010 Branden Schneider 
# Copyright (c) 2010­2014 
Claudio Kuenzler 
# Copyright (c) 2010 Samir Ibradzic 
# Copyright (c) 2010 Aaron Rogers 
# Copyright (c) 2011 Ludovic Hutin 
# Copyright (c) 2011 Carsten Schoene 
# Copyright (c) 2011­2012 
Phil Randal 
# Copyright (c) 2011 Fredrik Aslund 
# Copyright (c) 2011 Bertrand Jomin 
# Copyright (c) 2011 Ian Chard 
# Copyright (c) 2012 Craig Hart 
# Copyright (c) 2013 Carl R. Friend
63 
Adaptec by PMC
64 
$ sudo arcconf 
| UCLI | Adaptec by PMC uniform command line interface 
| UCLI | Version 1.6 (B21062) 
| UCLI | (C) Adaptec by PMC 2003­2014 
| UCLI | All Rights Reserved 
ATAPASSWORD | setting password on a physical drive 
COPYBACK | toggles controller copy back mode 
CREATE | creates a logical device 
CONSISTENCYCHECK | toggles the controller background consistency check mode 
DELETE | deletes one or more logical devices 
ERRORTUNABLE | sets error tunable profiles on the controller 
EXPANDERLIST | Lists the Expanders Connected to the Controller 
EXPANDERUPGRADE | updates expander firmware 
FAILOVER | toggles the controller automatic failover mode 
GETCONFIG | prints controller information 
GETLOGS | gets controller log information 
GETPERFORM | gets the parameters for a performance mode 
GETSMARTSTATS | gets the SMART statistics 
GETSTATUS | displays the status of running tasks 
GETVERSION | prints version information for all controllers 
IDENTIFY | blinks LEDS on device(s) connected to a controller 
IMAGEUPDATE | update physical device firmware 
KEY | installs a Feature Key onto a controller 
MODIFY | performs RAID Level Migration or Online Capacity Expansion 
PHYERRORLOG | displays PHY error logs for controller or device or an 
| expander PHY 
PRESERVECACHE | changes the cache preservation settings on the controller 
RESCAN | checks for new or removed drives 
RESETSTATISTICSCOUNTERS | resets the controller statistics counters 
ROMUPDATE | updates controller firmware 
SAVESUPPORTARCHIVE | saves the support archive 
SETALARM | controls the controller alarm, if present 
...
check_adaptec_raid Update 
$ ./check_adaptec_raid ­p 
/usr/sbin/arcconf 
AACRAID CRITICAL (Ctrl #1): [ZMM critical] 
$ ./check_adaptec_raid ­h 
Thomas­Krenn 
Adaptec Raid Controller Nagios/Icinga Plugin Version: 1.0 
Copyright (C) 2009­2013 
Thomas­Krenn. 
AG 
Current updates available via git at: 
65 
http://git.thomas­krenn. 
com/check_adaptec_raid.git 
This Nagios/Icinga Plugin checks ADAPTEC RAID­Controllers 
for Controller, 
Physical­Device 
and Logical Device warnings and errors. 
In order for this plugin to work properly you need to add the 
nagios­user 
to your sudoers file (or create a new one in /etc/sudoers.d/). 
This is required as arcconf must be called with sudo permissions. 
Usage: 
[ ­C 
<Controller number> ] [ ­LD 
<Logical device number> ] 
[ ­PD 
<Physical device number> ] [ ­T 
<Warning Temp., Crit. Temp.> ] 
[ ­h 
| ­­help 
] 
Display this help page 
[ ­v 
| ­vv 
| ­vvv 
| ­­verbose 
] 
Sets the verbosity level 
no ­v 
single line output for Nagios/Icinga 
­v 
single line with more details 
... 
geplant 
(2015)
VMware? → CIM Provider erwartet 
_ aktuell: 
66 
_ „CIM Provider“ für remote arcconf 
_ Adaptec MSM in einer VM 
_ künftig: 
_ „echter“ CIM Provider
67 
be smart, 
use SMART ;-)
68 
Self- 
Monitoring, 
Analysis & 
Reporting 
Technology
69 
Standardisiert NICHT standadisiert 
Datenformat 
Kommandos 
Errorlogs 
Tests 
Attribute 
Dokumentation 
vom Hersteller 
erforderlich 
(oft nicht 
öffentlich, außer 
Intel/Samsung)
70 
check_smart_attributes 
$ /usr/lib/nagios/plugins/check_smart_attributes  
> ­d 
/dev/sda  
> ­dbj 
/etc/nagios­plugins/ 
config/check_smartdb.json 
OK (sda) |sda_Media_Wearout_Indicator=098;16;6 
sda_Host_Writes_32MiB=575272 sda_Host_Reads_32MiB=723527
/etc/nagios­plugins/ 
config/check_smartdb.json 
... 
"Intel DC S3700" : { 
"Device" : ["Intel DC S3700 Series SSDs","INTEL SSDSC2BA100G3", 
"ID#" : { 
"5" : "RAW_VALUE", # Re­allocated 
Sector Count 
... 
"194" : "RAW_VALUE", # Temperature ­Device 
Internal Te 
... 
"232" : "VALUE", # Available Reserved Space 
"233" : "VALUE", # Media Wearout Indicator 
"234" : "VALUE", # Thermal Throttle Status 
"241" : "RAW_VALUE", # Total LBAs Written (32MiB) 
"242" : "RAW_VALUE", # Total LBAs Read (32MiB) 
"1024" : "VALUE" # ATA error count (custom) 
71 
}, 
"Threshs" : { 
"5" : ["20","40"], 
... 
"232" : ["16:","11:"], 
"233" : ["16:","6:"], 
"1024" : ["0","10"] 
}, 
"Perfs" : ["194","233","241","242"] 
}, 
...
/etc/nagios­plugins/ 
config/check_smartdb.json 
72 
...
/etc/nagios­plugins/ 
config/check_smartdb.json 
Ständig neue SSDs&HDDs 
73
/etc/nagios­plugins/ 
config/check_smartdb.json 
Ständig neue SSDs&HDDs 
74 
Aktualisierungen?
/etc/nagios­plugins/ 
config/check_smartdb.json 
Git(t) sei 
Dank ;-) 
75
ja cool, aber was ist mit RAID Controllern? 
... 
[­d| 
­­device 
<path to device being checked>] 
Specify the device being monitored. If multiple devices should be 
checked provide the '­d' 
option multiple times. 
E.g. '­d 
/dev/sda ­d 
/dev/sdb' 
For devices behind LSI RAID controllers specify 'megaraid' and then the 
device number, e.g. '­d 
megaraid6'. Use storcli to find out the 
corresponding device numbers. 
For devices behind Adaptec RAID controllers specify '/dev/sg<X>' where 
<X> is the number for your device. Use e.g. sg_scan to find the device. 
You must also use '­O 
sat' or '­O 
scsi' according to the device 
interface. This are extra options only necessary for '/dev/sg<X>' 
devices. 
76 
...
ja cool, aber was ist mit RAID Controllern? 
$ /usr/lib/nagios/plugins/check_smart_attributes  
> ­d 
megaraid6 
> ­dbj 
/etc/nagios­plugins/ 
config/check_smartdb.json 
OK (megaraid6) | 
megaraid6_Temperature_Internal=26 
megaraid6_Media_Wearout_Indicator=100;16;6 
megaraid6_Host_Writes_32MiB=70283 
megaraid6_Host_Reads_32MiB=1650800 
$ /usr/lib/nagios/plugins/check_smart_attributes  
> ­d 
megaraid7 
> ­dbj 
/etc/nagios­plugins/ 
config/check_smartdb.json 
Warning (megaraid7) [megaraid7_CRC_Error_Count = Warning]| 
megaraid7_Temperature_Internal=34 
megaraid7_Media_Wearout_Indicator=098;16;6 
megaraid7_Host_Writes_32MiB=189904 
megaraid7_Host_Reads_32MiB=29658 
77
78 
monitor your GPU!
79 
check_gpu_sensor 
$ /usr/lib/nagios/plugins/check_gpu_sensor ­db 
0000:83:00.0 
OK ­Tesla 
K20 |ECCL2AggSgl=0;1;2; 
ECCTexAggSgl=0;1;2; 
memUtilRate=0 
PWRUsage=49.81;150;200; 
ECCRegAggSgl=0;1;2; 
SMClock=705 
ECCL1AggSgl=0;1;2; 
GPUTemperature=38;85;100; 
memClock=2600 
usedMemory=0.24;95;99; 
fanSpeed=30;80;95; 
graphicsClock=705 
GPUUtilRate=0 
ECCMemAggSgl=0;1;2;
NVIDIA: „angezeigte 
Lüfterdrehzahl lässt nicht 
darauf schließen, ob sich der 
Lüfter tatsächlich dreht.“ 
80 
„es ist jene Drehzahl, mit der der Lüfter-Algorithmus versucht den Lüfter zu betreiben.“ 
wir empfehlen: 
„Temperatursensor“
81 
Plugins - Future 
_ Überwachung von 
FW-Versionen 
_ RAID Consistency 
Checks 
_ Temperatur von 
10GBit NICs 
(siehe Intel X540 FAQs)
82 
so, was nun?
83 
Relax ... 
_ alle Plugins unter git.thomas-krenn.com 
_ alle Plugins erfüllen 
Plugin Developer Guidelines (-h für Hilfe) 
_ „Plugin Entwicklung für Einsteiger“ 
von Alexander Wirt heute um 14:15h
84 
Relax, start ... 
Serverliste 
erstellen 
IPMI 
sicher 
konfigurieren 
relevante 
Plugins 
einrichten
85 
Relax, start and have fun at

More Related Content

What's hot

LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3Linaro
 
Demystifying Secure enclave processor
Demystifying Secure enclave processorDemystifying Secure enclave processor
Demystifying Secure enclave processorPriyanka Aash
 
Linux : The Common Mailbox Framework
Linux : The Common Mailbox FrameworkLinux : The Common Mailbox Framework
Linux : The Common Mailbox FrameworkMr. Vengineer
 
LCU14 302- How to port OP-TEE to another platform
LCU14 302- How to port OP-TEE to another platformLCU14 302- How to port OP-TEE to another platform
LCU14 302- How to port OP-TEE to another platformLinaro
 
Attack your Trusted Core
Attack your Trusted CoreAttack your Trusted Core
Attack your Trusted CoreDi Shen
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightLinaro
 
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareHKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareLinaro
 
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)Linaro
 
Project ACRN GPIO mediator introduction
Project ACRN GPIO mediator introductionProject ACRN GPIO mediator introduction
Project ACRN GPIO mediator introductionProject ACRN
 
ARM Trusted FirmwareのBL31を単体で使う!
ARM Trusted FirmwareのBL31を単体で使う!ARM Trusted FirmwareのBL31を単体で使う!
ARM Trusted FirmwareのBL31を単体で使う!Mr. Vengineer
 
[DCG 25] Александр Большев - Never Trust Your Inputs or How To Fool an ADC
[DCG 25] Александр Большев - Never Trust Your Inputs or How To Fool an ADC [DCG 25] Александр Большев - Never Trust Your Inputs or How To Fool an ADC
[DCG 25] Александр Большев - Never Trust Your Inputs or How To Fool an ADC DefconRussia
 
Embedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debuggingEmbedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debuggingAnne Nicolas
 
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203Linaro
 
Escalating Privileges in Linux using Fault Injection - FDTC 2017
Escalating Privileges in Linux using Fault Injection - FDTC 2017Escalating Privileges in Linux using Fault Injection - FDTC 2017
Escalating Privileges in Linux using Fault Injection - FDTC 2017Cristofaro Mune
 
Part-1 : Mastering microcontroller with embedded driver development
Part-1 : Mastering microcontroller with embedded driver development Part-1 : Mastering microcontroller with embedded driver development
Part-1 : Mastering microcontroller with embedded driver development FastBit Embedded Brain Academy
 
Breaking hardware enforced security with hypervisors
Breaking hardware enforced security with hypervisorsBreaking hardware enforced security with hypervisors
Breaking hardware enforced security with hypervisorsPriyanka Aash
 
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMUSFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMULinaro
 

What's hot (20)

LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
 
Demystifying Secure enclave processor
Demystifying Secure enclave processorDemystifying Secure enclave processor
Demystifying Secure enclave processor
 
Linux : The Common Mailbox Framework
Linux : The Common Mailbox FrameworkLinux : The Common Mailbox Framework
Linux : The Common Mailbox Framework
 
Microcontroller part 2
Microcontroller part 2Microcontroller part 2
Microcontroller part 2
 
LCU14 302- How to port OP-TEE to another platform
LCU14 302- How to port OP-TEE to another platformLCU14 302- How to port OP-TEE to another platform
LCU14 302- How to port OP-TEE to another platform
 
Attack your Trusted Core
Attack your Trusted CoreAttack your Trusted Core
Attack your Trusted Core
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with Coresight
 
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareHKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
 
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
 
Project ACRN GPIO mediator introduction
Project ACRN GPIO mediator introductionProject ACRN GPIO mediator introduction
Project ACRN GPIO mediator introduction
 
ARM Trusted FirmwareのBL31を単体で使う!
ARM Trusted FirmwareのBL31を単体で使う!ARM Trusted FirmwareのBL31を単体で使う!
ARM Trusted FirmwareのBL31を単体で使う!
 
STM32 Microcontroller Clocks and RCC block
STM32 Microcontroller Clocks and RCC blockSTM32 Microcontroller Clocks and RCC block
STM32 Microcontroller Clocks and RCC block
 
[DCG 25] Александр Большев - Never Trust Your Inputs or How To Fool an ADC
[DCG 25] Александр Большев - Never Trust Your Inputs or How To Fool an ADC [DCG 25] Александр Большев - Never Trust Your Inputs or How To Fool an ADC
[DCG 25] Александр Большев - Never Trust Your Inputs or How To Fool an ADC
 
Embedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debuggingEmbedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debugging
 
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
 
Linux interrupts
Linux interruptsLinux interrupts
Linux interrupts
 
Escalating Privileges in Linux using Fault Injection - FDTC 2017
Escalating Privileges in Linux using Fault Injection - FDTC 2017Escalating Privileges in Linux using Fault Injection - FDTC 2017
Escalating Privileges in Linux using Fault Injection - FDTC 2017
 
Part-1 : Mastering microcontroller with embedded driver development
Part-1 : Mastering microcontroller with embedded driver development Part-1 : Mastering microcontroller with embedded driver development
Part-1 : Mastering microcontroller with embedded driver development
 
Breaking hardware enforced security with hypervisors
Breaking hardware enforced security with hypervisorsBreaking hardware enforced security with hypervisors
Breaking hardware enforced security with hypervisors
 
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMUSFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
 

Similar to OSMC 2014: Server Hardware Monitoring done right | Werner Fischer

Information Gathering 2
Information Gathering 2Information Gathering 2
Information Gathering 2Aero Plane
 
OSMC 2014 | Server Hardware Monitoring done right by Werner Fischer
OSMC 2014 | Server Hardware Monitoring done right by Werner FischerOSMC 2014 | Server Hardware Monitoring done right by Werner Fischer
OSMC 2014 | Server Hardware Monitoring done right by Werner FischerNETWAYS
 
Icinga Camp Berlin 2017 - 10 Tips for better Hardware Monitoring
Icinga Camp Berlin 2017 - 10 Tips for better Hardware MonitoringIcinga Camp Berlin 2017 - 10 Tips for better Hardware Monitoring
Icinga Camp Berlin 2017 - 10 Tips for better Hardware MonitoringIcinga
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -evechiportal
 
Lost in Translation: When Industrial Protocol Translation goes Wrong [CONFide...
Lost in Translation: When Industrial Protocol Translation goes Wrong [CONFide...Lost in Translation: When Industrial Protocol Translation goes Wrong [CONFide...
Lost in Translation: When Industrial Protocol Translation goes Wrong [CONFide...Marco Balduzzi
 
managing your network environment
managing your network environmentmanaging your network environment
managing your network environmentscooby_doo
 
emips_overview_apr08
emips_overview_apr08emips_overview_apr08
emips_overview_apr08Neil Pittman
 
Nvidia tegra K1 Presentation
Nvidia tegra K1 PresentationNvidia tegra K1 Presentation
Nvidia tegra K1 PresentationANURAG SEKHSARIA
 
Android Things in action
Android Things in actionAndroid Things in action
Android Things in actionStefano Sanna
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Jorisimec.archive
 
05 module managing your network enviornment
05  module managing your network enviornment05  module managing your network enviornment
05 module managing your network enviornmentAsif
 
Important cisco-chow-commands
Important cisco-chow-commandsImportant cisco-chow-commands
Important cisco-chow-commandsssusere31b5c
 
LCA13: CPUIDLE: One driver to rule them all?
LCA13: CPUIDLE: One driver to rule them all?LCA13: CPUIDLE: One driver to rule them all?
LCA13: CPUIDLE: One driver to rule them all?Linaro
 
44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick
44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick
44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick44CON
 
Positive Hack Days. Pavlov. Network Infrastructure Security Assessment
Positive Hack Days. Pavlov. Network Infrastructure Security AssessmentPositive Hack Days. Pavlov. Network Infrastructure Security Assessment
Positive Hack Days. Pavlov. Network Infrastructure Security AssessmentPositive Hack Days
 
CCA security answers chapter 2 test
CCA security answers chapter 2 testCCA security answers chapter 2 test
CCA security answers chapter 2 testSoporte Yottatec
 

Similar to OSMC 2014: Server Hardware Monitoring done right | Werner Fischer (20)

Information Gathering 2
Information Gathering 2Information Gathering 2
Information Gathering 2
 
OSMC 2014 | Server Hardware Monitoring done right by Werner Fischer
OSMC 2014 | Server Hardware Monitoring done right by Werner FischerOSMC 2014 | Server Hardware Monitoring done right by Werner Fischer
OSMC 2014 | Server Hardware Monitoring done right by Werner Fischer
 
Icinga Camp Berlin 2017 - 10 Tips for better Hardware Monitoring
Icinga Camp Berlin 2017 - 10 Tips for better Hardware MonitoringIcinga Camp Berlin 2017 - 10 Tips for better Hardware Monitoring
Icinga Camp Berlin 2017 - 10 Tips for better Hardware Monitoring
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eve
 
Lost in Translation: When Industrial Protocol Translation goes Wrong [CONFide...
Lost in Translation: When Industrial Protocol Translation goes Wrong [CONFide...Lost in Translation: When Industrial Protocol Translation goes Wrong [CONFide...
Lost in Translation: When Industrial Protocol Translation goes Wrong [CONFide...
 
managing your network environment
managing your network environmentmanaging your network environment
managing your network environment
 
SR-IOV Introduce
SR-IOV IntroduceSR-IOV Introduce
SR-IOV Introduce
 
emips_overview_apr08
emips_overview_apr08emips_overview_apr08
emips_overview_apr08
 
Nvidia tegra K1 Presentation
Nvidia tegra K1 PresentationNvidia tegra K1 Presentation
Nvidia tegra K1 Presentation
 
Android Things in action
Android Things in actionAndroid Things in action
Android Things in action
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
 
Kernel Debugging & Profiling
Kernel Debugging & ProfilingKernel Debugging & Profiling
Kernel Debugging & Profiling
 
05 module managing your network enviornment
05  module managing your network enviornment05  module managing your network enviornment
05 module managing your network enviornment
 
Kernel Debugging & Profiling
Kernel Debugging & ProfilingKernel Debugging & Profiling
Kernel Debugging & Profiling
 
Important cisco-chow-commands
Important cisco-chow-commandsImportant cisco-chow-commands
Important cisco-chow-commands
 
Txt Introduction
Txt IntroductionTxt Introduction
Txt Introduction
 
LCA13: CPUIDLE: One driver to rule them all?
LCA13: CPUIDLE: One driver to rule them all?LCA13: CPUIDLE: One driver to rule them all?
LCA13: CPUIDLE: One driver to rule them all?
 
44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick
44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick
44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick
 
Positive Hack Days. Pavlov. Network Infrastructure Security Assessment
Positive Hack Days. Pavlov. Network Infrastructure Security AssessmentPositive Hack Days. Pavlov. Network Infrastructure Security Assessment
Positive Hack Days. Pavlov. Network Infrastructure Security Assessment
 
CCA security answers chapter 2 test
CCA security answers chapter 2 testCCA security answers chapter 2 test
CCA security answers chapter 2 test
 

Recently uploaded

What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 

Recently uploaded (20)

Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 

OSMC 2014: Server Hardware Monitoring done right | Werner Fischer

  • 1. Server Hardware Monitoring done right! Werner Fischer, Thomas-Krenn.AG
  • 2. 2 Status quo _ Überwachen Sie Ihre Server Hardware? Ja Nein
  • 3. Nach diesem Vortrag überwachen Sie sicherer und umfangreicher 3
  • 4. Nach diesem Vortrag überwachen Sie sicherer und umfangreicher (hoffe ich zumindest... ;-) 4
  • 5. 5 Status quo _ Welche Technologien nutzen Sie? IPMI / SNMP NRPE CAM CAT
  • 7. 7 CATinspection → satification?
  • 8. 8 Agenda _ IPMI (20') _ RAM (5') _ RAID (10') _ SMART (5') _ GPU (5')
  • 9. 9 monitor your IPMI-Sensors!
  • 10. 10
  • 11. 11
  • 12. 12 Intelligent Platform Management Interface
  • 13. 13  Monitoring (temp, fans, ...)  Recovery Control (on/off/reset)  Logging (System Event Log) Inventory (FRU information) Funktionen
  • 14. FRU Temp. sensor … Chassis board 14 Aufbau Motherboard private mgmt. busses Processor board Memory board Zugriff mit Benutzername & Passwort Baseboard Management Controller (BMC) System bus NVS Storage SDR SEL FRU Chassis mgmt. (Satellite Controller) Sensors & Controls Fan sensor Temp. sensor Power control Reset control … FRU Temp. s. FRU IPMB LAN interface Serial Port Sharing M/B Serial Controller BMC Serial Controller Serial/Modem interface Serial Connector LAN Connector PCI mgmt. bus Network (LAN) Controller Remote Mmgt. Card (KVM over IP, ...) Auxillary IPMB Connector ICMB ICMB bridge System interface Redundant Power board FRU Zugriff mit root Rechten
  • 15. 15 IPMI Sensor Klassen Discrete (True/False) Threshold (Schwellwerte) Mehrere Zustände möglich: ● bis zu 15 Status möglich ● jeder Status = 1 Bit ● mehrere aktive Statusbits möglich Zustand abhängig von: ● Vergleich analoger Messert mit dem Schwellwerten (Thresholds) Liefert: ● allgemeine Zustände ● Sensor-spezifische Zustände Liefert: ● analogen Messwert ● diskreten Status Ähnliche Klasse OEM ● Bedeutung der Zustände werden vom OEM definiert
  • 16. 16 IPMI Sensor Klassen Discrete Threshold [root@test ~]# ipmitool sdr get "PS2 Status" Sensor ID : PS2 Status (0x71) Entity ID : 10.2 (Power Supply) Sensor Type (Discrete): Power Supply States Asserted : Power Supply [Presence detected] [Power Supply AC lost] Assertion Events : Power Supply [Presence detected] [Power Supply AC lost] Assertions Enabled : Power Supply [Presence detected] [Failure detected] [Predictive failure] [Power Supply AC lost] [...] Deassertions Enabled : Power Supply [...] [root@test ~]# ipmitool sdr get "Fan 1" Sensor ID : Fan 1 (0x50) Entity ID : 29.1 (Fan Device) Sensor Type (Analog) : Fan Sensor Reading : 5719 (+/­0) RPM Status : ok Nominal Reading : 6708.000 Normal Minimum : 2451.000 Normal Maximum : 10965.000 Lower critical : 1720.000 Lower non­critical : 1978.000 Positive Hysteresis : 86.000 Negative Hysteresis : 86.000 Minimum sensor range : Unspecified Maximum sensor range : Unspecified Event Message Control : Per­threshold Readable Thresholds : lcr lnc Settable Thresholds : lcr lnc Threshold Read Mask : lcr lnc Assertion Events : Assertions Enabled : lnc­lcr­Deassertions Enabled : lnc­lcr­
  • 17. $ sudo ipmi­sensors ­­output­sensor­state ­­interpret­oem­data Password: ID | Name | Type | State | Reading | Units | Event 4 | System Temp | Temperature | Nominal | 27.00 | C | 'OK' 71 | Peripheral Temp | Temperature | Nominal | 35.00 | C | 'OK' 138 | CPU Temp | OEM Reserved | Nominal | N/A | N/A | 'Low' 205 | FAN 1 | Fan | Nominal | 1800.00 | RPM | 'OK' … 942 | VBAT | Voltage | Nominal | 3.15 | V | 'OK' 1009 | VSB | Voltage | Nominal | 3.34 | V | 'OK' 1076 | AVCC | Voltage | Nominal | 3.38 | V | 'OK' 1143 | Chassis Intru | Physical Security | Critical | N/A | N/A | 'Gen...' 17 IPMI Sensoren OK Critical
  • 18. 18 IPMI Sensoren (Discrete) $ cat /etc/freeipmi/freeipmi_interpret_sensor.conf […] ## IPMI_Physical_Security # # IPMI_Physical_Security_No_Event Nominal # IPMI_Physical_Security_General_Chassis_Intrusion Critical # IPMI_Physical_Security_Drive_Bay_Intrusion Critical […] # IPMI_Power_Supply_No_Event Nominal # IPMI_Power_Supply_Presence_Detected Nominal # IPMI_Power_Supply_Power_Supply_Failure_Detected Critical # IPMI_Power_Supply_Predictive_Failure Critical # IPMI_Power_Supply_Power_Supply_Input_Lost_AC_DC Critical […]
  • 19. $ ./check_ipmi_sensor ­H 192.168.255.5 ­f ipmi.cfg ­vv IPMI Status: OK | 'System Temp'=27.00 'Peripheral Temp'=35.00 'FAN 1'=1800.00 'Vcore'=0.98 '3.3VCC'=3.36 '12V'=11.93 'VDIMM'=1.53 '5VCC'=5.09 '­12V'= ­12.09 'VBAT'=3.15 'VSB'=3.34 'AVCC'=3.38 System Temp = 27.00 (Status: Nominal) Peripheral Temp = 35.00 (Status: Nominal) CPU Temp = 'Low' (Status: Nominal) FAN 1 = 1800.00 (Status: Nominal) Vcore = 0.98 (Status: Nominal) 3.3VCC = 3.36 (Status: Nominal) 12V = 11.93 (Status: Nominal) VDIMM = 1.53 (Status: Nominal) 5VCC = 5.09 (Status: Nominal) ­12V = ­12.09 (Status: Nominal) VBAT = 3.15 (Status: Nominal) VSB = 3.34 (Status: Nominal) AVCC = 3.38 (Status: Nominal) Chassis Intru = 'OK' (Status: Nominal) 19 IPMI Plugin
  • 20. 20 IPMI Plugin #!/usr/bin/perl # check_ipmi_sensor: Nagios/Icinga plugin to check IPMI sensors ## Copyright (C) 2009­2014 Thomas­Krenn. AG, # additional contributors see changelog.txt ## This program is free software; you can redistribute it and/or modify it under […] Version 3.5 20141031 * Fix LAN Driver if called on localhost Version 3.4 20140929 * Fix implicit array warning with split * Add option to disable LAN protocol version 2.0 Version 3.3 20140606 * Print a warning if ipmi­sensors only returned a single output row * Ignore sudo errors and warnings in IPMI command output (Thanks to Robert Heinzmann for contributing) * Use LAN protocol version 2.0 per default * Print empty output error only if return code was 0 * Exit the plugin with return code 3 if fru command fails * Added an include list option to only include specific sensors Version 3.2 20131028 * Added FRU serial number to output
  • 21. 21 so weit so gut?
  • 23. 23
  • 24. Das Abhörsystem in ihrem Computer 24 The Eavesdropping System in Your Computer (Bruce Schneier, Schneier on Security Blog 31.01.2013)
  • 25. 25
  • 26. 26
  • 27. 230.000 1HE Server → 10.223,5 m Höhe (Mount Everest 8.848 m) 27
  • 28. 28
  • 29. 29 IPMI Firmware by ATEN / AMI _ Mainboard-Hersteller passen Firmware an _ OS = Embedded Linux _ IPMI Firmware Teile Closed-Source
  • 30. Wir empfehlen administrative Zugänge wie IPMI- aber auch etwa SSH-Dienste nicht offen im Internet zu betreiben, 30 sondern mittels Firewall/VPN den Zugriff auf solche Dienste ausschließlich berechtigten Personen zu ermöglichen.
  • 31. 31 Was wenn doch? Enable &DROP
  • 32. 32 IPMI Top 3 Sicherheitstipps
  • 33. 33 #1 - Netzwerk
  • 34. 34 #1 - Netzwerk
  • 35. 35 #2 – User Management sjfaiklaz afjhuijoh Administrator User
  • 36. In short, the authentication process for IPMI 2.0 mandates that the server send a salted SHA1 or MD5 hash of the requested user's password to the client, prior to the client authenticating. 36 #2 – User Management A Penetration Tester's Guide to IPMI and BMCs (rapid7.com) msf > use auxiliary/scanner/ipmi/ipmi_dumphashes msf auxiliary(ipmi_dumphashes) > set RHOSTS 10.1.102.141 RHOSTS => 10.1.102.141 msf auxiliary(ipmi_dumphashes) > set THREADS 128 THREADS => 128 msf auxiliary(ipmi_dumphashes) > run [+] 10.1.102.141:623 - IPMI - Hash found: admin:14667523250000004ec525d3852f4fa73c93b674788217fe00000000000000 00000000000000000000000000000000000000000000000000140561646d696e:2c7 6e372d89ac7cd4e3bfecb423962f708d0741c
  • 37. 37 #2 – User Management $ ./cudaHashcat64.bin --outfile=ipmi.out -m 7300 hash.txt -a 3 ?lu? lu?lu?lu?lu?lu [...] Session.Name...: cudaHashcat Status.........: Exhausted Input.Mode.....: Mask (?lu?lu?lu?lu?lu?lu) [12] Hash.Target....: 54414378fb2db5ff365e4bc5856adaf4c1b8a2f2153efd1b81fb54dfe1bf56478788 ea7ba154375b40167e34f026e1020010d21d1ea31625040561646d696e:0a0b16023 1e204a6d0bd086e26718002409b35b7 Hash.Type......: IPMI2 RAKP HMAC-SHA1 Time.Started...: Thu Sep 18 10:11:17 2014 (6 secs) Time.Estimated.: 0 secs Speed.GPU.#1...: 52732.3 kH/s Recovered......: 0/1 (0.00%) Digests, 0/1 (0.00%) Salts Progress.......: 308915776/308915776 (100.00%) Skipped........: 0/308915776 (0.00%) Rejected.......: 0/308915776 (0.00%) HWMon.GPU.#1...: -1% Util, 41c Temp, 31% Fan
  • 38. 38 #2 – User Management 20 Komplexe & lange Passwörter
  • 39. 39 #3 – Dienste limitieren
  • 40.
  • 41.
  • 42. 42 monitor your RAM! (it's ECC, isn't it?)
  • 43.
  • 44. 44 3% min 1 CE/Jahr (DDR2) Google 2009, Jaguar-Cluster 2012
  • 45. 45 70% CE's vor UE's Google 2009
  • 46. 1,3% 46 Server mit UE's/Jahr Google 2009
  • 47. root@debian­test:/ sys/devices/system/edac/mc/mc0/csrow0# ls ­l total 0 ­r­­r­­r­­1 root root 4096 Nov 12 09:02 ce_count ­r­­r­­r­­1 root root 4096 Nov 12 09:02 ch0_ce_count ­rw­r­­r­­1 root root 4096 Nov 12 09:02 ch0_dimm_label ­r­­r­­r­­1 root root 4096 Nov 12 09:02 ch1_ce_count ­rw­r­­r­­1 root root 4096 Nov 12 09:02 ch1_dimm_label ­r­­r­­r­­1 root root 4096 Nov 12 09:02 dev_type ­r­­r­­r­­1 root root 4096 Nov 12 09:02 edac_mode ­r­­r­­r­­1 root root 4096 Nov 12 09:02 mem_type drwxr­xr­x 2 root root 0 Nov 12 09:02 power ­r­­r­­r­­1 root root 4096 Nov 12 09:02 size_mb lrwxrwxrwx 1 root root 0 Nov 12 09:02 subsystem ­> ../../../../../../bus/mc0 ­r­­r­­r­­1 root root 4096 Nov 12 09:02 ue_count ­rw­r­­r­­1 root root 4096 Nov 12 09:02 uevent root@debian­test:/ sys/devices/system/edac/mc/mc0/csrow0# cat ce_count 47 0 root@debian­test:/ sys/devices/system/edac/mc/mc0/csrow0# cat ue_count 0 Linux EDAC
  • 48. Linux EDAC Supportmatrix Treibermodul CPUs Kernel Unterstützte Architekturen amd64_edac.c AMD 2.6.31 48 2.6.39 3.10 3.13 3.15 K8 und F10 F15 F16 F15_M30H F16_M30H i7core_edac.c Intel Single/Dual 2.6.35 Nehalem/Westmere ie31200_edac.c Intel Single-CPU 3.17 Sandy & Ivy Bridge Haswell sb_edac.c Intel Dual-CPU 3.2 3.13 3.17 Sandy Bridge Ivy Bridge Haswell
  • 49. $ ipmi­sel ID | Date | Time | Name | State | Event 1 | Feb­03­2012 | 10:31:58 | CPU0 DIMM0 | Warning | Correctable memory error 2 | Feb­13­2012 | 22:28:58 | CPU0 DIMM0 | Warning | Correctable memory error 3 | Feb­14­2012 | 00:29:03 | CPU0 DIMM0 | Warning | Correctable memory error 4 | Feb­14­2012 | 01:29:06 | CPU0 DIMM0 | Warning | Correctable memory error ... 49 IPMI SEL (System Event Log) Unterstützung ab check_ipmi_sensor v3.6 (geplant 12/2014)
  • 50. $ ipmi­sel ID | Date | Time | Name | State | Event 1 | Feb­03­2012 | 10:31:58 | CPU0 DIMM0 | Warning | Correctable memory error 2 | Feb­13­2012 | 22:28:58 | CPU0 DIMM0 | Warning | Correctable memory error 3 | Feb­14­2012 | 00:29:03 | CPU0 DIMM0 | Warning | Correctable memory error 4 | Feb­14­2012 | 01:29:06 | CPU0 DIMM0 | Warning | Correctable memory error ... 50 IPMI SEL (System Event Log) OS unabhängig
  • 52.
  • 53. 53 Linux Software RAID LSI / Adaptec Hardware RAID
  • 55. 55 root@debian­test:~# storcli64 Storage Command Line Tool Ver 1.13.06 Sep 03, 2014 (c)Copyright 2014, LSI Corporation, All Rights Reserved. help ­lists all the commands with their usage. E.g. storcli help <command> help ­gives details about a particular command. E.g. storcli add help List of commands: Commands Description ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­add Adds/creates a new element to controller like VD,Spare..etc delete Deletes an element like VD,Spare show Displays information about an element set Set a particular value to a property get Get a particular value to a property compare Compares particular value to a property start Start background operation stop Stop background operation pause Pause background operation resume Resume background operation download Downloads file to given device expand expands size of given drive insert inserts new drive for missing transform downgrades the controller /cx Controller specific commands /ex Enclosure specific commands /sx Slot/PD specific commands /vx Virtual drive specific commands /dx Disk group specific commands /fall Foreign configuration specific commands /px Phy specific commands /[bbu|cv] Battery Backup Unit, Cachevault commands
  • 56. $ /usr/lib/nagios/plugins/check_lsi_raid ­vv Warning (LD Warn) [c0/v0_Consist = Warning (No)]| CV_Temperature=22;70;85 ROC_Temperature=57;80;90 c0/e252/s0_Drive_Temperature=21;40;45 c0/e252/s1_Drive_Temperature=21;40;45 Used storcli commands: ­/ usr/bin/sudo /usr/sbin/storcli64 /c0 /cv show status ­/ usr/bin/sudo /usr/sbin/storcli64 adpallinfo a0 ­/ usr/bin/sudo /usr/sbin/storcli64 /c0/vall show all ­/ usr/bin/sudo /usr/sbin/storcli64 /c0/vall show init ­/ usr/bin/sudo /usr/sbin/storcli64 /c0/eall/sall show all ­/ usr/bin/sudo /usr/sbin/storcli64 /c0/eall/sall show initialization ­/ usr/bin/sudo /usr/sbin/storcli64 /c0/eall/sall show rebuild Warning sensors: ­c0/ v0_Consist (No) 56 check_lsi_raid
  • 57. Warum adpallinfo a0? „storcli /0 show all … blocks the whole raid card i/o for … upto ~4 seconds“ 57
  • 58. Warum adpallinfo a0? „storcli /0 show all … blocks the whole raid card i/o for … upto ~4 seconds“ 58
  • 59. 59 check_lsi_raid $ /usr/lib/nagios/plugins/check_lsi_raid ­h check_lsi_raid: Nagios/Icinga plugin to check LSI Raid Controller status Pulgin version: 2.0 Copyright (C) 2013­2014 Thomas­Krenn. AG Current updates available at http://git.thomas­krenn. com/check_lsi_raid.git This Nagios/Icinga Plugin checks LSI RAID controllers for controller, physical device, logical device, BBU and CV warnings and errors. In order for this plugin to work properly you need to add the nagios user to your sudoers file (or create a new one in /etc/sudoers.d/). Usage: [ ­h | ­­help ] Display this help page [ ­v | ­vv | ­vvv | ­­verbose ] Sets the verbosity level. No ­v is the normal single line output for Nagios/Icinga, ­v is a more detailed version but still usable in Nagios. ­vv is a multiline output for debugging configuration errors or more detailed information. ­vvv is for plugin problem diagnosis. For further information please visit: http://nagiosplug.sourceforge.net/developer­guidelines. html#AEN39 [ ­V ­­version ] Displays the plugin and, if available, the version if StorCLI. [ ­C <num> | ­­controller <num> ] Specifies a controller number, defaults to 0. ...
  • 60. 60 VMware? → CIM Provider
  • 61. 61 VMware? → Plugin check_esxi_hardware.py check_vmware_esx.pl Hardware VMware allgemein python-pywbem VMware Perl SDK Claudio Kuenzler et.al. Infos: Martin Fürstenau
  • 62. VMware? check_esxi_hardware.py 62 #!/usr/bin/python # ­* ­coding: UTF­8 ­* ­ ## Script for checking global health of host running VMware ESX/ESXi ## Licence : GNU General Public Licence (GPL) http://www.gnu.org/ # This program is free software; you can redistribute it and/or ... # Copyright (c) 2008 David Ligeret # Copyright (c) 2009 Joshua Daniel Franklin # Copyright (c) 2010 Branden Schneider # Copyright (c) 2010­2014 Claudio Kuenzler # Copyright (c) 2010 Samir Ibradzic # Copyright (c) 2010 Aaron Rogers # Copyright (c) 2011 Ludovic Hutin # Copyright (c) 2011 Carsten Schoene # Copyright (c) 2011­2012 Phil Randal # Copyright (c) 2011 Fredrik Aslund # Copyright (c) 2011 Bertrand Jomin # Copyright (c) 2011 Ian Chard # Copyright (c) 2012 Craig Hart # Copyright (c) 2013 Carl R. Friend
  • 64. 64 $ sudo arcconf | UCLI | Adaptec by PMC uniform command line interface | UCLI | Version 1.6 (B21062) | UCLI | (C) Adaptec by PMC 2003­2014 | UCLI | All Rights Reserved ATAPASSWORD | setting password on a physical drive COPYBACK | toggles controller copy back mode CREATE | creates a logical device CONSISTENCYCHECK | toggles the controller background consistency check mode DELETE | deletes one or more logical devices ERRORTUNABLE | sets error tunable profiles on the controller EXPANDERLIST | Lists the Expanders Connected to the Controller EXPANDERUPGRADE | updates expander firmware FAILOVER | toggles the controller automatic failover mode GETCONFIG | prints controller information GETLOGS | gets controller log information GETPERFORM | gets the parameters for a performance mode GETSMARTSTATS | gets the SMART statistics GETSTATUS | displays the status of running tasks GETVERSION | prints version information for all controllers IDENTIFY | blinks LEDS on device(s) connected to a controller IMAGEUPDATE | update physical device firmware KEY | installs a Feature Key onto a controller MODIFY | performs RAID Level Migration or Online Capacity Expansion PHYERRORLOG | displays PHY error logs for controller or device or an | expander PHY PRESERVECACHE | changes the cache preservation settings on the controller RESCAN | checks for new or removed drives RESETSTATISTICSCOUNTERS | resets the controller statistics counters ROMUPDATE | updates controller firmware SAVESUPPORTARCHIVE | saves the support archive SETALARM | controls the controller alarm, if present ...
  • 65. check_adaptec_raid Update $ ./check_adaptec_raid ­p /usr/sbin/arcconf AACRAID CRITICAL (Ctrl #1): [ZMM critical] $ ./check_adaptec_raid ­h Thomas­Krenn Adaptec Raid Controller Nagios/Icinga Plugin Version: 1.0 Copyright (C) 2009­2013 Thomas­Krenn. AG Current updates available via git at: 65 http://git.thomas­krenn. com/check_adaptec_raid.git This Nagios/Icinga Plugin checks ADAPTEC RAID­Controllers for Controller, Physical­Device and Logical Device warnings and errors. In order for this plugin to work properly you need to add the nagios­user to your sudoers file (or create a new one in /etc/sudoers.d/). This is required as arcconf must be called with sudo permissions. Usage: [ ­C <Controller number> ] [ ­LD <Logical device number> ] [ ­PD <Physical device number> ] [ ­T <Warning Temp., Crit. Temp.> ] [ ­h | ­­help ] Display this help page [ ­v | ­vv | ­vvv | ­­verbose ] Sets the verbosity level no ­v single line output for Nagios/Icinga ­v single line with more details ... geplant (2015)
  • 66. VMware? → CIM Provider erwartet _ aktuell: 66 _ „CIM Provider“ für remote arcconf _ Adaptec MSM in einer VM _ künftig: _ „echter“ CIM Provider
  • 67. 67 be smart, use SMART ;-)
  • 68. 68 Self- Monitoring, Analysis & Reporting Technology
  • 69. 69 Standardisiert NICHT standadisiert Datenformat Kommandos Errorlogs Tests Attribute Dokumentation vom Hersteller erforderlich (oft nicht öffentlich, außer Intel/Samsung)
  • 70. 70 check_smart_attributes $ /usr/lib/nagios/plugins/check_smart_attributes > ­d /dev/sda > ­dbj /etc/nagios­plugins/ config/check_smartdb.json OK (sda) |sda_Media_Wearout_Indicator=098;16;6 sda_Host_Writes_32MiB=575272 sda_Host_Reads_32MiB=723527
  • 71. /etc/nagios­plugins/ config/check_smartdb.json ... "Intel DC S3700" : { "Device" : ["Intel DC S3700 Series SSDs","INTEL SSDSC2BA100G3", "ID#" : { "5" : "RAW_VALUE", # Re­allocated Sector Count ... "194" : "RAW_VALUE", # Temperature ­Device Internal Te ... "232" : "VALUE", # Available Reserved Space "233" : "VALUE", # Media Wearout Indicator "234" : "VALUE", # Thermal Throttle Status "241" : "RAW_VALUE", # Total LBAs Written (32MiB) "242" : "RAW_VALUE", # Total LBAs Read (32MiB) "1024" : "VALUE" # ATA error count (custom) 71 }, "Threshs" : { "5" : ["20","40"], ... "232" : ["16:","11:"], "233" : ["16:","6:"], "1024" : ["0","10"] }, "Perfs" : ["194","233","241","242"] }, ...
  • 74. /etc/nagios­plugins/ config/check_smartdb.json Ständig neue SSDs&HDDs 74 Aktualisierungen?
  • 76. ja cool, aber was ist mit RAID Controllern? ... [­d| ­­device <path to device being checked>] Specify the device being monitored. If multiple devices should be checked provide the '­d' option multiple times. E.g. '­d /dev/sda ­d /dev/sdb' For devices behind LSI RAID controllers specify 'megaraid' and then the device number, e.g. '­d megaraid6'. Use storcli to find out the corresponding device numbers. For devices behind Adaptec RAID controllers specify '/dev/sg<X>' where <X> is the number for your device. Use e.g. sg_scan to find the device. You must also use '­O sat' or '­O scsi' according to the device interface. This are extra options only necessary for '/dev/sg<X>' devices. 76 ...
  • 77. ja cool, aber was ist mit RAID Controllern? $ /usr/lib/nagios/plugins/check_smart_attributes > ­d megaraid6 > ­dbj /etc/nagios­plugins/ config/check_smartdb.json OK (megaraid6) | megaraid6_Temperature_Internal=26 megaraid6_Media_Wearout_Indicator=100;16;6 megaraid6_Host_Writes_32MiB=70283 megaraid6_Host_Reads_32MiB=1650800 $ /usr/lib/nagios/plugins/check_smart_attributes > ­d megaraid7 > ­dbj /etc/nagios­plugins/ config/check_smartdb.json Warning (megaraid7) [megaraid7_CRC_Error_Count = Warning]| megaraid7_Temperature_Internal=34 megaraid7_Media_Wearout_Indicator=098;16;6 megaraid7_Host_Writes_32MiB=189904 megaraid7_Host_Reads_32MiB=29658 77
  • 79. 79 check_gpu_sensor $ /usr/lib/nagios/plugins/check_gpu_sensor ­db 0000:83:00.0 OK ­Tesla K20 |ECCL2AggSgl=0;1;2; ECCTexAggSgl=0;1;2; memUtilRate=0 PWRUsage=49.81;150;200; ECCRegAggSgl=0;1;2; SMClock=705 ECCL1AggSgl=0;1;2; GPUTemperature=38;85;100; memClock=2600 usedMemory=0.24;95;99; fanSpeed=30;80;95; graphicsClock=705 GPUUtilRate=0 ECCMemAggSgl=0;1;2;
  • 80. NVIDIA: „angezeigte Lüfterdrehzahl lässt nicht darauf schließen, ob sich der Lüfter tatsächlich dreht.“ 80 „es ist jene Drehzahl, mit der der Lüfter-Algorithmus versucht den Lüfter zu betreiben.“ wir empfehlen: „Temperatursensor“
  • 81. 81 Plugins - Future _ Überwachung von FW-Versionen _ RAID Consistency Checks _ Temperatur von 10GBit NICs (siehe Intel X540 FAQs)
  • 82. 82 so, was nun?
  • 83. 83 Relax ... _ alle Plugins unter git.thomas-krenn.com _ alle Plugins erfüllen Plugin Developer Guidelines (-h für Hilfe) _ „Plugin Entwicklung für Einsteiger“ von Alexander Wirt heute um 14:15h
  • 84. 84 Relax, start ... Serverliste erstellen IPMI sicher konfigurieren relevante Plugins einrichten
  • 85. 85 Relax, start and have fun at