Zabbix 如何監(jiān)控服務(wù)器硬件信息?
做為L(zhǎng)inux系統(tǒng)工程師,在服務(wù)器的維護(hù)管理當(dāng)中,除了對(duì)系統(tǒng)進(jìn)行維護(hù)管理之外,最重要的還要對(duì)服務(wù)器的硬件進(jìn)行監(jiān)控,比如服務(wù)器Raid狀態(tài)是否正常(如果Raid卡出問題,會(huì)影響數(shù)據(jù)的讀寫速度),服務(wù)器硬盤是否正常(如果硬盤壞掉,嚴(yán)重的情況會(huì)丟失數(shù)據(jù)),服務(wù)器電源是否有故障等。除此之外還要對(duì)服務(wù)器的CPU,內(nèi)存,處理器等重要設(shè)備的溫度進(jìn)行監(jiān)控,如果溫度超過服務(wù)器的臨界溫度則進(jìn)行報(bào)警通知。
HP的服務(wù)器在硬件管理方面提供了自己管理工具h(yuǎn)pacucli,通過該工具可以查看HP服務(wù)器的RAID信息,服務(wù)器硬盤等信息。
1)安裝hpacucli工具(下載地址:HP hpacucli管理工具)
- [root@monitor ~]#rpm -ivh hpacucli-9.40-12.0.x86_64.rpm
2)查看服務(wù)器RAID信息,硬盤是否正常。
- [root@monitor~]# hpacucli ctrl all show config
- Smart Array P410i in Slot 0 (Embedded) (sn: 5001438018042FF0)
- array A (SAS, Unused Space: 0 MB)
- logicaldrive 1 (279.4 GB, RAID 1, OK)
- physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 300 GB, OK)
- physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 300 GB, OK)
3)通過hpacucli ctrl all show config detail命令可以詳細(xì)地查看RAID和硬盤的信息。
- [root@monitor ~]# hpacucli ctrl all show config detail
- Smart Array P410i in Slot 0 (Embedded)
- Bus Interface: PCI
- Slot: 0
- Serial Number: 5001438018042FF0
- Cache Serial Number: PBCDH0CRH1FH62
- RAID 6 (ADG) Status: Disabled
- Controller Status: OK
- Chassis Slot:
- Hardware Revision: Rev C
- Firmware Version: 5.14
- Rebuild Priority: Medium
- Expand Priority: Medium
- Surface Scan Delay: 15 secs
- Monitor and Performance Delay: 60 min
- Elevator Sort: Enabled
- Degraded Performance Optimization: Disabled
- Inconsistency Repair Policy: Disabled
- Post Prompt Timeout: 0 secs
- Cache Board Present: True
- Cache Status: OK
- Accelerator Ratio: 25% Read / 75% Write
- Drive Write Cache: Disabled
- Total Cache Size: 512 MB
- No-Battery Write Cache: Disabled
- Cache Backup Power Source: Capacitors
- Battery/Capacitor Count: 1
- Battery/Capacitor Status: OK
- SATA NCQ Supported: True
- Array: A
- Interface Type: SAS
- Unused Space: 0 MB
- Status: OK
- Logical Drive: 1
- Size: 279.4 GB
- Fault Tolerance: RAID 1
- Heads: 255
- Sectors Per Track: 32
- Cylinders: 65535
- Stripe Size: 128 KB
- Status: OK
- Array Accelerator: Enabled
- Unique Identifier: 600508B1001034373220202020200002
- Disk Name: /dev/cciss/c0d0
- Mount Points: /boot 99 MB
- Logical Drive Label: A00ADBD9PR7AMU1472 898D
- Mirror Group 0:
- physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 300 GB, OK)
- Mirror Group 1:
- physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 300 GB, OK)
- physicaldrive 1I:1:1
- Port: 1I
- Box: 1
- Bay: 1
- Status: OK
- Drive Type: Data Drive
- Interface Type: SAS
- Size: 300 GB
- Rotational Speed: 10000
- Firmware Revision: HPD4
- Serial Number: ECA1PC80GTS31234
- Model: HP EG0300FBDSP
- PHY Count: 2
- PHY Transfer Rate: 6.0GBPS, Unknown
- physicaldrive 1I:1:2
- Port: 1I
- Box: 1
- Bay: 2
- Status: OK
- Drive Type: Data Drive
- Interface Type: SAS
- Size: 300 GB
- Rotational Speed: 10000
- Firmware Revision: HPD7
- Serial Number: PMX6902D
- Model: HP EG0300FBDBR
- PHY Count: 2
- PHY Transfer Rate: 6.0GBPS, Unknown
HP官方還有一個(gè)hpasmcli管理工具,可以很詳細(xì)查看服務(wù)器CPU,內(nèi)存,處理器,電源等的溫度信息。
1)安裝hpasmcli工具(下載地址:HP hpasmcli管理工具)
- [root@monitor ~]#rpm -ivh hp-health-9.40-1602.44.rhel6.x86_64.rpm
2)通過工具h(yuǎn)pasmcli可以查看服務(wù)器各部件的溫度信息,其中Temp表示各部件當(dāng)前的溫度,Threshold表示臨界溫度,當(dāng)當(dāng)前溫度超過臨界溫度的時(shí)候就要注意啦。
- [root@monitor ~]# hpasmcli -s 'show temp'
- Sensor Location Temp Threshold
- ------ -------- ---- ---------
- #1 AMBIENT 23C/73F 42C/107F
- #2 CPU#1 40C/104F 82C/179F
- #3 CPU#2 40C/104F 82C/179F
- #4 MEMORY_BD 33C/91F 87C/188F
- #5 MEMORY_BD 33C/91F 78C/172F
- #6 MEMORY_BD - 87C/188F
- #7 MEMORY_BD 32C/89F 78C/172F
- #8 MEMORY_BD 32C/89F 87C/188F
- #9 MEMORY_BD 32C/89F 78C/172F
- #10 MEMORY_BD - 87C/188F
- #11 MEMORY_BD 32C/89F 78C/172F
- #12 POWER_SUPPLY_BAY 33C/91F 59C/138F
- #13 POWER_SUPPLY_BAY 47C/116F 73C/163F
- #14 MEMORY_BD 29C/84F 72C/161F
- #15 PROCESSOR_ZONE 32C/89F 73C/163F
- #16 PROCESSOR_ZONE 30C/86F 64C/147F
- #17 MEMORY_BD 28C/82F 63C/145F
- #18 PROCESSOR_ZONE 39C/102F 69C/156F
- #19 SYSTEM_BD 35C/95F 69C/156F
- #20 SYSTEM_BD 38C/100F 71C/159F
- #21 SYSTEM_BD 44C/111F 65C/149F
- #22 SYSTEM_BD 45C/113F 71C/159F
- #23 SYSTEM_BD 39C/102F 69C/156F
- #24 SYSTEM_BD 47C/116F 69C/156F
- #25 SYSTEM_BD 35C/95F 63C/145F
- #26 SYSTEM_BD 45C/113F 66C/150F
- #27 SCSI_BACKPLANE_ZONE 35C/95F 60C/140F
- #28 SYSTEM_BD 73C/163F 110C/230F
3)通過hpasmcli -s 'show'查看類似于help的幫助信息,監(jiān)控的時(shí)候要重點(diǎn)關(guān)注 DIMM(內(nèi)存)、FANS(風(fēng)扇)、POWERSUPPLY(電源模塊)、SERVER(系統(tǒng))、CPU、TEMP(溫度)等信息。
- [root@monitor ~]# hpasmcli -s 'show'
- Invalid Arguments
- SHOW ASR
- SHOW BOOT
- SHOW DIMM [ SPD ]
- SHOW F1
- SHOW FANS
- SHOW HT
- SHOW IML
- SHOW IPL
- SHOW NAME
- SHOW PORTMAP
- SHOW POWERMETER
- SHOW POWERSUPPLY
- SHOW PXE
- SHOW SERIAL [ BIOS | EMBEDDED | VIRTUAL ]
- SHOW SERVER
- SHOW TEMP
- SHOW TPM
- SHOW UID
- SHOW WOL
4)hpasmcli幾種常用的例子。
- 查看內(nèi)存信息:hpasmcli -s 'show dimm'|egrep -i 'module|stat'
- 查看風(fēng)扇信息:hpasmcli -s 'show fans'
- 查看硬件溫度:hpasmcli -s 'show temp'
- 查看電源模塊:hpasmcli -s 'show powersupply'
- 查看機(jī)器型號(hào),序列號(hào),CPU,內(nèi)存大?。篽pasmcli -s 'show server'
由于各種服務(wù)器的廠商不同,管理工具不同,因此Zabbix對(duì)服務(wù)器硬件方面沒有很詳細(xì),全面的解決方案。之前dl528888寫過zabbix通過omsa工具監(jiān)控DEL服務(wù)器,也是一種很好的思路,我也借鑒過,這里非常感謝。
Zabbix監(jiān)控總結(jié)起來(lái)有兩種思路:第一就是server通過agentd方式獲取數(shù)據(jù),這種方式需要定義UserParameter參數(shù),即KEY。第二就是server通過trapper的方式獲取數(shù)據(jù),即agentd將數(shù)據(jù)主動(dòng)sender給server或者proxy。我這里是通過第二種traper的方式監(jiān)控的。第一種方式server有時(shí)候會(huì)取不到數(shù)據(jù):
- became not supported: Received value [] is not suitable for value type [Numeric (unsigned)] and data type [Decimal]
會(huì)產(chǎn)生上面的錯(cuò)誤。
首先查看我監(jiān)控的腳本,由于是通過traper的思路進(jìn)行監(jiān)控,log_file文件依次定義了要監(jiān)控服務(wù)器的主機(jī)名(hostname),監(jiān)控項(xiàng)key以及監(jiān)控的值。
- [root@monitor scripts]# cat hpacuclizabbix.sh
- #!/bin/sh
- #create by sfzhang 20140517
- #This scripts monitoring HP server, such as smart array status,Hardware information and server temperature。
- zabbix_server="*.*.*.*" #IP from Zabbix Server or proxy where data should be send to.
- zabbix_sender="/usr/local/zabbix/bin/zabbix_sender"
- log_file='/tmp/hpacuclizabbix.log' #In the file to define the monitor host, key and value
- hpacucli='/usr/sbin/hpacucli'
- options='ctrl all show config detail'
- hpacucli_log="/tmp/result.log"
- PATH=$PATH:/usr/sbin:/sbin
- ${hpacucli} ${options} > ${hpacucli_log}
- Cache_status=`cat ${hpacucli_log} |awk '/Cache Status:/{print $NF}'`
- Controller_status=`cat ${hpacucli_log} |awk '/Controller Status:/{print $NF}'`
- Battery_capacitor_status=`cat ${hpacucli_log} |awk '/Battery\/Capacitor Status:/{print $NF}'`
- Physicaldrive_status=$(awk -v total=`hpacucli ctrl slot=0 pd all show status |grep physicaldrive |wc -l` -v normal=`hpacucli ctrl slot=0 pd all show status|awk '/physicaldrive/{if($NF=="OK") count+=1}END{print count}'` 'BEGIN{if(total==normal) {print "OK"} else {print "NO"}}')
- Memory_status=$(awk -v total=`hpasmcli -s 'SHOW DIMM'|grep -i 'Status' |wc -l` -v normal=`hpasmcli -s 'SHOW DIMM' |awk '/Status:/{if($NF=="Ok") count+=1}END{print count}'` 'BEGIN{if(total==normal) {print "OK"} else {print "NO"}}')
- Fans_status=$(awk -v total=`hpasmcli -s 'SHOW FANS' |grep "#" |wc -l` -v normal=`hpasmcli -s 'SHOW FANS' |awk '/#/{if($3=="Yes") count+=1}END{print count}'` 'BEGIN{if(total==normal) {print "OK"} else {print "NO"}}')
- Power_status=$(awk -v total=`hpasmcli -s 'SHOW POWERSUPPLY' |grep "Power supply" |wc -l` -v normal=`hpasmcli -s 'SHOW POWERSUPPLY' |awk '/Condition:/{if ($NF=="Ok") count+=1}END{print count}'` 'BEGIN{if(total==normal) {print "OK"} else {print "NO"}}')
- Processor_status=$(awk -v total=`hpasmcli -s 'SHOW SERVER' |grep "Processor:" |wc -l` -v normal=`hpasmcli -s 'SHOW SERVER' |awk '/Status/{if ($NF=="Ok") count+=1}END{print count}'` 'BEGIN{if(total==normal) {print "OK"} else {print "NO"}}')
- Power_temp_num=$(hpasmcli -s 'SHOW TEMP' |awk '/POWER_SUPPLY_BAY/{print $3}'|awk -F"C" '{print $1}'|awk 'BEGIN {max = 0} {if ($1>max) max=$1 fi} END {print max}')
- Ambient_temp_num=$(hpasmcli -s 'SHOW TEMP' |awk '/AMBIENT/{print $3}'|awk -F"C" '{print $1}')
- Cpu_temp_num=$(hpasmcli -s 'SHOW TEMP' |awk '/CPU/{print $3}'|awk -F"C" '{print $1}' |awk 'BEGIN {max = 0} {if ($1>max) max=$1 fi} END {print max}')
- Memory_temp_num=$(hpasmcli -s 'SHOW TEMP' |awk '/MEMORY_BD/{print $3}'|awk -F"C" '{print $1}' |awk 'BEGIN {max = 0} {if ($1>max) max=$1 fi} END {print max}')
- System_temp_num=$(hpasmcli -s 'SHOW TEMP' |awk '/SYSTEM_BD/{print $3}'|awk -F"C" '{print $1}' |awk 'BEGIN {max = 0} {if ($1>max) max=$1 fi} END {print max}')
- Processor_temp_num=$(hpasmcli -s 'SHOW TEMP' |awk '/PROCESSOR_ZONE/{print $3}'|awk -F"C" '{print $1}' |awk 'BEGIN {max = 0} {if ($1>max) max=$1 fi} END {print max}')
- echo $HOSTNAME hp_smart_array.cache_status $Cache_status >${log_file}
- echo $HOSTNAME hp_smart_array.controller_status $Controller_status >>${log_file}
- echo $HOSTNAME hp_smart_array.battery_capacitor_status $Battery_capacitor_status >>${log_file}
- echo $HOSTNAME hp_hardware.hpysicaldrive_status $Physicaldrive_status >>${log_file}
- echo $HOSTNAME hp_hardware.memory_status $Memory_status >>${log_file}
- echo $HOSTNAME hp_hardware.fans_status $Fans_status >>${log_file}
- echo $HOSTNAME hp_hardware.power_status $Power_status >>${log_file}
- echo $HOSTNAME hp_hardware.processor_status $Processor_status >>${log_file}
- echo $HOSTNAME hp_power.temp_num $Power_temp_num >> ${log_file}
- echo $HOSTNAME hp_ambient.temp_num $Ambient_temp_num >> ${log_file}
- echo $HOSTNAME hp_cpu.temp_num $Cpu_temp_num >> ${log_file}
- echo $HOSTNAME hp_memory.temp_num $Memory_temp_num >> ${log_file}
- echo $HOSTNAME hp_system.temp_num $System_temp_num >> ${log_file}
- echo $HOSTNAME hp_processor.temp_num $Processor_temp_num >> ${log_file}
- $zabbix_sender -z $zabbix_server -i ${log_file} > /tmp/zabbix.temp
最后只需開啟crontab,5分鐘運(yùn)行一次。
- [root@monitor~]echo "*/5 * * * * /etc/zabbix/scripts/hpacuclizabbix.sh" >> /var/spool/cron/root
查看zabbix監(jiān)控HP服務(wù)器硬件KEY的定義,數(shù)據(jù)的收集都是通過trapper的方式收集的。
查看zabbix監(jiān)控HP服務(wù)器硬件triggers定義,其中nodata(600)這個(gè)trigger是為了防止被監(jiān)控端數(shù)據(jù)采集出問題而設(shè)置的,比如crontab不正常,腳本被誤刪除等等。如果server10分鐘之內(nèi)收集不到被監(jiān)控端的數(shù)據(jù)就會(huì)報(bào)警。
在zabbix server lastdata查看zabbix server 通過trapper收到的數(shù)據(jù)。
查看被監(jiān)控端服務(wù)器各部件溫度信息。
當(dāng)被監(jiān)控端出問題時(shí)Zabbix會(huì)及時(shí)報(bào)警。
說(shuō)明:Zabbix監(jiān)控HP服務(wù)器硬件操作方法:
1)在HP服務(wù)器上面安裝hpacucli和hpasmcli管理工具。
2)修改hpacuclizabbix.sh腳本的zabbix_server ip地址,指定為自己的server或者proxy的地址,并把該腳本添加到crontab。
3)導(dǎo)入附件中的模板,Link到要監(jiān)控的主機(jī)上面即可。
4)如果有其它問題,歡迎多多交流。