Thursday, October 14, 2010

VMWare ESXi Hardware Monitoring

So I have some ESXi servers running and needed to do hardware monitoring with Nagios.

I found check_esx_wbem.py a Python script that uses VMWare CIM (if you need to enable CIM, read more here)

The script requires python and the pywbem module. In my case, I did aptitude install ;)

The usage is simple really:

Usage : ./check_esx_wbem.py hostname user password [verbose]
Example : ./check_esx_wbem.py https://myesxi:5989 root password


Using verbose, you get a lot of output such as this:

20101014 17:09:14 Check classe CIM_ComputerSystem
20101014 17:09:15 Element Name = System Board 7:1
20101014 17:09:15 Element Op Status = 0
20101014 17:09:15 Element Name = System Board 7:2
20101014 17:09:15 Element Op Status = 0
20101014 17:09:15 Element Name = System Board 7:3
20101014 17:09:15 Element Op Status = 0
20101014 17:09:15 Element Name = System Board 7:4
20101014 17:09:15 Element Op Status = 0
20101014 17:09:15 Element Name = System Board 7:5
20101014 17:09:15 Element Op Status = 0
20101014 17:09:15 Element Name = System Board 7:6
20101014 17:09:15 Element Op Status = 0
20101014 17:09:15 Element Name = System Board 7:7
20101014 17:09:15 Element Op Status = 0
20101014 17:09:15 Element Name = System Board 7:8
20101014 17:09:15 Element Op Status = 0
20101014 17:09:15 Element Name = System Board 7:9
20101014 17:09:15 Element Op Status = 0
20101014 17:09:15 Element Name = System Internal Expansion Board 16:1
20101014 17:09:15 Element Op Status = 0
20101014 17:09:15 Element Name = System Internal Expansion Board 16:2
20101014 17:09:15 Element Op Status = 0
20101014 17:09:15 Element Name = System Internal Expansion Board 16:3
20101014 17:09:15 Element Op Status = 0
20101014 17:09:15 Element Name = System Internal Expansion Board 16:4
20101014 17:09:15 Element Op Status = 0
20101014 17:09:15 Element Name = System Internal Expansion Board 16:5
20101014 17:09:15 Element Op Status = 0
20101014 17:09:15 Element Name = System Internal Expansion Board 16:6
20101014 17:09:15 Element Op Status = 0
20101014 17:09:15 Element Name = System Internal Expansion Board 16:7
20101014 17:09:15 Element Op Status = 0
20101014 17:09:15 Element Name = esxi.example.com
20101014 17:09:15 Element Name = Hardware Management Controller (Node 0)
20101014 17:09:15 Element Op Status = 0
20101014 17:09:15 Check classe CIM_NumericSensor
20101014 17:09:15 Element Name = System Board 8 Power Meter
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = System Board 7 Temp 24
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = System Board 6 Temp 23
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = System Board 5 Temp 22
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = Drive Backplane 1 Temp 21
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = Memory Module 9 Temp 20
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = Processor 3 Temp 19
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = System Internal Expansion Board 7 Temp 18
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = System Internal Expansion Board 6 Temp 17
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = System Internal Expansion Board 5 Temp 16
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = System Internal Expansion Board 4 Temp 15
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = System Internal Expansion Board 3 Temp 14
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = System Internal Expansion Board 2 Temp 13
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = System Internal Expansion Board 1 Temp 12
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = Memory Module 8 Temp 11
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = Memory Module 7 Temp 10
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = Memory Module 6 Temp 9
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = Memory Module 4 Temp 7
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = Memory Module 3 Temp 6
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = Memory Module 2 Temp 5
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = Memory Module 1 Temp 4
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = Processor 1 Temp 2
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = External Environment 1 Temp 1
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = System Board 4 Fans
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = System Board 2 Fan 2
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Element Name = System Board 1 Fan 1
20101014 17:09:15 Element Op Status = 2
20101014 17:09:15 Check classe CIM_Memory
20101014 17:09:16 Element Name = Proc 1 Level-1 Cache
20101014 17:09:16 Element Op Status = 0
20101014 17:09:16 Element Name = Proc 1 Level-1 Cache
20101014 17:09:16 Element Op Status = 0
20101014 17:09:16 Element Name = Proc 1 Level-1 Cache
20101014 17:09:16 Element Op Status = 0
20101014 17:09:16 Element Name = Proc 1 Level-1 Cache
20101014 17:09:16 Element Op Status = 0
20101014 17:09:16 Element Name = Proc 1 Level-2 Cache
20101014 17:09:16 Element Op Status = 0
20101014 17:09:16 Element Name = Proc 1 Level-2 Cache
20101014 17:09:16 Element Op Status = 0
20101014 17:09:16 Element Name = Proc 1 Level-2 Cache
20101014 17:09:16 Element Op Status = 0
20101014 17:09:16 Element Name = Proc 1 Level-2 Cache
20101014 17:09:16 Element Op Status = 0
20101014 17:09:16 Element Name = Proc 1 Level-3 Cache
20101014 17:09:16 Element Op Status = 0
20101014 17:09:16 Element Name = Memory
20101014 17:09:16 Element Op Status = 2
20101014 17:09:16 Check classe CIM_Processor
20101014 17:09:16 Element Name = Proc 1
20101014 17:09:16 Element Op Status = 2
20101014 17:09:16 Check classe CIM_RecordLog
20101014 17:09:16 Element Name = IPMI SEL
20101014 17:09:16 Element Op Status = 2
20101014 17:09:16 Check classe OMC_DiscreteSensor
20101014 17:09:16 Element Name = Power Supply 3 Power Supplies
20101014 17:09:16 Element Op Status = 2
20101014 17:09:16 Element Name = System Chassis 3 Ext. Health LED
20101014 17:09:16 Element Name = System Chassis 2 Int. Health LED
20101014 17:09:16 Element Name = System Chassis 1 UID Light
20101014 17:09:16 Check classe VMware_StorageExtent
20101014 17:09:16 Check classe VMware_Controller
20101014 17:09:17 Check classe VMware_StorageVolume
20101014 17:09:17 Check classe VMware_Battery
20101014 17:09:17 Check classe VMware_SASSATAPort
OK

Nagios Integration

Create a check command definition in nagios such as this:

define command {
       command_name          check_esxi
       command_line             /usr/bin/python /usr/lib/nagios/plugins/check_esx.py https://'$HOSTADDRESS$':5989 '$ARG1$' '$ARG2$' verbose
}

Create a service tied to a host:

define service {
        host_name                   ESXi-server
        service_description      Hardware ESXi
        use                              generic-service
        check_command         check_esxi!root!password
        register                       1
        }

Restart Nagios and Presto, now you are monitoring the hardware on your ESXi server.

No comments:

Post a Comment