Hello,
I’m currently facing an issue with our Centreon monitoring and would like to get some advice from the community.
We are monitoring several customer virtual machines (mostly Windows) with Centreon, using standard checks such as:
-
ICMP (ping)
-
SNMP (CPU, memory, etc.)
-
Some service checks
Recently, one of our VMs crashed with a Windows BSOD. However, no alert was triggered in Centreon:
-
The host was still responding to ping
-
SNMP checks were still returning values
-
Services appeared as OK
So from Centreon’s perspective, everything looked healthy, while the OS was actually not operational.
We would like to improve our monitoring to detect this kind of situation (OS freeze, BSOD, or unresponsive system), not just network availability.
My questions are:
-
What would be the best way to detect that a Windows OS is really alive and responsive?
-
Would using NSClient++ or Centreon Monitoring Agent be the recommended approach in this case?
-
Are there specific checks (uptime, heartbeat, process, eventlog…) that you recommend to detect OS hangs or freezes?
-
Has anyone implemented a “heartbeat” mechanism or similar workaround?
Our goal is to move from simple connectivity checks to real OS-level health monitoring.
Thanks in advance for your feedback and suggestions.
Julien
