Skip to main content

Hello everyone, i have a small problem that impact only a few of the hosts that i monitor.
The most blatant are “Server1” and “Server2” (cf pictures)
Server 1 and 2 both have as a check command “Linux-SNMP-Uptime” (defined by their hostgroup) However they do not behave exactly like other in their hostgroup neither are they behaving like each other, and i cannot understand why.
 

As you can see in this image, Server one (sometimes) says that the warning threshold is wrong, and Server2 that the uptime is Critical for being 465 days

 

The problem is that i never inputed any of these weird warning values for server 1 and other servers with as much uptime as Server2 do not show any critical state.

Here we can see that Server1 command act as if some weird values are used to check warnings and critical states.

Server2 seems to have regular options.

So There is what they show half of the time , whether shown as down or in critical status because of something i do not understand, however, sometimes they will return to normal and i really cannot grasp why that is.
An exemple of this:


These are the same servers but sometimes their value will return to normal,
We can see here that server1 does not show any of the warning or critical uptime anymore.

So the two servers are balancing between these two states and i really cannot grasp why that is. Why is Server1 sometimes using weird value for warning or critical sometimes but sometimes wont. I couldnt make any correlation between any event and this problem appearing or diseappearing, id say it switch something like every 10 minutes or something.

Does anybody have any idea what might happen here?

Thank you a lot for reading this if you did and have a great day.

 

Noam Monmarché

 

hello

what are your macro option for these services?

the default is “nothing” in warning and critical

if blank, there is no alert, just the performance data, which gives you a graph where you can see the reboot occuring

 

the syntax for the alert threshold is : X:Y

x is the minimum uptime, and y is the maximum uptime, if you don’t add “--unit=Z” the value for x and y are in seconds

--unit  Select the time unit for thresholds. May be 's' for seconds, 'm'
            for minutes, 'h' for hours, 'd' for days, 'w' for weeks. Default
            is seconds.

 

if you put only 1 number X in the threshold, it will be a “maximum uptime” as it will be converted to “0:X”

I don’t really understand why you have 3 values in your checks, but this looks like the metric for cpu or load (that have 3 threshold for multiple time periods, 1s, 5s, 1min, for example)

 

if you want a minimum uptime and no “max” : “X:” (to have an alert if a server rebooted without your knowledge)

if you want a max and no min : “X” (to force a reboot on a server that has not rebooted for too long)

 

it is explained here : Understanding metrics | Centreon Documentation

 


Hello and thank you for your answer. As i said before i have no option configured for this check command.
Here is my host configuration as well as his host template.

I fail to understand why as i showed earlier , the command sent by the host will sometimes fetch somet warning and critical metrics (resembling server load indeed) . That is more or less the point of my topic. I do not understand why these values are sometimes sent and sometimes blank ( cf the second and the last picture i sent in my previous post)
For 5 mintues it will send the command :with theses parameters : warning xx,xx,xx ; critical xx,xx,xx
and 5 minutes later : warning : ; critical :
so it alternates between up and down in my overview.
I also have the problem with my Server2 that is configured without any extra options but which is seen as down every 5 minutes or so despite sending back pretty normal uptime data.

For context these two servers takes configuration from two different template in which there are more or less 10 identical servers (8 are made in the same template that S1 and 11 are made from same template as S2 ) they are in two hostgroups with all the other made from same template.
Despite that face, they are the only two servers with these problems, which is very weird for me as i fail to find any difference in configuration between them and the other that i configured.

Sorry for this post to be such a mess, i struggle to present this in a clear way.

Thank you again

Noam Monmarché
 


Commenting because i still have the problem and couldnt find the reason why


hello

i’m a bit confused on how you have setup you “host” and your “service”

that is a “host” check alive command, not a service

 

a host should inherit from a host template, for example “os-linux-snmp-custom”

this will make a ‘host’ with a ping to check if the host is alive

and the services associated with the host will be created (if you say yes to the “create services linked to the template” )

 

the host template “os-linux-snmp-custom” comes with the linux snmp plugin pack from centreon, and is “linked” to multiple services, including the “snmp uptime” check

here is the relations in the host template menu

(the uptime may not be here, i don’t remember if I added it here manually)

 

that host template inherit from the default “generic active host” which include the “host check” with the ping command

 

why does it not work correctly : I don’t know, but I know the “host check” command should not be a snmp check because you cannot pass parameters, the host check command is different than a service check

 


on a side note, if you don’t want to use the full linux snmp host template

create a simple host with this template

then in the service, ADD a service

set the name to uptime, input your host name 

and use the template

 

always use the “-custom” in all your template, as the non “-custom” have no custom macro 


Thank you for your response.
I indeed setup all my Linux Hosts with the Check Command OS-Linux-SNMP-Uptime, , because that is an info i find interesting when i sort them by host and i thought it could do the job as well as a ping in order to know if the host was down or not. i do not want to add any metrics or macro so i didnt care much about it not being a service but just using it as a check command.

 


This is in fact working well on almost every host i monitor and was only wondering why it wouldnt work with only two of them, and more so i wanted to know why it was behaving so strangely.
I understand from what you say that it is not a good practice to perform an snmp check as a check command and will hold to this but im still questionning about these weird errors (times when the uptime check passes on some parameters from i dont know where for exemple) 

Anyway thank you for taking the time to answer me
 


ok, I can understand your need

what you could do is duplicate the check command snmp uptime, and force the setting in it so it can force the correct threshold, then use that check command that is sure to be used correctly in your host check command.


Reply