Hello everyone, i have a small problem that impact only a few of the hosts that i monitor. The most blatant are “Server1” and “Server2” (cf pictures) Server 1 and 2 both have as a check command “Linux-SNMP-Uptime” (defined by their hostgroup) However they do not behave exactly like other in their hostgroup neither are they behaving like each other, and i cannot understand why.
As you can see in this image, Server one (sometimes) says that the warning threshold is wrong, and Server2 that the uptime is Critical for being 465 days
The problem is that i never inputed any of these weird warning values for server 1 and other servers with as much uptime as Server2 do not show any critical state.
Here we can see that Server1 command act as if some weird values are used to check warnings and critical states.
Server2 seems to have regular options.
So There is what they show half of the time , whether shown as down or in critical status because of something i do not understand, however, sometimes they will return to normal and i really cannot grasp why that is. An exemple of this:
These are the same servers but sometimes their value will return to normal, We can see here that server1 does not show any of the warning or critical uptime anymore.
So the two servers are balancing between these two states and i really cannot grasp why that is. Why is Server1 sometimes using weird value for warning or critical sometimes but sometimes wont. I couldnt make any correlation between any event and this problem appearing or diseappearing, id say it switch something like every 10 minutes or something.
Does anybody have any idea what might happen here?
Thank you a lot for reading this if you did and have a great day.
Noam Monmarché
Page 1 / 1
hello
what are your macro option for these services?
the default is “nothing” in warning and critical
if blank, there is no alert, just the performance data, which gives you a graph where you can see the reboot occuring
the syntax for the alert threshold is : X:Y
x is the minimum uptime, and y is the maximum uptime, if you don’t add “--unit=Z” the value for x and y are in seconds
--unit Select the time unit for thresholds. May be 's' for seconds, 'm' for minutes, 'h' for hours, 'd' for days, 'w' for weeks. Default is seconds.
if you put only 1 number X in the threshold, it will be a “maximum uptime” as it will be converted to “0:X”
I don’t really understand why you have 3 values in your checks, but this looks like the metric for cpu or load (that have 3 threshold for multiple time periods, 1s, 5s, 1min, for example)
if you want a minimum uptime and no “max” : “X:” (to have an alert if a server rebooted without your knowledge)
if you want a max and no min : “X” (to force a reboot on a server that has not rebooted for too long)
Hello and thank you for your answer. As i said before i have no option configured for this check command. Here is my host configuration as well as his host template.
I fail to understand why as i showed earlier , the command sent by the host will sometimes fetch somet warning and critical metrics (resembling server load indeed) . That is more or less the point of my topic. I do not understand why these values are sometimes sent and sometimes blank ( cf the second and the last picture i sent in my previous post) For 5 mintues it will send the command :with theses parameters : warning xx,xx,xx ; critical xx,xx,xx and 5 minutes later : warning : ; critical : so it alternates between up and down in my overview. I also have the problem with my Server2 that is configured without any extra options but which is seen as down every 5 minutes or so despite sending back pretty normal uptime data.
For context these two servers takes configuration from two different template in which there are more or less 10 identical servers (8 are made in the same template that S1 and 11 are made from same template as S2 ) they are in two hostgroups with all the other made from same template. Despite that face, they are the only two servers with these problems, which is very weird for me as i fail to find any difference in configuration between them and the other that i configured.
Sorry for this post to be such a mess, i struggle to present this in a clear way.
Thank you again
Noam Monmarché
Commenting because i still have the problem and couldnt find the reason why
hello
i’m a bit confused on how you have setup you “host” and your “service”
that is a “host” check alive command, not a service
a host should inherit from a host template, for example “os-linux-snmp-custom”
this will make a ‘host’ with a ping to check if the host is alive
and the services associated with the host will be created (if you say yes to the “create services linked to the template” )
the host template “os-linux-snmp-custom” comes with the linux snmp plugin pack from centreon, and is “linked” to multiple services, including the “snmp uptime” check
here is the relations in the host template menu
(the uptime may not be here, i don’t remember if I added it here manually)
that host template inherit from the default “generic active host” which include the “host check” with the ping command
why does it not work correctly : I don’t know, but I know the “host check” command should not be a snmp check because you cannot pass parameters, the host check command is different than a service check
on a side note, if you don’t want to use the full linux snmp host template
create a simple host with this template
then in the service, ADD a service
set the name to uptime, input your host name
and use the template
always use the “-custom” in all your template, as the non “-custom” have no custom macro
Thank you for your response. I indeed setup all my Linux Hosts with the Check Command OS-Linux-SNMP-Uptime, , because that is an info i find interesting when i sort them by host and i thought it could do the job as well as a ping in order to know if the host was down or not. i do not want to add any metrics or macro so i didnt care much about it not being a service but just using it as a check command.
This is in fact working well on almost every host i monitor and was only wondering why it wouldnt work with only two of them, and more so i wanted to know why it was behaving so strangely. I understand from what you say that it is not a good practice to perform an snmp check as a check command and will hold to this but im still questionning about these weird errors (times when the uptime check passes on some parameters from i dont know where for exemple)
Anyway thank you for taking the time to answer me
ok, I can understand your need
what you could do is duplicate the check command snmp uptime, and force the setting in it so it can force the correct threshold, then use that check command that is sure to be used correctly in your host check command.