Skip to main content

What is the best practice to handle stale process from services?

Most of my services have a 5 minutes polling, although some of them have their last check reaching hours.

When I list process from centreon-engine they seem to be stuck even with max_execution_time set to 60s

These scripts are basically consulting large MIB tables which runs in 5 seconds average but for some reason they stack in the process pipeline.

 I’d like to schedule a script to cleanup those stuck process. Since I don’t want to mess with the system crontab, using a centreon service could be the best way to schedule it. How can I set the right permission to centreon-engine be killing these process though this script?

 

 

Hello

I had this issue on a specific case : I was using a poller on a “burstable” vm on azure with 10%cpu limit (model B1), 

this was causing process to get stuck and never ending because I did not have cpu credit to run when I started monitoring more and more services.

is it your case?

if it is, try allocating more resources

 

if not :

I have never see that behaviour, unless your script is simply not exiting and sending it’s exit code/output text

could you send your engine configuration with the “Freshness” in “check options”, and your “tuning”, maybe something is not correctly set up there.

also your check command line

(does your script exit correctly and with the right exit code when you run the command manually on the poller shell? you can check the exit code with “echo $?” right after your command)

 

the centreon engine is a scheduler that will run the command and wait for the exit code, if it never arrives it should timeout at somepoint

also it should not wait for the previous execution to finish and should run at each normal check interval

even if the previous one is not exited. (I’m not 100% sure about this)

 

That reminds me, and you should start there, what is your “normal check interval” on these services (or service template if you use one)

you can check this by looking at this on the resource detail

the “Next Check” is 5min after the last check, and my normal check interval is “5”

is your “next check” correctly scheduled with the “normal check interval” ?

 

to your question, there is no best way to manage stale process, there shouldn’t be any :-/ 

this will list you the pid of any process named “yourprocessname” running for more than 1h, (3600s)

 ps -C yourprocessname -o pid,etimes,command | awk '(NR>1){if($2>3600) print $1}'

you could kill the list

but if it’s “perl” or “php” or “python” it could have bad consequences.

 

killing a process launched by centreon engine should result in a “no ouput” Unknown status, I’m not sure if that have an impact on the engine, it should not, the process are exiting on their own normally, having something that kill such process should not cause issue (except the unknown status)

 

and the crontab is not really an issue on the whole, you can add whatever you want there


I realize I didn’t read your last question correctly and missed the part for the rights to kill process

all the check command are run with the user “centreon-engine”

So you script that will kill all your stale process will be run by the same user and should not have issue to kill them, you should not need specific right to kill your own process or run the “kill” command without sudo or root

 

but you really should find the rootcause of your problem, a kill script should be the last resort


Hi @christophe.niel-ACT thank you for your answer

is it your case?

 

 it might be my case. My resources for this poller are being like this:

statistics
check options

 

tunning

I have never see that behaviour, unless your script is simply not exiting and sending it’s exit code/output text

 

I have check one of my scripts that behave like this and it has an exit line for every condition

 

the centreon engine is a scheduler that will run the command and wait for the exit code, if it never arrives it should timeout at somepoint

 

Indeed, it should timeout and return exit code 3, but you can see it remains ok (code 0)

 

also it should not wait for the previous execution to finish and should run at each normal check interval

 

looks like it only check for the next execution when I kill the stuck process

 

ps -C yourprocessname -o pid,etimes,command | awk '(NR>1){if($2>3600) print $1}'

 

I have managed to write a working script using a similar filter executed by centreon-engine, but I really should find the root cause of this resource performance.

 

Thank you very much


A few thing from your engine settings that may have an impact on the stale process : 

  • enable the 2 “orphaned” check option
  • set a value in service freshness check option (I  have “60”)

in the tuning you have an empty value on the cached service check, not sure if it is important in your specific case, I have a 15 second value on my pollers

 

from your stats I see a value that worries me : the host check execution time, average is pretty high, it should be around 1, and also your max latency are through the roof but average seems low enough

(I’m comparing to my battery of pollers with various size and load)

I don’t have 1 poller handling 7k services though, the max I have one 1 poller is 4k (and 500 hosts on this one),

just for comparison it has 6cpu, 4G ram and has a load way over the critical threshold but I don’t have latency problem

(the average is below 0.15 and the max is around 0.5, and I do have some custom perl/snmp script running around for exotic checks) 

 

for your scripts using big snmp queries/table, I don’t know if it applies in your case but I highly suggest using caching on disk of the data if you can (storing the big table and a timestamp, then only querying if the cache is more than 1 or 5 minute old) and having all your script using the same data use this cache and refresh it as needed (or have a cron doing it)

thats how centreon did the vmware plugin, using an daemon to cache all the data locally, or how they do most some of the disk inventory on some hardware, storing the oid/name of the volumes to have less snmp querie

it’s case dependant, but it can help a lot to reduce the load

(also maybe your script get stuck because the remote equipement is throttling the snmp queries, having your check to stall/timeout and not finishing correctly, I have this on some hardware, when there is too much data the snmp engine is just slow then stop responding if there is load on the equipement, snmp is not a priority)

 

 


Reply