Troubleshooting
Problem
Why does the parameter JOB_IDLE configured in lsb.queues sometimes not take effect?
Symptom
JOB_IDLE configured in lsb.queues specifies a threshold for idle job exception handling.
The value should be a number between 0.0 and 1.0 representing CPU time/runtime.
If the job idle factor is less than the specified threshold, LSF will invoke LSF_SERVERDIR/eadmin to trigger the action to send an email for a job idle exception.
The invoke interval is controlled by parameter EADMIN_TRIGGER_DURATION set in lsb.params.
Sometimes the administrator will not get an email for job idle exception from LSF after set JOB_IDLE for the specific queue normal in lsb.queues.
For example:
1. Set up JOB_IDLE= 0.6 for the specific queue normal in lsb.queues,
2. Set EADMIN_TRIGGER_DURATION = 2 (min),
3. Submit a job whose runtime is 1000s, whose job_idle will be 0, totally less than the threshold 0.6.
But the administrator will not get an email for job idle exception from LSF.
Cause
There is a parameter DETECT_IDLE_JOB_AFTER which should be set in lsb.params.
Diagnosing The Problem
Syntax
DETECT_IDLE_JOB_AFTER=time_minutes
Description
The minimum job run time before mbatchd reports that the job is idle.
Default
20 (mbatchd checks if the job is idle after 20 minutes of run time)
Resolving The Problem
The default value of parameter DETECT_IDLE_JOB_AFTER is 20 minutes. When the job's run time is less than 20 minutes, it finishes before mbatchd has chance to report that job is idle, so the administrator failed to receive an email for job idle exception from LSF.
In above scenario, if you set DETECT_IDLE_JOB_AFTER=1, you can successfully receive an email for job idle exception from LSF.
Was this topic helpful?
Document Information
Modified date:
17 June 2018
UID
isg3T1026333