IBM Support

Available backoff strategies for probes

Question & Answer


Question

Backoff Strategy : How frequently do probes reconnect to their target?

Cause


Each probe may or may not have a Backoff strategy but all probes can be placed under process control to manage reconnections.

Answer

Process Control

When probes are run under process control [nco_pad] the probe process with be restarted after the process exits using the following backoff strategy:



2,4,8,16,32,64,128,256 -> 2,4,8,16,32,64,128,256 ...

That is, the process will be restarted after 2 seconds, then 4 seconds, then 8 seconds, etc.
After the 8th increase, the retry time is reset back to 2 seconds, and incremented as before.
The number of total retries is given in the nco_pa.conf file, and specified using the RetryCount setting:
e.g.

nco_process 'MTTrapd probe'
{
Command '$OMNIHOME/bin/nco_p_mttrapd -propsfile $OMNIHOME/probes/solaris2/my_mttrapd.props' run as 0
Host = 'probehost'
Managed = True
RestartMsg = '${NAME} running as ${EUID} has been restored on ${HOST}.'
AlertMsg = '${NAME} running as ${EUID} has died on ${HOST}.'
RetryCount = 0
ProcessType = PaPA_AWARE
}

The nco_pad process is usually started using a wrapper script, whose options for nco_pad can be updated to allow for non-native based probes, longer retry times, and rogue processes.

KillProcessGroup : If specified, when the process agent daemon stops a process, it also sends a signal to kill any processes in the same operating system process group.

RetryTime : Specifies the number of seconds that a process started by process control must run to be considered a successful start. [5]

RogueTimeout : Specifies the time in seconds to wait for the process to shut down [30]

For example you can edit the nco script and add additional arguments:
vi /etc/init.d/nco
#! /bin/sh
# My Pad settings
MY_ARGUMENTS="-killprocessgroup -roguetimeout 120 -retrytime 60"
...
if [ "$SECURE" = "Y" ]; then
${OMNIHOME}/bin/nco_pad -name ${NCO_PA} -authenticate PAM -secure ${MY_ARGUMENTS} > /dev/null 2> /dev/null
else
${OMNIHOME}/bin/nco_pad -name ${NCO_PA} -authenticate PAM ${MY_ARGUMENTS} > /dev/null 2> /dev/null
fi
:wq


Probe Specific properties
Some probes have their own backoff strategies l.

For example many CORBA probes use Retry:

Retry : 'true'

With Retry set to 'true' the probe tries to reestablish a connection after one second, two seconds, then four seconds, and so on, up to a maximum of 4096 seconds. Once the connection is made to the CORBA interface, the probe tries to log in to the device. If the probe fails to log in, it shuts down and tries to connect again. The backoff strategy remains in place until a successful login occurs.

As you can see this backoff strategy is like the one available with nco_pad:

1,2,4,8,16,32,64,128,256,1024,2056,4096 -> 1,2,4,8,16,32,64,128,256,1024,2056,4096 ...

Except that it starts at 1s and works through 12 iterations, before resetting the retry period.

However, unlike nco_pad, there are no additional options to adjust retry behaviour.

In general nco_pad provides greater control over the backoff strategy, and when used any other backoff strategy should be disabled:

e.g.

Retry : 'false'

[{"Product":{"code":"SSSHTQ","label":"Tivoli Netcool\/OMNIbus"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"Not Applicable","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF010","label":"HP-UX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"}],"Version":"7.4.0;8.1.0","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
17 June 2018

UID

swg21694650