Queue manager health check behavior

The queue manager carries out periodic health checks to ensure stable and reliable performance. This topic describes some of the health checks that the queue manager makes and explains how they can be configured based on environmental requirements.

In most environments, the default configuration is suitable and there is no need to change how frequently these checks are made. Even using the default settings, it can be useful to understand how the queue manager behaves when a problem is detected and the environmental problems that might cause a check to fail. This topic is intended to explain some of those behaviors.

CAUTION:

Do not make any changes to the frequency of these checks unless you are advised to do so by IBM® Support.

The different components of the queue manager use various means to detect and resolve inconsistencies and this topic is not intended to describe all such mechanisms. For example, IBM MQ processes use various mechanisms to ensure that other processes that they depend on are still running. The behaviors that are described are those behaviors that are made periodically by the execution controller to spot environmental or other unexpected situations. (The execution controller is the primary IBM MQ process that starts and manages most other queue manager processes.) Because they are periodic checks, they are made at specific intervals that can be modified to some extent by setting the appropriate tuning parameters.

Some of the checks that are described are made by a dedicated health checking thread. If a problem is detected with the health checking thread itself, a warning message AMQ5066 is written to the queue manager error logs.

The behaviors that are described in this topic are subject to change in future releases, for example if a different default value is observed to be more stable on a particular platform or configuration.

General health checks

The queue manager performs a variety of checks on a regular interval. By default, these checks are performed every 10 seconds (in some cases, the check allows two cycles before reporting an error, resulting in a 20-second interval for such checks). During the checks, the queue manager makes sure that the various processes that run as part of the queue manager are still running. For a Native HA queue manager, it checks that the queue manager is successfully replicating data to the standby instances.

If a critical check fails at this time (for example, if the amqzmuc0 process is no longer running) the queue manager will be unable to continue running. However, most of the checks are made to tidy up system resources that are no longer needed and might simply result in a message being written to the queue manager error logs.

In most cases, it is not necessary to change the frequency of these general health checks. The majority of events in the queue manager or in the operating environment are detected instantly without the need for the general health check process to detect them. This process serves as a periodic check for anything not detected elsewhere in the queue manager. If necessary, the frequency can be configured using the ECHeartBeatLen tuning parameter. The minimum value is 10000 milliseconds (10 seconds). The maximum value is 60000 milliseconds (60 seconds). If set to its maximum value of 60000, this might result in a two-minute delay for certain checks.

Checks that log progress is being made

The queue manager checks that writes to the log are being made at a reasonable rate. This is not a check that performance of the logger is optimal but is designed to spot conditions that might need further attention. For example, if the disk that the log files are being stored on is particularly slow, or if the queue manager is not receiving sufficient CPU time in a containerized environment to perform all of its work.

If this check fails, the action taken by the queue manager depends on the type of queue manager being used:

On a non-HA queue manager:
- An xecL_W_PERFORMANCE_BOTTLENECK FDC is written. This can be used as an indication that some part of the system might need further attention. The queue manager remains running. If xecL_W_PERFORMANCE_BOTTLENECK FDCs are seen in the errors directory, it might be necessary to work with your storage or platform team to understand if the underlying system resources are sufficient for IBM MQ to run. If IBM MQ is being run in containers on over-committed nodes then IBM MQ might not receive enough scheduled CPU time to perform all of its messaging workload.
- From IBM MQ 9.3.0, a warning message AMQ5068W is written to the queue manager error logs and no xecL_W_PERFORMANCE_BOTTLENECK FDC is written. If AMQ5068W messages are seen in the logs, it might be necessary to work with your storage or platform team to understand if the underlying system resources are sufficient for IBM MQ to run. If IBM MQ is being run in containers on over-committed nodes then IBM MQ might not receive enough scheduled CPU time to perform all of its messaging workload. If five AMQ5068W warning messages are written in succession, an xecL_W_PERFORMANCE_BOTTLENECK FDC is written.
On a multi-instance queue manager:
- If the log progress health check fails then the primary instance ends. If a standby instance is available it will start and become the primary instance.
- From IBM MQ 9.3.0, the primary instance checks if a standby instance is available before ending. If a standby queue manager is available to fail over to the primary instance ends. Additionally a warning message AMQ5068W is written to the queue manager error logs.
On a Native HA queue manager, this check behaves in the same way as a non-HA queue manager.
On an RDQM (replicated data queue manager), this check behaves in the same way as a non-HA queue manager.

It is possible that a problem with the progression of the IBM MQ log is caused by a performance issue in the queue manager itself.

By default this check is made every 60 seconds, although the queue manager waits for two cycles of the check before taking an action. This means that with the default settings, two minutes would have to pass before the queue manager wrote an error message (or failed over in the case of an HA queue manager).

In most cases, the default behavior is suitable even where the file system is slow or the queue manager is allocated a small amount of CPU time because other checks such as file locking (see Checks that file locks are still held) and basic file system operation will cause a primary instance to failover before this check is made. If necessary, the frequency of this check can be configured using the LivenessHeartBeatLen tuning parameter. The maximum value that it can be configured to is 600 seconds (10 minutes). The minimum value of 0 has the effect of disabling the check altogether. For a non-HA queue manager, the only effect of the check is an extra warning message in the queue manager error logs. For a multi-instance queue manager, you can configure the LivenessHeartBeatLen to cause a primary instance of the queue manager to fail over more quickly (by reducing the value) or slowly (by increasing the value). Increasing the value to reduce the frequency of the log progress check can be useful if your environment occasionally experiences very slow file system IO but where you would prefer the primary instance of the queue manager to remain running. This might be useful if you have applications that are not designed to automatically reconnect to the standby instance and require manual intervention to restart them.

Note: If the ECHeartBeatLen has been increased, this affects the timing of the LivenessHeartBeatLen checks. Log progress checks are made when the general health checks are performed so reducing the frequency of the general health checks (ECHeartBeatLen) might result in log progress checks being made up to a 30 seconds after the configured LivenessHeartBeatLen.

General logging filesystem performance

From IBM MQ 9.3.4, a warning message AMQ6729W is issued in the queue manager error log if general read/write operations to this storage appear to be taking longer than expected. You can use the AMQ_IODELAY environment variables to fine tune diagnostics and timings to either aid in diagnosing storage performance issues, or to increase tolerance for such delays. For more information, see AMQ_IODELAY, AMQ_IODELAY_INMS and AMQ_IODELAY_FFST.

Checks that file locks are still held

For a multi-instance queue manager, the execution controller periodically checks to make sure that it still holds the exclusive lock on the primary multi-instance file. In many cases, if the lock is lost due to a problem with the NFS server the primary instance fails over almost immediately (before this check is made). Additional periodic file lock checks are made to ensure that the primary queue manager fails over in the event of an unusual file system problem.

By default, these file lock checks are made every 20 seconds. If necessary, this value can be changed by setting the FileLockHeartBeatLen tuning parameter. The default value for the tuning parameter is 10 seconds (the queue manager allows two cycles of the check before taking an action resulting in the default behavior of checking every 20 seconds). The minimum value of the tuning parameter is 10 seconds, the maximum value is 600 seconds (10 minutes).

Note: If the ECHeartBeatLen has been increased, this affects the timing of the FileLockHeartBeatLen checks. File lock checks are made when the general health checks are performed so reducing the frequency of the general health checks (ECHeartBeatLen) might result in file lock checks being made up to 30 seconds after the configured FileLockHeartBeatLen.

Checks on user application health

The queue manager periodically checks that any locally bound applications that are no longer running have performed an MQDISC MQI call before terminating. These checks are performed at the same time as the general health checks described in General health checks. The default interval for such checks is therefore 10000 milliseconds (10 seconds) and changing the value of the ECHeartBeatLen tuning parameter changes the frequency at which they are made. This check is primarily to ensure any resources that are associated with an application connected are freed, it does not cause an HA or a non-HA queue manager to end or to fail over to an alternative instance.

IBM MQ client applications that have terminated without issuing an MQDISC MQI call are separately detected by the agent process and any resources associated with the connection are released.