Monitoring Db2 Mirror resources

The health center is a feature of Db2® Mirror that is used to continuously monitor for issues in the Db2 Mirror environment and act when an issue is detected.

The health of the Db2 Mirror environment is determined by monitoring two classes of resources; critical and product-related resources. Critical resources, such as storage and communication, are checked more frequently than product-related resources. When an issue is detected, the health center either suspends replication or notifies the operator of the issue by sending an alert to the QSYSOPR message queue. A user can configure the interval that is used to determine how frequently the Db2 Mirror environment checks these resources in addition to other health center properties.

Health Center Controls

The health center properties can be viewed and changed in the Db2 Mirror GUI by selecting Health Center Controls from the left navigation Health Center menu, as shown in the following figure.
Figure 1. Selecting Health Center Controls from the Health Center menu
Selecting Health Center Controls from the Health Center menu
The default view displays the health center properties for SYSBAS and all database IASPs that are active on the primary node. Separate tabs at the top of the page can be used to change which node is displayed. The Both Nodes tab displays the properties on both nodes side-by-side, as shown in the following figure.
Figure 2. Displaying Health Center Controls for both nodes
Displaying Health Center Controls for both nodes
The health center properties for the current node can also be displayed by using the MIRROR_HEALTH_MONITOR_INFO view.

Resource check intervals

Two interval properties control how frequently the health center monitors the Db2 Mirror environment. These properties are used to specify the time interval between critical and product-related resource checks made by the health center. Changing these interval properties affects both nodes.

Critical resource checks

The critical resource check interval property controls the time interval that is used for monitoring critical resources. This property can be set to a frequency of between 5 and 60 seconds. Critical resource monitoring involves the following checks:
  • Checking the status of the RDMA communication links between the nodes.
  • Checking the status of disk unit connectivity.
  • Checking storage availability. For more information, see Storage threshold monitoring.

If a communication or storage outage is detected, the health center suspends replication as an unplanned outage. Unplanned outages are eligible for automatic takeover. For more information, see Automatic takeover. When replication is suspended due to available storage threshold monitoring, it is not treated as an unplanned outage and therefore is not eligible for automatic takeover.

When the health center detects a communication outage, it does not immediately suspend replication for an unplanned outage. The health center waits a specified amount of time, in seconds, before it suspends replication. The suspend wait time interval property can be updated to wait between 15 and 3600 seconds (60 minutes) before a suspend of replication is initiated. Suspending replication due to a communication outage can be disabled by setting the suspend wait time interval property value to NOMAX.

The current value for the critical resource check interval and the suspend wait time properties can be viewed and updated in the Db2 Mirror GUI by using the techniques that are described in Health Center Controls. In addition, the properties can be updated by using the CHANGE_MIRROR_HEALTH_MONITOR procedure. The following example sets the time interval for checking critical resources to every 30 seconds and sets the amount of time to wait before a suspend is initiated when communication is lost to 300 seconds.
                                        SUSPEND_WAIT_TIME                => 300);
Note: Decreasing the time interval for critical resources to a smaller value makes the health center more sensitive to temporary outages, and might suspend Db2 Mirror unnecessarily for short transient outages. Increasing the critical resource check interval or increasing the suspend wait time decreases the risk of transient outages.

Storage threshold monitoring

The health center monitors the amount of storage available in each auxiliary storage pool (ASP). The monitoring is performed for each ASP that comprises SYSBAS and each registered database IASP. When the amount of available storage for an ASP falls beneath the available storage threshold percentage property, replication is suspended.

When the amount of available storage for a user ASP defined within SYSBAS falls beneath the configured threshold, then a suspend of replication is initiated regardless of the amount of available storage in other ASPs that are a part of SYSBAS. Similarly, if an ASP group has secondary ASPs, then the available storage threshold is monitored for each ASP within the ASP group. A suspend of database IASP replication is initiated when the amount of available storage of any one ASP within the ASP group falls beneath the configured threshold for the database IASP. The suspend of SYSBAS replication by the health center also suspends replication for all registered database IASPs.

If a suspend of replication is initiated because an ASP falls beneath the configured threshold, the issue must be resolved before active replication can be resumed. Resolving the issue might involve deleting objects to make more storage available or changing the available storage threshold property. You have two options for removing replicated objects on the node that is blocked while replication is suspended. One option is to exclude the objects from replication before the deletion of them. The second option is to change the available storage threshold property to a value that allows active replication to resume. The deletion of replicated objects on the secondary node is allowed after active replication is resumed.

The available storage threshold property can be viewed and updated in the Db2 Mirror GUI by using the techniques that are described in Health Center Controls. The property can also be updated by specifying both the IASP_NAME and AVAILABLE_STORAGE_THRESHOLD parameters on the CHANGE_MIRROR_HEALTH_MONITOR procedure. The following example sets the available storage threshold to 5.00% for SYSBAS on the current node.
                                        AVAILABLE_STORAGE_THRESHOLD => 5.00);
To disable the monitoring of available storage for an ASP, the special value NONE can be specified for the available storage threshold property. The following example disables the monitoring of available storage on the current node that is associated with the database IASP named DBIASP1.
                                        AVAILABLE_STORAGE_THRESHOLD => 'NONE');

Product resource checks

The resource check interval property controls the time interval that is used for monitoring Db2 Mirror product resources. This property can be set to a frequency of between 1 and 1440 minutes (24 hours). The monitoring of Db2 Mirror product resources involves periodically checking the status of various resources that might jeopardize the Db2 Mirror environment, such as:

When an issue is detected with one or more of the Db2 Mirror product resources, the health center notifies the user by sending alerts to the QSYSOPR message queue. For steps on accessing Db2 Mirror alerts, see Accessing Db2 Mirror QSYSOPR messages from the Db2 Mirror GUI.

The resource check interval property can be viewed and updated in the Db2 Mirror GUI by using the techniques that are described in Health Center Controls. The property can also be updated by specifying the RESOURCE_CHECK_INTERVAL parameter on the CHANGE_MIRROR_HEALTH_MONITOR procedure. The following example changes the time interval for checking less-critical product resources to every 5 minutes on both nodes.

Monitoring Db2 Mirror jobs

The health center checks the status of various Db2 Mirror system and user jobs that are required for the proper operation of the Db2 Mirror product. The jobs are monitored to verify that they are active and available for use by Db2 Mirror to manage the environment and replicate objects. The jobs that are monitored by the health center are described in Db2 Mirror jobs.

When the health center detects an issue with one of the jobs, it attempts to recover that job. If the job is successfully recovered, then a CPDC925 message is sent to the QSYSOPR message queue. The message contains the details about the type of job recovered. If the job fails to recover, then a CPDC922 message is sent to the QSYSOPR message queue. The message includes the details about the job that failed to recover. When a job fails to recover, the health center might also suspend replication as an unplanned outage if the replication of objects is jeopardized.

License expiration monitoring

Monitoring of the required license keys for the Db2 Mirror product is also included within the health center. The license expiration threshold property is set separately on each node. It defines the remaining number of days before alerts by the health center begin for any approaching expired license keys. When the health center detects an approaching license key expiration, then a CPDC924 message is sent to the QSYSOPR message queue. The message contains details on how many days remain for the license key. Expired or missing license keys for Db2 Data Mirroring (5770SS1 Option 48) or Db2 Mirror Enablement (5770DBM Option 1) can cause Db2 Mirror nodes to remain suspended.

The current value for the license expiration threshold property can be viewed and updated in the Db2 Mirror GUI by using the techniques that are described in Health Center Controls. The property can also be updated by specifying the LICENSE_EXPIRATION_THRESHOLD parameter on the CHANGE_MIRROR_HEALTH_MONITOR procedure. The following example sets the number of days before health center alerts of approaching expired license keys to 30 days on the current node.
To disable alerts for approaching expired license keys, the special value NONE can be specified for the license expiration threshold property. The following example disables alerts for approaching expired license keys on the current node.

Monitoring other resources

The health center also checks other less-critical product resources. These checks include ensuring that the system clocks remain synchronized and verifying that exit programs that were registered by the Db2 Mirror product remain registered. When an issue is detected for one of these less-critical resources, a CPDC924 message that indicates the type of issue is sent to the QSYSOPR message queue.

