DB2 Version 10.1 for Linux, UNIX, and Windows

DB2® high availability disaster recovery (HADR) standby database states

At any time, an high availability disaster recovery (HADR) standby database is in one of five states: local catchup, remote catchup pending, remote catchup, peer, or disconnected peer. The states are defined by log shipping status. Regardless of the state, log replay is concurrently going on, replaying all available logs.

The primary log position, standby log receive position, and standby log replay position are all reported by the standard monitoring interfaces for HADR: the MON_GET_HADR table function and the db2pd command with the -hadr parameter. The state of a standby is reported in the HADR_STATE field. If a primary database is connected to a standby database, the monitoring interface reports the state of the standby as its HADR_STATE; otherwise, it reports DISCONNECTED.

Figure 1 shows the progression through the different standby database states.

Figure 1. States of the standby database

This diagram shows the states of the standby database.

Local catchup state

With the HADR feature, when a standby database is started, it enters local catchup state and the log files in its local log path are read to determine what logs are available locally. In this state, logs are not be retrieved from the archive even if you configured a log archiving method. Also, in this state, a connection to the primary database is not required; however, if a connection does not exist, the standby database tries to connect to the primary database. When the end of local log files is reached, the standby database enters remote catchup pending state.

Remote catchup pending state

Entering remote catchup pending state, if a connection to the primary has not been established, the standby waits for a connection. After a connection is established, the standby obtains the primary's current log chain information. This enables the standby, if a log archive is configured, to retrieve log files from archive and verify that the log files are valid.

In remote catchup and peer state, if the standby loses its connection to the primary, it goes back to remote catchup pending state. When the connection is reestablished, the standby tries to retrieve the logs from the archive. Thus, if you configure a shared archive device, the standby might be able to find more logs than would be available if it is using a separate archiving device. This behavior favors retrieving from the archive over shipping from the primary through the HADR connection to minimize the impact on the primary database.

Remote catchup state

In remote catchup state, the primary database reads log data from its log path or by way of a log archiving method and the log data is sent to the standby database. The primary and standby databases enter peer state when the standby database receives all the on-disk log data of the primary database. If you are using the SUPERASYNC synchronization mode, the primary and standby never enter peer state. They permanently stay in remote catchup state, which prevents the possibility of blocking primary log writing in peer state.

If the connection between the primary and standby databases is lost when the databases are in remote catchup state, the standby database enters remote catchup pending state.

Peer state

In peer state, log data is shipped directly from the primary's log write buffer to the standby whenever the primary flushes its log pages to disk. The HADR synchronization mode specifies whether the primary waits for the standby to send an acknowledgement message that log data has been received. The log pages are always written to the local log files on the standby database. This behavior guards against a crash and allows a file to be archived on the new primary in case of takeover, if it has not been archived on the old primary. After being written to local disk, the received log pages can then be replayed on the standby database. If log spooling is disabled (the default), replay reads logs only from the log receive buffer.

If log replay is slow, the receive buffer can fill up, and the standby stops receiving new logs. If this happens, primary log writing is blocked. If you enable log spooling, a part of log buffer is released even if it has not been replayed yet. Log replay reads the log data back from disk later. If the spooling device fills up or the configured spool limit is reached, the standby still stops receiving, and the primary can still be blocked.

If the connection between the primary and standby databases is lost when the databases are in peer state and the hadr_peer_window database configuration parameter is set to 0 (the default), the standby database enters remote catchup pending state. However, if the connection between the primary and standby databases is lost when the databases are in peer state and you set the hadr_peer_window parameter to a nonzero value (meaning that you configured a peer window), the standby database enters disconnected peer state.

Disconnected peer state

If you configured a peer window and the primary database loses its connection with the standby database in peer state, the primary database continues to behave as though the primary and standby databases were in peer state for the configured amount of time (called the peer window), or until the standby reconnects, whichever happens first. When the primary database and standby database are disconnected but behave as though in they were in peer state, this state is called disconnected peer.

The advantage of configuring a peer window is that it lowers the risk of transaction loss during multiple or cascading failures. Without the peer window, when the primary database loses its connection with the standby database, the primary database moves out of peer state immediately and continues transaction processing. These transactions are not replicated to the standby. If the primary server fails shortly after it loses its connection to the standby, the risk of transaction loss is high in a failover. With the peer window enabled, the primary database blocks transaction processing for a certain amount of time after losing connection to standby in peer state, guarding against cascading failures. Furthermore, the standby can take over within the peer window time with no risk of data loss.

The disadvantage of configuring a peer window is that transactions on the primary database take longer or even time out while the primary database is in peer window waiting for the connection with the standby database to be restored or for the peer window to expire. As well, intermittent network failure can cause severe impact on primary transaction processing.

You can determine the peer window size, which is the value of the hadr_peer_window database configuration parameter, by using the MON_GET_HADR table function, or the db2pd command with the -hadr parameter.

Manually copying log files from the primary database to the standby database

One method for synchronizing the primary and standby databases is to manually copy the primary database log files into the standby database log path or overflow log path (if configured). This can be especially helpful if there is a large log gap between the primary and standby (for example, because the standby database was down for a long time). This can reduce the delay of standby having to retrieve the logs from archive, or it can reduce the impact on primary for having to ship over these log files (which the primary would likely have to retrieve from archive). It is important that this step is done before activating the standby database. After the standby database is activated, it proceeds with searching local log files, attempting to retrieve from archive, and engaging the primary for log shipping, as described above. Copying the log files to the standby after it has been activated interferes with its normal operation.