HADR takeover operations in a Db2 pureScale environment

When an HADR standby database takes over as the primary database in a Db2 pureScale environment, there are a number of important differences from HADR in other environments.

With HADR, there are two types of takeover: role switch and failover. Role switch, sometimes called graceful takeover or non-forced takeover, can be performed only when the primary is available and it switches the role of primary and standby. Failover, or forced takeover, can be performed when the primary is not available. It is commonly used in primary failure cases to make the standby the new primary. The old primary remains in the primary role in a forced takeover, but the standby sends it a message to disable it. Both types of takeover are supported in a Db2 pureScale environment, and both can be issued from any of the standby database members and not just the current replay member. However, after the standby completes the transition to the primary role, the database is only started on the member that served as the replay member before the takeover. The database can be started on the other members by issuing an ACTIVATE DATABASE command or implicitly through a client connection.

Role switch

After a role switch, which is initiated by issuing the TAKEOVER HADR command from any standby member, the standby cluster becomes the primary cluster and vice versa. Role switch helps ensure that no data is lost between the old primary and new primary. You can initiate a role switch in the following circumstances only:

Crash recovery is not occurring on the primary cluster, including member crash recovery that is pending or in progress.
All the log streams are in peer or assisted remote catchup state.
All the log streams are in remote catchup state or in assisted remote catchup state, and the synchronization mode is SUPERASYNC.

Before you initiate a role switch in remote catchup or assisted remote catchup state, check the log gap between the primary and standby log streams. A large gap can result in a long takeover time because all of the logs in that gap must be shipped and replayed first.

During a role switch, the following steps occur on the primary:

New connections are rejected on all members, any open transactions are rolled back, and all remaining logs are shipped to the standby.
The primary cluster's database role changes to standby.
A member that has a direct connection to the standby is chosen as the replay member, with preference given to the preferred replay member (that is, the member that HADR was started from).
Log receiving and replay starts on the replay member.
The database is shut down on the other non-replay members of the cluster.

And the following steps occur on the standby:

Log receiving is stopped on the replay member after the end of logs is reached on each log stream, helping ensure no data loss.
The replay member finishes replaying all received logs.
After it is confirmed that the primary cluster is now in the standby role, the replay member changes the standby cluster's role to primary.
The database is opened for client connections, but it is only activated on the member that was previously the standby replay member.

Failover

After a failover, which is initiated by issuing the TAKEOVER HADR command with the BY FORCE option from any standby member, the standby cluster becomes the primary cluster. The old primary cluster is sent a disabling message, but its role is not changed. Any member on the primary that receives this message disables the whole primary cluster. By initiating a failover, you are accepting the trade-off between potential data loss and having a working database. You cannot initiate a failover if the databases are in local catchup state.

Note: Unlike in previous releases, you can now initiate a failover even if log archive retrieval is in progress.

During a failover, the following steps occur on the primary (assuming it is online and connected to the standby):

After it receives the disabling message, the database is shut down and log writing is stopped.

And the following steps occur on the standby, all of which are carried out from the replay member:

A disabling message is sent to the primary, if it is connected.
Log shipping and log retrieval is stopped, which entails a risk of data loss.
The replay member finishes replaying all received logs (that is, the logs that are stored in the log path).
Any open transactions are rolled back.
The replay member changes the standby cluster's role to primary.
The database is opened for client connections, but it is only activated on the member that was previously the standby replay member.

You can reintegrate the old primary as a new standby only if its log streams did not diverge from the new primary's log streams. Before you can start HADR, the database must be offline on all of the old primary's members; the cluster caching facilities, however, can stay online. If any members are online, kill them instead of issuing the DEACTIVATE DATABASE command on them.