Recovering the CDC Replication Engine for Oracle databases after a database failover operation

CDC Replication requires a specific procedure to recover from an Oracle database failover operation.

About this task

In a disaster recovery configuration, users often set up their environments using a primary database and a physical standby database that are connected through DataGuard services. CDC Replication would normally be configured to replicate data from the primary database.

These databases have two mutually exclusive roles: primary and standby. The roles could be interchanged, which is known as role transition. Role transitions can happen due to planned transitions or as a result of a database failure.

There are two main scenarios:

Planned switchover: This scenario is a planned operation in which the primary database and the standby database change roles. A switchover guarantees no data loss. This planned operation occurs without having to re-instantiate either of the databases.
Failover: This scenario happens when the primary database fails, becomes unreachable and cannot be recovered in a timely manner. Failover might or might not result in data loss, depending on the data protection mode in use at the time of the failover. This type of transition requires a re-instantiation of the newly activated database.

A planned switchover is comprised of a series of steps that users would follow in order to switchover their databases. The CDC Replication Engine for Oracle databases should be taken into consideration as part of that plan.

The CDC Replication Engine for Oracle databases recovery mode provides a solution for some of the unplanned failover cases. This mode does not support automatic failover. The failover procedure is still manual. Implementing the manual failover procedure results in the following configuration:

In the new configuration, CDC Replication replicates from the newly primary database (B). The procedure described from now on does not describe how to move the CDC Replication Engine for Oracle databases from one machine to the other. It assumes that the software is ready to start replication from the newly primary database (B).

A failover operation requires a re-instantiation of the newly primary database (B). Opening the newly primary database by using the RESETLOG option re-instantiates the database. A RESETLOG operation does the following:

Archives the current online redo logs if they are accessible
Erases the contents of the online redo logs
Resets the log sequence number to 1
Creates the online redo log files if they do not currently exist
Updates all current data files and online redo logs and all subsequent archived redo logs with a new RESETLOGS SCN and timestamp

The RESETLOG operation creates a new incarnation of the database. If the CDC Replication Engine for Oracle databases were to be started after the RESETLOG operation was executed, replication would fail because CDC Replication could only read logs from the current incarnation of the database.

To enable CDC Replication to resume replication, CDC Replication must be run in recovery mode. In recovery mode, CDC Replication reads logs from the previous incarnation of the database. The CDC Replication Engine for Oracle databases uses the dmfailoverrecovery command-line utility to implement this recovery mode.

The dmfailoverrecovery command enables the CDC Replication Engine for Oracle databases to read logs from the previous incarnation of the database until all required logs are processed. If the recovery step finishes successfully, the CDC Replication Engine for Oracle databases resumes normal replication of the new database incarnation.

The dmfailoverrecovery command starts replication for all configured subscriptions and mirrors data until all logs from the previous incarnation are processed and the last available SCN on the previous incarnation is reached.

To continue replication after to a database failover, perform the following procedure after executing the manual failover procedure and configuring the CDC Replication Engine for Oracle databases with the new primary database (B):

Procedure

Run the dmfailoverrecovey command with -d option. The -d option enables you to validate the parameters that CDC Replication uses during the recovery. The output of the command shows the current and previous reset SCN.

CDC_INSTALL_HOME>/bin>./dmfailoverrecovery -I MYINSTANCE -d
Failover recovery will run for all configured subscriptions up to SCN <scn number>. Please re-run this command using the option to start
failover recovery. Active Reset SCN is: <scn number> , Previous Reset SCN is:  Previous Reset SCN is: <scn number>.

CDC Replication cannot perform a full recovery if the output of Step 1 does not match expected values. In this case, a refresh operation is required for all tables.

Attempting to resume replication without running the dmfailoverrecovery command results in failure, accompanied by the following error in the event log:

Archived log corresponding to SCN position {0} does not belong to the current database incarnation (it may belongs to previous DB 
Incarnation). Current DB Incarnation is {1}. Log position is {2}. If you performed a failover onto a physical standby database, 
you might need to run the failover recovery procedure running command line dmfailoverrecovery. If not, please contact IBM 
support for more Information.

After validating the information that is given by the command in Step 1, run the command with the -r option. This option starts the recovery process. Note that the command start replications for all subscriptions. Recovery time depends on the size and number of database logs that the CDC Replication Engine for Oracle databases must process. All subscriptions end replication when the recovery process is complete
```
<CDC_INSTALL_HOME>/bin>./dmfailoverrecovery -I MYINSTANCE -r
Failover Recovery has been completed successfully from previous reset SCN 144265867
```
Validate that all subscriptions ended replication gracefully. Subscriptions that do not end gracefully must be refreshed.
When the recovery process ends, you can resume normal replication.