Technical Blog Post
Abstract
75 ways to demystify DB2: #14 : Expert Advice: Why the standby keeps going into remoteCatchUp state when log files are available locally?
Body
Consider the following scenario, in a DB2 HADR configuration using ASYNC mode. Presently,the log files were extracted from TSM to a local file system on the standby server. Standby database was deactivated and re-activated causing the standby to go into LocalCatchUp state. However, its unable to read the local log file because its considered as a "stale" file and the standby re-enters a RemoteCatchUp state. This process was repeated two to three times and the standby fails to read the log file and switch HADR to RemoteCatchUp state. DB2 will generate similar messages in the db2diag.log :
2015-01-28-12.26.52.860613+060 I6092198A457 LEVEL: Warning
PID : 1282122 TID : 3856 PROC : db2sysc 0
INSTANCE: db2inst NODE : 000 DB : MYDB
APPHDL : 0-8 APPID: *LOCAL.db2inst.110729112647
AUTHID : DB2BIP
EDUID : 3856 EDUNAME: db2agent (MYDB) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduStartup, probe:21151
MESSAGE : Info: HADR Startup has begun.
2015-01-28-12.26.53.138837+060 E6096113A369 LEVEL: Event
PID : 1282122 TID : 9254 PROC : db2sysc 0
INSTANCE: db2inst NODE : 000
EDUID : 9254 EDUNAME: db2hadrs (MYDB) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
CHANGE : HADR state set to S-LocalCatchup (was S-Boot)
2015-01-28-12.26.53.286362+060 I6100019A470 LEVEL: Warning
PID : 1282122 TID : 5142 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000
EDUID : 5142 EDUNAME: db2lfr (MYDB) 0
FUNCTION: DB2 UDB, recovery manager, sqlplfrFMReadLog, probe:5150
MESSAGE : Found a log on a newer chain. Updating chain number. extNum / chainId
DATA #1 : unsigned integer, 4 bytes
597293
DATA #2 : unsigned integer, 4 bytes
12
2015-01-28-12.26.55.487649+060 I6101311A360 LEVEL: Warning
PID : 1282122 TID : 5142 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000
EDUID : 5142 EDUNAME: db2lfr (MYDB) 0
FUNCTION: DB2 UDB, recovery manager, sqlplfrFMOpenLog, probe:600
MESSAGE : Extent 597294 in log path may be stale. Trying archive.
As a result the standby enters RemoteCatchup state and requests the primary to retrieve the log files required for rolling forward the database records.
2015-01-28-12.27.00.179704+060 I6102036A364 LEVEL: Warning
PID : 1282122 TID : 4371 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000
EDUID : 4371 EDUNAME: db2logmgr (MYDB) 0
FUNCTION: DB2 UDB, data protection services, sqlpgRetrieveLogFile, probe:4130
MESSAGE : Started retrieve for log file S0597294.LOG.
2015-01-28-12.27.00.259102+060 E6102975A431 LEVEL: Warning
PID : 1282122 TID : 4371 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000
EDUID : 4371 EDUNAME: db2logmgr (MYDB) 0
FUNCTION: DB2 UDB, data protection services, sqlpgRetrieveLogFile, probe:4165
MESSAGE : ADM1847W Failed to retrieve log file "S0597294.LOG" on chain "12" to "/db2/MYDB/log_dir/NODE0000/".
2015-01-28-12.27.00.558157+060 E6103820A385 LEVEL: Event
PID : 1282122 TID : 9254 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000
EDUID : 9254 EDUNAME: db2hadrs (MYDB) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
CHANGE : HADR state set to S-RemoteCatchupPending (was S-LocalCatchup)
2015-01-28-12.27.00.656618+060 E6104206A386 LEVEL: Event
PID : 1282122 TID : 9254 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000
EDUID : 9254 EDUNAME: db2hadrs (MYDB) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
CHANGE : HADR state set to S-RemoteCatchup (was S-RemoteCatchupPending)
In this case, why the standby keeps switching to a RemoteCatchUp state when the log files are available locally? Being in a RemotCatchUp state, may take too long to complete compared to a LocalCatchUp state and what can be done to resolve this issue?
Explanation :
The problem is a timing issue. It depends when you issue the deactivate database command. If a log file is in the process of being transferred
when you deactivate the standby database, then the local copy of that file will be incomplete. This is because the remote catchup process does not wait for a log file to be transferred completely before it starts applying it on the standby database. The consequence of this is that when you subsequently re-activate the standby database, it goes into LocalCatchUp state, checks the next log it needs which (because of the interrupted transfer) is incomplete, rejects it as "stale" and goes into RemoteCatchUp state again.
Hence if you are not careful to replace any logs the remote catchup process downloaded from the primary, the standby will always pickup where it left off, i.e with the log file it must process when the database deactivated, and that is always incomplete. Keep in mind that, since the effect of an incompletely transferred log file is masked because log files are always allocated full size, then populated; i.e a log file being transferred does not start at zero bytes and grow to full size; its just allocated full size, and so when you list the log directory it looks just the same size as a full log file.
Resolution:
1. Deactivate the database on standby. Don not stop HADR as it switches the database role to standard.
2. Replace the (stale) log file the standby was processing when it was deactivated, with a full copy of the log file from the primary.
3. Copy in some subsequent logs from primary for good measure.
4. Re-activate database on standby.
UID
ibm11141246