Recovering table space errors on an HADR standby database
In an HADR environment, when a standby database has a tablespace in an invalid or error state, the replay of transactions on this tablespace stops. However, the replay of transactuions on other valid tablespaces will continue. The primary database is not affected, and the condition of this tablespace on the standby can go unnoticed.
If this tablespace condition exists on the standby database, then sometime later when a TAKEOVER operation is performed on the standby database, applications can be impacted by the unavailability of this tablespace. The erroneous tablespaces can be recovered on the standby database by either reinitializing the affected tablespaces, or reinitializing the entire database.
Monitoring and identifying erroneous tablespaces on the standby
Techniques for monitoring this condition and identifying the erroneous tablespaces are described in the technote, Monitoring and Identifying tablespaces in invalid or error state on the HADR Standby database.
Identifying and rectifying the cause of the erroneous tablespaces
After the erroneous tablespaces are identified, the cause of the error condition must be determined. Common causes include insufficient filesystem space, a filesystem that is not mounted, a filesystem error, or a load copy-yes image cannot be found. A severe log replay error can also be a cause.
The administration notification log (<instance_name>.nfy) and db2diag.log must be examined.
2016-03-15-22.51.15.490605 Instance:rsiinp16 Node:000
PID:10551302(db2redow (MYDB1) 0) TID:19277 Appid:*LOCAL.DB2.160221104421
buffer pool services sqlbIncPoolState Probe:3604 Database:MYDB1
ADM12512W Log replay on the HADR standby has stopped on table space "MYTBSPACE3"
(ID "7") because it has been put into "ROLLFORWARD PENDING" state.
2016-03-15-22.51.15.565961 Instance:rsiinp16 Node:000
PID:10551302(db2redow (MYDB1) 0) TID:19277 Appid:*LOCAL.DB2.160221104421
data management sqldHandleBadPool Probe:40 Database:MYDB1
ADM5550C The table space "MYTBSPACE3" (ID "7") is being removed from the
rollforward set. The SQLCODE is "-980".
2016-03-15-22.51.15.189467-240 E763181A1307 LEVEL: Error (OS)
PID : 10551302 TID : 19277 PROC : db2sysc 0
INSTANCE: myinst1 NODE : 000 DB : MYDB1APPHDL : 0-8
APPID: *LOCAL.DB2.160221104421
HOSTNAME: somehost.ibm.com
EDUID : 19277 EDUNAME: db2redow (MYDB1) 0
FUNCTION: Db2 UDB, oper system services, sqloseekwrite64, probe:40
MESSAGE : ZRC=0x860F0003=-2045837309=SQLO_DERR "disk error occurred (DOS)"
DIA8402C A disk error has occurred.
CALLED : OS, -, pwrite
OSERR : EIO (5) "I/O error"
If the cause of the error condition is determined to be a localized issue, such as insufficient filesystem space, a filesystem problem, or similar, the issue must be rectified before you proceed to reinitialize the table space on the standby.
If the cause of the error condition is determined to be a severe replay error or an internal db2 error, with no observable localized cause, retain all diagnostic information before you open a PMR.
Reinitializing the erroneous tablespaces on the standby database
Depending on your Db2® version, refer to the respective links to reinitialize the erroneous table spaces on the standby database.
- A full reinitialization of the standby database is the only supported method.
- Refer to Initializing high availability disaster recovery (HADR).
- A full reinitialization of the standby database is the only supported method.
- Refer to Initializing high availability disaster recovery (HADR).
- On the standby host, deactivate the standby
database:
[Standby]$db2 "deactivate db MYDB1"
- On the primary host, perform a FLUSH BUFFERPOOL operation before the backup operation. This is
to ensure that the backup image to be taken at step 3 has a later recovery starting point than
recovery starting point of the standby database. This operation is required in order to repair the
table space on the standby
database.
[Primary]$db2"flush bufferpools all"
Note: Databases with heavy workloads can experience short performance issues due to aggressive I/O during the FLUSH BUFFERPOOL operation - On the primary host, perform a backup of the erroneous tablespaces.
(Note: An 'online' backup allows the primary database to remain available during the backup operation).
For Versions 11.1.0.0 or 11.1.1.1 only, on the standby host, perform a STOP HADR operation:[Primary]$db2 "backup db MYDB1 tablespace MYTBSPC3 online to /bkp_image_path_pri/ "
[Standby]$db2 "stop hadr on db MYDB1"
- Copy or FTP the backup image from the primary host path
(/bkp_image_path_pri/ in this example) to the standby host path
(/bkp_image_path_stdby/ in this example). On the standby host, perform an
offline restore of the table space from the backup
image:
For Versions 11.1.0.0 or 11.1.1.1 only, on the standby host, perform a START HADR AS STANDBY operation:[Standby]$db2 "restore db MYDB1 tablespace (MYTBSPC3) from /bkp_image_path_stdby/"
[Standby]$db2 start hadr on db MYDB1 as standby
- On the standby host, reactivate the database (for Versions 11.1.0.0 or 11.1.1.1, this step is
not required because of the start hadr operation
above):
[Standby]$db2 "activate db MYDB1"
If a TAKEOVER operation was performed before realizing that tablespace(s) were invalid on the Standby:
SQL0290N Table space access is not allowed SQLSTATE=55039
Additionally, the
db2diag.log on the new-Primary may contain a message similar to this:
yyyy-mm-dd-hh.mm.ss... I32132788A530 LEVEL: Warning
PID : ... TID : ... PROC : db2sysc 0
INSTANCE: ... NODE : ... DB : ...
APPHDL : ... APPID: ...
EDUID : ... EDUNAME: db2agent (...) 0
FUNCTION: Db2 UDB, recovery manager, sqlpGetTablespacesForFilter,probe:1570
DATA #1 : preformatted
Tablespace 6 is in rollforward pending state. Another rollforward will
be needed to bring this tablespace online.
To confirm which tablespaces are in rollforward-pending state, the "db2pd -tablespaces" command can be used as described in technote #1993013 "Monitoring and Identifying tablespaces in invalid or error state on the HADR Standby database". http://www-01.ibm.com/support/docview.wss?uid=swg21993013
There are three methods to resolve this situation:
If the old-Primary database is still online: then a TAKEOVER operation can be performed on the old-Primary database to make it the Primary database again, and the invalid tablespaces on the Standby database can be re-initialized using the instructions above. This method is preferred since it reduces the duration of time that applications are unable to access the tablespace data.
db2 deactivate db sample
db2 stop hadr on db sample
Perform the rollforward operation:
db2 rollforward db dbname to end of logs
Upon completion of the rollforward, tablespaces should be in normal state, and the HADR Standby database will need to be re-initialized.