Technical Blog Post
Abstract
Why db2 was hung while waiting for log files to be archived to third party vendor ?
Body
Occasionally you may see DB2 hung and it may be because of a problem in the third party vendor used to archive db2 logs. We have tried to explain the problem, its identification and solution below. The third party vendor used in our example below is TSM, please note we have seen the below issue with Veritas Netbackup as well.
You may find the following messages in db2diag.log while db2 is hung :
….and finally DB2 stays hung indefinitely in this state and logs above message in db2diag.log often.
Let us look at exactly what happened here :
1> The db2diag.log file contains last entry of log archival at starting at 18.20 hrs. Between that time and 20.17 hrs,there were entries from db2loggr processes ("DB2 is waiting for log files to be archived.") as noted before, but there was not a single entry from any db2logmgr processes.
2> The db2logmgr processes should be running (since databases were activated and db2loggr processes were running), and should be writing entries to db2diag.log from time to time ("Successfully archived log file ..." message, or various error messages when hitting problems). In particular, if db2logmgr was archiving a log to TSM and that returned an error rc, we should see a sqlpgArchiveLogFile probe 3160 "Failed to archive log file ..." message in db2diag.log.
3> It is the same process (db2logmgr) that is responsible for archiving logs to TSM (third party vendor). The db2logmgr calls the Vendor code and then the db2loggr waits for the db2logmgr to post that the archive attempt succeeded/failed. However, in above example, the Vendor (TSM) mechanism did not return back to the db2logmgr with the expected success/failure code.
4> Since the Vendor code has control, there is nothing that DB2 can do until the Vendor code returns control back to db2, in this example with an archive attempt failed due to problem within TSM library.
If it had returned (FAILED) with some error code for a number of times (database configuration parameter NUMARCHRETRY, default is 5) then log file/s are archived to the FAILARCHPATH (if configured).
5> But in the example "archive to TSM” was HUNG (not failed) because of the problem with TSM library, then db2logmgr would be hanging and won’t return to db2 at all for db2 to take any action on it. This hang condition in the vendor code must somehow be broken by TSM/vendor to allow DB2 to once again operate properly.
Note, we have seen many such problems from various vendors used for log archival by customers so starting in Db2 Version 11.1.3.3, the archival of log files (LOGARCHMETH1/2) using VENDOR or TSM methods can now be configured with a timeout on Unix environments. If the transmission of log data between Db2 and the TSM or vendor API is unresponsive for the specified timeout period, then Db2 will interrupt the archive log attempt and follow normal log archive failure protocol. For details see "--VENDOR_ARCHIVE_TIMEOUT" in the
UID
ibm11140190