Troubleshooting
Problem
When DB2 archiving a log file to vendor program (for example TSM), if TSM does not respond to DB2 (neither success nor failed), and if the database is deactivating during the "non-response" window, DB2 will get stuck in the deactivting process and hence cannot accept new connections to the database
Symptom
DB2 appears hung and does not accept new connections
Cause
DB2 log manager does not get response from TSM, and because user's database is not explicitly activated, when the last connection leaves the database, the deactivate process begins. During the deactivate process, log reader will check if there are any logs waiting to be archived, because TSM did not respond to DB2's archive request, log reader will wait for the log to be archived and eventually get timed out. Because the database is in the process of deactivating, it cannot accept any new connections at this stage until the deactivating is finished. This is why user cannot connect to the database and the database appears to be "hung"
Diagnosing The Problem
This issue has the following pattern:
1. DB2 log manager starts a log archive request for TSM (this is vendor in our example) to archive one of its logs:
2014-09-01-00.12.11.806209-420 E11852947A377 LEVEL: Info
EDUID : 7311 EDUNAME: db2logmgr (XXXXX) 0
Started archive for log file S0133372.LOG.
2. DB2 log manager does not get response from TSM, and because user's database is not explicitly activated, when the last connection leaves the database, the deactivate process begins. During the deactivate process, log reader will check if there are any logs waiting to be archived, because TSM did not respond to DB2's archive request, log reader will wait for the log to be archived and eventually get timed out. Because the database is in the process of deactivating, it cannot accept any new connections at this stage until the deactivating is finished. This is why user cannot connect to the database and the database appears to be "hung" :
2014-09-01-02.43.35.379463-420 I11867573A374 LEVEL: Info
EDUID : 15760 EDUNAME: db2loggr (XXXXX) 0
MESSAGE : DB2 is waiting for log files to be archived.
…
2014-09-01-02.47.35.381139-420 I11869935A482 LEVEL: Info
EDUID : 15760 EDUNAME: db2loggr (XXXXX) 0
DB2 was unable to confirm logs were archived.
Return code -2029059911, FirstArchNum 133372, FirstArchNum2 4294967295, HeadExtentID 133382
…
3. DB2 sqlpterm() function is triggered after log reader is timed out, it tries to shut down the EDUs on the deactivating database, but it cannot shut down the log archive EDU because it is waiting for TSM's response for the log archive. Hence we see the following messages:
2014-09-01-02.52.39.468809-420 E11873692A466 LEVEL: Warning
EDUID : 39250 EDUNAME: db2agent (XXXXX) 0
MESSAGE : Some daemon is taking too long to shut down.
…
2014-09-01-03.18.28.680576-420 E11904313A466 LEVEL: Warning
EDUID : 39250 EDUNAME: db2agent (XXXXX) 0
MESSAGE : Some daemon is taking too long to shut down.
…
4. Finally the archive fails after over 3 hours because of vendor error (in this case, TSM is the vendor):
2014-09-01-03.23.13.210702-420 E11934995A572 LEVEL: Error
EDUID : 7311 EDUNAME: db2logmgr (XXXXX) 0
FUNCTION: DB2 UDB, data protection services, sqlpgArchiveLogFile, probe:3160
MESSAGE : ZRC=0x86100024=-2045771740=SQLP_SHRLIB_VENDOR_DEV_ERR
"A vendor device reported a shared library error."
DATA #1 : <preformatted>
Failed to archive log file S0133372.LOG to TSM chain 0 from /db2data/MBTS/db2inst1/NODE0000/SQL00003/SQLOGDIR/.
5. Also, from procstack, we can clearly see that DB2 log manager is waiting for the vendor (the vendor is TSM):
---------- tid# 72089699 (pthread ID: 7311) ----------
0x090000000002db14 read(??, ??, ??) + 0x274
0x090000000828a2e4 sqloReadNamedPipe(0x90000000828a6dc, 0x400, 0x7000000113fd2a0, 0x0, 0x7000000113fd2b0, 0xffffffffffffffff) + 0x98
0x090000000828a3fc @83@sqloReadVendorRC(SQLO_VENDOR_HANDLE* const,const int)(0x200000000000000, 0xffffffffffffffff) + 0x7c
0x090000000744dbe8 sqloInvokeVendorFunction(0x100000001, 0x500000005, 0xffffffffffffffff) + 0x130
0x09000000077377b0 sqluVendorCallWrapper(SQLO_VENDOR_HANDLE*,sqluFencedVendorCommBuf*,const unsigned int,sqluFencedVendorProcessState*,const int)(0x7700000c00427af, 0x7700000c004279a, 0x100000001, 0x440048020000002f, 0xffffffffffffffff) + 0x788
0x090000000aa74e6c sqluFencedVendorCB::callWrapper(const unsigned int,const int)(0x7700000c004279a, 0x100000001, 0x90000000bcdc1cc) + 0x50
0x090000000aac38e0 sqlpInitVendorDevice(SQLPG_VENDOR_CB*,const char*,unsigned int,unsigned long)(??, ??, ??, ??) + 0x3ac
0x090000000aac96f0 sqlpgArchiveLogVendor(SQLP_DBCB*,SQLPG_VENDOR_CB*,unsigned long,char*,const char*,unsigned int,unsigned int*,unsigned int*,unsigned int*,unsigned long*,unsigned int*)(??, ??, ??, ??, ??, ??, ??, ??) + 0x358
0x090000000aac4ec4 sqlpgArchiveLogFile(SQLP_DBCB*,unsigned int,unsigned long,unsigned long,char*,unsigned long)(0x208f9, 0x208fc000208fc, 0x1, 0x2, 0x7000000113fe82f, 0x0) + 0xd08
0x090000000aac04e4 sqlpgArchivePendingLogs(SQLP_DBCB*,unsigned long,unsigned long,char*,unsigned int*,unsigned int*,unsigned long*,unsigned int*)(0x7700000c0000348, 0x208f9000208f9, 0x2, 0x0, 0x7000000113feb00, 0x18c6, 0x0, 0x0) + 0x198
0x090000000aabc938 sqlpgLogManager(char*,unsigned int)(??, ??) + 0xe88
0x0900000007f84f6c sqloEDUEntry.fdpr.chunk.7(??) + 0x18
0x0900000000790d50 _pthread_body(??) + 0xf0
Resolving The Problem
The root cause is with the vendor, TSM in this example. They would have to fix the no-response to DB2 archiving issue.
From DB2 side, there's some work arounds, although they cannot fix the issue, but can help reduce the Hang problem:
1. When this issue occurs, we can terminate the vendor program (in this case, TSM) by using the command "db2pd -db <db name> -fvp lam1 term ". This will force the vendor program to set lam1(LOGARCHMETH1) to restart, and hence database can deactivate successfully.
2. Because this issue only happens when the database is deactivating, we can deliberately keep the database active by using command "db2 activate db <dbname>"
3. Change the log archive method to DISK instead of vendor program.
Note, starting in Db2 Version 11.1.3.3, the archival of log files (LOGARCHMETH1/2) using VENDOR or TSM methods can now be configured with a timeout on Unix environments. If the transmission of log data between Db2 and the TSM or vendor API is unresponsive for the specified timeout period, then Db2 will interrupt the archive log attempt and follow normal log archive failure protocol. For details see "--VENDOR_ARCHIVE_TIMEOUT" in the Configuration parameters for database logging Knowledge Center article.
Was this topic helpful?
Document Information
Modified date:
16 June 2018
UID
swg21690867