Technical Blog Post
Abstract
DB2 Purescale : Severe messages logged in db2diag.log when CF cannot access a page in time through Write and Register command
Body
A couple of db2 purescale customers have recently observed that the db2diag.log in a purescale environment might have some SEVERE messages logged with no operational problems in database associated with it.
The severe messages logged in db2diag.log are as follows :
2017-07-10-11.40.16.000086+540 I202891A5897 LEVEL: Severe
PID : 23068794 TID : 433823 PROC : db2sysc 1
INSTANCE: db2inst1 NODE : 001 DB : inst1
APPHDL : 1-6610 APPID: *N1.db2inst1.170710024007
AUTHID : DB2inst1 HOSTNAME: MEMBER1
EDUID : 433823 EDUNAME: db2agent (inst1) 1
FUNCTION: DB2 UDB, Shared Data Structure Abstraction Layer for CF,
SAL_GBP_HANDLE::SAL_WriteAndRegister, probe:24519
MESSAGE : CA RC= 2148073503
DATA #1 : SAL CF Index, PD_TYPE_SAL_CF_INDEX, 8 bytes
1
DATA #2 : SAL CF Node Number, PD_TYPE_SAL_CF_NODE_NUM, 2 bytes
128
DATA #3 : Page Name, PD_TYPE_SD_PAGE_NAME, 16 bytes
{pgNum:poolId:objId:oType} = {81798:3:60:1} (decimal)
DATA #4 : Page header, PD_TYPE_SQLB_PAGE_HEAD, 48 bytes
pageHead: {tbspSeed:3;obj:60;type:1} bphPPNum:81798 bphOPNum:81798
begoff: 48
datlen: 16336
revnum: 1280
pagelsn: 0000000759A577F8
flag: 2a
future32: 0
cbits1to31: 2852e
cbits32to63: 0
DATA #5 : Page Address, PD_TYPE_SD_PAGE_ADDR, 8 bytes
0x0a000303d5b14000
DATA #6 : Page Size, PD_TYPE_SD_PAGE_SIZE, 8 bytes
16384
DATA #7 : Page Size, PD_TYPE_SD_PAGE_SIZE, 8 bytes
0
DATA #8 : XI Index, PD_TYPE_SD_XI_INDEX, 4 bytes
2060517
DATA #9 : Castout Class, PD_TYPE_SD_CASTOUT_CLASS, 2 bytes
921
DATA #10: CF Process ID, PD_TYPE_SD_CASTOUT_PROCESS_ID, 8 bytes
0
DATA #11: SAL CF encoded state, PD_TYPE_SAL_ENCODED_STATE, 8 bytes
0x0002000200020004
DATA #12: Castout Count, PD_TYPE_SD_CASTOUT_COUNT, 4 bytes
0
DATA #13: Directory Change Count, PD_TYPE_SAL_DIRCHANGE_COUNT, 4 bytes
0
DATA #14: Max LSN, PD_TYPE_SD_MAXLSN, 8 bytes
31568787485
DATA #15: LsnNat, PD_TYPE_SD_LSNNAT, 8 bytes
31568787447
DATA #16: LsnNat, PD_TYPE_SD_LSNNAT, 8 bytes
31568787448
DATA #17: Codepath, 8 bytes
9:12:15:19:28:30:31:32:33
DATA #18: SAL_GBP_HANDLE, PD_TYPE_SAL_GBP_HANDLE, 9200 bytes
m_caStructName: db2.db2inst1.inst1.gbp
m_caKey:
m_caKey: Name=db2.db2inst1.inst1.gbp_caKey sample=0x2000200020004
m_primaryWaiters (sample): 0
m_waitForPrimaryDisabled: 0
History:
Last 2 out of 2 changes
- 0x2000200020004 at 2017-07-10-00.26.01
- 0x1000100000000 at 2017-07-10-00.26.01
m_connectKey:
m_caKey: Name=db2.db2inst1.inst1.gbp_connectKey sample=0x2000200020004
History:
Last 2 out of 2 changes
- 0x2000200020004 at 2017-07-10-00.26.01
- 0x1000100000000 at 2017-07-10-00.26.01
m_disconnectKey:
m_caKey: Name=db2.db2inst1.inst1.gbp_disconnectKey
sample=0x2000200020004
History:
Last 3 out of 3 changes
- 0x2000200020004 at 2017-07-10-00.26.01
- 0x2000100020000 at 2017-07-10-00.26.01
- 0x1000100000000 at 2017-07-10-00.26.01
m_reconstructKey:
m_caKey: Name=db2.db2inst1.inst1.gbp_reconstructKey
sample=0x2000200020004
History:
Last 2 out of 2 changes
- 0x2000200020004 at 2017-07-10-00.26.01
- 0x1000100000000 at 2017-07-10-00.26.01
m_catchupPercentComplete: 100 %
m_catchupTimePassed(secs): 0
m_pCatchupLotch: 0x0a000301cb7eba80: 00080001000000000000000076
SQLP_VALLOCK (SALCATCHUP: caStructType=1) (held: F)
m_pWriteProtectLotch: 0x0a000301cb7ebc40: 000B0001000000000000000076
SQLP_VALLOCK (SALWRITEPRITECT: caStructType=1) (held: F)
m_pSimplexLotch: 0x0a000301cb7ebe00: 00090001000000000000000076
SQLP_VALLOCK (SALSIMPLEX: caStructType=1) (held: F)
m_pauseDurationSoAPIretryIsntSuperFastMilliSec: 500
m_catchupRetryStartTimestamp: 2017-07-10-00.26.01
These severe error messages get "CA RC= 2148073503". Checking what that RC means :
$ db2diag -cfrc 2148073503
Input ZRC string '2148073503' parsed as 0x8009001F (-2146893793).
Attempting to lookup value 0x8009001F (-2146893793) as a ZRC
ZRC value to map: 0x97C9001F (-1748434913)
ZRC class :
Cluster caching facility transport common errors and warnings
(Class Index: 23)
Component:
CF_XPORT_COMMON ; cluster caching facility xport common
(Component Index: 201)
Reason Code:
31 (0x001F)
Identifer:
CF_XPORT_RECV_MRB_FAILURE
Identifer (without component):
SQLZ_RC_CF_XPORT_RECV_MRB_FAILURE
Description:
Receive MRB failure
Associated information:
Sqlcode -2051
SQL2051N There was a communication failure between a DB2 member and a
CF. Reason code = "3". CF identifier: "". Host name: "".
Number of sqlca tokens : 3
Diaglog message number: 1
Checking SQL2051N RC=3 :
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.messages.sql.doc/doc/msql02051n.html
So we know that there was a communication failure between a DB2 member and a CF.
This message is returned when the database manager detects problems that interfere with communications between a DB2 member and cluster caching facility (CF).
Now the fact that customer did not see any impact of this on their normal db operations, it makes it difficult to analyze.
Usually, we have seen such errors, when some type of timeouts happen during the member and CF communication.
The CA_XPORT_RECV_MRB_FAILURE is seen when operations against CF would be delayed and worker threads would miss to receive MRB (Message Response Block) within the
retries.
It could also be caused by something like a uDAPL communication delay problem between a DB2 member and a CF, or some busy CF worker threads ( insufficient number of CF worker threads compared to the requested workload ) situation on CF server.
Check cfdiag to see if we find more details of this problem.
In this example case, we saw DMA_RC -2146762722 at the same time when we saw the error message in db2diag.log.
2017-07-10-11.40.02.0208306000+540 E123456789A256 LEVEL : Error
PID : 4653300 TID : 14393
NODE : 000
HOSTNAME : CF1
FUNCTION : CA trace, log_error
MESSAGE : CA server has encountered an error.
DATA #1 :
DMA_RC: (-2146762722)
2017-07-10-11.40.16.0008748000+540 E123456789A372 LEVEL : Event
PID : 4653300 TID : 11309
NODE : 000
HOSTNAME : CF1
FUNCTION : CA svr_clstrmgr, conn_control
MESSAGE : Changed Connection Status : NOT ALLOWED
DATA #1 :
Node ID : 2
Instance ID : 0
EP : 0X44DD7B30
HCA ID : 3
Worker ID : 3
Client ID : 4
2017-07-10-11.40.16.0024466000+540 E123456789A320 LEVEL : Event
PID : 4653300 TID : 16192
HOSTNAME : CF1
FUNCTION : CA svr_clstrmgr, allowlist::reject_conn
MESSAGE : allowlist Table: Found node in list.
DATA #1 :
Sequence Number: 1256
Node Type : 1
Node ID : 2
Instance ID : 0
2017-07-10-11.40.16.0027289000+540 E123456789A374 LEVEL : Event
PID : 4653300 TID : 16192
HOSTNAME : CF1
FUNCTION : CA svr_clstrmgr, UWorkMgr::assign_client_to_worker
MESSAGE : Changed Connection Status : ALLOWED
DATA #1 :
Node ID : 2
Instance ID : 0
EP : 0X4470BFB0
HCA ID : 3
Worker ID : 5
Client ID : 5
This translates to RDMA error, SQL2051N for the communication failure between a DB2 member and a CF. Reason code = "1".
RC=1 means : The error occurred or was detected in the User-Level Direct Access Programming Library (uDAPL.)
As can be seen in above cf db2diag, you can see that connections were NOT ALLOWED for some milliseconds and then status changes to ALLOWED.
So looks like there is some intermittent issue with RDMA switch (uDAPL communication).
Customer's until now have not noticed any impact of this on db2 operations as the problem exists for a very short time (in milliseconds) but if this persists longer you may notice
delays, connection drops, slow SQL processing etc. in DB2. But DB2 here is just the victim of underlying RDMA Timeout errors.
These problems should be reported to AIX support to investigate the root cause of uDAPL DMA Timeout errors.
UID
ibm13286233