Technical Blog Post
Abstract
What do these db2diag.log messages "mc_dispatch() call came back with error" mean?
Body
What do the following repetitive messages (in db2diag.log) mean? 2015-12-15-23.25.30.209153-300 I101318E684 LEVEL: Info PID : 9697 TID : 140736926312192 PROC : db2sysc 1 INSTANCE: db2inst1 NODE : 001 HOSTNAME: XXX.XXX.XXX EDUID : 33 EDUNAME: db2ard 1 FUNCTION: DB2 UDB, base sys utilities, sqeARChainBuilder::processChain, probe:900 DATA #1 : String, 40 bytes Autorestart thread chain processing ends DATA #2 : signed integer, 4 bytes 0 DATA #3 : Boolean, 1 bytes false DATA #4 : Boolean, 1 bytes false DATA #5 : Boolean, 1 bytes false DATA #6 : unsigned integer, 2 bytes 32 DATA #7 : Boolean, 1 bytes true DATA #8 : Boolean, 1 bytes true DATA #9 : unsigned integer, 4 bytes 2015-12-15-23.28.41.217662-300 I102003E1343 LEVEL: Severe PID : 9697 TID : 140736968255232 PROC : db2sysc 1 INSTANCE: db2inst1 NODE : 001 HOSTNAME: XXX.XXX.XXX EDUID : 23 EDUNAME: db2clstrRscMon 1 FUNCTION: DB2 UDB, high avail services, rocmMemberHCAMonitorStartSessionRegisterAndDispatch, probe:3633 MESSAGE : ZRC=0x827300AA=-2106392406=HA_ZRC_FAILED "SQLHA API call error" DATA #1 : String, 39 bytes mc_dispatch() call came back with error DATA #2 : rsct error, PD_TYPE_SQLHA_RSCT_ERROR_RETURN, 4 bytes 0x00000003 DATA #3 : String, 66 bytes 2610-603 The session with the RMC subsystem has been interrupted. DATA #4 : String, 27 bytes Instancename, Member Number DATA #5 : String, 8 bytes db2inst1 DATA #6 : signed integer, 2 bytes 1 CALLSTCK: (Static functions may not be resolved correctly, as they are resolved to the nearest symbol) [0] 0x00007FFFC69400E7 _Z51rocmMemberHCAMonitorStartSessionRegisterAndDispatchP16ROCM_HCA_MONITOR + 0x4C7 [1] 0x00007FFFC693F9E9 rocmMemberHCAMonitor + 0x39 [2] 0x000000000041EADB _ZN26sqeMemberAdapterMonitorEdu6RunEDUEv + 0x10B [3] 0x00007FFFF1590A27 _ZN9sqzEDUObj9EDUDriverEv + 0xF7 [4] 0x00007FFFF0D5C3A3 sqloEDUEntry + 0x303 [5] 0x000000373AC07A51 /lib64/libpthread.so.0 + 0x7A51 [6] 0x000000373A8E893D clone + 0x6D 2015-12-15-23.28.41.219398-300 I103347E1176 LEVEL: Severe PID : 9697 TID : 140736968255232 PROC : db2sysc 1 INSTANCE: db2inst1 NODE : 001 HOSTNAME: XXX.XXX.XXX EDUID : 23 EDUNAME: db2clstrRscMon 1 FUNCTION: DB2 UDB, high avail services, rocmMemberHCAMonitorStartSessionRegisterAndDispatch, probe:3633 MESSAGE : ZRC=0x827300AA=-2106392406=HA_ZRC_FAILED "SQLHA API call error" DATA #1 : String, 39 bytes mc_dispatch() call came back with error DATA #2 : String, 27 bytes Instancename, Member Number DATA #3 : String, 8 bytes db2inst1 DATA #4 : signed integer, 2 bytes 1 CALLSTCK: (Static functions may not be resolved correctly, as they are resolved to the nearest symbol) [0] 0x00007FFFC693FDD8 _Z51rocmMemberHCAMonitorStartSessionRegisterAndDispatchP16ROCM_HCA_MONITOR + 0x1B8 [1] 0x00007FFFC693F9E9 rocmMemberHCAMonitor + 0x39 [2] 0x000000000041EADB _ZN26sqeMemberAdapterMonitorEdu6RunEDUEv + 0x10B [3] 0x00007FFFF1590A27 _ZN9sqzEDUObj9EDUDriverEv + 0xF7 [4] 0x00007FFFF0D5C3A3 sqloEDUEntry + 0x303 [5] 0x000000373AC07A51 /lib64/libpthread.so.0 + 0x7A51 [6] 0x000000373A8E893D clone + 0x6D 2015-12-15-23.28.41.220944-300 I104524E445 LEVEL: Event PID : 9697 TID : 140736968255232 PROC : db2sysc 1 INSTANCE: db2inst1 NODE : 001 HOSTNAME: XXX.XXX.XXX EDUID : 23 EDUNAME: db2clstrRscMon 1 FUNCTION: DB2 UDB, high avail services, rocmMemberHCAMonitorStartSessionRegisterAndDispatch, probe:3693 DATA #1 : rsct error, PD_TYPE_SQLHA_RSCT_ERROR_RETURN, 4 bytes 0x00000000 2015-12-15-23.28.42.221493-300 I104970E1333 LEVEL: Severe PID : 9697 TID : 140736968255232 PROC : db2sysc 1 INSTANCE: db2inst1 NODE : 001 HOSTNAME: XXX.XXX.XXX EDUID : 23 EDUNAME: db2clstrRscMon 1 FUNCTION: DB2 UDB, high avail services, rocmMemberHCAMonitorStartSessionRegisterAndDispatch, probe:3608 MESSAGE : ZRC=0x827300AA=-2106392406=HA_ZRC_FAILED "SQLHA API call error" DATA #1 : String, 27 bytes Error starting RSCT session DATA #2 : rsct error, PD_TYPE_SQLHA_RSCT_ERROR_RETURN, 4 bytes 0x00000002 DATA #3 : String, 68 bytes 2610-602 A session could not be established with the RMC subsystem. DATA #4 : String, 27 bytes Instancename, Member Number DATA #5 : String, 8 bytes db2inst1 DATA #6 : signed integer, 2 bytes 1 CALLSTCK: (Static functions may not be resolved correctly, as they are resolved to the nearest symbol) [0] 0x00007FFFC69400E7 _Z51rocmMemberHCAMonitorStartSessionRegisterAndDispatchP16ROCM_HCA_MONITOR + 0x4C7 [1] 0x00007FFFC693F9E9 rocmMemberHCAMonitor + 0x39 [2] 0x000000000041EADB _ZN26sqeMemberAdapterMonitorEdu6RunEDUEv + 0x10B [3] 0x00007FFFF1590A27 _ZN9sqzEDUObj9EDUDriverEv + 0xF7 [4] 0x00007FFFF0D5C3A3 sqloEDUEntry + 0x303 [5] 0x000000373AC07A51 /lib64/libpthread.so.0 + 0x7A51 [6] 0x000000373A8E893D clone + 0x6D Carried on for a few minutes then : 2015-12-15-23.28.47.239430-300 I121029E1164 LEVEL: Severe PID : 9697 TID : 140736968255232 PROC : db2sysc 1 INSTANCE: db2inst1 NODE : 001 HOSTNAME: XXX.XXX.XXX EDUID : 23 EDUNAME: db2clstrRscMon 1 FUNCTION: DB2 UDB, high avail services, rocmMemberHCAMonitorStartSessionRegisterAndDispatch, probe:3608 MESSAGE : ZRC=0x827300AA=-2106392406=HA_ZRC_FAILED "SQLHA API call error" DATA #1 : String, 27 bytes Error starting RSCT session DATA #2 : String, 27 bytes Instancename, Member Number DATA #3 : String, 8 bytes db2inst1 DATA #4 : signed integer, 2 bytes 1 CALLSTCK: (Static functions may not be resolved correctly, as they are resolved to the nearest symbol) [0] 0x00007FFFC693FDD8 _Z51rocmMemberHCAMonitorStartSessionRegisterAndDispatchP16ROCM_HCA_MONITOR + 0x1B8 [1] 0x00007FFFC693F9E9 rocmMemberHCAMonitor + 0x39 [2] 0x000000000041EADB _ZN26sqeMemberAdapterMonitorEdu6RunEDUEv + 0x10B [3] 0x00007FFFF1590A27 _ZN9sqzEDUObj9EDUDriverEv + 0xF7 [4] 0x00007FFFF0D5C3A3 sqloEDUEntry + 0x303 [5] 0x000000373AC07A51 /lib64/libpthread.so.0 + 0x7A51 [6] 0x000000373A8E893D clone + 0x6D 2015-12-15-23.28.47.240659-300 I122194E445 LEVEL: Event PID : 9697 TID : 140736968255232 PROC : db2sysc 1 INSTANCE: db2inst1 NODE : 001 HOSTNAME: XXX.XXX.XXX EDUID : 23 EDUNAME: db2clstrRscMon 1 FUNCTION: DB2 UDB, high avail services, rocmMemberHCAMonitorStartSessionRegisterAndDispatch, probe:3693 DATA #1 : rsct error, PD_TYPE_SQLHA_RSCT_ERROR_RETURN, 4 bytes 0x00000000 2015-12-15-23.29.15.594177-300 I122640E532 LEVEL: Event PID : 35180 TID : 46912839558752 PROC : db2rocme 1 [db2inst1] INSTANCE: db2inst1 NODE : 001 HOSTNAME: XXX.XXX.XXX FUNCTION: DB2 UDB, high avail services, rocmGenerateEvent, probe:888 MESSAGE : Completed fast member departure. FMD valid flag and start/end times as follows. DATA #1 : Boolean, 1 bytes true DATA #2 : String, 26 bytes 12/15/2015 23:29:15.164410 DATA #3 : String, 26 bytes 12/15/2015 23:29:15.593999
The "mc_*" calls are used to establish RSCT monitoring of the HCAs on the server (Highspeed Communincation Adapter ports). The code calling rocmMemberHCAMonitorStartSessionRegisterAndDispatch, rocmMemberHCAMonitor, states that establishing this RSCT monitoring can encounter timing issues wrt the underlying RSCT infrastructure. This is a transient state. Due to this transient state, DB2 has built in retries in order to ensure the monitoring is established. DB2 will retry up to 30 times to establish the HCA monitoring. With each iteration that does fail, logging will occur as seen in above db2diag.log snip.
Later in db2diag.log we see that DB2 successfully registered this HCA monitoring on the eighth attempt.
DB2 is working as designed for this transient RSCT state.
UID
ibm11140472