Technical Blog Post
Abstract
75 ways to demystify DB2: #26: Techtip: In DB2 PureScale environment, all members are showing UNKNOWN state
Body
Problem:
In DB2 PureScale environment, all members are showing UNKNOWN state.
Symptoms:
db2instance -list
ID TYPE STATE HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
-- ---- ----- --------- ------------ ----- ---------------- ------------ -------
0 MEMBER UNKNOWN purescalem020 purescalem020 NO 0 0 purescalem020-ib0
1 MEMBER UNKNOWN purescalem021 purescalem021 NO 0 0 purescalem021-ib0
2 MEMBER UNKNOWN purescalem022 purescalem022 NO 0 0 purescalem022-ib0
3 MEMBER UNKNOWN purescalem023 purescalem023 NO 0 0 purescalem023-ib0
128 CF PEER purescacf001 purescacf001 NO - 0 purescacf001-ib0,purescacf001-ib1
129 CF PRIMARY purescacf002 purescacf002 NO - 0 purescacf002-ib0,purescacf002-ib1
HOSTNAME STATE INSTANCE_STOPPED ALERT
-------- ----- ---------------- -----
purescacf002 ACTIVE NO NO
purescacf001 ACTIVE NO NO
purescalem023 ACTIVE NO NO
purescalem022 ACTIVE UNKNOWN YES
purescalem021 ACTIVE NO NO
purescalem020 ACTIVE NO NO
There is currently an alert for a member, CF, or host in the data-sharing instance. For more information on the alert, its impact, and how to clear it, run the following command: 'db2cluster -cm -list -alert'.
UNKNOWN is a transient state. Re-issue the command in few seconds. However, if UNKNOWN state continues to re-appear, check the db2diag.log for failure messages pertaining to the DB member or CF in this state.
When running: db2cluster -cm -list -alert , we get:
NON-FATAL ASSERTION FAILED!!!
Time = Fri Apr 24 13:29:40 2015
Expr = Invalid block eye-catcher (0x1) found at:
PID = 32768000
TID = 1
File = sqlhaInterface.C
Line = 21755
Stacktrace =
_ossMemAllocCheck (+0x730)
_ossMemFree (+0x278)
_ossMemFree@glue8B (+0x50)
sqlhaDeleteClusterObjectListStructure__FP25SQLHA_CLUSTER_OBJECT_LIST (+0x200)
sqlhaSysIBMGetAlerts__FPP17sqlerDB2AlertDataPUiP5sqlcaP19SQLHA_CONTROL_BLOCK (+0x330)
rocmDb2clusterActionDisplayAlerts__FPcCiUi (+0x1A4)
rocmDoDb2clusterEngineAction__FP11ROCM_ACTION (+0x1C8)
main (+0x24A4)
__start (+0x70)
?unknown
?unknown
?unknown
?unknown
?unknown
?unknown
?unknown
NON-FATAL ASSERTION FAILED!!!
Time = Fri Apr 24 13:29:40 2015
Expr = Invalid pad type (0x10570750) found at:
PID = 32768000
TID = 1
File = sqlhaInterface.C
Line = 21755
Stacktrace =
_ossMemAllocCheck (+0x6CC)
_ossMemFree (+0x278)
_ossMemFree@glue8B (+0x50)
sqlhaDeleteClusterObjectListStructure__FP25SQLHA_CLUSTER_OBJECT_LIST (+0x200)
sqlhaSysIBMGetAlerts__FPP17sqlerDB2AlertDataPUiP5sqlcaP19SQLHA_CONTROL_BLOCK (+0x330)
rocmDb2clusterActionDisplayAlerts__FPcCiUi (+0x1A4)
rocmDoDb2clusterEngineAction__FP11ROCM_ACTION (+0x1C8)
main (+0x24A4)
__start (+0x70)
?unknown
?unknown
?unknown
?unknown
?unknown
?unknown
?unknown
Cluster alerts could not be retrieved from the cluster manager. See the db2diag.log for more information.
On top, if you are unable to log into the problematic server remotely and OS is reporting memory related issues, then the situation we have is that on this particular host we have run into high memory utilization and that caused the resources to go into UNKNOWN state.
Solution:
If we can confirm the following:
mmgetstate -a
Node number Node name GPFS state
------------------------------------------
1 purescacf001 active
2 purescacf002 active
3 purescalem020 active
4 purescalem021 active
5 purescalem022 unknown
6 purescalem023 active
This confirms that the GPFS resource state itself is unknown.The solution would be to re-direct the traffic to rest of the members if you can and do a hard reboot on the problematic member.
In the mean time, for further verification, we can check , errpt -a, or GPFS logs, located at /var/adm/ras/ (i.e mmfs.log.latest and mmfs.log.previous)
Once the member restarts successfully, we should have the following:
db2instance -list
ID TYPE STATE HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
-- ---- ----- --------- ------------ ----- ---------------- ------------ -------
0 MEMBER STARTED purescalem020 purescalem020 NO 0 0 purescalem020-ib0
1 MEMBER STARTED purescalem021 purescalem021 NO 0 0 purescalem021-ib0
2 MEMBER STARTED purescalem022 purescalem022 NO 0 0 purescalem022-ib0
3 MEMBER STARTED purescalem023 purescalem023 NO 0 0 purescalem023-ib0
128 CF PEER purescacf001 purescacf001 NO - 0 purescacf001-ib0,purescacf001-ib1
129 CF PRIMARY purescacf002 purescacf002 NO - 0 purescacf002-ib0,purescacf002-ib1
HOSTNAME STATE INSTANCE_STOPPED ALERT
-------- ----- ---------------- -----
purescacf002 ACTIVE NO NO
purescacf001 ACTIVE NO NO
purescalem023 ACTIVE NO NO
purescalem022 ACTIVE NO NO
purescalem021 ACTIVE NO NO
purescalem020 ACTIVE NO NO
mmgetstate -a
Node number Node name GPFS state
------------------------------------------
1 purescacf001 active
2 purescacf002 active
3 purescalem020 active
4 purescalem021 active
5 purescalem022 active
6 purescalem023 active
If the reboot doesn't help, then please collect the following and engage DB2 Support.
1. db2support (with -s -f option ) from the problematic member along with 'db2instance -list' output
2. GPFS logs
3. getsdata -a
UID
ibm11141138