Technical Blog Post
Abstract
Db2 Purescale : Unable to enter in the CM maintenance mode
Body
Here is a demo.
db2inst5@machine007 ~ $ export DB2INSTANCE=db2inst5
db2inst5@machine007 ~ $ sudo -E /opt/IBM/DB2/V11.1.3.s/bin/db2cluster -cm -enter -maintenance -all
Unable to enter maintenance mode for domain 'db2domain_20180214154000' because host 'machine008' is already in maintenance mode. Issue the command 'db2cluster -cm -exit -maintenance' on this host to remove it from maintenance mode and then re-issue this command. A diagnostic log has been saved to '/tmp/ibm.db2.cluster.KUa4LB'.
Here are the contents of /tmp/ibm.db2.cluster.KUa4LB which doesn't show any clue of the issue:
2019-07-12-11.57.37.593086+600 I1E1841 LEVEL: Event
PID : 5577 TID : 140696991897472 PROC : db2cluster
INSTANCE: db2inst5 NODE : 129
HOSTNAME: machine007
FUNCTION: DB2 UDB, RAS/PD component, pdLogInternal, probe:120
START : New Diagnostic Log file
DATA #1 : Build Level, 248 bytes
Instance "db2inst5" uses "64" bits and DB2 code release "SQL11013"
with level identifier "0204010F".
Informational tokens are "DB2 v11.1.3.3", "special_37607", "DYN1803021700AMD64_37607", Fix Pack "3".
DATA #2 : System Info, 664 bytes
System: Linux machine007 10 3 x86_64
CPU: total:24 online:24 Cores per socket:6 Threading degree per core:2 SIMD:Y CPU Features:SSSE3,SSE4,AVX2
CPU binding: not in use
Physical Memory(MB): total:128648 free:116712
Virtual Memory(MB): total:132744 free:120808
Swap Memory(MB): total:4096 free:4096
Kernel Params: msgMaxMessageSize:65536 msgMsgMap:65536 msgMaxQueueIDs:129024
msgNumberOfHeaders:65536 msgMaxQueueSize:65536
msgMaxSegmentSize:16 shmMax:9223372036854775807 shmMin:1
shmIDs:32256 shmSegments:32256 semMap:256000 semIDs:32256
semNum:256000 semUndo:256000 semNumPerID:250 semOps:32
semUndoSize:20 semMaxVal:32767 semAdjustOnExit:32767
Cur cpu time limit (seconds) = 0xFFFFFFFFFFFFFFFF
Cur file size limit (bytes) = 0xFFFFFFFFFFFFFFFF
Cur data size (bytes) = 0xFFFFFFFFFFFFFFFF
Cur stack size (bytes) = 0x0000000000800000
Cur core size (bytes) = 0xFFFFFFFFFFFFFFFF
Cur memory size (bytes) = 0xFFFFFFFFFFFFFFFF
nofiles (descriptors) = 0x000000000000FFFE
Cur maximum number of processes = 0x000000000000FFFE
Information in this record is only valid at the time when this file was
created (see this record's time stamp)
2019-07-12-11.57.37.594761+600 I1843E393 LEVEL: Info
PID : 5577 TID : 140696991897472 PROC : db2cluster
INSTANCE: db2inst5 NODE : 129
HOSTNAME: machine007
FUNCTION: DB2 UDB, high avail services, sqlhaUIParseCmdArgs, probe:3335
DATA #1 : String, 67 bytes
/opt/IBM/DB2/V11.1.3.s/bin/db2cluster -cm -enter -maintenance -all
2019-07-12-11.57.37.608460+600 E2237E405 LEVEL: Error
PID : 5577 TID : 140696991897472 PROC : db2cluster
INSTANCE: db2inst5 NODE : 129
HOSTNAME: machine007
FUNCTION: DB2 UDB, high avail services, sqlhaUIMain, probe:1527
MESSAGE : ECF=0x9000062F=-1879046609=ECF_SQLHA_ALREADY_IN_MAINT
Cluster is already in maintenance mode.
Likewise, when trying to exit the CM maintenance mode it complains saying that host is currently not in maintenance mode.
db2inst5@machine007 ~ $ export DB2INSTANCE=db2inst5
db2inst5@machine007 ~ $ sudo -E /opt/IBM/DB2/V11.1.3.s/bin/db2cluster -cm -exit -maintenance
Host 'machine007' is currently not in maintenance mode.
A diagnostic log has been saved to '/tmp/ibm.db2.cluster.tlXcll'.
Here are the contents of /tmp/ibm.db2.cluster.tlXcll:
2019-07-12-11.58.19.292649+600 I1E1841 LEVEL: Event
PID : 6800 TID : 140312855496576 PROC : db2cluster
INSTANCE: db2inst5 NODE : 129
HOSTNAME: machine007
FUNCTION: DB2 UDB, RAS/PD component, pdLogInternal, probe:120
START : New Diagnostic Log file
DATA #1 : Build Level, 248 bytes
Instance "db2inst5" uses "64" bits and DB2 code release "SQL11013"
with level identifier "0204010F".
Informational tokens are "DB2 v11.1.3.3", "special_37607", "DYN1803021700AMD64_37607", Fix Pack "3".
DATA #2 : System Info, 664 bytes
System: Linux machine007 10 3 x86_64
CPU: total:24 online:24 Cores per socket:6 Threading degree per core:2 SIMD:Y CPU Features:SSSE3,SSE4,AVX2
CPU binding: not in use
Physical Memory(MB): total:128648 free:116675
Virtual Memory(MB): total:132744 free:120771
Swap Memory(MB): total:4096 free:4096
Kernel Params: msgMaxMessageSize:65536 msgMsgMap:65536 msgMaxQueueIDs:129024
msgNumberOfHeaders:65536 msgMaxQueueSize:65536
msgMaxSegmentSize:16 shmMax:9223372036854775807 shmMin:1
shmIDs:32256 shmSegments:32256 semMap:256000 semIDs:32256
semNum:256000 semUndo:256000 semNumPerID:250 semOps:32
semUndoSize:20 semMaxVal:32767 semAdjustOnExit:32767
Cur cpu time limit (seconds) = 0xFFFFFFFFFFFFFFFF
Cur file size limit (bytes) = 0xFFFFFFFFFFFFFFFF
Cur data size (bytes) = 0xFFFFFFFFFFFFFFFF
Cur stack size (bytes) = 0x0000000000800000
Cur core size (bytes) = 0xFFFFFFFFFFFFFFFF
Cur memory size (bytes) = 0xFFFFFFFFFFFFFFFF
nofiles (descriptors) = 0x000000000000FFFE
Cur maximum number of processes = 0x000000000000FFFE
Information in this record is only valid at the time when this file was
created (see this record's time stamp)
2019-07-12-11.58.19.294314+600 I1843E387 LEVEL: Info
PID : 6800 TID : 140312855496576 PROC : db2cluster
INSTANCE: db2inst5 NODE : 129
HOSTNAME: machine007
FUNCTION: DB2 UDB, high avail services, sqlhaUIParseCmdArgs, probe:3335
DATA #1 : String, 61 bytes
/opt/IBM/DB2/V11.1.3.s/bin/db2cluster -cm -exit -maintenance
2019-07-12-11.58.19.304998+600 I2231E2010 LEVEL: Error
PID : 6800 TID : 140312855496576 PROC : db2cluster
INSTANCE: db2inst5 NODE : 129
HOSTNAME: machine007
FUNCTION: DB2 UDB, high avail services, sqlhaUIExitMaintenanceModeForCM, probe:4831
MESSAGE : ECF=0x90000614=-1879046636=ECF_SQLHA_CLUSTER_NOT_IN_MAINTENANCE_MODE
The specified cluster on local node is not in maintenance mode.
DATA #1 : String, 78 bytes
An active domain was found on this host. This host is not in maintenance mode.
DATA #2 : Codepath, 8 bytes
2:5
DATA #3 : signed integer, 4 bytes
0
DATA #4 : signed integer, 4 bytes
0
DATA #5 : Boolean, 1 bytes
false
DATA #6 : Boolean, 1 bytes
false
DATA #7 : Boolean, 1 bytes
true
DATA #8 : Boolean, 1 bytes
false
DATA #9 : String, 39 bytes
/var/db2/.db2_maintenance_machine007
DATA #10: String, 13 bytes
machine007
DATA #11: String, 8 bytes
db2inst5
DATA #12: String, 24 bytes
db2domain_20180214154000
DATA #13: String, 0 bytes
Object not dumped: Address: 0x00007FFD51489A10 Size: 0 Reason: Zero-length data
DATA #14: String, 0 bytes
Object not dumped: Address: 0x00007FFD51488CB0 Size: 0 Reason: Zero-length data
DATA #15: String, 26 bytes
/home/db2/db2inst5/sqllib/
DATA #16: String, 0 bytes
Object not dumped: Address: 0x00007FFD51489200 Size: 0 Reason: Zero-length data
DATA #17: signed integer, 4 bytes
0
DATA #18: String, 0 bytes
Object not dumped: Address: 0x00007FFD51486A04 Size: 0 Reason: Zero-length data
DATA #19: String, 18 bytes
/home/db2/db2inst5
DATA #20: Boolean, 1 bytes
false
DATA #21: unsigned integer, 8 bytes
0
DATA #22: Boolean, 1 bytes
false
CALLSTCK: (Static functions may not be resolved correctly, as they are resolved to the nearest symbol)
[0] 0x00000000004EFD51 _Z31sqlhaUIExitMaintenanceModeForCMP9sqlhaUICB + 0x2461
[1] 0x000000000042FE7B main + 0x3C9B
[2] 0x00007F9D1CF14C05 __libc_start_main + 0xF5
[3] 0x000000000042C119 /opt/IBM/DB2/V11.1.3.s/bin/db2cluster + 0x2C119
2019-07-12-11.58.19.306048+600 E4242E444 LEVEL: Error
PID : 6800 TID : 140312855496576 PROC : db2cluster
INSTANCE: db2inst5 NODE : 129
HOSTNAME: machine007
FUNCTION: DB2 UDB, high avail services, sqlhaUIMain, probe:1556
MESSAGE : ECF=0x90000614=-1879046636=ECF_SQLHA_CLUSTER_NOT_IN_MAINTENANCE_MODE
The specified cluster on local node is not in maintenance mode.
Based on the error that we return for the command "Unable to enter maintenance mode for domain .... is already in maintenance mode", the failure to enter maintenance is due to quorum vote being disabled on a few nodes.
When exiting maintenance, we check to see if the node is online in the domain. Hence the exit maintenance command returns a message indicating the host is not in maintenance.
The RSCT command, lsrsrc IBM.PeerNode lists details of which hosts have quorum bit unset ( this data is also captured in db2support.zip_unpack/ctsnap_out/lsrsrc/lsrsrc_IBM.PeerNode.out)
Resource Persistent and Dynamic Attributes for IBM.PeerNode
Name NodeList RSCTVersion ClassVersions CritRsrcProtMethod IsQuorumNode IsPreferredGSGL NodeUUID HostName TBPriority ActivePeerDomain NodeNameList OpState ConfigChanged CritRsrcActive OpUsabilityState
"machine006" {2} "3.2.3.1" {} 0 0 1 "" "machine006.xxx.xxx.xxxx.xxx" 0 "db2domain_20180214154000" {"machine006.xxx.xxx.xxxx.xxx"} 1 0 0 1
"machine007" {5} "3.2.3.1" {} 0 0 1 "" "machine007.xxx.xxx.xxxx.xxx" 0 "db2domain_20180214154000" {"machine007.xxx.xxx.xxxx.xxx"} 1 0 0 1
"machine008" {4} "3.2.3.1" {} 0 0 1 "" "machine008.xxx.xxx.xxxx.xxx" 0 "db2domain_20180214154000" {"machine008.xxx.xxx.xxxx.xxx"} 1 0 0 1
"machine009" {1} "3.2.3.1" {} 0 0 1 "" "machine009.xxx.xxx.xxxx.xxx" 0 "db2domain_20180214154000" {"machine009.xxx.xxx.xxxx.xxx"} 1 0 0 1
"machine005" {6} "3.2.3.1" {} 0 1 1 "" "machine005.xxx.xxx.xxxx.xxx" 0 "db2domain_20180214154000" {"machine005.xxx.xxx.xxxx.xxx"} 1 0 0 1
We can see that nodes machine006, machine007, machine008 and machine009 have this bit unset.
You can set this bit via the following command (for each of the nodes that have this unset):
CT_MANAGEMENT_SCOPE=2 chrsrc -s "Name=='<NodeName>'" IBM.PeerNode IsQuorumNode=1
Once you do the above and ensure the quorum bit is set on all hosts. You can verify the quorum bit is set on all hosts via the following command:
CT_MANAGEMENT_SCOPE=2 lsrsrc IBM.PeerNode Name IsQuorumNode
The IsQuorumNode attribute is set to 1 when the cluster at the time the cluster is created.
This value is updated by DB2 only in two instances: when a node enters maintenance (IsQuorumNode=0) or when it exits maintenance (IsQuorumNode=1).
However if the user exited cm maintenance via startrpnode/startrpdomain then this bit will remain unset.
So such situation can happen when any of the hosts in this cluster enters cm maintenance via db2cluster but exits maintenance via startrpnode/startrpdomain.
Thanks,
Thanks,
Shashank Kharche
IBM DB2 LUW Lab
UID
ibm13285789