Db2 Purescale : Unable to enter in the CM maintenance mode

Technical Blog Post

Abstract

Body

Here is a demo.

db2inst5@machine007 ~ $ export DB2INSTANCE=db2inst5
db2inst5@machine007 ~ $ sudo -E /opt/IBM/DB2/V11.1.3.s/bin/db2cluster -cm -enter -maintenance -all
Unable to enter maintenance mode for domain 'db2domain_20180214154000' because host 'machine008' is already in maintenance mode. Issue the command 'db2cluster -cm -exit -maintenance' on this host to remove it from maintenance mode and then re-issue this command. A diagnostic log has been saved to '/tmp/ibm.db2.cluster.KUa4LB'.

Here are the contents of /tmp/ibm.db2.cluster.KUa4LB which doesn't show any clue of the issue:

2019-07-12-11.57.37.593086+600 I1E1841               LEVEL: Event
PID     : 5577                 TID : 140696991897472 PROC : db2cluster
INSTANCE: db2inst5             NODE : 129
HOSTNAME: machine007
FUNCTION: DB2 UDB, RAS/PD component, pdLogInternal, probe:120
START   : New Diagnostic Log file
DATA #1 : Build Level, 248 bytes
Instance "db2inst5" uses "64" bits and DB2 code release "SQL11013"
with level identifier "0204010F".
Informational tokens are "DB2 v11.1.3.3", "special_37607", "DYN1803021700AMD64_37607", Fix Pack "3".
DATA #2 : System Info, 664 bytes
System: Linux machine007 10 3 x86_64
CPU: total:24 online:24 Cores per socket:6 Threading degree per core:2 SIMD:Y CPU Features:SSSE3,SSE4,AVX2
CPU binding: not in use
Physical Memory(MB): total:128648 free:116712
Virtual Memory(MB): total:132744 free:120808
Swap     Memory(MB): total:4096 free:4096
Kernel   Params: msgMaxMessageSize:65536 msgMsgMap:65536 msgMaxQueueIDs:129024
                 msgNumberOfHeaders:65536 msgMaxQueueSize:65536
                 msgMaxSegmentSize:16 shmMax:9223372036854775807 shmMin:1
                 shmIDs:32256 shmSegments:32256 semMap:256000 semIDs:32256
                 semNum:256000 semUndo:256000 semNumPerID:250 semOps:32
                 semUndoSize:20 semMaxVal:32767 semAdjustOnExit:32767
Cur cpu time limit (seconds) = 0xFFFFFFFFFFFFFFFF
Cur file size limit (bytes) = 0xFFFFFFFFFFFFFFFF
Cur data size (bytes) = 0xFFFFFFFFFFFFFFFF
Cur stack size (bytes) = 0x0000000000800000
Cur core size (bytes) = 0xFFFFFFFFFFFFFFFF
Cur memory size (bytes) = 0xFFFFFFFFFFFFFFFF
nofiles (descriptors) = 0x000000000000FFFE
Cur maximum number of processes = 0x000000000000FFFE

Information in this record is only valid at the time when this file was
created (see this record's time stamp)

2019-07-12-11.57.37.594761+600 I1843E393             LEVEL: Info
PID     : 5577                 TID : 140696991897472 PROC : db2cluster
INSTANCE: db2inst5             NODE : 129
HOSTNAME: machine007
FUNCTION: DB2 UDB, high avail services, sqlhaUIParseCmdArgs, probe:3335
DATA #1 : String, 67 bytes
/opt/IBM/DB2/V11.1.3.s/bin/db2cluster -cm -enter -maintenance -all

2019-07-12-11.57.37.608460+600 E2237E405             LEVEL: Error
PID     : 5577                 TID : 140696991897472 PROC : db2cluster
INSTANCE: db2inst5             NODE : 129
HOSTNAME: machine007
FUNCTION: DB2 UDB, high avail services, sqlhaUIMain, probe:1527
MESSAGE : ECF=0x9000062F=-1879046609=ECF_SQLHA_ALREADY_IN_MAINT
          Cluster is already in maintenance mode.

Likewise, when trying to exit the CM maintenance mode it complains saying that host is currently not in maintenance mode.

db2inst5@machine007 ~ $ export DB2INSTANCE=db2inst5
db2inst5@machine007 ~ $ sudo -E /opt/IBM/DB2/V11.1.3.s/bin/db2cluster -cm -exit -maintenance
Host 'machine007' is currently not in maintenance mode.
A diagnostic log has been saved to '/tmp/ibm.db2.cluster.tlXcll'.

Here are the contents of /tmp/ibm.db2.cluster.tlXcll:

2019-07-12-11.58.19.292649+600 I1E1841               LEVEL: Event
PID     : 6800                 TID : 140312855496576 PROC : db2cluster
INSTANCE: db2inst5             NODE : 129
HOSTNAME: machine007
FUNCTION: DB2 UDB, RAS/PD component, pdLogInternal, probe:120
START   : New Diagnostic Log file
DATA #1 : Build Level, 248 bytes
Instance "db2inst5" uses "64" bits and DB2 code release "SQL11013"
with level identifier "0204010F".
Informational tokens are "DB2 v11.1.3.3", "special_37607", "DYN1803021700AMD64_37607", Fix Pack "3".
DATA #2 : System Info, 664 bytes
System: Linux machine007 10 3 x86_64
CPU: total:24 online:24 Cores per socket:6 Threading degree per core:2 SIMD:Y CPU Features:SSSE3,SSE4,AVX2
CPU binding: not in use
Physical Memory(MB): total:128648 free:116675
Virtual Memory(MB): total:132744 free:120771
Swap     Memory(MB): total:4096 free:4096
Kernel   Params: msgMaxMessageSize:65536 msgMsgMap:65536 msgMaxQueueIDs:129024
                 msgNumberOfHeaders:65536 msgMaxQueueSize:65536
                 msgMaxSegmentSize:16 shmMax:9223372036854775807 shmMin:1
                 shmIDs:32256 shmSegments:32256 semMap:256000 semIDs:32256
                 semNum:256000 semUndo:256000 semNumPerID:250 semOps:32
                 semUndoSize:20 semMaxVal:32767 semAdjustOnExit:32767
Cur cpu time limit (seconds) = 0xFFFFFFFFFFFFFFFF
Cur file size limit (bytes) = 0xFFFFFFFFFFFFFFFF
Cur data size (bytes) = 0xFFFFFFFFFFFFFFFF
Cur stack size (bytes) = 0x0000000000800000
Cur core size (bytes) = 0xFFFFFFFFFFFFFFFF
Cur memory size (bytes) = 0xFFFFFFFFFFFFFFFF
nofiles (descriptors) = 0x000000000000FFFE
Cur maximum number of processes = 0x000000000000FFFE

Information in this record is only valid at the time when this file was
created (see this record's time stamp)

2019-07-12-11.58.19.294314+600 I1843E387             LEVEL: Info
PID     : 6800                 TID : 140312855496576 PROC : db2cluster
INSTANCE: db2inst5             NODE : 129
HOSTNAME: machine007
FUNCTION: DB2 UDB, high avail services, sqlhaUIParseCmdArgs, probe:3335
DATA #1 : String, 61 bytes
/opt/IBM/DB2/V11.1.3.s/bin/db2cluster -cm -exit -maintenance

2019-07-12-11.58.19.304998+600 I2231E2010            LEVEL: Error
PID     : 6800                 TID : 140312855496576 PROC : db2cluster
INSTANCE: db2inst5             NODE : 129
HOSTNAME: machine007
FUNCTION: DB2 UDB, high avail services, sqlhaUIExitMaintenanceModeForCM, probe:4831
MESSAGE : ECF=0x90000614=-1879046636=ECF_SQLHA_CLUSTER_NOT_IN_MAINTENANCE_MODE
          The specified cluster on local node is not in maintenance mode.
DATA #1 : String, 78 bytes
An active domain was found on this host. This host is not in maintenance mode.
DATA #2 : Codepath, 8 bytes
2:5
DATA #3 : signed integer, 4 bytes
0
DATA #4 : signed integer, 4 bytes
0
DATA #5 : Boolean, 1 bytes
false
DATA #6 : Boolean, 1 bytes
false
DATA #7 : Boolean, 1 bytes
true
DATA #8 : Boolean, 1 bytes
false
DATA #9 : String, 39 bytes
/var/db2/.db2_maintenance_machine007
DATA #10: String, 13 bytes
machine007
DATA #11: String, 8 bytes
db2inst5
DATA #12: String, 24 bytes
db2domain_20180214154000
DATA #13: String, 0 bytes
Object not dumped: Address: 0x00007FFD51489A10 Size: 0 Reason: Zero-length data
DATA #14: String, 0 bytes
Object not dumped: Address: 0x00007FFD51488CB0 Size: 0 Reason: Zero-length data
DATA #15: String, 26 bytes
/home/db2/db2inst5/sqllib/
DATA #16: String, 0 bytes
Object not dumped: Address: 0x00007FFD51489200 Size: 0 Reason: Zero-length data
DATA #17: signed integer, 4 bytes
0
DATA #18: String, 0 bytes
Object not dumped: Address: 0x00007FFD51486A04 Size: 0 Reason: Zero-length data
DATA #19: String, 18 bytes
/home/db2/db2inst5
DATA #20: Boolean, 1 bytes
false
DATA #21: unsigned integer, 8 bytes
0
DATA #22: Boolean, 1 bytes
false
CALLSTCK: (Static functions may not be resolved correctly, as they are resolved to the nearest symbol)
[0] 0x00000000004EFD51 _Z31sqlhaUIExitMaintenanceModeForCMP9sqlhaUICB + 0x2461
[1] 0x000000000042FE7B main + 0x3C9B
[2] 0x00007F9D1CF14C05 __libc_start_main + 0xF5
[3] 0x000000000042C119 /opt/IBM/DB2/V11.1.3.s/bin/db2cluster + 0x2C119

2019-07-12-11.58.19.306048+600 E4242E444             LEVEL: Error
PID     : 6800                 TID : 140312855496576 PROC : db2cluster
INSTANCE: db2inst5             NODE : 129
HOSTNAME: machine007
FUNCTION: DB2 UDB, high avail services, sqlhaUIMain, probe:1556
MESSAGE : ECF=0x90000614=-1879046636=ECF_SQLHA_CLUSTER_NOT_IN_MAINTENANCE_MODE
          The specified cluster on local node is not in maintenance mode.

Based on the error that we return for the command "Unable to enter maintenance mode for domain .... is already in maintenance mode", the failure to enter maintenance is due to quorum vote being disabled on a few nodes.
When exiting maintenance, we check to see if the node is online in the domain. Hence the exit maintenance command returns a message indicating the host is not in maintenance.

The RSCT command, lsrsrc IBM.PeerNode lists details of which hosts have quorum bit unset ( this data is also captured in db2support.zip_unpack/ctsnap_out/lsrsrc/lsrsrc_IBM.PeerNode.out)

Resource Persistent and Dynamic Attributes for IBM.PeerNode
Name            NodeList RSCTVersion ClassVersions CritRsrcProtMethod IsQuorumNode IsPreferredGSGL NodeUUID HostName                        TBPriority ActivePeerDomain           NodeNameList                       OpState ConfigChanged CritRsrcActive OpUsabilityState
"machine006"    {2}      "3.2.3.1"   {}            0                  0            1               ""       "machine006.xxx.xxx.xxxx.xxx"    0          "db2domain_20180214154000" {"machine006.xxx.xxx.xxxx.xxx"}    1       0             0              1
"machine007"    {5}      "3.2.3.1"   {}            0                  0            1               ""       "machine007.xxx.xxx.xxxx.xxx"    0          "db2domain_20180214154000" {"machine007.xxx.xxx.xxxx.xxx"}    1       0             0              1
"machine008"    {4}      "3.2.3.1"   {}            0                  0            1               ""       "machine008.xxx.xxx.xxxx.xxx"    0          "db2domain_20180214154000" {"machine008.xxx.xxx.xxxx.xxx"}    1       0             0              1
"machine009"    {1}      "3.2.3.1"   {}            0                0            1               ""       "machine009.xxx.xxx.xxxx.xxx"    0          "db2domain_20180214154000" {"machine009.xxx.xxx.xxxx.xxx"}    1       0             0              1
"machine005"    {6}      "3.2.3.1"   {}            0                  1            1               ""       "machine005.xxx.xxx.xxxx.xxx"    0          "db2domain_20180214154000" {"machine005.xxx.xxx.xxxx.xxx"}    1       0             0              1

We can see that nodes machine006, machine007, machine008 and machine009 have this bit unset.

You can set this bit via the following command (for each of the nodes that have this unset):
CT_MANAGEMENT_SCOPE=2 chrsrc -s "Name=='<NodeName>'" IBM.PeerNode IsQuorumNode=1

Once you do the above and ensure the quorum bit is set on all hosts. You can verify the quorum bit is set on all hosts via the following command:
CT_MANAGEMENT_SCOPE=2 lsrsrc IBM.PeerNode Name IsQuorumNode

The IsQuorumNode attribute is set to 1 when the cluster at the time the cluster is created.
This value is updated by DB2 only in two instances: when a node enters maintenance (IsQuorumNode=0) or when it exits maintenance (IsQuorumNode=1).

However if the user exited cm maintenance via startrpnode/startrpdomain then this bit will remain unset.
So such situation can happen when any of the hosts in this cluster enters cm maintenance via db2cluster but exits maintenance via startrpnode/startrpdomain.

Thanks,

Thanks,
Shashank Kharche
IBM DB2 LUW Lab

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm13285789

Tips

Db2 Purescale : Unable to enter in the CM maintenance mode

Technical Blog Post

Abstract

Body

UID

Share your feedback

Need support?