Diagnosing a cluster file system failure that occurred during restart light

A member attempts to perform a restart light, but a cluster file system error occurs, which causes the restart light to fail.

Symptoms

The objective of this topic is to diagnose the cause of the failure. This is a sample output from the db2instance -list command showing a three member, two cluster caching facility environment:

ID        TYPE             STATE             HOME_HOST    CURRENT_HOST    ALERT   PARTITION_NUMBER   LOGICAL_PORT    NETNAME
--        ----             -----             ---------    ------------    -----   ----------------   ------------    -------
0         MEMBER           RESTARTING        hostA        hostB           No                     0              1    hostB-ib0
1         MEMBER           STARTED           hostB        hostB           NO                     0              0    hostB-ib0
2         MEMBER           STARTED           hostC        hostC           NO                     0              0    hostC-ib0
128       CF               PRIMARY           hostD        hostD           NO                     -              0    hostD-ib0
129       CF               PEER              hostE        hostE           NO                     -              0    hostE-ib0

HOSTNAME       STATE      INSTANCE_STOPPED ALERT
--------       -----      ---------------- -----
hostA          INACTIVE   NO               YES
hostB          ACTIVE     NO               NO
hostC          ACTIVE     NO               NO
hostD          ACTIVE     NO               NO
hostE          ACTIVE     NO               NO

There is a state of RESTARTING for member 0 occurring on hostB. The home host for member 0 is hostA. This output indicates member 0 is actively performing a restart light on hostB

HostA has a state of INACTIVE with an corresponding alert. This indicates an abnormal host shutdown such as a power failure or a failure to access the host due to a network communication failure.

A subsequent db2instance -list output shows that the restart light failed on hostB. The member then attempts the restart light on hostC and is successful. Member 0 is left in the WAITING_FOR_FAILBACK state because hostA remains offline.

ID        TYPE             STATE                HOME_HOST    CURRENT_HOST    ALERT   PARTITION_NUMBER    LOGICAL_PORT    NETNAME
--        ----             -----                ---------    ------------    -----   ----------------    ------------    -------
0         MEMBER           WAITING_FOR_FAILBACK hostA       hostC           YES                    0               1    hostC-ib0
1         MEMBER           STARTED              hostB        hostB           NO                     0               0    hostB-ib0
2         MEMBER           STARTED              hostC        hostC           NO                     0               0    hostC-ib0
128       CF               PRIMARY              hostD        hostD           NO                     -               0    hostD-ib0
129       CF               PEER                 hostE        hostE           NO                     -               0    hostE-ib0
	
HOSTNAME       STATE      INSTANCE_STOPPED ALERT
--------       -----      ---------------- -----
hostA          INACTIVE   NO               YES
hostB          ACTIVE     NO               NO
hostC          ACTIVE     NO               NO
hostD          ACTIVE     NO               NO
hostE          ACTIVE     NO               NO

If hostA comes online, the host state will return to ACTIVE, and member 0 will fail back to hostA as shown in the following db2instance -list output

ID        TYPE          STATE        HOME_HOST   CURRENT_HOST    ALERT   PARTITION_NUMBER    LOGICAL_PORT    NETNAME
--        ----          -----        ---------   ------------    -----   ----------------    ------------    -------
0         MEMBER        STARTED      hostA       hostA           YES                    0               0    hostA-ib0
1         MEMBER        STARTED      hostB       hostB           NO                     0               0    hostB-ib0
2         MEMBER        STARTED      hostC       hostC           NO                     0               0    hostC-ib0
128       CF            PRIMARY      hostD       hostD           NO                     -               0    hostD-ib0
129       CF            PEER         hostE       hostE           NO                     -               0    hostE-ib0
	
HOSTNAME       STATE      INSTANCE_STOPPED ALERT
--------       -----      ---------------- -----
hostA          ACTIVE     NO               NO
hostB          ACTIVE     NO               NO
hostC          ACTIVE     NO               NO
hostD          ACTIVE     NO               NO
hostE          ACTIVE     NO               NO

Diagnosis and resolution

To help diagnose the restart light error, take one or more of the following steps:

Check the $INSTHOME/sqllib/db2dump/db2diag.log file. There should be a message corresponding to the time of the restart light attempt on hostB, as shown in the following output:

2009-08-27-23.37.52.416270-240 I6733A457            LEVEL: Event
PID     : 1093874              TID  : 1             KTID : 2461779
PROC    : db2star2
INSTANCE:                      NODE : 000
HOSTNAME: hostB
EDUID   : 1
FUNCTION: Db2, base sys utilities, DB2StartMain, probe:3368
MESSAGE : Idle process taken over by member
DATA #1 : Database Partition Number, PD_TYPE_NODE, 2 bytes
996
DATA #2 : Database Partition Number, PD_TYPE_NODE, 2 bytes
0

Check the db2diag log file to analyze the error message corresponding to the restart light failure on hostB. The following output is generated during the restart light attempt:

2009-08-27-23.37.53.260582-240 E596165A1624       LEVEL: Error (OS)
PID     : 667736               TID  : 1           PROC : db2rocm
INSTANCE:                      NODE : 000
HOSTNAME: hostB
EDUID   : 1
FUNCTION: Db2, oper system services, sqloflock, probe:100
MESSAGE : ZRC=0x83000037=-2097151945
CALLED  : OS, -, fcntl
OSERR   : EINPROGRESS (55) "Operation now in progress"
DATA #1 : File handle, PD_TYPE_SQO_FILE_HDL, 8 bytes
  File Handle              = 3
  File System Block Size   = 32768 bytes
  File System Type         = mmfs
  File Handle Flags :
    Require Sector Align   = No
    DIO/CIO Mode           = No
    Raw Block Device       = No
    Reserved Handle        = No
    Flush On Close         = No
    Thread-Level Lock      = No
    Write-through Mode     = No
    File Not Tracked       = No

The output shows that the fcntl() operating system function was called and generated an error message and code of EINPROGRESS (55) "Operation now in progress". The File System Type is mmfs, which is the General Parallel File System (GPFS).

Run the errpt -a operating system command to view the contents of the AIX errpt system log. In the scenario, as shown in the following sample output, the AIX errpt log from hostB contains MMFS_* messages (for example, MMFS_MOREINFO) that were generated at approximately the same time as the aforementioned restart light message. These MMFS* messages indicate that the problem originated in the GPFS subsystem.
```
LABEL:          MMFS_MOREINFO
IDENTIFIER:     E7AA68A1

Date/Time:       Thu Aug 27 23:37:53 EDT 2009
Sequence Number: 356562
Machine Id:      0006DA8AD700
Node Id:         hostB
Class:           S
Type:            PERM
WPAR:            Global
Resource Name:   mmfs

Description
SOFTWARE PROGRAM ERROR
```

See the topic Diagnosing a host reboot with a restart light for help in diagnosing the cause of the host failure that initiated the restart light.

Further assistance for host and restart light problems is available on the IBM Technical Support website (Db2® support), where you can get information about known problems based on symptoms or error log messages. In addition, if you want IBM Technical Support to analyze your diagnostic data, obtain a db2support package by running db2support output directory -d database name -s on each node in the cluster. To submit data to IBM Technical Support, see Submitting diagnostic information to IBM Technical Support for problem determination.