Diagnosing a cluster file system failure that occurred during restart light
A member attempts to perform a restart light, but a cluster file system error occurs, which causes the restart light to fail.
Symptoms
ID TYPE STATE HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
-- ---- ----- --------- ------------ ----- ---------------- ------------ -------
0 MEMBER RESTARTING hostA hostB No 0 1 hostB-ib0
1 MEMBER STARTED hostB hostB NO 0 0 hostB-ib0
2 MEMBER STARTED hostC hostC NO 0 0 hostC-ib0
128 CF PRIMARY hostD hostD NO - 0 hostD-ib0
129 CF PEER hostE hostE NO - 0 hostE-ib0
HOSTNAME STATE INSTANCE_STOPPED ALERT
-------- ----- ---------------- -----
hostA INACTIVE NO YES
hostB ACTIVE NO NO
hostC ACTIVE NO NO
hostD ACTIVE NO NO
hostE ACTIVE NO NO
There is a state of RESTARTING
for member 0
occurring on hostB. The home host for member 0
is hostA. This output indicates member 0
is actively performing a restart light on hostBHostA has a
state of INACTIVE
with an corresponding alert. This
indicates an abnormal host shutdown such as a power failure or a failure
to access the host due to a network communication failure.
ID TYPE STATE HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
-- ---- ----- --------- ------------ ----- ---------------- ------------ -------
0 MEMBER WAITING_FOR_FAILBACK hostA hostC YES 0 1 hostC-ib0
1 MEMBER STARTED hostB hostB NO 0 0 hostB-ib0
2 MEMBER STARTED hostC hostC NO 0 0 hostC-ib0
128 CF PRIMARY hostD hostD NO - 0 hostD-ib0
129 CF PEER hostE hostE NO - 0 hostE-ib0
HOSTNAME STATE INSTANCE_STOPPED ALERT
-------- ----- ---------------- -----
hostA INACTIVE NO YES
hostB ACTIVE NO NO
hostC ACTIVE NO NO
hostD ACTIVE NO NO
hostE ACTIVE NO NO
ID TYPE STATE HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
-- ---- ----- --------- ------------ ----- ---------------- ------------ -------
0 MEMBER STARTED hostA hostA YES 0 0 hostA-ib0
1 MEMBER STARTED hostB hostB NO 0 0 hostB-ib0
2 MEMBER STARTED hostC hostC NO 0 0 hostC-ib0
128 CF PRIMARY hostD hostD NO - 0 hostD-ib0
129 CF PEER hostE hostE NO - 0 hostE-ib0
HOSTNAME STATE INSTANCE_STOPPED ALERT
-------- ----- ---------------- -----
hostA ACTIVE NO NO
hostB ACTIVE NO NO
hostC ACTIVE NO NO
hostD ACTIVE NO NO
hostE ACTIVE NO NO
Diagnosis and resolution
To help diagnose the restart light error, take one or more of the following steps:
- Check the $INSTHOME/sqllib/db2dump/db2diag.log file.
There should be a message corresponding to the time of the restart
light attempt on hostB, as shown in the following output:
2009-08-27-23.37.52.416270-240 I6733A457 LEVEL: Event PID : 1093874 TID : 1 KTID : 2461779 PROC : db2star2 INSTANCE: NODE : 000 HOSTNAME: hostB EDUID : 1 FUNCTION: Db2, base sys utilities, DB2StartMain, probe:3368 MESSAGE : Idle process taken over by member DATA #1 : Database Partition Number, PD_TYPE_NODE, 2 bytes 996 DATA #2 : Database Partition Number, PD_TYPE_NODE, 2 bytes 0
- Check
the db2diag log file to analyze the error message corresponding to
the restart light failure on hostB. The following output is generated
during the restart light attempt:
The output shows that the fcntl() operating system function was called and generated an error message and code of EINPROGRESS (55) "Operation now in progress". The File System Type is mmfs, which is the General Parallel File System (GPFS).2009-08-27-23.37.53.260582-240 E596165A1624 LEVEL: Error (OS) PID : 667736 TID : 1 PROC : db2rocm INSTANCE: NODE : 000 HOSTNAME: hostB EDUID : 1 FUNCTION: Db2, oper system services, sqloflock, probe:100 MESSAGE : ZRC=0x83000037=-2097151945 CALLED : OS, -, fcntl OSERR : EINPROGRESS (55) "Operation now in progress" DATA #1 : File handle, PD_TYPE_SQO_FILE_HDL, 8 bytes File Handle = 3 File System Block Size = 32768 bytes File System Type = mmfs File Handle Flags : Require Sector Align = No DIO/CIO Mode = No Raw Block Device = No Reserved Handle = No Flush On Close = No Thread-Level Lock = No Write-through Mode = No File Not Tracked = No
- Run the errpt -a operating
system command to view the contents of the AIX errpt system
log. In the scenario, as shown in the following sample output, the AIX
errpt log from hostB contains MMFS_* messages
(for example, MMFS_MOREINFO) that were generated at
approximately the same time as the aforementioned restart light message.
These MMFS* messages indicate that the problem originated
in the GPFS subsystem.
LABEL: MMFS_MOREINFO IDENTIFIER: E7AA68A1 Date/Time: Thu Aug 27 23:37:53 EDT 2009 Sequence Number: 356562 Machine Id: 0006DA8AD700 Node Id: hostB Class: S Type: PERM WPAR: Global Resource Name: mmfs Description SOFTWARE PROGRAM ERROR
See the topic Diagnosing a host reboot with a restart light for help in diagnosing the cause of the host failure that initiated the restart light.
Further assistance for host and restart light problems is available on the IBM Technical Support website (Db2® support), where you can get information about known problems based on symptoms or error log messages. In addition, if you want IBM Technical Support to analyze your diagnostic data, obtain a db2support package by running db2support output directory -d database name -s on each node in the cluster. To submit data to IBM Technical Support, see Submitting diagnostic information to IBM Technical Support for problem determination.