Disk and file system issues

This section describes solutions for potential disk and file system issues.

AIX volume group commands fail during a varyon operation on a shared volume group

Problem

The /var/hacmp/log/hacmp.out file shows that the cl_nfskill command fails when you perform a forced unmount of an Network File System (NFS) mounted file system. NFS provides certain levels of locking a file system that resists forced unmounting by the cl_nfskill command.

Solution

When you configure a shared volume group, set the Activate volume group AUTOMATICALLY at system restart? field to no on the SMIT System Management (C-SPOC) > PowerHA SystemMirror Logical Volume Management > Shared Volume Groups > Create a Shared Volume Group panel. After you import the shared volume group on the other cluster nodes, use the following command to ensure that the volume group on each node is not set to autovaryon at boot:

chvg -an vgname

varyonvg command fails on a volume group

This section discusses different problems that are indicated by a varyonvg command failing on a volume group.

Problem 1

The PowerHA® SystemMirror® software (the /var/hacmp/log/hacmp.out file) indicates that the varyonvg command failed when trying to vary on a volume group.

Solution

Ensure that the volume group is not set to autovaryon on any node and that the volume group (unless it is in concurrent access mode) is not already varied on by another node.

The lsvg -o command can be used to determine whether the shared volume group is active. Enter:lsvg volume_group_name on the node with the volume group activated, and check the AUTO ON field to determine whether the volume group is automatically set to be on. If the AUTO ON field is set to yes, you can modify the value by using the chvg -an volume_group_name command.

Problem 2

The volume group information on disk differs from the information that is present in the Device Configuration Database.

Solution

Correct the Device Configuration Database on the nodes that have incorrect information:

Use the smit exportvg fast path to export the volume group information. The volume group information is removed from the Device Configuration Database.
Use the smit importvg fast path to import the volume group. A new Device Configuration Data Base entry is created directly from the information on disk. After you import the volume group, ensure that the volume group is not set to autovaryon at the next system restart.
In SMIT, select Problem Determination Tools > Recover From PowerHA System Mirror Script Failure. The clruncmd command is run and the Cluster Manager resumes cluster processing.

Problem 3: The PowerHA SystemMirror software indicates that the varyonvg command failed because the volume group might not be found.
Solution: The volume group is not defined to the system. If a new volume group is created and exported, or if an mksysb system backup is restored, you must import the volume group. Follow the steps described in Problem 2.2 to verify that the correct volume group name is being referenced.

Problem 4: The PowerHA SystemMirror software indicates that the varyonvg command failed because the logical volume is incomplete.
Solution: The varyonvg command fails because the forced vary on attribute is configured for the volume group in SMIT, and when the forced vary on operation is performed, PowerHA SystemMirror dis not find a complete copy of the specified logical volume for the volume group. Also, it is possible that you requested a forced vary on operation but did not specify the super strict allocation policy for the mirrored logical volumes. In this case, the vary on operation might not be successful.

cl_nfskill command fails when performing a forced unmount operation

Problem: The /var/hacmp/log/hacmp.out file shows that the cl_nfskill command fails on a forced unmount operation of an NFS-mounted file system. NFS provides certain levels of locking a file system that resists forced unmount operation by the cl_nfskill command.
Solution: Make a copy of the /etc/locks file in a separate directory before you run the cl_nfskill command. Delete the original /etc/locks file and run the cl_nfskill command. After the command succeeds, re-create the /etc/locks file by using the saved copy.

cl_scdiskreset command fails

Problem: The cl_scdiskreset command logs error messages to the /var/hacmp/log/hacmp.out file. To break the reserve held by one system on a SCSI device, the PowerHA SystemMirror disk utilities issue the cl_scdiskreset command. The cl_scdiskreset command might fail if back-level hardware exists on the SCSI bus (adapters, cables, or devices) or if a SCSI ID conflict exists on the bus.
Solution: See the appropriate sections in the Using cluster log files topic to check the SCSI adapters, cables, and devices. Ensure that you have the latest adapters and cables. The SCSI IDs for each SCSI device must be different.

fsck command fails at boot time

Problem

At boot time, AIX runs the fsck command to check all the file systems listed in /etc/filesystems with the check=true attribute. If it cannot check a file system, AIX displays an error.

Solution

For file systems controlled by PowerHA SystemMirror, this message typically does not indicate a problem. The file system check fails because the volume group that is defining the file system is not varied on. The boot procedure does not automatically vary on PowerHA SystemMirror-controlled volume groups.

To prevent this message, make sure that all the file systems under PowerHA SystemMirror control do not have the check=true attribute in their /etc/filesystems stanzas. To delete this attribute or change it to check=false, edit the /etc/filesystems file.

System cannot mount specified file systems

Problem: The /etc/filesystems file is not updated to reflect changes to log names for a logical volume. If you change the name of a logical volume after the file systems is created for that logical volume, the /etc/filesystems entry for the log does not get updated. When you mount the file systems, the PowerHA SystemMirror software tries to get the required information about the logical volume name from the old log name. Because this information is not updated, the file systems cannot be mounted.
Solution: Update the /etc/filesystems file after you modify logical volume names.

Cluster disk replacement operation fails

Problem: The disk replacement operation failed to complete due to a node_down event.
Solution: After the node is online, export the volume group, and then import the volume group before you start PowerHA SystemMirror on this node.

File system change not recognized by lazy update

Problem

If you change the name of a file system, or remove a file system and then perform a lazy update, lazy update does not run the imfs -lx command before the imfs command is run. This action might lead to a failure during fallover or prevent a successful restart of the PowerHA SystemMirror cluster services.

Solution

Use the C-SPOC utility to change or remove file systems to ensure that the imfs -lx command is run before the imfs command and that the changes are updated on all nodes in the cluster.

Error Reporting provides detailed information about inconsistency in volume group state across the cluster. If this happens, take manual corrective action. If the file system changes are not updated on all nodes, update the nodes manually with this information.

clam_nfsv4 application monitor fails

Problem: The clam_nfsv4 application monitor takes more than 60 seconds to complete. The monitor is not responding and is stopped. Therefore, a fallover occurs on the Network File System (NFS) node. This fallover usually occurs if the system that hosts the application monitor is experiencing high-performance workloads.
Solution: You must reduce the system workloads to correct this problem. You can also apply APAR IV08873 to your system, which reduces the amount of time it takes to run the clam_nfsv4 application monitor script.

Troubleshooting LVM split-site mirroring

Problem

PowerHA SystemMirror and LVM do not have information about the physical location for disks, other than the information that was specified when the mirror pools were defined.

Solution

Review the following information to identify possible solutions for problems with LVM split-site mirroring:

Verify that the assignment of disks to mirror pools by entering smitty cl_mirrorpool_mgt from the command line, and select Show Mirror Pools for a Volume Group.
Verify that the mirroring for individual file systems and logical volumes is correct by entering smitty cl_lv from the command line, and selecting Show Characteristics of a Logical Volume.
Verify that your volume groups are super strict by entering smitty cl_vgsc from the command line, and select Change/Show characteristics of a Volume Group.
Examine the AIX error log for problems associated with the disks in the volume group if resynchronization fails. You can manually resynchronize the volume group by entering smitty cl_syncvg from the command line, and select Synchronize LVM Mirrors by Volume Group.

Troubleshooting repository disks

Problem

If any node in the cluster encounters errors with the repository disk or a failure while accessing the disk, the cluster enters a limited or restricted mode of operation. In this mode of operation most topology-related operations are not allowed, and any node that is restarted cannot rejoin the cluster.

Solution

When the repository disk fails, you are notified of the disk failure. PowerHA SystemMirror continues to notify you of the repository disk failure until it is resolved. To determine what the problem is with the repository disk, you can view the following log files:

hacmp.out
AIX error log (using the errpt command)

Example: hacmp.out log

Following is an example of an error message in the hacmp.out log file when a repository disk fails: ERROR: rep_disk_notify : Tue Jan 10 13:38:22 CST 2012 : Node "r6r4m32"(0x54628FEA1D0611E183EE001A64B90DF0) on Cluster r6r4m31_32_33_34 has lost access to repository disk hdisk75.

Example: AIX error log

When a node loses access to the repository disk, an entry is made in the AIX error log of each node that has a problem. Following is an example of an error message in the error log file when a repository disk fails.

Note: To view the AIX error log, you must use the errpt command.

LABEL:          OPMSG
IDENTIFIER:     AA8AB241

Date/Time:       Tue Jan 10 13:38:22 CST 2012
Sequence Number: 21581
Machine Id:      00CDB2C14C00
Node Id:         r6r4m32
Class:           O
Type:            TEMP
WPAR:            Global
Resource Name:   clevmgrd

Description
OPERATOR NOTIFICATION

User Causes
ERRLOGGER COMMAND

        Recommended Actions
        REVIEW DETAILED DATA

Detail Data
MESSAGE FROM ERRLOGGER COMMAND
Error: Node 0x54628FEA1D0611E183EE001A64B90DF0 has lost access to repository disk hdisk75.

Replacing a failed or lost repository disk

If a repository disk fails, the repository disk must be recovered on a different disk to restore all cluster operations. The circumstances for your cluster environment and the type of the repository disk failure determine the possible methods for recovering the repository disk.

Automatic Repository Disk Replacement (ARR)

PowerHA SystemMirror Version 7.2.0, or later, uses the ARR capability of CAA (in AIX Version 7.2, or later, or in AIX Version 7.1 with Technology Level 4, or later), to handle repository disk failures. ARR automatically replaces a failed repository disk with a backup repository disk. The ARR function is available only if you configure a backup repository disk by using PowerHA SystemMirror. For more information about ARR, see the Repository disk failure topic.

You must clean up the failed repository disk because the ARR does not clean the disk as it is not accessible. To clean up the failed repository disk, use the following command:

CAA_FORCE_ENABLED=true rmcluster -r <disk name>

The following are two possible scenarios where a repository disk fails and the possible methods for restoring the repository disk on a new storage disk.

Repository disk fails but the cluster is still operational

In this scenario, the repository disk access is lost on one or more nodes in the cluster. When this failure occurs, Cluster Aware AIX (CAA) continues to operate in restricted mode by using repository disk information that is cached in memory. If CAA remains active on a single node in the cluster, the information from the previous repository disk information can be used to rebuild a new repository disk.

To rebuild the repository disk after a failure, complete the following steps from any node where CAA is still active:

Verify that CAA is active on the node by using the lscluster -c command and then the lscluster -m command.
Replace the repository disk by completing the steps in the Replacing a repository disk with SMIT topic. PowerHA SystemMirror recognizes the problem and interacts with CAA to rebuild the repository disk on the new storage disk.
Note: This step updates the repository information that is stored in the PowerHA SystemMirror configuration data.
You do not need to perform Step 1 and Step 2, if the ARR function is available.
Synchronize thePowerHA SystemMirror cluster configuration information by selecting Cluster Nodes and Networks > Verify and Synchronize Cluster Configuration from the SMIT interface.

Repository disk fails and the nodes in the cluster are restarted

In this rare scenario, a series of critical failures occurs that result in a worst case scenario where access to the repository disk is lost and all nodes in the cluster were rebooted. Thus, none of the nodes in the cluster remained online during the failure and you cannot rebuild the repository disk from the AIX operating systems memory. When the nodes are brought back online, they cannot start CAA because a repository disk is not present in the cluster. To fix this problem, it is ideal to bring back the repository disk and allow the cluster self-heal. If that is not possible, you must rebuild the repository disk on a new storage disk and use it to start the CAA cluster.

To rebuild the repository disk and start cluster services, complete the following steps:

On a node in the cluster, rebuild the repository by completing the steps in the Replacing a repository disk with SMIT topic. PowerHA SystemMirror recognizes the problem and interacts with CAA to rebuild the repository disk on the new storage disk.
Note: This step updates the repository information that is stored in the PowerHA SystemMirror configuration data and rebuilds the repository disk from the CAA cluster cache file.
If the ARR function is available, you do not need to perform Step 1, and the disk is replaced automatically.
After the repository disk is replaced, run the verify and synchronization operations. If some of the nodes are down, the verify and synchronization operations might fail with errors. To run the verify and synchronization operations successfully, enter the following command:
```
#/usr/es/sbin/cluster/utilities/cldare -f -dr
```
You can ignore the cl_rsh errors if any.
Start cluster services on the node that hosts the repository disk by completing the steps in the Starting cluster services topic.
All other nodes in the cluster continue to attempt to access the original repository disk. You must configure these nodes to use the new repository disk and start CAA cluster services. Verify that the CAA cluster is not active on any of these nodes by using the lscluster -m command. If the CAA cluster is not active or the local node is in the DOWN state, enter the following commands to remove the old repository disk information:
```
export CAA_FORCE_ENABLED=true
clusterconf -fu
```
To have other nodes join the CAA cluster, use the following command on the active node with the newly created repository disk:
```
clusterconf -p
```
For AIX Version 7.1 with Technology Level 4, or later, you do not need to perform Step 3 and Step 4. After you complete Step 2, all nodes that were rebooted must wait for about 10 minutes to use the new repository disk.
Verify that CAA is active by first using the lscluster -c command and then the lscluster -m command.
Synchronize thePowerHA SystemMirror cluster configuration information about the newly created repository disk to all other nodes by selecting Cluster Nodes and Networks > Verify and Synchronize Cluster Configuration from the SMIT interface.
Start PowerHA SystemMirror cluster services on all nodes (besides the first node where the repository disk was created) by selecting System Management (C-SPOC) > PowerHA SystemMirror Services > Start Cluster Services from the SMIT interface.

Snapshot migration and repository disk: The snapshot migration process for an online cluster requires that the cluster information in the snapshot matches the online cluster information. This requirement also applies to repository disks. If you change a repository disk configuration, you must update the snapshot to reflect these changes and then complete the snapshot migration process.

Troubleshooting disk fencing

Disk fencing feature is only available for the quarantine policies in PowerHA SystemMirror.

Problem 1

Disk fencing is no longer needed for your environment. You can disable disk fencing and release the reservation for a disk or a volume group.

Solution

To disable disk fencing and release the reservation for a disk or a volume group, complete the following steps:

Stop the cluster services on all cluster nodes.
From the command line, enter smit sysmirror.
From the SMIT interface, select Custom Cluster Configuration > Cluster Nodes and Networks > Initial Cluster Setup (Custom) > Configure Cluster Split and Merge Policy > Quarantine Policy > Disk Fencing, and press Enter.
Specify No for the Disk Fencing field, and press Enter to save your changes.
Verify and synchronize the cluster.
Start the cluster services on all cluster nodes.

Problem 2

A resource group goes into an error state in an active cluster. The resource group is put into an error state because a node fails to register and put a reserve on a single volume group in the resource group.

Solution

To fix this problem with the resource group, complete the following steps:

From the command line, enter smit sysmirror.
From the SMIT interface, select Problem Determination Tools > Recover Resource Group from SCSI Persistent Reserve Error, and press Enter.
Select the resource that is in an error state, and press Enter.
From the SMIT interface, select System Management (C-SPOC) > Resource Group and Applications > Bring a Resource Group Online, and press Enter.
Select the resource group that you want to bring back online, and press Enter.

Note: If the problem persists, contact IBM® support.

Problem 3

If the quarantine policy is Disk Fencing, the PowerHA SystemMirror sets up the SCSI Persistent Reserve state for all shared disks when it is started. PowerHA SystemMirror also sets up the Persistent Reserve keys for all paths to the devices. If later, new or changed paths are added to the device, the Persistent Reserve keys are not set up for those paths.

Solution

To update the SCSI Persistent Reserves for the disk paths that are new or changed, complete the following steps from the command line:

To release the existing disk paths reservations, run one of the following commands:
```
clmgr modify physical_volume <disk> SCSIPR_ACTION=clear
clmgr modify volume_group <vg> SCSIPR_ACTION=clear
```
where disk is the name of the disk that is the part of the volume group and vg is the name of the volume group.
To restore the reservations, stop the cluster services by using the unmanage option and restart the cluster services. To restore the reservation on each cluster node, you must stop and start the cluster services on each cluster node by using the following commands:

clmgr stop node MANAGE=unmanage
clmgr start node