ESS known issues

Known issues in ESS version 6.0.2.x

For information about ESS 5.3.7.x known issues, see Known issues in ESS 5.3.7.x Quick Deployment Guide.

The following table describes the known issues in IBM Elastic Storage® System (ESS) version 6.0.2.x and how to resolve these issues.

Issue	Resolution or action
The existence of the xcat repo files (xcat-otherpkgsX) might cause update issues. If a PXE was deployed recently, the xcat-oth\u0002erpkgs{0,1,..X} repository files might exist and subsequently cause issues when you upgrade a node from the container by using the essrun command. The following issue might occur: `rc: 1 start: '2021-07-23 01:44:22.441566' stderr: \|- Warning: failed loading '/etc/yum.repos.d/xCAT-otherpkgs2.repo', skipping. Error: No matching repo to modify: yum, /install/rhels8.2/ppc64le/BaseOS, for, reposi-tory, configured, xCAT.` Product ESS 3000 ESS 5000	To fix this issue complete the following steps: Log in to each ESS and remove the xcat repos. `# cd /etc/yum.repos.d ; rm -f xcat yum clean all` Rerun the upgrade.
Failed to start `IBM.ESAGENT` subsystem due to wrong `JAVA_HOME` value which may cause ESA start to fail. In the following example, note how Java™ is pointing to the wrong location. This causes the ESA startup to fail: `[root@ems1 alternatives]# ls -alt total 20 drwxr-xr-x. 2 root root 4096 Nov 22 15:02 . lrwxrwxrwx 1 root root 62 Nov 22 15:02 java -> /usr/lib/jvm/java-11-openjdk-11.0.ea.28-7.el7.ppc64le/bin/java lrwxrwxrwx 1 root root 70 Nov 22 15:02 java.1.gz -> /usr/share/man/man1/java-java-11-openjdk-11.0.ea.28-7.el7.ppc64le.1.gz lrwxrwxrwx 1 root root 61 Nov 22 15:02 jjs -> /usr/lib/jvm/java-11-openjdk-11.0.ea.28-7.el7.ppc64le/bin/jjs` Product ESS 3000	To fix the issue, remove the current `java` symbolic link and update the `java` pointer, then retry ESA activation. Remove the current `java` symbolic link. `# cd /etc/alternatives/ # rm java rm: remove symbolic link ‘java’? y` Update the java pointer. `# ln -s /usr/lpp/mmfs/java java # ls -alt \| grep -i java lrwxrwxrwx 1 root root 18 Nov 22 16:03 java -> /usr/lpp/mmfs/java` cd /opt/ibm/ # ln -s /etc/alternatives/java java-ppc64le-80 # ls -alt total 0 drwxr-xr-x. 5 root root 62 Nov 22 16:04 . lrwxrwxrwx 1 root root 22 Nov 22 16:04 java-ppc64le-80 -> /etc/alternatives/java dr-xr-x--- 12 root root 151 Nov 22 15:48 esa drwxr-xr-x. 10 root root 119 Nov 7 16:09 .. drwx------ 8 scalemgmt scalemgmt 121 Nov 7 16:00 wlp drwxr-xr-x. 7 root root 68 Nov 7 14:36 gss # vi /opt/ibm/esa/runtime/conf/javaHome.sh # cat /opt/ibm/esa/runtime/conf/javaHome.sh JAVA_HOME=/opt/ibm/java-ppc64le-80/jre Retry the ESA activation. `# /opt/ibm/esa/bin/activator -C -p 5024 -w -Y`
The hardware CPU validation GPFS callback is only active for one node in the cluster. This callback prevents GPFS from starting if a CPU socket is missing. Product ESS 3000	No action is required.
During rolling upgrade, mmhealth might show the error local_exported_ fs_unavail even though the file system is still mounted. Product ESS 3000 ESS 5000	During a rolling upgrade (Updating of one ESS I/O node at a time but maintaining quorum), mmhealth might display an error indicating that the local exported file system is unavailable. This message is erroneous. `Component Status Status Change Reasons ------------------------------------------------------------ GPFS HEALTHY 6 min. ago - NETWORK HEALTHY 20 min. ago - FILESYSTEM DEGRADED 18 min. ago local_exported_fs_unavail(gpfs1) DISK HEALTHY 6 min. ago - NATIVE_RAID HEALTHY 6 min. ago - PERFMON HEALTHY 19 min. ago - THRESHOLD HEALTHY 20 min. ago -` The workaround is to restart mmsysmon on each node called out by mmhealth.
During upgrade, if the container had an unintended loss of connection with the target canister(s), there might be a timeout of up to 2 hours in the Ansible® update task. Product ESS 3000	Wait for the timeout and retry the essrun update task.
During storage MES upgrade you are required to update the drive firmware to complete the task. Some of the drives may not update on the first pass of running the command. Product ESS 3000	Rerun the mmchfirmware --type drive command again which should resolve the issue and update the remaining drives.
When running essrun commands, you might see messages such as these: `Thursday 16 April 2020 20:52:44 +0000 (0:00:00.572) 0:13:19.792 ****** Thursday 16 April 2020 20:52:45 +0000 (0:00:00.575) 0:13:20.367 **** Thursday 16 April 2020 20:52:46 +0000 (0:00:00.577) 0:13:20.944 ****` Product** ESS 3000 ESS 5000	This is a restriction in the Ansible timestamp module. It shows timestamps even for the “skipped” tasks. If you want to remove timestamps from the output, change the ansible.cfg file inside the container as follows: vim /etc/ansible/ansible.cfg Remove `,profile_tasks` on line 7. Save and quit: `esc + :wq`
When running the essrun config load command, you might see a failure such as this: `stderr: \|- rc=2 code=186 Failed to obtain the enclosure device name with rc=2 rc=2 code=669` Product ESS 3000	This failure means that the pems module is not running the canister. For fixing this, do the following: Log in to the failed canister and run the following commands: `cd /install/ess/otherpkgs/rhels8/x86_64/gpfs yum reinstall gpfs.ess.platform.ess3k*` When the installation finishes, wait until the lsmod \| grep pems command returns output similar to this: `pemsmod 188416 0 scsi_transport_sas 45056 1 pemsmod stderr: \|- rc=2 code=186 Failed to obtain the enclosure device name with rc=2 rc=2 code=669`
Running essrun -N node1,node2,… config load command with high-speed names causes issues with the upgrade task using the `-G` flag. Product ESS 3000 ESS 5000	The essrun config load command is an Ansible wrapper that attempts to discover the ESS 3000 canister node positions, place them into groups, and fix the SSH keys between the servers. This command must always be run using the low-speed or management names. You must not use the high-speed names with this command. This command should always be run using the low-speed or management names. For example: essrun -N ess3k1a,ess3k1b config load If you run this command using the high-speed or cluster names, this might result in issues when performing the update task. Example of what not to do: essrun -N ess3k1a-hs,ess3k1b-hs config load To confirm that the config run is set up correctly, use the lsdef command. This command returns only the low-speed or management names defined in /etc/hosts.
After reboot of an ESS 5000 node, systemd could be loaded incorrectly. Users might see the following error when trying to start GPFS: `Failed to activate service 'org.freedesktop.systemd1': timed out` Product ESS 5000	Power off the system and then power it on again. Run the following command from the container: `rpower <node name> off` Wait for at least 30 seconds and run the following command to verify that the system is off: `rpower <node name> status` Restart the system with the following command: `rpower <node name> on`
In ESS 5000 SLx series, after pulling a hard drive out for a long time wherein the drive has finished draining, when you re-insert the drive, the drive could not be recovered. Product ESS 5000	Run the following command from EMS or IO node to revive the drive: `mmvdisk pdisk change --rg RGName --pdisk PdiskName --revive` Where `RGName` is the recovery group that the drive belongs to and `PdiskName` is the drive's pdisk name.
After the deployment is complete, if firmware on the enclosure, drive, or HBA adapter does not match the expected level, and if you run essinstallcheck, the following mmvdisk settings related error message is displayed: `[ERROR] mmvdisk settings do NOT match best practices. Run mmvdisk server configure --verify --node-class ess5k_ppc64le_mmvdisk to debug.` Product ESS 3000 ESS 5000	The error about mmvdisk settings can be ignored. The resolution is to update the mismatched firmware levels on enclosure, adapter, or HBA adapters to the correct levels. You can run the mmvdisk configuration check command to confirm. The mmvdisk settings do not match best practices. Run the mmvdisk server configure --verify --node-class <nodeclass> command. List the mmvdisk node classes: mmvdisk nc list Note: essinstallcheck detects inconsistencies from mmvdisk best practices for all node classes in the cluster and stops immediately if an issue is found.
When running essinstallcheck you might see an error message similar to: `System Firmware could not be obtained which will lead to a false-positive PASS message when the script completes.` Product ESS 5000	Run vpdupdate on each IO node. Rerun essinstallcheck which should properly query the firmware level.
When running the essrun - N `Node`healthcheck command, the essinstallcheck script might fail due to incorrect error verification which might lead to an impression that there is a problem where there is none. Command: `essrun -N <node> healthcheck` Product ESS 3000 ESS 5000	This health check command (essrun - N `Node`healthcheck) is removed from the ESS documentation and it is advised to use the manual commands to verify system health after deployment. Run the following commands for health check: gnrhealthcheck mmhealth node show -a essinstallcheck -N localhost Note: This command needs to be run on each node.
During command-less disk replacement, there is a limit on how many disks can be replaced at one time. Product ESS 3000 ESS 5000	For command-less disk replacement using commands, only replace up to 2 disks at a time. If command-less disk replacement is enabled, and more than 2 disks are replaceable, replace the 1st 2 disks, and then use the commands to replace the 3rd and subsequent disks.
Issue reported with command-less disk replacement warning LEDs. Product ESS 5000	The replaceable disk will have the amber led on, but not blinking. Disk replacement should still succeed.
After upgrading an ESS 3000 node to version 6.0.2.6, the `pmsensors` service needs to be manually started. Product ESS 3000	After the ESS 3000 upgrade is complete, the `pmsensors` service does not automatically start. You must manually start the service for performance monitoring to be restored. On each ESS 3000 canister, run the following command: `systemctl start pmsensors` For checking the status of the service, run the following command: `systemctl status --no-pager pmsensors`
ESS commands such as essstoragequickcheck, essinstallcheck must be run using -N localhost. If using the hostname such as -N ess3k1a, an error occurs. Product ESS 3000 ESS 5000	There is currently an issue with running the ESS deployment commands by using the hostname of a node. The workaround is to run checks locally on each node by using localhost. For example, instead of using essstoragequickcheck -N ess3k1a, use the following command: `essstoragequickcheck -N localhost`
Hyperthreading might be enabled on an ESS 3000 system due to an incorrect kernel grub flag being set. Product ESS 3000	Hyperthreading needs to be disabled on ESS 3000 systems. This is ensured in following ways: Disabled in BIOS Disabled using the tuned profile Disabled using the grub command line When disabled with the grub command line, the issue occurs because the grub configuration had an incorrect flag set in earlier versions. To resolve this issue, do the following: Edit the /etc/grub2.cfg file to change `nohup` with `nosmt`. Before change: `set default_kernelopts="root=UUID=9a4a93b8-2e6b-4ba6-bda4-a7f8c3cb908f ro nvme.sgl_threshold=0 sshd=1 pcie_ports=native nohup resume=UUID=c939121b-526a-4d44-8d33-693f2fb7f018 rd.md.uuid=f6dbf6f2:8ac82ed6:875ca663:0094ac11 rd.md.uuid=06c2d5b0:c6603a1e:5df4b4d3:98fd5adc rhgb quiet crashkernel=4096M"` After change: `set default_kernelopts="root=UUID=9a4a93b8-2e6b-4ba6-bda4-a7f8c3cb908f ro nvme.sgl_threshold=0 sshd=1 pcie_ports=native nosmt resume=UUID=c939121b-526a-4d44-8d33-693f2fb7f018 rd.md.uuid=f6dbf6f2:8ac82ed6:875ca663:0094ac11 rd.md.uuid=06c2d5b0:c6603a1e:5df4b4d3:98fd5adc rhgb quiet crashkernel=4096M"` Reboot the node for the changes to take effect. The main change is the nohup item to nosmt. Note: After you made the change, re-boot the node
The ESS 3000 container contains the rhels8.2-ppc64le-install-ces image. However, the pxe cannot be installed by using it because it is not creating repo in the container. An example is as follows: `root@cems0legacy:/ # lsdef -t osi-mage rhels7.9-ppc64le-install-ems (osi-mage) rhels8.2-ppc64le-install-ces (osi-mage) rhels8.2-ppc64le-install-ems (osi-mage) rhels8.2-x86_64-install (osimage) ESS 3000 CONTAINER root@cems0legacy:/ #` Product ESS 3000	This issue has been resolved in 6.1.1.1 build.
P8 protocol node update is not supported. Product 3000	This issue has been resolved in 6.1.1.1 build.
With ESS 5000 container, P9 IO node, PXE install is not supported. Product 5000	This issue has been resolved in 6.1.1.1 build.
For ESS 3000 container on P9 EMS node, PXE install is not supported on P9 protocol node. Product 3000	This issue has been resolved in 6.1.1.1 build.
In an existing cluster with quorum nodes not exceeding 7 nodes, addition of new nodes will fail irrespective of the firmware level. Product ESS 3000 ESS 5000	This is not considered a problem, thus, no workaround is needed.