ESS known issues

Known issues in ESS version 6.0.2.x

For information about ESS 5.3.7.x known issues, see Known issues in ESS 5.3.7.x Quick Deployment Guide.

The following table describes the known issues in IBM Elastic Storage® System (ESS) version 6.0.2.x and how to resolve these issues.
Issue Resolution or action

The existence of the xcat repo files (xcat-otherpkgsX) might cause update issues.

If a PXE was deployed recently, the xcat-oth\u0002erpkgs{0,1,..X} repository files might exist and subsequently cause issues when you upgrade a node from the container by using the essrun command.

The following issue might occur:
rc: 1
  start: '2021-07-23 01:44:22.441566'
  stderr: |-
    Warning: failed loading '/etc/yum.repos.d/xCAT-otherpkgs2.repo', skipping.
    Error: No matching repo to modify: yum, /install/rhels8.2/ppc64le/BaseOS, for, reposi-tory, configured, xCAT.
Product
  • ESS 3000
  • ESS 5000
To fix this issue complete the following steps:
  1. Log in to each ESS and remove the xcat repos.
    # cd /etc/yum.repos.d ; rm -f *xcat*
    yum clean all
    
  2. Rerun the upgrade.
Failed to start IBM.ESAGENT subsystem due to wrong JAVA_HOME value which may cause ESA start to fail.
In the following example, note how Java™ is pointing to the wrong location. This causes the ESA startup to fail:
[root@ems1 alternatives]# ls -alt
total 20
drwxr-xr-x.   2 root root  4096 Nov 22 15:02 .
lrwxrwxrwx    1 root root    62 Nov 22 15:02 java -> /usr/lib/jvm/java-11-openjdk-11.0.ea.28-7.el7.ppc64le/bin/java
lrwxrwxrwx    1 root root    70 Nov 22 15:02 java.1.gz -> /usr/share/man/man1/java-java-11-openjdk-11.0.ea.28-7.el7.ppc64le.1.gz
lrwxrwxrwx    1 root root    61 Nov 22 15:02 jjs -> /usr/lib/jvm/java-11-openjdk-11.0.ea.28-7.el7.ppc64le/bin/jjs
Product
  • ESS 3000
To fix the issue, remove the current java symbolic link and update the java pointer, then retry ESA activation.
  1. Remove the current java symbolic link.
    # cd /etc/alternatives/
    # rm java
    rm: remove symbolic link ‘java’? y
    
  2. Update the java pointer.
    # ln -s /usr/lpp/mmfs/java java
    # ls -alt | grep -i java
    lrwxrwxrwx    1 root root    18 Nov 22 16:03 java -> /usr/lpp/mmfs/java
    
    cd /opt/ibm/
    # ln -s /etc/alternatives/java java-ppc64le-80
    # ls -alt
    total 0
    drwxr-xr-x.  5 root      root       62 Nov 22 16:04 .
    lrwxrwxrwx   1 root      root       22 Nov 22 16:04 java-ppc64le-80 -> /etc/alternatives/java
    dr-xr-x---  12 root      root      151 Nov 22 15:48 esa
    drwxr-xr-x. 10 root      root      119 Nov  7 16:09 ..
    drwx------   8 scalemgmt scalemgmt 121 Nov  7 16:00 wlp
    drwxr-xr-x.  7 root      root       68 Nov  7 14:36 gss
    
    # vi /opt/ibm/esa/runtime/conf/javaHome.sh
    
    # cat /opt/ibm/esa/runtime/conf/javaHome.sh
    JAVA_HOME=/opt/ibm/java-ppc64le-80/jre
  3. Retry the ESA activation.
# /opt/ibm/esa/bin/activator -C -p 5024 -w -Y
The hardware CPU validation GPFS callback is only active for one node in the cluster.

This callback prevents GPFS from starting if a CPU socket is missing.

Product
  • ESS 3000
No action is required.
During rolling upgrade, mmhealth might show the error local_exported_ fs_unavail even though the file system is still mounted.
Product
  • ESS 3000
  • ESS 5000

During a rolling upgrade (Updating of one ESS I/O node at a time but maintaining quorum), mmhealth might display an error indicating that the local exported file system is unavailable. This message is erroneous.


Component    Status    Status Change Reasons
------------------------------------------------------------
GPFS         HEALTHY   6 min. ago    -
NETWORK      HEALTHY   20 min. ago   -
FILESYSTEM   DEGRADED  18 min. ago   local_exported_fs_unavail(gpfs1)
DISK         HEALTHY   6 min. ago    -
NATIVE_RAID  HEALTHY   6 min. ago    -
PERFMON      HEALTHY   19 min. ago   -
THRESHOLD    HEALTHY   20 min. ago   -
The workaround is to restart mmsysmon on each node called out by mmhealth.
During upgrade, if the container had an unintended loss of connection with the target canister(s), there might be a timeout of up to 2 hours in the Ansible® update task.
Product
  • ESS 3000
Wait for the timeout and retry the essrun update task.
During storage MES upgrade you are required to update the drive firmware to complete the task. Some of the drives may not update on the first pass of running the command.
Product
  • ESS 3000
Rerun the mmchfirmware --type drive command again which should resolve the issue and update the remaining drives.
When running essrun commands, you might see messages such as these:
Thursday 16 April 2020 20:52:44 +0000
(0:00:00.572) 0:13:19.792 ********
Thursday 16 April 2020 20:52:45 +0000
(0:00:00.575) 0:13:20.367 ********
Thursday 16 April 2020 20:52:46 +0000
(0:00:00.577) 0:13:20.944 ********
Product
  • ESS 3000
  • ESS 5000
This is a restriction in the Ansible timestamp module. It shows timestamps even for the “skipped” tasks. If you want to remove timestamps from the output, change the ansible.cfg file inside the container as follows:
  1. vim /etc/ansible/ansible.cfg
  2. Remove ,profile_tasks on line 7.
  3. Save and quit: esc + :wq
When running the essrun config load command, you might see a failure such as this:
stderr: |-
rc=2 code=186
Failed to obtain the enclosure device
name with rc=2
rc=2 code=669
Product
  • ESS 3000
This failure means that the pems module is not running the canister. For fixing this, do the following:
  1. Log in to the failed canister and run the following commands:
    cd /install/ess/otherpkgs/rhels8/x86_64/gpfs
    yum reinstall gpfs.ess.platform.ess3k*
  2. When the installation finishes, wait until the lsmod | grep pems command returns output similar to this:
    pemsmod 188416 0
    scsi_transport_sas 45056 1 pemsmod
    stderr: |-
        rc=2 code=186
        Failed to obtain the enclosure device name with rc=2
        rc=2 code=669
    
Running essrun -N node1,node2,… config load command with high-speed names causes issues with the upgrade task using the -G flag.
Product
  • ESS 3000
  • ESS 5000
The essrun config load command is an Ansible wrapper that attempts to discover the ESS 3000 canister node positions, place them into groups, and fix the SSH keys between the servers. This command must always be run using the low-speed or management names. You must not use the high-speed names with this command.

This command should always be run using the low-speed or management names.

For example:

essrun -N ess3k1a,ess3k1b config load

If you run this command using the high-speed or cluster names, this might result in issues when performing the update task.

Example of what not to do:

essrun -N ess3k1a-hs,ess3k1b-hs config load

To confirm that the config run is set up correctly, use the lsdef command. This command returns only the low-speed or management names defined in /etc/hosts.

After reboot of an ESS 5000 node, systemd could be loaded incorrectly.
Users might see the following error when trying to start GPFS:
Failed to activate service 'org.freedesktop.systemd1': timed out
Product
  • ESS 5000
Power off the system and then power it on again.
  1. Run the following command from the container:
    rpower <node name> off
  2. Wait for at least 30 seconds and run the following command to verify that the system is off:
    rpower <node name> status
  3. Restart the system with the following command:
    rpower <node name> on
In ESS 5000 SLx series, after pulling a hard drive out for a long time wherein the drive has finished draining, when you re-insert the drive, the drive could not be recovered.
Product
  • ESS 5000
Run the following command from EMS or IO node to revive the drive:
mmvdisk pdisk change --rg RGName --pdisk PdiskName --revive

Where RGName is the recovery group that the drive belongs to and PdiskName is the drive's pdisk name.

After the deployment is complete, if firmware on the enclosure, drive, or HBA adapter does not match the expected level, and if you run essinstallcheck, the following mmvdisk settings related error message is displayed:
[ERROR] mmvdisk settings do NOT match best practices. 
Run mmvdisk server configure --verify --node-class  ess5k_ppc64le_mmvdisk to debug.  
Product
  • ESS 3000
  • ESS 5000

The error about mmvdisk settings can be ignored. The resolution is to update the mismatched firmware levels on enclosure, adapter, or HBA adapters to the correct levels.

You can run the mmvdisk configuration check command to confirm.

The mmvdisk settings do not match best practices. Run the mmvdisk server configure --verify --node-class <nodeclass> command.

List the mmvdisk node classes: mmvdisk nc list
Note: essinstallcheck detects inconsistencies from mmvdisk best practices for all node classes in the cluster and stops immediately if an issue is found.
When running essinstallcheck you might see an error message similar to:
System Firmware could not be obtained which will lead to a false-positive PASS message when the script completes.
Product
  • ESS 5000

Run vpdupdate on each IO node.

Rerun essinstallcheck which should properly query the firmware level.
When running the essrun - N Node healthcheck command, the essinstallcheck script might fail due to incorrect error verification which might lead to an impression that there is a problem where there is none.

Command:

essrun -N <node> healthcheck
Product
  • ESS 3000
  • ESS 5000
This health check command (essrun - N Node healthcheck) is removed from the ESS documentation and it is advised to use the manual commands to verify system health after deployment. Run the following commands for health check:
  • gnrhealthcheck
  • mmhealth node show -a
  • essinstallcheck -N localhost
    Note: This command needs to be run on each node.
During command-less disk replacement, there is a limit on how many disks can be replaced at one time.
Product
  • ESS 3000
  • ESS 5000
For command-less disk replacement using commands, only replace up to 2 disks at a time. If command-less disk replacement is enabled, and more than 2 disks are replaceable, replace the 1st 2 disks, and then use the commands to replace the 3rd and subsequent disks.
Issue reported with command-less disk replacement warning LEDs.
Product
  • ESS 5000
The replaceable disk will have the amber led on, but not blinking. Disk replacement should still succeed.
After upgrading an ESS 3000 node to version 6.0.2.6, the pmsensors service needs to be manually started.
Product
  • ESS 3000
After the ESS 3000 upgrade is complete, the pmsensors service does not automatically start. You must manually start the service for performance monitoring to be restored. On each ESS 3000 canister, run the following command:
systemctl start pmsensors
For checking the status of the service, run the following command:
systemctl status --no-pager pmsensors
ESS commands such as essstoragequickcheck, essinstallcheck must be run using -N localhost. If using the hostname such as -N ess3k1a, an error occurs.
Product
  • ESS 3000
  • ESS 5000
There is currently an issue with running the ESS deployment commands by using the hostname of a node. The workaround is to run checks locally on each node by using localhost. For example, instead of using essstoragequickcheck -N ess3k1a, use the following command:
essstoragequickcheck -N localhost
Hyperthreading might be enabled on an ESS 3000 system due to an incorrect kernel grub flag being set.
Product
  • ESS 3000
Hyperthreading needs to be disabled on ESS 3000 systems. This is ensured in following ways:
  • Disabled in BIOS
  • Disabled using the tuned profile
  • Disabled using the grub command line
When disabled with the grub command line, the issue occurs because the grub configuration had an incorrect flag set in earlier versions. To resolve this issue, do the following:
  1. Edit the /etc/grub2.cfg file to change nohup with nosmt.
    Before change:
    set default_kernelopts="root=UUID=9a4a93b8-2e6b-4ba6-bda4-a7f8c3cb908f 
    ro nvme.sgl_threshold=0 sshd=1 pcie_ports=native nohup 
    resume=UUID=c939121b-526a-4d44-8d33-693f2fb7f018 
    rd.md.uuid=f6dbf6f2:8ac82ed6:875ca663:0094ac11 
    rd.md.uuid=06c2d5b0:c6603a1e:5df4b4d3:98fd5adc rhgb quiet crashkernel=4096M"
    After change:
    set default_kernelopts="root=UUID=9a4a93b8-2e6b-4ba6-bda4-a7f8c3cb908f 
    ro nvme.sgl_threshold=0 sshd=1 pcie_ports=native nosmt 
    resume=UUID=c939121b-526a-4d44-8d33-693f2fb7f018 
    rd.md.uuid=f6dbf6f2:8ac82ed6:875ca663:0094ac11 
    rd.md.uuid=06c2d5b0:c6603a1e:5df4b4d3:98fd5adc rhgb quiet crashkernel=4096M"
  2. Reboot the node for the changes to take effect.
The main change is the nohup item to nosmt.
Note: After you made the change, re-boot the node
The ESS 3000 container contains the rhels8.2-ppc64le-install-ces image. However, the pxe cannot be installed by using it because it is not creating repo in the container.
An example is as follows:
root@cems0legacy:/ # lsdef -t osi-mage
rhels7.9-ppc64le-install-ems  (osi-mage)
rhels8.2-ppc64le-install-ces  (osi-mage)
rhels8.2-ppc64le-install-ems  (osi-mage)
rhels8.2-x86_64-install  (osimage)
ESS 3000 CONTAINER root@cems0legacy:/ #
Product
  • ESS 3000
This issue has been resolved in 6.1.1.1 build.
P8 protocol node update is not supported.
Product
  • 3000
This issue has been resolved in 6.1.1.1 build.
With ESS 5000 container, P9 IO node, PXE install is not supported.
Product
  • 5000
This issue has been resolved in 6.1.1.1 build.
For ESS 3000 container on P9 EMS node, PXE install is not supported on P9 protocol node.
Product
  • 3000
This issue has been resolved in 6.1.1.1 build.
In an existing cluster with quorum nodes not exceeding 7 nodes, addition of new nodes will fail irrespective of the firmware level.
Product
  • ESS 3000
  • ESS 5000
This is not considered a problem, thus, no workaround is needed.