Known issues

This topic describes known issues for ESS.

ESS 5.3.7.6 issues

The following table describes known issues in ESS 5.3.7.6 and how to resolve these issues. Depending on which fix level you are installing, these might or might not apply to you.
Table 1. Known issues in ESS 5.3.7.6
Issue Environment affected Description Resolution or action
The gssgennetworks script requires high-speed host names to be derived from I/O server (xCAT) host names using suffix, prefix, or both.

High-speed network generation

Type: Install

Version: All

Arch: Both

Affected nodes: All

gssgennetworks requires that the target host name provided in -N or -G option are reachable to create the high-speed network on the target node. If the xCAT node name does not contain the same base name as the high-speed name you might be affected by this issue. A typical deployment scenario is:
gssio1 // xCAT name
gssio1-hs // high-speed
An Issue scenario is:
gssio1 // xCAT name
foo1abc-hs // high-speed name
Create entries in the /etc/hosts with node names that are reachable over the management network such that the high-speed host names can be derived from it using some combination of suffix and/or prefix. For example, if the high-speed host names are foo1abc-hs, goo1abc-hs:
  1. Add foo1 and goo1 to the /etc/hosts using management network address (reachable) in the EMS node only.
  2. Use: gssgennetworks -N foo1,goo1 – suffix abc-hs --create-bond
  3. Remove the entries foo1 and goo1 from the /etc/hosts file on the EMS node once the high-speed networks are created.
Example of how to fix (/etc/hosts):
// Before
<IP><Long Name><Short Name> 192.168.40.21 gssio1.gpfs.net gssio1 192.168.40.22 gssio2.gpfs.net gssio2 X.X.X.X foo1abc-hs.gpfs.net foo1abc-hs X.X.X.Y goo1abc-hs.gpfs.net goo1abc-hs
// Fix
192.168.40.21 gssio1.gpfs.net gssio1 foo1 192.168.40.22 gssio2.gpfs.net gssio2 goo1 X.X.X.X foo1abc-hs.gpfs.net foo1abc-hs X.X.X.Y goo1abc-hs.gpfs.net goo1abc-hs
gssgennetworks -N foo1, goo1 --suffix=abc-hs --create-bond
Running essutils over PuTTY might show horizontal lines as “qqq” and vertical lines as “xxx”.

ESS Install and Deployment Toolkit

Type: Install or Upgrade

Version: All

Arch: Both

Affected nodes: All

PuTTY translation default Remote Character set UTF-8 might not translate horizontal line and vertical character sets correctly. 1. On the PuTTY terminal Window > Translation, change Remote character set from UTF-8 to ISO-8859-1:1998 (Latin-1, West Europe) (this should be the first option after UTF-8).
2. Open session.
gssinstallcheck might flag an error regarding page pool size in multi-building block situations if the physical memory sizes differ.

Software Validation

Type: Install or Upgrade

Arch: Both

Version: All

Affected nodes: I/O server nodes

gssinstallcheck is a tool introduced in ESS 3.5, that helps validate software, firmware, and configuration settings. If adding (or installing) building blocks of a different memory footprint installcheck will flag this as an error. Best practice states that your I/O servers must all have the same memory footprint, thus pagepool value. Page pool is currently set at ~60% of physical memory per I/O server node.

Example from gssinstallcheck: [ERROR] pagepool: found 142807662592 expected range 147028338278 - 179529339371

1. Confirm each I/O server node's individual memory footprint.
From the EMS, run the following command against your I/O xCAT group: xdsh gss_ppc64 "cat/ proc/meminfo | grep MemTotal"
Note: This value is in KB.

If the physical memory varies between servers and/or building blocks, consider adding memory and re-calculating pagepool to ensure consistency.
2. Validate the pagepool settings in IBM Spectrum Scale: mmlsconfig | grep -A 1 pagepool
Note: This value is in MB.
If the pagepool value setting is not roughly ~60% of physical memory, then you must consider recalculating and setting an updated value. For information about how to update the pagepool value, see IBM Spectrum Scale documentation.
Creating small file systems in the GUI (below 16G) will result in incorrect sizes

GUI

Type: Install or Upgrade

Arch: Both

Version: All

Affected nodes: All

When creating file systems in the GUI smaller than 16GB (usually done to create CES_ROOT for protocol nodes) the size will come out larger than expected.

There is currently no resolution. The smallest size you might be able to create is 16GB. Experienced users might consider creating a customer vdisk.stanza file for specific sizes you require.

You can try one of the following workarounds:
  • Use three-way replication on the GUI when creating small file systems.
  • Use gssgenvdisks which supports the creation of small file systems especially for CES_ROOT purposes (Refer to the --crcesfs flag).
Canceling disk replacement through GUI leaves original disk in unusable state

GUI

Type: Install or Upgrade

Arch: Both

Version: All

Affected nodes: I/O server nodes

Canceling a disk replacement can lead to an unstable system state and must not be performed. However, if you did this operation, use the provided workaround. Do not cancel disk replacement from the GUI. However, if you did, then use the following command to recover the disk took state:

mmchpdisk <RG> --pdisk <pdisk> --resume

During firmware upgrades on PPC64LE, update_flash might show the following warning:

Unit kexec.service could not be found.

Firmware

Type: Installation or Upgrade

Arch: Little Endian

Version: All

Affected nodes: N/A

  This warning can be ignored.
Infiniband with multiple fabric is not supported with gssgennetworks

Type: Install and Upgrade

Arch: Both

Version:

All

In a multiple fabric network, the Infiniband Fabric ID might not be properly appended in the verbsPorts configuration statement during the cluster creation. Incorrect verbsPort setting might cause the outage of the IB network.
Do the following to ensure that the verbsPorts setting is accurate:
  1. Use gssgennetworks to properly set up IB or Ethernet bonds on the ESS system.
  2. Create a cluster. During cluster creation, the verbsPorts setting is applied and there is a probability that the IB network becomes unreachable, if multiple fabric are set up during the cluster deployment.
  3. Ensure that the GPFS daemon is running and then run the mmfsadm test verbs config | grep verbsPorts command.
# mmfsadm test verbs config | grep verbsPorts
mmfs verbsPorts: mlx5_0/1/4 mlx5_1/1/7
In this example, the adapter mlx5_0, port 1 is connected to fabric 4 and the adapter mlx5_1 port 1 is connected to fabric 7. Run the following command and ensure that verbsPorts is correctly configured to the GPFS cluster.
# mmlsconfig | grep verbsPorts
verbsPorts mlx5_0/1 mlx5_1/1
Here, it can be seen that the fabric is not configured even though IB was configured with multiple fabric. This is a known issue.
Now, modify the verbsPorts setting for each node or node class to take the subnet into account.
# verbsPorts="$(echo $(mmfsadm test verbs config 
| grep verbsPorts | awk 
’{ $1=""; $2=""; $3=""; print $0} ’))" 
# echo $verbsPorts mlx5_0/1/4 mlx5_1/1/7
# mmchconfig verbsPorts="$verbsPorts" -N gssio1 
mmchconfig: Command successfully completed 
mmchconfig: Propagating the cluster configuration data to 
all affected nodes. 
This is an asynchronous process.
Here, the node can be any GPFS node or node class.
Thereafter, verify that the new, correct verbsPorts setting is listed in the output.
# mmlsconfig | grep verbsPorts
verbsPorts mlx5_0/1/4 mlx5_1/1/7
A failed disk's state wouldn't be changed to drained or replace in new enclosures that are added by MES procedure and never be used for any file system.

Type: IBM Spectrum Scale RAID

Arch: Little Endian

Version: ESS 5.3.7

Affected Nodes: N/A

If a user runs mmvdisk pdisk change --simulate-failing to fail two pdisks in the new enclosure(s) that are added by using the MES procedure and never be used for any file system, the state of the second pdisk stays at simulate-failing. Then, GUI cannot detect that the second failed disk is replaceable, same as command line which fails because of the state of the disk.
Run the mmchpdisk --diagnose on the failing disk. Or, run mmshutdown or mmstartup on the I/O server node that serves the recovery group that the simulate-failing pdisk belongs to.
Important: This workaround causes a failover.
Cable pulls might result in I/O hang and application failure.

Type: IBM Spectrum Scale RAID

Arch: Both

Version: ESS 5.3.7

Affected Nodes: I/O server nodes

If SAS cables are pulled, I/O might hang for an extended period of time during RG or path recovery which could lead to application failures. Change nsdRAIDEventLogShortTermDelay to 30ms (The default is 3000ms):
  1. Run mmchconfig nsdRAIDEventLogShortTermDelay=30.
  2. Restart GPFS.
gssinstall_<arch> and gssinstallcheck report NOT_INST for a few GPFS group RPMs.

Type: Deployment

Arch: Little Endian

Version: ESS 5.3.7

Affected Nodes: ALL

By default, deployment does not install RPMs for file audit logging support. This is the expected behavior.
If the file audit logging feature is required, you can manually install these packages from the EMS GPFS repository.
  • gpfs.kafka
  • gpfs.libkafka
When doing a disk revive operation, you might see it fail with a return code 5.

Type: RAS

Arch: Both

Version: ESS 5.3.7

Affected Nodes: N/A

If a pdisk fails (or simulated to fail), a user would perform a revive operation using mmchpdisk. This command fails with a return code 5.

Command: err 5: 
tschpdisk 
--recovery-group 
rg_essio52-ce 
--pdisk e2s024 
--state ok 
2019-02-25_10:11:
29.742-0600: 
Input/output error
No workaround is required. Though the command shows an error and bad return code, the operation will work (pdisk revived) after a few minutes.

When running gssinstallcheck, the system profile setting is found activated, instead of the expected setting scale

Type: RAS

Arch: Both

Version: ESS 5.3.7

Affected Nodes: N/A

Users run gssinstallcheck to verify various settings on an ESS 5.3.7 deploy or upgrade. There is a rare situation where the output of the system profile check is incorrect. This is due to the tuned service running on the node hitting a problem.

To verify that tuned has failed, run:
systemctl status 
tuned

The workaround is to restart the tuned service on the failed node and re-running gssinstallcheck (or manually check using tuned-adm verify).

To restart the service, run:
systemctl restart tuned
During enclosure firmware updates, you might hit an issue where GPFS might crash and the file system shows stale file handle.

Type: Upgrade

Arch: Both

Version: ESS 5.3.7

Affected Nodes: I/O

During Enclosure Firmware test (upgrade), you may see a GPFS crash with signal 11 (visible in /var/adm/ras/mmfs.log.latest).

The current workaround is to restart GPFS on the affected nodes.

Deleting loghome vdisk and recreating it without deleting the corresponding recovery group might lead to loss access or data.

Type: Recovery usage

Arch: Both

Version: All

Affected Nodes: I/O

A loghome vdisk is created when a recovery group is created using the gssgenclusterrgs script. If you delete the loghome vdisk, the corresponding recovery group must be also deleted. If a loghome vdisk is recreated without deleting the corresponding recovery group, a rare race condition could arise which might not properly flush key metadata to the disk associated with the recovery group of the loghome. Thus, it might lead to internal metadata inconsistencies resulting in loss of access or loss of data.

Ensure that if a loghome vdisk is deleted, the corresponding recovery group is also deleted. When the recovery group is deleted, you can recreate the recovery group using gssgenclusterrgs and proceed as normal usage.
The Pools page of the ESS GUI might indicate No capacity available. for the given system.

Type: GUI

Versions: All

Arch: Both

Affected nodes: All

In certain scenarios during a deployment or an upgrade, the ESS GUI might show that capacity is not available in the Pools page. It is advised that you wait up to 24 hrs. for all GUI refresh tasks to update. If you still see the problem, you can obtain the correct capacity by using the command line options.
gssgenclusterrgs might fail because mmvdisk node class already exists.

Type: Install

Versions: Both

Arch: All

Affected nodes: I/O

When performing a new installation, the creation of the recovery groups might fail due to a timing conflict with the mmvdisk node class generation. This is when using the wrapper scripts which now, by default, use mmvdisk.

New installations only (not for adding building blocks)

Check if the mmvdisk node class exists:
mmvdisk nc list
If so, do the following:
  1. Unconfigure servers.mmvdisk server unconfigure --node-class <class>
  2. Delete the node class.mmvdisk nodeclass delete --node-class <class>
  3. Try gssgenclusterrgs again.
gssgenclusterrgs might fail due to longer than expected recovery group names.

Type: Install/Add

Versions: All

Arch: Both

Affected nodes: I/O

The gssgenclusterrgs command might fail because mmvdisk does not support names over a certain length.

Use mmvdisk commands directly and create recovery groups with names of a shorter character length.
gssgenvdisks does not work properly if using –-vdisk-size or –-use-da

Type: Install/Add

Versions: All

Arch: Both

Affected nodes: I/O

The gssgenvdisks command might not return the expected result in situations where the usage of a single pool or hybrid environment requires exact data sizes.

For example, --vdisk-size when used with --use-da might not give the intended result.

Use mmvdisk commands directly or modify the vdisk stanza when precise vdisk sizes are required.
Duplicate PMRs might be generated for node call home

Versions: ESS 5.3.7

Arch: Little Endian

Affected nodes: All

Duplicate PMRs might be generated unless a step is taken to avoid this behavior.

When doing the call home setup by using the gsscallhomeconf command, use the --stop-auto-event-report flag. Using this flag resolves this issue.
MTM does not match the failing unit information reference in the PMR log file

Versions: ESS 5.3.7

Arch: Little Endian

Affected nodes: All

The MTM does not match the failing unit reference in the PMR log file in cases such as a power cord pull and PS pull. For more information, see this example. Note that the failing unit might be reported as the EMS node but the PMR details reference the failing node/part correctly..
During a rolling upgrade, mmhealth might show the error local_exported_fs_unavail even though the file system is still mounted.

Area: RAS

Type: Upgrade

Arch: Both

Version: ESS 5.3.7

During an online rolling upgrade (Updating of one ESS I/O node at a time but maintaining quorum), mmhealth might display an error indicating that the local exported file system is unavailable. This message is erroneous. For more information, see this example.

Restart mmsysmon on each node called out by mmhealth.

Hardware details – EMS and I/O server nodes are not showing Asset number.

Type: Install / Add

Version: All

Arch: All

Affected Nodes: EMS node.

Hardware details – EMS and I/O server nodes are not showing Asset number Use the mmlscomp command to view this information.
xCAT related commands do not work on EMS node if security is enabled using the gss_security tool.

Type: Install / upgrade

Version: 5.3.7

Arch: All

Affected Nodes: EMS node

After running the gss_security tool to enable security, as part of the security enablement, the httpd daemon is shut down. Due to this, xCAT commands fail. For more information, see Enabling security in ESS. Disable security on the EMS node by using gss_security -d command and retry xCAT commands.
Failed to start IBM.ESAGENT subsystem due to wrong Java installed

Type: Upgrade

Version: 5.3.x.x

Arch: All

Affected Nodes: All

Upon upgrade, the JAVA pointer might be incorrect and might cause issues when starting ESA. To fix this, issue, run the following command:
yum reinstall java-1.8.0-openjdk-headless-1.8.0.222.b03-1.el7.ppc64le
To verify that the issue is fixed, run the following command:
which java; java -version
The expected output is:
/bin/java
 openjdk version "1.8.0_222-ea"
 OpenJDK Runtime Environ-ment (build 1.8.0_222-ea-b03)
 OpenJDK 64-Bit Server VM (build 25.222-b03, mixed mode)
After these pets, restart ESA as follows:
systemctl restart esactl
Verify that the service has started as follows:
systemctl status esactl
Email notification by using ESA does not work

Type: Install/Upgrade

Version: 5.3.x.x

Arch: All

Affected Nodes: All

An ESA patch is required for SNMP notifications to work as designed. A patch is being tested at this time to check if it addresses the issue. Contact support to check if a fix is available.
POWER8 protocol nodes are not supported with Fiber Channel (or other storage adapters).

Type: Install

Version: 5.3.x.x

Arch: All

Affected Nodes: All

ESS deployment currently does not support POWER8 protocol nodes that have storage adapters inserted. You might still use these nodes but ESS deployment installation and upgrade will not be supported.  
Call home setup has a requirement that all nodes of PPC64LE architecture must be registered in ESA.

Type: Install

Version: 5.3.x.x

Arch: All

Affected Nodes: All

There is currently a limitation wherein for call home to work correctly, all nodes in the cluster that are PPC64LE must be registered with ESA. If all nodes of that architecture are not registered, call home might not work as designed.

This is usually the case when PPC64LE client or protocol nodes are in the same cluster as ESS nodes but they are not registered with ESA.

 
Race condition in opal-elog that can hit a kernel panic in function elog_work_fn. This is experienced when the GUI is running HW_INVENTORY commands to POWER servers.

Type: Kernel panic

Version: All

Arch: PPC64LE

Affected Nodes: EMS, I/O, and protocol nodes

This issue was found with RHEL 7 kernel (Bugzilla 1873189) while opal-elog is handling an excessive amount of OPAL error log events.

The GUI runs ipmi fru print commands as part of its HW_INVENTORY checks. The bug might be hit during these intervals due to the excessive amount of OPAL events are being generated.

A fix is being worked on by Red Hat to provide a new kernel to address this race condition.

There is a known issue with OPAL on Power nodes wherein too many OPAL requests might cause a system hang. This issue does not affect ESS 3000 nodes.

In response, consider disabling the HW_INVENTORY GUI task to reduce requests to the FSP.
/usr/lpp/mmfs/gui/cli/chtask HW_INVENTORY --inactive
Upgrades from 5.3.7.3 to 6.1.2.x may fail (legacy) due to lower firmware level carried in 6.1.2.x

Type: Deployment upgrade

Version: 5.3.7.3/6.1.2.x

Affected Nodes: All legacy nodes (including protocol)

When upgrading a container from ESS 5.3.7.3 to 6.1.2.x for legacy p8 nodes, yum update fails. The yum update is failed because of a lower kernel level in 6.1.2.x.

ESS 5.3.7.3 has Linux kernel 3.10.0-1160.45.1 while ESS 6.1.2.1 legacy has kernel 3.10.0-1160.41.1.

The lower level kernel is creating some dependency issues during upgrade. Before the upgrade, downgrade the Linux kernel of ESS 5.3.7.3 by issuing the following command:
yum downgrade kernel-tools kernel-tools-libs
Example of MTM not matching failing unit information in PMR log
**************  FAILING UNIT INFORMATION **************
Service Agent Date, Time:  2019-11-14 17:19:46 UTC

UTC:                       2019-11-14 17:19:46 GMT                       
Machine Type/Feat:       5148       <------------- This data continues to be from the ems node  XXX
Model:                   21L        <------------- This data continues to be from the ems node  XXX
Serial:                  005789A    <------------- This data continues to be from the ems node  XXX
Unit Name:               essio41-fo
Sys Feature/Fnc 20:
Bundled Problem Report:
Indicator Mode (LP/GL):
Sys Attn/Info Act (Y/N):
Example output of mmhealth command

Component       Status        Status Change     Reasons
---------------------------------------------------------------------------------------------
GPFS            HEALTHY       6 min. ago        -
NETWORK         HEALTHY       20 min. ago       -
FILESYSTEM      DEGRADED      18 min. ago       local_exported_fs_unavail(gpfs1, gpfs0)
DISK            HEALTHY       6 min. ago        -
NATIVE_RAID     HEALTHY       6 min. ago        -
PERFMON         HEALTHY       19 min. ago       -
THRESHOLD       HEALTHY       20 min. ago       -