IBM Support

IBM Storage Fusion Hotfix for the GDP OCP issues

General Page

IBM Storage Fusion 2.7.1 and 2.7.2 versions are impacted due to an IBM Storage Scale container native failure for a few of the OCP versions.

A recent Linux kernel update to address the CVE-2024-25744 Linux security vulnerability resulted in a mmbuildgpl command failure while building the IBM Storage Scale kernel portability layer. IBM Storage Scale cannot get into an active state on a node with the updated kernel.
To get more information about the kernel changes, see Red Hat support page.

The RHEL9.2 5.14.0-284.66.1.el9_2.x86_64 and higher versions impact the IBM Storage Scale.

OpenShift levels containing the kernels (x86_64 only) that impact the IBM Storage Scale Container Native:

  • 4.14.25 or higher versions
  • 4.13.42 or higher versions
     

How to identify the probelm
 

  • The following example steps show how to identify the error and recover from a failure of the IBM Storage Scale Container Native after an OpenShift upgrade.
  • It is applicable for the IBM Storage Scale Container Native 5.1.9.1 and  5.1.9.3 versions, which are  upgraded to any of the following OpenShift levels:
    4.14.25 or higher versions
    4.13.42 or higher versions
 
  1. Check whether the KERNEL-VERSION column lists the kernel 5.14.0-284.66.1 or a higher version. If the kernel version is lower than 5.14.0-284.66.1, then you might not face this issue.

  2. Run the following command to check whether a worker node is set to SchedulingDisabled state. It indicates the next work node that is scheduled for a Red Hat OpenShift machine config upgrade rollout.

    oc get nodes -o wide
     

    Example output:

    NAME                                         STATUS                                                                                                                   
    master0.cp.fyre.ibm.com                      Ready                                       

    master1.cp.fyre.ibm.com                      Ready                                     

    master2.cp.fyre.ibm.com                      Ready                                   
    worker0.cp.fyre.ibm.com                      Ready                                         
    worker1.cp.fyre.ibm.com                      Ready                                                           
worker2.cp.fyre.ibm.com                      Ready 
    worker3.cp.fyre.ibm.com                      Ready,SchedulingDisabled
    
  3. Run the following command to check whether at least a single scale-core pod must be in the Init: CrashLoopBackOff state.

    oc get pods -o wide
     

    Example output:

    NAME                                                                     STATUS                                                                                                          
    ibm-spectrum-scale-gui-0                                                Running
    ibm-spectrum-scale-gui-1                                                Running
    ibm-spectrum-scale-pmcollector-0                                        Running
    ibm-spectrum-scale-pmcollector-1                                        Running
    worker0                                                                 Running
    worker1                                                                 Running
    worker2                                                           Init: CrashLoopBackOff
    worker3                                                                 Running
    worker4                                                                 Running
    
  4. Run the following command to check the logs from the mmbuildgpl pod of the worker node that is in Init: CrashLoopBackOff state and search for the variable st_ino error, which is a signature of this issue.

    oc logs worker2 -c mmbuildgpl
    • The example output shows that the worker2 node is in an Init: CrashLoopBackOff state because the mmbuildgpl fails to compile a portability layer that is used as kernel tie-in for the IBM Storage Scale Container Native. The mmbuildgpl failed due to a defect that created an incompatibility with the RHCOS 9 EUS kernel level 5.14.0-284.66.1.e19_2 and higher versions.
    • The OpenShift Machine Config Operator (MCO) rolled out a new configuration on the underlying worker2 node. It then progressed to the next node because the cluster integrity protected by the IBM Storage Scale, which prevented the draining of the next scale core pod. The Red Hat OpenShift upgraded itself as a result of  holding the MCO rollout for a long time.

    Example output:
    ....
    Invoking Kbuild...
    /usr/bin/make -C /usr/src/kernels/5.14.0-284.66.1.el9_2.x86_64 ARCH=x86_64 M=/usr/lpp/mmfs/src/gpl-linux CONFIGDIR=/usr/lpp/mmfs/src/config ; \

    if [ $? -ne 0 ]; then \
    exit 1;\
    fi make[2]: Entering directory '/usr/src/kernels/5.14.0-284.66.1.el9_2.x86_64'
    CC [M] /usr/lpp/mmfs/src/gpl-linux/tracelin.o
    CC [M] /usr/lpp/mmfs/src/gpl-linux/tracedev-ksyms.o
    CC [M] /usr/lpp/mmfs/src/gpl-linux/ktrccalls.o
    CC [M] /usr/lpp/mmfs/src/gpl-linux/relaytrc.o
    LD [M] /usr/lpp/mmfs/src/gpl-linux/tracedev.o
    CC [M] /usr/lpp/mmfs/src/gpl-linux/mmfsmod.o
    LD [M] /usr/lpp/mmfs/src/gpl-linux/mmfs26.o
    CC [M] /usr/lpp/mmfs/src/gpl-linux/cfiles_cust.o
    In file included from /usr/lpp/mmfs/src/gpl-linux/cfiles.c:61,
                     from /usr/lpp/mmfs/src/gpl-linux/cfiles_cust.c:54:
    /usr/lpp/mmfs/src/gpl-linux/kx.c: In function 'vstat':
    /usr/lpp/mmfs/src/gpl-linux/kx.c:238:12: error: 'struct stat' has no member named '__st_ino'; did you mean 'st_ino'?
      238 | statbuf->__st_ino = vattrp->va_ino;
          | ^~~~~~~~
          | st_ino
    make[3]: *** [scripts/Makefile.build:321: /usr/lpp/mmfs/src/gpl-linux/cfiles_cust.o] Error 1 make[2]: *** [Makefile:1923: /usr/lpp/mmfs/src/gpl-linux] Error 2
    make[2]: Leaving directory '/usr/src/kernels/5.14.0-284.66.1.el9_2.x86_64'
    make[1]: *** [makefile:140: modules] Error 1
    make[1]: Leaving directory '/usr/lpp/mmfs/src/gpl-linux'
    make: *** [makefile:145: Modules] Error 1
    --------------------------------------------------------
    mmbuildgpl: Building GPL module failed at Mon May 20 17:27:51 UTC 2024.
    --------------------------------------------------------
    mmbuildgpl: Command failed. Examine previous error messages to determine cause.
    cleanup run

     

    Resolution

    1. Add a field enableManualInstallation: true to the scalemanager CR.

      apiVersion: cns.isf.ibm.com/v1
      kind: ScaleManager
      metadata:
        name: scalemanager
        namespace: ibm-spectrum-fusion-ns
      spec:
        creator: Fusion
        enableManualInstallation: true
      
    2. Recovery from the failure state where a single scale-core pod is in Init: CrashLoopBackoff state. If the IBM Storage Scale is at 5.1.9.1 or 5.1.9.3 version, then follow the upgrade steps mentioned in the IBM Storage Scale Container Native 5.1.9 documentation.

      • Upgrade documentation states not to proceed whenever all pods are not Running. However, it is alright to proceed when only a single scale-core pod is in the Init: CrashLoopBackoff state and all other scale-core pods are in a Running.
      • The scale-core pod that is already in an Init: CrashLoopBackoff state can continue to remain in the same state even after the scale-core pods are updated. In such a case, delete the single scale-core pod in Init: CrashLoopBackoff state.

      • The deletion causes the pod to recycle and achieve a running state, and the IBM Storage Scale no longer blocks the Red Hat OpenShift Machine Config Operator (MCO).

      • Run the oc get mcp command to check the Machine Config Operator, and update the rest of the nodes to complete the Red Hat OpenShift upgrade. Follow the upgrade instructions to validate the upgrade status and ensure that all pods are in a Running state.

    3. All core pods must be in a running state with new versions after you complete the upgrade steps.

      oc get pods -n ibm-spectrum-scale
       

      Example output:

      NAME                                         STATUS                                                                                                                   
      compute-1-ru5                               Running                                       

      compute-1-ru6                               Running                                     

      compute-1-ru7                               Running                                   
      control-1-ru2                               Running                                         
      control-1-ru3                               Running
      control-1-ru4                               Running 
      ibm-spectrum-scale-gui-0                    Running
      ibm-spectrum-scale-gui-0                    Running
      ibm-spectrum-scale-pmcollector-0            Running
      ibm-spectrum-scale-pmcollector-1            Running
      
    4. The GDP service status will be in a critical state due to the applied hotfix.

    5. To resolve this issue, apply the following patch to replace the isf-storage-operator-controller-manager image with a new image in the installed operator CSV of the ibm-spectrum-fusion-ns namespace.

      isf-storage-operator - cp.icr.io/cp/isf/isf-storage-operator@sha256:4c56ddb5a745a196e24b3ba7d7af166c64e67cac302ec859e4ce3c556ac5625c

    6. Ensure that the new pod is in a running state by running the following command.

      oc get pods -n ibm-spectrum-fusion-ns | grep isf-storage-operator-controller-manager
       

      Example output:

      NAME                                                                                  STATUS                                                                                                                   
      isf-storage-operator-controller-manager-86ccc69c4d-hjh65                              Running
      

[{"Type":"MASTER","Line of Business":{"code":"LOB69","label":"Storage TPS"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSFETU","label":"IBM Storage Fusion"},"ARM Category":[{"code":"a8m3p0000000rXCAAY","label":"SW"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"},{"Type":"MASTER","Line of Business":{"code":"LOB69","label":"Storage TPS"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSXEFDS","label":"IBM Storage Fusion HCI Appliance Software"},"ARM Category":[{"code":"a8m3p0000000rX7AAI","label":"HW"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
17 July 2024

UID

ibm17156911