IBM Support

IBM Storage Scale versions 5.1.8.0 through 5.1.9.3 and 5.2.0.0 with RHEL 9.2 EUS kernel 5.14.0-284.66.1 and higher (x86 only) : mmbuildgpl fails with a struct_stat error (st_ino) due to a change in Linux kernel to address CVE-2024-25744

Flashes (Alerts)


Abstract

A recent Linux kernel update to address the CVE-2024-25744 Linux security vulnerability results in a failure in the mmbuildgpl command when building the IBM Storage Scale kernel portability layer. IBM Storage Scale is unable to achieve an active state upon a node with this updated kernel.

This will affect IBM Storage Scale releases 5.1.8.0 through 5.1.9.3 and 5.2.0.0. The fix for this is contained in 5.1.9.4 and 5.2.0.1.

This is the Red Hat distro page tracking the kernel change: https://access.redhat.com/security/cve/CVE-2024-25744

Based on the above links, the following kernels (x86_64 only) impact IBM Storage Scale:

- RHEL9.2 5.14.0-284.66.1.el9_2.x86_64 and higher

OpenShift levels containing kernels (x86_64 only) that impact IBM Storage Scale Container Native:

- 4.15.13 and higher

- 4.14.25 and higher

- 4.13.42 and higher

Content

Problem Determination:
Example 1: mmbuildgpl failure - note the unique error at struct stat due to no member named ‘__st_ino’
This example shows the error signature in mmbuildgpl output: __st_no.
This error signature is consistent across all deployments of Scale and will be referred to in all examples within this flash.
...
Invoking Kbuild...
/usr/bin/make -C /usr/src/kernels/5.14.0-284.66.1.el9_2.x86_64 ARCH=x86_64 M=/usr/lpp/mmfs/src/gpl-linux CONFIGDIR=/usr/lpp/mmfs/src/config  ; \
if [ $? -ne 0 ]; then \
	exit 1;\
fi
make[2]: Entering directory '/usr/src/kernels/5.14.0-284.66.1.el9_2.x86_64'
  CC [M]  /usr/lpp/mmfs/src/gpl-linux/tracelin.o
  CC [M]  /usr/lpp/mmfs/src/gpl-linux/tracedev-ksyms.o
  CC [M]  /usr/lpp/mmfs/src/gpl-linux/ktrccalls.o
  CC [M]  /usr/lpp/mmfs/src/gpl-linux/relaytrc.o
  LD [M]  /usr/lpp/mmfs/src/gpl-linux/tracedev.o
  CC [M]  /usr/lpp/mmfs/src/gpl-linux/mmfsmod.o
  LD [M]  /usr/lpp/mmfs/src/gpl-linux/mmfs26.o
  CC [M]  /usr/lpp/mmfs/src/gpl-linux/cfiles_cust.o
In file included from /usr/lpp/mmfs/src/gpl-linux/cfiles.c:61,
                 from /usr/lpp/mmfs/src/gpl-linux/cfiles_cust.c:54:
/usr/lpp/mmfs/src/gpl-linux/kx.c: In function 'vstat':
/usr/lpp/mmfs/src/gpl-linux/kx.c:238:12: error: 'struct stat' has no member named '__st_ino'; did you mean 'st_ino'?
  238 |   statbuf->__st_ino       = vattrp->va_ino;
      |            ^~~~~~~~
      |            st_ino   <----------------------- signature of this problem
make[3]: *** [scripts/Makefile.build:321: /usr/lpp/mmfs/src/gpl-linux/cfiles_cust.o] Error 1
make[2]: *** [Makefile:1923: /usr/lpp/mmfs/src/gpl-linux] Error 2
make[2]: Leaving directory '/usr/src/kernels/5.14.0-284.66.1.el9_2.x86_64'
make[1]: *** [makefile:140: modules] Error 1
make[1]: Leaving directory '/usr/lpp/mmfs/src/gpl-linux'
make: *** [makefile:145: Modules] Error 1
--------------------------------------------------------
mmbuildgpl: Building GPL module failed at Thu May 16 19:18:34 UTC 2024.
--------------------------------------------------------
mmbuildgpl: Command failed. Examine previous error messages to determine cause.
Example 2: Identification and recovery from an OpenShift Upgrade failure + Scale node fail on Storage Scale Container Native
This example shows how to identify the error and recover from a failure of Storage Scale Container Native after an OpenShift upgrade.
The example is relevant for Storage Scale Container Native 5.2.0.0, 5.1.9.1, 5.1.9.3 having had upgrades to any of these OpenShift levels: 4.15.13+, 4.14.25+, 4.13.42+.
  • First check that the KERNEL-VERSION column lists kernel 5.14.0-284.66.1 or higher. If the kernel level is lower, this issue is not being hit.
  • Also confirm a worker node is set to SchedulingDisabled. This will be indicative of the next worker node that OpenShift has selected for rollout of the upgrade machine config. Note this for when the recovery procedure commences.
# oc get nodes -o wide
NAME                                    STATUS                     ROLES                  AGE    VERSION           INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
master0.cp.fyre.ibm.com   Ready                      control-plane,master   3d3h   v1.28.9+8ca71f7   10.17.105.70    <none>        Red Hat Enterprise Linux CoreOS 415.92.202405070140-0 (Plow)   5.14.0-284.66.1.el9_2.x86_64   cri-o://1.28.6-5.rhaos4.15.gita02fb1e.el9
master1.cp.fyre.ibm.com   Ready                      control-plane,master   3d3h   v1.28.9+8ca71f7   10.17.106.60    <none>        Red Hat Enterprise Linux CoreOS 415.92.202405070140-0 (Plow)   5.14.0-284.66.1.el9_2.x86_64   cri-o://1.28.6-5.rhaos4.15.gita02fb1e.el9
master2.cp.fyre.ibm.com   Ready                      control-plane,master   3d3h   v1.28.9+8ca71f7   10.17.106.236   <none>        Red Hat Enterprise Linux CoreOS 415.92.202405070140-0 (Plow)   5.14.0-284.66.1.el9_2.x86_64   cri-o://1.28.6-5.rhaos4.15.gita02fb1e.el9
worker0.cp.fyre.ibm.com   Ready                      worker                 3d3h   v1.28.7+f1b5f6c   10.17.108.41    <none>        Red Hat Enterprise Linux CoreOS 415.92.202403270524-0 (Plow)   5.14.0-284.59.1.el9_2.x86_64   cri-o://1.28.4-8.rhaos4.15.git24f50b9.el9
worker1.cp.fyre.ibm.com   Ready                      worker                 3d3h   v1.28.7+f1b5f6c   10.17.113.236   <none>        Red Hat Enterprise Linux CoreOS 415.92.202403270524-0 (Plow)   5.14.0-284.59.1.el9_2.x86_64   cri-o://1.28.4-8.rhaos4.15.git24f50b9.el9
worker2.cp.fyre.ibm.com   Ready                      worker                 3d3h   v1.28.9+8ca71f7   10.17.120.85    <none>        Red Hat Enterprise Linux CoreOS 415.92.202405070140-0 (Plow)   5.14.0-284.66.1.el9_2.x86_64   cri-o://1.28.6-5.rhaos4.15.gita02fb1e.el9
worker3.cp.fyre.ibm.com   Ready,SchedulingDisabled   worker                 3d3h   v1.28.7+f1b5f6c   10.17.124.102   <none>        Red Hat Enterprise Linux CoreOS 415.92.202403270524-0 (Plow)   5.14.0-284.59.1.el9_2.x86_64   cri-o://1.28.4-8.rhaos4.15.git24f50b9.el9
worker4.cp.fyre.ibm.com   Ready                      worker                 3d3h   v1.28.7+f1b5f6c   10.17.124.158   <none>        Red Hat Enterprise Linux CoreOS 415.92.202403270524-0 (Plow)   5.14.0-284.59.1.el9_2.x86_64   cri-o://1.28.4-8.rhaos4.15.git24f50b9.el9
  • Confirm a single scale-core pod (worker2 in this example) is in the following state: Init: CrashLoopBackOff
# oc get pods -o wide
NAME                               READY   STATUS                  RESTARTS        AGE     IP              NODE                                    NOMINATED NODE   READINESS GATES
ibm-spectrum-scale-gui-0           4/4     Running                 7 (26h ago)     2d19h   10.254.20.11    worker1.cp.fyre.ibm.com   <none>           <none>
ibm-spectrum-scale-gui-1           4/4     Running                 5 (18h ago)     2d19h   10.254.28.9     worker4.cp.fyre.ibm.com   <none>           <none>
ibm-spectrum-scale-pmcollector-0   2/2     Running                 0               2d19h   10.254.20.12    worker1.cp.fyre.ibm.com   <none>           <none>
ibm-spectrum-scale-pmcollector-1   2/2     Running                 0               26m     10.254.16.4     worker2.cp.fyre.ibm.com   <none>           <none>
worker0                            2/2     Running                 0               2d19h   10.17.108.41    worker0.cp.fyre.ibm.com   <none>           <none>
worker1                            2/2     Running                 1 (2d19h ago)   2d19h   10.17.113.236   worker1.cp.fyre.ibm.com   <none>           <none>
worker2                            0/2     Init:CrashLoopBackOff   9 (47s ago)     25m     10.17.120.85    worker2.cp.fyre.ibm.com   <none>           <none>
worker3                            2/2     Running                 1 (2d19h ago)   2d19h   10.17.124.102   worker3.cp.fyre.ibm.com   <none>           <none>
worker4                            2/2     Running                 0               2d19h   10.17.124.158   worker4.cp.fyre.ibm.com   <none>           <none>
  • Check logs from the mmbuildgpl pod of the worker that is in Init: CrashLoopBackOff from above. Look for the st_ino error which is a signature of this issue:
# oc logs worker2 -c mmbuildgpl

....
Invoking Kbuild...
/usr/bin/make -C /usr/src/kernels/5.14.0-284.66.1.el9_2.x86_64 ARCH=x86_64 M=/usr/lpp/mmfs/src/gpl-linux CONFIGDIR=/usr/lpp/mmfs/src/config  ; \
if [ $? -ne 0 ]; then \
	exit 1;\
fi
make[2]: Entering directory '/usr/src/kernels/5.14.0-284.66.1.el9_2.x86_64'
  CC [M]  /usr/lpp/mmfs/src/gpl-linux/tracelin.o
  CC [M]  /usr/lpp/mmfs/src/gpl-linux/tracedev-ksyms.o
  CC [M]  /usr/lpp/mmfs/src/gpl-linux/ktrccalls.o
  CC [M]  /usr/lpp/mmfs/src/gpl-linux/relaytrc.o
  LD [M]  /usr/lpp/mmfs/src/gpl-linux/tracedev.o
  CC [M]  /usr/lpp/mmfs/src/gpl-linux/mmfsmod.o
  LD [M]  /usr/lpp/mmfs/src/gpl-linux/mmfs26.o
  CC [M]  /usr/lpp/mmfs/src/gpl-linux/cfiles_cust.o
In file included from /usr/lpp/mmfs/src/gpl-linux/cfiles.c:61,
                 from /usr/lpp/mmfs/src/gpl-linux/cfiles_cust.c:54:
/usr/lpp/mmfs/src/gpl-linux/kx.c: In function 'vstat':
/usr/lpp/mmfs/src/gpl-linux/kx.c:238:12: error: 'struct stat' has no member named '__st_ino'; did you mean 'st_ino'?
  238 |   statbuf->__st_ino       = vattrp->va_ino;
      |            ^~~~~~~~
      |            st_ino
make[3]: *** [scripts/Makefile.build:321: /usr/lpp/mmfs/src/gpl-linux/cfiles_cust.o] Error 1
make[2]: *** [Makefile:1923: /usr/lpp/mmfs/src/gpl-linux] Error 2
make[2]: Leaving directory '/usr/src/kernels/5.14.0-284.66.1.el9_2.x86_64'
make[1]: *** [makefile:140: modules] Error 1
make[1]: Leaving directory '/usr/lpp/mmfs/src/gpl-linux'
make: *** [makefile:145: Modules] Error 1
--------------------------------------------------------
mmbuildgpl: Building GPL module failed at Mon May 20 17:27:51 UTC 2024.
--------------------------------------------------------
mmbuildgpl: Command failed. Examine previous error messages to determine cause.
cleanup run
  • What the above shows:
    • Worker2 is in Init: CrashLoopBackOff because mmbuildgpl is failing to compile a portability layer used as a kernel tie-in for Storage Scale Container Native. mmbuildgpl is failing due to a defect creating an incompatibility with RHCOS 9 EUS kernel level 5.14.0-284.66.1.el9_2 and higher. The OpenShift Machine Config Operator (MCO) has rolled out a new config on the underlying worker2 RHCOS node successfully, but it cannot progress to the next worker node because Scale is protecting cluster integrity and preventing draining of the next scale-core pod… thus holding up the OpenShift MCO rollout and ultimately, the OpenShift upgrade itself.
  • Recovery from this failure state where a single scale-core pod is in Init: CrashLoopBackoff while the OpenShift Machine Config Operator is unable to progress to the next worker and finish the OpenShift upgrade.
    • If at 5.2.0.0, follow the 5.2.0 upgrade steps in the Storage Scale Container Native documentation here: https://www.ibm.com/docs/en/scalecontainernative/5.2.0?topic=upgrading-storage-scale-container-native. See details below before proceeding with upgrade.
    • If at 5.1.9.1 or 5.1.9.3, follow the 5.1.9 upgrade steps in the Storage Scale Container Native documentation here: https://www.ibm.com/docs/en/scalecontainernative/5.1.9?topic=upgrading-storage-scale-container-native. See details below before proceeding with upgrade.
      • all Storage Scale Container Native upgrade instructions point to an install.yaml in the public github repo branch of 5.2.0.x or 5.1.9.x. This branch is always updated with the latest fixpack / efix and upgrading to it will apply the fix for this problem (the branches were updated with the fix on May 22, 2024). Upgrade and install processes for both efixes and fixpacks are identical to the processes of major releases.
      • Follow the entirety of the upgrade instructions with one exception. Upgrade documentation will state not to proceed if not all pods are up. In this case, it is ok to proceed as long as only a single scale-core pod is in the state Init: CrashLoopBackoff, so long as all other scale-core pods are in a Running state. After the scale-core pods update, the scale-core pod previously in Init: CrashLoopBackoff may remain in Init: CrashLoopBackoff. When this occurs, and only if a single scale-core pod is in this state, delete this single scale-core pod in Init: CrashLoopBackoff. Deletion will cause the pod to recycle and achieve a Running state. Once this occurs, the OpenShift Machine Config Operator (MCO) will no longer be blocked, by Storage Scale, from continuing to update nodes. Watch the Machine Config Operator (oc get mcp) update the rest of the nodes and complete the OpenShift upgrade. Follow the entirety of the upgrade instructions to validate upgrade success and that all pods are in a Running state afterwards.
Recommendations:
Workaround for non-containerized Scale deployments:
Option 1: Back out affected kernel
Option 2: If unable to back out the affected kernel, reach out to Scale support for a workaround (see below).
This problem can be avoided by removing the offending line from the file /usr/lpp/mmfs/src/gpl-linux/kx.c:
Step 1: Open the file /usr/lpp/mmfs/src/gpl-linux/kx.c in a text editor.

Step 2: Please follow the below mentioned steps

  Find the code block:#ifdef STAT64_HAS_BROKEN_ST_INO

    /* Linux has 2 struct stat64 definitions:

         1) /usr/include/asm/stat.h

         2) /usr/include/bits/stat.h

  Of course, they differ

         1) st_dev & st_rdev is 2 bytes in (1) and 8 bytes in (2),

         but the 2 definitions overlap.

         2) st_ino is 8 bytes in (1) and 4 bytes in (2)

         and they are in different places!

  Fortunately, (1) defines an ifdef telling us to assign st_ino

  to a second variable which just happens to exactly match the

  definition in (2). */

         statbuf->__st_ino       = vattrp->va_ino;

         #endif

  Remove the line  statbuf->__st_ino       = vattrp->va_ino;

  Save the file

  Run mmbuildgpl
Workaround for Storage Scale Container Native:
An upgrade to Storage Scale Container Native 5.1.9 or 5.2.0 from May 23rd onwards contains the efixes required to avoid and/or recover from this situation. This is due to documentation always pointing to manifests residing in a 5.2.0.x and 5.1.9.x public github branch, which are kept updated when new efixes and fixpacks are released.
Instructions for requesting an efix:
If an upgrade of Scale code or a downlevel to the non problematic kernel is not possible, customers should contact IBM support and request an efix for this problem. Refer to APAR IJ51225: https://www.ibm.com/support/pages/apar/IJ51225
Until the fix can be applied:
Downgrade to a kernel that does not contain CVE-2024-25744 (for RHEL9.2 this is less than 5.14.0-284.66.1.el9_2.x86_64) or use the workaround above.
Note: Internal Reference D.329851

[{"Type":"MASTER","Line of Business":{"code":"LOB69","label":"Storage TPS"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSRNVQG","label":"IBM Storage Scale"},"ARM Category":[{"code":"a8m3p000000PC7yAAG","label":"non-GPFS"}],"ARM Case Number":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"5.1.8;5.1.9;5.2.0"}]

Document Information

Modified date:
14 June 2024

UID

ibm17155787