IBM Support

NVIDIA GPU cluster policy stuck in Not Ready state following major kernel update

Troubleshooting


Problem

When referencing a driver version in the GPU cluster policy by tag, the NVIDIA GPU Operator appends the machine's kernel version to the image name.

This can lead to issues when the cluster picks up major kernel (i.e RHEL 8.8 -> 8.9).

Cloud providers will delay from moving to the latest kernel until proper validation and testing can be done to deem production readiness, which will lead to mismatch until it's ready.

This issue will result in the Nvidia-driver-daemon set pod stuck in ImagePullBackOff state either:

1. failing to pull the corresponding tag (c.f workaround 1)

  Normal  BackOff  114s (x6344 over 24h)  kubelet  Back-off pulling image "nvcr.io/nvidia/driver:525.105.17-rhel8.9"

2. failing to download from the subscription manager (c.f workaround 2)

Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
Installing Linux kernel headers...
+ echo 'Installing Linux kernel headers...'
+ dnf -q -y --releasever=8.9 install kernel-headers-4.18.0-477.27.1.el8_8.x86_64 kernel-devel-4.18.0-477.27.1.el8_8.x86_64
Error: Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried




Symptom

- NVIDIA GPU Operator ClusterPoly is in "notReady" state
- The nvidia-driver-daemonset has an "ImagePullBackOff" status

Resolving The Problem

Workaround #1

Modify the GPU cluster policy to reference a Nvidia driver version by checksum using a tag from the nvidia catalog , then restart the nvidia-driver-daemonset pod

driver.version=sha256:<checksum>

Workaround #2

Modify the VERSION_ID field by editing the file /etc/os-release on the host node to match the version of the kernel, then restart the nvidia-driver-daemonset pod.

In this case, update VERSION_ID to be 8.8 when it's trying to pull 8.8 kernel drivers.

sh-4.4# cat /etc/os-release

NAME="Red Hat Enterprise Linux"

VERSION="8.9 (Ootpa)"

ID="rhel"

ID_LIKE="fedora"

VERSION_ID="8.9"

PLATFORM_ID="platform:el8"

PRETTY_NAME="Red Hat Enterprise Linux 8.9 (Ootpa)"

ANSI_COLOR="0;31"

CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"

HOME_URL="https://www.redhat.com/"

DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"

BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"

REDHAT_BUGZILLA_PRODUCT_VERSION=8.9

REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"

REDHAT_SUPPORT_PRODUCT_VERSION="8.9"

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB59","label":"Sustainability Software"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSRHPA","label":"IBM Maximo Application Suite"},"ARM Category":[{"code":"a8m3p000000hAeUAAU","label":"Maximo Application Suite-\u003EMAS Applications-\u003EVisual Inspection"}],"ARM Case Number":"TS015292098","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"8.10.0;8.11.0"}]

Document Information

Modified date:
13 February 2024

UID

ibm17112272