Troubleshooting
Problem
When referencing a driver version in the GPU cluster policy by tag, the NVIDIA GPU Operator appends the machine's kernel version to the image name.
This can lead to issues when the cluster picks up major kernel (i.e RHEL 8.8 -> 8.9).
Cloud providers will delay from moving to the latest kernel until proper validation and testing can be done to deem production readiness, which will lead to mismatch until it's ready.
This issue will result in the Nvidia-driver-daemon set pod stuck in ImagePullBackOff state either:
1. failing to pull the corresponding tag (c.f workaround 1)
Normal BackOff 114s (x6344 over 24h) kubelet Back-off pulling image "nvcr.io/nvidia/driver:525.105.17-rhel8.9"
2. failing to download from the subscription manager (c.f workaround 2)
Updating Subscription Management repositories. Unable to read consumer identity Subscription Manager is operating in container mode. Installing Linux kernel headers... + echo 'Installing Linux kernel headers...' + dnf -q -y --releasever=8.9 install kernel-headers-4.18.0-477.27.1.el8_8.x86_64 kernel-devel-4.18.0-477.27.1.el8_8.x86_64 Error: Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
Symptom
Resolving The Problem
Modify the GPU cluster policy to reference a Nvidia driver version by checksum using a tag from the nvidia catalog , then restart the nvidia-driver-daemonset pod
driver.version=sha256:<checksum>
Workaround #2
Modify the VERSION_ID field by editing the file /etc/os-release on the host node to match the version of the kernel, then restart the nvidia-driver-daemonset pod.
In this case, update VERSION_ID to be 8.8 when it's trying to pull 8.8 kernel drivers.
sh-4.4# cat /etc/os-release
NAME="Red Hat Enterprise Linux"
VERSION="8.9 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.9"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.9 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.9
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.9"
Related Information
Document Location
Worldwide
Was this topic helpful?
Document Information
Modified date:
13 February 2024
UID
ibm17112272