Configuring a GPU worker node
Verify that your GPU worker nodes are ready for deployments.
Updating your GPU driver version
You can update your GPU driver before or after installation of IBM® Cloud Private.
Important: The NVIDIA Container Runtime is not used by the IBM Cloud Private environment. The NVIDIA Container Runtime and its dependencies must not be installed on any IBM Cloud Private GPU nodes.
-
To update your GPU driver version. See, http://www.nvidia.com/Download/index.aspx .
-
After updating your GPU driver version, restart Kubelet to allow Kubernetes to pick up the changes to the GPU driver.
systemctl restart kubelet
Verifying that nodes are ready for deployment
You must run these verification steps from the worker node that has the Nvidia GPU driver installed.
-
Check that NVIDIA is up and running.
nvidia-smi
The output resembles the following code:
Thu Nov 9 16:44:28 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 375.66 Driver Version: 375.66 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 Off | 0000:08:00.0 Off | 0 | | N/A 47C P8 26W / 149W | 0MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K80 Off | 0000:09:00.0 Off | 0 | | N/A 36C P8 31W / 149W | 0MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
If an error message is returned, reinstall the GPU driver on the node. See, http://www.nvidia.com/Download/index.aspx
-
Ensure that the GPU driver folders are available.
- Check whether the
/var/lib/kubelet/device-plugins/nvidia-driver
folder exists. - Check whether there are at least two folders under the
/var/lib/kubelet/device-plugins/nvidia-driver
folder. The folder names areinit
and<driver-version-number>
.
If any folder does not exist, ensure that the GPU drivers are correctly installed. Then, run the following commands:
-
Delete the folder that has the driver files.
rm -rf /var/lib/kubelet/device-plugins/nvidia-driver
-
Restart kubelet service.
systemctl restart kubelet
- Check whether the
-
Verify whether the GPU resources are available for Kubernetes to use.
kubectl describe nodes
In the command output, the node with GPU must have the following entries:
Capacity: [snip] nvidia.com/gpu: 2 [snip] Allocatable: [snip] nvidia.com/gpu: 2
If you do not see
nvidia.com/gpu
entries for the node that has NVIDIA GPUs, the likely cause is an incorrect installation of the GPU driver. You might need to reinstall the GPU driver.
You are now ready to deploy applications that use GPU resources on your worker node. See Creating a deployment with attached GPU resources.