Configuring a GPU worker node

Verify that your GPU worker nodes are ready for deployments.

Updating your GPU driver version

You can update your GPU driver before or after installation of IBM® Cloud Private.

Important: The NVIDIA Container Runtime is not used by the IBM Cloud Private environment. The NVIDIA Container Runtime and its dependencies must not be installed on any IBM Cloud Private GPU nodes.

  1. To update your GPU driver version. See, http://www.nvidia.com/Download/index.aspx Opens in a new tab.

  2. After updating your GPU driver version, restart Kubelet to allow Kubernetes to pick up the changes to the GPU driver.

    systemctl restart kubelet
    

Verifying that nodes are ready for deployment

You must run these verification steps from the worker node that has the Nvidia GPU driver installed.

  1. Check that NVIDIA is up and running.

    nvidia-smi
    

    The output resembles the following code:

    Thu Nov  9 16:44:28 2017
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla K80           Off  | 0000:08:00.0     Off |                    0 |
    | N/A   47C    P8    26W / 149W |      0MiB / 11439MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  Tesla K80           Off  | 0000:09:00.0     Off |                    0 |
    | N/A   36C    P8    31W / 149W |      0MiB / 11439MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID  Type  Process name                               Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    

    If an error message is returned, reinstall the GPU driver on the node. See, http://www.nvidia.com/Download/index.aspx Opens in a new tab

  2. Ensure that the GPU driver folders are available.

    1. Check whether the /var/lib/kubelet/device-plugins/nvidia-driver folder exists.
    2. Check whether there are at least two folders under the /var/lib/kubelet/device-plugins/nvidia-driver folder. The folder names are init and <driver-version-number>.

    If any folder does not exist, ensure that the GPU drivers are correctly installed. Then, run the following commands:

    1. Delete the folder that has the driver files.

      rm -rf /var/lib/kubelet/device-plugins/nvidia-driver
      
    2. Restart kubelet service.

      systemctl restart kubelet
      
  3. Verify whether the GPU resources are available for Kubernetes to use.

    kubectl describe nodes
    

    In the command output, the node with GPU must have the following entries:

    Capacity:
    [snip]
    nvidia.com/gpu:     2
    [snip]
    Allocatable:
    [snip]
    nvidia.com/gpu:     2
    

    If you do not see nvidia.com/gpu entries for the node that has NVIDIA GPUs, the likely cause is an incorrect installation of the GPU driver. You might need to reinstall the GPU driver.

You are now ready to deploy applications that use GPU resources on your worker node. See Creating a deployment with attached GPU resources.