Configuring MIG support in Red Hat® OpenShift

You can configure your Nvidia Multi-Instance GPU (MIG) in an OpenShift® cluster.

To configure MIG support, follow the official Nvidia Guide for configuring MIG support.
Note: Do not run any workload on a node when it is being reconfigured. If you do, the configuration process fails.
To assign a MIG profile to a node, label it by using this command:
oc label nodes node1 nvidia.com/mig.config=all-1g.10gb --overwrite=true
Note: To find a list of supported MIG profiles for your GPU, refer to the mig-parted-config configmap in the GPU operator namespace.

The standard setup uses a single MIG profile across the entire Cloud Pak for Data cluster and does not require any custom runtime definitions to be configured. To use a standard setup, label all nodes with the same MIG profile.

As a result, users can start a GPU runtime and select a single GPU to get a MIG device assigned.
Note: Assigning more than one MIG device to a runtime is not recommended. This is because tasks such as training machine learning models cannot be easily done in a distributed fashion (MIG devices are, by design, isolated entities). If your use case requires assigning more than one MIG device to a runtime, refer to Configuring multiple MIG profiles within a cluster.

Configuring multiple MIG profiles within a cluster

To enable different MIG profiles, assign a different profile to each worker node and then create new custom runtime definitions.

To create new custom runtime definitions:
  1. Download the runtime definition for the GPU runtime (for example, jupyter-231n-pygpu). Refer to Downloading the runtime configuration.
  2. Inside the runtime definition, add a nodeAffinity property that specifies the newly created MIG profile:
    "nodeAffinity": {  "requiredDuringSchedulingIgnoredDuringExecution": {
        "nodeSelectorTerms": [
            {
            "matchExpressions": [
                {
                "key": "nvidia.com/mig.config",
                "operator": "In",
                "values": ["all-1g.10g"]
                }
            ]
            }
        ]
        }
    }
  3. Upload the custom runtime definition, as described in Uploading the custom configuration.
  4. If your cluster contains other GPUs that do not provide MIG support, you must taint these nodes and work with a custom runtime definition that contains tolerations. For more information on taints and tolerations, see Taints and tolerations..
    Here is an example of tolerations:
    "tolerations": [
      {
        "key": "DontSchedule",
        "operator": "Exists",
        "effect": "NoExecute"
      }
    ]
    This will ensure that you will not accidentally select a complete GPU instead of a single MIG device.

    If you want to avoid other workloads being scheduled onto a MIG node, taint this node as well and add a toleration to the runtime definition.

After the custom runtime definitions are uploaded, users can create new environments in their projects and select the nodes that offer a certain MIG profile.