Configuring MIG support in Red Hat® OpenShift
You can configure your Nvidia Multi-Instance GPU (MIG) in an OpenShift® cluster.
To configure MIG support, follow the official Nvidia Guide for configuring MIG support.
Note: Do not run any
workload on a node when it is being reconfigured. If you do, the configuration process
fails.
To assign a MIG profile to a node, label it by using this
command:
oc label nodes node1 nvidia.com/mig.config=all-1g.10gb --overwrite=true
Note: To
find a list of supported MIG profiles for your GPU, refer to the
mig-parted-config
configmap in the GPU operator namespace.The standard setup uses a single MIG profile across the entire Cloud Pak for Data cluster and does not require any custom runtime definitions to be configured. To use a standard setup, label all nodes with the same MIG profile.
As a result, users can start a GPU runtime and select a single GPU to get a MIG device
assigned.
Note: Assigning more than one MIG device to a runtime is not recommended. This is because
tasks such as training machine learning models cannot be easily done in a distributed fashion (MIG
devices are, by design, isolated entities). If your use case requires assigning more than one MIG
device to a runtime, refer to Configuring
multiple MIG profiles within a cluster.
Configuring multiple MIG profiles within a cluster
To enable different MIG profiles, assign a different profile to each worker node and then create new custom runtime definitions.
To create new custom runtime definitions:
- Download the runtime definition for the GPU runtime (for example,
jupyter-231n-pygpu
). Refer to Downloading the runtime configuration. - Inside the runtime definition, add a
nodeAffinity
property that specifies the newly created MIG profile:"nodeAffinity": { "requiredDuringSchedulingIgnoredDuringExecution": { "nodeSelectorTerms": [ { "matchExpressions": [ { "key": "nvidia.com/mig.config", "operator": "In", "values": ["all-1g.10g"] } ] } ] } }
- Upload the custom runtime definition, as described in Uploading the custom configuration.
- If your cluster contains other GPUs that do not provide MIG support, you
must taint these nodes and work with a custom runtime definition that contains tolerations. For more
information on taints and tolerations, see Taints and tolerations..Here is an example of tolerations:
This will ensure that you will not accidentally select a complete GPU instead of a single MIG device."tolerations": [ { "key": "DontSchedule", "operator": "Exists", "effect": "NoExecute" } ]
If you want to avoid other workloads being scheduled onto a MIG node, taint this node as well and add a toleration to the runtime definition.
After the custom runtime definitions are uploaded, users can create new environments in their projects and select the nodes that offer a certain MIG profile.