Submitting jobs that require GPU resources

Use the bsub -gpu option to specify GPU resource requirements during job submission or submit your job to a queue or application profile that has GPU resource requirements configured in the GPU_REQ parameter. For complex GPU resource requirements (including alternative or compound resource requirements), use the bsub -R option.

Procedure

Use the bsub -gpu option to submit a job with GPU resource requirements.
bsub -gpu - | "[num=num_gpus[/task | host]] [:mode=shared | exclusive_process] [:mps=yes[,shared][,nocvd] | no | per_socket[,shared][,nocvd] | per_gpu[,shared][,nocvd]] [:j_exclusive=yes | no] [:aff=yes | no] [:block=yes | no] [:gpack=yes | no] [:glink=yes] [:gvendor=amd | nvidia] [:gmodel=model_name[-mem_size]] [:gtile=tile_num|'!'] [:gmem=mem_value]"

-

Specifies that the job does not set job-level GPU requirements. Use the hyphen with no letter to set the effective GPU requirements, which are defined at the cluster, queue, or application profile level.
If a GPU requirement is specified at the cluster, queue, and application profile level, each option (num, mode, mps, j_exclusive, gmodel, gtile, gmem, and nvlink) of the GPU requirement is merged separately. Application profile level overrides queue level, which overrides the cluster level default GPU requirement.

If there are no GPU requirements defined in the cluster, queue, or application level, the default value is "num=1:mode=shared:mps=no:j_exclusive=no".
num=num_gpus[/task | host]

The number of physical GPUs required by the job. By default, the number is per host. You can also specify that the number is per task by specifying /task after the number.
If you specified that the number is per task, the configuration of the ngpus_physical resource in the lsb.resources file is set to PER_TASK, or the RESOURCE_RESERVE_PER_TASK=Y parameter is set in the lsb.params file, this number is the requested count per task.

mode=shared | exclusive_process

The GPU mode when the job is running, either shared or exclusive_process. The default mode is shared.
The shared mode corresponds to the Nvidia or AMD DEFAULT compute mode. The exclusive_process mode corresponds to the Nvidia EXCLUSIVE_PROCESS compute mode.

Note: Do not specify exclusive_process when you are using AMD GPUs (that is, when gvendor=amd is specified).

mps=yes[,nocvd][,shared] | per_socket[,shared][,nocvd] | per_gpu[,shared][,nocvd] | no
Enables or disables the Nvidia Multi-Process Service (MPS) for the GPUs that are allocated to the job. Using MPS effectively causes the EXCLUSIVE_PROCESS mode to behave like the DEFAULT mode for all MPS clients. MPS always allows multiple clients to use the GPU through the MPS server.
Note: To avoid inconsistent behavior, do not enable mps when you are using AMD GPUs (that is, when gvendor=amd is specified). If the result of merging the GPU requirements at the cluster, queue, application, and job levels is gvendor=amd and mps is enabled (for example, if gvendor=amd is specified at the job level without specifying mps=no, but mps=yes is specified at the application, queue, or cluster level), LSF ignores the mps requirement.

MPS is useful for both shared and exclusive process GPUs, and allows more efficient sharing of GPU resources and better GPU utilization. See the Nvidia documentation for more information and limitations.

When using MPS, use the EXCLUSIVE_PROCESS mode to ensure that only a single MPS server is using the GPU, which provides additional insurance that the MPS server is the single point of arbitration between all CUDA process for that GPU.

You can also enable MPS daemon sharing by adding the share keyword with a comma and no space (for example, mps=yes,shared enables MPS daemon sharing on the host). If sharing is enabled, all jobs that are submitted by the same user with the same resource requirements share the same MPS daemon on the host, socket, or GPU.

LSF starts MPS daemons on a per-host, per-socket, or per-GPU basis depending on value of the mps keyword:
- If mps=yes is set, LSF starts one MPS daemon per host per job.
  When share is enabled (that is, if mps=yes,shared is set), LSF starts one MPS daemon per host for all jobs that are submitted by the same user with the same resource requirements. These jobs all use the same MPS daemon on the host.
  
  When the CUDA_VISIBLE_DEVICES environment variable is disabled (that is, if mps=yes,nocvd is set), LSF does not set the CUDA_VISIBLE_DEVICES<number> environment variables for tasks, so LSF MPI does not set CUDA_VISIBLE_DEVICES for the tasks. LSF just sets the CUDA_VISIBLE_DEVICES<number> environment variables for tasks, not CUDA_VISIBLE_DEVICES. LSF MPI converts the CUDA_VISIBLE_DEVICES<number> environment variables into CUDA_VISIBLE_DEVICES and sets that for the tasks.
- If mps=per_socket is set, LSF starts one MPS daemon per socket per job. When enabled with share (that is, if mps=per_socket,shared is set), LSF starts one MPS daemon per socket for all jobs that are submitted by the same user with the same resource requirements. These jobs all use the same MPS daemon for the socket.
- If mps=per_gpu is set, LSF starts one MPS daemon per GPU per job. When enabled with share (that is, if mps=per_gpu,shared is set), LSF starts one MPS daemon per GPU for all jobs that are submitted by the same user with the same resource requirements. These jobs all use the same MPS daemon for the GPU.
Important: Using EXCLUSIVE_THREAD mode with MPS is not supported and might cause unexpected behavior.
j_exclusive=yes | no

Specifies whether the allocated GPUs can be used by other jobs. When the mode is set to exclusive_process, the j_exclusive=yes option is set automatically.

aff=yes | no

Specifies whether to enforce strict GPU-CPU affinity binding. If set to no, LSF relaxes GPU affinity while maintaining CPU affinity. By default, aff=yes is set to maintain strict GPU-CPU affinity binding.
Note: The aff=yes setting conflicts with block=yes (distribute allocated GPUs as blocks when the number of tasks is greater than the requested number of GPUs). This is because strict CPU-GPU binding allocates GPUs to tasks based on the CPU NUMA ID, which conflicts with the distribution of allocated GPUs as blocks. If aff=yes and block=yes are both specified in the GPU requirements string, the block=yes setting takes precedence and strict CPU-GPU affinity binding is disabled (that is, aff=no is automatically set).

block=yes | no

Specifies whether to enable block distribution, that is, to distribute the allocated GPUs of a job as blocks when the number of tasks is greater than the requested number of GPUs. If set to yes, LSF distributes all the allocated GPUs of a job as blocks when the number of tasks is bigger than the requested number of GPUs. By default, block=no is set so that allocated GPUs are not distributed as blocks.
For example, if a GPU job requests to run on a host with 4 GPUs and 40 tasks, block distribution assigns GPU0 for ranks 0-9, GPU1 for ranks 10-19, GPU2 for tanks 20-29, and GPU3 for ranks 30-39.

Note: The block=yes setting conflicts with aff=yes (strict CPU-GPU affinity binding). This is because strict CPU-GPU binding allocates GPUs to tasks based on the CPU NUMA ID, which conflicts with the distribution of allocated GPUs as blocks. If block=yes and aff=yes are both specified in the GPU requirements string, the block=yes setting takes precedence and strict CPU-GPU affinity binding is disabled (that is, aff=no is automatically set).

gpack=yes | no
For shared mode jobs only. Specifies whether to enable pack scheduling. If set to yes, LSF packs multiple shared mode GPU jobs to allocated GPUs. LSF schedules shared mode GPUs as follows:
1. LSF sorts the candidate hosts (from largest to smallest) based on the number of shared GPUs that already have running jobs, then by the number of GPUs that are not exclusive.
  If the order[] keyword is defined in the resource requirements string, after sorting order[], LSF re-sorts the candidate hosts by the gpack policy (by shared GPUs that already have running jobs first, then by the number of GPUs that are not exclusive). The gpack policy sort priority is higher than the order[] sort.
2. LSF sorts the candidate GPUs on each host (from largest to smallest) based on the number of running jobs.
After scheduling, the shared mode GPU job packs to the allocated shared GPU that is sorted first, not to a new shared GPU.

If Docker attribute affinity is enabled, the order of candidate hosts are sorted by Docker attribute affinity before sorting by GPUs.

By default, gpack=no is set so that pack scheduling is disabled.
gvendor=amd | nvidia

Specifies the GPU vendor type. LSF allocates GPUs with the specified vendor type.
Specify amd to request AMD GPUs, or specify nvidia to request Nvidia GPUs.

By default, LSF requests Nvidia GPUs.

gmodel=model_name[-mem_size]

Specifies GPUs with the specific model name and, optionally, its total GPU memory. By default, LSF allocates the GPUs with the same model, if available.
The gmodel keyword supports the following formats:

gmodel=model_name

Requests GPUs with the specified brand and model name (for example, TeslaK80).

gmodel=short_model_name

Requests GPUs with a specific brand name (for example, Tesla, Quadro, NVS, ) or model type name (for example, K80, P100).

gmodel=model_name-mem_size

Requests GPUs with the specified brand name and total GPU memory size. The GPU memory size consists of the number and its unit, which includes M, G, T, MB, GB, and TB (for example, 12G).

To find the available GPU model names on each host, run the lsload –gpuload, lshosts –gpu, or bhosts -gpu commands. The model name string does not contain space characters. In addition, the slash (/) and hyphen (-) characters are replaced with the underscore character (_). For example, the GPU model name “Tesla C2050 / C2070” is converted to “TeslaC2050_C2070” in LSF.

gmem=mem_value

Specify the GPU memory on each GPU required by the job. The format of mem_value is the same to other resource value (for example, mem or swap) in the rusage section of the job resource requirements (-R).

gtile=! | tile_num

Specifies the number of GPUs per socket. Specify an number to explicitly define the number of GPUs per socket on the host, or specify an exclamation mark (!) to enable LSF to automatically calculate the number, which evenly divides the GPUs along all sockets on the host. LSF guarantees the gtile requirements even for affinity jobs. This means that LSF might not allocate the GPU's affinity to the allocated CPUs when the gtile requirements cannot be satisfied.
If the gtile keyword is not specified for an affinity job, LSF attempts to allocate enough GPUs on the sockets that allocated GPUs. If there are not enough GPUs on the optimal sockets, jobs cannot go to this host.

If the gtile keyword is not specified for a non-affinity job, LSF attempts to allocate enough GPUs on the same socket. If this is not available, LSF might allocate GPUs on separate GPUs.

nvlink=yes

Obsolete in LSF, Version 10.1 Fix Pack 11. Use the glink keyword instead. Enables the job enforcement for NVLink connections among GPUs. LSF allocates GPUs with NVLink connections in force.

glink=yes

Enables job enforcement for special connections among GPUs. LSF must allocate GPUs with the special connections that are specific to the GPU vendor.
If the job requests AMD GPUs, LSF must allocate GPUs with the xGMI connection. If the job requests Nvidia GPUs, LSF must allocate GPUs with the NVLink connection.

Do not use glink together with the obsolete nvlink keyword.

By default, LSF can allocate GPUs without special connections when there are not enough GPUs with these connections.

mig=GI_size[/CI_size]

Specifies Nvidia Multi-Instance GPU (MIG) device requirements.
Specify the requested number of GPU instances for the MIG job. Valid GPU instance sizes are 1, 2, 3, 4, 7.

Optionally, specify the requested number of compute instances after the specified GPU instance size and a slash character (/). The requested compute instance size must be less than or equal to the requested GPU instance size. In addition, Nvidia MIG does not support the following GPU/compute instance size combinations: 4/3, 7/5, 7/6. If this is not specified, the default compute instance size is 1.

If the GPU_REQ_MERGE parameter is defined as Y or y in the lsb.params file and a GPU requirement is specified at multiple levels (at least two of the default cluster, queue, application profile, or job level requirements), each option of the GPU requirement is merged separately. Job level overrides application level, which overrides queue level, which overrides the default cluster GPU requirement. For example, if the mode option of the GPU requirement is defined on the -gpu option, and the mps option is defined in the queue, the mode of job level and the mps value of queue is used.

If the GPU_REQ_MERGE parameter is not defined as Y or y in the lsb.params file and a GPU requirement is specified at multiple levels (at least two of the default cluster, queue, application profile, or job level requirements), the entire GPU requirement string is replaced. The entire job level GPU requirement string overrides application level, which overrides queue level, which overrides the default GPU requirement.

The esub parameter LSB_SUB4_GPU_REQ modifies the value of the -gpu option.

The GPU requirements are converted to rusage resource requirements for the job. For example, num=2 is converted to rusage[ngpus_physical=2]. Use the bjobs, bhist, and bacct commands to see the merged resource requirement.

For more details on using the bsub -gpu command option, refer to bsub -gpu.
For example, the following command requests 2 Tesla GPUs with a total size of 12 GB on each GPU, reserves 8 GB GPU memory on each GPU, and evenly spreads the GPUs over all sockets on the host. The job submission enables NVIDIA MPS on the GPUs, and any allocated GPUs must have NVLink connections:

bsub -gpu "num=2:gmodel=Tesla-12G:gmem=8G:gtile='!':mps=yes:glink=nvlink ./myjob
If you have GPU resource requirements specified in an application profile or queue, submit a job to the application profile or queue with GPU resource requirements.

bsub -app gpu_app ./myjob

bsub -q "gpu_queue" ./myjob
Use the bsub -R command option to submit a job with complex GPU resource requirements.
There might be complex GPU resource requirements that the bsub -gpu option and GPU_REQ parameter syntax cannot cover, including compound GPU requirements (for different GPU requirements for jobs on different hosts, or for different parts of a parallel job) and alternate GPU requirements (if more than one set of GPU requirements might be acceptable for a job to run).

The following bsub -R options are specific for GPU resource requirements:

bsub -R " span[gtile=tile_num | '!'] rusage[ngpus_physical=num_gpus:gmodel=model_name[-mem_size]:gmem=mem_value:glink=yes]"
- In the span[] section, use the gtile keyword to specify the number of GPUs requested on each socket.
- In the rusage[] section, use the ngpus_physical resource to request the number of physical GPUs, together with the gmodel option specify the GPU model (and the optional total memory size), the gmem option to specify the amount of reserved GPU memory, and the glink option to request only GPUs with special connections (xGMI connections for AMD GPUs or NVLink connections for Nvidia GPUs).
If you need to use the mode, j_exclusive, and mps options, you must use simple GPU requirements. You cannot use these options in the rusage[] section for complex GPU resource requirements.

Important: If you specify complex GPU resource requirements at the job level with the bsub -R command option, the queue level with the RES_REQ parameter in the lsb.queues file, or the application profile level with the RES_REQ parameter in the lsb.applications file, LSF ignores the simple GPU requirements at all levels (that is, LSF ignores the bsub -gpu command option, the LSB_GPU_REQ parameter in the lsf.conf file, and GPU_REQ parameter in the lsb.queues and lsb.applications files).

For more details on specifying resource requirements, including complex GPU resource requirements, see Specifying resource requirements.
For example, the following alternative resource requirement string specifies a parallel job with 2 tasks, with each task running on a separate host that either has 4 K80 GPUs or 2 P100 GPUs evenly divided on each socket:

bsub -n 2 -R "span[ptile=1,gtile=!] rusage[ngpus_physical=4:gmodel=K80 || ngpus_physical=2:gmodel=P100]" ./myjob

Results

The order of GPU conditions when allocating the GPUs are as follows:

The largest GPU compute capability (gpu_factor value).
GPUs with direct NVLink connections.
GPUs with the same model, including the GPU total memory size.
The largest available GPU memory.
The number of concurrent jobs on the same GPU.
The current GPU mode.