Enabling jobs to use GPU resources

LSF jobs can specify GPU resource requirements in one statement.

Note: Extended GPU resource requirement specification is now enabled and fixed at LSF_GPU_NEW_SYNTAX=extend.

You can specify all GPU requirements for your job together with the bsub -gpu option or in configuration in a queue, application profile, or in a default GPU requirement. The resource requirements of your job submission cannot use the legacy GPU resources (ngpus_shared, ngpus_excl_t, ngpus_excl_p) as job resource requirements. In addition, if the PREEMPTABLE_RESOURCES parameter in the lsb.params file includes the ngpus_physical resource, GPU preemption is enabled with only one restriction: higher priority GPU jobs cannot preempt GPU jobs with mode=shared configuration in the GPU resource requirements if there are multiple jobs running on the GPU. (Note that as of Fix Pack 14, this restriction has been removed so that higher priority GPU jobs with j_exclusive=yes or mode=exclusive_process settings can preempt shared-mode GPU jobs if there were multiple jobs running on the GPU.) Ensure that you properly configure the MIG, RERUNNABLE, or REQUEUE parameters to ensure that GPU resources are properly released after the job is preempted.

If any option of the GPU requirements is not defined, the default value is used for each option: "num=1:mode=shared:mps=no:j_exclusive=no". Use the LSB_GPU_REQ parameter in the lsf.conf file to specify a different default GPU resource requirement.

You can also specify GPU resource requirements with the GPU_REQ parameter in a queue (lsb.queues file) or application profile (lsb.applications file).

If a GPU requirement is specified at the cluster level (lsf.conf file), queue, or application profile, and at job level, each option (num, mode, mps, and j_exclusive) of the GPU requirement is merged separately. Job level overrides application level, which overrides queue level, which overrides cluster level configuration. For example, if the mode option of GPU requirement is defined on the -gpu option, and the mps option is defined in the queue, the mode of job level and the mps value of queue is used.