Control job locality using compute units

Compute units are groups of hosts laid out by the LSF administrator and configured to mimic the network architecture, minimizing communications overhead for optimal placement of parallel jobs. Different granularities of compute units provide the flexibility to configure an extensive cluster accurately and run larger jobs over larger compute units.

Resource requirement keywords within the compute unit section can be used to allocate resources throughout compute units in manner analogous to host resource allocation. Compute units then replace hosts as the basic unit of allocation for a job.

High performance computing clusters running large parallel jobs spread over many hosts benefit from using compute units. Communications bottlenecks within the network architecture of a large cluster can be isolated through careful configuration of compute units. Using compute units instead of hosts as the basic allocation unit, scheduling policies can be applied on a large scale.

Note:

Configure each individual host as a compute unit to use the compute unit features for host level job allocation.

[Types of compute unit]

As indicated in the picture, two types of compute units have been defined in the parameter COMPUTE_UNIT_TYPES in lsb.params:

COMPUTE_UNIT_TYPES= enclosure! rack

! indicates the default compute unit type. The first type listed (enclosure) is the finest granularity and the only type of compute unit containing hosts and host groups. Coarser granularity rack compute units can only contain enclosures.

The hosts have been grouped into compute units in the ComputeUnit section of lsb.hosts as follows (some lines omitted):

Begin ComputeUnit 
NAME         MEMBER             CONDENSED TYPE
enclosure1   (host1[01-16])     Y         enclosure
...
enclosure8   (host8[01-16])     Y         enclosure
rack1        (enclosure[1-2])   Y         rack
rack2        (enclosure[3-4])   Y         rack
rack3        (enclosure[5-6])   Y         rack
rack4        (enclosure[7-8])   Y         rack
End ComputeUnit
This example defines 12 compute units, all of which have condensed output:
  • enclosure1 through enclosure8 are the finest granularity, and each contain 16 hosts.
  • rack1, rack2, rack3, and rack4 are the coarsest in granularity, and each contain 2 enclosures.

Syntax

The cu string supports the following syntax:

cu[balance]
All compute units used for this job should contribute the same number of slots (to within one slot). It provides a balanced allocation over the fewest possible compute units.
cu[pref=bestfit]
The job will be placed to span as few compute units as possible (given the current resource availability) while preferring to use already occupied resources for the job, in order to try to reduce fragmentation in the cluster. Do not use with the balance keyword.
cu[pref=config]
Compute units for this job are considered in the order that they appear in the lsb.hosts configuration file. This is the default value.
cu[pref=minavail]
Compute units with the fewest available slots are considered first for this job. It is useful for smaller jobs (both sequential and parallel) since this minimizes the possible fragmentation of compute units, leaving whole compute units free for larger jobs.
cu[pref=maxavail]
Compute units with the most available slots are considered first for this job.
cu[maxcus=number]
Maximum number of compute units the job can run across.
cu[usablecuslots=number]

All compute units used for this job should contribute the same minimum number of slots. At most the final allocated compute unit can contribute fewer than number slots.

cu[type=cu_type]
Type of compute unit being used, where cu_type is one of the types defined by COMPUTE_UNIT_TYPES in lsb.params. The default is the compute unit type listed first in lsb.params.
cu[excl]
Compute units used exclusively for the job. Must be enabled by EXCLUSIVE in lsb.queues.

Continuing with the example shown above, assume lsb.queues contains the parameter definition EXCLUSIVE=CU[rack] and that the slots available for each compute unit are shown under MAX in the condensed display from bhosts, where HOST_NAME refers to the compute unit:

HOST_NAME    STATUS   JL/U  MAX  NJOBS  RUN  SSUSP  USUSP  RSV
enclosure1   ok       -     64   34     34    0      0      0
enclosure2   ok       -     64   54     54    0      0      0
enclosure3   ok       -     64   46     46    0      0      0
enclosure4   ok       -     64   44     44    0      0      0
enclosure5   ok       -     64   45     45    0      0      0
enclosure6   ok       -     64   44     44    0      0      0
enclosure7   ok       -     32   0      0     0      0      0
enclosure8   ok       -     64   0      0     0      0      0
rack1        ok       -     128  88     88    0      0      0
rack2        ok       -     128  90     90    0      0      0
rack3        ok       -     128  89     89    0      0      0
rack4        ok       -     128  0      0     0      0      0

Based on the 12 configured compute units, jobs can be submitted with a variety of compute unit requirements.

Use compute units

  1. bsub -R "cu[]" -n 64 ./app

    This job is restricted to compute units of the default type enclosure. The default pref=config applies, with compute units considered in configuration order. The job runs on 30 slots in enclosure1, 10 slots in enclosure2, 18 slots in enclosure3, and 6 slots in enclosure4 for a total of 64 slots.

  2. Compute units can be considered in order of most free slots or fewest free slots, where free slots include any slots available and not occupied by a running job.

    bsub -R "cu[pref=minavail]" -n 32 ./app

    This job is restricted to compute units of the default type enclosure in the order pref=minavail. Compute units with the fewest free slots are considered first. The job runs on 10 slots in enclosure2, 18 slots in enclosure3 and 4 slots in enclosure5 for a total of 32 slots.

  3. bsub -R "cu[type=rack:pref=maxavail]" -n 64 ./app

    This job is restricted to compute units of the default type enclosure in the order pref=maxavail. Compute units with the most free slots are considered first. The job runs on 64 slots in enclosure8.

Localized allocations

Jobs can be run over a limited number of compute units using the maxcus keyword.

  1. bsub -R "cu[pref=maxavail:maxcus=1]" ./app

    This job is restricted to a single enclosure, and compute units with the most free slots are considered first. The job requirements are satisfied by enclosure8 which has 64 free slots.

  2. bsub -n 64 -R "cu[maxcus=3]" ./app
    This job requires a total of 64 slots over 3 enclosures or less. Compute units are considered in configuration order. The job requirements are satisfied by the following allocation:
    Compute unit Free slots
    enclosure1 30
    enclosure3 18
    enclosure4 16

Balanced slot allocations

Balanced allocations split jobs evenly between compute units, which increases the efficiency of some applications.

  1. bsub -n 80 -R "cu[balance:maxcus=4]" ./app

    This job requires a balanced allocation over the fewest possible compute units of type enclosure (the default type), with a total of 80 slots. Since none of the configured enclosures have 80 slots, 2 compute units with 40 slots each are used, satisfying the maxcus requirement to use 4 compute units or less.

    The keyword pref is not included so the default order of pref=config is used. The job requirements are satisfied by 27 slots in enclosure1, 27 slots in enclosure7, and 26 slots in enclosure8, for a total of 80 slots.

  2. bsub -n 64 -R "cu[balance:type=rack:pref=maxavail]" ./app

    This job requires a balanced allocation over the fewest possible compute units of type rack, with a total of 64 slots. Compute units with the most free slots are considered first, in the order rack4, rack1, rack3, rack2. The job requirements are satisfied by rack4.

  3. bsub -n "40,80" -R "cu[balance:pref=minavail]" ./app

    This job requires a balanced allocation over compute units of type rack, with a range of 40 to 80 slots. Only the minimum number of slots is considered when a range is specified along with keyword balance, so the job needs 40 slots. Compute units with the fewest free slots are considered first.

    Because balance uses the fewest possible compute units, racks with 40 or more slots are considered first, namely rack1 and rack4. The rack with the fewest available slots is then selected, and all job requirements are satisfied by 40 slots in enclosure8.

Balanced host allocations

Using balance and ptile together within the requirement string results in a balanced host allocation over compute units, and the same number of slots from each host. The final host may provide fewer slots if required.

  • bsub -n 64 -R "cu[balance] span[ptile=4]" ./app

    This job requires a balanced allocation over the fewest possible compute units of type enclosure, with a total of 64 slots. Each host used must provide 4 slots. Since enclosure8 has 64 slots available over 16 hosts (4 slots per host), it satisfies the job requirements.

    Had enclosure8 not satisfied the requirements, other possible allocations in order of consideration (fewest compute units first) include:
    Number of compute units Number of hosts
    2 8+8
    3 5+5+6
    4 4+4+4+4
    5 3+3+3+3+4

Minimum slot allocations

Minimum slot allocations result in jobs spreading over fewer compute units, and ignoring compute units with few hosts available.

  1. bsub -n 45 -R "cu[usablecuslots=10:pref=minavail]" ./app
    This job requires an allocation of at least 10 slots in each enclosure, except possibly the last one. Compute units with the fewest free slots are considered first. The requirements are satisfied by a slot allocation of:
    Compute unit Number of slots
    enclosure2 10
    enclosure5 19
    enclosure4 16

  2. bsub -n "1,140" -R "cu[usablecuslots=20]" ./app
    This job requires an allocation of at least 20 slots in each enclosure, except possibly the last one. Compute units are considered in configuration order and as close to 140 slots are allocated as possible. The requirements are satisfied by an allocation of 140 slots, where only the last compute unit has fewer than 20 slots allocated as follows:
    Compute unit Number of slots
    enclosure1 30
    enclosure4 20
    enclosure6 20
    enclosure7 64
    enclosure2 6

Exclusive compute unit jobs

Because EXCLUSIVE=CU[rack] in lsb.queues, jobs may use compute units of type rack or finer granularity type enclosure exclusively. Exclusive jobs lock all compute units they run in, even if not all slots are being used by the job. Running compute unit exclusive jobs minimizes communications slowdowns resulting from shared network bandwidth.

  1. bsub -R "cu[excl:type=enclosure]" ./app

    This job requires exclusive use of an enclosure with compute units considered in configuration order. The first enclosure not running any jobs is enclosure7.

  2. Using excl with usablecuslots, the job avoids compute units where a large portion of the hosts are unavailable.

    bsub -n 90 -R "cu[excl:usablecuslots=12:type=enclosure]" ./app

    This job requires exclusive use of compute units, and will not use a compute unit if fewer than 12 slots are available. Compute units are considered in configuration order. In this case the job requirements are satisfied by 64 slots in enclosure7 and 26 slots in enclosure8.

  3. bsub -R "cu[excl]" ./app

    This job requires exclusive use of a rack with compute units considered in configuration order. The only rack not running any jobs is rack4.

Reservation

Compute unit constraints such as keywords maxcus, balance, and excl can result in inaccurately predicted start times from default LSF resource reservation. Time-based resource reservation provides a more accurate pending job predicted start time. When calculating job a time-based predicted start time, LSF considers job scheduling constraints and requirements, including job topology and resource limits, for example.

Host-level compute units

Configuring each individual host as a compute unit allows you to use the compute unit features for host level job allocation. Consider an example where one type of compute units has been defined in the parameter COMPUTE_UNIT_TYPES in lsb.params:

COMPUTE_UNIT_TYPES= host!

The hosts have been grouped into compute hosts in the ComputeUnit section of lsb.hosts as follows:

Begin ComputeUnit 
NAME  MEMBER   TYPE
h1    host1    host
h2    host2    host
...
h50   host50   host
End ComputeUnit

Each configured compute unit of default type host contains a single host.

Order host allocations

Using the compute unit keyword pref, hosts can be considered in order of most free slots or fewest free slots, where free slots include any slots available and not occupied by a running job:

  1. bsub -R "cu[]" ./app

    Compute units of default type host, each containing a single host, are considered in configuration order.

  2. bsub -R "cu[pref=minavail]" ./app

    Compute units of default type host each contain a single host. Compute units with the fewest free slots are considered first.

  3. bsub -n 20 -R "cu[pref=maxavail]" ./app

    Compute units of default type host each contain a single host. Compute units with the most free slots are considered first. A total of 20 slots are allocated for this job.

Limit hosts in allocations

Use the compute unit keyword maxcus to specify the maximum number of hosts allocated to a job can be set:

  • bsub -n 12 -R "cu[pref=maxavail:maxcus=3]" ./app

    Compute units of default type host each contain a single host. Compute units with the most free slots are considered first. This job requires an allocation of 12 slots over at most 3 hosts.

Balanced slot allocations

Using the compute unit keyword balance, jobs can be evenly distributed over hosts:

  1. bsub -n 9 -R "cu[balance]" ./app

    Compute units of default type host, each containing a single host, are considered in configuration order. Possible balanced allocations are:


    Compute units Hosts Number of slots per host
    1 1 9
    2 2 4, 5
    3 3 3, 3, 3
    4 4 2, 2, 2, 3
    5 5 2, 2, 2, 2, 1
    6 6 2, 2, 2, 1, 1, 1
    7 7 2, 2, 1, 1, 1, 1, 1
    8 8 2, 1, 1, 1, 1, 1, 1, 1
    9 9 1, 1, 1, 1, 1, 1, 1, 1, 1

  2. bsub -n 9 -R "cu[balance:maxcus=3]" ./app

    Compute units of default type host, each containing a single host, are considered in configuration order. Possible balanced allocations are 1 host with 9 slots, 2 hosts with 4 and 5 slots, or 3 hosts with 3 slots each.

Minimum slot allocations

Using the compute unit keyword usablecuslots, hosts are only considered if they have a minimum number of slots free and usable for this job:

  1. bsub -n 16 -R "cu[usablecuslots=4]" ./app

    Compute units of default type host, each containing a single host, are considered in configuration order. Only hosts with 4 or more slots available and not occupied by a running job are considered. Each host (except possibly the last host allocated) must contribute at least 4 slots to the job.

  2. bsub -n 16 -R "rusage[mem=1000] cu[usablecuslots=4]" ./app

    Compute units of default type host, each containing a single host, are considered in configuration order. Only hosts with 4 or more slots available, not occupied by a running job, and with 1000 memory units are considered. A host with 10 slots and 2000 units of memory, for example, will only have 2 slots free that satisfy the memory requirements of this job.

Best fit for job placement

Some users may want LSF to place jobs with an optimal placement with respect to compute units, without having to specify different requirements for different jobs. For this purpose, LSF has the "bestfit" value for the pref option. For example, bsub -R "cu [pref=bestfit]" ...

When this algorithm is used, LSF places the job to span as few compute units as possible (given the current resource availability) while preferring to use already occupied resources for the job, in order to try to reduce fragmentation in the cluster. This is done by considering the underlying network topology of the cluster, as specified in LSF Compute Units (CUs).

The "bestfit" value is different from other compute unit placement algorithms in that it considers multiple levels of the compute unit hierarchy, if applicable. It also allows both large and small jobs to use the same compute unit requirements. Ultimately, it simplifies the specification of compute unit requirements.

The [pref=bestfit] option can be used together with:
  • maxcus
  • type=<type>
  • usablecuslots=<num>
  • excl
  • compute unit preference -m option

The [pref=bestfit] algorithm works in two phases.

Examples of the usage of [pref=bestfit] and how the algorithm finds the best fit of compute units:
  • bsub -R “cu[excl:pref=bestfit]” -n 10 myjob

    An exclusive job on a best fit of compute units

  • bsub -R “cu[pref=bestfit:usablecuslots=5]” -n 15 myjob

    Best-fit job placement with minimum slot allocations for the default level and below.

  • bsub -m “cu1+10 cu3+1” -R “cu[pref=bestfit] ” -n 10 myjob

    Best-fit job placement while taking host preference into consideration. Host preference has a higher priority.

  • bsub -R “cu[pref=bestfit:maxcus=2:type=switch]” -n 10 myjob

    Best-fit job placement with specified maxcus for the default level and below. For the second phase, LSF considers the lowest maxcus among the user-specified value and the number of compute units that is calculated in the first phase.

  • bsub -R “8*{select[LN]} + {cu[pref=bestfit]}” myjob

    A job with a compound resource requirement.

Specify the compute unit order with the host preference feature

Before LSF 10.1, the compute unit order was determined only by the compute unit pref policies (cu[pref=config | maxavail | minavail]). Host preference (specified by -m or HOSTS in lsb.queues) only affected the host order within each compute unit. In LSF 10.1, considering the customer's requirement to specify compute unit order more flexibly, this behavior has been changed.

Currently, LSF allows use of the host preference to specify compute unit order along with cu[pref=config | maxavail | minavail] policy. The following example illustrates use of the -m preference to specify the compute unit's order as: cu1>cu2>cu3>cu4.
bsub -n 2 -m "cu1+10 cu2+5 cu3+1 cu4" -R "cu[]" ./app 
Host preference works along with cu[pref=config | maxavail | minavail] in the following manner to determine compute unit order:
  1. LSF calculates the compute unit preference according to the host preference, taking the highest preference of hosts within the compute unit as the compute unit preference. For example:
    1. In the following example, in which h1 h2 belong to cu1 and h3 h4 belong to cu2, according to the candidate host preference, LSF determines that the cu1 preference is 10 and the cu2 preference is 0.
      bsub -n 2 -m "h1+10 others" -R "cu[pref=minavail]" ./app 
    2. In the following example, in which h1 h2 belong to cu1 and h3 h4 belong to cu2, according to the candidate host preference, LSF determines that the cu1 preference is 10 and cu2 preference is 9.
      bsub -n 2 -m "h1+10 h2+1 h3+9 h4+9" -R "cu[pref=minavail]" ./app 
  2. LSF determines the compute unit order as follows:
    1. When the compute unit preference calculated in step 1., above, differs, LSF orders the compute unit by the compute unit preference, considering the compute unit with the higher preference first. For example:
      1. Because cu1 has a higher preference than cu2 in the following example, LSF first considers cu1, then cu2, without regard for the cu[pref=config | maxavail | minavail] policy.
        bsub -n 2 -m "cu1+10 cu2" -R "cu[pref=minavail]" ./app
      2. cu[pref=maxvail] and cu[pref=config] also follow this policy.
    2. When the compute unit preference calculated in step 1., above, is the same, LSF orders the compute unit using cu[pref=config | maxavail | minavail]. For example:
      1. Because all compute units in the following example do not have the preference, LSF uses standard logic to determine compute unit order. pref=minavail takes affect.
        bsub -n 2 -R "cu[pref=minavail]" ./app 
      2. Because all compute units in the following example have the same preference, LSF uses standard logic to determine compute unit order. pref=minavail takes affect.
        bsub -n 2 -m "cu1+10 cu2+10" -R "cu[pref=minavail]" ./app
      3. cu[pref=maxvail] and cu[pref=config] also follow this policy.
  3. After LSF determines the compute unit order, the LSF scheduler adjusts the candidate host list according to the compute unit order.

For resource reservation, the host preference is considered when determining the resource on which the compute unit is reserved first. The default, pref=config, however, is always used.

The first-execution host works with the compute unit feature as follows:
  • When mandatory first-execution hosts are used together with the compute unit feature, the compute unit that contains the first-execution host is given first preference among compute units. The remaining compute units are ordered according to the calculated compute unit preference. As previously in LSF, exactly one of the candidate first-execution hosts can be used in the job's allocation.
Note:
  1. Compute unit policy pref=config|maxavail|minavail does not affect the order of compute units specified as first-execution hosts.
    This means that even when under the following circumstances:
    1. You use the following options:
       -m "cu1! cu2! cu3!" -R "cu[pref=minavail]"
    2. cu1 has 15 free slots.
    3. cu2 has 10 free slots.
    4. cu3 has 20 free slots.
    LSF does not reorder the first-execution host list under the compute unit cu[pref=config|maxavail|minavail] policy.
  2. When using host preference to determine compute unit preference, the host specified as first-execution host is not considered. Namely, LSF counts only the preference of hosts that are not specified as first-execution hosts.

    For example, let cu1, cu2, cu3, cu4 denote compute units, all of the same type. Let hg denote a host group containing one host from each compute unit.

    The user submits the following job:
    bsub -n 64 -m "hg! cu1+1 cu2+2 cu3+3 cu4+4" -R "cu[pref=config]" ./app
    
    When the job is dispatched, exactly one host from hg appears in the job's allocation list. (This host should appear in the first position of the list.) Next in the list are zero or more hosts from the same compute unit that as the first-execution host. The remaining hosts from the other compute units appear grouped by compute unit, with the groups themselves appearing in order, according to the high-low preference of the compute unit. For example:
    1. If h1 from cu1 is selected as the first-execution host, the final compute unit order would be cu1>cu4>cu3>cu2
    2. If h2 from cu2 is selected as the first-execution host, the final compute unit order would be cu2>cu4>cu3>cu1
    3. If h3 from cu3 is selected as the first-execution host, the final compute unit order would be cu3>cu4>cu2>cu1
    4. If h4 from cu4 is selected as the first-execution host, the final compute unit order would be cu4>cu3>cu2>cu1