Using the `ddlrun` tool

DDL enabled programs should be launched using the ddlrun tool.

This tool performs the following tasks automatically:

Determines the necessary arguments to pass to MPI based on the current environment and version of MPI.
Tests connections to the hosts, including the correct setup of ssh keys.
Verifies that DDL is installed across the hosts.
Verifies that the hosts all have the same hardware configuration.
Detects the hosts' hardware configuration, including GPU count, and generates a valid topology.
Generates the necessary rankfile, providing options to specify more specific topology details.
Constructs, displays and executes the mpirun command needed to distribute jobs to each node.

Note: Use of the --rankfile option to pass in a custom rank file bypasses much of the automatic error checking, including connection tests and key validation.

Various options can be passed to ddlrun to control how and where the processes are executed.

Usage: ddlrun [options] <program> [<args>]

Options

<program>: The program executable. This is identified as the first non-recognized argument to ddlrun.
<args>: Pass these run-time arguments to every instance of <program> started by mpirun.
-m,--m,-mode,--mode <arg0>: Specify <arg0> as the ddl mode to use. Default is 'p'.
-ddlarg,--ddlarg,-ddloption,--ddloption <arg0>: Pass <arg0> through as an argument to ddl.
-no_ddloptions,--no_ddloptions: Do not pass the ddloptions env var to mpi.
-rf,--rf,-rankfile,--rankfile <arg0>: Provide an MPI rank file. (Skips most error checking, including connection testing and key validation)
-H,--H,-host,--host <host1,host2,...,hostN>: List of hosts to use when generating a rankfile.
-hostfile,--hostfile <arg0>: File containing list of hosts to use when generating a rankfile, with one host on each line.
-aisles,--aisles <arg0>: Number of aisles. If no option given 1 is assumed.
-racks,--racks <arg0>: Number of racks in each aisle. If no option given 1 is assumed.
-nodes,--nodes <arg0>: Number of nodes per rack. If no option given the number of hosts given is assumed.
-accelerators,--accelerators <arg0>: Number of accelerators per node. If no option given the number in the current machine is assumed. Error on values larger than 16.
-sockets,--sockets <arg0>: Number of CPU sockets per node. Values can be 1 or any even number 2 or larger. If no option given the number in the current machine is assumed.
-cores,--cores <arg0>: Number of cores per socket. If no option given the number in the current machine is assumed.
-all_cores,--all_cores: Each instance can use all cores. Default is to equally divide cores between instances.
-mpiarg,--mpiarg <arg0>: Pass <arg0> through as an argument to the mpirun command.
-x VAR, --x VAR: Export an environment variable; optionally specifying a value. For example: -x foo or -x foo=bar.
-tcp,-TCP,--tcp,--TCP: Have mpi use TCP instead of infiniband to communicate between nodes.
-v--v,-verbose,--verbose: Provide more verbose output.
-s,--s,-skipchecks,--skipchecks: Skip all diagnostic checks.
-h,--h,-help,--help: Display this help and exit

Example usage

The following example uses ddlrun to launch the mnist-env.py script across two nodes named host1 and host2.

Make sure there is a conda environment activated with the ddl-tensorflow package installed before running the following command.

ddlrun -H host1,host2 python mnist-env.py

DDL modes

DDL can perform an allreduce using different algorithms. These are specified with the mode argument to ddlrun. If you do not select a mode, ddlrun selects p mode. Note that different modes can be used between different subsections of a cluster. For example, n mode can be used between GPUs within the same node, then r mode can be used between GPUs in different nodes. To specify this, use mode nr. The different modes that DDL supports include the following options:

b mode: Use enhanced NCCL allreduce algorithm.
n mode: Use NCCL allreduce algorithm.
r mode: Use MPI-based ring algorithm.
c mode: Use an MPI-based recursive algorithm. This mode only works when the number of nodes is a power of 2.
m mode: Use MPI's allreduce algorithm. This mode can work on machines without GPUs.
p mode: Automatically picks the best mode to use.

DDL options

There are a number of runtime options for the DDL engine. These options can be passed to ddlrun using the --ddloption flag.

-mpi_mode <N>: Allows to turn on or off MPI based communication. By default, the value is 1.
-cross_node_nccl <N>: Allow cross-node communication to occur using NCCL as the backend. When N is non-zero, NCCL can be used for cross-node communication. By default, the value is 1.
-pipe <N>: Run DDL in N staged pipeline. The default is 0 (which means off). Setting to 1 has no performance benefit. Setting to 2 can sometimes see a performance benefit. Note: the -pipe option does not work with modes: r, c or m.
-pack <F>: Use 16-bit floating point numbers when F is non-zero. F is a scaling factor. For example, if F is 2.0, it will scale up x2 before converting to 16-bit floating point numbers, then x0.5 before converting back to 32-bit floating point numbers.
-cpu_only <N>: When N is non-zero, DDL allreduce operations only works with arrays on the CPU and only operates on the CPU. By default, the value is 0. Note: This option only works if running on a system with CUDA installed.
-dbg_level <0,1,2>: Set to 0 for no, 1 for mild, and 2 for detailed debug messages
-dump_iter <N>: Outputs DDL performance information every N iterations.