Tutorial: Pytorch with DDL

This tutorial explains the necessary steps for enabling distributed deep learning (DDL) from within the Pytorch script examples provided in the WML CE distribution. DDL is directly integrated into the Pytorch distributed communication package, torch.distributed, as the backend ddl.

Additional information regarding torch.distributed: https://pytorch.org/docs/stable/distributed.html

Enabling DDL in a Pytorch program

The DDL Pytorch integration makes it simple to run a Pytorch program on a cluster. To enable DDL, you simply need to initialize the Pytorch package torch.distributed with the backend DDL before any other method in the program. The init_method needs to be set to env://, as shown in this example:

torch.distributed.init_process_group('ddl', init_method='env://')

For the distributed program to work most efficiently, the data needs to be split among each DDL instance. The following two methods are approaches in which the data can be split:

Splitting training data manually

In this approach, you split the data and make the dist.all_reduce call manually.

The mnist.py script was modified in this manner to split the training data, as described above, to enable DDL. The modified script can be found in
CONDA_PREFIX/lib/python$PY_VER/site-packages/torch/examples/ddl_examples/mnist/mnist.py
.

For more detailed information about how the data is being split, see the Pytorch tutorial on Writing Distributed Applications

Splitting training data through Pytorch module DistributedDataParallel and DistributedSampler

Pytorch offers a DistributedSampler module that performs the training data split amongst the DDL instances and DistributedDataParallel that does the averaging of the gradients on the backward pass. DDL does not support the num_workers argument passed to torch.utils.data.DataLoader, unexpected behavior may occur.

kwargs = {'pin_memory': True}
train_sampler = torch.utils.data.distributed.DistributedSampler(dataset, world_size, rank)
train_data = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=False,
                                         sampler=train_sampler, **kwargs)

If this behavior occurs, wrap your model with the DistributedDataParallel module.

model = torch.nn.parallel.DistributedDataParallel(model)
The mnist_ddp.py script was modified in this manner to split the training data, as described above, to enable DDL. The modified script can be found in
CONDA_PREFIX/lib/python$PY_VER/site-packages/torch/examples/ddl_examples/mnist/mnist_ddp.py
.

Running the script using ddlrun

DDL-enabled programs are launched using the ddlrun tool. See Using the ddlrun tool and ddlrun -h for more information about ddlrun.

The following code snippets are examples of how to call ddlrun:
ddlrun -H host1,host2 --accelerators 4 python
CONDA_PREFIX/lib/python$PY_VER/site-packages/torch/examples/ddl_examples/mnist/mnist.py
        --dist-backend ddl

For additional information about how to use ddlrun, see CONDA_PREFIX/doc/ddl/README.md.

Running DDL-enabled Imagenet script

The Imagenet script available from Pytorch has been modified to take advantage of DDL.

The following code snippets are examples of how to call DDL Enabled Imagenet:
ddlrun --accelerators 4 python -u
CONDA_PREFIX/lib/python$PY_VER/site-packages/torch/examples/ddl_examples/imagenet/imagenet_ddp.py
        --dist-backend ddl --seed 1234 --dist-url env:// /dir/to/training_data
For additional information regarding DDL Enabled Imagenet, see:
CONDA_PREFIX/lib/python$PY_VER/site-packages/torch/examples/ddl_examples/imagenet/README.md