Getting started with DDL

WML CE Distributed Deep Learning (or DDL) is a MPI-based communication library, which is specifically optimized for deep learning training. An application integrated with DDL becomes an MPI-application, which will allow the use of the ddlrun command to invoke the job in parallel across a cluster of systems. DDL understands multi-tier network environment and uses different libraries (for example NCCL) and algorithms to get the best performance in multi-node, multi-GPU environments.

Initial set up

Some configuration steps are common to all use of DDL:

WML CE frameworks must be installed at the same version on all nodes in the DDL cluster.
The DDL master node must be able to log in to all the nodes in the cluster by using ssh keys. Keys can be created and added by:
- Generate ssh private/public key pair on the master node by using:
```
ssh-keygen
```
- Copy the generated public key in ~/.ssh/id_rsa.pub to all the nodes’ ~./ssh/authorized_keys file:
```
ssh-copy-id -i ~/.ssh/id_rsa.pub $USER@$HOST
```
Linux system firewalls might need to be adjusted to pass MPI traffic. This adjustment might be done broadly as shown.
Note: Opening only required ports would be more secure. Required ports vary with configuration.
```
sudo iptables -A INPUT -p tcp --dport 1024:65535 -j ACCEPT
```