Getting started with DDL
WML CE Distributed Deep Learning (or
DDL) is a MPI-based communication library, which is specifically optimized for deep learning
training. An application integrated with DDL becomes an MPI-application, which will allow the use of
the ddlrun
command to invoke the job in parallel across a cluster of systems. DDL
understands multi-tier network environment and uses different libraries (for example NCCL) and
algorithms to get the best performance in multi-node, multi-GPU environments.
Initial set up
Some configuration steps are common to all use of DDL:
- WML CE frameworks must be installed at the same version on all nodes in the DDL cluster.
- The DDL master node must be able to log in to all the nodes in the cluster by using ssh keys.
Keys can be created and added by:
- Generate ssh private/public key pair on the master node by
using:
ssh-keygen
- Copy the generated public key in
~/.ssh/id_rsa.pub
to all the nodes’~./ssh/authorized_keys
file:ssh-copy-id -i ~/.ssh/id_rsa.pub $USER@$HOST
- Generate ssh private/public key pair on the master node by
using:
- Linux system firewalls might need to be adjusted to pass
MPI traffic. This adjustment might be done broadly as shown. Note: Opening only required ports would be more secure. Required ports vary with configuration.
sudo iptables -A INPUT -p tcp --dport 1024:65535 -j ACCEPT