Getting started with DDL
IBM PowerAI distributed deep learning (or DDL) is a MPI-based communication library,
which is specifically optimized for deep learning training. An application integrated with DDL
becomes an MPI-application, which will allow the use of the ddlrun
command to
invoke the job in parallel across a cluster of systems. DDL understands multi-tier network
environment and uses different libraries (for example NCCL) and algorithms to get the best
performance in multi-node, multi-GPU environments.
Initial set up
Some configuration steps are common to all use of DDL:
- PowerAI frameworks must be installed at the same version on all nodes in the DDL cluster.
- The DDL master node must be able to log in to all the nodes in the cluster by using ssh keys.
Keys can be created and added by:
- Generate ssh private/public key pair on the master node by
using:
ssh-keygen
- Copy the generated public key in
~/.ssh/id_rsa.pub
to all the nodes’~./ssh/authorized_keys
file:ssh-copy-id -i ~/.ssh/id_rsa.pub $USER@$HOST
- Generate ssh private/public key pair on the master node by
using:
- Linux system firewalls might need to be adjusted to pass
MPI traffic. This adjustment might be done broadly as shown. Note: Opening only required ports would be more secure. Required ports vary with configuration.
sudo iptables -A INPUT -p tcp --dport 1024:65535 -j ACCEPT