Getting started with DDL

Find common configuration steps for getting started with distributed deep learning (DDL).

Some configuration steps are common to all use of DDL:

  • PowerAI frameworks must be installed at the same version on all nodes in the DDL cluster.
  • The DDL master node must be able to log in to all the nodes in the cluster by using ssh keys. Keys can be created and added by:
    • Generate ssh private/public key pair on the master node by using:
      ssh-keygen
    • Copy the generated public key in ~/.ssh/id_rsa.pub to all the nodes’ ~./ssh/authorized_keys file:
      ssh-copy-id -i ~/.ssh/id_rsa.pub $USER@$HOST
  • Linux system firewalls might need to be adjusted to pass MPI traffic. This adjustment might be done broadly as shown.
    Note: Opening only required ports would be more secure. Required ports vary with configuration.
    sudo iptables -A INPUT -p tcp --dport 1024:65535 -j ACCEPT