Troubleshooting DDL

Known problems and solutions for using distributed deep learning (DDL).

NVProf with Spectrum MPI

Problem

Currently CUDA 10.0+ nvprof suffers from a problem that may affect running with Spectrum MPI. When attempting to launch nvprof through SMPI, the environment LD_PRELOAD values gets set incorrectly, which causes the cuda hooks to fail on launch. Any application that replies on LD_PRELOAD could potentially see the same issue.

ddlrun -H host1 nvprof python ~/tensorflow_examples/ddl-tensorflow/examples/mnist/mnist-env.py 
+ /mnt/pai/home/anaconda3/envs/py27/bin/mpirun -x PATH -x LD_LIBRARY_PATH -gpu --rankfile /tmp/DDLRUN/DDLRUN.kuCMN7IB7MV2/RANKFILE -x 'DDL_OPTIONS=-mode r:4x1x1x1 ' -n 4 -x 'DDL_HOST_PORT=2200' -x 'DDL_HOST_LIST=host1:0,1,2,3' bash -c 'source /mnt/pai/home/anaconda3/etc/profile.d/conda.sh && conda activate /mnt/pai/home/anaconda3/envs/py27 > /dev/null 2>&1 && nvprof python /mnt/pai/home/tensorflow_examples/ddl-tensorflow/examples/mnist/mnist-env.py'
DDL: Execution Mode=Graph.
DDL: DDL_GROUP_SIZE=10000000.
DDL: DDL_SYNC_METHOD=default.

FATAL ERROR: Couldn't load pami_cudahook
FATAL ERROR: Couldn't load pami_cudahook
FATAL ERROR: Couldn't load pami_cudahook
FATAL ERROR: Couldn't load pami_cudahook
[host1:54798] Error: common_pami.c:857 - ompi_common_pami_init() Unable to create PAMI client (rc=1)
[host1:54797] Error: common_pami.c:857 - ompi_common_pami_init() Unable to create PAMI client (rc=1)
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      host1
  Framework: pml

Solution

There are two workarounds to prevent this.

Passing the --openmp-profiling off option should prevent LD_PRELOAD from being improperly set and allow nvprof to proceed normally
```
ddlrun -H host1 nvprof --openmp-profiling python ~/tensorflow_examples/ddl-
tensorflow/examples/mnist/mnist-env.py
```
Running nvprof as a daemon will prevent the cuda hook failure, as all data is collected on a separate session from the run. Please refer to NVIDIA documentation on MPI Profiling

ddlrun hangs with --rankfile option

Use of the --rankfile option to pass in a custom rank file to ddlrun bypasses much of the automatic error checking, including connection tests and key validation. This can lead to hangs if ssh keys are not properly configured on all hosts.

ddlrun hanging on hosts with multiple interfaces

It is possible in a distributed environment to have more than one interface for each host. In such a scenario, OpenMPI by default, uses any and all interfaces that are "up" to communicate with a host. To avoid problems in such cases you can tell MPI to use given interface. For example:

ddlrun --mpiarg "--mca btl_tcp_if_include ib0" ...

ddlrun --mpiarg "--mca btl_tcp_if_exclude lo,enp1s0f2" ...

More details available on the Spectrum MPI website.

Unable to install Mellanox OFED within a container

Some versions of the Mellanox OFED (MOFED) installer contain a bug preventing installation with the "user-space" flag within a container. MOFED version 4.7-1 can be used to work around this problem.

Using Ubuntu as the host OS

Spectrum MPI requires the usage of ptrace. By default, Ubuntu does not allow this. In order to turn on this capability, run the following command on the host system:

sudo bash -c "echo '0' > /proc/sys/kernel/yama/ptrace_scope"