Troubleshooting DDL
Known problems and solutions for using distributed deep learning (DDL).
NVProf with Spectrum MPI
Problem
Currently CUDA 10.0+ nvprof
suffers from a problem that may affect running with
Spectrum MPI. When attempting to launch nvprof
through SMPI, the environment
LD_PRELOAD
values gets set incorrectly, which causes the cuda hooks to fail on
launch. Any application that replies on LD_PRELOAD
could potentially see the same
issue.
ddlrun -H host1 nvprof python ~/tensorflow_examples/ddl-tensorflow/examples/mnist/mnist-env.py
+ /mnt/pai/home/anaconda3/envs/py27/bin/mpirun -x PATH -x LD_LIBRARY_PATH -gpu --rankfile /tmp/DDLRUN/DDLRUN.kuCMN7IB7MV2/RANKFILE -x 'DDL_OPTIONS=-mode r:4x1x1x1 ' -n 4 -x 'DDL_HOST_PORT=2200' -x 'DDL_HOST_LIST=host1:0,1,2,3' bash -c 'source /mnt/pai/home/anaconda3/etc/profile.d/conda.sh && conda activate /mnt/pai/home/anaconda3/envs/py27 > /dev/null 2>&1 && nvprof python /mnt/pai/home/tensorflow_examples/ddl-tensorflow/examples/mnist/mnist-env.py'
DDL: Execution Mode=Graph.
DDL: DDL_GROUP_SIZE=10000000.
DDL: DDL_SYNC_METHOD=default.
FATAL ERROR: Couldn't load pami_cudahook
FATAL ERROR: Couldn't load pami_cudahook
FATAL ERROR: Couldn't load pami_cudahook
FATAL ERROR: Couldn't load pami_cudahook
[host1:54798] Error: common_pami.c:857 - ompi_common_pami_init() Unable to create PAMI client (rc=1)
[host1:54797] Error: common_pami.c:857 - ompi_common_pami_init() Unable to create PAMI client (rc=1)
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.
This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.
Host: host1
Framework: pml
Solution
There are two workarounds to prevent this.
- Passing the
--openmp-profiling off
option should preventLD_PRELOAD
from being improperly set and allow nvprof to proceed normallyddlrun -H host1 nvprof --openmp-profiling python ~/tensorflow_examples/ddl- tensorflow/examples/mnist/mnist-env.py
- Running
nvprof
as a daemon will prevent the cuda hook failure, as all data is collected on a separate session from the run. Please refer to NVIDIA documentation on MPI Profiling
ddlrun hangs with --rankfile option
Use of the --rankfile
option to pass in a custom rank file to
ddlrun
bypasses much of the automatic error checking, including connection tests
and key validation. This can lead to hangs if ssh keys are not properly configured on all hosts.
ddlrun hanging on hosts with multiple interfaces
It is possible in a distributed environment to have more than one interface for each host. In such a scenario, OpenMPI by default, uses any and all interfaces that are "up" to communicate with a host. To avoid problems in such cases you can tell MPI to use given interface. For example:
ddlrun --mpiarg "--mca btl_tcp_if_include ib0" ...
ddlrun --mpiarg "--mca btl_tcp_if_exclude lo,enp1s0f2" ...
More details available on the Spectrum MPI website.
Unable to install Mellanox OFED within a container
Some versions of the Mellanox OFED (MOFED) installer contain a bug preventing installation with the "user-space" flag within a container. MOFED version 4.7-1 can be used to work around this problem.
Using Ubuntu as the host OS
Spectrum MPI requires the usage of ptrace. By default, Ubuntu does not allow this. In order to turn on this capability, run the following command on the host system:
sudo bash -c "echo '0' > /proc/sys/kernel/yama/ptrace_scope"