Troubleshooting, FAQs, and support

Use this information to help troubleshoot issues with IBM Watson® Machine Learning Accelerator and the components that make up IBM Watson Machine Learning Accelerator.

NVProf with Spectrum MPI

Problem

Currently CUDA 10.0+ nvprof suffers from a problem that may affect running with Spectrum MPI. When attempting to launch nvprof through SMPI, the environment LD_PRELOAD values gets set incorrectly, which causes the cuda hooks to fail on launch. Any application that replies on LD_PRELOAD could potentially see the same issue.

ddlrun -H host1 nvprof python ~/tensorflow_examples/ddl-tensorflow/examples/mnist/mnist-env.py 
+ /mnt/pai/home/anaconda3/envs/py27/bin/mpirun -x PATH -x LD_LIBRARY_PATH -gpu --rankfile /tmp/DDLRUN/DDLRUN.kuCMN7IB7MV2/RANKFILE -x 'DDL_OPTIONS=-mode r:4x1x1x1 ' -n 4 -x 'DDL_HOST_PORT=2200' -x 'DDL_HOST_LIST=host1:0,1,2,3' bash -c 'source /mnt/pai/home/anaconda3/etc/profile.d/conda.sh && conda activate /mnt/pai/home/anaconda3/envs/py27 > /dev/null 2>&1 && nvprof python /mnt/pai/home/tensorflow_examples/ddl-tensorflow/examples/mnist/mnist-env.py'
DDL: Execution Mode=Graph.
DDL: DDL_GROUP_SIZE=10000000.
DDL: DDL_SYNC_METHOD=default.

FATAL ERROR: Couldn't load pami_cudahook
FATAL ERROR: Couldn't load pami_cudahook
FATAL ERROR: Couldn't load pami_cudahook
FATAL ERROR: Couldn't load pami_cudahook
[host1:54798] Error: common_pami.c:857 - ompi_common_pami_init() Unable to create PAMI client (rc=1)
[host1:54797] Error: common_pami.c:857 - ompi_common_pami_init() Unable to create PAMI client (rc=1)
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      host1
  Framework: pml

Solution

There are two workarounds to prevent this.

  1. Passing the --openmp-profiling off option should prevent LD_PRELOAD from being improperly set and allow nvprof to proceed normally
    ddlrun -H host1 nvprof --openmp-profiling python ~/tensorflow_examples/ddl-
    tensorflow/examples/mnist/mnist-env.py
  2. Running nvprof as a daemon will prevent the cuda hook failure, as all data is collected on a separate session from the run. Please refer to NVIDIA documentation on MPI Profiling