Troubleshooting for IBM Spectrum Conductor Deep Learning Impact

Use this information to help troubleshoot issues with IBM Spectrum Conductor Deep Learning Impact.

  • Problem: After applying interim fix 531936, the nginx service fails to start.
    If the nginx service fails to start, the following issues occur:
    • The Elastic Distributed Inference page in the cluster management console has an error. Log in to the cluster management console and go to Workload > Spark > Deep Learning > Elastic Distributed Inference. If the nginx service is not running, the following error is displayed:
      Error 500: Failed to list Elastic Distributed Inference Models: Connect to hostxx:9000 [host/5.5.5.1 failed: Connection refused (Connection refused)
      where host/5.5.5.1 is an example hostname and IP address.
    • Elastic distributed inference services are started but the nginx service is not started and the edi-access.log file in the $EGO_TOP/dlim/logs directory has not been created. Run the egosh service list command to see which serves are started.

    Solution:

    To resolve this issue, do the following:
    1. Ensure that all the configurations in the dlim.conf, etcd.conf, and nginx.conf files under $EGO_CONFDIR/../../dlim/conf are correct.
    2. Check for any missing files by running the following command:
      ldd $EGO_TOP/dlim/1.2.3/linux-ppc64le/etc/nginx
    3. Manually start the nginx process. For example:
      $EGO_TOP/dlim/1.2.3/linux-ppc64le/etc/nginx -c $EGO_CONFDIR/../../dlim/conf/nginx.conf
    4. Review the output and determine a possible cause.
  • Problem: Cannot create dataset. Dataset is already in use.

    If a Spark instance group was previously deleted, any dataset names used in the previous Spark instance group cannot be used in a new Spark instance group.

    Solution:

    To use the same dataset name that was used in a deleted Spark instance group, make sure that all data is removed. To ensure that all dataset data was removed, do one of the following options:
    • Clean up your environment by executing the curl command, for example:
      curl -k -X DELETE -u Admin:Admin <DLPD_REST_BASE_URL_1>deeplearning/v1/admin/cleanup
    • Configure auto-cleaning in the dlpd.conf file, by setting AUTOCLEANMODE to on:
      "AUTOCLEANMODE" : "on"
      Auto cleaning is only triggers when the dlpd service is restarted, it does not run periodically. After updating dlpd.conf, make sure to restart the dlpd service to trigger an environment clean up.
      egosh service stop dlpd
      egosh service start dlpd