Troubleshooting for IBM Spectrum Conductor Deep Learning Impact
Use this information to help troubleshoot issues with IBM Spectrum Conductor Deep Learning Impact.
- Problem: After
applying interim fix 531936, the nginx service fails to start. If the nginx service fails to start, the following issues occur:
- The Elastic Distributed Inference page in the cluster management console
has an error. Log in to the cluster management console and go to
. If the
nginx service is not running, the following error is
displayed:
where host/5.5.5.1 is an example hostname and IP address.Error 500: Failed to list Elastic Distributed Inference Models: Connect to hostxx:9000 [host/5.5.5.1 failed: Connection refused (Connection refused)
- Elastic distributed inference services are started but the nginx service is not started and the edi-access.log file in the $EGO_TOP/dlim/logs directory has not been created. Run the egosh service list command to see which serves are started.
Solution:
To resolve this issue, do the following:- Ensure that all the configurations in the dlim.conf, etcd.conf, and nginx.conf files under $EGO_CONFDIR/../../dlim/conf are correct.
- Check for any missing files by running the following
command:
ldd $EGO_TOP/dlim/1.2.3/linux-ppc64le/etc/nginx
- Manually start the nginx process. For
example:
$EGO_TOP/dlim/1.2.3/linux-ppc64le/etc/nginx -c $EGO_CONFDIR/../../dlim/conf/nginx.conf
- Review the output and determine a possible cause.
- The Elastic Distributed Inference page in the cluster management console
has an error. Log in to the cluster management console and go to
. If the
nginx service is not running, the following error is
displayed:
- Problem: Cannot create dataset. Dataset is already in use.
If a Spark instance group was previously deleted, any dataset names used in the previous Spark instance group cannot be used in a new Spark instance group.
Solution:
To use the same dataset name that was used in a deleted Spark instance group, make sure that all data is removed. To ensure that all dataset data was removed, do one of the following options:- Clean up your environment by executing the curl command, for
example:
curl -k -X DELETE -u Admin:Admin <DLPD_REST_BASE_URL_1>deeplearning/v1/admin/cleanup
- Configure auto-cleaning in the dlpd.conf file, by setting
AUTOCLEANMODE to
on:
Auto cleaning is only triggers when the dlpd service is restarted, it does not run periodically. After updating dlpd.conf, make sure to restart the dlpd service to trigger an environment clean up."AUTOCLEANMODE" : "on"
egosh service stop dlpd egosh service start dlpd
- Clean up your environment by executing the curl command, for
example: