Limitations and known issues

These limitations and known problems exist in IBM Spectrum Conductor Deep Learning Impact 1.2.2.

Found in version 1.2.2

  • If using IBM Watson Studio Local with IBM Watson Machine Learning Accelerator, a PyTorch elastic distributed training job fails if the following setting is used: csRmqMeta=None

    Ensure that a value is specified for csRmqMeta. If a job has failed as a result of this issue, delete the task, and try again.

  • After a Spark instance group is deleted, any datasets that were previously created are automatically removed but datasets using the same name cannot be recreated
    If you delete a Spark instance group, an issue exists where the related datasets and models may not be removed completely. This issue prevents you from recreating datasets and models of the same name. To resolve this issue, run the following clean up command:
    curl -k -X DELETE -u Admin:Admin <DLPD_REST_BASE_URL_1>deeplearning/v1/admin/cleanup

Found in earlier versions

  • Some operations fail when upgrading IBM Spectrum Conductor Deep Learning Impact from version 1.2 to 1.2.1.

    When you run the egoupgrade command, a known issue exists where the dlinsights-monitor and dlinsights-optimizer services are logged as not started even if these services are started.

    By default, during upgrade, the system waits 60 seconds for each service to start. If a service is not started within those 60 seconds, a message is logged that the service cannot be started. Even after this default wait time of 60 seconds, the system will continue to start these services. To ensure that the dlinsights-monitor and dlinsights-optimizer services are started, wait a few minutes after upgrading and verify that these services are in STARTED state, see Starting services.

  • When running a dlicmd task with framework plugin PowerAICaffeBVLC an error occurs where the plugin cannot find the /opt/DL/caffe-bvlc/bin/caffe directory.
    To fix this known issue, update the PowerAICaffeBVLC_wrapper.sh file in the /tools/dl_plugins directory:
    1. Open and edit the PowerAICaffeBVLC_wrapper.sh file.
      vi dli_shared_fs/tools/dl_plugins/PowerAICaffeBVLC_wrapper.sh
    2. Change the following line:
      /opt/DL/caffe-bvlc/bin/caffe $@
      To:
      /opt/DL/caffe-bvlc/python/bin/caffe $@
    3. Save and close the file.
    Rerun the dlicmd task successfully.
  • If you are using elastic distributed training for natural language processing, the training engine might have issues starting if the Spark Context driver fails to initialize.
    For example:
    INFO EGOClusterDriverWrapper: Waiting for spark context initialization ... 9
    
    ERROR EGOClusterDriverWrapper: SparkContext did not initialize after waiting for 100000 ms. Please check earlier log output for errors. Failing the application.
    
    EGOClusterDriverWrapper: Final app status: 1, exitCode: 63, (reason: Timed out waiting for SparkContext.)
    To resolve this issue, update the spark-env.sh file that is found in the DLI_SHARED_FS/conf directory. Add the following line to the end of this file:
    export SPARK_EGO_CLIENT_CONTEXT_WAITTRIES=1000
  • Framework plugins do not support a Spark instance group where the SPARK_EGO_APP_SCHEDULE_POLICY is configured for fairshare.

    Plugins support Spark instance group that are configured with fifo only. To learn more on configuring Spark instance groups, see Configuring a Spark instance group after installing IBM Spectrum Conductor Deep Learning Impact.

  • In the cluster management console, the Drivers and Executors link on the Deep Learning pages does not load.

    To see the Drivers and Executors page, make sure that the Spark instance group is running. If the Spark instance group is not running, start the Spark instance group and try again.

  • A deep learning training fails or runs with errors after a task is killed by an executor.
    INFO EGOExecutorBackend: Got kill task 1 with concurrent number(0)
    
    INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1)
    
    INFO Executor: [task] [killing] taskName:task 1.0 in stage 0.0 taskId:1 finishTime:1512136348229 state:KILLED
    To resolve this issue, disable the option to reclaim resources in the Spark instance group consumer.
    1. Select Resources > Consumers.
    2. Select the Spark executor consumer for the IBM Spectrum Conductor Deep Learning Impact Spark instance group.
    3. Click the Consumer Properties tab, and complete the following steps:
      1. Clear the Rebalance when resource plan changes or time interval changes option.
      2. Set Reclaim grace period to the same value as the value set for the SPARK_EGO_RECLAIM_GRACE_PERIOD environment variable in SIG Spark configuration.
        Note: To see the current value set for SPARK_EGO_RECLAIM_GRACE_PERIOD:
        1. Go to Workload > Spark Instance Groups.
        2. Click the IBM Spectrum Conductor Deep Learning Impact Spark instance group.
        3. Select Manage > Configure.
        4. Click Spark configuration and search for SPARK_EGO_RECLAIM_GRACE_PERIOD.
      Note: Do these changes on all the same Spark executor consumers, including: parent and children.
    4. Click Apply to save the changes.
    5. Restart the Spark instance group.