Limitations and known issues

These limitations and known problems exist in WML Accelerator 1.2.3.

Issue found after applying Fix 601552

After applying Fix 601552, no data displayed in the dlinsights charts.
  1. Logon to the primary management host as the cluster administrator and stop the following elk services:
    egosh service stop elk-shipper
    egosh service stop elk-indexer
    egosh service stop elk-manager
    egosh service stop elk-elasticsearch-data
    egosh service stop elk-elasticsearch-master
    egosh service stop elk-elasticsearch
  2. Update the logstash configuration.
    1. Logon to the primary management host as the cluster administrator and upload dlinsights_logstash_worker_cws251.conf to the host. The dlinsights_logstash_worker_cws251.conf file can be obtained from Github.
    2. Navigate to $EGO_CONFDIR/../../integration/elk/conf/indexer.
      cd $EGO_CONFDIR/../../integration/elk/conf/indexer
    3. Back up the cws_spark.conf file to cws_spark.conf.bak.
    4. Copy dlinsights_logstash_worker_cws251.conf to $EGO_CONFDIR/../../integration/elk/conf/indexer/cws_spark.conf.
      cp dlinsights_logstash_worker_cws251.conf $EGO_CONFDIR/../../integration/elk/conf/indexer/cws_spark.conf
    5. Open the cws_spark.conf file and add the following lines:
      #Match Spark 3.3.1
      match => { "message" => "%{DATESTAMP:[@metadata][timestamp]} %{DATA} Executor: [task] [%{DATA:[State]}] taskName:task %{INT:[TaskIndex]:int}.%{INT:[TaskAttempt]:int} in stage %{INT:[StageID]:int}.%{INT:[StageAttempt]:int} (TID %{DATA}) taskId:%{INT:[TaskID]:int} finishTime:%{INT:[Time]:int} duration:%{INT:[Duration]:int} state:%{DATA} resourceType:%{WORD:[ResourceType]}%{GREEDYDATA}" }
    6. Logon to each compute host as the cluster administrator and complete step 2b to 3e to upload the dlinsights_logstash_worker_cws251.conf file to each compute host.
  3. Logon to the primary management host as the cluster administrator and start the following elk services:
    egosh service start elk-shipper
    egosh service start elk-indexer
    egosh service start elk-manager
    egosh service start elk-elasticsearch-data
    egosh service start elk-elasticsearch-master
    egosh service start elk-elasticsearch

Issues found after upgrading to IBM Spectrum Conductor 2.5.1

For known issues found in IBM Spectrum Conductor 2.5.1, see Known issues found in IBM Spectrum Conductor 2.5.1.

  • No data displayed in the Insights charts after upgrading to IBM Spectrum Conductor 2.5.1

    After upgrading from IBM Spectrum Conductor version 2.5.0 to 2.5.1, no data is displayed in the Insights charts after a training workload is run.

    To resolve this issue, update elk configuration files:
    1. Logon to the primary management host as the cluster administrator and stop the following elk services:
      egosh service stop elk-shipper
      egosh service stop elk-indexer
      egosh service stop elk-manager
      egosh service stop elk-elasticsearch-data
      egosh service stop elk-elasticsearch-master
      egosh service stop elk-elasticsearch
    2. Update the filebeat configuration.
      1. Logon to each host in the cluster as the cluster administrator and upload dlinsights_shipper_cws251.yml to the host. The dlinsights_shipper_cws251.yml file can be obtained from Github.
      2. Navigate to $EGO_TOP/integration/elk/1.4.5/conf/shipper, where $EGO_TOP is the cluster installation top directory:
        cd $EGO_TOP/integration/elk/1.4.5/conf/shipper
      3. Back the conductor.yml file to conductor.yml.bak.
      4. Copy dlinsights_shipper_cws251.yml to $EGO_TOP/integration/elk/1.4.5/conf/shipper/conductor.yml.
        cp dlinsights_shipper_cws251.yml $EGO_TOP/integration/elk/1.4.5/conf/shipper/conductor.yml
    3. Update the logstash configuration.
      1. Logon to the primary management host as the cluster administrator and upload dlinsights_logstash_worker_cws251.conf to the host. The dlinsights_logstash_worker_cws251.conf file can be obtained from Github.
      2. Navigate to $EGO_CONFDIR/../../integration/elk/conf/indexer.
        cd $EGO_CONFDIR/../../integration/elk/conf/indexer
      3. Back up the cws_spark.conf file to cws_spark.conf.bak.
      4. Copy dlinsights_logstash_worker_cws251.conf to $EGO_CONFDIR/../../integration/elk/conf/indexer/cws_spark.conf.
        cp dlinsights_logstash_worker_cws251.conf $EGO_CONFDIR/../../integration/elk/conf/indexer/cws_spark.conf
      5. Logon to each compute host as the cluster administrator and complete step 3b to 3d to upload the dlinsights_logstash_worker_cws251.conf file to each compute host.
    4. Logon to the primary management host as the cluster administrator and start the following elk services:
      egosh service start elk-shipper
      egosh service start elk-indexer
      egosh service start elk-manager
      egosh service start elk-elasticsearch-data
      egosh service start elk-elasticsearch-master
      egosh service start elk-elasticsearch
  • Cannot access the Deep learning console after upgrading to IBM Spectrum Conductor 2.5.1
    After upgrading from IBM Spectrum Conductor version 2.5.0 to 2.5.1, the following error is displayed on the Workload > Deep Learning page:
    Error 404: javax.servlet.ServletException: java.io.FileNotFoundException: SRVE0190E: File not found: /dlgui/dl/toSparkDeepLearning.controller
    To resolve this issue, update deep learning configuration files:
    1. Logon to the primary management node as the cluster administrator.
    2. Source the environment by running one of the following commands:
      • For BASH shell, run:
        source EGO_TOP/profile.platform
      • For CSH shell, run:
        source EGO_TOP cshrc.platform
      where EGO_TOP is the cluster installation top directory.
    3. Copy pmc_DLI_help.xml.
      cp $EGO_TOP/gui/activation/dlimgmt-1.2.5/conf/help/pmc_DLI_help.xml $EGO_CONFDIR/../../gui/conf/help/pmc_DLI_help.xml
    4. Copy server_internal.xml.
      cp $EGO_TOP/gui/activation/dlimgmt-1.2.5/conf/webapp/server_internal.xml $EGO_CONFDIR/../../gui/conf/webapp/server_internal.xml
    5. Stop the webgui service.
      egosh service stop WEBGUI
    6. Start the webgui service.
      egosh service start WEBGUI
  • Training jobs remain in running state
    An issue exist where training jobs remain in running state. The following error is found int he application's executor log:
    INFO TorrentBroadcast: Reading broadcast variable 0 took 20 msINFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 360.0 B, free 1048.8 MiB)
    [E]: Error 111 connecting to lab.ibm.com:6379. Connection refused. {}
    ...
    [I]: Try to publish message again after 10s
    [E]: Failed to publish message
    To resolve this issue, restart the redis service:
    1. Logon on the master node as root.
    2. Source the profile.
      • If you are running by using BASH, run:
        . $EGO_TOP/profile.platform
      • If you are running by using CSH, run:
        source $EGO_TOP/cshrc.platform
    3. Logon to ego as the cluster administrator (for example, the cluster administrator user account Admin with the default password Admin_Password):
      egosh user logon -u Admin -x Admin_Password
    4. Stop the redis service.
      egosh service stop redis
    5. Verify that the redis service has stopped.
      egosh service list|grep redis
      redis DEFINED /Manage* Manag*
    6. Start the redis service.
      egosh service start redis

Issues found in initial release of WML Accelerator 1.2.3

  • TensorFlow v2 model training fails (error: CUDNN_STATUS_INTERNAL_ERROR).
    TensorFlow training fails with the following error:
    2021-08-17 18:04:35.429072: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
    I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
    I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
    E tensorflow/stream_executor/cuda/cuda_dnn.cc:336] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
    For details, see Limiting GPU memory growth.
    To resolve this issue, add an additional environment variable to the instance group running the TensorFlow workload:
    1. Log on to the WML Accelerator management console as the cluster administrator.
    2. Stop the targeted instance group from the Instance Group list page.
    3. Select the instance group and click Configure. Go to the Spark tab and select Configuration.
    4. Choose Additional Environment Variables from the "All Parameters" drop down list and add with:
      NAME: TF_FORCE_GPU_ALLOW_GROWTH
      VALUE: true
    5. Save the configuration and click Modify Instance Group.
    6. Start the instance group.
  • Deep learning training jobs fail when submitted to WML Accelerator.

    The VEMKD process core dumps when running deep learning training jobs and submitting them to WML Accelerator using a long username. As a result, the deep learning training jobs fail.

    When this occurs, the following is logged to /var/log/message on the WML Accelerator current master host:
    systemd-coredump[2642553]: Process 2342755 (vemkd) of user 0 dumped core.#012#012Stack trace of thread 2342755:#012#0 0x00007fffbc6035f8 raise (libc.so.6)#012#1 0x00007fffbc5e3a2c abort (libc.so.6)#012#2 0x00007fffbc64f09c __libc_message (libc.so.6)#012#3 0x00007fffbc65a338 **malloc_printerr** (libc.so.6)#012#4 0x00007fffbc65c66c _int_free (libc.so.6)#012#5 0x000000001022c9f4 xmlFreeNodeList (vemkd)#012#6 0x000000001022c9f4 xmlFreeNodeList (vemkd)#012#7 0x000000001022c7b4 xmlFreeNodeList (vemkd)#012#8 0x000000001022c7b4 xmlFreeNodeList (vemkd)#012#9 0x000000001022c7b4 xmlFreeNodeList (vemkd)#012#10 0x000000001022c7b4 xmlFreeNodeList (vemkd)#012#11 0x000000001022c7b4 xmlFreeNodeList (vemkd)#012#12 0x0000000010227954 xmlFreeDoc (vemkd)#012#13 0x0000000010088c4c **getPolicyTrees (vemkd)**#012#012Stack trace of thread 2342809:#012#0 0x00007fffbc6e2d18 __select (libc.so.6)#012#1 0x00000000100e6510 _millisleep_ (vemkd)#012#2 0x000000001002a284 limControlWorker (vemkd)#012#3 0x00007fffbcbd87c8 start_thread (libpthread.so.0)#012#4 0x00007fffbc6f0508 __clone (libc.so.6)

    To resolve this issue, apply iFix sc-2.5-build600328 to your WML Accelerator 1.2.3 cluster.

  • Submitting multiple elastic distributed jobs a the same time can cause one job to hang if a worker is reclaimed too quickly.
    In rare cases, an elastic distributed training job can hang because an executor received a reclaim signal to be killed but did not send the updated kill status to the driver. When this happens, the following information is logged by the executor that failed to send the updated kill status:
    ...
    2020-05-21 02:23:11,929 - root - INFO - receive reclaim signal.
    2020-05-21 02:23:11,929 - root - INFO - Worker was reclaimed before registration. Terminate itself directly!!
    20/05/21 02:27:49 INFO EGOExecutorBackend: Got kill task 1 with concurrent number(0)
    20/05/21 02:27:49 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1), reason:
    20/05/21 02:27:49 INFO Executor: [task] [killing] taskName:task 1.0 in stage 0.0 taskId:1 finishTime:1590042469999 state:KILLED
    ...

    The elastic discredited training worker did not start before the driver sent the reclaim signal to kill the executor. As result, the executor was not killed which caused the elastic training job to hang. To resolve this issue, manually stop the hanging elastic distributed training job and resubmit the training job.

    Stop the hanging elastic distributed training job one of the following ways:
    • Using the instance group application list page in the cluster management console.
    • Using the Deep Learning tab in the cluster management console.
    • Using the deep learning CLI.
  • Curl command generated from swagger does not work from the command line.

    An issue exists when using the curl command that is generated from the swagger API. In the case where you are uploading a file, the REST request content type is multipart/form-data. When using the curl command line, the parameter -F file= must be provided to specify the location of the file to upload. To resolve this issue, you must ensure that the parameter -F file= is added to the curl command. For example:

    curl -k -s -X POST  -H Content-Type:multipart/form-data  -H 'accept: application/json'   -F file=@/tmp/pytorch_mnist_qZog3N.modelDir.tar 'https://wmlahost1:9243/platform/rest/deeplearning/v1/execs/?sigName=mysig&args=--exec-start%20PyTorch%20%20%20%20%20%20%20%20%20%20%20%20--cs-datastore-meta%20type%3Dfs%2Cdata_path%3Dpaietool/%20%20%20%20%20%20%20%20%20%20%20--ig%20mysig%20%20%20%20%20%20%20%20%20%20%20--numWorker%201%20%20%20%20%20%20%20%20%20%20%20--gpuPerWorker%201%20%20%20%20%20%20%20%20%20%20%20--model-dir%20pytorch_mnist%20%20%20%20%20%20%20%20%20%20%20--model-main%20pytorch_mnist.py%20--batch-size%2064%20--lr%200.01%20--epochs%2015'
  • When running elastic distributed training with a PyTorch model, the training hangs and cannot be completed.
    When running a PyTorch elastic distributed training job, the executors hit the following error:
    [ERROR DDL-1-4] A NCCL error has occurred on host xxx.xxx.xx.ibm.com:
    unhandled system error
    xxx.xxx.xxx.ibm.com:122368:122368 [0] NCCL INFO Call to connect returned Connection refused, retrying
    xxx.xxx.xxx.ibm.com:122368:122368 [0] NCCL INFO Call to connect returned Connection refused, retrying
    xxx.xxx.xxx.ibm.com:122368:122368 [0] NCCL INFO Call to connect returned Connection refused, retrying
    The NCCL error occurs when a host has multiple network interfaces and NCCL is trying to connect to an incorrect IP address. To resolve this issue, add the following lines to $DLI_SHARED_FS/conf/spark-env.shexport:
    export NCCL_P2P_DISABLE="1"
    export NCCL_SOCKET_IFNAME=nic1,nic2
    where nic1 is the correct network interface according to your cluster's network.
  • Elastic distributed training does not save checkpoint information when training a model.

    Ensure that you set checkpoint_freq to 1 in the elastic distributed training model API, for example: model.train(epoch_number, batch_size, checkpoint_freq=1).

    Doing so enables checkpoint information to be saved during training. By setting the checkpoint frequency to 1 a checkpoint is generated per epoch. For example, if epoch runs 5 times, 5 checkpoints are created.

  • After creating a Spark instance group, the Spark master fails if SPARK_EGO_AUTOSCALE_GPU_SLOTS_PER_TASK is changed.
    When creating a Spark instance group, you must set SPARK_EGO_AUTOSCALE_GPU_SLOTS_PER_TASK accordingly as this value cannot be changed after the Spark instance group is started. If the value is changed the Spark instance group batch master fails to start and the following error is logged in the service log, assuming the Spark instance group name is wmlaedt:
    ERROR EGOSecurityManager: The execution user for consumer /wmlaedt/wmlaedt-sparkexecutor/wmlaedt-sparkexecutor0 does not exist.
  • If using IBM Watson Studio Local with WML Accelerator, a PyTorch elastic distributed training job fails if the following setting is used: csRmqMeta=None

    Ensure that a value is specified for csRmqMeta. If a job has failed as a result of this issue, delete the task, and try again.

  • After a Spark instance group is deleted, any datasets that were previously created are automatically removed but datasets using the same name cannot be recreated
    If you delete a Spark instance group, an issue exists where the related datasets and models may not be removed completely. This issue prevents you from recreating datasets and models of the same name. To resolve this issue, run the following clean up command:
    curl -k -X DELETE -u Admin:Admin <DLPD_REST_BASE_URL_1>deeplearning/v1/admin/cleanup
  • In the cluster management console, the Drivers and Executors link on the Deep Learning pages does not load.

    To see the Drivers and Executors page, make sure that the instance group is running. If the instance group is not running, start the instance group and try again.

  • A deep learning training fails or runs with errors after a task is killed by an executor.
    INFO EGOExecutorBackend: Got kill task 1 with concurrent number(0)
    INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1)
    INFO Executor: [task] [killing] taskName:task 1.0 in stage 0.0 taskId:1 finishTime:1512136348229 state:KILLED
    To resolve this issue, disable the option to reclaim resources in the instance group consumer.
    1. Select Resources > Consumers.
    2. Select the Spark executor consumer for the Deep Learning Impact Spark instance group.
    3. Click the Consumer Properties tab, and complete the following steps:
      1. Clear the Rebalance when resource plan changes or time interval changes option.
      2. Set Reclaim grace period to the same value as the value set for the SPARK_EGO_RECLAIM_GRACE_PERIOD environment variable in SIG Spark configuration.
        Note: To see the current value set for SPARK_EGO_RECLAIM_GRACE_PERIOD:
        1. Go to Workload > Spark Instance Groups.
        2. Click the Deep Learning Impact Spark instance group.
        3. Select Manage > Configure.
        4. Click Spark configuration and search for SPARK_EGO_RECLAIM_GRACE_PERIOD.
      Note: Do these changes on all the same Spark executor consumers, including: parent and children.
    4. Click Apply to save the changes.
    5. Restart the Spark instance group.