Limitations and known issues
These limitations and known problems exist in WML Accelerator 1.2.3.
Issue found after applying Fix 601552
- Logon to the primary management host as the cluster administrator and stop the
following elk services:
egosh service stop elk-shipper
egosh service stop elk-indexer
egosh service stop elk-manager
egosh service stop elk-elasticsearch-data
egosh service stop elk-elasticsearch-master
egosh service stop elk-elasticsearch
- Update the logstash configuration.
- Logon to the primary management host as the cluster administrator and upload dlinsights_logstash_worker_cws251.conf to the host. The dlinsights_logstash_worker_cws251.conf file can be obtained from Github.
- Navigate to
$EGO_CONFDIR/../../integration/elk/conf/indexer.
cd $EGO_CONFDIR/../../integration/elk/conf/indexer
- Back up the cws_spark.conf file to cws_spark.conf.bak.
- Copy dlinsights_logstash_worker_cws251.conf to
$EGO_CONFDIR/../../integration/elk/conf/indexer/cws_spark.conf.
cp dlinsights_logstash_worker_cws251.conf $EGO_CONFDIR/../../integration/elk/conf/indexer/cws_spark.conf
- Open the cws_spark.conf file and add the following lines:
#Match Spark 3.3.1 match => { "message" => "%{DATESTAMP:[@metadata][timestamp]} %{DATA} Executor: [task] [%{DATA:[State]}] taskName:task %{INT:[TaskIndex]:int}.%{INT:[TaskAttempt]:int} in stage %{INT:[StageID]:int}.%{INT:[StageAttempt]:int} (TID %{DATA}) taskId:%{INT:[TaskID]:int} finishTime:%{INT:[Time]:int} duration:%{INT:[Duration]:int} state:%{DATA} resourceType:%{WORD:[ResourceType]}%{GREEDYDATA}" }
- Logon to each compute host as the cluster administrator and complete step 2b to 3e to upload the dlinsights_logstash_worker_cws251.conf file to each compute host.
- Logon to the primary management host as the cluster administrator and start the
following elk
services:
egosh service start elk-shipper
egosh service start elk-indexer
egosh service start elk-manager
egosh service start elk-elasticsearch-data
egosh service start elk-elasticsearch-master
egosh service start elk-elasticsearch
Issues found after upgrading to IBM Spectrum Conductor 2.5.1
For known issues found in IBM Spectrum Conductor 2.5.1, see Known issues found in IBM Spectrum Conductor 2.5.1.
- No data displayed in the Insights charts after upgrading to IBM Spectrum Conductor
2.5.1
After upgrading from IBM Spectrum Conductor version 2.5.0 to 2.5.1, no data is displayed in the Insights charts after a training workload is run.
To resolve this issue, update elk configuration files:- Logon to the primary management host as the cluster administrator and stop the
following elk services:
egosh service stop elk-shipper
egosh service stop elk-indexer
egosh service stop elk-manager
egosh service stop elk-elasticsearch-data
egosh service stop elk-elasticsearch-master
egosh service stop elk-elasticsearch
- Update the filebeat configuration.
- Logon to each host in the cluster as the cluster administrator and upload dlinsights_shipper_cws251.yml to the host. The dlinsights_shipper_cws251.yml file can be obtained from Github.
- Navigate to $EGO_TOP/integration/elk/1.4.5/conf/shipper, where
$EGO_TOP is the cluster installation top
directory:
cd $EGO_TOP/integration/elk/1.4.5/conf/shipper
- Back the conductor.yml file to conductor.yml.bak.
- Copy dlinsights_shipper_cws251.yml to
$EGO_TOP/integration/elk/1.4.5/conf/shipper/conductor.yml.
cp dlinsights_shipper_cws251.yml $EGO_TOP/integration/elk/1.4.5/conf/shipper/conductor.yml
- Update the logstash configuration.
- Logon to the primary management host as the cluster administrator and upload dlinsights_logstash_worker_cws251.conf to the host. The dlinsights_logstash_worker_cws251.conf file can be obtained from Github.
- Navigate to
$EGO_CONFDIR/../../integration/elk/conf/indexer.
cd $EGO_CONFDIR/../../integration/elk/conf/indexer
- Back up the cws_spark.conf file to cws_spark.conf.bak.
- Copy dlinsights_logstash_worker_cws251.conf to
$EGO_CONFDIR/../../integration/elk/conf/indexer/cws_spark.conf.
cp dlinsights_logstash_worker_cws251.conf $EGO_CONFDIR/../../integration/elk/conf/indexer/cws_spark.conf
- Logon to each compute host as the cluster administrator and complete step 3b to 3d to upload the dlinsights_logstash_worker_cws251.conf file to each compute host.
- Logon to the primary management host as the cluster administrator and start the
following elk
services:
egosh service start elk-shipper
egosh service start elk-indexer
egosh service start elk-manager
egosh service start elk-elasticsearch-data
egosh service start elk-elasticsearch-master
egosh service start elk-elasticsearch
- Logon to the primary management host as the cluster administrator and stop the
following elk services:
- Cannot
access the Deep learning console after upgrading to IBM Spectrum Conductor 2.5.1After upgrading from IBM Spectrum Conductor version 2.5.0 to 2.5.1, the following error is displayed on thepage:
Error 404: javax.servlet.ServletException: java.io.FileNotFoundException: SRVE0190E: File not found: /dlgui/dl/toSparkDeepLearning.controller
To resolve this issue, update deep learning configuration files:- Logon to the primary management node as the cluster administrator.
- Source the environment by running one of the following commands:
- For BASH shell, run:
source EGO_TOP/profile.platform
- For CSH shell, run:
source EGO_TOP cshrc.platform
- For BASH shell, run:
- Copy
pmc_DLI_help.xml.
cp $EGO_TOP/gui/activation/dlimgmt-1.2.5/conf/help/pmc_DLI_help.xml $EGO_CONFDIR/../../gui/conf/help/pmc_DLI_help.xml
- Copy
server_internal.xml.
cp $EGO_TOP/gui/activation/dlimgmt-1.2.5/conf/webapp/server_internal.xml $EGO_CONFDIR/../../gui/conf/webapp/server_internal.xml
- Stop the webgui service.
egosh service stop WEBGUI
- Start the webgui service.
egosh service start WEBGUI
- Training jobs remain in running
stateAn issue exist where training jobs remain in running state. The following error is found int he application's executor log:
INFO TorrentBroadcast: Reading broadcast variable 0 took 20 msINFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 360.0 B, free 1048.8 MiB) [E]: Error 111 connecting to lab.ibm.com:6379. Connection refused. {} ... [I]: Try to publish message again after 10s [E]: Failed to publish message
To resolve this issue, restart the redis service:- Logon on the master node as root.
- Source the profile.
- If you are running by using BASH, run:
. $EGO_TOP/profile.platform
- If you are running by using CSH, run:
source $EGO_TOP/cshrc.platform
- If you are running by using BASH, run:
- Logon to ego as the cluster administrator (for example, the cluster administrator user account
Admin
with the default passwordAdmin_Password
):egosh user logon -u Admin -x Admin_Password
- Stop the redis service.
egosh service stop redis
- Verify that the redis service has stopped.
egosh service list|grep redis redis DEFINED /Manage* Manag*
- Start the redis service.
egosh service start redis
Issues found in initial release of WML Accelerator 1.2.3
- TensorFlow
v2 model training fails (error: CUDNN_STATUS_INTERNAL_ERROR).TensorFlow training fails with the following error:
For details, see Limiting GPU memory growth.2021-08-17 18:04:35.429072: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 E tensorflow/stream_executor/cuda/cuda_dnn.cc:336] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
To resolve this issue, add an additional environment variable to the instance group running the TensorFlow workload:- Log on to the WML Accelerator management console as the cluster administrator.
- Stop the targeted instance group from the Instance Group list page.
- Select the instance group and click Configure. Go to the Spark tab and select Configuration.
- Choose Additional Environment Variables from the "All Parameters" drop
down list and add with:
NAME: TF_FORCE_GPU_ALLOW_GROWTH VALUE: true
- Save the configuration and click Modify Instance Group.
- Start the instance group.
- Deep
learning training jobs fail when submitted to WML Accelerator.
The VEMKD process core dumps when running deep learning training jobs and submitting them to WML Accelerator using a long username. As a result, the deep learning training jobs fail.
When this occurs, the following is logged to /var/log/message on the WML Accelerator current master host:systemd-coredump[2642553]: Process 2342755 (vemkd) of user 0 dumped core.#012#012Stack trace of thread 2342755:#012#0 0x00007fffbc6035f8 raise (libc.so.6)#012#1 0x00007fffbc5e3a2c abort (libc.so.6)#012#2 0x00007fffbc64f09c __libc_message (libc.so.6)#012#3 0x00007fffbc65a338 **malloc_printerr** (libc.so.6)#012#4 0x00007fffbc65c66c _int_free (libc.so.6)#012#5 0x000000001022c9f4 xmlFreeNodeList (vemkd)#012#6 0x000000001022c9f4 xmlFreeNodeList (vemkd)#012#7 0x000000001022c7b4 xmlFreeNodeList (vemkd)#012#8 0x000000001022c7b4 xmlFreeNodeList (vemkd)#012#9 0x000000001022c7b4 xmlFreeNodeList (vemkd)#012#10 0x000000001022c7b4 xmlFreeNodeList (vemkd)#012#11 0x000000001022c7b4 xmlFreeNodeList (vemkd)#012#12 0x0000000010227954 xmlFreeDoc (vemkd)#012#13 0x0000000010088c4c **getPolicyTrees (vemkd)**#012#012Stack trace of thread 2342809:#012#0 0x00007fffbc6e2d18 __select (libc.so.6)#012#1 0x00000000100e6510 _millisleep_ (vemkd)#012#2 0x000000001002a284 limControlWorker (vemkd)#012#3 0x00007fffbcbd87c8 start_thread (libpthread.so.0)#012#4 0x00007fffbc6f0508 __clone (libc.so.6)
To resolve this issue, apply iFix sc-2.5-build600328 to your WML Accelerator 1.2.3 cluster.
- Submitting multiple elastic
distributed jobs a the same time can cause one job to hang if a worker is reclaimed too quickly.
In rare cases, an elastic distributed training job can hang because an executor received a reclaim signal to be killed but did not send the updated kill status to the driver. When this happens, the following information is logged by the executor that failed to send the updated kill status:
... 2020-05-21 02:23:11,929 - root - INFO - receive reclaim signal. 2020-05-21 02:23:11,929 - root - INFO - Worker was reclaimed before registration. Terminate itself directly!! 20/05/21 02:27:49 INFO EGOExecutorBackend: Got kill task 1 with concurrent number(0) 20/05/21 02:27:49 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1), reason: 20/05/21 02:27:49 INFO Executor: [task] [killing] taskName:task 1.0 in stage 0.0 taskId:1 finishTime:1590042469999 state:KILLED ...
The elastic discredited training worker did not start before the driver sent the reclaim signal to kill the executor. As result, the executor was not killed which caused the elastic training job to hang. To resolve this issue, manually stop the hanging elastic distributed training job and resubmit the training job.
Stop the hanging elastic distributed training job one of the following ways:- Using the instance group application list page in the cluster management console.
- Using the Deep Learning tab in the cluster management console.
- Using the deep learning CLI.
- Curl command generated from swagger does not work from the command
line.
An issue exists when using the curl command that is generated from the swagger API. In the case where you are uploading a file, the REST request content type is multipart/form-data. When using the curl command line, the parameter -F file= must be provided to specify the location of the file to upload. To resolve this issue, you must ensure that the parameter -F file= is added to the curl command. For example:
curl -k -s -X POST -H Content-Type:multipart/form-data -H 'accept: application/json' -F file=@/tmp/pytorch_mnist_qZog3N.modelDir.tar 'https://wmlahost1:9243/platform/rest/deeplearning/v1/execs/?sigName=mysig&args=--exec-start%20PyTorch%20%20%20%20%20%20%20%20%20%20%20%20--cs-datastore-meta%20type%3Dfs%2Cdata_path%3Dpaietool/%20%20%20%20%20%20%20%20%20%20%20--ig%20mysig%20%20%20%20%20%20%20%20%20%20%20--numWorker%201%20%20%20%20%20%20%20%20%20%20%20--gpuPerWorker%201%20%20%20%20%20%20%20%20%20%20%20--model-dir%20pytorch_mnist%20%20%20%20%20%20%20%20%20%20%20--model-main%20pytorch_mnist.py%20--batch-size%2064%20--lr%200.01%20--epochs%2015'
- When running elastic distributed training with a PyTorch model, the training hangs and cannot
be completed. When running a PyTorch elastic distributed training job, the executors hit the following error:
[ERROR DDL-1-4] A NCCL error has occurred on host xxx.xxx.xx.ibm.com: unhandled system error xxx.xxx.xxx.ibm.com:122368:122368 [0] NCCL INFO Call to connect returned Connection refused, retrying xxx.xxx.xxx.ibm.com:122368:122368 [0] NCCL INFO Call to connect returned Connection refused, retrying xxx.xxx.xxx.ibm.com:122368:122368 [0] NCCL INFO Call to connect returned Connection refused, retrying
The NCCL error occurs when a host has multiple network interfaces and NCCL is trying to connect to an incorrect IP address. To resolve this issue, add the following lines to $DLI_SHARED_FS/conf/spark-env.shexport:
where nic1 is the correct network interface according to your cluster's network.export NCCL_P2P_DISABLE="1" export NCCL_SOCKET_IFNAME=nic1,nic2
- Elastic distributed training does not save checkpoint information
when training a model.
Ensure that you set checkpoint_freq to 1 in the elastic distributed training model API, for example:
model.train(epoch_number, batch_size, checkpoint_freq=1)
.Doing so enables checkpoint information to be saved during training. By setting the checkpoint frequency to 1 a checkpoint is generated per epoch. For example, if epoch runs 5 times, 5 checkpoints are created.
- After creating a Spark
instance group, the Spark master fails if
SPARK_EGO_AUTOSCALE_GPU_SLOTS_PER_TASK is changed. When creating a Spark instance group, you must set SPARK_EGO_AUTOSCALE_GPU_SLOTS_PER_TASK accordingly as this value cannot be changed after the Spark instance group is started. If the value is changed the Spark instance group batch master fails to start and the following error is logged in the service log, assuming the Spark instance group name is wmlaedt:
ERROR EGOSecurityManager: The execution user for consumer /wmlaedt/wmlaedt-sparkexecutor/wmlaedt-sparkexecutor0 does not exist.
- If
using IBM Watson Studio Local with WML Accelerator, a PyTorch elastic distributed
training job fails if the following setting is used:
csRmqMeta=None
Ensure that a value is specified for csRmqMeta. If a job has failed as a result of this issue, delete the task, and try again.
- After a Spark instance group is deleted, any datasets
that were previously created are automatically removed but datasets using the same name cannot be
recreatedIf you delete a Spark instance group, an issue exists where the related datasets and models may not be removed completely. This issue prevents you from recreating datasets and models of the same name. To resolve this issue, run the following clean up command:
curl -k -X DELETE -u Admin:Admin <DLPD_REST_BASE_URL_1>deeplearning/v1/admin/cleanup
- In the cluster management console, the Drivers
and Executors link on the Deep Learning pages does not load.
To see the Drivers and Executors page, make sure that the instance group is running. If the instance group is not running, start the instance group and try again.
- A deep learning training fails or runs with errors after a
task is killed by an
executor.
To resolve this issue, disable the option to reclaim resources in the instance group consumer.INFO EGOExecutorBackend: Got kill task 1 with concurrent number(0) INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1) INFO Executor: [task] [killing] taskName:task 1.0 in stage 0.0 taskId:1 finishTime:1512136348229 state:KILLED
- Select .
- Select the Spark executor consumer for the Deep Learning Impact Spark instance group.
- Click the Consumer Properties tab, and complete the following steps:
- Clear the Rebalance when resource plan changes or time interval changes option.
- Set Reclaim grace period to the same value as the value set for the
SPARK_EGO_RECLAIM_GRACE_PERIOD environment variable in SIG Spark
configuration.Note: To see the current value set for SPARK_EGO_RECLAIM_GRACE_PERIOD:
- Go to .
- Click the Deep Learning Impact Spark instance group.
- Select .
- Click Spark configuration and search for SPARK_EGO_RECLAIM_GRACE_PERIOD.
Note: Do these changes on all the same Spark executor consumers, including: parent and children. - Click Apply to save the changes.
- Restart the Spark instance group.