Configure IBM Spectrum Conductor Deep Learning Impact

  1. Locate and start the cluster management console. See Locating the cluster management console.
  2. Verify that IBM Spectrum Conductor Deep Learning Impact was installed successfully, by selecting the Workload menu, and navigating to the Spark > Deep Learning option. If this option is available, then IBM Spectrum Conductor Deep Learning Impact was installed successfully. Otherwise, if the Deep Learning option is unavailable, troubleshoot the IBM Spectrum Conductor Deep Learning Impact installation using the dlibin_install.log file in the $EGO_TOP/dli/install_log directory. Additionally, see the log files in the $EGO_TOP/dli/1.2.3/dlpd/logs directory.
  3. Ensure that GPU is enabled for deep learning workloads. If you do not have GPU enabled, enable GPU by running one of these scripts on the master host:
    Note: EGO is restarted on all of the hosts in the cluster after running this script.
    • To run with user interaction: # $EGO_TOP/conductorspark/2.3.0/etc/gpuconfig.sh enable
    • To run without user interaction: # $EGO_TOP/conductorspark/2.3.0/etc/gpuconfig.sh enable --quiet -u <username> -x <password>
  4. Create a resource group for GPU executors where the advanced formula is set to ngpus.
    • To create resource groups for slot-based scheduling, see Creating resource groups.
      Create two resource groups: one with CPU hosts (for example, CPUrg); the other with GPU hosts (for example, GPUrg). Do not use the same name for the two resource groups even if they contain the same hosts.
      Note:
      • For the resource group with the GPU hosts, ensure that all the hosts in the resource group have a number value in the ngpus column; and that the number under the Total slots for this group column is equal to the total number of GPUs on all the host in the resource group.
      • If your cluster has non-GPU hosts, make sure to select Static (List of Names) as the Resource Selection Method and select the hosts from the Member hosts list. By default, the ngpus column is hidden but can be added using the Preferences menu.

      After creating the resource groups, update the resource plan to use the resource groups you created. If required, modify the default resource plan to change how resources are allocated among consumers. For example, you can change the default slot allocation policy from Stacked to Exclusive. For more information, see Creating or modifying resource plans. For more information, see Creating or modifying resource plans.

    • To create resource groups for multidimensional scheduling, see Flow to configure multidimensional scheduling.

      As part of this workflow, you create multidimensional resource groups, associate the resources groups with a multidimensional resource plan, add a consumer to the resource plan, and update other settings.

  5. If needed, create a resource group for CPU executors. The resource group for CPU executors must contain all of the hosts that are in the GPU executors resource group.
  6. Create a Spark instance group for IBM Spectrum Conductor Deep Learning Impact using the dli-sig-template-2.2.0 template.
    Note: To use IBM Watson Studio Local with WML Accelerator you must first install IBM Watson Studio Local before installing WML Accelerator. If you already have WML Accelerator installed you must first remove it and install it again after installing IBM Watson Studio Local.

    If you have IBM Watson Studio Local installed with IBM Spectrum Conductor 2.3.0, make sure to create a Spark instance group using the wmla-ig-template-2.3.1 template for distributed training or the wmla-ig-edt-template-2.3.1 template for elastic distributed training.

    If you have IBM Watson Studio Local installed with IBM Spectrum Conductor 2.4.0, make sure to create an instance group using the wmla-ig-template-2.3.3 template for distributed training or the wmla-ig-edt-template-2.3.3 template for elastic distributed training.

  7. If you have IBM Watson Studio Local installed, edit the resource plan of each Spark instance group.
    1. Navigate to Resources > Resource Planning (Slot) > Resource Plan.
    2. From the set of available resources, select the GPU resource group previously created, for example: Resource Group: GPUrg.
    3. Expand the Slot allocation policy menu and select Exclusive and Consumer level: each host is only assigned to allocations under the same exclusive consumer.
    4. Select Hybrid policy.
    5. Set your owned slots for each consumer.

      Make sure to select <Spark-instance-group-name>-sparkexecutor2 (for example, wml-ig-sparkexecutor2 and wml-ig-edt-sparkexecutor2) as an exclusive consumer and in the corresponding Model type: Hybrid column, enable the following slots: <Spark-instance-group-name>-sparkexecutor0, <Spark-instance-group-name>-sparkexecutor1 and <Spark-instance-group-name>-sparkexecutor2.

    6. Click Apply.
  8. Start the Spark instance group for IBM Spectrum Conductor Deep Learning Impact.
    1. Navigate to the Workload tab and select Spark > Spark Instance Groups.
    2. Select the Spark instance group and click Start.
  9. Verify that IBM Spectrum Conductor™ Deep Learning Impact is configured correctly.
    1. Verify that the cluster management console has no issues. If you see any issues check the log files on the management host in the $EGO_TOP/gui/logs directory.
    2. Check that the deep learning services have started. See that all services are in STARTED state after running the egosh service list -l command.
      If any services are not started refer to the corresponding log files on the management host for more information:
      • $EGO_TOP/dli/1.2.3/dlpd/logs/dlpd.log
      • $EGO_TOP/dli/1.2.3/dlpd/dlrest/logs/messages.log
      • $EGO_TOP/dli/1.2.3/dlinsights/logs/start_dlinsights_service.sh.log.monitor
      • $EGO_TOP/dli/1.2.3/dlinsights/logs/start_dlinsights_service.sh.log.optimizer
      • $EGO_TOP/dli/1.2.3/dlinsights/logs/monitor.log.hostname
      • $EGO_TOP/dli/1.2.3/dlinsights/logs/optimizer.log.hostname
      • $EGO_TOP/dli/1.2.3/mongodb/logs/mongod.log
      • $EGO_TOP/integration/elk/log/shipper-err.log.hostname
      • $EGO_TOP/integration/elk/log/shipper-out.log.hostname
    3. For other issues, refer to IBM Spectrum Conductor troubleshooting information.