Configuring a Spark instance group after installing IBM Spectrum Conductor Deep Learning Impact

After installing IBM Spectrum Conductor and IBM Spectrum Conductor Deep Learning Impact, start the cluster management console for the first time and configure IBM Spectrum Conductor Deep Learning Impact.

Ensure that you have successfully installed IBM Spectrum Conductor Deep Learning Impact.

  1. Locate and start the cluster management console. See Locating the cluster management console.
  2. Verify that IBM Spectrum Conductor Deep Learning Impact was installed successfully, by selecting the Workload menu, and navigating to the Spark > Deep Learning option. If this option is available, then IBM Spectrum Conductor Deep Learning Impact was installed successfully. Otherwise, if the Deep Learning option is unavailable, troubleshoot the IBM Spectrum Conductor Deep Learning Impact installation using the dlibin_install.log file in the $EGO_TOP/dli/install_log directory. Additionally, see the log files in the $EGO_TOP/dli/1.2.2/dlpd/logs directory.
  3. Ensure that GPU is enabled for deep learning workloads. If you do not have GPU enabled, enable GPU now. See, Enabling GPUs.
  4. Create a resource group for GPU executors where the advanced formula is set to ngpus. See, Using resource groups with GPU hosts.
  5. If needed, create a resource group for CPU executors. The resource group for CPU executors must contain all of the hosts that are in the GPU executors resource group.
  6. Create a Spark instance group for IBM Spectrum Conductor Deep Learning Impact using the dli-sig-template-2.2.0 template, see Create a Spark instance group using the dli-sig-template-2.2.0 template.
    Note: If you have IBM Watson Studio Local installed, make sure to create a Spark instance group using the wmla-ig-template-2.3.1 template for distributed training or the wmla-ig-edt-template-2.3.1 template for elastic distributed training.
    Note: To use IBM Watson Studio Local with IBM Watson Machine Learning Accelerator you must first install IBM Watson Studio Local before installing IBM Watson Machine Learning Accelerator.

    If you already have IBM Watson Machine Learning Accelerator installed you must first remove it and install it again after installing IBM Watson Studio Local.

  7. Edit the consumer properties of the Spark instance group.
    1. Navigate to Resources > Consumers.
    2. Select the <Spark-instance-group-name>-spark consumer.
      1. Under the Consumer Properties tab, deselect the Rebalance when resource plan changes or time interval changes option.
      2. Set Reclaim grace period to the same value as the value set for the SPARK_EGO_RECLAIM_GRACE_PERIOD environment variable in SIG Spark configuration.
        Note: To see the current value set for SPARK_EGO_RECLAIM_GRACE_PERIOD:
        1. Go to Workload > Spark > Spark Instance Groups.
        2. Click on the IBM Spectrum Conductor Deep Learning Impact Spark instance group.
        3. Select Manage > Configure.
        4. Click on Spark configuration and search for SPARK_EGO_RECLAIM_GRACE_PERIOD.
    3. Select the <Spark-instance-group-name>-sparkexecutor consumer.
      1. Under the Consumer Properties tab, deselect the Rebalance when resource plan changes or time interval changes option.
      2. Set Reclaim grace period to the same value as the value set for the SPARK_EGO_RECLAIM_GRACE_PERIOD environment variable in SIG Spark configuration.
    4. Click Apply
  8. Start the Spark instance group for IBM Spectrum Conductor Deep Learning Impact.
    1. Navigate to the Workload tab and select Spark > Spark Instance Groups.
    2. Select the Spark instance group and click Start.
IBM Spectrum Conductor Deep Learning Impact is configured successfully and is ready to use. To verify that IBM Spectrum Conductor Deep Learning Impact is configured correctly, do the following:
  1. Verify that the cluster management console has no issues. If you see any issues check the log files on the management host in the $EGO_TOP/gui/logs directory.
  2. Check that the deep learning services have started. See that all services are in STARTED state after running the egosh service list -l command.
    If any services are not started refer to the corresponding log files on the management host for more information:
    • $EGO_TOP/dli/1.2.2/dlpd/logs/dlpd.log
    • $EGO_TOP/dli/1.2.2/dlpd/dlrest/logs/messages.log
    • $EGO_TOP/dli/1.2.2/dlinsights/logs/start_dlinsights_service.sh.log.monitor
    • $EGO_TOP/dli/1.2.2/dlinsights/logs/start_dlinsights_service.sh.log.optimizer
    • $EGO_TOP/dli/1.2.2/dlinsights/logs/monitor.log.hostname
    • $EGO_TOP/dli/1.2.2/dlinsights/logs/optimizer.log.hostname
    • $EGO_TOP/dli/1.2.2/mongodb/logs/mongod.log
    • $EGO_TOP/integration/elk/log/shipper-err.log.hostname
    • $EGO_TOP/integration/elk/log/shipper-out.log.hostname
  3. For additional issues, refer to IBM Spectrum Conductor troubleshooting and FAQs.