Create a Spark instance group using the dli-sig-template-2.3.3 template

Create an instance group using the dli-sig-template-2.3.3 template. Only available with IBM Spectrum Conductor 2.4.0.

Ensure that you completed setting up your resources, see Configuring a Spark instance group after installing IBM Spectrum Conductor Deep Learning Impact.
Create an instance group for IBM Spectrum Conductor Deep Learning Impact using the dli-sig-template-2.3.3 template.
1. Select the Workload tab and click Spark > Instance Groups.
2. In the Instance Group List tab, click New.
3. Click the Templates button to load the dli-sig-template-2.3.3 template.
  Attention: The dli-sig-template-2.3.3 template creates an instance group that is used for running deep learning workloads. In order to use elastic distributed training, use this template to create a second instance group for the purpose of running elastic distributed training workloads.
4. Click Use to select and use the dli-sig-template-2.3.3 template.
5. Provide a name for the instance group.
6. Provide a directory for the Spark deployment. The egoadmin user must have read, write, and execute permissions to the directory specified and its parent directory. If using a different user, the cluster administrator should have privileges of the user group that the user belongs to and the user's umask must be set to 002.
7. Leave the execution user set to egoadmin.
8. Provide a Spark version and configure Spark. By default, the template uses Spark version 2.3.3 and is configured for single node training using Python 3.6. If you change the Spark version, these configurations are lost and must be configured manually. Or if you want to use a different training type or a different Python version, you must configure additional parameters as follows.
  - For single node training, ensure that the following Spark parameters are set:
    - SPARK_EGO_EXECUTOR_SLOTS_MAX must be set to 1.
    - SPARK_EGO_EXECUTOR_SLOTS_RESERVE must be set to 1.
    - SPARK_EGO_GPU_EXECUTOR_SLOTS_MAX must be set to 1.
    - SPARK_EGO_GPU_EXECUTOR_SLOTS_RESERVE must be set to 1.
    - SPARK_EGO_EXECUTOR_IDLE_TIMEOUT must be set to 6000.
    - SPARK_EGO_CONF_DIR_EXTRA must be set to ${DLI_SHARED_FS}/conf. For example, if DLI_SHARED_FS is set /gpfs/dlfs1 for deep learning, then set SPARK_EGO_CONF_DIR_EXTRA to /gpfs/dlfs1/conf.
  - For distributed training, ensure that the following Spark parameters are set:
    - SPARK_EGO_EXECUTOR_SLOTS_MAX must be set to 1.
    - SPARK_EGO_EXECUTOR_SLOTS_RESERVE must be set to 1.
    - SPARK_EGO_GPU_EXECUTOR_SLOTS_MAX must be set to 1.
    - SPARK_EGO_GPU_EXECUTOR_SLOTS_RESERVE must be set to 1.
    - SPARK_EGO_EXECUTOR_IDLE_TIMEOUT must be set to 6000.
    - SPARK_EGO_CONF_DIR_EXTRA must be set to ${DLI_SHARED_FS}/conf. For example, if DLI_SHARED_FS is set to /gpfs/dlfs1 for deep learning, then set SPARK_EGO_CONF_DIR_EXTRA to /gpfs/dlfs1/conf.
    In addition, set the following:
    - SPARK_EGO_AUTOSCALE_GPU_SLOTS_PER_TASK must be set to 1,2,4.
    - SPARK_EGO_APP_SCHEDULE_POLICY must be set to fifo.
    - SPARK_EGO_ENABLE_PREEMPTION is set to false.
    - SPARK_EGO_SLOTS_REQUIRED_TIMEOUT must be adjusted to a smaller value to ensure that jobs time out in a reasonable amount of time. If this value is too large, jobs competing for resources can be stuck waiting too long and are abruptly stopped by the executor.
  - For elastic distributed training, ensure that the following Spark parameters are set:
    - SPARK_EGO_EXECUTOR_SLOTS_MAX must be set to 1.
    - SPARK_EGO_EXECUTOR_SLOTS_RESERVE must be set to 1.
    - SPARK_EGO_GPU_EXECUTOR_SLOTS_MAX must be set to 1.
    - SPARK_EGO_GPU_EXECUTOR_SLOTS_RESERVE must be set to 1.
    - SPARK_EGO_EXECUTOR_IDLE_TIMEOUT must be set to 6000.
    - SPARK_EGO_RECLAIM_GRACE_PERIOD must be set to 200.
    - SPARK_EGO_CONF_DIR_EXTRA must be set to ${DLI_SHARED_FS}/conf. For example, if DLI_SHARED_FS is set to /gpfs/dlfs1 for deep learning, then set SPARK_EGO_CONF_DIR_EXTRA to /gpfs/dlfs1/conf.
    In addition, set the following:
    - SPARK_EGO_AUTOSCALE_GPU_SLOTS_PER_TASK is set to 1 .
    - SPARK_EGO_ENABLE_PREEMPTION is set to true.
    - SPARK_EGO_APP_SCHEDULE_POLICY is set to fairshare. Elastic distributed training can work with both fifo and fairshare, however, for performance reasons, fairshare is preferred.
  In addition to these configurations, also consider the following:
  - By default, the Spark instance group uses Python 3.6. If you want to use Python 2.7, set the PYTHON_VERSION environment variable to python2. To add this environment variable, do the following:
    1. In the parameter drop down, select Additional Environment Variables and click Add an Environment Variable.
    2. Set the Name to PYTHON_VERSION and the Value to python2.
    3. Click Save.
  - If you are using Caffe you must set the JAVA_HOME variable to your OpenJDK path. This path must be the same on all hosts. This environment variable is not included in the default dli-sig-template-2.2.0 template. For example, on RHEL 7 this path might be /usr/lib/jvm/java-1.8.0-openjdk/.
9. Under Resource Groups and Plans enable GPU slot allocation and specify the resource group from which resources are allocated to executors in the Spark instance group.
  Make sure that the CPU executors resource group contains all the CPU and GPU executor hosts. If you do not do this, GPU slots are used for the shuffle service.
  - Select a CPU resource group for use by Spark executors (CPU slots).
  - Select the previously created GPU resource group for use by Spark executors (GPU slots) drop. Ensure that you do not select the resource group to be used by Spark drivers.
10. Create the Spark instance group by clicking Create and Deploy Instance Group.
Edit the consumer properties of the Spark instance group.
1. Navigate to Resources > Consumers.
2. Select the <Spark-instance-group-name>-spark consumer.
  1. Under the Consumer Properties tab, deselect the Rebalance when resource plan changes or time interval changes option.
  2. Set Reclaim grace period to 99999. .
3. Select the <Spark-instance-group-name>-sparkexecutor consumer.
  1. Under the Consumer Properties tab, deselect the Rebalance when resource plan changes or time interval changes option.
  2. Set Reclaim grace period to 99999.
4. Click Apply

The instance group is configured and ready to be started.