Create a Spark instance group using the wmla-ig-edt-template-2.3.3 template

Create a Spark instance group using the wmla-ig-edt-template-2.3.3 template. Only available with IBM Spectrum Conductor 2.4.0.

  1. Ensure that you completed setting up your resources, see Configuring a Spark instance group after installing IBM Spectrum Conductor Deep Learning Impact.
  2. Create an instance group for IBM Spectrum Conductor Deep Learning Impact using the wmla-ig-edt-template-2.3.3 template.
    Attention: The wmla-ig-edt-template-2.3.3 template creates an instance group that is used for running deep learning workloads. In order to use elastic distributed training, use this template to create a second instance group for the purpose of running elastic distributed training workloads.
    1. Select the Workload tab and click Spark > Instance Groups.
    2. In the Instance Group List tab, click New.
    3. Click the Templates button to load the wmla-ig-edt-template-2.3.3 template.
    4. Click Use to select and use the wmla-ig-edt-template-2.3.3 template.
    5. Provide a name for the instance group, for example: wml-ig-edt.
    6. Provide a directory for the Spark deployment. The wml-user user must have read, write, and execute permissions to the directory specified and its parent directory.
    7. Set the execution user to wml-user.
    8. Provide a Spark version and configure Spark. By default, the template uses Spark version 2.3.3 and is configured for single node training using Python 3.6. If you change the Spark version, these configurations are lost and must be configured manually. Or if you want to use a different training type or a different Python version, you must configure additional parameters as follows.
      • SPARK_EGO_EXECUTOR_SLOTS_MAX must be set to 1.
      • SPARK_EGO_EXECUTOR_SLOTS_RESERVE must be set to 1.
      • SPARK_EGO_GPU_EXECUTOR_SLOTS_MAX must be set to 1.
      • SPARK_EGO_GPU_EXECUTOR_SLOTS_RESERVE must be set to 1.
      • SPARK_EGO_EXECUTOR_IDLE_TIMEOUT must be set to 6000.
      • SPARK_EGO_CONF_DIR_EXTRA must be set to ${DLI_SHARED_FS}/conf. For example, if DLI_SHARED_FS is set to /gpfs/dlfs1 for deep learning, then set SPARK_EGO_CONF_DIR_EXTRA to /gpfs/dlfs1/conf.
      • SPARK_EGO_AUTOSCALE_GPU_SLOTS_PER_TASK is set to 1 .
      • SPARK_EGO_ENABLE_PREEMPTION is set to true.
      • SPARK_EGO_APP_SCHEDULE_POLICY is set to fairshare.
      • When using fairshare, make sure that you:
        • Do not disable reclaim for the executor consumers and do not set SPARK_EGO_RECLAIM_GRACE_PERIOD. Use the default IBM Spectrum Conductor reclaim settings for the consumer.
        • Do not change the SPARK_EGO_SLOTS_REQUIRED_TIMEOUT value for a Spark instance group with fairshare.
      • Framework plugins do not support a Spark instance group where the SPARK_EGO_APP_SCHEDULE_POLICY is configured for fairshare. Framework plugins only support Spark instance group that are configured with fifo.
      In addition to these configurations, also consider the following:
      • By default, the Spark instance group uses Python 3.6 . If you want to use Python 2.7, set the PYTHON_VERSION environment variable to python2. To add this environment variable, do the following:
        1. In the parameter drop down, select Additional Environment Variables and click Add an Environment Variable.
        2. Set the Name to PYTHON_VERSION and the Value to python2.
        3. Click Save.
      • If you want to set additional Spark parameters, refer to Creating a Spark instance group to use GPUs.
    9. Under Consumers the Enable impersonation to have Spark applications run as the submission user option is enabled. This option is required for IBM Watson Studio Local and requires the use of LDAP.
    10. Under Resource Groups and Plans enable GPU slot allocation and specify the resource group from which resources are allocated to executors in the Spark instance group.

      Make sure that the CPU executors resource group contains all the CPU and GPU executor hosts. If you do not do this, GPU slots are used for the shuffle service.

      1. Select a CPU resource group for use by Spark executors (CPU slots).
      2. Select the previously created GPU resource group for use by Spark executors (GPU slots).
    11. Create the Spark instance group by clicking Create and Deploy Instance Group.
  3. Edit the consumer properties of the Spark instance group.
    1. Navigate to Resources > Consumers.
    2. Select the <Spark-instance-group-name>-sparkexeutor consumer.
      1. Under the Consumer Properties tab, set Reclaim grace period to the maximum value of 120 Hours.
    3. Enable exclusive slots at the consumer level (where free slots from the host can be shared and assigned to any number of allocations, but only amongst a select set of consumers within an exclusive consumer group).
      1. Open the $EGO_CONFDIR/ego.conf file for editing.
      2. Set EGO_ENABLE_CONSUMER_LEVEL_EXCLUSIVE=Y.
      3. Save your changes.
    4. Click Apply
The instance group is configured and ready to be started.