Create a Spark instance group using the wmla-ig-template-2.3.3 template

Create a Spark instance group using the wmla-ig-template-2.3.3 template. Only available with IBM Spectrum Conductor 2.4.0.
Note: The wmla-ig-template-2.3.3 template should only be used in IBM Watson Machine Learning Accelerator environments where IBM Watson Studio Local is installed.

By default, the wmla-ig-template-2.3.3 template enables impersonation settings. By enabling impersonation you can have Spark applications run as the submission user. Impersonation means that the system runs executables under a designated operating system account. The wmla-ig-template-2.3.3 template enables impersonation with authentication therefore Spark applications must run by LDAP or OS execution users.

  1. Ensure that you completed setting up your resources, see Configuring a Spark instance group after installing IBM Spectrum Conductor Deep Learning Impact.
  2. Create an instance group for IBM Spectrum Conductor Deep Learning Impact using the wmla-ig-template-2.3.3 template.
    Attention: The wmla-ig-template-2.3.3 template creates a Spark instance group that is used for running deep learning workloads. In order to use elastic distributed training, use the wmla-ig-edt-template-2.3.3 template to create a second instance group for the purpose of running elastic distributed training workloads.
    1. Select the Workload tab and click Spark > Instance Groups.
    2. In the Instance Group List tab, click New.
    3. Click the Templates button to load the wmla-ig-template-2.3.3 template.
    4. Click Use to select and use the wmla-ig-template-2.3.3 template.
    5. Provide a name for the instance group, for example: wml-ig.
    6. Provide a directory for the Spark deployment. The wml-user user must have read, write, and execute permissions to the directory specified and its parent directory.
    7. Set the execution user to wml-user.
    8. Provide a Spark version and configure Spark. By default, the template uses Spark version 2.3.3 and is configured for single node training using Python 3.6. If you change the Spark version, these configurations are lost and must be configured manually. Or if you want to use a different training type or a different Python version, you must configure additional parameters as follows.
      • SPARK_EGO_EXECUTOR_SLOTS_MAX must be set to 1.
      • SPARK_EGO_EXECUTOR_SLOTS_RESERVE must be set to 1.
      • SPARK_EGO_GPU_EXECUTOR_SLOTS_MAX must be set to 1.
      • SPARK_EGO_GPU_EXECUTOR_SLOTS_RESERVE must be set to 1.
      • SPARK_EGO_EXECUTOR_IDLE_TIMEOUT must be set to 6000.
      • SPARK_EGO_CONF_DIR_EXTRA must be set to ${DLI_SHARED_FS}/conf. For example, if DLI_SHARED_FS is set /gpfs/dlfs1 for deep learning, then set SPARK_EGO_CONF_DIR_EXTRA to /gpfs/dlfs1/conf.
      • SPARK_EGO_AUTOSCALE_GPU_SLOTS_PER_TASK must be set to 1,2,4.
      • SPARK_EGO_APP_SCHEDULE_POLICY must be set to fifo.
        Note: IBM Watson Machine Learning Accelerator also supports a hierarchical scheduling policy.
      • SPARK_EGO_ENABLE_PREEMPTION is set to false.
      • SPARK_EGO_SLOTS_REQUIRED_TIMEOUT must be adjusted to a smaller value to ensure that jobs time out in a reasonable amount of time. If this value is too large, jobs competing for resources can be stuck waiting too long and are abruptly stopped by the executor.
      • SPARK_EGO_DISTRIBUTED_WORKLOAD_CONFIG_PATH must be set to an existing shared directory that has read and write access open to all. At installation time, the installer creates a directory under DLI_SHARED_FS called distrib_workload_config with appropriate permissions. By default, the wmla-ig-template-2.3.1 template sets this configuration value.
      • By default, the instance group uses Python 3.6. If you want to use Python 2.7, set the PYTHON_VERSION environment variable to python2. To add this environment variable, do the following:
        1. In the parameter drop down, select Additional Environment Variables and click Add an Environment Variable.
        2. Set the Name to PYTHON_VERSION and the Value to python2.
        3. Click Save.
      • If you want to set additional Spark parameters, refer to Creating a Spark instance group to use GPUs.
    9. Under Consumers the Enable impersonation to have Spark applications run as the submission user option is enabled. This option is required for IBM Watson Studio Local and requires the use of LDAP.
    10. Under Resource Groups and Plans enable GPU slot allocation and specify the resource group from which resources are allocated to executors in the instance group.

      Make sure that the CPU executors resource group contains all the CPU and GPU executor hosts. If you do not do this, GPU slots are used for the shuffle service.

      1. Select a CPU resource group for use by Spark executors (CPU slots).
      2. Select the previously created GPU resource group for use by Spark executors (GPU slots).
    11. Create the Spark instance group by clicking Create and Deploy Instance Group.
  3. Edit the consumer properties of the Spark instance group.
    1. Navigate to Resources > Consumers.
    2. Select the <Spark-instance-group-name>-sparkexecutor consumer.
      1. Under the Consumer Properties tab, set Reclaim grace period to the maximum value of 596 Hours.
    3. For each child consumer belonging to <Spark-instance-group-name>-sparkexecutor (for example, <Spark-instance-group-name>-sparkexecutor0), complete the following:
      1. Under the Consumer Properties tab, set Reclaim grace period to the maximum value of 596 Hours.
    4. Enable exclusive slots at the consumer level (where free slots from the host can be shared and assigned to any number of allocations, but only amongst a select set of consumers within an exclusive consumer group).
      1. Open the $EGO_CONFDIR/ego.conf file for editing.
      2. Set EGO_ENABLE_CONSUMER_LEVEL_EXCLUSIVE=Y.
      3. Save your changes.
    5. Click Apply
The instance group is configured and ready to be started.