Create a Spark instance group using the wmla-ig-edt-template-2.3.1 template.
Note: The
wmla-ig-edt-template-2.3.1
template should only be used in
IBM Watson Machine Learning Accelerator environments where
IBM Watson Studio Local is installed.
By default,
the wmla-ig-edt-template-2.3.1 template enables impersonation
settings.
By
enabling impersonation you can have Spark applications run as the submission user. Impersonation
means that the system runs executables under a designated operating system account. The wmla-ig-edt-template-2.3.1 template enables impersonation with
authentication therefore Spark applications must run by LDAP or OS execution users.
- Ensure that you completed setting up your resources, see Configuring a Spark instance group after installing IBM Spectrum Conductor Deep Learning Impact.
- Create a Spark instance group for IBM Spectrum Conductor Deep Learning Impact using the
wmla-ig-edt-template-2.3.1 template.
- Select the Workload tab and click .
- In the Instance Group List tab, click New.
- Click the Templates button to load the wmla-ig-edt-template-2.3.1 template.
Attention: The wmla-ig-edt-template-2.3.1 template creates a Spark instance group that is used for running deep learning
workloads. In order to use elastic distributed training, use this template to create a second Spark instance group for the purpose of running elastic
distributed training workloads.
- Click Use to select and use the wmla-ig-edt-template-2.3.1 template.
- Provide a name for the Spark instance group, for
example: wml-ig-edt.
- Provide a directory for the Spark deployment. The wml-user user must
have read, write, and execute permissions to the directory specified and its parent directory.
- Set the execution user to
wml-user.
- Provide a Spark version and configure Spark. By default, the template uses Spark version 2.3.1 and is configured for single node training
using Python 2.7. If you change
the Spark version, these configurations are lost and must be configured manually. Or if you want to
use a different training type or a different Python version, you must configure additional
parameters as follows.
- SPARK_EGO_EXECUTOR_SLOTS_MAX must be set to 1.
- SPARK_EGO_EXECUTOR_SLOTS_RESERVE must be set to
1.
- SPARK_EGO_GPU_EXECUTOR_SLOTS_MAX must be set to
1.
- SPARK_EGO_GPU_EXECUTOR_SLOTS_RESERVE must be set to
1.
- SPARK_EGO_EXECUTOR_IDLE_TIMEOUT must be set to
6000.
- SPARK_EGO_CONF_DIR_EXTRA must be set to
${DLI_SHARED_FS}/conf. For example, if DLI_SHARED_FS is
set to /gpfs/dlfs1 for deep learning, then set
SPARK_EGO_CONF_DIR_EXTRA to /gpfs/dlfs1/conf.
- SPARK_EGO_AUTOSCALE_GPU_SLOTS_PER_TASK is set to 1
.
- SPARK_EGO_ENABLE_PREEMPTION is set to true.
- SPARK_EGO_APP_SCHEDULE_POLICY is set to
fairshare.
- SPARK_EGO_RECLAIM_GRACE_PERIOD must be set to
200.
- When using fairshare, make sure that you:
- Do not disable reclaim for the executor consumers and do not set
SPARK_EGO_RECLAIM_GRACE_PERIOD. Use the default IBM Spectrum Conductor reclaim settings for the
consumer.
- Do not change the SPARK_EGO_SLOTS_REQUIRED_TIMEOUT value for a Spark instance
group with fairshare.
- Framework plugins do not support a Spark instance group where the
SPARK_EGO_APP_SCHEDULE_POLICY is configured for
fairshare. Framework plugins only support Spark instance group that are
configured with fifo.
In addition to these configurations, also consider the following:
- By default, the Spark instance group uses Python 2.7. If you want to use Python 3.6, set the
PYTHON_VERSION environment variable to python3. To add
this environment variable, do the following:
- In the parameter drop down, select Additional Environment Variables and
click Add an Environment Variable.
- Set the Name to PYTHON_VERSION and the
Value to python3.
- Click Save.
- If you want to set additional Spark parameters, refer to Creating a Spark
instance group to use GPUs.
- Under Consumers the Enable
impersonation to have Spark applications run as the submission user option is enabled. This option is required for IBM Watson Studio Local
and requires the use of LDAP.
- Under Resource Groups and Plans enable GPU slot allocation and specify
the resource group from which resources are allocated to executors in the Spark instance group.
Make sure that the CPU executors resource group contains all the CPU and GPU
executor hosts. If you do not do this, GPU slots are used for the shuffle service.
- Select a CPU resource group for use by Spark executors (CPU slots).
- Select the previously created GPU resource group for use by Spark executors (GPU
slots).
- Create the Spark instance group by clicking Create and Deploy Instance
Group.
- Edit the consumer properties of the Spark instance group.
- Navigate to .
- Select the <Spark-instance-group-name>-sparkexeutor consumer.
- Under the Consumer Properties tab, set Reclaim grace
period to the maximum value of 120 Hours.
- Enable exclusive slots at the consumer level (where free slots from the
host can be shared and assigned to any number of allocations, but only amongst a select set of
consumers within an exclusive consumer group).
- Open the $EGO_CONFDIR/ego.conf file for editing.
- Set EGO_ENABLE_CONSUMER_LEVEL_EXCLUSIVE=Y.
- Save your changes.
- Click Apply
The Spark instance group is configured and ready to be
started.