Create an instance group
that allocates GPU slots to Spark executors in the instance group.
About this task
Follow these steps to create an instance group that uses GPU resources to run
its applications. This task calls out the steps only for GPU allocation when you create an instance group. For more information on how
to create an instance group, see Creating instance groups.
Procedure
-
In the Basic Settings tab, and click the
Configuration link to customize the Spark version properties for the
following GPU parameters. If you do not make changes, the default values are used.
- SPARK_EGO_AUTOSCALE_GPU_SLOTS_PER_TASK
- Specifies a comma-separated list of SPARK_EGO_GPU_SLOTS_PER_TASK values.
When specified, the Spark master service
is scaled up to accommodate (at a minimum) one service instance for the total number of values
specified. A maximum of five values are supported. This parameter does not affect the Spark notebook master instance. With no list
specified (default), the SPARK_EGO_GPU_SLOTS_PER_TASK value takes effect for all Spark master service instances.
To prevent a
Spark master service instance from
becoming stuck while it waits for executors on another Spark master instance to finish, enable fair
share scheduling for executors or select a multidimensional resource plan for executors for the
instance group. For more information,
see Setting consumers and resource groups for an instance group.
- SPARK_EGO_GPU_EXECUTOR_SLOTS_MAX
- Specifies the maximum number of GPU tasks that can run concurrently in one GPU executor. Default
is Integer.MAX_value.
- SPARK_EGO_GPU_EXECUTOR_SLOTS_RESERVE
- Specifies the minimum number of GPU slots to reserve once the executor is started. The number of
GPU executor slots is the minimum of the reserved number and the owned number. Default is 1. The
timeout for the reserve GPU slots is set by SPARK_EGO_EXECUTOR_IDLE_TIMEOUT.
- SPARK_EGO_GPU_MODE
- Specifies whether the executor needs either exclusive or shared GPUs. Valid values are Exclusive
or Shared. Default value is shared.
- SPARK_EGO_GPU_SLOTS_MAX
- Specifies the maximum number of slots that an application can get for GPU tasks in primary mode. Default is
Integer.MAX_value.
- SPARK_EGO_GPU_SLOTS_PER_TASK
- Specifies the number of slots that are allocated to a GPU task, enabling each task to use
multiple slots. The initial number of tasks of an executor times the
SPARK_EGO_GPU_SLOTS_PER_TASK value must be equal or less than the number GPUs
on the host where the executor is run. Valid value is a positive integer. Default is 1.
When this
parameter takes effect for GPU scheduling, the number of slots in the egosh alloc
command output equals the number of running GPU tasks times the value of
SPARK_EGO_GPU_SLOTS_PER_TASK.
- SPARK_EGO_SLOTS_REQUIRED_TIMEOUT
- The time, in seconds, to wait for a Spark application to get the required number of slots,
including CPU and GPU slots, before launching tasks. After this time, any slots that are held are
released and the application fails. Default is Integer.MAX_value.
-
If your cluster is installed to a shared file system, decide whether you want to enable the
shuffle service for the instance group.
- If you do not enable the shuffle service, set the spark.local.dir
parameter to a shared directory on the file system.
- If you want to enable the shuffle service, enable and configure the shuffle service. When
you enable the shuffle service, a new consumer is created by default exclusively for the shuffle
service. If you want to change this default consumer, the shuffle service consumer must be
associated with only two resource groups (or a resource plan): one for CPU scheduling and the other
for GPU scheduling. For more information on enabling the shuffle service in a shared file system,
see Enabling and configuring the Spark shuffle service.
-
Enable GPU slot allocation and specify the resource group (or multidimensional resource plan)
from which resources are allocated to executors in the instance group.
Make sure that the CPU executors resource group contains all the CPU and GPU executor hosts. If
you do not do this, GPU slots are used for the shuffle service.
-
In the Resource Groups and Plans section, select the CPU resource group for use by Spark
executors (CPU slots); for example, the CPUrg resource group that you
created.
-
Select the GPU resource group for use by Spark executors (GPU slots), for example, the
GPUrg that you created. Ensure that you do not select the resource group to
be used by Spark drivers.
Results
The instance group is set up
for GPU allocation.
What to do next
- Create and deploy the instance group. After you start the instance group,
GPU slots (in addition to CPU slots) are allocated to Spark executors in the instance group. See Starting instance groups.
- Submit a Spark application that uses GPUs to the instance group. See either Submitting a Spark application with GPU RDD or Submitting a Spark application without GPU RDD.