Enabling adaptive scheduling

Optionally, enable adaptive scheduling for efficient usage of mixed resources (CPU and GPU) in your cluster. With adaptive scheduling, tasks run first on a portion of GPU resources in the cluster. When GPU resources are no longer available in the cluster, the remaining tasks run on CPU resources.

Before you begin

  • You can only enable adaptive scheduling with certain Spark versions. Spark versions not supported: 1.5.2.
  • You require separate resource groups for CPU and GPU hosts. See Using resource groups with GPU hosts.
  • Your instance group must be enabled to use GPU resources for applications. See Enabling GPUs.

About this task

CPU and GPU tasks in many applications, especially traditional machine-learning applications, and deep-learning frameworks, are convertible. With adaptive scheduling, when your cluster is short on GPU resources, CPU resources can help speed up large-scale machine-learning applications and improve resource usage.

Procedure

Follow these steps to enable adaptive scheduling for an instance group. This task calls out only the steps to configure adaptive scheduling when you create an instance group. For more information on how to create an instance group, see Creating instance groups.

  1. In the Basic Settings tab, select the Spark version and click the Configuration link.
  2. In the Session Scheduler list, set the SPARK_EGO_GPU_ADAPTIVE_ENABLE parameter to true.
  3. Configure other settings. If you do not customize the settings, the default values are used.
    • Set the SPARK_EGO_GPU_ADAPTIVE_PERCENTAGE parameter to specify the maximum number of tasks in the GPU stage (as a percentage of total GPU tasks) that are transferred to CPU resources when GPU resources are no longer available. Valid value is 0 - 1 (including decimal digits). Default is 0.1. The number of transferred GPU tasks depend on a combination of factors and could be zero, less than, or equal to the percentage value of this parameter, but it does not exceed the specified value. Configure this value according to how many CPU slots the application can acquire. The more CPU slots that it can get, the higher this value can be.
    • Set the SPARK_EGO_GPU_ADAPTIVE_EST_RATIO parameter to specify the estimated speedup ratio of CPU tasks to GPU tasks. Valid value is an integer, starting from 1. Default is 5. The speedup ratio is estimated based on the average execution time of CPU and GPU tasks (calculated as average_CPU_task_duration/average_GPU_task_duration). If you configure this value as 1 (for example), the GPU task duration almost equals the CPU task duration and might not yield the best results. To derive the most benefit with GPUs, set this parameter to a larger value (rather than a smaller value). Estimate a suitable value for SPARK_EGO_GPU_ADAPTIVE_EST_RATIO by running the application twice (once in GPU mode and once in CPU mode) to get the average _GPU_task_duration and the average_CPU_task_duration.

    For initial scheduling, set the SPARK_EGO_GPU_ADAPTIVE_PERCENTAGE parameter as required for your application. If the percentage value is suitable for your application, GPU tasks are transferred to CPU resources. After a portion of GPU and CPU tasks complete, the SPARK_EGO_GPU_ADAPTIVE_EST_RATIO parameter takes effect. The boundary of this percentage can be estimated as: percentage < ratio * (#_of_total_GPU_slots/#_of__GPU_tasks) which is less than or equal to 1.

  4. Click Save.

Results

Applications in the instance group are set up for adaptive scheduling.

What to do next

  1. Deploy the instance group; then, start it. See Starting instance groups.
  2. Submit a Spark application with GPU to the instance group. See either Submitting a Spark application with GPU RDD or Submitting a Spark application without GPU RDD.
    Note: For adaptive scheduling, the SPARK_EGO_WORKLOAD_TYPE environment variable is set internally when the task is run to indicate workload type (either GPU or CPU). You can define different logic for GPU and CPU processing in the application task logic. For example:
    def feature_extractor(path):
      if (os.environ.has_key("SPARK_EGO_WORKLOAD_TYPE”)) and (os.environ[‘SPARK_EGO_WORKLOAD_TYPE’] == ‘GPU’):
        feature = runGPULogical()
      else:
        feature = runCPULogical()
      return feature
    sc.parallelize(...).gpu().map(lambda path: feature_extractor(path)).collect()

    After the application is submitted, drill down from the Spark master web UI to monitor task details. Additionally, you can use the Workload Type column in the task list to check whether tasks are running on GPU or CPU hosts.