Required Spark configuration settings

When creating a Spark instance group for IBM Spectrum Conductor™ Deep Learning Impact, if you specify a Spark version other than 1.6.1, the default configurations in the dli-sig-template template are lost and you must ensure that all the Spark configuration settings are correct.

You must specify all configuration settings in Required Spark configuration settings, as well as the settings in one of these sections, as appropriate:

For information on additional Spark parameters, see Creating a Spark instance group to use GPUs.

Required Spark configuration settings

  • SPARK_EGO_EXECUTOR_SLOTS_MAX = 1
  • SPARK_EGO_EXECUTOR_SLOTS_RESERVE = 1
  • SPARK_EGO_GPU_EXECUTOR_SLOTS_MAX = 1
  • SPARK_EGO_GPU_EXECUTOR_SLOTS_RESERVE = 1
  • SPARK_EGO_EXECUTOR_IDLE_TIMEOUT = 6000
  • SPARK_EGO_CONF_DIR_EXTRA = ${DLI_SHARED_FS}/conf

    For example, if DLI_SHARED_FS is /gpfs/dlfs1 for deep learning module installation, then SPARK_EGO_CONF_DIR_EXTRA = /gpfs/dlfs1/conf.

Required Spark configuration settings for distributed training

To support distributed training, set the following:
  • SPARK_EGO_AUTOSCALE_GPU_SLOTS_PER_TASK = 1,2,4.
  • SPARK_EGO_APP_SCHEDULE_POLICY = fifo.
  • SPARK_EGO_ENABLE_PREEMPTION = false.
  • SPARK_EGO_SLOTS_REQUIRED_TIMEOUT must be decreased to ensure that jobs time out in a reasonable amount of time. If this value is too large, jobs competing for resources can be stuck waiting too long and are abruptly stopped by the executor.

Required Spark configuration settings for elastic distributed training

To support elastic distributed training, set the following:
  • SPARK_EGO_AUTOSCALE_GPU_SLOTS_PER_TASK = 1 .
  • SPARK_EGO_APP_SCHEDULE_POLICY = fairshare.
    Note: Distributed training with auto-scaling can work with both fifo and fairshare, however fairshare is preferred. When using fairshare, follow these guidelines:
    • Do not disable reclaim for the executor consumers and do not set SPARK_EGO_RECLAIM_GRACE_PERIOD. Use default reclaim settings for the consumer.
    • Do not change the SPARK_EGO_SLOTS_REQUIRED_TIMEOUT value.
  • SPARK_EGO_ENABLE_PREEMPTION = true.