Installing on an additional management host

Add a management host to your IBM Spectrum Conductor™ Deep Learning Impact 1.2.3 environment.

Before installing IBM Spectrum Conductor Deep Learning Impact, ensure that all system set up has been done: Set up your system (Manual install) and Configure a system for IBM Spectrum Conductor Deep Learning Impact.

  1. If the cluster has been started, shut it down:
    1. Log in to the host as root.
    2. Source the environment by running one of the following commands, where install_location is the path to your installation directory. The default directory is /opt/ibm/spectrumcomputing:
      • For BASH shell, run: source <install_location>/profile.platform
      • For CSH shell, run: source <install_location>/cshrc.platform
    3. Log on to EGO as a cluster administrator:
      egosh user logon -u user_name -x password
      For example:
      egosh user logon -u Admin -x Admin
    4. Shut down all EGO services and daemons on all hosts in the cluster:
      egoshutdown.sh
  2. Log in to the host (root or sudo to root permission).
  3. Copy the installation files to the master host.
  4. Define the IBM Spectrum Conductor Deep Learning Impact properties by setting the following environment variables. If you do not set the optional environment variables, the default values are used.
    Environment variable Description
    CLUSTERADMIN Required. Set to any valid operating user account, which then owns all installation files. For example:
    export CLUSTERADMIN=egoadmin
    DLI_SHARED_FS Required. Set to the working shared file system directory used by IBM Spectrum Conductor Deep Learning Impact for configurations, binaries and data. The default is /gpfs/dlfs1. For example:
    export DLI_SHARED_FS=/gpfs/dlfs1
    Notes:
    • Ensure that the cluster administrator has read, write, and execute permissions on this directory.
    • The shared file system directory specified must already exist. Make sure that the directory is already created before exporting this variable.
    • Authority 755 is required.
    DLI_DATA_FS Set a mount point for user dataset storage. You must set the permission for this file system share such that the DLI workload submission user can read the files from this directory.

    Usage example:

    export DLI_DATA_FS=/dli_data_fs
    DLI_RESULT_FS Mandatory. Set a mount point for user specific job data storage, which includes datasets, training, tuning, validation and hyperparameter optimization. You must set the following ownership and permissions:
    chown $CLUSTERADMIN:$ADMINGROUP $DLI_RESULT_FS
    chmod 733 $DLI_RESULT_FS
    chmod o+t $DLI_RESULT_FS
    Usage example:
    export DLI_RESULT_FS=/dli_resuls_fs
    Notes:
    • The shared file system directory specified must already exist. Make sure that the directory is already created before exporting this variable.
    • Cannot use the same mount point as DLI_SHARED_FS.
    DLI_CONDA_HOME Mandatory. Sets the Anaconda directory that is used by deep learning frameworks.
    Default value is /opt/anaconda3. For example:
    export DLI_CONDA_HOME=/opt/anaconda3
    DLI_MONGODB_PORT Optional. Specifics the Mongo database server port number. The default port number is 27017. For example:
    export DLI_MONGODB_PORT=27017
    DLI_REDIS_PORT Optional. Specifics the Redis server port number. The default port number is 6379. For example:
    export DLI_REDIS_PORT=6379
    DLI_REDIS_PASSWORD Optional. Specifies the Redis password. The default password is dliredispassword. For example:
    export DLI_REDIS_PASSWORD='dliredispassword'
    DLI_DLPD_REST_PORT Optional. Specifies the deep learning REST service port number. The default port number is 9243. For example:
    export DLI_DLPD_REST_PORT=9243
    DLI_INSIGHTS_MONITOR_PORT Optional. Specifies the deep learning monitoring REST service port number. The default port number is 50000. Make sure to choose an open or unused port. If you specify a different port, you also need to update the firewall with that port.For example:
    export DLI_INSIGHTS_MONITOR_PORT=5000
    DLI_INSIGHTS_OPTIMIZER_PORT Optional. Specifies the deep learning optimizer REST service port number. The default port number is 50001. Make sure to choose an open or unused port. If you specify a different port, you also need to update the firewall with that port.For example:
    export DLI_INSIGHTS_OPTIMIZER_PORT=5001
    DLI_RDMA_ENABLE Optional. Enables remote direct memory access (RDMA) for elastic distributed training. The default value is N.
    For example:
    export DLI_RDMA_ENABLE=N

    If installing to a shared environment, you must update the profile.elastic file in the $EGO_TOP/dli/1.2.3/dlpd/conf/ directory by adding RDMA_DEVICE_NAME and the RDMA_DEVICE_PORT settings for each compute host. For example:

    RDMA_DEVICE_NAME=compute1.ibm.com:mlx4_0
    RDMA_DEVICE_PORT=compute1.ibm.com:1
    RDMA_DEVICE_NAME=compute2.ibm.com:mlx4_0
    RDMA_DEVICE_PORT=compute2.ibm.com:1 
    DLI_RDMA_DEVICE_NAME Required if DLI_RDMA_ENABLE is enabled. Specifies the RDMA device name. For example:
    export DLI_RDMA_DEVICE_NAME=mlx4_0
    DLI_RDMA_DEVICE_PORT Required if DLI_RDMA_ENABLE is enabled. Specifies the RDMA device port. The port specified must be available and in PORT_ACTIVE state. For example:
    export DLI_RDMA_DEVICE_PORT=1
    DLI_RDMA_BUFFER_SIZE Optional if DLI_RDMA_ENABLE is enabled. Specifies the RDMA buffer size. If not specified, the default value of 1GB is used.
    Note: Make sure to set this value according to your GPU device settings.
    For example:
    export DLI_RDMA_BUFFER_SIZE=32768
    DLI_ELASTIC_COMMIPNETWORK Optional. Specifies the InfiniBand IPoIB network used for elastic distributed training. The default value is the public network.
    export DLI_ELASTIC_COMMIPNETWORK=10.10.10.0/24
    DLI_METRICS_STREAMING_ENABLED Optional. Enables metric streaming in RabbitMQ . Metric streaming provides training metric information for specific training task submission parameters when a training task is submitted. The default value is N.
    export DLI_METRICS_STREAMING_ENABLED=Y
    DLI_JWT_SECRET_KEY Required if using IBM Watson Studio Local with WML Accelerator. Otherwise, optional. Sets the location of the IBM Spectrum Conductor Deep Learning Impact secret key.
    export DLI_JWT_SECRET_KEY=key_file_location
    DLI_EDI_ENABLE Required. Optional. Enables inference service. The default value is Y. For example:
    export DLI_EDI_ENABLE=Y
    DLI_EDI_MGT_NETWORKID Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies the inference service management network. For example:
    export DLI_EDI_MGT_NETWORKID=9.21.52.77/24
    DLI_EDI_VIP_NETWORKID Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies inference service base virtual IP network that is used for high availability. For example:
    export DLI_EDI_VIP_NETWORKID=9.21.52.250/24
    Note: For inference service to have high availability, both DLI_EDI_VIP_NETWORKID and DLI_EDI_VIRTUAL_HOSTNAME must be specified.
    DLI_EDI_VIRTUAL_HOSTNAME Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies the inference service virtual hostname used for high availability.
    export DLI_EDI_VIRTUAL_HOSTNAME=vauto.ibm.com
    Note: For inference service to have high availability, both DLI_EDI_VIP_NETWORKID and DLI_EDI_VIRTUAL_HOSTNAME must be specified.
    DLI_EDI_LBD_ENABLE Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Enables load balancing for inference service. The default value is Y. For example:
    export DLI_EDI_LBD_ENABLE=Y
    DLI_EDI_LBD_REST_PORT Optional if DLI_EDI_LBD_ENABLE is enabled, otherwise not required. Specifies the load balancing daemon (LBD) rest port for the inference service. The default port number is 9099. For example:
    export DLI_EDI_LBD_STREAM_PORT=9099
    DLI_EDI_LBD_STREAM_PORT Optional if DLI_EDI_LBD_ENABLE is enabled, otherwise not required. Specifies the load balancing daemon (LBD) streaming port for the inference service. The default port number is 9099. For example:
    export DLI_EDI_LBD_STREAM_PORT=9099
    DLI_EDI_IMD_REST_PORT Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies the inference management daemon (IMD) rest port. The default port number is 12888. For example:
    export DLI_EDI_IMD_REST_PORT=12888
    DLI_EDI_IMD_MGT_PORT Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies the inference management daemon (IMD) management port. The default port number is 12889. For example:
    export DLI_EDI_IMD_MGT_PORT=12889
    DLI_EDI_IMD_STREAM_PORT Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies the inference management daemon (IMD) streaming port. The default port number is 6003. For example:
    export DLI_EDI_IMD_STREAM_PORT=6003
    DLI_EDI_ETCD_PEER_PORT Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies the etcd management port. The default port number is 2500. For example:
    export DLI_EDI_ETCD_PEER_PORT=2500
    DLI_EDI_ETCD_CLIENT_PORT Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies the etcd client port. The default port number is 2501. For example:
    export DLI_EDI_ETCD_CLIENT_PORT=2501
  5. Run the IBM Spectrum Conductor Deep Learning Impact installer package.
    Important: Make sure that you install IBM Spectrum Conductor Deep Learning Impact to the same directory as the IBM Spectrum Conductor installation.
    Entitled version
    • If IBM Spectrum Conductor was installed with default settings, run this command:
      • Power® install
        sudo ./dli-1.2.3.0_ppc64le.bin --quiet
      • x86 install
        sudo ./dli-1.2.3.0_x86_64.bin --quiet
    • If IBM Spectrum Conductor was not installed with default settings, run this command:
      • Power install
        sudo ./dli-1.2.3.0_ppc64le.bin --prefix install_location --dbpath dbpath_location --quiet
      • x86 install
        sudo ./dli-1.2.3.0_x86_64.bin --prefix install_location --dbpath dbpath_location --quiet
      • --prefix install_location specifies the absolute path to the installation directory. The --prefix parameter is optional. If you install without the --prefix option, IBM Spectrum Conductor Deep Learning Impact is installed in its default directory: /opt/ibm/spectrumcomputing. Ensure that the path is set to the same directory as IBM Spectrum Conductor.
      • --dbpath dbpath_location sets the RPM database to a directory different from the default /var/lib/rpm. The --dbpath parameter is optional. Ensure that the path is set to the same directory as IBM Spectrum Conductor.
      • --quiet enables silent installation. The --quiet parameter is optional.
    Evaluation version
    • If IBM Spectrum Conductor was installed with default settings, run this command:
      • Power install
        sudo ./dlieval-1.2.3.0_ppc64le.bin --quiet
      • x86 install
        sudo ./dlieval-1.2.3.0_x86_64.bin --quiet
    • If IBM Spectrum Conductor was not installed with default settings, run this command:
      • Power install
        sudo ./dlieval-1.2.3.0_ppc64le.bin --prefix install_location --dbpath dbpath_location --quiet
      • x86 install
        sudo ./dlieval-1.2.3.0_x86_64.bin --prefix install_location --dbpath dbpath_location --quiet
      • --prefix install_location specifies the absolute path to the installation directory. The --prefix parameter is optional. If you install without the --prefix option, IBM Spectrum Conductor Deep Learning Impact is installed in its default directory: /opt/ibm/spectrumcomputing. Ensure that the path is set to the same directory as IBM Spectrum Conductor.
      • --dbpath dbpath_location sets the RPM database to a directory different from the default /var/lib/rpm. The --dbpath parameter is optional. Ensure that the path is set to the same directory as IBM Spectrum Conductor.
      • --quiet enables silent installation. The --quiet parameter is optional.
  6. Source the environment by running one of the following commands, where install_location is the path to your installation directory. The default directory is /opt/ibm/spectrumcomputing:
    • For BASH shell, run: source <install_location>/profile.platform
    • For CSH shell, run: source <install_location>/cshrc.platform
  7. As the root user, log in to the cluster as the cluster administrator user:
    egosh user logon -u username -x password
    For example:
    egosh user logon -u Admin -x Admin
  8. Add the management host to the IBM Spectrum Conductor cluster. See Adding a host to a cluster.