Installing on the first management host (master host)

Follow these steps to install IBM Spectrum Conductor™ Deep Learning Impact 1.2.3 on the master host.

Before installing IBM Spectrum Conductor Deep Learning Impact, ensure that all system set up has been done: Set up your system (Manual install) and Configure a system for IBM Spectrum Conductor Deep Learning Impact.

  1. If the cluster has been started, shut it down:
    1. Log in to the host as root.
    2. Source the environment by running one of the following commands, where install_location is the path to your installation directory. The default directory is /opt/ibm/spectrumcomputing:
      • For BASH shell, run: source <install_location>/profile.platform
      • For CSH shell, run: source <install_location>/cshrc.platform
    3. Log on to EGO as a cluster administrator:
      egosh user logon -u user_name -x password
      For example:
      egosh user logon -u Admin -x Admin
    4. Shut down all EGO services and daemons on all hosts in the cluster:
      egoshutdown.sh
  2. Log in to the host (root or sudo to root permission).
  3. Copy the installation files to the master host.
  4. Define the IBM Spectrum Conductor Deep Learning Impact properties by setting the following environment variables. If you do not set the optional environment variables, the default values are used.
    Environment variable Description
    CLUSTERADMIN Required. Set to any valid operating user account, which then owns all installation files. For example:
    export CLUSTERADMIN=egoadmin
    DLI_SHARED_FS Required. Set to the working shared file system directory used by IBM Spectrum Conductor Deep Learning Impact for configurations, binaries and data. The default is /gpfs/dlfs1. For example:
    export DLI_SHARED_FS=/gpfs/dlfs1
    Notes:
    • Ensure that the cluster administrator has read, write, and execute permissions on this directory.
    • The shared file system directory specified must already exist. Make sure that the directory is already created before exporting this variable.
    • Authority 755 is required.
    DLI_DATA_FS Set a mount point for user dataset storage. You must set the permission for this file system share such that the DLI workload submission user can read the files from this directory.

    Usage example:

    export DLI_DATA_FS=/dli_data_fs
    DLI_RESULT_FS Mandatory. Set a mount point for user specific job data storage, which includes datasets, training, tuning, validation and hyperparameter optimization. You must set the following ownership and permissions:
    chown $CLUSTERADMIN:$ADMINGROUP $DLI_RESULT_FS
    chmod 733 $DLI_RESULT_FS
    chmod o+t $DLI_RESULT_FS
    Usage example:
    export DLI_RESULT_FS=/dli_resuls_fs
    Notes:
    • The shared file system directory specified must already exist. Make sure that the directory is already created before exporting this variable.
    • Cannot use the same mount point as DLI_SHARED_FS.
    DLI_CONDA_HOME Mandatory. Sets the Anaconda directory that is used by deep learning frameworks.
    Default value is /opt/anaconda3. For example:
    export DLI_CONDA_HOME=/opt/anaconda3
    DLI_MONGODB_PORT Optional. Specifics the Mongo database server port number. The default port number is 27017. For example:
    export DLI_MONGODB_PORT=27017
    DLI_REDIS_PORT Optional. Specifics the Redis server port number. The default port number is 6379. For example:
    export DLI_REDIS_PORT=6379
    DLI_REDIS_PASSWORD Optional. Specifies the Redis password. The default password is dliredispassword. For example:
    export DLI_REDIS_PASSWORD='dliredispassword'
    DLI_DLPD_REST_PORT Optional. Specifies the deep learning REST service port number. The default port number is 9243. For example:
    export DLI_DLPD_REST_PORT=9243
    DLI_INSIGHTS_MONITOR_PORT Optional. Specifies the deep learning monitoring REST service port number. The default port number is 50000. Make sure to choose an open or unused port. If you specify a different port, you also need to update the firewall with that port.For example:
    export DLI_INSIGHTS_MONITOR_PORT=5000
    DLI_INSIGHTS_OPTIMIZER_PORT Optional. Specifies the deep learning optimizer REST service port number. The default port number is 50001. Make sure to choose an open or unused port. If you specify a different port, you also need to update the firewall with that port.For example:
    export DLI_INSIGHTS_OPTIMIZER_PORT=5001
    DLI_RDMA_ENABLE Optional. Enables remote direct memory access (RDMA) for elastic distributed training. The default value is N.
    For example:
    export DLI_RDMA_ENABLE=N

    If installing to a shared environment, you must update the profile.elastic file in the $EGO_TOP/dli/1.2.3/dlpd/conf/ directory by adding RDMA_DEVICE_NAME and the RDMA_DEVICE_PORT settings for each compute host. For example:

    RDMA_DEVICE_NAME=compute1.ibm.com:mlx4_0
    RDMA_DEVICE_PORT=compute1.ibm.com:1
    RDMA_DEVICE_NAME=compute2.ibm.com:mlx4_0
    RDMA_DEVICE_PORT=compute2.ibm.com:1 
    DLI_RDMA_DEVICE_NAME Required if DLI_RDMA_ENABLE is enabled. Specifies the RDMA device name. For example:
    export DLI_RDMA_DEVICE_NAME=mlx4_0
    DLI_RDMA_DEVICE_PORT Required if DLI_RDMA_ENABLE is enabled. Specifies the RDMA device port. The port specified must be available and in PORT_ACTIVE state. For example:
    export DLI_RDMA_DEVICE_PORT=1
    DLI_RDMA_BUFFER_SIZE Optional if DLI_RDMA_ENABLE is enabled. Specifies the RDMA buffer size. If not specified, the default value of 1GB is used.
    Note: Make sure to set this value according to your GPU device settings.
    For example:
    export DLI_RDMA_BUFFER_SIZE=32768
    DLI_ELASTIC_COMMIPNETWORK Optional. Specifies the InfiniBand IPoIB network used for elastic distributed training. The default value is the public network.
    export DLI_ELASTIC_COMMIPNETWORK=10.10.10.0/24
    DLI_METRICS_STREAMING_ENABLED Optional. Enables metric streaming in RabbitMQ . Metric streaming provides training metric information for specific training task submission parameters when a training task is submitted. The default value is N.
    export DLI_METRICS_STREAMING_ENABLED=Y
    DLI_JWT_SECRET_KEY Required if using IBM Watson Studio Local with WML Accelerator. Otherwise, optional. Sets the location of the IBM Spectrum Conductor Deep Learning Impact secret key.
    export DLI_JWT_SECRET_KEY=key_file_location
    DLI_EDI_ENABLE Required. Optional. Enables inference service. The default value is Y. For example:
    export DLI_EDI_ENABLE=Y
    DLI_EDI_MGT_NETWORKID Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies the inference service management network. For example:
    export DLI_EDI_MGT_NETWORKID=9.21.52.77/24
    DLI_EDI_VIP_NETWORKID Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies inference service base virtual IP network that is used for high availability. For example:
    export DLI_EDI_VIP_NETWORKID=9.21.52.250/24
    Note: For inference service to have high availability, both DLI_EDI_VIP_NETWORKID and DLI_EDI_VIRTUAL_HOSTNAME must be specified.
    DLI_EDI_VIRTUAL_HOSTNAME Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies the inference service virtual hostname used for high availability.
    export DLI_EDI_VIRTUAL_HOSTNAME=vauto.ibm.com
    Note: For inference service to have high availability, both DLI_EDI_VIP_NETWORKID and DLI_EDI_VIRTUAL_HOSTNAME must be specified.
    DLI_EDI_LBD_ENABLE Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Enables load balancing for inference service. The default value is Y. For example:
    export DLI_EDI_LBD_ENABLE=Y
    DLI_EDI_LBD_REST_PORT Optional if DLI_EDI_LBD_ENABLE is enabled, otherwise not required. Specifies the load balancing daemon (LBD) rest port for the inference service. The default port number is 9099. For example:
    export DLI_EDI_LBD_STREAM_PORT=9099
    DLI_EDI_LBD_STREAM_PORT Optional if DLI_EDI_LBD_ENABLE is enabled, otherwise not required. Specifies the load balancing daemon (LBD) streaming port for the inference service. The default port number is 9099. For example:
    export DLI_EDI_LBD_STREAM_PORT=9099
    DLI_EDI_IMD_REST_PORT Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies the inference management daemon (IMD) rest port. The default port number is 12888. For example:
    export DLI_EDI_IMD_REST_PORT=12888
    DLI_EDI_IMD_MGT_PORT Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies the inference management daemon (IMD) management port. The default port number is 12889. For example:
    export DLI_EDI_IMD_MGT_PORT=12889
    DLI_EDI_IMD_STREAM_PORT Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies the inference management daemon (IMD) streaming port. The default port number is 6003. For example:
    export DLI_EDI_IMD_STREAM_PORT=6003
    DLI_EDI_ETCD_PEER_PORT Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies the etcd management port. The default port number is 2500. For example:
    export DLI_EDI_ETCD_PEER_PORT=2500
    DLI_EDI_ETCD_CLIENT_PORT Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies the etcd client port. The default port number is 2501. For example:
    export DLI_EDI_ETCD_CLIENT_PORT=2501
  5. Run the IBM Spectrum Conductor Deep Learning Impact installer package.
    Important: Make sure that you install IBM Spectrum Conductor Deep Learning Impact to the same directory as the IBM Spectrum Conductor installation.
    Entitled version
    • If IBM Spectrum Conductor was installed with default settings, run this command:
      • Power® install
        sudo ./dli-1.2.3.0_ppc64le.bin --quiet
      • x86 install
        sudo ./dli-1.2.3.0_x86_64.bin --quiet
    • If IBM Spectrum Conductor was not installed with default settings, run this command:
      • Power install
        sudo ./dli-1.2.3.0_ppc64le.bin --prefix install_location --dbpath dbpath_location --quiet
      • x86 install
        sudo ./dli-1.2.3.0_x86_64.bin --prefix install_location --dbpath dbpath_location --quiet
      • --prefix install_location specifies the absolute path to the installation directory. The --prefix parameter is optional. If you install without the --prefix option, IBM Spectrum Conductor Deep Learning Impact is installed in its default directory: /opt/ibm/spectrumcomputing. Ensure that the path is set to the same directory as IBM Spectrum Conductor.
      • --dbpath dbpath_location sets the RPM database to a directory different from the default /var/lib/rpm. The --dbpath parameter is optional. Ensure that the path is set to the same directory as IBM Spectrum Conductor.
      • --quiet enables silent installation. The --quiet parameter is optional.
    Evaluation version
    • If IBM Spectrum Conductor was installed with default settings, run this command:
      • Power install
        sudo ./dlieval-1.2.3.0_ppc64le.bin --quiet
      • x86 install
        sudo ./dlieval-1.2.3.0_x86_64.bin --quiet
    • If IBM Spectrum Conductor was not installed with default settings, run this command:
      • Power install
        sudo ./dlieval-1.2.3.0_ppc64le.bin --prefix install_location --dbpath dbpath_location --quiet
      • x86 install
        sudo ./dlieval-1.2.3.0_x86_64.bin --prefix install_location --dbpath dbpath_location --quiet
      • --prefix install_location specifies the absolute path to the installation directory. The --prefix parameter is optional. If you install without the --prefix option, IBM Spectrum Conductor Deep Learning Impact is installed in its default directory: /opt/ibm/spectrumcomputing. Ensure that the path is set to the same directory as IBM Spectrum Conductor.
      • --dbpath dbpath_location sets the RPM database to a directory different from the default /var/lib/rpm. The --dbpath parameter is optional. Ensure that the path is set to the same directory as IBM Spectrum Conductor.
      • --quiet enables silent installation. The --quiet parameter is optional.
  6. Source the environment by running one of the following commands, where install_location is the path to your installation directory. The default directory is /opt/ibm/spectrumcomputing:
    • For BASH shell, run: source <install_location>/profile.platform
    • For CSH shell, run: source <install_location>/cshrc.platform
  7. Entitle IBM Spectrum Conductor Deep Learning Impact.
    1. Log in to your master host as the operating system cluster administrator user:
      su - username
      where, username is the operating system cluster administrator (for example: egoadmin).
    2. Source the environment.
      • For BASH shell, run:
        source $EGO_TOP/profile.platform
      • For CSH shell, run:
        source $EGO_TOP/cshrc.platform
    3. Join the host to the cluster:
      egoconfig join master_host
    4. Set IBM Spectrum Conductor Deep Learning Impact entitlement for the IBM Spectrum Conductor cluster.
      egoconfig setentitlement entitlement-file-path
      where, entitlement-file-path is the path to your IBM Spectrum Conductor Deep Learning Impact entitlement file.
    5. If IBM Spectrum Conductor is already configured with a shared directory for failover, run:
      egoconfig mghost shared_directory
    6. Start the IBM Spectrum Conductor cluster.
      sudo egosh ego start
  8. As the root user, log in to the cluster as the cluster administrator user:
    egosh user logon -u username -x password
    For example:
    egosh user logon -u Admin -x Admin
  9. Verify entitlement information.
    egosh entitlement info
    The following output is presented, if IBM Spectrum Conductor Deep Learning Impact is entitled correctly:
    EGO                                               : Entitled
    IBM Spectrum Conductor                            : Entitled
    IBM Spectrum Conductor Deep Learning Impact       : Entitled
  10. Verify that the IBM Spectrum Conductor Deep Learning Impact services, including: dlpd, dlinsights-monitor, dlinsights-optimizer, redis and mongod, are started:
    egosh service list -l |grep -E "(dlpd|dlinsights*|mongod|redis)"
    If any services are not in the STARTED state, restart the service:
    egosh service stop service_name 
    egosh service start service_name
    where, service_name is the name of the service (for example, dlpd).
  11. Once services are started, set permissions for DLI_SHARED_FS and DLI_DATA_FS. Ensure that users who are submitting deep learning workloads can read from this directory.
    1. Adjust permissions of DLI_DATA_FS:
      chmod -R 555 $DLI_DATA_FS
    2. Set the following:
      chown ${CLUSTERADMIN}:${CLUSTERADMIN_USERGROUP} $DLI_RESULT_FS
      chmod 733 $DLI_RESULT_FS
      chmod o+t $DLI_RESULT_FS
  12. If IBM Watson Studio Local is installed with a RabbitMQ server enabled for SSL with peer verification, you must complete the following. Note, this requires that DLI_METRICS_STREAMING_ENABLED=Y is set at installation.
    1. Verify that Rabbit server is enabled for SSL with peer verification with the following configuration:
      ssl_options.verify = verify_peer
      ssl_options.fail_if_no_peer_cert = true
    2. Create the keystore using pkcs12 format which includes the dlpd server's private key and public certificate. Provide the dlpd server's private key and certificate, for example:
      keytool -genkeypair -keystore rmq_client.p12 -storetype PKCS12 \
      -storepass MY_PASSWORD -alias KEYSTORE_ENTRY -keyalg RSA -keysize 2048 \
      -dname "CN=My SSL Certificate, OU=My Team, O=My Company, L=My City, ST=My State, C=SA" \
      -ext san=dns:mydomain.com,dns:localhost,ip:127.0.0.1
    3. Create a trust keystore using the RMQ server's certificate.
      keytool -import -alias server1 -file rmq_server_certificate.pem -keystore ./rabbitstore
    4. Edit the dlpd.conf configuration file and configure the following parameters:
      "RMQ_PEER_VERIFY": "true",
      "RMQ_CLIENT_KEYSTORE_PASSPHRASE": "<the pkcs12 format keystore password>",
      "RMQ_CLIENT_KEYSTORE_FILE": "<full file path for the key store file, pkcs12 format >",
      "RMQ_CLIENT_TRUSTSTORE_PASSPHRASE": "<your truststore password>",
      "RMQ_CLIENT_TRUSTSTORE_FILE": "<full file path for the trust store file>",
    5. Restart the dlpd service.
      egosh service stop dlpd
      sleep 5
      egosh service start dlpd

If you want to enable high availability and set up master failover, you can set up master failover .