Installing the elastic distributed inference package

Install the elastic distributed inference package.

Note: When installing the elastic distributed inference package, you must first install it on the master host before installing it on any compute hosts.
Before installing the elastic distributed inference package, ensure the following:
  • Obtain and install interim fix 226695 before you install the elastic distributed inference package.
  • Make sure that you obtain the elastic distributed inference package (dliedi-1.2.3.0.x86_64.rpm or dliedi-1.2.3.0.ppc64le.rpm) from IBM Fix Central.
  • High availability for elastic distributed inference is implemented using the EGO service virtual IP and is installed in the shared directory for IBM Spectrum Conductor Deep Learning Impact. To ensure that high availability is available for elastic distributed inference, you must ensure the following:
    • A virtual IP address must be prepared for management services, for example: imd/lbd/etcd.
    • A hostname is configured to use the virtual IP. It must have the same domain name as the hosts.
    • If SSL is enabled, all hosts should have an entry for virtual IP and hostname in /etc/hosts.
  • If a virtual IP is not specified during installation, and the cluster has more than one network card, it is recommenced that after installation you should specify which network card to use. If the network card is not specified, by default, the first network card is used.
    After installation, to specify which network card is used, update the $EGO_TOP/dlim/conf/etcd.conf file and specify the NETWORKID and HOSTIP to be used for elastic distributed inference. For example:
    NETWORKID=network-IP/24
    HOSTIP=network-IP
    where network-IP is the network IP address that you want to use for elastic distributed inference.
To install elastic distributed inference, complete the following steps on each host.
  1. Log in to the host (root or sudo to root permission).
  2. Source the environment. For example:
    source /opt/ibm/spectrumcomputing/profile.platform
  3. Copy the elastic distributed inference package files to the host.
  4. Set environment variables. You must set the same variables that you used for the installation of IBM Spectrum Conductor Deep Learning Impact and any additional variables that enable and define elastic distributed inference properties.
    Environment variable Description
    EDI_SHARED_FS Required. Set to the working shared file system directory used by elastic distributed inference. By default, if not specified, the directory specified by DL_NFS_PATH (found in the dlpd.conf file) is used. It is recommended that you use the default, otherwise, you must ensure the following for the directory:
    • Ensure that the cluster administrator has read, write, and execute permissions on this directory.
    • Compute hosts must use the same directory as management hosts.
    • The shared file system directory specified must already exist. Make sure that the directory is already created before exporting this variable.
    • Authority 755 is required.
    For example:
    export EDI_SHARED_FS=/gpfs/dlfs1
    Note: For the test functionality to be available from the cluster management console, you must set EDI_SHARED_FS to the same value as DLI_SHARED_FS. Otherwise, the test functionality is disabled.
    EDI_CONDA_HOME Required. Sets the Anaconda directory that is used by elastic distributed inference for deep learning frameworks. By default, if not specified, the directory specified by DLI_CONDA_HOME (found in the dlpd.conf file) is used. It is recommended that you use the default, otherwise, you must ensure that the compute hosts use the same directory as the management hosts. For example:
    export EDI_CONDA_HOME=/opt/anaconda3
    DISABLESSL Disables SSL. The default value is N.
    Note: This value must be the same as your IBM Spectrum Conductor installation.
    For example:
    export DISABLESSL=N
    DLI_EDI_MGT_NETWORKID Specifies the inference service management network. For example:
    export DLI_EDI_MGT_NETWORKID=9.21.52.77/24
    DLI_EDI_VIP_NETWORKID Specifies inference service base virtual IP network that is used for high availability. For example:
    export DLI_EDI_VIP_NETWORKID=9.21.52.250/24
    Note: For inference service to have high availability, both DLI_EDI_VIP_NETWORKID and DLI_EDI_VIRTUAL_HOSTNAME must be specified.
    DLI_EDI_VIRTUAL_HOSTNAME Specifies the inference service virtual hostname used for high availability.
    export DLI_EDI_VIRTUAL_HOSTNAME=vauto.ibm.com
    Note: For inference service to have high availability, both DLI_EDI_VIP_NETWORKID and DLI_EDI_VIRTUAL_HOSTNAME must be specified.
    DLI_EDI_LBD_REST_PORT Specifies the load balancing daemon (LBD) rest port for the inference service. The default port number is 9000. For example:
    export DLI_EDI_LBD_REST_PORT=9000
    DLI_EDI_LBD_STREAM_PORT Specifies the load balancing daemon (LBD) streaming port for the inference service. The default port number is 9010. For example:
    export DLI_EDI_LBD_STREAM_PORT=9010
    DLI_EDI_IMD_REST_PORT Specifies the inference management daemon (IMD) rest port. The default port number is 8888. For example:
    export DLI_EDI_IMD_REST_PORT=8888
    DLI_EDI_IMD_MGT_PORT Specifies the inference management daemon (IMD) management port. The default port number is 8889. For example:
    export DLI_EDI_IMD_MGT_PORT=8889
    DLI_EDI_IMD_STREAM_PORT Specifies the inference management daemon (IMD) streaming port. The default port number is 8890. For example:
    export DLI_EDI_IMD_STREAM_PORT=8890
    DLI_EDI_ETCD_PEER_PORT Specifies the etcd management port. The default port number is 2380. For example:
    export DLI_EDI_ETCD_PEER_PORT=2380
    DLI_EDI_ETCD_CLIENT_PORT Specifies the etcd client port. The default port number is 2379. For example:
    export DLI_EDI_ETCD_CLIENT_PORT=2379
  5. Install the elastic distributed inference package. The elastic distributed inference package must be installed to the same installation location as IBM Spectrum Conductor Deep Learning Impact.
    For x86:
    rpm -ivh --prefix $EGO_TOP --dbpath $DB_PATH dliedi-1.2.3.0.x86_64.rpm
    For Power:
    rpm -ivh --prefix $EGO_TOP --dbpath $DB_PATH dliedi-1.2.3.0.ppc64le.rpm
    where:
    • --prefix $EGO_TOP specifies the absolute path to the installation directory. The --prefix parameter is optional. If IBM Spectrum Conductor Deep Learning Impact was installed with the --prefix option, then you must specify the same path that was used during the installation of IBM Spectrum Conductor Deep Learning Impact. If you install without the --prefix option, the default path /opt/ibm/spectrumcomputing is used.
    • --dbpath $DB_PATH sets the path to the RPM database directory. The --dbpath parameter is optional. If IBM Spectrum Conductor Deep Learning Impact was installed with the --dbpath option, then you must specify the same path that was used during the installation of IBM Spectrum Conductor Deep Learning Impact.
  6. Restart EGO services.
    egosh service stop all
    egosh ego shutdown all
    egosh ego start all
  7. Log in to the cluster management console to verify that elastic distributed inference is available.
    1. Log in to the cluster management console at https://<webserver_hostname>:8443 (or http://<webserver_hostname>:8080 if SSL is disabled).
    2. Navigate to Workload > Spark > Deep Learning.
    3. Verify that Elastic Distributed Inference is one of the available tabs.
Elastic distributed inference is installed.