Installing on the first management host (master host)

Follow these steps to install IBM Spectrum Conductor™ Deep Learning Impact 1.2.3 on the master host.

Before installing IBM Spectrum Conductor Deep Learning Impact, ensure that all system set up has been done: Set up your system (Manual install) and Configure a system for IBM Spectrum Conductor Deep Learning Impact.

If the cluster has been started, shut it down:
1. Log in to the host as root.
2. Source the environment by running one of the following commands, where install_location is the path to your installation directory. The default directory is /opt/ibm/spectrumcomputing:
  - For BASH shell, run: source <install_location>/profile.platform
  - For CSH shell, run: source <install_location>/cshrc.platform
3. Log on to EGO as a cluster administrator:
```
egosh user logon -u user_name -x password
```
  For example:
```
egosh user logon -u Admin -x Admin
```
4. Shut down all EGO services and daemons on all hosts in the cluster:
```
egoshutdown.sh
```
Log in to the host (root or sudo to root permission).
Copy the installation files to the master host.

Define the IBM Spectrum Conductor Deep Learning Impact properties by setting the following environment variables. If you do not set the optional environment variables, the default values are used.

Environment variable	Description
CLUSTERADMIN	Required. Set to any valid operating user account, which then owns all installation files. For example: `export CLUSTERADMIN=egoadmin`
DLI_SHARED_FS	Required. Set to the working shared file system directory used by IBM Spectrum Conductor Deep Learning Impact for configurations, binaries and data. The default is `/gpfs/dlfs1`. For example: `export DLI_SHARED_FS=/gpfs/dlfs1` Notes: Ensure that the cluster administrator has read, write, and execute permissions on this directory. The shared file system directory specified must already exist. Make sure that the directory is already created before exporting this variable. Authority 755 is required.
DLI_DATA_FS	Set a mount point for user dataset storage. You must set the permission for this file system share such that the DLI workload submission user can read the files from this directory. Usage example: `export DLI_DATA_FS=/dli_data_fs`
DLI_RESULT_FS	Mandatory. Set a mount point for user specific job data storage, which includes datasets, training, tuning, validation and hyperparameter optimization. You must set the following ownership and permissions: `chown $CLUSTERADMIN:$ADMINGROUP $DLI_RESULT_FS chmod 733 $DLI_RESULT_FS chmod o+t $DLI_RESULT_FS` Usage example: `export DLI_RESULT_FS=/dli_resuls_fs` Notes: The shared file system directory specified must already exist. Make sure that the directory is already created before exporting this variable. Cannot use the same mount point as DLI_SHARED_FS.
DLI_CONDA_HOME	Mandatory. Sets the Anaconda directory that is used by deep learning frameworks. Default value is `/opt/anaconda3`. For example: `export DLI_CONDA_HOME=/opt/anaconda3`
DLI_MONGODB_PORT	Optional. Specifics the Mongo database server port number. The default port number is `27017`. For example: `export DLI_MONGODB_PORT=27017`
DLI_REDIS_PORT	Optional. Specifics the Redis server port number. The default port number is `6379`. For example: `export DLI_REDIS_PORT=6379`
DLI_REDIS_PASSWORD	Optional. Specifies the Redis password. The default password is `dliredispassword`. For example: `export DLI_REDIS_PASSWORD='dliredispassword'`
DLI_DLPD_REST_PORT	Optional. Specifies the deep learning REST service port number. The default port number is `9243`. For example: `export DLI_DLPD_REST_PORT=9243`
DLI_INSIGHTS_MONITOR_PORT	Optional. Specifies the deep learning monitoring REST service port number. The default port number is `50000`. Make sure to choose an open or unused port. If you specify a different port, you also need to update the firewall with that port.For example: `export DLI_INSIGHTS_MONITOR_PORT=5000`
DLI_INSIGHTS_OPTIMIZER_PORT	Optional. Specifies the deep learning optimizer REST service port number. The default port number is `50001`. Make sure to choose an open or unused port. If you specify a different port, you also need to update the firewall with that port.For example: `export DLI_INSIGHTS_OPTIMIZER_PORT=5001`
DLI_RDMA_ENABLE	Optional. Enables remote direct memory access (RDMA) for elastic distributed training. The default value is `N`. For example: `export DLI_RDMA_ENABLE=N` If installing to a shared environment, you must update the profile.elastic file in the $EGO_TOP/dli/1.2.3/dlpd/conf/ directory by adding RDMA_DEVICE_NAME and the RDMA_DEVICE_PORT settings for each compute host. For example: `RDMA_DEVICE_NAME=compute1.ibm.com:mlx4_0 RDMA_DEVICE_PORT=compute1.ibm.com:1 RDMA_DEVICE_NAME=compute2.ibm.com:mlx4_0 RDMA_DEVICE_PORT=compute2.ibm.com:1`
DLI_RDMA_DEVICE_NAME	Required if DLI_RDMA_ENABLE is enabled. Specifies the RDMA device name. For example: `export DLI_RDMA_DEVICE_NAME=mlx4_0`
DLI_RDMA_DEVICE_PORT	Required if DLI_RDMA_ENABLE is enabled. Specifies the RDMA device port. The port specified must be available and in `PORT_ACTIVE` state. For example: `export DLI_RDMA_DEVICE_PORT=1`
DLI_RDMA_BUFFER_SIZE	Optional if DLI_RDMA_ENABLE is enabled. Specifies the RDMA buffer size. If not specified, the default value of 1GB is used. Note: Make sure to set this value according to your GPU device settings. For example: `export DLI_RDMA_BUFFER_SIZE=32768`
DLI_ELASTIC_COMMIPNETWORK	Optional. Specifies the InfiniBand IPoIB network used for elastic distributed training. The default value is the public network. `export DLI_ELASTIC_COMMIPNETWORK=10.10.10.0/24`
DLI_METRICS_STREAMING_ENABLED	Optional. Enables metric streaming in RabbitMQ . Metric streaming provides training metric information for specific training task submission parameters when a training task is submitted. The default value is `N`. `export DLI_METRICS_STREAMING_ENABLED=Y`
DLI_JWT_SECRET_KEY	Required if using IBM Watson Studio Local with WML Accelerator. Otherwise, optional. Sets the location of the IBM Spectrum Conductor Deep Learning Impact secret key. `export DLI_JWT_SECRET_KEY=key_file_location`
DLI_EDI_ENABLE	Required. Optional. Enables inference service. The default value is `Y`. For example: `export DLI_EDI_ENABLE=Y`
DLI_EDI_MGT_NETWORKID	Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies the inference service management network. For example: `export DLI_EDI_MGT_NETWORKID=9.21.52.77/24`
DLI_EDI_VIP_NETWORKID	Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies inference service base virtual IP network that is used for high availability. For example: `export DLI_EDI_VIP_NETWORKID=9.21.52.250/24` Note: For inference service to have high availability, both DLI_EDI_VIP_NETWORKID and DLI_EDI_VIRTUAL_HOSTNAME must be specified.
DLI_EDI_VIRTUAL_HOSTNAME	Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies the inference service virtual hostname used for high availability. `export DLI_EDI_VIRTUAL_HOSTNAME=vauto.ibm.com` Note: For inference service to have high availability, both DLI_EDI_VIP_NETWORKID and DLI_EDI_VIRTUAL_HOSTNAME must be specified.
DLI_EDI_LBD_ENABLE	Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Enables load balancing for inference service. The default value is `Y`. For example: `export DLI_EDI_LBD_ENABLE=Y`
DLI_EDI_LBD_REST_PORT	Optional if DLI_EDI_LBD_ENABLE is enabled, otherwise not required. Specifies the load balancing daemon (LBD) rest port for the inference service. The default port number is `9099`. For example: `export DLI_EDI_LBD_STREAM_PORT=9099`
DLI_EDI_LBD_STREAM_PORT	Optional if DLI_EDI_LBD_ENABLE is enabled, otherwise not required. Specifies the load balancing daemon (LBD) streaming port for the inference service. The default port number is `9099`. For example: `export DLI_EDI_LBD_STREAM_PORT=9099`
DLI_EDI_IMD_REST_PORT	Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies the inference management daemon (IMD) rest port. The default port number is `12888`. For example: `export DLI_EDI_IMD_REST_PORT=12888`
DLI_EDI_IMD_MGT_PORT	Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies the inference management daemon (IMD) management port. The default port number is `12889`. For example: `export DLI_EDI_IMD_MGT_PORT=12889`
DLI_EDI_IMD_STREAM_PORT	Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies the inference management daemon (IMD) streaming port. The default port number is `6003`. For example: `export DLI_EDI_IMD_STREAM_PORT=6003`
DLI_EDI_ETCD_PEER_PORT	Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies the etcd management port. The default port number is `2500`. For example: `export DLI_EDI_ETCD_PEER_PORT=2500`
DLI_EDI_ETCD_CLIENT_PORT	Optional if DLI_EDI_ENABLE is enabled, otherwise not required. Specifies the etcd client port. The default port number is `2501`. For example: `export DLI_EDI_ETCD_CLIENT_PORT=2501`

Run the IBM Spectrum Conductor Deep Learning Impact installer package.
Important: Make sure that you install IBM Spectrum Conductor Deep Learning Impact to the same directory as the IBM Spectrum Conductor installation.
Entitled version
- If IBM Spectrum Conductor was installed with default settings, run this command:
  - Power® install
    sudo ./dli-1.2.3.0_ppc64le.bin --quiet
  - x86 install
    sudo ./dli-1.2.3.0_x86_64.bin --quiet
- If IBM Spectrum Conductor was not installed with default settings, run this command:
  - Power install
    sudo ./dli-1.2.3.0_ppc64le.bin --prefix install_location --dbpath dbpath_location --quiet
  - x86 install
    sudo ./dli-1.2.3.0_x86_64.bin --prefix install_location --dbpath dbpath_location --quiet
  - --prefix install_location specifies the absolute path to the installation directory. The --prefix parameter is optional. If you install without the --prefix option, IBM Spectrum Conductor Deep Learning Impact is installed in its default directory: /opt/ibm/spectrumcomputing. Ensure that the path is set to the same directory as IBM Spectrum Conductor.
  - --dbpath dbpath_location sets the RPM database to a directory different from the default /var/lib/rpm. The --dbpath parameter is optional. Ensure that the path is set to the same directory as IBM Spectrum Conductor.
  - --quiet enables silent installation. The --quiet parameter is optional.
Evaluation version
- If IBM Spectrum Conductor was installed with default settings, run this command:
  - Power install
    sudo ./dlieval-1.2.3.0_ppc64le.bin --quiet
  - x86 install
    sudo ./dlieval-1.2.3.0_x86_64.bin --quiet
- If IBM Spectrum Conductor was not installed with default settings, run this command:
  - Power install
    sudo ./dlieval-1.2.3.0_ppc64le.bin --prefix install_location --dbpath dbpath_location --quiet
  - x86 install
    sudo ./dlieval-1.2.3.0_x86_64.bin --prefix install_location --dbpath dbpath_location --quiet
  - --prefix install_location specifies the absolute path to the installation directory. The --prefix parameter is optional. If you install without the --prefix option, IBM Spectrum Conductor Deep Learning Impact is installed in its default directory: /opt/ibm/spectrumcomputing. Ensure that the path is set to the same directory as IBM Spectrum Conductor.
  - --dbpath dbpath_location sets the RPM database to a directory different from the default /var/lib/rpm. The --dbpath parameter is optional. Ensure that the path is set to the same directory as IBM Spectrum Conductor.
  - --quiet enables silent installation. The --quiet parameter is optional.
Source the environment by running one of the following commands, where install_location is the path to your installation directory. The default directory is /opt/ibm/spectrumcomputing:
- For BASH shell, run: source <install_location>/profile.platform
- For CSH shell, run: source <install_location>/cshrc.platform
Entitle IBM Spectrum Conductor Deep Learning Impact.
1. Log in to your master host as the operating system cluster administrator user:
```
su - username
```
  where, username is the operating system cluster administrator (for example: egoadmin).
2. Source the environment.
  - For BASH shell, run:
```
source $EGO_TOP/profile.platform
```
  - For CSH shell, run:
```
source $EGO_TOP/cshrc.platform
```
3. Join the host to the cluster:
```
egoconfig join master_host
```
4. Set IBM Spectrum Conductor Deep Learning Impact entitlement for the IBM Spectrum Conductor cluster.
```
egoconfig setentitlement entitlement-file-path
```
  where, entitlement-file-path is the path to your IBM Spectrum Conductor Deep Learning Impact entitlement file.
5. If IBM Spectrum Conductor is already configured with a shared directory for failover, run:
```
egoconfig mghost shared_directory
```
6. Start the IBM Spectrum Conductor cluster.
```
sudo egosh ego start
```
As the root user, log in to the cluster as the cluster administrator user:
```
egosh user logon -u username -x password
```
For example:
```
egosh user logon -u Admin -x Admin
```

Verify entitlement information.

egosh entitlement info

The following output is presented, if IBM Spectrum Conductor Deep Learning Impact is entitled correctly:

EGO                                               : Entitled
IBM Spectrum Conductor                            : Entitled
IBM Spectrum Conductor Deep Learning Impact       : Entitled

Verify that the IBM Spectrum Conductor Deep Learning Impact services, including: dlpd, dlinsights-monitor, dlinsights-optimizer, redis and mongod, are started:
```
egosh service list -l |grep -E "(dlpd|dlinsights*|mongod|redis)"
```
If any services are not in the STARTED state, restart the service:
```
egosh service stop service_name 
egosh service start service_name
```
where, service_name is the name of the service (for example, dlpd).
Once services are started, set permissions for DLI_SHARED_FS and DLI_DATA_FS. Ensure that users who are submitting deep learning workloads can read from this directory.
1. Adjust permissions of DLI_DATA_FS:
```
chmod -R 555 $DLI_DATA_FS
```
2. Set the following:
```
chown ${CLUSTERADMIN}:${CLUSTERADMIN_USERGROUP} $DLI_RESULT_FS
chmod 733 $DLI_RESULT_FS
chmod o+t $DLI_RESULT_FS
```

If IBM Watson Studio Local is installed with a RabbitMQ server enabled for SSL with peer verification, you must complete the following. Note, this requires that DLI_METRICS_STREAMING_ENABLED=Y is set at installation.

Verify that Rabbit server is enabled for SSL with peer verification with the following configuration:
```
ssl_options.verify = verify_peer
ssl_options.fail_if_no_peer_cert = true
```

Create the keystore using pkcs12 format which includes the dlpd server's private key and public certificate. Provide the dlpd server's private key and certificate, for example:

keytool -genkeypair -keystore rmq_client.p12 -storetype PKCS12 \
-storepass MY_PASSWORD -alias KEYSTORE_ENTRY -keyalg RSA -keysize 2048 \
-dname "CN=My SSL Certificate, OU=My Team, O=My Company, L=My City, ST=My State, C=SA" \
-ext san=dns:mydomain.com,dns:localhost,ip:127.0.0.1

Create a trust keystore using the RMQ server's certificate.

keytool -import -alias server1 -file rmq_server_certificate.pem -keystore ./rabbitstore

Edit the dlpd.conf configuration file and configure the following parameters:

"RMQ_PEER_VERIFY": "true",
"RMQ_CLIENT_KEYSTORE_PASSPHRASE": "<the pkcs12 format keystore password>",
"RMQ_CLIENT_KEYSTORE_FILE": "<full file path for the key store file, pkcs12 format >",
"RMQ_CLIENT_TRUSTSTORE_PASSPHRASE": "<your truststore password>",
"RMQ_CLIENT_TRUSTSTORE_FILE": "<full file path for the trust store file>",

Restart the dlpd service.

egosh service stop dlpd
sleep 5
egosh service start dlpd

If you want to enable high availability and set up master failover, you can set up master failover .