Set up your system (Manual install)

Before installing IBM Watson® Machine Learning Accelerator, perform the following setup on your system.

Prerequisites

  • Ensure that you have root access to all hosts running deep learning workloads.
  • Ensure that all hardware and software requirements are met: Hardware and software requirements.
  • Use fully qualified domain names (FQDN) for all hosts in your cluster. The names of the hosts must use a valid domain name server (DNS), so that you can resolve the IP address by the domain name and be able to find the domain name by IP address. Use the following commands to confirm host names in your cluster: hostname -f and getent hosts [ip_address]. The host names that are returned by these commands must match your cluster configuration.
  • All hosts in the cluster must use the same clock setting.
  • Python 2.7 must be installed on all hosts.
  • OpenSSL 1.0.1 or later must be installed on all hosts.
  • All hosts require the gettext library to provide globalization support for translated product messages. Without this library, you might encounter a gettext.sh: file not found or gettext: command not found error during installation. Typically, this library is installed with the operating system; however, if it was removed or is not installed, install the gettext package.
  • If you will enable SSL communication, install cURL for Elastic Stack 7.28 or later on all management hosts and all hosts that will be used to run notebooks.
  • Remote shell (rsh) must be available on each host in the cluster.
Virus scanning: It is recommended that you disable real-time anti-virus software and any defragmentation software. These tools cause poor performance and instability, especially on management hosts, and create problems if they lock files while scanning them. Also, schedule virus scanning during cluster downtime.

Install the operating system

The Deep Learning packages require one of the following operating systems:

Red Hat Enterprise Linux (RHEL) 7.6 little endian for POWER8® and POWER9™
  • WML CE can be installed and run directly on a bare-metal RHEL 7.6 system
  • The RHEL installation image and license must be acquired from Red Hat

For more information about installing operating systems on IBM Power Systems servers, see Quick start guides for Linux on IBM® Power System servers.

Red Hat Enterprise Linux (RHEL) 7.6 (Linux 64-bit)

Open necessary ports

If a firewall is enabled, the following default ports must be granted access on all management hosts for IBM Spectrum Conductor Deep Learning Impact: 9243, 9280, 5000, 5001, 27017, and 6379. If you change these ports after installation, make sure to update firewall rules accordingly.

Review this topic to determine which ports need to be opened for IBM Spectrum Conductor™: Summary of ports used by IBM Spectrum Conductor.

Review this topic to determine which ports need to be opened for IBM Spectrum Conductor Deep Learning Impact: Summary of ports used by IBM Spectrum Conductor Deep Learning Impact.

Ensure user access of client machines to cluster hosts

Spark workload runs on non-management hosts in your cluster. Therefore, the Apache Spark UI and RESTful APIs that are available from Spark applications and the Spark history server must be accessible to your end users. This access is also required for any notebooks that you configure for use with IBM Spectrum Conductor.

If the hosts and the ports used are not accessible from your client machines, you can encounter errors when you access notebooks and IBM Spectrum Conductor user interfaces. The management hosts also must be able to access these hosts and the ports used.

Set the appropriate heap size

The default Elasticsearch installation uses a 2-4 GB heap for the Elasticsearch services. Elasticsearch recommends that you assign 50 percent of available memory to the Elasticsearch client service, but not exceed 30.5 GB. Based on these recommendations, configure the Elasticsearch client and data services heap in IBM Spectrum Conductor to use 6~8 GB. Further, the default garbage collector for Elasticsearch is Concurrent-Mark and Sweep (CMS). To prevent long stop-the-world pauses, do not configure the heap size to be higher than what the CMS garbage collector was designed for (approximately 6-8 GB).

For instructions to change the heap size, see How do I change the heap size for Elasticsearch?.

Log in with root permission

The following tasks all require that you log in as a user that has root or sudo to root permission.

Create cluster administration accounts

If you set up users on your hosts (both management and compute hosts), the execution user must use the same user ID (UID), and group ID (GID) on all of the hosts. For example, the UID and GID for the CLUSTERADMIN account must be the same on all hosts. By default, CLUSTERADMIN is set to egoadmin.

Mount a shared file system

If you are using multiple nodes, you must mount a shared file system. The shared file system is used for user data, such as datasets, tuning data, validation results, training models and more. In this step, the default cluster administrator account (egoadmin) is used and the mount points are /dli_shared_fs and /dli_result_fs. Optionally, /dli_data_fs can be used for additional user data. The shared file system must meet these requirements:

  • The shared file system must be mounted to a clean directory. If you are reinstalling IBM Spectrum Conductor Deep Learning Impact, make sure that the directory specified is empty.
  • The shared file system must have a minimum of 2 GB of free disk space.
  • The cluster administrator account (the account that was specified by the CLUSTERADMIN variable during IBM Spectrum Conductor installation) must have read and write permissions to the shared file system.

To verify that you mounted the shared file system correctly, assuming that cluster administrator account is egoadmin and the mount points are /dli_shared_fs and /dli_result_fs, follow these steps:

  1. Export the environment variables:
    Note: The directory specified as the shared file system must exist. Before exporting the shared file systtensem environment variable, make sure that the directory specified exists, if not, manually create it.
    export CLUSTERADMIN=egoadmin
    export ADMINGROUP=egoadmin
    export DLI_SHARED_FS=/dli_shared_fs
    export DLI_RESULT_FS=/dli_result_fs
  2. Change the ownership of DLI_SHARED_FS to CLUSTERADMIN:
    chown -Rh $CLUSTERADMIN:$ADMINGROUP $DLI_SHARED_FS
  3. Make sure DLI_SHARED_FS is owned by CLUSTERADMIN and remove all other access from DLI_SHARED_FS:
    chmod -R 755 $DLI_SHARED_FS
  4. Set the correct ownership for DLI_RESULT_FS, which is the mount point for shared result data storage:
    chown $CLUSTERADMIN:$ADMINGROUP $DLI_RESULT_FS
    chmod 733 $DLI_RESULT_FS
    chmod o+t $DLI_RESULT_FS
  5. Export DLI_DATA_FS:
    export DLI_DATA_FS=/dli_data_fs
    You must set the permission for this file system shared storage such that the deep learning workload submission user can read the files from this directory. If your are using Caffe models, the directory structure also needs to be writable. For example:
    chmod -R 755 $DLI_DATA_FS

Install utilities and packages

All hosts require the following utilities and packages:
  • bind-utils - provides the nslookup tool.
  • iproute - provides the SS utility that the built-in Zeppelin notebook uses.
  • net-tools package for RHEL.

Red Hat Enterprise Linux operating system and repository setup

  1. Enable common, optional, and extra repo channels.
    IBM POWER8:
    sudo subscription-manager repos --enable=rhel-7-for-power-le-optional-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-le-extras-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-le-rpms
    IBM POWER9:
    sudo subscription-manager repos --enable=rhel-7-for-power-9-optional-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-9-extras-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-9-rpms
    x86:
    sudo subscription-manager repos --enable=rhel-7-servers-optional-rpms
    sudo subscription-manager repos --enable=rhel-7-servers-extras-rpms
    sudo subscription-manager repos --enable=rhel-7-servers-rpms
  2. Install packages needed for the installation.
    sudo yum -y install wget nano bzip2
  3. Enable Fedora Project EPEL (Extra Packages for Enterprise Linux repo:
    wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    sudo rpm -ihv epel-release-latest-7.noarch.rpm
  4. Load the latest kernel or do a full update:
    • Load the latest kernel:
      sudo yum install kernel-devel
      
      sudo yum update kernel kernel-devel kernel-tools kernel-tools-libs kernel-bootwrapper
      
      reboot
    • Do a full update:
      sudo yum install kernel-devel
      
      sudo yum update
      
      sudo reboot

System firmware

If you are running on an AC922 system, you need to update the firmware. Ensure that the system firmware is updated to at least the following levels before you install the current NVIDIA GPU driver.

The firmware series and fix levels that are required for AC922 for the current NVIDIA GPU driver are:

  • 8335-GTG: OP910.30 or higher
  • 8335-GTH: OP920.10 or higher

System firmware updates are available at Fix Central. To find your updates in Fix Central, follow these steps:

  1. Enter 8335-GTG or 8335-GTH as the Product Selector.
  2. Select the appropriate firmware series from the drop-down list.
  3. Click Continue to go to the Select fixes page.
  4. Select the appropriate fix level.
  5. Click Continue to go to the Download options page.

IBM POWER9 specific udev rules

Before you install the NVIDIA components, the udev Memory Auto-Onlining Rule must be disabled for the CUDA driver to function properly.
Note: If you upgraded from a previous release, repeat this step with RHEL 7.6.

To disable it, follow these steps:

  1. Copy the /lib/udev/rules.d/40-redhat.rules file to the directory for user overridden rules.
    sudo cp /lib/udev/rules.d/40-redhat.rules /etc/udev/rules.d/
  2. Edit the /etc/udev/rules.d/40-redhat.rules file.
    sudo nano /etc/udev/rules.d/40-redhat.rules
  3. Comment out the entire "Memory hotadd request" section and save the change:
    # Memory hotadd request
    
    #SUBSYSTEM!="memory", ACTION!="add", GOTO="memory_hotplug_end"
    
    #PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"
    
    
    
    #ENV{.state}="online"
    
    #PROGRAM="/bin/systemd-detect-virt", RESULT=="none", ENV{.state}="online_movable"
    
    #ATTR{state}=="offline", ATTR{state}="$env{.state}"
    
    
    
    #LABEL="memory_hotplug_end"
  4. Optionally, delete the first line of the file, since the file was copied to a directory where it cannot be overwritten.
    # do not edit this file, it will be overwritten on update
  5. Restart the system for the changes to take effect.
    sudo reboot

Install the kernel development packages

Load the latest kernel or do a full update:
  • On Red Hat:
    • Load the latest kernel:
      sudo yum install kernel-devel
      
      sudo yum update kernel kernel-devel kernel-tools kernel-tools-libs kernel-bootwrapper
      
      reboot
    • Do a full update:
      sudo yum install kernel-devel
      
      sudo yum update
      
      sudo reboot

Remove previously installed CUDA and NVIDIA drivers

The CUDA Toolkit, cuDNN and NCCL are provided as Conda packages and do not require separate installations. The GPU driver must still be installed separately.

Note: If you require the CUDA Toolkit on the host for uses beyond WML CE, consult NVIDIA's CUDA documentation for help upgrading the GPU driver without disturbing your existing Toolkit installation.

Before installing the updated GPU driver, uninstall any previously-installed CUDA and NVIDIA drivers. Follow these steps:

  1. Remove all CUDA Toolkit and GPU driver packages.

    You can display installed CUDA and driver packages by running these commands:

    rpm -qa | egrep 'cuda.*(9-2|10-0)'
    rpm -qa | egrep '(cuda|nvidia).*(396|410)\.'

    Verify the list and remove with yum remove.

  2. Remove any CUDA Toolkit and GPU driver repository packages.

    These should have been included in step 1, but you can confirm with this command:

    rpm -qa | egrep '(cuda|nvidia).*repo'

    Use yum remove to remove any that remain.

  3. Clean the yum repository:
    sudo yum clean all
  4. Remove cuDNN and NCCL:
    sudo rm -rf /usr/local/cuda /usr/local/cuda-9.2 /usr/local/cuda-10.0
  5. Reboot the system to unload the GPU driver
    sudo shutdown -r now

Install the GPU driver

Many of the deep learning packages require the GPU driver packages to be downloaded from NVIDIA.

Install the GPU driver by following these steps:

  1. Download the NVIDIA GPU driver:
    • Go to NVIDIA Driver Download.
    • Select Product Type: Tesla.
    • Select Product Series: P-Series (for Tesla P100) or V-Series (for Tesla V100).
    • Select Product: Tesla P100 or Tesla V100.
    • Select Operating System: Linux POWER LE RHEL 7 for POWER or Linux 64-bit RHEL7 for x86, depending on your cluster architecture. Click Show all Operating Systems if your version is not available.
    • Select CUDA Toolkit: 10.1
    • Click SEARCH to go to the download link.
    • Click Download to download the driver.
  2. Install the GPU driver repository and cuda-drivers.
    sudo rpm -ivh nvidia-driver-local-repo-rhel7-418.*.rpm
    sudo yum install nvidia-driver-latest-dkms
  3. Set nvidia-persistenced to start at boot
    sudo systemctl enable nvidia-persistenced
  4. Reboot the system

Configure the required limits for the maximum number of processors and the maximum number of open files

For both the root user and the cluster administrator (egoadmin), you must configure the required limits for the maximum number of processors (nprocs) and the maximum number of files open (nofile) on your hosts. The limits for the root user must be 65536 or more, and the limit for the cluster administrator must be 65536. Without this limit, services hang or enter the Error state on cluster startup.

  1. In the /etc/security/limits.conf file, set nproc and nofile to 65536 for root and the cluster administrator. In the following example, the cluster administrator is named egoadmin:
    root   soft    nproc     65536
    root   hard    nproc     65536
    root   soft    nofile    65536
    root   hard    nofile    65536
    egoadmin   soft    nproc     65536
    egoadmin   hard    nproc     65536
    egoadmin   soft    nofile    65536
    egoadmin   hard    nofile    65536
  2. Log out and then log back in to the server for the changes to take effect.

Set the vm_max_map_count kernel value

Set the vm_max_map_count kernel value to 262144 or more:
  1. Set the kernel value dynamically to ensure that the change takes effect immediately:

    sysctl -w vm.max_map_count=262144

  2. Set the kernel value in the /etc/sysctl.conf file to ensure that the change is still in effect when you restart your host:

    vm.max_map_count=262144