PowerAI system setup
Find information to set up your operating system, repository, and NVIDIA components.
- Operating system
- Red Hat Enterprise Linux operating system and repository setup
- Ubuntu operating system and repository setup
- System firmware
- IBM POWER9 specific udev rules (Red Hat only)
- Install the kernel development packages
- Remove previously installed CUDA and NVIDIA drivers (Red Hat only)
- CUDA, GPU driver, cuDNN, and NCCL (Red Hat only)
- NVIDIA Persistence Daemon (Red Hat only)
- GPU driver, docker, nvidia-docker2 (Ubuntu only)
- Anaconda
- Latest Linux kernel for RHEL 7.5 ALT
You can also run PowerAI in a container on a bare metal system that is running Ubuntu 18.04.
- Recent AC922 system firmware:
- 8335-GTG: OP910.24
- 8335-GTH: OP920.02
- NVIDIA GPU driver 410.72 or higher
Operating system
The Deep Learning packages require specific operating systems:
- Red Hat Enterprise Linux (RHEL) 7.5 little endian for IBM® POWER8® and IBM POWER9™
-
- PowerAI can be installed and run directly on a bare-metal RHEL 7.5 system
- PowerAI can also be run from a container on a RHEL 7.5 system. For more information about setting up a container to run PowerAI, see Using nvidia-docker 2.0 with RHEL 7.
- The RHEL installation image and license must be acquired from Red Hat
- Ubuntu 18.04 LTS for IBM Power
-
- PowerAI must be run in container when running on a bare-metal Ubuntu 18.04 system
- The Ubuntu installation image can be downloaded from Ubuntu
Host OS | Container OS |
---|---|
Red Hat Enterprise Linux 7.5 | Ubuntu 18.04 |
Ubuntu 18.04 | Ubuntu 18.04 |
Red Hat Enterprise Linux 7.5 | none (Bare metal) |
For more information about installing operating systems on IBM Power Systems servers, see Quick start guides for Linux on IBM® Power System servers.
Red Hat Enterprise Linux operating system and repository setup
- Enable
common
,optional
, andextra
repo channels.IBM POWER8:
sudo subscription-manager repos --enable=rhel-7-for-power-le-optional-rpms
sudo subscription-manager repos --enable=rhel-7-for-power-le-extras-rpms
sudo subscription-manager repos --enable=rhel-7-for-power-le-rpms
IBM POWER9:
sudo subscription-manager repos --enable=rhel-7-for-power-9-optional-rpms
sudo subscription-manager repos --enable=rhel-7-for-power-9-extras-rpms
sudo subscription-manager repos --enable=rhel-7-for-power-9-rpms
- Install packages needed for the
installation.
sudo yum -y install wget nano bzip2
- Enable Fedora Project EPEL (Extra Packages for Enterprise Linux)
repo:
wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo rpm -ihv epel-release-latest-7.noarch.rpm
- Load the latest kernel or do a full update:
- Load the latest
kernel:
sudo yum update kernel kernel-tools kernel-tools-libs kernel-bootwrapper
reboot
- Do a full update:
sudo yum update
sudo reboot
Important: RHEL 7.6 was released at the end of October, but is not yet supported by PowerAI. Running just yum update might upgrade a 7.5 system to 7.6. In order to avoid this, customers with a standard RHEL subscription might use:
Customers should consult Red Hat if they’re unsure how to avoid unintended upgrade.sudo subscription-manager release --set=7.5
- Load the latest
kernel:
Ubuntu operating system and repository setup
- Install packages needed for the
installation
sudo apt-get install -y wget nano apt-transport-https ca-certificates curl software-properties-common
- Load the latest
kernel
sudo apt-get install linux-headers-$(uname -r) sudo reboot
- Or do a full
update
sudo apt-get update sudo apt-get dist-upgrade sudo reboot
System firmware
If you are running on an AC922 system, you need to update the firmware. Ensure that the system firmware is updated to at least the following levels before you install the current NVIDIA GPU driver.
The firmware series and fix levels that are required for AC922 for the current NVIDIA GPU driver are:
- 8335-GTG: OP910.24 or higher
- 8335-GTH: OP920.02 or higher
System firmware updates are available at Fix Central. To find your updates in Fix Central, follow these steps:
- Enter 8335-GTG or 8335-GTH as the Product Selector.
- Select the appropriate firmware series from the drop-down list.
- Click Continue to go to the Select fixes page.
- Select the appropriate fix level.
- Click Continue to go to the Download options page.
IBM POWER9 specific udev rules (Red Hat only)
Before you install the NVIDIA components, the udev Memory Auto-Onlining Rule must be disabled for the CUDA driver to function properly. To disable it, follow these steps:
- Copy the
/lib/udev/rules.d/40-redhat.rules
file to the directory for user overridden rules.sudo cp /lib/udev/rules.d/40-redhat.rules /etc/udev/rules.d/
- Edit the
/etc/udev/rules.d/40-redhat.rules
file.sudo nano /etc/udev/rules.d/40-redhat.rules
- Comment out the following line and save the change:
SUBSYSTEM=="memory", ACTION=="add", PROGRAM="/bin/uname -p", RESULT!="s390*", ATTR{state}=="offline", ATTR{state}="online"
- Optionally, delete the first line of the file, since the file was copied to a directory where it
cannot be overwritten.
# do not edit this file, it will be overwritten on update
- Restart the system for the changes to take effect.
sudo reboot
Install the kernel development packages
Install the kernel development packages for the currently running kernel by running the following command:
- On Red Hat:
sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
- On Ubuntu:
sudo apt-get install linux-headers-$(uname -r)
Remove previously installed CUDA and NVIDIA drivers (Red Hat only)
Before installing CUDA 10, uninstall any previous installations of CUDA and NVIDIA drivers. Follow these steps:
- Run the following command:
Note that removingsudo yum remove libglvnd*
libglvnd*
also uninstalls the Nvidia drivers. - Verify that the drivers were
uninstalled:
If a previous local repo is found, uninstall it:sudo yum list installed | grep cuda
sudo rpm -e cuda-repo-rhel7-9-2-local-9.2.148-1.ppc64le
- Finally run
If you get a message to remove the yum cache, runsudo yum clean all
sudo rm -rf /var/cache/yum
CUDA, GPU driver, cuDNN, and NCCL (Red Hat only)
The Deep Learning packages require CUDA, cuDNN, and GPU driver packages from NVIDIA. See the PowerAI prerequisites for the required and recommended versions of these components.
Install the components by following these steps:
- Download NVIDIA CUDA 10
- Select Operating System: Linux.
- Select Architecture: ppc64le.
- Select Distribution: RHEL.
- Select Version: 7.
- Select Installer Type: rpm (network).
- Follow the Linux on POWER installation instructions in the CUDA Quick Start Guide, including the steps that describe how to set up the
CUDA development environment by updating
PATH
andLD_LIBRARY_PATH
.
- Download NVIDIA driver 410
- Select Product Type: Tesla
- Select Product Series: P-Series
- Select Product: Tesla P100
- Select Operating System: Linux POWER LE RHEL 7
- Select CUDA Toolkit: 10.0
- Click Search to go do the download link.
Note: See Table 1 for supported and recommended drivers. - Install CUDA and the GPU driver.Note: For AC922 systems, OS and system firmware updates are required before you install the latest GPU driver.At a high level, the installation process is:
- Install the CUDA Base repository rpm
- Install the GPU driver repository rpm
- Run
sudo yum install cuda
to install CUDA and the GPU driver - Restart to activate the driver
PATH
andLD_LIBRARY_PATH
. - Download NVIDIA cuDNN v7.3.1 for CUDA 10.0 (Registration in NVIDIA’s Accelerated
Computing Developer Program is required).
- cuDNN v7.3.1 Library for Linux (Power8/Power9)
- Download NVIDIA NCCL v2.3.5 for CUDA 10.0 (Registration in NVIDIA’s
Accelerated Computing Developer Program is required).
- NCCL 2.3.5 O/S agnostic and CUDA 10.0 and IBM Power
- Install the cuDNN v7.3.1 and NCCL
v2.3.5 packages. Refresh shared library cache.
sudo tar -C /usr/local --no-same-owner -xzvf cudnn-10.0-linux-ppc64le-v7.3.1.20.tgz
sudo tar -C /usr/local/cuda/targets/ppc64le-linux/ --no-same-owner --strip-components=1 -xvf nccl_2.3.5-5+cuda10.0_ppc64le.txz
sudo ldconfig
NVIDIA Persistence Daemon (Red Hat only)
The NVIDIA Persistence Daemon may be automatically started for POWER9 installations. Check that it is running with the following command:
systemctl status nvidia-persistenced
If it is not active, run the following command:
sudo systemctl enable nvidia-persistenced
GPU driver, docker, nvidia-docker2 (Ubuntu only)
To run PowerAI within docker containers, only the GPU driver needs to be installed on the host.
- Download NVIDIA driver 410.72 from http://www.nvidia.com/Download/index.aspx.
- Select Product Type: Tesla
- Select Product Series: P-Series
- Select Product: Tesla P100
- Select Operating System: Linux POWER LE Ubuntu 18.04 (If Linux POWER LE Ubuntu 18.04 is not available, click Show all Operating Systems)
- Select CUDA Toolkit: 10.0
- Click Search to go do the download link
- Install the GPU driver repository deb package and
cuda-drivers.
sudo dpkg -i nvidia-driver-local-repo-ubuntu1804-410.72_1.0-1_ppc64el.deb sudo apt-get update sudo apt-get install cuda-drivers
- Edit the nvidia-persistenced
file.
sudo systemctl edit --full nvidia-persistenced
Replace the contents with the following lines:
[Unit] Description=NVIDIA Persistence Daemon Wants=syslog.target [Service] Type=forking PIDFile=/var/run/nvidia-persistenced/nvidia-persistenced.pid Restart=always ExecStart=/usr/bin/nvidia-persistenced --verbose ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced TimeoutSec=300 [Install] WantedBy=multi-user.target
- Set nvidia-persistenced to start at
boot
sudo systemctl enable nvidia-persistenced
- Restart your system.
- Install docker.For Ubuntu platforms, a Docker runtime must be installed. If
there is no Docker runtime installed yet, install Docker-CE on
Ubuntu.
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - sudo add-apt-repository "deb [arch=ppc64el] https://download.docker.com/linux/ubuntu bionic stable" sudo apt-get update sudo apt-get install docker-ce
- Install nvidia-docker
2.
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \ sudo tee /etc/apt/sources.list.d/nvidia-docker.list sudo apt-get update sudo apt-get install nvidia-docker2 sudo pkill -SIGHUP dockerd
- Verify the
setup.
nvidia-docker run --rm nvidia/cuda nvidia-smi
Anaconda
A number of the Deep Learning frameworks require Anaconda. Anaconda is a platform-agnostic data science distribution with a collection of 1,000+ open source packages with free community support.
Use Anaconda2 with Python 2 to run the Python 2 versions of the Deep Learning frameworks. Anaconda3 with Python 3 is required to run the Python 3 versions of the Deep Learning frameworks.
- Anaconda2, version 5.2.0
md5sum: 479633a95906ea6d41056ebe84a4c47b
- Anaconda3, version 5.2.0
md5sum: cbd1d5435ead2b0b97dba5b3cf45d694
- Download
Anaconda:
wget https://repo.continuum.io/archive/Anaconda2-5.2.0-Linux-ppc64le.sh
- Install Anaconda
bash Anaconda2-5.2.0-Linux-ppc64le.sh
source ~/.bashrc
- Accept the license agreement
- Specify an installation location (default is
$HOME/anaconda2
) - Set the
PATH
environment variable. For systems that have a single Anaconda instance, such as PowerAI Enterprise, multiple users are- For setups that have a single Anaconda instance for multiple users, such as PowerAI Enterprise,
reply
no
to update the .bashrc file or .bash_profile. After the installation is complete, export the path with this command:export PATH=/opt/anaconda2/bin:$PATH
- For other PowerAI users, reply
yes
to allow the installer to update the .bashrcfile or .bash_profile. In this case, if multiple users are using the same system, each user should install Anaconda individually.
- For setups that have a single Anaconda instance for multiple users, such as PowerAI Enterprise,
reply