Getting started with PyTorch
The WML CE PyTorch includes support for IBM's Distributed Deep Learning (DDL) and Large Model Support (LMS).
This release of WML CE includes PyTorch 1.3.1.
GPU-enabled and CPU-only variants
Overview and top level meta-packages
- GPU-enabled variant
- The GPU-enabled variant pulls in CUDA and other NVIDIA components during install. It has larger installation size and includes support for advanced features that require GPU, such as DDL, LMS, and NVIDIA's Apex.
- CPU-only variant
- The CPU-only variant is built without CUDA and GPU support. It has a smaller installation size, and omits features that would require a GPU. It does not include support for DDL, LMS, or NVIDIA's Apex.
pytorch
- Installs the GPU-enabled variants of PyTorch, torchvision, and Apex, along with torchtext.pytorch-cpu
- Installs the CPU-only variants of PyTorch and torchvision, along with torchtext.
Packaging details
A brief description of all the PyTorch-family packages included in WML CE follows:
GPU-enabled | CPU-only | Comments |
---|---|---|
pytorch |
pytorch-cpu |
Metapackage - Installs the entire pytorch family but has no actual content. |
pytorch-base |
pytorch-base |
PyTorch package - Includes installer and content. |
torchvision |
torchvision-cpu |
Metapackage - Installs torchvision but has no actual content. |
torchvision-base |
torchvision-base |
Torchvision package - Includes installer and content. |
torchtext |
torchtext |
Torchtext package - Includes installer and content. |
apex |
N/A | Apex installer and content. |
The package-base
packages come in both GPU and CPU variants,
and include gpu
or cpu
in the build string. There is also a
_pytorch_select
package that prevents mixing GPU and CPU packages.
Switching between GPU-enabled and CPU-only installations
To switch from a GPU-enabled installation to CPU-only or vice versa, you will uninstall several
packages and install several others (CPU-only). Depending on the version of conda
being used, the installer may not be able to find the solution on its own.
For example, if the GPU-enabled packages shown above were installed in a
wmlce_env
environment and you run the following, the conda installer might not be
able to find the solution for that request (conda 4.6 likely would; 4.7 likely not):
conda install --prune pytorch-cpu
A workaround for this is to manually uninstall the old variant before installing the new. You can
uninstall the old variant using the _pytorch_select
package. So the workaround
would be to run the following:
conda remove _pytorch_select
conda install --prune pytorch-cpu
You can also install the other variant in a separate conda environment from the original installation. GPU and CPU variants cannot exist in a single environment, but you can create multiple environments with GPU-enbled packages in some and CPU-only in others.
PyTorch examples
The PyTorch package includes a set of examples. A script is provided to copy the sample content into a specified directory:
pytorch-install-samples $HOME/pytorch-samples
PyTorch and DDL
WML CE Distributed Deep Learning is directly
integrated into PyTorch, in the form of ddl
backend in PyTorch's communication
package torch.distributed
.
Find more information at Integration with deep learning frameworks.
PyTorch cpp_extensions tests
The cpp_extensions
tests that are run with pytorch-test
require NVCC and a C++ compiler with C++11 ABI tagging (similar to g++ version 7). These packages
are not listed in the pytorch
conda packages as dependencies, however. In order to
use these tests, you must install the cudatoolkit-dev
conda package. You also need
g++ version 7 installed and set with the CXX environment variable or to a symlink with the
c++ command. One way to install the correct compiler is to run, depending on your
architecture, either gxx_linux-ppc64le or gxx_linux-64 version
7 with conda. If you do not install the cudatoolkit-dev
and set up a C++ compiler,
when running pytorch-test, you will get an info message about the
cpp_extensions
tests not being run and the tests will be skipped.
PyTorch distributed tests
Several of the PyTorch distributed tests require SSH and may fail with a message like the following if SSH is not present or usable:
--------------------------------------------------------------------------
The value of the MCA parameter "plm_rsh_agent" was set to a path
that could not be found:
plm_rsh_agent: ssh : rsh
Please either unset the parameter, or check that the path is correct
--------------------------------------------------------------------------
If you see that message, you can either install SSH or skip the distributed tests by running the following:
pytorch-test -x distributed
TensorBoard and PyTorch
PyTorch has a summary writer API (torch.utils.tensorboard.SummaryWriter) that can be used to export TensorBoard compatible data in much the same way as TensorFlow. For more information, visit: https://pytorch.org/docs/1.3.1/tensorboard.
Large Model Support (LMS)
Large Model Support is a feature provided in WML CE PyTorch that allows the successful training of deep learning models that would otherwise exhaust GPU memory and abort with “out-of-memory” errors. LMS manages this oversubscription of GPU memory by temporarily swapping tensors to host memory when they are not needed.
One or more elements of a deep learning model can lead to GPU memory exhaustion. These include:
- Model depth and complexity
- Base data size (for example, high-resolution images)
- Batch size
Traditionally, the solution to this problem has been to modify the model until it fits in GPU memory. This approach, however, can negatively impact accuracy – especially if concessions are made by reducing data fidelity or model complexity.
With LMS, deep learning models can scale significantly beyond what was previously possible and, ultimately, generate more accurate results.
LMS usage
A PyTorch program enables Large Model Support by calling
torch.cuda.set_enabled_lms(True)
prior to model creation.
In addition, a pair of tunables is provided to control how GPU memory used for tensors is managed under LMS.
torch.cuda.set_limit_lms(limit)
Defines the soft limit in bytes on GPU memory allocated for tensors (default: 0).
By default, LMS favors GPU memory reuse (moving inactive tensors to host memory) over new allocations. This effectively minimizes GPU memory consumption.
However, when a limit is defined, the algorithm favors allocation of GPU memory up to the limit prior to swapping any tensors out to host memory. This allows the user to control the amount of GPU memory consumed when using LMS.
Tuning this limit to optimize GPU memory utilization, therefore, can reduce data transfers and improve performance. Since the ideal tuning for any given scenario may differ, it is considered a best practice to determine the value experimentally, arriving at the largest value that does not result in an out of memory error.
torch.cuda.set_size_lms(size)
Defines the minimum tensor size in bytes that is eligible for LMS swapping (default: 1 MB).
Any tensor smaller than this value is exempt from LMS reuse and persists in GPU memory.
LMS example
The PyTorch imagenet example provides a simple illustration of Large Model Support in action. ResNet-152 is a deep residual network that requires a significant amount of GPU memory.
On a system with a single 16 GB GPU, without LMS enabled, a training attempt with the default batch size of 256 will fail with insufficient GPU memory:
python main.py -a resnet152 -b 256 [imagenet-folder with train and val folders]
=> creating model 'resnet152'
[...]
RuntimeError: CUDA error: out of memory
After enabling LMS, the training proceeds without issue:
git diff
--- a/imagenet/main.py
+++ b/imagenet/main.py
@@ -90,6 +90,7 @@ def main():
world_size=args.world_size)
# create model
+ torch.cuda.set_enabled_lms(True)
if args.pretrained:
print("=> using pre-trained model '{}'".format(args.arch))
model = models.__dict__[args.arch](pretrained=True)
python main.py -a resnet152 -b 256 [imagenet-folder with train and val folders]
=> creating model 'resnet152'
Epoch: [0][0/5005] [...]
Epoch: [0][10/5005] [...]
Epoch: [0][20/5005] [...]
Epoch: [0][30/5005] [...]
Epoch: [0][40/5005] [...]
Epoch: [0][50/5005] [...]
Epoch: [0][60/5005] [...]
[...]
WML CE PyTorch API Extensions for LMS
Large Model Support extends the torch.cuda package to provide the following control and tuning interfaces.
torch.cuda.set_enabled_lms(enable)
- Enable/disable Large Model Support.
Parameters: enable (bool): desired LMS setting.
torch.cuda.get_enabled_lms()
-
Returns a bool indicating whether Large Model Support is currently enabled.
torch.cuda.set_limit_lms(limit)
- Sets the allocation limit (in bytes) for LMS.
Parameters: limit (int): soft limit on GPU memory allocated for tensors.
torch.cuda.get_limit_lms()
-
Returns the allocation limit (in bytes) for LMS.
torch.cuda.set_size_lms(size)
- Sets the minimum size (in bytes) for LMS.
Parameters: size (int): any tensor smaller than this value is exempt from LMS reuse and persists in GPU memory.
torch.cuda.get_size_lms()
-
Returns the minimum size (in bytes) for LMS.
Known issues
- When running in a Docker container,
pytorch-test
ortest/test_nn.py
might fail with the following error:libgomp: Thread creation failed: Resource temporarily unavailable
This is due to a default limit on the number of processes available in a Docker container. The error can be avoided by increasing the limit with the--pids-limit
option when running thedocker run
command. In testing, a limit of 16384 was found to avoid this issue:--pids-limit 16384
More information
The PyTorch home page has various information, including tutorials (here) and a getting started guide.
Additional tutorials and examples are available from the community: PyTorchZeroToAll