Getting started with TensorFlow large model support (TFLMS) V2
TensorFlow large model support (TFLMS) V2 provides an approach to training large models that cannot be fit into GPU memory. It takes a computational graph defined by users and automatically adds swap-in and swap-out nodes for transferring tensors from GPUs to the host and vice versa. The computational graph is statically modified. Hence, it needs to be done before a session actually starts.
Installing TensorFlow large model support
TensorFlow Large Model Support (TFLMS) V2 can be installed running this command:
conda install tensorflow-large-model-support
How to use TensorFlow large model support
Enabling LMS for a model depends on how users write their training. The following guidelines cover three ways to train:
Session-based training
To train a model that uses session-based training, add the following code block after the graph is built, but before you start a training session:
from tensorflow_large_model_support import LMS
lms_obj = LMS()
lms_obj.run(tf.get_default_graph())
For example:
# Import and run the graph modification before running a session:
from tensorflow_large_model_support import LMS
lms_obj = LMS()
lms_obj.run(tf.get_default_graph())
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
batch = mnist.train.next_batch(50)
train_step.run(feed_dict={x: batch[0], y_: batch[1]})
For a working example of LMS integration with Session-based training, see the TensorFlow Large Model Support examples.
Estimator-based training
To train a model that uses estimator-based training, follow these steps:
- Import and initialize LMS. From tensorflow_large_model_support, import
LMS:
lms_hook = LMS()
- Add the LMS object into Estimator's hook
list
mnist_classifier.train( input_fn=train_input_fn, steps=20000 hooks=[logging_hook, lms_hook])
For a working example of LMS integration with Estimator-based training, see the TensorFlow Large Model Support examples.
tf.keras-based training
To train a model that uses tf.keras
-based training, follow these steps:
- Import and initialize
LMS
from tensorflow_large_model_support import LMS lms_callback = LMS()
- Add the LMS object to the callback list on the Keras
fit
orfit_generator
function.model.fit_generator(generator=training_gen, callbacks=[lms_callback])
For a working example of LMS integration with tf.keras-based training, see the TensorFlow Large Model Support examples.
Parameters for LMS
- swapout_threshold
- The smaller
swapout_threshold
is, the more tensors are swapped out to the host memory. Default-1
(auto mode). - swapin_ahead
- The larger
swapin_ahead
is, the earlier a tensor is swapped in to the GPU memory from the host memory. Default-1
(auto mode). - swapin_groupby
- Multiple swap-in operations of the same tensor will be grouped or fused into one swap-in
operation for better performance if they are close to each other (the distance between them
is within
swapin_groupby
). Default-1
(auto mode). - sync_mode
- Whether do synchronization between data transfer and kernel computation or not. Four modes:
0
: Turn off1
: Sync for only swap-out operations2
: Sync for only swap-in operations3
: Sync for both swap-out and swap-in operations
0
. - serialization
- Serialize operations at the same level in the topological sort. This option accepts a list of
Python slicing string in which each slicing represents level indices in the topological sort. For
example, [1, '3:5', 7] means levels 1, 3, 4, 5, and 7 are serialized. Default
[]
(turn off). - gpu_device
- The GPU device that this instance of LMS will operate on.
When models are written in a multi-tower fashion and operations are assigned to different GPU devices using semantics similar to
'tf.device('/device:GPU:2')
, LMS must be instantiated and run multiple times, one time for each GPU. - debug
- Debug mode for LMS. Default
False
. - debug_level
- Debug level for LMS (1 or 2). Default
1
.
Automatic parameter tuning
If parameters swapout_threshold
, swapin_ahead
,
swapin_groupby
are set to the default values, TFLMS will automatically find
suitable values for them. Auto tuning will also automatically tune the sync_mode
parameter to increase GPU compute and data transfer synchronization when it detects that no amount
of swapping will avoid out of memory conditions with asynchronous GPU compute and data transfer.
Auto tuning uses simulated training to find suitable values and thus is only suitable for tuning training. If LMS is required for inferencing, prediction, and evaluation, then manual tuning techniques should be used. For repeated training runs, consider using auto tuning for the first run to find working values for the tunable parameters and then modify the code to pass those values rather than running auto tuning for each training run. This will decrease training time.
Auto tuning requires the mini batch size to correctly calculate memory usage. Some models and some methods of invoking the model training do not allow LMS to know the batch size. When auto tuning cannot discover the batch size, it raises an error and the batch size should be specified manually as follows:
from tensorflow_large_model_support import LMS
lms = LMS()
lms.batch_size = 32
lms.run(tf.get_default_graph())
Auto tuning excludes a configurable portion of GPU memory during its simulated training to allow for memory overhead related to things like garbage collection, metrics, temporary memory allocations within operations, or cross GPU communication. This value can be configured by setting the TF_LMS_SIMULATOR_MEM_RATIO environment variable. Set the environment variable to a numeric floating point value that will be used to as a factor of GPU memory that will be set as maximum available memory for the simulated training. For example, the default value of TF_LMS_SIMULATOR_MEM_RATIO is 0.9 that will direct auto-tuning to use 90% of the GPU memory capacity as the maximum for the simulated training. If parameters chosen by auto-tuning still result in out-of-memory errors, it may be beneficial to set this parameter lower and reserve more memory for other overhead. Conversely, it may be beneficial to set this ratio to 1.0 for some models if the auto-tuned values at 1.0 do not result in out of memory errors and result in faster training time.
The auto tuning simulator can produce plots of predicted memory usage over an iteration of model. It will produce plots for expected memory usage without TFLMS and for every TFLMS parameter set that succeeds in simulation during auto tuning. To enable auto tune plotting set the autotune_plot property as follows:
lms = LMS()
lms.autotune_plot = True
Manual parameter tuning
Tuning the LMS parameters manually is an exercise in finding the maximum values of the
swapout_threshold
, swapin_ahead
, swapin_groupby
parameters that allow the model execution while avoiding out of memory conditions. The maximum value
for these values is the number of topological sort levels in the model. The recommendation is to
find the maximum value for swapout_threshold
first, followed by
swapin_ahead
, and lastly swapin_groupby
.
Tuning the sync_mode parameter:
If the model is unable to run within GPU memory while using a swapout_threshold
of 1, the next step is to begin enabling higher levels of synchronization between GPU compute and
memory transfer. This is accomplished by setting the sync_mode
parameter to higher
levels.
Tuning serialization:
If the model is still unable to run while using a low swapout_threshold
and full
compute and memory synchronization (sync_mode=3
), the next parameter to investigate
is the serialization
parameter. This parameter adds synchronization between
operations within the same topological sort level. This synchronization allows operations that
produce large tensors to run in serial and allow their tensors to be swapped out before the next
operation runs. This ability allows the operations to have more available GPU memory. When the
serialization
parameter is used, it changes the number of levels in the topological
sort which can change the behavior of the swapout_threshold
parameter.
One convenient method to set serialization is to begin serializing from the end of the model and
work backwards through the model until it is able to run without running out of memory. To do this,
the string slice form can be used. For example, to serialize the operations in levels 200 until the
end of the model you could specify serialization=['200:']
.
Care must be taken when using the serialization parameter to ensure that the earlier levels of the model where variable initialization is performed are not serialized. If the variable initialization becomes serialized the model execution will fail or produce unpredictable results.
Usage tips
- Enable TFLMS informational output
TFLMS logs informational output during graph modification and auto tuning. The TensorFlow logger is used for this and the TensorFlow logger must be set to the INFO or greater level for the TFLMS log statements to appear. To enable TensorFlow INFO logging do the following:
import tensorflow as tf tf.logging.set_verbosity(tf.logging.INFO)
- Increase the system memory (CUDA host) memory allocation
TensorFlow sets a limit on the amount of memory that will be allocated on the CUDA host (CPU) side. The limit is often not high enough to act as a tensor swap space when swapping a large amount of data or when using multiple GPUs in a multi-tower fashion with a tower for each GPU as described in the TensorFlow documentation. Failure to set this limit higher will result in out of memory errors such as:
Allocator (cuda_host_bfc) ran out of memory trying to allocate
. Note thecuda_host_bfc
allocator is mentioned rather than a GPU allocator.A good rule of thumb would be to start with a value that is 4 times the memory capacity of the GPUs times the number of GPUs that will be used. For example, if you have four 16-GB GPUs in a system and will use all four in a training run,
TF_CUDA_HOST_MEM_LIMIT_IN_MB
should be set to 262144 and adjust from there as needed. (4 x 16384 (16GB as MB) x 4 GPUs) = 262144 MB. - Use NUMA pinning for single GPU use
If you are utilizing a single GPU it is recommended to use NUMA pinning to pin the process to the CPU and memory that is on the same system socket as the GPU being used. Pinning the process allows the fastest connection paths between system memory and GPU memory, which reduces the training or inferencing time. PowerAI includes the numactl utility that can be used to do this pinning. It can be installed with the
conda install numactl
command. The following example shows how to specify a single GPU to be used and how to pin the process to use the CPU cores and memory that are on the same socket as the specified GPU:export CUDA_VISIBLE_DEVICES=0 numactl --cpunodebind=0 --membind=0 python train.py
- Use PowerAI Distributed Deep Learning when using more than one GPU
To achieve better scaling performance with LMS on multiple GPUs, update the training script to use PowerAI Distributed Deep Learning and run the training script with the
ddlrun
command. For more information aboutddlrun
, see Using ddlrun tool.Additionally, if you are running on a single system without an Infiniband set up, the
--mpiarg -pami_noib
parameter must be added to the ddlrun command line, for example:ddlrun --mpiarg -pami_noib python train_model.py
- Invoke LMS on each GPU "tower" when not using PowerAI Distributed Deep Learning
As previously noted, it is highly recommended that you use PowerAI Distributed Deep Learning when using more than one GPU for the best scaling performance. However, if you choose to use a multi-tower graph with a tower for each GPU as described in the TensorFlow documentation, TFLMS can still be used to add swapping nodes to the graph. For multi-tower, multi-GPU models, run TFLMS one time per GPU. For example:
gpu_devices = ['/device:GPU:0', '/device:GPU:1', '/device:GPU:2', '/device:GPU:3'] for gpu in gpu_devices: lms = LMS(gpu_device=gpu) lms.run(graph)
Example with adjustable image size
The Keras_ResNet50 example, found in the TensorFlow Large Model Support examples, uses synthetic random images with the Keras ResNet50 model to allow users a fast hands-on experience with LMS. The example allows users to change the image size, explore auto-tuning, and manually set the LMS tunable parameters. Advanced users can also use the command line parameters to enable CUDA profiling that can be used with the NVIDIA Visual Profiler to profile and visualize the tensor swapping.
Inference (prediction) considerations
Some TensorFlow operations perform differently during training than during inferencing. These
behavior changes drive different portions of the model's neural network graph to be active or
inactive during training and inferencing. For this reason, TFLMS needs to know whether the graph is
being used for training or inferencing when it adds tensor swapping nodes. The LMS class has a
is_training
property that directs LMS on which portions of the graph to modify. By
default, is_training
is set to True. For inferencing, set this property before the
LMS run method is called. For example:
# build up model with code here
lms = LMS()
lms.is_training = False
lms.run(graph)
# load weights
# predict / inference with model
Some use cases of TFLMS with TensorFlow Keras APIs have additional considerations. If the model
needs TFLMS to successfully run inference and the model is built using code rather than the Keras
load_model API, then keras=True
must be passed to the LMS run
method. For example:
model = create_model()
lms = LMS()
lms.is_training = False
lms.run(graph, keras=True)
# load weights
# predict / inference with model
Using LMS with saved models
Both TensorFlow and Keras have various ways to save models. Some of these methods save the model or graph definition and some methods save only the weights. Whether you need to enable large model support on the loaded model depends on several factors: if you are loading the model for further training or loading the model for inferencing, as well as how the model was saved.
If TensorFlow MetaGraphs or SavedModels are saved after LMS has added swapping nodes to the model, the loaded model will contain swapping nodes. If only the model weights are saved and are restored onto a model that is built using code, then the model will only have LMS swapping nodes if LMS is re-run on the model.
Since TensorFlow MetaGraphs and SavedModels contain the swapping nodes in the graph, if the model
uses operations that branch differently for training or inferencing, an LMS-enabled model that was
trained, saved as a MetaGraph or SavedModel, and then loaded for inferencing will not swap down the
inference branches of those operations. If large model support is required to avoid out of memory
situations during inferencing, this feature of MetaGraphs and SavedModels with swapping nodes should
be considered. This limitation can be worked around by saving the weights, rebuilding the model with
code, running LMS on the model with is_training=False
, and then applying the
weights.
Keras models saved with tf.keras.models.save_model
do not have LMS swapping
nodes in them. If swapping is required in the loaded model, pass LMS to the load
tf.keras.models.load_model API. For example:
from tensorflow_large_model_support import LMS
lms_callback = LMS()
model = tf.keras.models.load_model(filepath, callbacks=[lms_callback])