Tutorial: TensorFlow with DDL
This tutorial explains the necessary steps for enabling distributed deep learning (DDL)
from within the mnist.py
example provided in the WML CE distribution. DDL is indirectly integrated into
TensorFlow in the form of a custom operator. The custom operator is provided as a shared library,
which can be loaded and invoked from within a Python training script.
Enabling DDL in a TensorFlow program
The DDL TensorFlow operator makes it easy to run TensorFlow programs on a cluster. Enabling DDL in your TensorFlow program is simple and it can be done using one of the following two approaches:
For both approaches to work most efficiently, split the training data up among each DDL instance. The DDL library provides functions to help with this approach.
Enable DDL using default options for the current system configuration
In this approach, ddlrun determines the best options to run with and sets
DDL_OPTIONS
environment variable. When the DDL library is imported in the
TensorFlow program, the DDL library will be initialized using the DDL_OPTIONS
environment variable. You cannot override the environment variables using this approach.
The only change that is required to enable DDL in this manner is to import the DDL python library:
import ddl
The mnist.py
script was modified in this manner and to split the training data,
as described below, to enable DDL. The modified script can be found in
$CONDA_PREFIX/ddl-tensorflow/examples/mnist/mnist-env.py
.
Enable DDL manually
In this approach, the DDL options are manually passed to the DDL library. When this approach is
used, the grads_reduce
function must be manually called. This approach gives the
you more control over the use of the DDL library.
The following changes are required to enable DDL in this manner:
- Import the DDL library:
import ddl
- Explicitly initialize DDL
ddl.init(FLAGS.ddl_options)
- Call the DDL
grads_reduce
function and replace thereduceAll
function in themnist
script with the DDLgrads_reduce
function:grads_and_vars = ddl.grads_reduce(grads_and_vars, average=True) objective = optimizer.apply_gradients(grads_and_vars)
The mnist.py
script was modified in this manner and to split the training data,
as described below, to enable DDL. The modified script can be found in
$CONDA_PREFIX/ddl-tensorflow/examples/mnist/mnist-init.py
.
Split the training data among DDL instances
For both approaches to work most efficiently, the data should be split up between each DDL
instance. This can be done using the DDL functions size()
and
rank()
. The size()
function returns the total number of DDL
instances, while rank()
returns a unique integer for the current DDL instance.
The following example is the code that was added to the mnist.py
script, in both
the default and manual approaches, in order to split up the data:
batch_x, batch_y = mnist.train.next_batch(batch_size*ddl.size())
#select one of partitions
batch_x = np.split(batch_x,ddl.size())[ddl.rank()]
batch_y = np.split(batch_y,ddl.size())[ddl.rank()]
Running a DDL-enabled TensorFlow script
Follow these steps to run a DDL-enabled TensorFlow script:
- Install
ddl-tensorflow
conda packageBefore running a DDL enabled TensorFlow script, make sure there is a conda environment activated with the
ddl-tensorflow
package installed.The
mnist
examples should also be copied into a user directory before being run. This copy action can done by running the following command:ddl-tensorflow-install-samples <somedir>
- Run the script using ddlrun:
DDL enabled programs should be launched using the ddlrun tool. See Using the ddlrun tool and
ddlrun -h
for more information about ddlrun.- Run using
DDL_OPTIONS
The
DDL_OPTIONS
environment variable should be set with the intended arguments. See DDL options for a list of DDL arguments.The following command launches the
mnist-env.py
script to run on host1 and host2.ddlrun -H host1,host2 python mnist-env.py
- Run without using
DDL_OPTIONS
The DDL options can be passed into the
mnist-init.py
script as an argument. See DDL options for a list of DDL arguments.Run the following command to launch the
mnist-init.py
script on host1 and host2.ddlrun --no_ddloptions -H host1,host2 python mnist-init.py --ddl_options="-mode b:4x2"
- Run using