Getting started with Caffe
Find tips and tricks for getting started with Caffe.
Installing Caffe
Caffe is installed with WML CE or it can be installed by separately.
Caffe samples and examples
Each Caffe package includes example scripts and sample models. A script is provided to copy the sample content into a specified directory:
caffe-install-samples <somedir>
Optimizations in IBM enhanced Caffe
The IBM-enhanced Caffe package (caffe-ibm
) in WML CE is based on BVLC Caffe and includes optimizations
and enhancements from IBM®:
- IBM Watson™ Machine Learning Community Edition Distributed Deep Learning (DDL)
- Large Model Support (LMS)
- Fast R-CNN support from https://github.com/rbgirshick/caffe-fast-rcnn/tree/faster-rcnn-upstream-33f2445
- CPU/GPU layer-wise reduction
- Built with IBM's Mathematical Acceleration Subsystem (MASS) libraries
- CPU/GPU Affinity
- YOLOv2 support
- Limited CPU-only support
Command line options
IBM enhanced Caffe supports all of BVLC Caffe's options and adds a few new ones to control the enhancements. IBM enhanced Caffe options related to Distributed Deep Learning (options that start with the word ddl) only work if DDL is installed.
-bvlc
- Disable CPU/GPU layer-wise reduction.
-threshold
- If the number of parameters for one layer is greater than or equal to threshold, their accumulation on CPU is run in parallel. Otherwise, the accumulation is done by using one thread. It is set to 2,000,000 by default.
-ddl ["-option1 param -option2 param"]
- Enable Distributed Deep Learning, with optional space-delimited parameter string. Supported
parameters are:
mode <mode>
dump_iter <N>
dev_sync <0, 1, or 2>
rebind_iter <N>
dbg_level <0, 1, or 2>
-ddl_update
- This option instructs Caffe to use a new custom version of the
ApplyUpdate
function that is optimized for DDL. It is faster, but does not support gradient clipping and is off by default. It can be used in networks that do not support clipping (common). -ddl_align
- This option ensures that the gradient buffers have a length that is a multiple of 256 bytes and have start addresses that are multiples of 256. This action ensures cache line alignment on multiple platforms and alignment with NCCL slices. This option is off by default.
-ddl_database_restart
- This option ensures that every learner always looks at the same data set during an epoch, allowing a system to cache only the pages that are touched by the learners that are contained within it. It can help size the number of learners that are needed for a specific data set size by establishing a known database footprint per system. Do not use this flag while you are running Caffe on several hosts. This option is off by default.
-lms
- Enable Large Model Support. See Large Model Support.
-lms_size_threshold <size in KB>
- Set LMS size threshold. See Large Model Support.
-lms_exclude <size in MB>
- Tune LMS memory usage. See Large Model Support.
-affinity
- Enable CPU/GPU affinity (default). Specify
-noaffinity
to disable.
Use the command line options as follows:
| Feature | -bvlc | -ddl | -lms | -gpu | -affinity |
| ------------------------------- | ----- | ---- | ----- | ------------- | --------- |
| CPU/GPU layer-wise reduction | N | X | X | multiple GPUs | X |
| Distributed Deep Learning (DDL) | X | Y | X | N | X |
| Large model support | X | X | Y | X | X |
| CPU/GPU affinity | X | X | X | X | Y |
Y: do specify
N: don't specifiy
X: don't care/matter
LMS gets enabled regardless of other options when -lms
is specified. For
example, you can use DDL and LMS together.
CPU/GPU layer-wise reduction is enabled only if multiple GPUs are specified and
layer_wise_reduce: false
.
Use of multiple GPUs with DDL is specified through the MPI rank file, so the
-gpu
flag cannot be used to specify multiple GPUs for DDL.
While you are running Caffe on several hosts, the use of shared storage for data can lead Caffe to hang.
CPU/GPU layer-wise reduction
This optimization aims to reduce the running time of a multiple-GPU training by using CPUs. In particular, gradient accumulation is offloaded to CPUs and done in parallel with the training. To gain the best performance with IBM enhanced Caffe, close unnecessary applications that use a high percentage of CPU.
If you are using a single GPU, IBM enhanced Caffe and BVLC Caffe have a similar performance.
The optimizations in IBM enhanced Caffe do not change the convergence of a neural network during training. IBM enhanced Caffe and BVLC Caffe should produce the same convergence results.
CPU/GPU layer-wise reduction is enabled unless the -bvlc
command line flag is
used.
IBM Watson Machine Learning Community Edition Distributed Deep Learning (DDL)
See Getting started with DDL for more information about using IBM Watson Machine Learning Community Edition Distributed Deep Learning.
Large Model Support (LMS)
IBM enhanced Caffe with Large Model Support loads the neural model and data set in system memory
and caches activity to GPU memory only when needed for computation. This action allows models and
training batch size to scale significantly beyond what was previously possible. You can enable Large
Model Support by adding -lms
.
The -lms_size_threshold <size in KB>
option modifies the minimum memory
chunk size that is considered for the LMS cache (default: 1000
). Any chunk smaller
than this value is exempt from LMS reuse and persists in GPU memory. The value can be used to
control the performance tradeoff.
The -lms_exclude <size in MB>
option defines a soft limit on GPU memory
that is allocated for the LMS cache (where limit = GPU-capacity -
value). If zero, favors aggressive GPU memory reuse over allocation (default). If
specified (> 0), enables aggressive allocation of GPU memory up to the limit. Minimizing this
value, while still allowing enough memory for non-LMS allocations, might improve performance by
increasing GPU memory utilization and reducing data transfers between system and GPU memory.
For example, the following command line options yield the best training performance for the GoogleNet model with high-resolution image data (crop size 2240x2240, batch size 5) using Tesla P100 GPUs:
caffe train -solver=solver.prototxt -gpu all -lms —lms_size_threshold 1000 -lms_exclude 1400
Ideal tunings for any given scenario can differ depending on the model's network architecture,
data size, batch size, and GPU memory capacity. This is particularly true for the
-lms_exclude
option such that it is considered a best practice to determine its
value experimentally, arriving at the smallest value that does not result in an out-of-memory
error.
Combining LMS and DDL
Large Model Support and Distributed Deep Learning can be combined. For example, to run on two hosts that are named host1 and host2:
ddlrun -H host1,host2 caffe train -solver solver-resnet-152.prototxt -lms
CPU-only support
IBM enhanced Caffe includes limited support for CPU-only operation.
Training of large models will be much slower without GPU, so this support is best suited for inferencing (classification) or experimenting with small models.
To use CPU-only mode:
- Do not specify
-gpu
on thecaffe
command line - Code
solver_mode: CPU
in your solver.prototxt file - Call
caffe.set_mode_cpu()
when using Caffe from python
Invoke caffe training using the command line:
caffe train -solver=solver.prototxt
More information
Go to Caffe website for tutorials and example programs that you can run to get started.
See these example programs:
- LeNet MNIST Tutorial - Train a neural network to understand handwritten digits.
- CIFAR-10 tutorial - Train a convolutional neural network to classify small images.