Getting started with IBM Distributed Accelerated ML library
Find information about getting started with pai4sk and SnapML.
This release of WML CE includes a conda
package named pai4sk
, which includes the IBM accelerated Machine Learning library.
The main component of this library includes SnapML APIs. Snap ML is a library for training
generalized linear models. It is being developed at IBM with the vision to remove training time as a
bottleneck for machine learning applications. Snap ML supports many classical machine learning
models and scales gracefully to data sets with billions of examples or features. Snap ML training
can be performed in a single machine or distributed across a cluster of machines. It also offers GPU
acceleration and supports sparse data structures. The library is exposed through a Python API
compatible with sklearn and can seamlessly be integrated into existing Python application. The
following APIs are supported:
- LogisticRegression
- LinearRegression
- SupportVectorMachine
- DecisionTreeClassifier (Supports Single Threaded - CPU version only)
- RandomForestClassifier (CPU version only)
SnapML uses a proprietary data format named snap
for efficient data loading for
both single and multiple node training. The following list is a set of APIs provided to load and
store the datasets in snap
format,
Because pai4sk
is built upon scikit-learn
library version 0.20.1, it can be used as a replacement for scikit-learn. Some of the APIs
are accelerated by making use of SnapML and cuML under the hood. This module will automatically fall back to original scikit-learn
behavior when SnapML or cuML
does not provide the necessary support. The following
links are a list of such APIs:
- Ridge
- Lasso
- LogisticRegression
- SupportVectorMachine LinearSVC
- Clustering KMeans (Dependents on cuML)
- Clustering DBSCAN (Dependents on cuML)
- Decomposition PCA (Dependents on cuML)
- Decomposition TruncatedSVD (Dependents on cuML)
- Dataset load_svmlight_file
- Metrics log_loss
- Metrics accuracy_score
- Metrics hinge_loss
- Metrics mean_squared_error
- Metrics Similarity Search
- On single system without an InfiniBand set up, use
--pami_noib
option of mpirun. - On multiple systems without an InfiniBand set up, use
-mca btl tcp,self
instead of-tcp
option of mpirun.
To run pai4sk
applications in a distributed way, use snaprun
to start the application as follows,
- Determines the necessary arguments to pass to MPI based on the current environment and version of MPI.
- Tests connections to the hosts, including the correct setup of ssh keys.
- Verifies that
pai4sk
is installed across the hosts. - Detects the hardware configuration of the hosts, including GPU count, and generates a valid topology.
- Generates the necessary
rankfile
, providing options to specify more specific topology details. - Constructs, displays, and executes the mpirun command needed to distribute jobs to each node.
Run snaprun -h
to get the usage details of this tool.
Example programs for each of the above mentioned APIs are provided as part of the conda package.
To find out how to run the sample programs, refer to the READMEs placed under
$CONDA_PREFIX/pai4sk/local-examples/
and
$CONDA_PREFIX/pai4sk/mpi-examples/
.
Sample Jupyter notebooks are provided in this github repository.
cuML
will be in technology preview for this release.