Create an inference service for model deployment that uses elastic distributed inference.
Before you begin
Make sure that you have the model files that you are creating the inference service for. To
create the inference service, you will need to create a kernel file, a
model.json file, a readme file, and the model to be
deployed.If using a storage volume to access data, an administrator must create the storage volume in Cloud
Pak for Data and provide users permissions to the storage volume.
About this task
Elastic distributed inference has three parts for model deployment. It includes:
- Model assets: meta data in database (model.json, README.md) and your model files on shared
storage
- Model service: client facing inference interface (available via REST API and the WML Accelerator
console)
- Model kernels: load model assets and run inference on GPU or CPU
To deploy a model, you create an inference service by uploading your model assets and creating
the required model meta data. By starting the model, the service creates the model kernel instances
on demand by current inference requesting throughput.
Elastic distributed inference service has 3 interface working modes:
- CPU mode: Model kernels run on CPU only
- GPU exclusive mode: Model kernels run on GPU, and one kernel instance uses one GPU
exclusively
- GPU packing mode: Model kernels run on GPU, and multiple kernel instances share GPU
Once
published, ensure that the model has been tested and that there are enough CPU, Memory, and GPU
resources available for the inference services to run.
Note: You can only edit the inference service
to set the inference working mode after the model is deployed, see
Edit an inference service.
Procedure
- Create a working directory and copy your model files into that directory.
- Create the model.json file for the model and add it to the
directory. Here is an example where GPU is enabled:
{
"name":"resnet18-pytorch",
"runtime":"dlipy3",
"kernel_path":"kernel.py",
"weight_path":"./",
"tag":"example,dev-use",
"attributes" :
[
{ "key" : "foo", "value" : "bar"},
],
“mk_environments” :
[
{ "key" : "ENV_FOO", "value" : "VALUE_BAR"}
],
"schema_version" : "1"
}
Populate your
model.json file with the following parameters:
- schema_version: Set this value to 1.
- name: Specify the name of the deployed model as it will appear in Cloud Pak
for Data. The name specified cannot include the underscore character.
- runtime: Specify the conda environment that runs the model, by default the
following environments are included with Watson Machine Learning Accelerator: dlipy3
- kernel_path: Set this to the name of your kernel file, for example:
kernel.py. Learn about kernel files and how to create them, see: Create a kernel file for an inference service.
- weight_path: If needed, specify the relative path to the model weight file.
Cannot be an empty string.
- tag: Optional. Any keywords to identify your model.
- attributes: Optional. Set any key-value attributes if
needed. The key-value attributes are passed to the kernel file.
Note: In Version 2.4.1 and later,
elastic distributed inference uses attributes with the prefix
cpd-volumes-* to
read model data from a Cloud Pak for Data storage volume to the model kernel. For example:
"attributes" :
[
{ "key": "cpd-volumes-<volume instance name>",
"value": "[/<mount_path>:ro]"
},
...
]
- key is set to the volume instance name for the PVC storage volume.
- value is set to the attribute specified in Cloud Pak for Data. Setting this
value will override the value currently specified. Supported values include mount path and
read-only.
To create a storage volume, see Establish access to a volume. After the model is deployed
and started, the persistent volume claim defined in the storage volume will be mounted to the kernel
pod. If the mount path is not specified, the mount path defined in the storage volume meta data will
be used. When the storage is first mounted it is set to read only. The read write permission of the
storage can be changed by the Cloud Pak for Data users.
- mk_environments: Optional. Set any key-value
deployment environment variables if needed. The key-value environment variables will be set when the
kernel file is run.
- Create a README.md file. A readme file describes what the input data
looks like and what type of inference server response is expected. Add this file to your working
directory. For example:
## Input data:
{
"id": 1,
"inputs": [
{
"name": "gpu_0/data",
"shape": [478, 640],
"datatype": "BYTES",
"data": <base64_string of imagefile>
}],
"outputs":[]
}
## The response:
```
{'id': 1, 'outputs': [{'name': 'output0', 'datatype': 'FP32', 'shape': [1, 10], 'data': [0.07383891940116882,
- Create a kernel file, see Create a kernel file for an inference service.
The kernel file must be added to your working directory.
- Deploy the
model:
dlim model deploy -p working_directory
where
working_directory is the full path to your working directory which includes
your model files,
kernel.py,
model.json, and
README.md.
- Start your inference service, see Start an inference service.
Note: Elastic distributed inference creates a model service deployment, and the
cpd-scheduler allocates memory to run the service deployment. There must be
enough GPUs to run the model service for the kernel pods to start.