Create an inference service

Create an inference service for model deployment that uses elastic distributed inference.

Before you begin

Make sure that you have the model files that you are creating the inference service for. To create the inference service, you will need to create a kernel file, a model.json file, a readme file, and the model to be deployed.

If using a storage volume to access data, an administrator must create the storage volume in Cloud Pak for Data and provide users permissions to the storage volume.

About this task

Elastic distributed inference has three parts for model deployment. It includes:

Model assets: meta data in database (model.json, README.md) and your model files on shared storage
Model service: client facing inference interface (available via REST API and the WML Accelerator console)
Model kernels: load model assets and run inference on GPU or CPU

To deploy a model, you create an inference service by uploading your model assets and creating the required model meta data. By starting the model, the service creates the model kernel instances on demand by current inference requesting throughput.

Elastic distributed inference service has 3 interface working modes:

CPU mode: Model kernels run on CPU only
GPU exclusive mode: Model kernels run on GPU, and one kernel instance uses one GPU exclusively
GPU packing mode: Model kernels run on GPU, and multiple kernel instances share GPU

Once published, ensure that the model has been tested and that there are enough CPU, Memory, and GPU resources available for the inference services to run.

Note: You can only edit the inference service to set the inference working mode after the model is deployed, see Edit an inference service.

Procedure

Create a working directory and copy your model files into that directory.
Create the model.json file for the model and add it to the directory. Here is an example where GPU is enabled:
```
{  
"name":"resnet18-pytorch", 
"runtime":"dlipy3", 
"kernel_path":"kernel.py", 
"weight_path":"./",  
"tag":"example,dev-use",
"attributes" :
[
    { "key" : "foo", "value" : "bar"},
],
“mk_environments” :
[
    { "key" : "ENV_FOO", "value" : "VALUE_BAR"}
],
"schema_version" : "1"
}
```
Populate your model.json file with the following parameters:
- schema_version: Set this value to 1.
- name: Specify the name of the deployed model as it will appear in Cloud Pak for Data. The name specified cannot include the underscore character.
- runtime: Specify the conda environment that runs the model, by default the following environments are included with Watson Machine Learning Accelerator: dlipy3
- kernel_path: Set this to the name of your kernel file, for example: kernel.py. Learn about kernel files and how to create them, see: Create a kernel file for an inference service.
- weight_path: If needed, specify the relative path to the model weight file. Cannot be an empty string.
- tag: Optional. Any keywords to identify your model.
- attributes: Optional. Set any key-value attributes if needed. The key-value attributes are passed to the kernel file.
  Note: In Version 2.4.1 and later, elastic distributed inference uses attributes with the prefix cpd-volumes-* to read model data from a Cloud Pak for Data storage volume to the model kernel. For example:
  "attributes" : [ { "key": "cpd-volumes-<volume instance name>", "value": "[/<mount_path>:ro]" }, ... ]
  - key is set to the volume instance name for the PVC storage volume.
  - value is set to the attribute specified in Cloud Pak for Data. Setting this value will override the value currently specified. Supported values include mount path and read-only.
  To create a storage volume, see Establish access to a volume.
  After the model is deployed and started, the persistent volume claim defined in the storage volume will be mounted to the kernel pod. If the mount path is not specified, the mount path defined in the storage volume meta data will be used. When the storage is first mounted it is set to read only. The read write permission of the storage can be changed by the Cloud Pak for Data users.
- mk_environments: Optional. Set any key-value deployment environment variables if needed. The key-value environment variables will be set when the kernel file is run.

Create a README.md file. A readme file describes what the input data looks like and what type of inference server response is expected. Add this file to your working directory. For example:

## Input data:
{
"id": 1,
"inputs": [
{
"name": "gpu_0/data",
"shape": [478, 640],
"datatype": "BYTES",
"data": <base64_string of imagefile>
}],
"outputs":[]
}
## The response:
```
{'id': 1, 'outputs': [{'name': 'output0', 'datatype': 'FP32', 'shape': [1, 10], 'data': [0.07383891940116882,

Create a kernel file, see Create a kernel file for an inference service. The kernel file must be added to your working directory.
Deploy the model:
```
dlim model deploy -p working_directory
```
where working_directory is the full path to your working directory which includes your model files, kernel.py, model.json, and README.md.
Start your inference service, see Start an inference service.

Note: Elastic distributed inference creates a model service deployment, and the cpd-scheduler allocates memory to run the service deployment. There must be enough GPUs to run the model service for the kernel pods to start.

What to do next

To set additional parameters on your inference service, see Edit an inference service.
View deployed models from the Watson Machine Learning Accelerator console, see View deployed models.