Create an inference service

Create an inference service for model deployment that uses elastic distributed inference.

Before you begin

Make sure that you have the model files that you are creating the inference service for. To create the inference service, you will need to create a kernel file, a model.json file, and a readme file.

About this task

Elastic distributed inference has three parts for model deployment. It includes:

  • Model assets: meta data and your model files (kernel.py, model.json, and README.md)
  • Model service: client facing inference interface (available via REST API and the WML Accelerator console)
  • Model kernels: use GPU to load model assets and run inference

To publish a model, you create an inference service by uploading your model assets and creating the required model meta data. By starting the model, the service creates the model kernels by current inference requesting throughput (you can adjust the minimum and maximum kernel number).

One elastic distributed inference model kernel uses one GPU (GPU exclusive mode). One elastic distributed inference model that uses multiple kernels can share 1 GPU (GPU shared mode or GPU packing mode). Once published, ensure that the model has been tested and that there are enough GPU resources available for the inference services to run.

Procedure

  1. Create a working directory and copy your model files into that directory.
  2. Create the model.json file for the model and add it to the directory. Here is an example where GPU is enabled:
    {
    "weight_path":"./",    
    "schema_version":"1",
    "name":"resnet18-pytorch", 
    "runtime":"dlipy3", 
    "kernel_path":"kernel.py", 
    "model_path":"./" 
    }
    Populate your model.json file with the following parameters:
    • weight_path: If needed, specify the relative path to the model weight file.
    • schema_version: Set this value to 1.
    • name: Specify the name of the deployed model as it will appear in Cloud Pak for Data. The name specified cannot include the underscore character.
    • runtime: Specify the conda environment that runs the model, by default the following environments are included with Watson Machine Learning Accelerator: dlipy3
    • kernel_path: Set this to the name of your kernel file, for example: kernel.py. Learn about kernel files and how to create them, see: Create a kernel file for an inference service.
    • model_path: Specify the relative path to the location of your model files. This location is where model assets are uploaded to (a model sub-directory in the EDI model repository PV). It is mounted to a kernel pod via sub-path and it is read-only.

      The kernel code can only get this path from kernel context when on_kernel_start is initiated and model_path is the top directory. For other paths, like kernel attributes (kernel.py), if using a relative path, it must be found under the location specified for model_path.

    Additionally, the following settings can be set:
    • GPU: GPU allocation policy. Options include: no, shared, or exclusive. If set to shared, this indicates GPU packing is enabled.

    • Resource plan: Resource plan where the kernel is running. An absolute resource plan path can be specified. Default default: sample-project/inference.

    • Resources : CPU memory resource settings for a kernel pod. Default:
      ncpus=0.5,ncpus_limit=2,mem=1024,mem_limit=4096
      • When GPU packing is enabled, where GPU is set to shared, use ncpus_limit and mem_limit to start the kernel pod.
      • When GPU packing is disabled, where GPU is set to no or exclusive, use ncpus and mem to start the kernel pod.
      Tip: For running TensorFlow or PyTorch, it is recommended that you increase the default memory value by setting mem to at least 2048 (mem=2048).
    • Accelerator resources: GPU resource settings for a kernel pod when GPU allocation policy is set to shared or exclusive.

    • For more parameters, see Edit an inference service.
  3. Create a README.md file. A readme file describes what the input data looks like and what type of inference server response is expected. Add this file to your working directory. For example:
    ## Input data:
    {
    "id": 1,
    "inputs": [
    {
    "name": "gpu_0/data",
    "shape": [478, 640],
    "datatype": "BYTES",
    "data": <base64_string of imagefile>
    }],
    "outputs":[]
    }
    ## The response:
    ```
    {'id': 1, 'outputs': [{'name': 'output0', 'datatype': 'FP32', 'shape': [1, 10], 'data': [0.07383891940116882, 
  4. Create a kernel file, see Create a kernel file for an inference service. The kernel file must be added to your working directory.
  5. Deploy the model:
    dlim model deploy -p working_directory
    where working_directory is the full path to your working directory which includes your model files, kernel.py, model.json, and README.md.
  6. Start your inference service, see Start an inference service.
    Note: Elastic distributed inference creates a model service deployment, and the cpd-scheduler allocates memory to run the service deployment. There must be enough GPUs to run the model service for the kernel pods to start.

What to do next