Edit an inference service

Edit online

Edit an inference service in Watson Machine Learning Accelerator 2.3.

Procedure

To edit an inference service using the command line:
1. Before editing the deployed model, stop the inference service.
```
dlim model stop model_name
```
  where model_name is the name of your model.
2. Get the model json file.
```
dlim model viewprofile model_name -j > modelA_profile.json
```
  where model_name is the name of your model and modelA_profile.json is the name of your JSON file.
3. Edit the JSON file and save your changes.
  The following parameters are available:
  
  Replica: Number of copies of the service to run for a model.
  
  Kernel delay release time: Time (in seconds) to wait after the system detects the number of kernels that are running is higher than the load requires before the system stops the extra kernels. Must be greater than 0.
  
  Kernel Min: Minimum number of kernels to always keep running. Must be greater than 0. Do not set higher than the number of slot resources you want to consume all the time, and do not set higher than the total number of slot resources in the resource plan.
  
  Kernel Max: Maximum number of kernels to scale up to. Specify -1 for unlimited number of kernels or a number greater than or equal to the value set for Kernel Min.
  
  Schedule interval: How often the service re-evaluates the number of kernels that are running based on current load. Time in seconds. Must be greater than 0.
  
  Stream discard slow tasks: If true, after new tasks are completed, older tasks are discarded instead of being returned to the gRPC streaming client. Applies to gRPC streaming clients only, not applicable to REST clients. To enabled, specify true, otherwise, set to false.
  
  Stream number per group: Number of streams per resource group.
  
  Task Execution Timeout: Time (in seconds) to wait for an individual inference request task to complete. Specify -1 for no timeout or a value greater than 0.
  
  Task Batch Size: Number of tasks that can be dispatched to a single kernel in a single call. Used in a GPU enabled kernel that uses a high batch size to reach optimal performance. Must be greater than 0.
  
  Connection timeout: Time (in seconds) that the service waits for a kernel to report that it started. Must be greater than 0.
  
  Namespace: Namespace that the kernel is running in. This value cannot be modified.
  
  Resource plan: Resource plan where the kernel is running. An absolute resource plan path can be specified. Default default: sample-project/inference.
  
  GPU: GPU allocation policy. Options include: no, shared, or exclusive. If set to shared, this indicates GPU packing is enabled.
  
  Image name: Specifies where kernel images are pulled from. Specified for kernel image patch, or for a custom image.
  Resources : CPU memory resource settings for a kernel pod. Default:
```
ncpus=0.5,ncpus_limit=2,mem=1024,mem_limit=4096
```
  - When GPU packing is enabled, where GPU is set to shared, use ncpus_limit and mem_limit to start the kernel pod.
  - When GPU packing is disabled, where GPU is set to no or exclusive, use ncpus and mem to start the kernel pod.
  Tip: For running TensorFlow or PyTorch, it is recommended that you increase the default memory value by setting mem to at least 2048 (mem=2048).
  Accelerator resources: GPU resource settings for a kernel pod when GPU allocation policy is set to shared or exclusive.
  
  Node selector: Additional query to filter nodes that are running on a kernel pod.
4. Update the model with the latest JSON file.
```
dlim model updateprofile model_name -f modelA_profile.json
```
  where model_name is the name of your model and modelA_profile.json is the name of your JSON file.
To edit an inference service from the Watson Machine Learning Accelerator console:
1. Before editing the deployed model, stop the inference service. See Stop an inference service.
2. From the console, navigate to Workload > Deployed Model
3. Select the model that you want to edit.

Results

The inference service is updated.