Run with Kubernetes and KServe ModelMesh Serving
IBM recommends KServe ModelMesh Serving to serve Watson NLP models. Kserve is a Kubernetes-based platform for ML model inference. It supports several standard ML model formats, including: TensorFlow, PyTorch ScriptModule, ONNX, scikit-learn, XGBoost, LightGBM, and OpenVINO IR.
KServe can also be extended to support custom runtimes with arbitrary model formats, such as Watson NLP runtime. KServe ModelMesh Serving is a recently added feature intended to increase Kserve's scalability. It is designed to handle large volumes of models, where the deployed models change frequently. It loads and unloads models aiming to balance between responsiveness to users, and computational footprint.
The basic setup steps to get running are to:
- Provision external etcd and S3 resources
- Install KServe ModelMesh Serving onto your cluster
- Deploy your Watson NLP ServingRuntime
- Deploy a model upload job for each pretrained model you want to serve
- Deploy an InferenceService for each model
This document will run through a basic tutorial of getting up and running with KServe. The resulting deployments are for experimentation and demo purposes and would not be suitable for a production environment.
Installing KServe ModelMesh Serving
See the KServe ModelMesh Serving installation instructions for detailed instructions on how to install KServe with ModelMesh onto your cluster.
Installation requires a Kubernetes cluster. You will need cluster-admin
authority in order to complete all the prescribed steps. You should also be familiar with the concept of custom resources . A standard installation also assumes you have access to etcd and S3 storage.
Create an image pull secret
You will need to create a pull secret with your entitlement key to pull images from the entitled registry. See Accessing the files and Pull an image from a private registry .
Create an image pull secret named ibm-entitlement-key
, and then add a new ServiceAccount that references the pull secret.
apiVersion: v1
imagePullSecrets:
- name: ibm-entitlement-key
kind: ServiceAccount
metadata:
name: pull-secret-sa
namespace: modelmesh-serving
Update the Model Serving config
The modelmesh-serving controller has a number of configuration options specified here.
For this tutorial, we will go ahead and disable the KServe REST Proxy. This proxy is not currently compatible with the Watson NLP Runtime. We will also configure the controller to use the new pull-secret-sa
service account so that
our pods can access the entitled registry.
apiVersion: v1
kind: ConfigMap
metadata:
name: model-serving-config
data:
config.yaml: |
#Sample config overrides
serviceAccountName: pull-secret-sa
restProxy:
enabled: false
Create a serving runtime
A serving runtime is a template for a pod that can serve one or more particular model formats. The following sample will create a simple serving runtime for Watson NLP models.
Notice a few important overrides here:
- The metrics port is changed, to not conflict with model mesh's
puller
container - The command line arguments are overriden to spin up the gRPC server directly, instead of booting both the gRPC and REST servers
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: watson-nlp-runtime
spec:
containers:
- env:
- name: ACCEPT_LICENSE
value: "true"
- name: LOG_LEVEL
value: info
- name: CAPACITY
value: "6000000000"
- name: DEFAULT_MODEL_SIZE
value: "500000000"
- name: METRICS_PORT
value: "2113"
args:
- --
- python3
- -m
- watson_runtime.grpc_server
image: cp.icr.io/cp/ai/watson-nlp-runtime:1.1.36
imagePullPolicy: IfNotPresent
name: watson-nlp-runtime
resources:
limits:
cpu: 2
memory: 8Gi
requests:
cpu: 1
memory: 8Gi
grpcDataEndpoint: port:8085
grpcEndpoint: port:8085
multiModel: true
storageHelper:
disabled: false
supportedModelFormats:
- autoSelect: true
name: watson-nlp
You should be able to see the status of the runtime with
kubectl get servingruntimes
You will also be able to see the pods spin up for the runtime, and inspect them directly for any debugging and troubleshooting.
Upload pretrained models to S3
The pretrained model containers come with the ability to run as S3 upload jobs that reference a kserve storage config secret. For each model that you want to serve, you can deploy an upload job like the following:
apiVersion: batch/v1
kind: Job
metadata:
name: model-upload
spec:
template:
spec:
containers:
- name: syntax-izumo-en-stock
image: cp.icr.io/cp/ai/watson-nlp_syntax_izumo_lang_en_stock:1.3.0
env:
- name: UPLOAD
value: "true"
- name: ACCEPT_LICENSE
value: "true"
- name: S3_CONFIG_FILE
value: /storage-config/localMinIO
- name: UPLOAD_PATH
value: models
volumeMounts:
- mountPath: /storage-config
name: storage-config
readOnly: true
volumes:
- name: storage-config
secret:
defaultMode: 420
secretName: storage-config
restartPolicy: Never
serviceAccountName: pull-secret-sa
backoffLimit: 2
Note that this assumes your storage-config
secret exists, and that the localMinIO
key exists within it. This should have been created by the quickstart install of modelmesh-serving. You can configure other remote storage
locations in that secret and reference them here.
See also how UPLOAD_PATH
is set to upload to the ${bucket}/models/${model_name}
path.
Create an InferenceService
predictor for models
InferenceServices represent a logical endpoint for serving predictions using a particular model. Watson NLP models must be stored in an S3 compatible object store to be served by KServe ModelMesh Serving.
For each model that you want to serve, create an InferenceService
like the following:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: syntax-izumo-en
annotations:
serving.kserve.io/deploymentMode: ModelMesh
spec:
predictor:
model:
modelFormat:
name: watson-nlp
storage:
path: models/syntax_izumo_lang_en_stock
key: localMinIO
Note that the storage config should match the location where the model was uploaded.
Once the model is successfully loaded, you will see the READY
status is True
, when checked with the following command:
kubectl get inferenceservice
Querying your InferenceService
IMPORTANT: A key difference from other deployment modes is that you must query your models via ModelMesh instead of invoking the runtime API directly. This means that:
- Only the gRPC API is supported
- The
mm-vmodel-id
metadata key must be supplied with the name of theInferenceService
to query, instead of themm-model-id
metadata header.
First, port-forward the model-mesh service
kubectl port-forward --address 0.0.0.0 service/modelmesh-serving 8033 -n modelmesh-serving
Then use the python client library to query the new InferenceService
. (See instructions for installing the client library here)
import grpc
from watson_nlp_runtime_client import (
common_service_pb2,
common_service_pb2_grpc,
syntax_types_pb2,
)
# No TLS
# Note the 8033 port to talk to model-mesh directly
channel = grpc.insecure_channel("localhost:8033")
stub = common_service_pb2_grpc.NlpServiceStub(channel)
request = common_service_pb2.SyntaxRequest(
raw_document=syntax_types_pb2.RawDocument(text="This is a test"),
parsers=("sentence", "token", "part_of_speech", "lemma", "dependency"),
)
# Note the `mm-vmodel-id` header with the name of the InferenceService
response = stub.SyntaxPredict(
request, metadata=[("mm-vmodel-id", "syntax-izumo-en")]
)
print(response)
Other Resources
To see a tutorial that takes you through the steps to deploy a Watson NLP model to the KServe ModelMesh Serving sandbox environment on IBM Technology Zone (TechZone), check out Deploy a Watson NLP Model to KServe ModelMesh Serving on GitHub.
Once you have your runtime server working, see Accessing client libraries and tools to continue.