Create an inference service
Create an inference service for model deployment that uses elastic distributed inference.
Before you begin
About this task
Elastic distributed inference has three parts for model deployment. It includes:
- Model assets: meta data and your model files (kernel.py, model.json, and README.md)
- Model service: client facing inference interface (available via REST API and the WML Accelerator console)
- Model kernels: use GPU to load model assets and run inference
To publish a model, you create an inference service by uploading your model assets and creating the required model meta data. By starting the model, the service creates the model kernels by current inference requesting throughput (you can adjust the minimum and maximum kernel number).
One elastic distributed inference model kernel uses one GPU (GPU exclusive mode). One elastic distributed inference model that uses multiple kernels can share 1 GPU (GPU shared mode or GPU packing mode). Once published, ensure that the model has been tested and that there are enough GPU resources available for the inference services to run.
Procedure
What to do next
- To set additional parameters on your inference service, see Edit an inference service.
- View deployed models from the Watson Machine Learning Accelerator console, see View deployed models.