Inferencing deployed custom foundation models tuned with PEFT

After deploying a custom foundation model that is fine-tuned with a PEFT technique, you can inference the model by providing text or stream input data to the deployed model to generate predictions in real-time.

Before you begin

Make sure you have deployed your LoRA adapter model before proceeding with inferencing.

Inferencing deployed model with REST API

You can use the watsonx.ai REST API to inference your deployed model and generate predictions in real-time.

Generating text response

To generate a text response from your deployed PEFT model, use the following code sample:

curl -X POST "https://<HOST>/ml/v1/deployments/<deployment_id>/text/generation?version=2024-01-29" \
-H "Authorization: Bearer <token>" \
-H "content-type: application/json" \
--data '{
 "input": "What is the boiling point of water?",
 "parameters": {
    "max_new_tokens": 200,
    "min_new_tokens": 20
 }
}'

Make sure to replace the placeholders with your actual values and adjust the parameters according to your specific use case.

Generating stream response

To generate a stream response from your deployed PEFT model, use the following code sample:

curl -X POST "https://<HOST>/ml/v1/deployments/<deployment_id>/text/generation_stream?version=2024-01-29" \
-H "Authorization: Bearer <token>" \
-H "content-type: application/json" \
--data '{
 "input": "What is the boiling point of water?",
 "parameters": {
    "max_new_tokens": 200,
    "min_new_tokens": 20
 }
}'

Make sure to replace the placeholders with your actual values and adjust the parameters according to your specific use case.

Learn more

Parent topic: Deploying fine-tuned custom foundation models