Inferencing deployed custom foundation models tuned with PEFT
After deploying a custom foundation model that is fine-tuned with a PEFT technique, you can inference the model by providing text or stream input data to the deployed model to generate predictions in real-time.
Before you begin
Make sure you have deployed your LoRA adapter model before proceeding with inferencing.
Inferencing deployed model with REST API
You can use the watsonx.ai REST API to inference your deployed model and generate predictions in real-time.
Generating text response
To generate a text response from your deployed PEFT model, use the following code sample:
curl -X POST "https://<HOST>/ml/v1/deployments/<deployment_id>/text/generation?version=2024-01-29" \
-H "Authorization: Bearer <token>" \
-H "content-type: application/json" \
--data '{
"input": "What is the boiling point of water?",
"parameters": {
"max_new_tokens": 200,
"min_new_tokens": 20
}
}'
Make sure to replace the placeholders with your actual values and adjust the parameters according to your specific use case.
Generating stream response
To generate a stream response from your deployed PEFT model, use the following code sample:
curl -X POST "https://<HOST>/ml/v1/deployments/<deployment_id>/text/generation_stream?version=2024-01-29" \
-H "Authorization: Bearer <token>" \
-H "content-type: application/json" \
--data '{
"input": "What is the boiling point of water?",
"parameters": {
"max_new_tokens": 200,
"min_new_tokens": 20
}
}'
Make sure to replace the placeholders with your actual values and adjust the parameters according to your specific use case.
Learn more
Parent topic: Deploying fine-tuned custom foundation models