Creating a deployment for a custom foundation model

After a custom foundation model is installed and registered for use with watsonx.ai, you can create a deployment for the model.

Service The required watsonx.ai service and other supplemental services are not available by default. An administrator must install these services on the IBM Cloud Pak for Data platform. To determine whether a service is installed, open the Services catalog and check whether the service is enabled.

Create a deployment for a custom foundation model so you can inference the model, by using the capabilities of watsonx.ai.

After a custom foundation model is uploaded and registered with watsonx.ai, you can deploy the model to make it available for inferencing. Use one of these methods:

Create a deployment from a space
Create a deployment using APIs

Before you begin

Before you create a deployment, plan for the hardware specification you will need for the deployment. In most cases, you can select a predefined hardware specification. If you are deploying a quantized model, or have other specifications which require a custom hardware specification, create the hardware specification by using Python client or Watson Machine Learning API before you begin the deployment process.

For more information, see Supported hardware, model architectures, and performance settings.

Creating a deployment from a space

To create a deployment of your custom foundation model from a space you start by adding the model asset to the space, then create the deployment.

Import the model asset into a space

To import a custom foundation model asset into your deployment space:

From your deployment space, select the Assets tab and click Import assets.
From the Import assets catalog, click Custom foundation model and choose the model to import.
Enter your model details, and click Add.

Note: If the custom foundation model asset was saved to a project, you can promote the model from the project to the space instead of importing the model asset.

Deploy the custom foundation models from a deployment space

Follow these steps to create an online deployment for a custom foundation model from a deployment space:

From the deployment details page, click New deployment.
Enter a name for your deployment and optionally enter a serving name, description, and tags.
Select a hardware specification.

Restriction:
You cannot create a custom hardware specification from the user interface in a deployment space. To learn more about ways to create and select a hardware specification, see Managing hardware specifications for deployments. After you create the custom hardware specification, you can select it programmatically or from the user interface for your custom foundation model.
Optional: If you want to override some of the base model parameters, enter new parameter values:
- Data type: Choose the float16 or bfloat16 to specify the data type for your model.
- Max batch size: Enter the maximum batch size for your model.
- Max concurrent requests: Enter the maximum number of concurrent request that can be made to your model.
- Max new tokens: Enter the maximum number of tokens that can be created for your model for an inference request.
- Max sequence length: Enter the maximum sequence length for your model.
Click Create.

Test the deployment

Follow these steps to test your custom foundation model deployment:

From the Deployments tab of your space, click the deployment name.
Click the Test tab to input prompt text and get a response from the deployed asset.
Enter test data in one the following formats, depending on the type of asset that you deployed:
- Text: Enter text input data to generate a block of text as output.
- Stream: Enter text input data to generate a stream of text as output.
- JSON: Enter JSON input data to generate output in JSON format.
Click Generate to get results based on your prompt.

Retrieve the endpoint for custom foundation model deployments

Follow these steps to retrieve the endpoint URL for your custom foundation model deployment to access the deployment from your applications:

From the Deployments tab of your space, click the deployment name.
In the API Reference tab, find the private and public endpoint links. You can also find cURL commands in code snippets that you can use to include the endpoint details in an application.

Note:

If you added Serving name when you created your online deployment, you see two endpoint URLs. The first URL contains the deployment ID, and the second URL contains your serving name. You can use either one of these URLs with your deployment.

Creating a deployment programmatically

To create a deployment programmatically, you must first get the model asset ID and then create the deployment.

List the deployable custom foundation models

Run this code to view the list of available custom foundation models by using the watsonx API:

curl --location 'https://<cluster_url>/ml/v4/custom_foundation_models' \
--header 'Authorization: Bearer $TOKEN'

Example output:

{
    "first": {
        "href": "/ml/v4/custom_foundation_models?limit=100"
    },
    "limit": 100,
    "resources": [
        {
            "model_id": "example_model_13b",
            "parameters": [
                {
                    "default": "float16",
                    "display_name": "Data Type",
                    "name": "dtype",
                    "options": [
                        "float16",
                        "bfloat16"
                    ],
                    "type": "string"
                },
                {
                    "default": 256,
                    "display_name": "Max Batch Size",
                    "name": "max_batch_size",
                    "type": "number"
                },
                {
                    "default": 1024,
                    "display_name": "Max Concurrent Requests",
                    "name": "max_concurrent_requests",
                    "type": "number"
                },
                {
                    "default": 2048,
                    "display_name": "Max New Tokens",
                    "name": "max_new_tokens",
                    "type": "number"
                },
                {
                    "default": 2048,
                    "display_name": "Max Sequence Length",
                    "name": "max_sequence_length",
                    "type": "number"
                }
            ]
        },
        {
            "model_id": "example_model_70b",
            "parameters": [
                {
                    "default": "float16",
                    "display_name": "Data Type",
                    "name": "dtype",
                    "options": [
                        "float16",
                        "bfloat16"
                    ],
                    "type": "string"
                },
                {
                    "default": 256,
                    "display_name": "Max Batch Size",
                    "max": 512,
                    "min": 16,
                    "name": "max_batch_size",
                    "type": "number"
                },
                {
                    "default": 64,
                    "display_name": "Max Concurrent Requests",
                    "max": 128,
                    "min": 0,
                    "name": "max_concurrent_requests",
                    "type": "number"
                },
                {
                    "default": 2048,
                    "display_name": "Max New Tokens",
                    "max": 4096,
                    "min": 512,
                    "name": "max_new_tokens",
                    "type": "number"
                },
                {
                    "default": 2048,
                    "display_name": "Max Sequence Length",
                    "max": 8192,
                    "min": 256,
                    "name": "max_sequence_length",
                    "type": "number"
                }
            ],
            "tags": [
                "example_model",
                "70b"
            ]
        }
    ],
    "total_count": 2
}

Note:

If you access the model list programmatically, you can access all the parameters that you can set for the selected model. For models deployed through the UI, the parameters are available at the online deployment creation phase. See the description of parameters for custom foundation models.

Create the custom foundation model asset

You can create a model asset in two contexts: project context and space context.

If you create a project asset in project context, you can then import the model to your project and then promote it to space.
If you create a project asset in space context, you can import the model and then deploy it online. A model deployed from a space is also accessible from Prompt Lab in the project scope.

To create a model asset for your custom foundation model in space context, use this code:

curl -X POST "https://<cluster_url>/ml/v4/models?version=2024-01-29" \
-H "Authorization: Bearer $TOKEN" \
-H "content-type: application/json" \
--data '{
            "name": "<a meaningful name>",
            "space_id": "<your space id>",
            "foundation_model": {
            "model_id": "<your model id>"
            },
            "type": "custom foundation model 1.0",
            "software_spec": {
            "name": "watsonx-cfm-caikit-1.0"
            }
        }'

To create a model asset for your custom foundation model in project context, use this code:

curl -X POST "https://<cluster_url>/ml/v4/models?version=2024-01-29" \
-H "Authorization: Bearer $TOKEN" \
-H "content-type: application/json" \
--data '{
            "name": "<a meaningful name>",
            "project_id": "<your project id>",
            "foundation_model": {
            "model_id": "<your model id>"
            },
            "type": "custom foundation model 1.0",
            "software_spec": {
            "name": "watsonx-cfm-caikit-1.0"
            }
        }'

Note:

The model type must be custom_foundation_model_1.0. The software specification name must be watsonx-cfm-caikit-1.0. You cannot customize the software specification.

Create the online deployment

This example code shows a sample deployment, with some of the parameters overridden.

curl -X POST "https://<cluster_url>/ml/v4/deployments?version=2024-01-29" \
-H "Authorization: Bearer $TOKEN" \
-H "content-type: application/json" \
--data '{
  "asset":{
    "id":<your custom foundation model id>  // WML custom foundation model asset
  },
  "online":{
    "parameters":{
      "serving_name":"test_custom_fm",
      "foundation_model": {
           "max_sequence_length": 4096
      }
    }
  },
  "hardware_spec": {                        // Only one, of "id" or "name" must be set.
    "id": "<your custom hardware spec id>",
    "num_nodes": 1
  },
  "description": "Testing deployment using custom foundation model",
  "name":"custom_fm_deployment",
  "project_id":<your project id>  // Either "project_id" (or) "space_id". Only one is allowed
}'

View the status for a deployment

To get the status for an existing deployment, use this command:

curl -X GET "https://<cluster_url>/ml/v4/deployments/<your deployment ID>?version=2024-01-29&project_id=<your project ID>" \
-H "Authorization: Bearer $TOKEN"

Note: The deployed_asset_type is returned as custom_foundation_model.

Next steps

Prompting a custom foundation model

Parent topic: Deploying custom foundation models