Evaluating prompt templates for non-IBM foundation models outside of watsonx

You can evaluate prompt templates against externally-hosted foundation models outside of watsonx to measure the performance of the foundation model's tasks, and to understand how the model generates responses.

Use the watsonx.governance features to evaluate prompt templates for externally-hosted generative AI models. For example, if you inference an AWS Bedrock LLM, you can evaluate the saved prompt template using watsonx.governance metrics for perfomance and reliability using the watsonx.governance APIs.

With watsonx.governance, you can evaluate the output of your model against multiple task types:

Review and use sample Jupyter Notebooks to evaluate models. Each notebook lists learning goals so you can find the one that best meets your needs.

Evaluation samples

Notebooks demonstrate how to evaluate your prompt templates against models hosted by:

Google Vertex AI
AWS Bedrock
Azure OpenAI

Note: Evaluations might also work for models from other providers, but have not been tested.

While each sample notebook demonstrates specific task(s), you can modify each with other task types from the other sample notebooks.

For example, the AWS Bedrock sample demonstrates Summarization, Content Generation, Question answering, and Text classification tasks. You can use this notebook as a guide for how to add Question answering to the Google Vertex AI sample notebook. Simply copy the Evaluating Question and Answering output from the Foundation Model section from the AWS Bedrock sample notebook to the Google Vertex AI sample notebook, provide appropriate test data, and run the code cells.

Notebook name	Model	Tasks demonstrated
Use the watsonx.governance monitoring toolkit to evaluate a watsonx.ai model	FLAN-UL2	Metrics evaluated: - Summarization Metrics published: - OpenPages
Use the watsonx.governance monitoring toolkit to evaluate AWS Bedrock	AWS anthropic.claude-v2	Metrics evaluated: - Summarization - Content Generation - Question answering - Text classification
Use the watsonx.governance monitoring toolkit to evaluate Google Vertex AI	Google Vertex AI Gemini	- Metrics evaluated: - Summarization -Custom Metrics published: - OpenPages
Use the watsonx.governance monitoring toolkit to evaluate Azure OpenAI	Azure OpenAI GPT-35	Metrics evaluated: - Summarization Metrics published: - OpenPages

Next steps

To learn more about using notebook editors, see Notebooks.
To learn more about working with notebooks, see Coding and running notebooks.
To learn more about authenticating in a notebook, see Authentication.

Parent topic: APIs, SDKs and tutorials