Evaluating prompt templates for non-IBM foundation models outside of watsonx

You can evaluate prompt templates against externally-hosted foundation models outside of watsonx to measure the performance of the foundation model's tasks, and to understand how the model generates responses.

Use the watsonx.governance features to evaluate prompt templates for externally-hosted generative AI models. For example, if you inference an AWS Bedrock LLM, you can evaluate the saved prompt template using watsonx.governance metrics for perfomance and reliability using the watsonx.governance APIs.

With watsonx.governance, you can evaluate the output of your model against multiple task types:

Review and use sample Jupyter Notebooks to evaluate models. Each notebook lists learning goals so you can find the one that best meets your needs.

Evaluation samples

Notebooks demonstrate how to evaluate your prompt templates against models hosted by:

  • Google Vertex AI
  • AWS Bedrock
  • Azure OpenAI
Note: Evaluations might also work for models from other providers, but have not been tested.

While each sample notebook demonstrates specific task(s), you can modify each with other task types from the other sample notebooks.

For example, the AWS Bedrock sample demonstrates Summarization, Content Generation, Question answering, and Text classification tasks. You can use this notebook as a guide for how to add Question answering to the Google Vertex AI sample notebook. Simply copy the Evaluating Question and Answering output from the Foundation Model section from the AWS Bedrock sample notebook to the Google Vertex AI sample notebook, provide appropriate test data, and run the code cells.

Notebook name Model Tasks demonstrated
Use the watsonx.governance monitoring toolkit to evaluate a watsonx.ai model FLAN-UL2 Metrics evaluated:
- Summarization

Metrics published:
- OpenPages
Use the watsonx.governance monitoring toolkit to evaluate AWS Bedrock AWS anthropic.claude-v2 Metrics evaluated:
- Summarization
- Content Generation
- Question answering
- Text classification
Use the watsonx.governance monitoring toolkit to evaluate Google Vertex AI Google Vertex AI Gemini - Metrics evaluated:
- Summarization
-Custom

Metrics published:
- OpenPages
Use the watsonx.governance monitoring toolkit to evaluate Azure OpenAI Azure OpenAI GPT-35 Metrics evaluated:
- Summarization

Metrics published:
- OpenPages

Next steps

Parent topic: APIs, SDKs and tutorials