Evaluating prompt templates for non-IBM foundation models with notebooks

You can evaluate prompt templates against externally-hosted foundation models with notebooks to measure the performance of the foundation model's tasks, and to understand how the model generates responses.

You can use watsonx.governance to evaluate prompt templates for externally-hosted generative AI models.

With watsonx.governance, you can evaluate the output of your model against multiple task types:

When you review and use sample Jupyter Notebooks to evaluate prompt templates, each notebook lists learning goals so that you can find the one that best meets your needs.

Notebooks demonstrate how to evaluate your prompt templates against models hosted by:

  • Google Vertex AI
  • AWS Bedrock
  • Azure OpenAI

For example, if you inference an AWS Bedrock LLM, you can evaluate the saved prompt template using watsonx.governance metrics for performance and reliability using the watsonx.governance APIs.

Note: Evaluations might also work for models from other providers, but have not been tested.

While each sample notebook demonstrates specific task(s), you can modify each with other task types from the other sample notebooks.

For example, the AWS Bedrock sample demonstrates Summarization, Content Generation, Question answering, and Text classification tasks. You can use this notebook as a guide for how to add Question answering to the Google Vertex AI sample notebook. Simply copy the Evaluating Question and Answering output from the Foundation Model section from the AWS Bedrock sample notebook to the Google Vertex AI sample notebook, provide appropriate test data, and run the code cells.

Notebook name Model Tasks demonstrated
Use the watsonx.governance monitoring toolkit to evaluate a watsonx.ai model FLAN-UL2 Metrics evaluated:
- Summarization

Metrics published:
- OpenPages
Use the watsonx.governance monitoring toolkit to evaluate AWS Bedrock AWS anthropic.claude-v2 Metrics evaluated:
- Summarization
- Content Generation
- Question answering
- Text classification
Use the watsonx.governance monitoring toolkit to evaluate Google Vertex AI Google Vertex AI Gemini - Metrics evaluated:
- Summarization
-Custom

Metrics published:
- OpenPages
Use the watsonx.governance monitoring toolkit to evaluate Azure OpenAI Azure OpenAI GPT-35 Metrics evaluated:
- Summarization

Metrics published:
- OpenPages

Next steps

You can also evaluate prompt templates for non-IBM foundation models in watsonx.governance deployment spaces and projects. For more information, see the following topics:

Learn more