Evaluating prompt templates for non-IBM foundation models outside of watsonx
You can evaluate prompt templates against externally-hosted foundation models outside of watsonx to measure the performance of the foundation model's tasks, and to understand how the model generates responses.
Use the watsonx.governance features to evaluate prompt templates for externally-hosted generative AI models. For example, if you inference an AWS Bedrock LLM, you can evaluate the saved prompt template using watsonx.governance metrics for perfomance and reliability using the watsonx.governance APIs.
With watsonx.governance, you can evaluate the output of your model against multiple task types:
Review and use sample Jupyter Notebooks to evaluate models. Each notebook lists learning goals so you can find the one that best meets your needs.
Evaluation samples
Notebooks demonstrate how to evaluate your prompt templates against models hosted by:
- Google Vertex AI
- AWS Bedrock
- Azure OpenAI
While each sample notebook demonstrates specific task(s), you can modify each with other task types from the other sample notebooks.
For example, the AWS Bedrock sample demonstrates Summarization, Content Generation, Question answering, and Text classification tasks. You can use this notebook as a guide for how to add Question answering to the Google Vertex AI sample notebook. Simply copy the Evaluating Question and Answering output from the Foundation Model section from the AWS Bedrock sample notebook to the Google Vertex AI sample notebook, provide appropriate test data, and run the code cells.
Notebook name | Model | Tasks demonstrated |
---|---|---|
Use the watsonx.governance monitoring toolkit to evaluate a watsonx.ai model | FLAN-UL2 | Metrics evaluated: - Summarization Metrics published: - OpenPages |
Use the watsonx.governance monitoring toolkit to evaluate AWS Bedrock | AWS anthropic.claude-v2 | Metrics evaluated: - Summarization - Content Generation - Question answering - Text classification |
Use the watsonx.governance monitoring toolkit to evaluate Google Vertex AI | Google Vertex AI Gemini | - Metrics evaluated: - Summarization -Custom Metrics published: - OpenPages |
Use the watsonx.governance monitoring toolkit to evaluate Azure OpenAI | Azure OpenAI GPT-35 | Metrics evaluated: - Summarization Metrics published: - OpenPages |
Next steps
- To learn more about using notebook editors, see Notebooks.
- To learn more about working with notebooks, see Coding and running notebooks.
- To learn more about authenticating in a notebook, see Authentication.
Parent topic: APIs, SDKs and tutorials