Reviewing evaluation results in Watson OpenScale

You can select a deployment in Watson OpenScale from the insights dashboard to analyze evaluation results.

The Evaluation window displays the results for evaluations that you configure. Depending on the type of model, either pre-production or production, the Tests run, Tests passed, and Tests failed fields display the following results:

For a pre-production model deployment, the results of your test run are displayed
For a production model deployment, the results of the most-recent regularly scheduled hourly fairness and quality, and the most-recent 3-hourly drift evaluations are displayed.

Model deployment evaluation chart is displayed with fairness, quality, and drift monitors each showing details for how the model meets set thresholds.

Performing analysis

To view details about your model evaluation and manage data, use the the Actions menu
To run on-demand evaluations after uploading the data, select Evaluate now from the Actions menu
To upload data for production models, click Upload payload data upload payload data with a CSV file or click Upload feedback data to upload feedback data with a CSV file.
To use endpoints to provide data for your model evaluations, select View endpoints. For more information, see Sending model transactions.

Pre-production models

For pre-production models, you can upload feedback data with a CSV file or you can connect to a CSV file with feedback data that is stored in Cloud Object Storage or Db2.
If you want to upload feedback data that is already scored, you can select the Test data includes model output checkbox. Cloud Object Storage does not rescore the test data when you select this option. - The test data that you upload can also include record_id/transaction_id and record_timestamp columns that are added to the payload logging and feedback tables when the Test data includes model output option is selected.
To view a timeline chart, click one of the evaluation tiles . The timeline chart displays aggregated evaluations as data points within the timeframe and Date range metric that you specify. The timestamp of each datapoint that displays when you hover on the chart does not match the timestamp of the latest evaluation due to the default aggregation behavior. The latest evaluation for the timeframe that you select is displayed during the associated date range. When you view a batch deployment, the chart can also display the following metrics:
- The evaluation interval is set to 1 week by default. You can set the evaluation interval to 1 month or 1 year with the Watson OpenScale Python SDK.
- The interval that is specified with the timeframe metric is set to the evaluation interval that you configure for the evaluations.

Analyzing fairness

You can click a data point on the chart to view more details about how the fairness scores are calculated. For each monitored attribute and fairness metric, you can view the calculations for the following types of data sets:

Balanced: This balanced calculation includes the scoring request that is received for the selected hour. The calculation also includes more records from previous hours if the minimum number of records that are required for evaluation was not met. Includes more perturbed and synthesized records that are used to test the model's response when the value of the monitored feature changes.
Payload: The actual scoring requests that are received by the model for the selected hour.
Training: The training data records that are used to train the model.
Debiased: The output of the debiasing algorithm after processing the runtime and perturbed data.

To view the balanced data set calculation for batch deployment subscriptions, you must specify a model endpoint when you provide your deployment details. For more information, see Configuring the batch processor in Watson OpenScale. Watson OpenScale does not support debiased data set calculations for batch deployments.

Next steps

Reviewing fairness evaluation results

Parent topic: Getting insights with Watson OpenScale