Drift in data

In addition to checking for the drift in model accuracy, the drift monitor can detect the drift in data. This type of drift is defined as something that deviates from what is standard, normal, or expected. Watson OpenScale detects data drift so that you can make changes to the model.

Understanding drift detection

Drift is the degradation of predictive performance over time because of hidden context. As your data changes over time, the ability of your model to make accurate predictions may deteriorate. Watson OpenScale both detects and highlights drift so that you can take corrective action.

How it works

Watson OpenScale analyzes all transactions to find the ones that contribute to drift. It then groups the records based on the similarity of data inconsistency patterns that were significant in contributing to drift.

Data drift constraint specification

The constraints schema describes the statistics of training data as a set of single column and two column data boundaries. These statistics identify input data outliers to a machine learning model at runtime. Single-column constraints deal with each column individually while two-column constraints assume a relationship might exist between any two columns in the training data.

Constraints JSON Object

The constraints schema itself is specified as a JSON object with two array fields that describe all the columns and the constraints in the training data. The JSON object takes the following format:

{
      columns: [],
      constraints: []
}

Column statistics

Each element in the columns describes the standard statistical properties of a column.

The data type of the column is indicated by the dtype variable. Allowed values of dtype variable are one of the following items:

A numeric column is described with its standard numerical bounds, such as minimum, maximum, mean, standard deviation, and its first, second and third quartile percentile values.

Common attributes of all constraints:

The name field identifies the concrete type of a constraint. The value of the name can be one of the following items:

The id field is an internal field to identify each constraint uniquely. Its value is a UUID.

The kind field identifies if the constraint is a single or two column constraint. Allowed values are one of single_column, two_column

The columns field is an array of column names. If the constraint deals with a single column, array contains a single element whose value is the name of the column. If the constraint deals with two columns, array contains the names of the two columns.

Single-feature constraints

Single-feature constraints, also know as single-column constraints, cannot be generated in the following instances:

Double-feature constraints

Double-feature constraints, also known as two-column constraints, cannot be generated in the following instances:

Working with large datasets

For data drift to be calculated successfully, very large datasets that consist of more than one-thousand columns (1,012) must be broken up. You must split the dataset into multiple datasets, each with a subset of columns, and the generate constraints.

For datasets, which have a large number of columns, that use one hot encoding, it is suggested that you write a wrapper on top of the model and provide Watson OpenScale a REST API of the scoring end point. In this way, Watson OpenScale can accept non one hot encoded data during training time and also while adding payload data.

Do the math

Watson OpenScale analyzes each transaction for data inconsistency, by comparing the transaction content with the training data patterns. If a transaction violates one or more of the training data patterns, the transaction is marked as drifted. Watson OpenScale then estimates the magnitude of data inconsistency as the fraction of drifted transactions to the total number of transactions analyzed. Further, Watson OpenScale analyzes all the drifted transactions; and then, groups transactions that violate similar training data patterns into different clusters. In each cluster, Watson OpenScale also estimates the important features that played a major role in the data inconsistency and classifies their feature impact as large, some, and small.

Next steps