Data quality score

The InfoSphere® Information Analyzer thin client displays a data quality score for each data set that you analyze.

Data quality dimensions

Data quality scores are computed based on quality dimensions for each individual column in the data set, and then a combined quality score for the entire data set is calculated. The combined score is an average of the scores for all columns.

Rules

Rules represent a type of quality dimension that evaluates or validates specific conditions associated with your data sources. Unlike other quality dimensions, which are identified automatically when a data set analysis is run, you can define data rules, rule sets or quality rules and associate them to the appropriate data sources to customize your analysis results. You create quality rules in the thin client by binding rule definitions to columns. You must use Information Analyzer workbench to define and create data rules and rule sets and associate them to data sources. When an analysis is run in the thin client, Information Analyzer identifies data rules and rule sets that are associated with the analyzed data sets, and determines whether they can be evaluated as part of the data quality score. All quality rules are evaluated as part of the data quality score.

Data rules and rule sets must meet certain criteria to be evaluated as part of a data quality score:

A data rule and a rule set must be valid and fully bound
All of the data rule and rule sets variables must be bound to columns in the same data set or to constants.
Note: If rule set variables are bound to columns of different data sets, then the whole rule set is discarded from quality score evaluation.

Data quality confidence

Each potential quality dimension identified at either a value level or a column level is also associated with a confidence number, which indicates the system's certainty that the dimension identified is correct. Confidence is a number between 0.0 and 1.0, with 0.0 being no confidence that the dimension is correct, and 1.0 being absolute confidence that the dimension is correct.

The quality score of a value is computed as the product of (1.0-confidence) of all quality problems identified for that cell or column. For example, imagine you have a column containing US names. One row contains a name from another country that is unusually long and contains a combination of letters not expected in a column of this data class. That record might be identified as a suspect value with a confidence of 70 percent. In the same column there is another value of "###############1234###############." That format is clearly a suspect domain violation and is identified with a confidence of 100%. The score decrease caused by the first value would be 70 percent of the score decrease caused by the second value.

Data quality score calculation example

The following table provides an example of a data quality score calculation with rules applied to certain columns. The example assumes that all dimensions have been identified with a confidence of 100 percent (or 1.0).

Table 1. Data quality score with rules calculation
Column 1*	Column 1 quality dimensions	Column 2	Column 2 quality dimensions	Column 3	Column 3 quality dimensions
555-123-4567		17 King Way		555-123-4567	Duplicate value
555-456-1234		22 B Street		555-123-4568
4567	Rule violation	45 H Lane		555-123-4569
555-0236-8565		6 Villa Drive		555-123-4520
555-236-8596		4 Parker Road		555-123-4567	Duplicate value
555-897-5632		NULL	Missing value	555-123-4530
Renee	Suspect value Rule violation	NULL	Missing value	NULL	Missing value
NULL	Missing value Rule violation		Missing value	555-123-4545
555-897-8523		555-123-4567	Suspect value	NULL	Missing value
3	Rule violation	09876	Suspect value	555-123-4555

*Column 1 has a data rule applied. The rule requires all rows in this column to contain a numeric 7 digit value.

Column 1:4/10 data quality issues identified for a column quality score of 60 percent.
Column 2: 5/10 data quality issues identified for a column quality score of 50 percent.
Column 3: 4/10 data quality issues identified for a column quality score of 60 percent.
Data set quality score: (60+50+60) / 3 = 57 percent.

To prevent records with multiple quality issues to unnecessarily weigh down the data quality score, values that are identified with more than one issue do not weigh differently against the quality score as values with only one.

The quality score of a data set might change after you migrate to a new release. The change is due to improvements to algorithms for various quality dimensions.