Data quality score

The InfoSphere® Information Analyzer thin client displays a data quality score for each data set that you analyze.

Data quality dimensions

Data quality scores are computed based on quality dimensions for each individual column in the data set, and then a combined quality score for the entire data set is calculated. The combined score is an average of the scores for all columns.

Rules

Rules represent a type of quality dimension that evaluates or validates specific conditions associated with your data sources. Unlike other quality dimensions, which are identified automatically when a data set analysis is run, you can define data rules, rule sets or quality rules and associate them to the appropriate data sources to customize your analysis results. You create quality rules in the thin client by binding rule definitions to columns. You must use Information Analyzer workbench to define and create data rules and rule sets and associate them to data sources. When an analysis is run in the thin client, Information Analyzer identifies data rules and rule sets that are associated with the analyzed data sets, and determines whether they can be evaluated as part of the data quality score. All quality rules are evaluated as part of the data quality score.

Data rules and rule sets must meet certain criteria to be evaluated as part of a data quality score:
  • A data rule and a rule set must be valid and fully bound
  • All of the data rule and rule sets variables must be bound to columns in the same data set or to constants.
    Note: If rule set variables are bound to columns of different data sets, then the whole rule set is discarded from quality score evaluation.

Data quality confidence

Each potential quality dimension identified at either a value level or a column level is also associated with a confidence number, which indicates the system's certainty that the dimension identified is correct. Confidence is a number between 0.0 and 1.0, with 0.0 being no confidence that the dimension is correct, and 1.0 being absolute confidence that the dimension is correct.

The quality score of a value is computed as the product of (1.0-confidence) of all quality problems identified for that cell or column. For example, imagine you have a column containing US names. One row contains a name from another country that is unusually long and contains a combination of letters not expected in a column of this data class. That record might be identified as a suspect value with a confidence of 70 percent. In the same column there is another value of "###############1234###############." That format is clearly a suspect domain violation and is identified with a confidence of 100%. The score decrease caused by the first value would be 70 percent of the score decrease caused by the second value.

Data quality score calculation example

The following table provides an example of a data quality score calculation with rules applied to certain columns. The example assumes that all dimensions have been identified with a confidence of 100 percent (or 1.0).
Table 1. Data quality score with rules calculation
Column 1* Column 1 quality dimensions Column 2 Column 2 quality dimensions Column 3 Column 3 quality dimensions
555-123-4567   17 King Way   555-123-4567 Duplicate value
555-456-1234   22 B Street   555-123-4568  
4567 Rule violation 45 H Lane   555-123-4569  
555-0236-8565   6 Villa Drive   555-123-4520  
555-236-8596   4 Parker Road   555-123-4567 Duplicate value
555-897-5632   NULL Missing value 555-123-4530  
Renee
  • Suspect value
  • Rule violation
NULL Missing value NULL Missing value
NULL
  • Missing value
  • Rule violation
  Missing value 555-123-4545  
555-897-8523   555-123-4567 Suspect value NULL Missing value
3 Rule violation 09876 Suspect value 555-123-4555  
*Column 1 has a data rule applied. The rule requires all rows in this column to contain a numeric 7 digit value.
  • Column 1:4/10 data quality issues identified for a column quality score of 60 percent.
  • Column 2: 5/10 data quality issues identified for a column quality score of 50 percent.
  • Column 3: 4/10 data quality issues identified for a column quality score of 60 percent.
  • Data set quality score: (60+50+60) / 3 = 57 percent.

To prevent records with multiple quality issues to unnecessarily weigh down the data quality score, values that are identified with more than one issue do not weigh differently against the quality score as values with only one.

The quality score of a data set might change after you migrate to a new release. The change is due to improvements to algorithms for various quality dimensions.