Data quality score
The InfoSphere® Information Analyzer thin client displays a data quality score for each data set that you analyze.
Data quality dimensions
Data quality scores are computed based on quality dimensions for each individual column in the data set, and then a combined quality score for the entire data set is calculated. The combined score is an average of the scores for all columns.
Rules
Rules represent a type of quality dimension that evaluates or validates specific conditions associated with your data sources. Unlike other quality dimensions, which are identified automatically when a data set analysis is run, you can define data rules, rule sets or quality rules and associate them to the appropriate data sources to customize your analysis results. You create quality rules in the thin client by binding rule definitions to columns. You must use Information Analyzer workbench to define and create data rules and rule sets and associate them to data sources. When an analysis is run in the thin client, Information Analyzer identifies data rules and rule sets that are associated with the analyzed data sets, and determines whether they can be evaluated as part of the data quality score. All quality rules are evaluated as part of the data quality score.
- A data rule and a rule set must be valid and fully bound
- All of the data rule and rule sets variables must be bound to columns in the same data set or to
constants. Note: If rule set variables are bound to columns of different data sets, then the whole rule set is discarded from quality score evaluation.
Data quality confidence
Each potential quality dimension identified at either a value level or a column level is also associated with a confidence number, which indicates the system's certainty that the dimension identified is correct. Confidence is a number between 0.0 and 1.0, with 0.0 being no confidence that the dimension is correct, and 1.0 being absolute confidence that the dimension is correct.
The quality score of a value is computed as the product of (1.0-confidence) of all quality problems identified for that cell or column. For example, imagine you have a column containing US names. One row contains a name from another country that is unusually long and contains a combination of letters not expected in a column of this data class. That record might be identified as a suspect value with a confidence of 70 percent. In the same column there is another value of "###############1234###############." That format is clearly a suspect domain violation and is identified with a confidence of 100%. The score decrease caused by the first value would be 70 percent of the score decrease caused by the second value.
Data quality score calculation example
Column 1* | Column 1 quality dimensions | Column 2 | Column 2 quality dimensions | Column 3 | Column 3 quality dimensions |
---|---|---|---|---|---|
555-123-4567 | 17 King Way | 555-123-4567 | Duplicate value | ||
555-456-1234 | 22 B Street | 555-123-4568 | |||
4567 | Rule violation | 45 H Lane | 555-123-4569 | ||
555-0236-8565 | 6 Villa Drive | 555-123-4520 | |||
555-236-8596 | 4 Parker Road | 555-123-4567 | Duplicate value | ||
555-897-5632 | NULL | Missing value | 555-123-4530 | ||
Renee |
|
NULL | Missing value | NULL | Missing value |
NULL |
|
Missing value | 555-123-4545 | ||
555-897-8523 | 555-123-4567 | Suspect value | NULL | Missing value | |
3 | Rule violation | 09876 | Suspect value | 555-123-4555 |
- Column 1:4/10 data quality issues identified for a column quality score of 60 percent.
- Column 2: 5/10 data quality issues identified for a column quality score of 50 percent.
- Column 3: 4/10 data quality issues identified for a column quality score of 60 percent.
- Data set quality score: (60+50+60) / 3 = 57 percent.
To prevent records with multiple quality issues to unnecessarily weigh down the data quality score, values that are identified with more than one issue do not weigh differently against the quality score as values with only one.
The quality score of a data set might change after you migrate to a new release. The change is due to improvements to algorithms for various quality dimensions.