Probabilistic vs. deterministic matching styles
InfoSphere® MDM suspected duplicate processing can be configured to work with different matching engines and matching styles. The two matching styles are probabilistic matching and deterministic matching.
- Deterministic matching
Deterministic matching, as provided by the MDM Classic Matching Engine, involves comparing the set of values for all of a given party's critical data elements with those of another. This comparison takes into account the presence, absence, and content of the values, and results in a matching score.
In a deterministic matching comparison, values that are present in both suspected duplicate parties and match create a unique score that is referred to as the match relevancy portion. Any values that are present in both parties but do not match create the non-match relevancy part of the score. Taken together, the match and non-match relevancy scores define the type of suspected duplicate that has been found.
When there are values missing in one or both parties for a particular critical data element, this element is not included in the creation of the unique match and non-match relevancy scores.
- Probabilistic matching
Probabilistic matching, as provided by the InfoSphere MDM Probabilistic Matching Engine and the IBM® InfoSphere QualityStage matching engine, generates matching scores that take into consideration the frequency of the occurrence of a data value within a particular distribution.
For example, matching on last name Smith in North America should render a lower matching score than matching on the last name DeFillipo. That is, the likelihood that the last name Smith is a true match is lower than the likelihood that DeFillipo is a true match because Smith is a more common last name in North America.
The probabilistic matching approach can be used to improve the accuracy of suspected duplicate matching scores and categories.
In contrast to the deterministic approach that produces both match and non-match relevancy scores, probabilistic matching engines calculate only one composite weight. This weight is a result of a probabilistic calculation of the match and non-match weights for the individual compared fields.