Probabilistic vs. deterministic matching styles

InfoSphere® MDM suspected duplicate processing can be configured to work with different matching engines and matching styles. The two matching styles are probabilistic matching and deterministic matching.

Deterministic matching

Deterministic matching, as provided by the MDM Classic Matching Engine, involves comparing the set of values for all of a given party's critical data elements with those of another. This comparison takes into account the presence, absence, and content of the values, and results in a matching score.

In a deterministic matching comparison, values that are present in both suspected duplicate parties and match create a unique score that is referred to as the match relevancy portion. Any values that are present in both parties but do not match create the non-match relevancy part of the score. Taken together, the match and non-match relevancy scores define the type of suspected duplicate that has been found.

When there are values missing in one or both parties for a particular critical data element, this element is not included in the creation of the unique match and non-match relevancy scores.

Probabilistic matching

Probabilistic matching, as provided by the InfoSphere MDM Probabilistic Matching Engine and the IBM® InfoSphere QualityStage matching engine, generates matching scores that take into consideration the frequency of the occurrence of a data value within a particular distribution.

For example, matching on last name Smith in North America should render a lower matching score than matching on the last name DeFillipo. That is, the likelihood that the last name Smith is a true match is lower than the likelihood that DeFillipo is a true match because Smith is a more common last name in North America.

The probabilistic matching approach can be used to improve the accuracy of suspected duplicate matching scores and categories.

In contrast to the deterministic approach that produces both match and non-match relevancy scores, probabilistic matching engines calculate only one composite weight. This weight is a result of a probabilistic calculation of the match and non-match weights for the individual compared fields.

Remember: No perfect matching system exists that can ensure complete correctness all of the time. As such, it is critical that you understand and define your business tolerance for missed matches as well as false matches. This involves iteratively testing the modifications being made to the critical data, match scoring, and survivorship, and requires your data stewards to inspect critical data, assess match criteria, and provide feedback on the results to ensure both accuracy and completeness of the data that has been processed.

Tip: Engage your data stewards and business analysts in the testing and validation of changes made to critical data, match scoring, and survivorship processing.