It is often necessary to combine data from multiple sources to get a complete picture of entities we're anslyzing. As data scientists, in addition to just linking data, we are also concerned about issues of missing links, duplicative links, and erroneous links. Record linkage methods range from traditional rule-based and probabilistic approaches, to more modern approaches using machine learning.
Background and Motivation#
The goal of record linkage is to determine if pairs of records describe the same entity. This is important for removing duplicates from a data source or joining two separate data sources together. Record linkages also goes by the terms -- data matching, merge/purge, duplication detection, de-duping, reference matching, co-reference/anaphora -- in various fields. There are several approaches to record linkage that includes exact matching, rule-based linking and probabilistic linking. An example of exact matching is joining records based on social security number. Rule-based matching involves applying a cascading set of rules that reflect the domain knowledge of the records being linked. In probabilistic record linkage, linkage weights are calculated based on records and a threshold is applied to make a decision of whether to link records or not.
- The notebook
- DSaPP created a webapp for doing matching