Record Linkage and Data Matching
Advanced Statistical Methods
Connecting Records Across Imperfect Databases
Record linkage identifies records that refer to the same entity across different data sources using probabilistic models and string distance metrics. The Felleg-Sunter model provides the theoretical foundation.
- Public health β Link hospital records to disease registries for longitudinal studies
- Bureau of statistics β Combine administrative datasets while preserving privacy
- Finance β Match customer records across mergers and acquisitions for risk assessment
Record linkage bridges the gap between siloed datasets to unlock richer analytical insights.
Record linkage (also called entity resolution, deduplication, or data matching) identifies records across or within databases that refer to the same real-world entity. When unique identifiers are absent, statistical and probabilistic methods are needed to determine whether two records represent the same entity. This lesson develops the mathematical foundations of record linkage from deterministic rules through the Felleg-Sunter probabilistic framework.
Deterministic vs Probabilistic Linkage
DfRecord Linkage Problem
Given two datasets and , the record linkage problem is to identify all pairs where and refer to the same real-world entity. Formally, define a binary match indicator if and are a true match, and otherwise.
Deterministic Linkage
Deterministic linkage requires exact agreement on a set of key fields (e.g., Social Security Number, date of birth, last name). Records are classified as matches if and only if all key fields agree. This approach is:
- Fast and interpretable when reliable identifiers exist
- Brittle: A single character error in any key field causes a missed match
- Inappropriate for noisy data, name variations, or missing values
DfProbabilistic Linkage
Probabilistic linkage quantifies the strength of evidence that two records match, using field-by-field agreement patterns. The core question: given that two records agree on some fields and disagree on others, what is the probability they represent the same entity?
Felleg-Sunter Model
Felleg-Sunter Framework
The Felleg-Sunter model (1969) compares each pair across fields. For each field , define an agreement indicator:
The match weight for the pair is:
where:
- is the match probability (probability of agreement given a true match)
- is the non-match probability (probability of agreement given a non-match)
The log-likelihood ratio (LLR) for the entire pair is:
Interpreting Match Weights
- Strong match: (positive threshold), e.g., bits
- Strong non-match: (negative threshold), e.g., bits
- Possible match: , requiring clerical review or further analysis
The weight is the contribution of field . Fields with high and low (e.g., exact Social Security Number) contribute large positive weights; fields with (e.g., common last names) contribute near zero.
Felleg-Sunter Assumptions
The Felleg-Sunter model assumes:
- Conditional independence: Agreement indicators are independent given the match status:
- Monotonicity: For any field , (true matches agree more often than non-matches)
- Parameter stability: and are the same for all record pairs
These assumptions are often violated in practice. Violations of conditional independence (e.g., correlated name and address fields) can bias match probability estimates.
String Distance Metrics
Levenshtein Distance
The Levenshtein (edit) distance between strings and is the minimum number of single-character insertions, deletions, or substitutions required to transform into :
where is with the first character removed. This is computed via dynamic programming in time. The normalized Levenshtein distance is .
Jaro-Winkler Similarity
The Jaro similarity between strings and is:
where is the number of matching characters (characters within positions), and is the number of transpositions among matching characters.
The Jaro-Winkler similarity boosts the score for strings that match from the beginning:
where is the scaling factor and is the length of the common prefix (up to ). JW gives higher scores to strings sharing a common prefix, making it well-suited for name matching.
Metric Comparison
| Metric | Range | Handles Transpositions | Prefix Bias | Best For |
|---|---|---|---|---|
| Levenshtein | No | No | General editing | |
| Jaro | Yes | No | Short strings | |
| Jaro-Winkler | Yes | Yes | Names | |
| Soundex | Categorical | N/A | Phonetic | Name phonetics |
| Metaphone | Categorical | N/A | Phonetic | Name phonetics |
Blocking
DfBlocking
Blocking reduces the comparison space by partitioning records into blocks within which comparisons are performed. Records in different blocks are never compared. A good blocking scheme minimizes the number of missed true matches (false negatives) while reducing the number of comparisons.
Blocking Efficiency
The reduction ratio measures the fraction of comparisons eliminated:
The pairs completeness (sensitivity) measures the fraction of true matches retained:
The pairs quality (positive predictive value) is:
Blocking Strategies
- Standard blocking: Exact match on a key (e.g., first 3 characters of last name + ZIP code)
- Sorted neighborhood: Sort records by a key, then compare within a sliding window
- Multi-pass blocking: Multiple blocking passes with different keys, union of candidate pairs
- Canopy clustering: TF-IDF + cosine similarity with thresholds for loose blocking
- Learning-based: Use a classifier to predict if two records should be compared
EM Algorithm for Linkage
DfUnsupervised Linkage Parameters
When true match labels are unknown, the EM algorithm estimates , , and the prior match probability from the comparison patterns alone.
EM for Felleg-Sunter
Let be the comparison vector for the -th pair, and the latent match indicator.
E-step: Compute posterior match probabilities:
M-step: Update parameters:
Iterate until convergence. The EM algorithm finds a local maximum of the likelihood; multiple random initializations are recommended.
EM Convergence Issues
- EM converges to a local maximum, not necessarily the global maximum
- Initialization matters: start with reasonable for all fields
- Labelled data (even a small sample of known matches) dramatically improves estimation
- With many fields, EM can overfit: use regularization or Bayesian priors
- The number of true matches is often very small relative to the total number of pairs, creating severe class imbalance
Privacy-Preserving Record Linkage
DfPrivate Record Linkage
Privacy-preserving record linkage (PPRL) enables linkage without revealing the actual data values. Key approaches include:
- Bloom filters: Encode each field as a Bloom filter; compare using set operations on encoded values
- Homomorphic encryption: Perform comparisons on encrypted data
- Secure multi-party computation: Jointly compute linkage without revealing inputs
- Differential privacy: Add calibrated noise to comparison scores
Bloom Filter Encoding
A Bloom filter is a bit vector representing a set . To insert element :
where are hash functions. The Jaccard similarity of two Bloom filters approximates the similarity of the underlying strings:
False positives occur when bits are set by different strings, but the approximation is sufficient for linkage when Bloom filter size is large enough ().
Evaluation Metrics
Linkage Quality Metrics
Given a set of predicted matches and true matches :
The F-beta score weights recall more heavily when (missing matches is costly). The linkage error rate is:
In practice, sampling-based evaluation estimates these metrics by manually reviewing a stratified sample of pairs (strong matches, possible matches, strong non-matches).
Evaluation Challenges
- Ground truth is usually unavailable for the full dataset; evaluation relies on samples
- Clerical review of uncertain pairs is expensive but necessary for bias estimation
- Linkage-induced bias: Incorrect matches introduce systematic error in downstream analyses
- Sensitivity analysis: Vary thresholds and evaluate the precision-recall tradeoff
- Block-level evaluation: Assess whether blocking schemes miss true matches
Python Implementation
import numpy as np
import pandas as pd
from collections import Counter
np.random.seed(42)
# --- String Distance Metrics ---
def levenshtein_distance(s, t):
"""Compute Levenshtein edit distance via dynamic programming."""
m, n = len(s), len(t)
dp = np.zeros((m + 1, n + 1), dtype=int)
dp[:, 0] = np.arange(m + 1)
dp[0, :] = np.arange(n + 1)
for i in range(1, m + 1):
for j in range(1, n + 1):
cost = 0 if s[i-1] == t[j-1] else 1
dp[i, j] = min(dp[i-1, j] + 1, dp[i, j-1] + 1, dp[i-1, j-1] + cost)
return dp[m, n]
def jaro_winkler(s, t, p=0.1):
"""Compute Jaro-Winkler similarity."""
if s == t: return 1.0
len_s, len_t = len(s), len(t)
match_distance = max(len_s, len_t) // 2 - 1
s_matches = [False] * len_s
t_matches = [False] * len_t
matches = transpositions = 0
for i in range(len_s):
start = max(0, i - match_distance)
end = min(i + match_distance + 1, len_t)
for j in range(start, end):
if t_matches[j] or s[i] != t[j]: continue
s_matches[i] = t_matches[j] = True
matches += 1
break
if matches == 0: return 0.0
k = 0
for i in range(len_s):
if not s_matches[i]: continue
while not t_matches[k]: k += 1
if s[i] != t[k]: transpositions += 1
k += 1
jaro = (matches/len_s + matches/len_t + (matches - transpositions/2)/matches) / 3
# Common prefix (max 4 chars)
prefix = 0
for i in range(min(4, len_s, len_t)):
if s[i] == t[i]: prefix += 1
else: break
return jaro + prefix * p * (1 - jaro)
# --- Test string distances ---
pairs = [
("Smith", "Smyth"), ("Johnson", "Johnsen"),
("Washington", "Washngton"), ("Michael", "Micheal"),
("New York", "New Yrok"), ("Robert", "Robbert"),
]
print("=== String Distance Metrics ===")
print(f"{'Pair':<30} {'Levenshtein':<14} {'Jaro-Winkler':<14}")
print("-" * 58)
for s, t in pairs:
lev = levenshtein_distance(s, t)
jw = jaro_winkler(s, t)
print(f"({s}, {t}){'':>{26-len(s)-len(t)}} {lev:<14} {jw:<14.4f}")
# --- Felleg-Sunter Model ---
def felleg_sunter_weights(match_probs, non_match_probs, agreement_vector):
"""Compute Felleg-Sunter match weight for a pair."""
weight = 0.0
for gamma, m_k, u_k in zip(agreement_vector, match_probs, non_match_probs):
if gamma == 1:
weight += np.log(m_k / u_k)
else:
weight += np.log((1 - m_k) / (1 - u_k))
return weight
# Simulated parameters (estimated from field distributions)
# Fields: [first_name, last_name, DOB, city, ZIP]
m_k = np.array([0.95, 0.98, 0.90, 0.85, 0.92]) # match probs
u_k = np.array([0.01, 0.005, 0.02, 0.03, 0.05]) # non-match probs
print("\n=== Felleg-Sunter Match Weights ===")
test_pairs = [
("John Smith", "John Smith", [1,1,1,1,1]),
("John Smith", "Jon Smith", [0,1,1,1,1]),
("John Smith", "John Smyth", [1,0,1,1,1]),
("John Smith", "Jane Doe", [0,0,0,0,0]),
]
for name1, name2, gamma in test_pairs:
w = felleg_sunter_weights(m_k, u_k, gamma)
agreement = sum(gamma)
print(f"({name1}, {name2}) agree on {agreement}/5 fields: W = {w:.2f} bits")
# --- EM Algorithm for Linkage ---
np.random.seed(42)
n_pairs = 5000
n_true_matches = 200
# Generate comparison vectors
gamma_true = np.random.binomial(1, m_k, size=(n_true_matches, 5))
gamma_false = np.random.binomial(1, u_k, size=(n_pairs - n_true_matches, 5))
gamma_all = np.vstack([gamma_true, gamma_false])
labels = np.concatenate([np.ones(n_true_matches), np.zeros(n_pairs - n_true_matches)])
# EM algorithm
n_pairs_total = len(gamma_all)
n_fields = gamma_all.shape[1]
pi = n_true_matches / n_pairs_total
m_em = np.random.uniform(0.7, 0.95, n_fields)
u_em = np.random.uniform(0.01, 0.1, n_fields)
for iteration in range(50):
# E-step
p_match = pi * np.prod(m_em ** gamma_all * (1 - m_em) ** (1 - gamma_all), axis=1)
p_nonmatch = (1 - pi) * np.prod(u_em ** gamma_all * (1 - u_em) ** (1 - gamma_all), axis=1)
w_post = p_match / (p_match + p_nonmatch + 1e-300)
# M-step
pi_new = np.mean(w_post)
m_em_new = np.average(gamma_all, weights=w_post, axis=0)
u_em_new = np.average(gamma_all, weights=1 - w_post, axis=0)
if np.max(np.abs(m_em - m_em_new)) < 1e-6:
print(f"EM converged after {iteration + 1} iterations")
break
pi, m_em, u_em = pi_new, m_em_new, u_em_new
print(f"\n=== EM Estimates ===")
print(f"Ο (match prior): {pi:.4f} (true: {n_true_matches/n_pairs_total:.4f})")
fields = ['First Name', 'Last Name', 'DOB', 'City', 'ZIP']
print(f"{'Field':<15} {'m_k (est)':<12} {'m_k (true)':<12} {'u_k (est)':<12} {'u_k (true)':<12}")
print("-" * 63)
for i, field in enumerate(fields):
print(f"{field:<15} {m_em[i]:<12.4f} {m_k[i]:<12.4f} {u_em[i]:<12.4f} {u_k[i]:<12.4f}")
# Classification
threshold = 0.5
predicted = (w_post >= threshold).astype(int)
tp = np.sum((predicted == 1) & (labels == 1))
fp = np.sum((predicted == 1) & (labels == 0))
fn = np.sum((predicted == 0) & (labels == 1))
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
print(f"\n=== Linkage Evaluation ===")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1: {f1:.4f}")
Summary
Key Takeaways: Record Linkage and Data Matching
- Deterministic linkage requires exact key agreement; probabilistic linkage uses agreement patterns across multiple fields to compute match weights. The Felleg-Sunter model provides the theoretical foundation for probabilistic linkage.
- Felleg-Sunter match weights sum field-specific log-likelihood ratios. The model assumes conditional independence of agreement indicators given match status β a strong assumption often violated in practice.
- String distances β Levenshtein measures edit operations; Jaro-Winkler adds prefix bonus for name matching. Phonetic encodings (Soundex, Metaphone) handle spelling variations. Choice depends on the type of errors expected.
- Blocking reduces the comparison space. Good blocking achieves high pairs completeness (sensitivity) while maintaining manageable comparison volumes. Multi-pass and learned blocking improve over standard approaches.
- EM algorithm estimates , , and without labelled data, but converges to local maxima. Small labelled samples dramatically improve estimation. Privacy-preserving methods (Bloom filters, homomorphic encryption) enable linkage without revealing raw data.