πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Record Linkage and Data Matching

Advanced Statistical MethodsData Methods🟒 Free Lesson

Advertisement

Record Linkage and Data Matching

Advanced Statistical Methods

Connecting Records Across Imperfect Databases

Record linkage identifies records that refer to the same entity across different data sources using probabilistic models and string distance metrics. The Felleg-Sunter model provides the theoretical foundation.

  • Public health β€” Link hospital records to disease registries for longitudinal studies
  • Bureau of statistics β€” Combine administrative datasets while preserving privacy
  • Finance β€” Match customer records across mergers and acquisitions for risk assessment

Record linkage bridges the gap between siloed datasets to unlock richer analytical insights.


Record linkage (also called entity resolution, deduplication, or data matching) identifies records across or within databases that refer to the same real-world entity. When unique identifiers are absent, statistical and probabilistic methods are needed to determine whether two records represent the same entity. This lesson develops the mathematical foundations of record linkage from deterministic rules through the Felleg-Sunter probabilistic framework.


Deterministic vs Probabilistic Linkage

DfRecord Linkage Problem

Given two datasets A\mathcal{A} and B\mathcal{B}, the record linkage problem is to identify all pairs (ai,bj)(a_i, b_j) where ai∈Aa_i \in \mathcal{A} and bj∈Bb_j \in \mathcal{B} refer to the same real-world entity. Formally, define a binary match indicator Mij=1M_{ij} = 1 if aia_i and bjb_j are a true match, and Mij=0M_{ij} = 0 otherwise.

Deterministic Linkage

Deterministic linkage requires exact agreement on a set of key fields (e.g., Social Security Number, date of birth, last name). Records are classified as matches if and only if all key fields agree. This approach is:

  • Fast and interpretable when reliable identifiers exist
  • Brittle: A single character error in any key field causes a missed match
  • Inappropriate for noisy data, name variations, or missing values

DfProbabilistic Linkage

Probabilistic linkage quantifies the strength of evidence that two records match, using field-by-field agreement patterns. The core question: given that two records agree on some fields and disagree on others, what is the probability they represent the same entity?


Felleg-Sunter Model

Felleg-Sunter Framework

The Felleg-Sunter model (1969) compares each pair (ai,bj)(a_i, b_j) across KK fields. For each field kk, define an agreement indicator:

Ξ³k(i,j)={1ifΒ fieldsΒ kΒ agree0otherwise\gamma_k(i, j) = \begin{cases} 1 & \text{if fields } k \text{ agree} \\ 0 & \text{otherwise} \end{cases}

The match weight for the pair is:

W=βˆ‘k=1KΞ³kβ‹…log⁑mkukW = \sum_{k=1}^{K} \gamma_k \cdot \log \frac{m_k}{u_k}

where:

  • mk=P(Ξ³k=1∣M=1)m_k = P(\gamma_k = 1 \mid M = 1) is the match probability (probability of agreement given a true match)
  • uk=P(Ξ³k=1∣M=0)u_k = P(\gamma_k = 1 \mid M = 0) is the non-match probability (probability of agreement given a non-match)

The log-likelihood ratio (LLR) for the entire pair is:

Ξ›(i,j)=log⁑P(Ξ³(i,j)∣M=1)P(Ξ³(i,j)∣M=0)=βˆ‘k=1KΞ³klog⁑mkuk+(1βˆ’Ξ³k)log⁑1βˆ’mk1βˆ’uk\Lambda(i,j) = \log \frac{P(\mathbf{\gamma}(i,j) \mid M = 1)}{P(\mathbf{\gamma}(i,j) \mid M = 0)} = \sum_{k=1}^{K} \gamma_k \log \frac{m_k}{u_k} + (1 - \gamma_k) \log \frac{1 - m_k}{1 - u_k}

Interpreting Match Weights

  • Strong match: W>T+W > T_+ (positive threshold), e.g., W>12W > 12 bits
  • Strong non-match: W<Tβˆ’W < T_- (negative threshold), e.g., W<βˆ’4W < -4 bits
  • Possible match: Tβˆ’β‰€W≀T+T_- \leq W \leq T_+, requiring clerical review or further analysis

The weight log⁑(mk/uk)\log(m_k/u_k) is the contribution of field kk. Fields with high mkm_k and low uku_k (e.g., exact Social Security Number) contribute large positive weights; fields with mkβ‰ˆukm_k \approx u_k (e.g., common last names) contribute near zero.

Felleg-Sunter Assumptions

The Felleg-Sunter model assumes:

  1. Conditional independence: Agreement indicators are independent given the match status:
P(γ∣M)=∏k=1KP(γk∣M)P(\boldsymbol{\gamma} \mid M) = \prod_{k=1}^{K} P(\gamma_k \mid M)
  1. Monotonicity: For any field kk, mkβ‰₯ukm_k \geq u_k (true matches agree more often than non-matches)
  2. Parameter stability: mkm_k and uku_k are the same for all record pairs

These assumptions are often violated in practice. Violations of conditional independence (e.g., correlated name and address fields) can bias match probability estimates.


String Distance Metrics

Levenshtein Distance

The Levenshtein (edit) distance dL(s,t)d_L(s, t) between strings ss and tt is the minimum number of single-character insertions, deletions, or substitutions required to transform ss into tt:

dL(s,t)={∣s∣ifΒ t=βˆ…βˆ£t∣ifΒ s=βˆ…dL(sβ€²,tβ€²)ifΒ s[0]=t[0]1+min⁑{dL(sβ€²,t)(delete)dL(s,tβ€²)(insert)dL(sβ€²,tβ€²)(substitute)otherwised_L(s, t) = \begin{cases} |s| & \text{if } t = \emptyset \\ |t| & \text{if } s = \emptyset \\ d_L(s', t') & \text{if } s[0] = t[0] \\ 1 + \min \begin{cases} d_L(s', t) & \text{(delete)} \\ d_L(s, t') & \text{(insert)} \\ d_L(s', t') & \text{(substitute)} \end{cases} & \text{otherwise} \end{cases}

where sβ€²s' is ss with the first character removed. This is computed via dynamic programming in O(∣sβˆ£β‹…βˆ£t∣)O(|s| \cdot |t|) time. The normalized Levenshtein distance is dL/max⁑(∣s∣,∣t∣)d_L / \max(|s|, |t|).

Jaro-Winkler Similarity

The Jaro similarity between strings ss and tt is:

J(s,t)=13(∣sm∣∣s∣+∣tm∣∣t∣+∣smβˆ£βˆ’T∣sm∣)J(s, t) = \frac{1}{3}\left(\frac{|s_m|}{|s|} + \frac{|t_m|}{|t|} + \frac{|s_m| - T}{|s_m|}\right)

where ∣sm∣|s_m| is the number of matching characters (characters within ⌊max⁑(∣s∣,∣t∣)/2βŒ‹βˆ’1\lfloor\max(|s|, |t|)/2\rfloor - 1 positions), and TT is the number of transpositions among matching characters.

The Jaro-Winkler similarity boosts the score for strings that match from the beginning:

JW(s,t)=J(s,t)+pβ‹…β„“β‹…(1βˆ’J(s,t))JW(s, t) = J(s, t) + p \cdot \ell \cdot (1 - J(s, t))

where p=0.1p = 0.1 is the scaling factor and β„“\ell is the length of the common prefix (up to β„“max⁑=4\ell_{\max} = 4). JW gives higher scores to strings sharing a common prefix, making it well-suited for name matching.

Metric Comparison

MetricRangeHandles TranspositionsPrefix BiasBest For
Levenshtein[0,max⁑(∣s∣,∣t∣)][0, \max(|s|,|t|)]NoNoGeneral editing
Jaro[0,1][0, 1]YesNoShort strings
Jaro-Winkler[0,1][0, 1]YesYesNames
SoundexCategoricalN/APhoneticName phonetics
MetaphoneCategoricalN/APhoneticName phonetics

Blocking

DfBlocking

Blocking reduces the O(N2)O(N^2) comparison space by partitioning records into blocks within which comparisons are performed. Records in different blocks are never compared. A good blocking scheme minimizes the number of missed true matches (false negatives) while reducing the number of comparisons.

Blocking Efficiency

The reduction ratio measures the fraction of comparisons eliminated:

RR=1βˆ’comparisonsΒ withΒ blockingcomparisonsΒ withoutΒ blocking=1βˆ’βˆ‘bnb2N2\text{RR} = 1 - \frac{\text{comparisons with blocking}}{\text{comparisons without blocking}} = 1 - \frac{\sum_b n_b^2}{N^2}

The pairs completeness (sensitivity) measures the fraction of true matches retained:

PC=∣{true matches in same block}∣∣all true matches∣\text{PC} = \frac{|\{\text{true matches in same block}\}|}{|\text{all true matches}|}

The pairs quality (positive predictive value) is:

PQ=∣{true matches in same block}∣∣{all pairs in same block}∣\text{PQ} = \frac{|\{\text{true matches in same block}\}|}{|\{\text{all pairs in same block}\}|}

Blocking Strategies

  • Standard blocking: Exact match on a key (e.g., first 3 characters of last name + ZIP code)
  • Sorted neighborhood: Sort records by a key, then compare within a sliding window
  • Multi-pass blocking: Multiple blocking passes with different keys, union of candidate pairs
  • Canopy clustering: TF-IDF + cosine similarity with thresholds for loose blocking
  • Learning-based: Use a classifier to predict if two records should be compared

EM Algorithm for Linkage

DfUnsupervised Linkage Parameters

When true match labels are unknown, the EM algorithm estimates mkm_k, uku_k, and the prior match probability Ο€=P(M=1)\pi = P(M = 1) from the comparison patterns alone.

EM for Felleg-Sunter

Let γ(l)\boldsymbol{\gamma}^{(l)} be the comparison vector for the ll-th pair, and z(l)∈{0,1}\mathbf{z}^{(l)} \in \{0, 1\} the latent match indicator.

E-step: Compute posterior match probabilities:

w(l)=P(z(l)=1∣γ(l),ΞΈ(t))=Ο€(t)∏k(mk(t))Ξ³k(l)(1βˆ’mk(t))1βˆ’Ξ³k(l)Ο€(t)∏k(mk(t))Ξ³k(l)(1βˆ’mk(t))1βˆ’Ξ³k(l)+(1βˆ’Ο€(t))∏k(uk(t))Ξ³k(l)(1βˆ’uk(t))1βˆ’Ξ³k(l)w^{(l)} = P(z^{(l)} = 1 \mid \boldsymbol{\gamma}^{(l)}, \boldsymbol{\theta}^{(t)}) = \frac{\pi^{(t)} \prod_k (m_k^{(t)})^{\gamma_k^{(l)}} (1 - m_k^{(t)})^{1 - \gamma_k^{(l)}}}{\pi^{(t)} \prod_k (m_k^{(t)})^{\gamma_k^{(l)}} (1 - m_k^{(t)})^{1 - \gamma_k^{(l)}} + (1 - \pi^{(t)}) \prod_k (u_k^{(t)})^{\gamma_k^{(l)}} (1 - u_k^{(t)})^{1 - \gamma_k^{(l)}}}

M-step: Update parameters:

Ο€(t+1)=1Lβˆ‘l=1Lw(l),mk(t+1)=βˆ‘lw(l)Ξ³k(l)βˆ‘lw(l),uk(t+1)=βˆ‘l(1βˆ’w(l))Ξ³k(l)βˆ‘l(1βˆ’w(l))\pi^{(t+1)} = \frac{1}{L} \sum_{l=1}^{L} w^{(l)}, \quad m_k^{(t+1)} = \frac{\sum_l w^{(l)} \gamma_k^{(l)}}{\sum_l w^{(l)}}, \quad u_k^{(t+1)} = \frac{\sum_l (1 - w^{(l)}) \gamma_k^{(l)}}{\sum_l (1 - w^{(l)})}

Iterate until convergence. The EM algorithm finds a local maximum of the likelihood; multiple random initializations are recommended.

EM Convergence Issues

  • EM converges to a local maximum, not necessarily the global maximum
  • Initialization matters: start with reasonable mk>ukm_k > u_k for all fields
  • Labelled data (even a small sample of known matches) dramatically improves estimation
  • With many fields, EM can overfit: use regularization or Bayesian priors
  • The number of true matches is often very small relative to the total number of pairs, creating severe class imbalance

Privacy-Preserving Record Linkage

DfPrivate Record Linkage

Privacy-preserving record linkage (PPRL) enables linkage without revealing the actual data values. Key approaches include:

  • Bloom filters: Encode each field as a Bloom filter; compare using set operations on encoded values
  • Homomorphic encryption: Perform comparisons on encrypted data
  • Secure multi-party computation: Jointly compute linkage without revealing inputs
  • Differential privacy: Add calibrated noise to comparison scores

Bloom Filter Encoding

A Bloom filter is a bit vector b∈{0,1}m\mathbf{b} \in \{0, 1\}^m representing a set SS. To insert element xx:

bhi(x)=1forΒ i=1,…,kb_{h_i(x)} = 1 \quad \text{for } i = 1, \dots, k

where h1,…,hkh_1, \dots, h_k are hash functions. The Jaccard similarity of two Bloom filters approximates the similarity of the underlying strings:

J^(s,t)β‰ˆβˆ£bs∩bt∣∣bsβˆͺbt∣\hat{J}(s, t) \approx \frac{|\mathbf{b}_s \cap \mathbf{b}_t|}{|\mathbf{b}_s \cup \mathbf{b}_t|}

False positives occur when bits are set by different strings, but the approximation is sufficient for linkage when Bloom filter size is large enough (mβ‰₯10km \geq 10k).


Evaluation Metrics

Linkage Quality Metrics

Given a set of predicted matches M^\hat{M} and true matches Mβˆ—M^*:

Precision=∣M^∩Mβˆ—βˆ£βˆ£M^∣,Recall=∣M^∩Mβˆ—βˆ£βˆ£Mβˆ—βˆ£\text{Precision} = \frac{|\hat{M} \cap M^*|}{|\hat{M}|}, \quad \text{Recall} = \frac{|\hat{M} \cap M^*|}{|M^*|}
F1=2β‹…Precisionβ‹…RecallPrecision+Recall\text{F}_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

The F-beta score weights recall more heavily when Ξ²>1\beta > 1 (missing matches is costly). The linkage error rate is:

LER=FP+FN∣M^∣+∣Mβˆ—βˆ£βˆ’βˆ£M^∩Mβˆ—βˆ£\text{LER} = \frac{FP + FN}{|\hat{M}| + |M^*| - |\hat{M} \cap M^*|}

In practice, sampling-based evaluation estimates these metrics by manually reviewing a stratified sample of pairs (strong matches, possible matches, strong non-matches).

Evaluation Challenges

  • Ground truth is usually unavailable for the full dataset; evaluation relies on samples
  • Clerical review of uncertain pairs is expensive but necessary for bias estimation
  • Linkage-induced bias: Incorrect matches introduce systematic error in downstream analyses
  • Sensitivity analysis: Vary thresholds and evaluate the precision-recall tradeoff
  • Block-level evaluation: Assess whether blocking schemes miss true matches

Python Implementation

import numpy as np
import pandas as pd
from collections import Counter

np.random.seed(42)

# --- String Distance Metrics ---
def levenshtein_distance(s, t):
    """Compute Levenshtein edit distance via dynamic programming."""
    m, n = len(s), len(t)
    dp = np.zeros((m + 1, n + 1), dtype=int)
    dp[:, 0] = np.arange(m + 1)
    dp[0, :] = np.arange(n + 1)
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            cost = 0 if s[i-1] == t[j-1] else 1
            dp[i, j] = min(dp[i-1, j] + 1, dp[i, j-1] + 1, dp[i-1, j-1] + cost)
    return dp[m, n]

def jaro_winkler(s, t, p=0.1):
    """Compute Jaro-Winkler similarity."""
    if s == t: return 1.0
    len_s, len_t = len(s), len(t)
    match_distance = max(len_s, len_t) // 2 - 1
    s_matches = [False] * len_s
    t_matches = [False] * len_t
    matches = transpositions = 0

    for i in range(len_s):
        start = max(0, i - match_distance)
        end = min(i + match_distance + 1, len_t)
        for j in range(start, end):
            if t_matches[j] or s[i] != t[j]: continue
            s_matches[i] = t_matches[j] = True
            matches += 1
            break

    if matches == 0: return 0.0

    k = 0
    for i in range(len_s):
        if not s_matches[i]: continue
        while not t_matches[k]: k += 1
        if s[i] != t[k]: transpositions += 1
        k += 1

    jaro = (matches/len_s + matches/len_t + (matches - transpositions/2)/matches) / 3

    # Common prefix (max 4 chars)
    prefix = 0
    for i in range(min(4, len_s, len_t)):
        if s[i] == t[i]: prefix += 1
        else: break

    return jaro + prefix * p * (1 - jaro)

# --- Test string distances ---
pairs = [
    ("Smith", "Smyth"), ("Johnson", "Johnsen"),
    ("Washington", "Washngton"), ("Michael", "Micheal"),
    ("New York", "New Yrok"), ("Robert", "Robbert"),
]
print("=== String Distance Metrics ===")
print(f"{'Pair':<30} {'Levenshtein':<14} {'Jaro-Winkler':<14}")
print("-" * 58)
for s, t in pairs:
    lev = levenshtein_distance(s, t)
    jw = jaro_winkler(s, t)
    print(f"({s}, {t}){'':>{26-len(s)-len(t)}} {lev:<14} {jw:<14.4f}")

# --- Felleg-Sunter Model ---
def felleg_sunter_weights(match_probs, non_match_probs, agreement_vector):
    """Compute Felleg-Sunter match weight for a pair."""
    weight = 0.0
    for gamma, m_k, u_k in zip(agreement_vector, match_probs, non_match_probs):
        if gamma == 1:
            weight += np.log(m_k / u_k)
        else:
            weight += np.log((1 - m_k) / (1 - u_k))
    return weight

# Simulated parameters (estimated from field distributions)
# Fields: [first_name, last_name, DOB, city, ZIP]
m_k = np.array([0.95, 0.98, 0.90, 0.85, 0.92])  # match probs
u_k = np.array([0.01, 0.005, 0.02, 0.03, 0.05])  # non-match probs

print("\n=== Felleg-Sunter Match Weights ===")
test_pairs = [
    ("John Smith", "John Smith", [1,1,1,1,1]),
    ("John Smith", "Jon Smith", [0,1,1,1,1]),
    ("John Smith", "John Smyth", [1,0,1,1,1]),
    ("John Smith", "Jane Doe", [0,0,0,0,0]),
]
for name1, name2, gamma in test_pairs:
    w = felleg_sunter_weights(m_k, u_k, gamma)
    agreement = sum(gamma)
    print(f"({name1}, {name2}) agree on {agreement}/5 fields: W = {w:.2f} bits")

# --- EM Algorithm for Linkage ---
np.random.seed(42)
n_pairs = 5000
n_true_matches = 200

# Generate comparison vectors
gamma_true = np.random.binomial(1, m_k, size=(n_true_matches, 5))
gamma_false = np.random.binomial(1, u_k, size=(n_pairs - n_true_matches, 5))
gamma_all = np.vstack([gamma_true, gamma_false])
labels = np.concatenate([np.ones(n_true_matches), np.zeros(n_pairs - n_true_matches)])

# EM algorithm
n_pairs_total = len(gamma_all)
n_fields = gamma_all.shape[1]
pi = n_true_matches / n_pairs_total
m_em = np.random.uniform(0.7, 0.95, n_fields)
u_em = np.random.uniform(0.01, 0.1, n_fields)

for iteration in range(50):
    # E-step
    p_match = pi * np.prod(m_em ** gamma_all * (1 - m_em) ** (1 - gamma_all), axis=1)
    p_nonmatch = (1 - pi) * np.prod(u_em ** gamma_all * (1 - u_em) ** (1 - gamma_all), axis=1)
    w_post = p_match / (p_match + p_nonmatch + 1e-300)

    # M-step
    pi_new = np.mean(w_post)
    m_em_new = np.average(gamma_all, weights=w_post, axis=0)
    u_em_new = np.average(gamma_all, weights=1 - w_post, axis=0)

    if np.max(np.abs(m_em - m_em_new)) < 1e-6:
        print(f"EM converged after {iteration + 1} iterations")
        break
    pi, m_em, u_em = pi_new, m_em_new, u_em_new

print(f"\n=== EM Estimates ===")
print(f"Ο€ (match prior): {pi:.4f} (true: {n_true_matches/n_pairs_total:.4f})")
fields = ['First Name', 'Last Name', 'DOB', 'City', 'ZIP']
print(f"{'Field':<15} {'m_k (est)':<12} {'m_k (true)':<12} {'u_k (est)':<12} {'u_k (true)':<12}")
print("-" * 63)
for i, field in enumerate(fields):
    print(f"{field:<15} {m_em[i]:<12.4f} {m_k[i]:<12.4f} {u_em[i]:<12.4f} {u_k[i]:<12.4f}")

# Classification
threshold = 0.5
predicted = (w_post >= threshold).astype(int)
tp = np.sum((predicted == 1) & (labels == 1))
fp = np.sum((predicted == 1) & (labels == 0))
fn = np.sum((predicted == 0) & (labels == 1))
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

print(f"\n=== Linkage Evaluation ===")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1: {f1:.4f}")

Summary

Key Takeaways: Record Linkage and Data Matching

  1. Deterministic linkage requires exact key agreement; probabilistic linkage uses agreement patterns across multiple fields to compute match weights. The Felleg-Sunter model provides the theoretical foundation for probabilistic linkage.
  2. Felleg-Sunter match weights W=βˆ‘kΞ³klog⁑(mk/uk)W = \sum_k \gamma_k \log(m_k/u_k) sum field-specific log-likelihood ratios. The model assumes conditional independence of agreement indicators given match status β€” a strong assumption often violated in practice.
  3. String distances β€” Levenshtein measures edit operations; Jaro-Winkler adds prefix bonus for name matching. Phonetic encodings (Soundex, Metaphone) handle spelling variations. Choice depends on the type of errors expected.
  4. Blocking reduces the O(N2)O(N^2) comparison space. Good blocking achieves high pairs completeness (sensitivity) while maintaining manageable comparison volumes. Multi-pass and learned blocking improve over standard approaches.
  5. EM algorithm estimates mkm_k, uku_k, and Ο€\pi without labelled data, but converges to local maxima. Small labelled samples dramatically improve estimation. Privacy-preserving methods (Bloom filters, homomorphic encryption) enable linkage without revealing raw data.
⭐

Premium Content

Record Linkage and Data Matching

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement