Record Linkage and Data Matching

Advanced Statistical Methods

Connecting Records Across Imperfect Databases

Record linkage identifies records that refer to the same entity across different data sources using probabilistic models and string distance metrics. The Felleg-Sunter model provides the theoretical foundation.

Public health — Link hospital records to disease registries for longitudinal studies
Bureau of statistics — Combine administrative datasets while preserving privacy
Finance — Match customer records across mergers and acquisitions for risk assessment

Record linkage bridges the gap between siloed datasets to unlock richer analytical insights.

Record linkage (also called entity resolution, deduplication, or data matching) identifies records across or within databases that refer to the same real-world entity. When unique identifiers are absent, statistical and probabilistic methods are needed to determine whether two records represent the same entity. This lesson develops the mathematical foundations of record linkage from deterministic rules through the Felleg-Sunter probabilistic framework.

Deterministic vs Probabilistic Linkage

DfRecord Linkage Problem

Given two datasets $\mathcal{A}$ and $\mathcal{B}$ , the record linkage problem is to identify all pairs $(a_i, b_j)$ where $a_i \in \mathcal{A}$ and $b_j \in \mathcal{B}$ refer to the same real-world entity. Formally, define a binary match indicator $M_{ij} = 1$ if $a_i$ and $b_j$ are a true match, and $M_{ij} = 0$ otherwise.

Deterministic Linkage

Deterministic linkage requires exact agreement on a set of key fields (e.g., Social Security Number, date of birth, last name). Records are classified as matches if and only if all key fields agree. This approach is:

Fast and interpretable when reliable identifiers exist
Brittle: A single character error in any key field causes a missed match
Inappropriate for noisy data, name variations, or missing values

DfProbabilistic Linkage

Probabilistic linkage quantifies the strength of evidence that two records match, using field-by-field agreement patterns. The core question: given that two records agree on some fields and disagree on others, what is the probability they represent the same entity?

Felleg-Sunter Model

Felleg-Sunter Framework

The Felleg-Sunter model (1969) compares each pair $(a_i, b_j)$ across $K$ fields. For each field $k$ , define an agreement indicator:

\gamma_k(i, j) = \begin{cases} 1 & \text{if fields } k \text{ agree} \\ 0 & \text{otherwise} \end{cases}

The match weight for the pair is:

W = \sum_{k=1}^{K} \gamma_k \cdot \log \frac{m_k}{u_k}

where:

$m_k = P(\gamma_k = 1 \mid M = 1)$ is the match probability (probability of agreement given a true match)
$u_k = P(\gamma_k = 1 \mid M = 0)$ is the non-match probability (probability of agreement given a non-match)

The log-likelihood ratio (LLR) for the entire pair is:

\Lambda(i,j) = \log \frac{P(\mathbf{\gamma}(i,j) \mid M = 1)}{P(\mathbf{\gamma}(i,j) \mid M = 0)} = \sum_{k=1}^{K} \gamma_k \log \frac{m_k}{u_k} + (1 - \gamma_k) \log \frac{1 - m_k}{1 - u_k}

Interpreting Match Weights

Strong match: $W > T_+$ (positive threshold), e.g., $W > 12$ bits
Strong non-match: $W < T_-$ (negative threshold), e.g., $W < -4$ bits
Possible match: $T_- \leq W \leq T_+$ , requiring clerical review or further analysis

The weight $\log(m_k/u_k)$ is the contribution of field $k$ . Fields with high $m_k$ and low $u_k$ (e.g., exact Social Security Number) contribute large positive weights; fields with $m_k \approx u_k$ (e.g., common last names) contribute near zero.

Felleg-Sunter Assumptions

The Felleg-Sunter model assumes:

Conditional independence: Agreement indicators are independent given the match status:

P(\boldsymbol{\gamma} \mid M) = \prod_{k=1}^{K} P(\gamma_k \mid M)

Monotonicity: For any field $k$ , $m_k \geq u_k$ (true matches agree more often than non-matches)
Parameter stability: $m_k$ and $u_k$ are the same for all record pairs

These assumptions are often violated in practice. Violations of conditional independence (e.g., correlated name and address fields) can bias match probability estimates.

String Distance Metrics

Levenshtein Distance

The Levenshtein (edit) distance $d_L(s, t)$ between strings $s$ and $t$ is the minimum number of single-character insertions, deletions, or substitutions required to transform $s$ into $t$ :

d_L(s, t) = \begin{cases} |s| & \text{if } t = \emptyset \\ |t| & \text{if } s = \emptyset \\ d_L(s', t') & \text{if } s[0] = t[0] \\ 1 + \min \begin{cases} d_L(s', t) & \text{(delete)} \\ d_L(s, t') & \text{(insert)} \\ d_L(s', t') & \text{(substitute)} \end{cases} & \text{otherwise} \end{cases}

where $s'$ is $s$ with the first character removed. This is computed via dynamic programming in $O(|s| \cdot |t|)$ time. The normalized Levenshtein distance is $d_L / \max(|s|, |t|)$ .

Jaro-Winkler Similarity

The Jaro similarity between strings $s$ and $t$ is:

J(s, t) = \frac{1}{3}\left(\frac{|s_m|}{|s|} + \frac{|t_m|}{|t|} + \frac{|s_m| - T}{|s_m|}\right)

where $|s_m|$ is the number of matching characters (characters within $\lfloor\max(|s|, |t|)/2\rfloor - 1$ positions), and $T$ is the number of transpositions among matching characters.

The Jaro-Winkler similarity boosts the score for strings that match from the beginning:

JW(s, t) = J(s, t) + p \cdot \ell \cdot (1 - J(s, t))

where $p = 0.1$ is the scaling factor and $\ell$ is the length of the common prefix (up to $\ell_{\max} = 4$ ). JW gives higher scores to strings sharing a common prefix, making it well-suited for name matching.

Metric Comparison

Metric	Range	Handles Transpositions	Prefix Bias	Best For
Levenshtein	$[0, \max(\|s\|,\|t\|)]$	No	No	General editing
Jaro	$[0, 1]$	Yes	No	Short strings
Jaro-Winkler	$[0, 1]$	Yes	Yes	Names
Soundex	Categorical	N/A	Phonetic	Name phonetics
Metaphone	Categorical	N/A	Phonetic	Name phonetics

Blocking

DfBlocking

Blocking reduces the $O(N^2)$ comparison space by partitioning records into blocks within which comparisons are performed. Records in different blocks are never compared. A good blocking scheme minimizes the number of missed true matches (false negatives) while reducing the number of comparisons.

Blocking Efficiency

The reduction ratio measures the fraction of comparisons eliminated:

\text{RR} = 1 - \frac{\text{comparisons with blocking}}{\text{comparisons without blocking}} = 1 - \frac{\sum_b n_b^2}{N^2}

The pairs completeness (sensitivity) measures the fraction of true matches retained:

\text{PC} = \frac{|\{\text{true matches in same block}\}|}{|\text{all true matches}|}

The pairs quality (positive predictive value) is:

\text{PQ} = \frac{|\{\text{true matches in same block}\}|}{|\{\text{all pairs in same block}\}|}

Blocking Strategies

Standard blocking: Exact match on a key (e.g., first 3 characters of last name + ZIP code)
Sorted neighborhood: Sort records by a key, then compare within a sliding window
Multi-pass blocking: Multiple blocking passes with different keys, union of candidate pairs
Canopy clustering: TF-IDF + cosine similarity with thresholds for loose blocking
Learning-based: Use a classifier to predict if two records should be compared

EM Algorithm for Linkage

DfUnsupervised Linkage Parameters

When true match labels are unknown, the EM algorithm estimates $m_k$ , $u_k$ , and the prior match probability $\pi = P(M = 1)$ from the comparison patterns alone.

EM for Felleg-Sunter

Let $\boldsymbol{\gamma}^{(l)}$ be the comparison vector for the $l$ -th pair, and $\mathbf{z}^{(l)} \in \{0, 1\}$ the latent match indicator.

E-step: Compute posterior match probabilities:

w^{(l)} = P(z^{(l)} = 1 \mid \boldsymbol{\gamma}^{(l)}, \boldsymbol{\theta}^{(t)}) = \frac{\pi^{(t)} \prod_k (m_k^{(t)})^{\gamma_k^{(l)}} (1 - m_k^{(t)})^{1 - \gamma_k^{(l)}}}{\pi^{(t)} \prod_k (m_k^{(t)})^{\gamma_k^{(l)}} (1 - m_k^{(t)})^{1 - \gamma_k^{(l)}} + (1 - \pi^{(t)}) \prod_k (u_k^{(t)})^{\gamma_k^{(l)}} (1 - u_k^{(t)})^{1 - \gamma_k^{(l)}}}

M-step: Update parameters:

\pi^{(t+1)} = \frac{1}{L} \sum_{l=1}^{L} w^{(l)}, \quad m_k^{(t+1)} = \frac{\sum_l w^{(l)} \gamma_k^{(l)}}{\sum_l w^{(l)}}, \quad u_k^{(t+1)} = \frac{\sum_l (1 - w^{(l)}) \gamma_k^{(l)}}{\sum_l (1 - w^{(l)})}

Iterate until convergence. The EM algorithm finds a local maximum of the likelihood; multiple random initializations are recommended.

EM Convergence Issues

EM converges to a local maximum, not necessarily the global maximum
Initialization matters: start with reasonable $m_k > u_k$ for all fields
Labelled data (even a small sample of known matches) dramatically improves estimation
With many fields, EM can overfit: use regularization or Bayesian priors
The number of true matches is often very small relative to the total number of pairs, creating severe class imbalance

Privacy-Preserving Record Linkage

DfPrivate Record Linkage

Privacy-preserving record linkage (PPRL) enables linkage without revealing the actual data values. Key approaches include:

Bloom filters: Encode each field as a Bloom filter; compare using set operations on encoded values
Homomorphic encryption: Perform comparisons on encrypted data
Secure multi-party computation: Jointly compute linkage without revealing inputs
Differential privacy: Add calibrated noise to comparison scores

Bloom Filter Encoding

A Bloom filter is a bit vector $\mathbf{b} \in \{0, 1\}^m$ representing a set $S$ . To insert element $x$ :

b_{h_i(x)} = 1 \quad \text{for } i = 1, \dots, k

where $h_1, \dots, h_k$ are hash functions. The Jaccard similarity of two Bloom filters approximates the similarity of the underlying strings:

\hat{J}(s, t) \approx \frac{|\mathbf{b}_s \cap \mathbf{b}_t|}{|\mathbf{b}_s \cup \mathbf{b}_t|}

False positives occur when bits are set by different strings, but the approximation is sufficient for linkage when Bloom filter size is large enough ( $m \geq 10k$ ).

Evaluation Metrics

Linkage Quality Metrics

Given a set of predicted matches $\hat{M}$ and true matches $M^*$ :

\text{Precision} = \frac{|\hat{M} \cap M^*|}{|\hat{M}|}, \quad \text{Recall} = \frac{|\hat{M} \cap M^*|}{|M^*|}

\text{F}_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

The F-beta score weights recall more heavily when $\beta > 1$ (missing matches is costly). The linkage error rate is:

\text{LER} = \frac{FP + FN}{|\hat{M}| + |M^*| - |\hat{M} \cap M^*|}

In practice, sampling-based evaluation estimates these metrics by manually reviewing a stratified sample of pairs (strong matches, possible matches, strong non-matches).

Evaluation Challenges

Ground truth is usually unavailable for the full dataset; evaluation relies on samples
Clerical review of uncertain pairs is expensive but necessary for bias estimation
Linkage-induced bias: Incorrect matches introduce systematic error in downstream analyses
Sensitivity analysis: Vary thresholds and evaluate the precision-recall tradeoff
Block-level evaluation: Assess whether blocking schemes miss true matches

Python Implementation

import numpy as np
import pandas as pd
from collections import Counter

np.random.seed(42)

# --- String Distance Metrics ---
def levenshtein_distance(s, t):
    """Compute Levenshtein edit distance via dynamic programming."""
    m, n = len(s), len(t)
    dp = np.zeros((m + 1, n + 1), dtype=int)
    dp[:, 0] = np.arange(m + 1)
    dp[0, :] = np.arange(n + 1)
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            cost = 0 if s[i-1] == t[j-1] else 1
            dp[i, j] = min(dp[i-1, j] + 1, dp[i, j-1] + 1, dp[i-1, j-1] + cost)
    return dp[m, n]

def jaro_winkler(s, t, p=0.1):
    """Compute Jaro-Winkler similarity."""
    if s == t: return 1.0
    len_s, len_t = len(s), len(t)
    match_distance = max(len_s, len_t) // 2 - 1
    s_matches = [False] * len_s
    t_matches = [False] * len_t
    matches = transpositions = 0

    for i in range(len_s):
        start = max(0, i - match_distance)
        end = min(i + match_distance + 1, len_t)
        for j in range(start, end):
            if t_matches[j] or s[i] != t[j]: continue
            s_matches[i] = t_matches[j] = True
            matches += 1
            break

    if matches == 0: return 0.0

    k = 0
    for i in range(len_s):
        if not s_matches[i]: continue
        while not t_matches[k]: k += 1
        if s[i] != t[k]: transpositions += 1
        k += 1

    jaro = (matches/len_s + matches/len_t + (matches - transpositions/2)/matches) / 3

    # Common prefix (max 4 chars)
    prefix = 0
    for i in range(min(4, len_s, len_t)):
        if s[i] == t[i]: prefix += 1
        else: break

    return jaro + prefix * p * (1 - jaro)

# --- Test string distances ---
pairs = [
    ("Smith", "Smyth"), ("Johnson", "Johnsen"),
    ("Washington", "Washngton"), ("Michael", "Micheal"),
    ("New York", "New Yrok"), ("Robert", "Robbert"),
]
print("=== String Distance Metrics ===")
print(f"{'Pair':<30} {'Levenshtein':<14} {'Jaro-Winkler':<14}")
print("-" * 58)
for s, t in pairs:
    lev = levenshtein_distance(s, t)
    jw = jaro_winkler(s, t)
    print(f"({s}, {t}){'':>{26-len(s)-len(t)}} {lev:<14} {jw:<14.4f}")

# --- Felleg-Sunter Model ---
def felleg_sunter_weights(match_probs, non_match_probs, agreement_vector):
    """Compute Felleg-Sunter match weight for a pair."""
    weight = 0.0
    for gamma, m_k, u_k in zip(agreement_vector, match_probs, non_match_probs):
        if gamma == 1:
            weight += np.log(m_k / u_k)
        else:
            weight += np.log((1 - m_k) / (1 - u_k))
    return weight

# Simulated parameters (estimated from field distributions)
# Fields: [first_name, last_name, DOB, city, ZIP]
m_k = np.array([0.95, 0.98, 0.90, 0.85, 0.92])  # match probs
u_k = np.array([0.01, 0.005, 0.02, 0.03, 0.05])  # non-match probs

print("\n=== Felleg-Sunter Match Weights ===")
test_pairs = [
    ("John Smith", "John Smith", [1,1,1,1,1]),
    ("John Smith", "Jon Smith", [0,1,1,1,1]),
    ("John Smith", "John Smyth", [1,0,1,1,1]),
    ("John Smith", "Jane Doe", [0,0,0,0,0]),
]
for name1, name2, gamma in test_pairs:
    w = felleg_sunter_weights(m_k, u_k, gamma)
    agreement = sum(gamma)
    print(f"({name1}, {name2}) agree on {agreement}/5 fields: W = {w:.2f} bits")

# --- EM Algorithm for Linkage ---
np.random.seed(42)
n_pairs = 5000
n_true_matches = 200

# Generate comparison vectors
gamma_true = np.random.binomial(1, m_k, size=(n_true_matches, 5))
gamma_false = np.random.binomial(1, u_k, size=(n_pairs - n_true_matches, 5))
gamma_all = np.vstack([gamma_true, gamma_false])
labels = np.concatenate([np.ones(n_true_matches), np.zeros(n_pairs - n_true_matches)])

# EM algorithm
n_pairs_total = len(gamma_all)
n_fields = gamma_all.shape[1]
pi = n_true_matches / n_pairs_total
m_em = np.random.uniform(0.7, 0.95, n_fields)
u_em = np.random.uniform(0.01, 0.1, n_fields)

for iteration in range(50):
    # E-step
    p_match = pi * np.prod(m_em ** gamma_all * (1 - m_em) ** (1 - gamma_all), axis=1)
    p_nonmatch = (1 - pi) * np.prod(u_em ** gamma_all * (1 - u_em) ** (1 - gamma_all), axis=1)
    w_post = p_match / (p_match + p_nonmatch + 1e-300)

    # M-step
    pi_new = np.mean(w_post)
    m_em_new = np.average(gamma_all, weights=w_post, axis=0)
    u_em_new = np.average(gamma_all, weights=1 - w_post, axis=0)

    if np.max(np.abs(m_em - m_em_new)) < 1e-6:
        print(f"EM converged after {iteration + 1} iterations")
        break
    pi, m_em, u_em = pi_new, m_em_new, u_em_new

print(f"\n=== EM Estimates ===")
print(f"π (match prior): {pi:.4f} (true: {n_true_matches/n_pairs_total:.4f})")
fields = ['First Name', 'Last Name', 'DOB', 'City', 'ZIP']
print(f"{'Field':<15} {'m_k (est)':<12} {'m_k (true)':<12} {'u_k (est)':<12} {'u_k (true)':<12}")
print("-" * 63)
for i, field in enumerate(fields):
    print(f"{field:<15} {m_em[i]:<12.4f} {m_k[i]:<12.4f} {u_em[i]:<12.4f} {u_k[i]:<12.4f}")

# Classification
threshold = 0.5
predicted = (w_post >= threshold).astype(int)
tp = np.sum((predicted == 1) & (labels == 1))
fp = np.sum((predicted == 1) & (labels == 0))
fn = np.sum((predicted == 0) & (labels == 1))
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

print(f"\n=== Linkage Evaluation ===")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1: {f1:.4f}")

Summary

Key Takeaways: Record Linkage and Data Matching

Deterministic linkage requires exact key agreement; probabilistic linkage uses agreement patterns across multiple fields to compute match weights. The Felleg-Sunter model provides the theoretical foundation for probabilistic linkage.
Felleg-Sunter match weights $W = \sum_k \gamma_k \log(m_k/u_k)$ sum field-specific log-likelihood ratios. The model assumes conditional independence of agreement indicators given match status — a strong assumption often violated in practice.
String distances — Levenshtein measures edit operations; Jaro-Winkler adds prefix bonus for name matching. Phonetic encodings (Soundex, Metaphone) handle spelling variations. Choice depends on the type of errors expected.
Blocking reduces the $O(N^2)$ comparison space. Good blocking achieves high pairs completeness (sensitivity) while maintaining manageable comparison volumes. Multi-pass and learned blocking improve over standard approaches.
EM algorithm estimates $m_k$ , $u_k$ , and $\pi$ without labelled data, but converges to local maxima. Small labelled samples dramatically improve estimation. Privacy-preserving methods (Bloom filters, homomorphic encryption) enable linkage without revealing raw data.

Record Linkage and Data Matching

Record Linkage and Data Matching

Connecting Records Across Imperfect Databases

Deterministic vs Probabilistic Linkage

DfRecord Linkage Problem

DfProbabilistic Linkage

Felleg-Sunter Model

Felleg-Sunter Framework

Felleg-Sunter Assumptions

String Distance Metrics

Levenshtein Distance

Jaro-Winkler Similarity

Blocking

DfBlocking

Blocking Efficiency

EM Algorithm for Linkage

DfUnsupervised Linkage Parameters

EM for Felleg-Sunter

Privacy-Preserving Record Linkage

DfPrivate Record Linkage

Bloom Filter Encoding

Evaluation Metrics

Linkage Quality Metrics

Python Implementation

Summary

Key Takeaways: Record Linkage and Data Matching

Premium Content

Need Expert Statistics Help?