Survey Sampling and Weighting

Advanced Statistical Methods

Getting Population Answers From Sample Data

Survey sampling uses design-based inference to generalize from samples to populations, with weighting procedures like raking and the Horvitz-Thompson estimator correcting for unequal selection probabilities.

Public opinion polling — Produce nationally representative estimates from stratified samples
Census operations — Adjust for non-response and undercoverage in population counts
Health surveys — Estimate disease prevalence from complex multi-stage sampling designs

Survey weighting ensures every voice counts proportionally, not just the ones that were easy to reach.

Survey statistics provides the mathematical framework for drawing inferences about populations from samples selected through known, probabilistic mechanisms. Unlike model-based inference, design-based inference treats population values as fixed and randomness arises solely from the sampling process. This lesson develops the theory and practice of survey sampling, weighting, and variance estimation for complex survey designs.

Design-Based Inference

DfFinite Population Framework

Let $U = \{1, 2, \dots, N\}$ be a finite population of size $N$ . Each unit $i$ has an unknown value $y_i$ for the variable of interest. A sampling design $p(s)$ assigns a probability to every possible sample $s \subseteq U$ such that $\sum_{s \subseteq U} p(s) = 1$ . The design determines which samples are possible and their selection probabilities.

DfInclusion Probability

The first-order inclusion probability $\pi_i$ is the probability that unit $i$ is included in the sample:

\pi_i = P(i \in s) = \sum_{s \ni i} p(s)

The second-order inclusion probability $\pi_{ij}$ is the probability that both units $i$ and $j$ are included:

\pi_{ij} = P(i \in s \text{ and } j \in s) = \sum_{s \ni i, j} p(s)

A sampling design is first-order balanced if $\pi_i > 0$ for all $i$ .

Horvitz-Thompson Estimator

The Horvitz-Thompson (HT) estimator of the population total $Y = \sum_{i \in U} y_i$ is:

\hat{Y}_{\text{HT}} = \sum_{i \in s} \frac{y_i}{\pi_i} = \sum_{i \in s} w_i y_i

where $w_i = 1/\pi_i$ is the design weight. The HT estimator is unbiased for any sampling design with $\pi_i > 0$ :

\mathbb{E}_p[\hat{Y}_{\text{HT}}] = \sum_{i \in U} y_i = Y

ThHorvitz-Thompson Unbiasedness

Proof: Let $I_i$ be the inclusion indicator ( $I_i = 1$ if $i \in s$ , 0 otherwise). Then $\mathbb{E}[I_i] = \pi_i$ and:

\mathbb{E}\left[\sum_{i \in U} \frac{I_i y_i}{\pi_i}\right] = \sum_{i \in U} \frac{y_i}{\pi_i} \cdot \pi_i = \sum_{i \in U} y_i = Y \quad \square

HT Variance

The variance of the HT estimator is:

V(\hat{Y}_{\text{HT}}) = \sum_{i \in U} \sum_{j \in U} \frac{(\pi_{ij} - \pi_i \pi_j)}{\pi_i \pi_j} y_i y_j

For stratified sampling with independent strata:

V(\hat{Y}_{\text{HT}}) = \sum_{h=1}^{H} V_h\left(\hat{Y}_h\right) = \sum_{h=1}^{H} N_h^2 \left(1 - \frac{n_h}{N_h}\right) \frac{S_h^2}{n_h}

where $S_h^2 = \frac{1}{N_h - 1}\sum_{i \in U_h}(y_i - \bar{Y}_h)^2$ is the population variance in stratum $h$ .

HT Estimator Properties

Unbiased for any $\pi_i > 0$ , regardless of the population distribution
Design-consistent: $V(\hat{Y}_{\text{HT}})/Y^2 \to 0$ as $n \to \infty$ under appropriate conditions
Sensitivity: Large weights (small $\pi_i$ ) lead to high variance; extreme weights can dominate the estimate
Non-negativity: The HT estimator can produce negative estimates for totals when $y_i < 0$ and weights are large

Design Effects

DfDesign Effect (DEFF)

The design effect compares the variance of an estimator under the complex design to its variance under simple random sampling (SRS) of the same size:

\text{DEFF}(\hat{\theta}) = \frac{V_{\text{complex}}(\hat{\theta})}{V_{\text{SRS}}(\hat{\theta})}

$\text{DEFF} > 1$ : Complex design is less efficient than SRS
$\text{DEFF} < 1$ : Complex design is more efficient (e.g., optimal allocation)
$\text{DEFF} = 1$ : Equivalent to SRS

Effective Sample Size

The effective sample size adjusts the actual sample size for design effects:

n_{\text{eff}} = \frac{n}{\text{DEFF}}

For a proportion $\hat{p}$ with design weight $w_i$ :

\text{DEFF} = 1 + \text{CV}^2(w) \cdot \bar{w} \cdot (\text{something specific to the design})

A common approximation for stratified sampling with optimal allocation is $\text{DEFF} \approx 1 - \sum_h W_h^2$ where $W_h = N_h/N$ .

Sources of Design Effects

Clustering: Respondents within clusters are correlated; $\text{DEFF}_{\text{cluster}} \approx 1 + (m-1)\rho$ where $m$ is cluster size and $\rho$ is the intra-class correlation
Stratification: Reduces variance when strata are homogeneous; $\text{DEFF}_{\text{strat}} < 1$ when strata differ in means
Unequal selection probabilities: Increase variance unless compensated by weighting
Post-stratification: Can reduce variance if auxiliary information is correlated with outcomes

Weighting Procedures

Survey weights adjust for unequal selection probabilities and non-response. The final analysis weight is typically a product of several components.

Design Weight

The base weight (design weight) is the inverse of the inclusion probability:

w_i^{\text{design}} = \frac{1}{\pi_i}

For a multi-stage design with $k$ stages, the design weight is the product of conditional selection probabilities at each stage.

Non-Response Adjustment

After accounting for design, weights are adjusted for non-response. The non-response adjustment divides the weight by the estimated response probability:

w_i^{\text{NR}} = \frac{w_i^{\text{design}}}{\hat{P}(R_i = 1 \mid \mathbf{x}_i)}

where $\hat{P}(R_i = 1 \mid \mathbf{x}_i)$ is the estimated probability of response given covariates $\mathbf{x}_i$ . This is estimated via logistic regression on design strata and auxiliary variables. Non-response adjustment cells group units with similar response propensities.

Post-Stratification

Post-stratification calibrates weights so that weighted totals of known population counts match census or administrative data:

w_i^{\text{PS}} = w_i^{\text{NR}} \cdot \frac{N_c}{\sum_{j \in s_c} w_j^{\text{NR}}}

where $N_c$ is the known population count in post-stratum $c$ and $s_c$ is the sample in that post-stratum. This adjusts for coverage and non-response bias when the post-stratification variables are correlated with outcomes.

Raking (Iterative Proportional Fitting)

DfRaking

Raking (iterative proportional fitting, IPF) calibrates weights to match known marginal distributions for multiple categorical variables simultaneously. Given $K$ classification variables with known population margins $\{N_k^{(1)}, \dots, N_k^{(K)}\}$ :

w_i^{(t+1)} = w_i^{(t)} \cdot \prod_{k=1}^{K} \frac{N_k^{(c_{ik})}}{\sum_{j \in s} w_j^{(t)} \cdot \mathbb{1}(c_{jk} = c_{ik})}

where $c_{ik}$ is the category of unit $i$ on variable $k$ . Raking iterates through each margin, adjusting weights proportionally until convergence.

Raking Convergence

Raking converges when:

The number of cells in the cross-classification exceeds the sample size (so exact calibration to all cells is impossible)
The marginal totals are consistent (sum to the same population total across all variables)
Each unit belongs to a unique cell in the cross-classification

Convergence is guaranteed when the log-linear model implied by the margins is faithful. In practice, convergence typically occurs in 10–30 iterations. Weights can become extreme when sample sizes in marginal cells are small.

Calibration Estimator

Raking is a special case of calibration estimation. Given auxiliary variables $\mathbf{x}_i$ with known population total $\mathbf{X} = \sum_{i \in U} \mathbf{x}_i$ , the calibrated weight minimizes:

\min_{\mathbf{w}} \sum_{i \in s} g(w_i) \quad \text{subject to} \quad \sum_{i \in s} w_i \mathbf{x}_i = \mathbf{X}

where $g(w)$ is a distance function. Common choices:

Raking: $g(w) = w \log w$ (Kullback-Leibler divergence)
Linear calibration: $g(w) = (w - w^{\text{old}})^2$ (chi-square distance)
Logit trimming: Bounds weights to $[a, b]$ and re-calibrates

Variance Estimation for Complex Surveys

Complex survey designs violate the independence assumption underlying simple variance formulas. Specialized variance estimation methods account for the design.

Taylor Series Linearization

The Taylor series method approximates the variance of a nonlinear estimator $\hat{\theta} = f(\hat{\mathbf{Y}})$ by linearizing around the population value:

V(\hat{\theta}) \approx \sum_{h=1}^{H} \sum_{i=1}^{n_h} \sum_{j=1}^{n_h} \Delta_h \Delta_h' \cdot (z_{hi} - \bar{z}_h)(z_{hj} - \bar{z}_h)'

where $\Delta_h = \frac{\partial f}{\partial \hat{\mathbf{Y}}_h}\bigg|_{\hat{\mathbf{Y}}_h = \mathbf{Y}_h}$ is the gradient and $z_{hi}$ is the linearized variable. For a weighted mean $\hat{\theta} = \sum w_i y_i / \sum w_i$ :

z_i = w_i(y_i - \hat{\theta}) / \sum_j w_j

Jackknife Variance Estimation

The jackknife estimates variance by systematically deleting portions of the sample. For a stratified design with $n_h$ PSUs per stratum:

Delete PSU $j$ from stratum $h$ : compute estimate $\hat{\theta}_{h(-j)}$ using remaining PSUs
Create pseudo-replicate: $\hat{\theta}_{h(-j)}^{*} = n_h \hat{\theta} - (n_h - 1)\hat{\theta}_{h(-j)}$

V_{\text{JK}}(\hat{\theta}) = \frac{1}{H(n_h - 1)} \sum_{h=1}^{H} \sum_{j=1}^{n_h} (\hat{\theta}_{h(-j)}^{*} - \hat{\theta})^2

The jackknife is model-free and handles any statistic (medians, quantiles, regression coefficients).

Bootstrap Variance Estimation

The bootstrap for complex surveys resamples PSUs within strata:

For replicate $b = 1, \dots, B$ : draw $n_h$ PSUs with replacement from stratum $h$
Compute $\hat{\theta}^{*(b)}$ from the bootstrap sample
Estimate variance:

V_{\text{boot}}(\hat{\theta}) = \frac{1}{B-1} \sum_{b=1}^{B} (\hat{\theta}^{*(b)} - \bar{\theta}^*)^2

The bootstrap handles unequal probabilities by weighting resampled units by $1/p_i^{*}$ where $p_i^{*}$ is the resampling probability.

Variance Estimation Comparison

Method	Pros	Cons	Best For
Taylor series	Fast, analytic	Linearization error	Smooth estimators
Jackknife	Exact, handles any statistic	Needs PSUs	Medians, quantiles
Bootstrap	Flexible, handles complex designs	Computationally expensive	Small samples, extreme statistics

Python Implementation

import numpy as np
import pandas as pd
from scipy import stats

np.random.seed(42)

# --- Simulate Complex Survey Data ---
N = 10000  # population size
n_strata = 4  # number of strata
n_psu_per_stratum = 25  # PSUs per stratum

# Population
strata = np.random.choice(n_strata, N, p=[0.3, 0.3, 0.25, 0.15])
y = 50 + 10 * strata + np.random.randn(N) * 15  # outcome correlated with stratum
x = 2 * strata + np.random.randn(N) * 5  # auxiliary variable

# Inverse probability sampling (higher prob in larger strata)
probs = np.where(strata < 2, 0.1, 0.2)  # oversample small strata
sample_idx = np.random.choice(N, size=1000, replace=False, p=probs / probs.sum())
design_weights = 1.0 / probs[sample_idx]

# --- Horvitz-Thompson Estimator ---
y_sample = y[sample_idx]
w = design_weights

y_ht = np.sum(w * y_sample) / np.sum(w)
y_srs = np.mean(y_sample)

print("=== Horvitz-Thompson Estimator ===")
print(f"Population mean (true): {np.mean(y):.2f}")
print(f"HT estimate: {y_ht:.2f}")
print(f"SRS mean: {y_srs:.2f}")

# --- Design Effect ---
var_srs = np.var(y_sample, ddof=1) / len(y_sample)
# Approximate variance with Taylor linearization
z_i = w * (y_sample - y_ht) / np.sum(w)
var_ht = np.sum(z_i**2) * len(y_sample) / (len(y_sample) - 1)
deff = var_ht / var_srs
n_eff = len(y_sample) / deff

print(f"\n=== Design Effects ===")
print(f"DEFF: {deff:.2f}")
print(f"Effective sample size: {n_eff:.0f}")

# --- Raking / Iterative Proportional Fitting ---
def rake_weights(weights, sample_data, population_margins, max_iter=50, tol=1e-6):
    """Rake weights to match population margins for multiple variables."""
    w = weights.copy()
    for iteration in range(max_iter):
        w_old = w.copy()
        for var_name, margin in population_margins.items():
            categories = sample_data[var_name]
            unique_cats = np.unique(categories)
            for cat in unique_cats:
                mask = categories == cat
                w[mask] *= margin[cat] / np.sum(w[mask])
        if np.max(np.abs(w - w_old)) < tol:
            print(f"  Raking converged after {iteration + 1} iterations")
            break
    return w

# Post-stratification variables
age_group = np.random.choice(['18-34', '35-54', '55+'], len(sample_idx), p=[0.3, 0.4, 0.3])
gender = np.random.choice(['M', 'F'], len(sample_idx), p=[0.48, 0.52])

# Known population margins
pop_margins = {
    'age': {'18-34': 0.28, '35-54': 0.38, '55+': 0.34},
    'gender': {'M': 0.49, 'F': 0.51}
}

sample_df = pd.DataFrame({'age': age_group, 'gender': gender})
w_raked = rake_weights(w, sample_df, pop_margins)
y_raked = np.sum(w_raked * y_sample) / np.sum(w_raked)

print(f"\n=== Raked Estimate ===")
print(f"Raked mean: {y_raked:.2f}")
print(f"Weight range: [{w_raked.min():.2f}, {w_raked.max():.2f}]")

# --- Jackknife Variance Estimation ---
# Simulate PSU structure
n_strata_sample = 4
psu_per_stratum = 25
total_sample = n_strata_sample * psu_per_stratum

# Create fake PSU assignments
psu_ids = np.repeat(np.arange(total_sample), len(y_sample) // total_sample + 1)[:len(y_sample)]
stratum_ids = np.repeat(np.arange(n_strata_sample), psu_per_stratum)[:len(y_sample)]

def theta_hat(y_vals, w_vals):
    return np.sum(w_vals * y_vals) / np.sum(w_vals)

# Delete-one-PSU jackknife
estimates = []
for h in range(n_strata_sample):
    for j in range(psu_per_stratum):
        mask = ~((stratum_ids == h) & (psu_ids == j))
        if mask.sum() > 10:
            theta_jk = theta_hat(y_sample[mask], w[mask])
            estimates.append(theta_jk)

estimates = np.array(estimates)
var_jack = np.var(estimates, ddof=1)
se_jack = np.sqrt(var_jack)

print(f"\n=== Jackknife Variance ===")
print(f"Estimate: {y_ht:.2f}")
print(f"SE (jackknife): {se_jack:.2f}")
print(f"95% CI: [{y_ht - 1.96*se_jack:.2f}, {y_ht + 1.96*se_jack:.2f}]")

Summary

Key Takeaways: Survey Sampling and Weighting

Design-based inference treats population values as fixed; randomness comes from the sampling mechanism. The Horvitz-Thompson estimator $\hat{Y}_{\text{HT}} = \sum_{i \in s} y_i/\pi_i$ is unbiased for any design with known inclusion probabilities $\pi_i > 0$ .
Design effects (DEFF) quantify the efficiency loss from complex designs relative to SRS. Clustering inflates DEFF by factor $1 + (m-1)\rho$ ; stratification reduces it. The effective sample size $n_{\text{eff}} = n/\text{DEFF}$ .
Weighting proceeds in stages: design weights ( $1/\pi_i$ ), non-response adjustment (divide by estimated response probability), and post-stratification (calibrate to known margins). Raking (IPF) iteratively adjusts to multiple marginal distributions.
Calibration estimation provides a unified framework: minimize distance to base weights subject to margin-matching constraints. Raking uses Kullback-Leibler; linear calibration uses chi-square distance.
Variance estimation — Taylor series linearization is fast for smooth estimators; jackknife and bootstrap handle arbitrary statistics (medians, quantiles). For complex surveys, always use design-based variance estimates, not naive formulas.

Survey Sampling and Weighting

Survey Sampling and Weighting

Getting Population Answers From Sample Data

Design-Based Inference

DfFinite Population Framework

DfInclusion Probability

Horvitz-Thompson Estimator

Horvitz-Thompson Estimator

ThHorvitz-Thompson Unbiasedness

HT Variance

Design Effects

DfDesign Effect (DEFF)

Effective Sample Size

Weighting Procedures

Design Weight

Non-Response Adjustment

Post-Stratification

Raking (Iterative Proportional Fitting)

DfRaking

Calibration Estimator

Variance Estimation for Complex Surveys

Taylor Series Linearization

Jackknife Variance Estimation

Bootstrap Variance Estimation

Python Implementation

Summary

Key Takeaways: Survey Sampling and Weighting

Premium Content

Need Expert Statistics Help?