Equivalence Testing

Advanced Statistical Methods

Proving Things Are the Same, Not Just Different

Equivalence testing uses the TOST procedure to demonstrate that two treatments differ by no more than a pre-specified margin, reversing the logic of traditional hypothesis testing. It provides evidence for practical equivalence.

Generic drug approval — Demonstrate bioequivalence of generic and brand-name medications
Manufacturing quality — Verify that a new process produces results equivalent to the established one
Bioequivalence — Show that formulation changes do not alter drug absorption characteristics

Equivalence testing answers the right question: is the difference small enough to not matter?

DfEquivalence Testing

Equivalence testing is a statistical framework for demonstrating that two treatments are practically equivalent, rather than merely failing to detect a difference. Instead of the traditional null hypothesis of "no difference," we test whether the true difference lies within a pre-specified equivalence margin $\Delta$ .

"Absence of evidence is not evidence of absence." — Altman & Bland, 2004

The Problem with Traditional Hypothesis Testing

In traditional testing:

$H_0$ : $\mu_T - \mu_R = 0$ (no difference)
$H_1$ : $\mu_T - \mu_R \neq 0$ (some difference)

Failing to reject $H_0$ does not prove equivalence. It merely means we lack evidence of a difference. This is fundamentally different from demonstrating that the treatments are equivalent.

Traditional vs Equivalence

Aspect	Traditional Test	Equivalence Test
Null hypothesis	$\theta = 0$	$\|\theta\| \geq \Delta$
Alternative	$\theta \neq 0$	$\|\theta\| < \Delta$
Conclusion if $H_0$ rejected	There is a difference	The treatments are equivalent
What failing to reject means	No evidence of difference	No evidence of equivalence

Two One-Sided Tests (TOST)

DfTOST Procedure

The Two One-Sided Tests (TOST) procedure (Schuirmann, 1987) demonstrates equivalence by simultaneously testing:

$H_{01}$ : $\mu_T - \mu_R \leq -\Delta$ (treatment is worse)
$H_{02}$ : $\mu_T - \mu_R \geq \Delta$ (treatment is better)

The composite null hypothesis is $H_0$ : $|\mu_T - \mu_R| \geq \Delta$ , and equivalence is concluded when both one-sided tests reject at significance level $\alpha$ .

Test Statistics

TOST Test Statistics

T_1 = \frac{(\bar{X}_T - \bar{X}_R) + \Delta}{S_p \sqrt{\frac{1}{n_T} + \frac{1}{n_R}}}

T_2 = \frac{(\bar{X}_T - \bar{X}_R) - \Delta}{S_p \sqrt{\frac{1}{n_T} + \frac{1}{n_R}}}

Equivalence is concluded when $T_1 > t_{1-\alpha, \nu}$ and $T_2 < -t_{1-\alpha, \nu}$ , where $\nu = n_T + n_R - 2$ .

Equivalence to Confidence Interval

TOST is equivalent to checking whether the $(1 - 2\alpha)$ confidence interval for $\mu_T - \mu_R$ falls entirely within $(-\Delta, +\Delta)$ . For a 90% CI, this corresponds to $\alpha = 0.05$ per test.

Choosing the Equivalence Margin

DfEquivalence Margin

The equivalence margin $\Delta$ represents the largest difference that is considered clinically or practically negligible. It is not a statistical parameter — it is a clinical judgment that must be specified before the study.

Common Equivalence Margins

Application	Typical Margin	Rationale
Bioequivalence (AUC, Cmax)	80–125% (ratio scale)	Regulatory standard (FDA, EMA)
Non-inferiority (efficacy)	Pre-specified based on historical data	Must preserve a fraction of standard treatment effect
Diagnostic accuracy	±5% sensitivity/specificity	Clinical non-inferiority
Device comparison	±10% of reference SD	Clinical equivalence

Margin Selection

An overly generous $\Delta$ makes it easy to declare equivalence but may miss clinically meaningful differences. An overly strict $\Delta$ may require infeasibly large sample sizes. Consult clinical experts and use established regulatory precedents.

Bioequivalence

DfAverage Bioequivalence

Average bioequivalence (ABE) is established when the ratio of population geometric means (test/reference) falls within 80–125% for AUC and Cmax. This is equivalent to testing on the log scale with $\Delta = \ln(1.25) \approx 0.223$ .

Log-Transformed Data

For bioequivalence, we work on the log scale:

Log-Scale Equivalence

\ln(\mu_T / \mu_R) = \ln(\mu_T) - \ln(\mu_R)

Equivalence when: $-\ln(1.25) < \ln(\mu_T) - \ln(\mu_R) < \ln(1.25)$

Scaled Average Bioequivalence (SABE)

DfScaled Average Bioequivalence

For highly variable drugs (intra-subject CV > 30%), SABE widens the equivalence margin proportional to the reference formulation's variability:

\frac{(\mu_T - \mu_R)^2}{\sigma_{WR}^2} < \theta_S

where $\sigma_{WR}$ is the within-subject standard deviation of the reference and $\theta_S$ is the scaled margin (typically 0.896 for FDA, 0.76 for EMA).

Power Analysis for Equivalence Testing

Non-Central t-Distribution Approach

Power of TOST

The power of TOST for detecting equivalence when the true difference is $\delta$ is:

\text{Power} = P\left(T_1 > t_{1-\alpha, \nu} \text{ and } T_2 < -t_{1-\alpha, \nu} \mid \delta\right)

Using the non-central t-distribution:

\text{Power} = P\left(t'_{\nu, \lambda_1} > t_{1-\alpha, \nu}\right) + P\left(t'_{\nu, \lambda_2} < -t_{1-\alpha, \nu}\right) - 1

where the non-centrality parameters are:

\lambda_1 = \frac{\delta + \Delta}{\sigma \sqrt{\frac{1}{n_T} + \frac{1}{n_R}}}, \quad \lambda_2 = \frac{\delta - \Delta}{\sigma \sqrt{\frac{1}{n_T} + \frac{1}{n_R}}}

Power is Maximum When δ = 0

Power for equivalence testing is highest when the true difference is zero and decreases as $|\delta|$ approaches $\Delta$ . This is the opposite of superiority testing, where power increases with the true effect size.

Sample Size Formula (Two-Group Design)

For equal sample sizes $n_T = n_R = n$ and $\delta = 0$ :

Sample Size for Equivalence

n = \frac{2\sigma^2 (z_{1-\alpha} + z_{1-\beta})^2}{(\Delta - |\delta|)^2}

where $z_{1-\alpha}$ is the critical value for the one-sided test and $z_{1-\beta}$ is the power quantile.

Relationship to Confidence Intervals

DfEquivalence via Confidence Interval

A $(1 - 2\alpha)$ confidence interval for $\mu_T - \mu_R$ that falls entirely within $(-\Delta, +\Delta)$ is equivalent to rejecting both one-sided null hypotheses at level $\alpha$ .

Decision rules:

CI Outcome	TOST Decision	Interpretation
CI entirely within $(-\Delta, +\Delta)$	Reject both $H_{01}$ and $H_{02}$	Equivalence demonstrated
CI includes 0 but extends beyond $\pm\Delta$	Cannot reject at least one	Inconclusive
CI entirely outside $(-\Delta, +\Delta)$	Cannot reject at least one	Difference detected

One-Sided vs Two-Sided CIs

A 90% two-sided CI corresponds to TOST at $\alpha = 0.05$ . A 95% two-sided CI corresponds to TOST at $\alpha = 0.025$ . Always match the CI level to the intended test level.

Non-Inferiority Testing

DfNon-Inferiority

Non-inferiority testing is a one-sided version of equivalence testing where we only test whether the new treatment is not unacceptably worse than the reference:

$H_0$ : $\mu_T - \mu_R \leq -\Delta$ (new treatment is worse)
$H_1$ : $\mu_T - \mu_R > -\Delta$ (new treatment is non-inferior)

This requires a historical evidence argument that the standard treatment has a known effect over placebo, and the margin $\Delta$ preserves a fraction (typically 50%) of that effect.

Python Implementation

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# --- TOST for two independent groups ---
def tost_ind(x_treatment, x_reference, margin, alpha=0.05):
    """
    Two one-sided tests (TOST) for two independent groups.
    
    Parameters:
        x_treatment: array-like, treatment group data
        x_reference: array-like, reference group data
        margin: equivalence margin (Delta)
        alpha: significance level per test
    
    Returns:
        dict with test statistics, p-values, CI, and decision
    """
    n_t = len(x_treatment)
    n_r = len(x_reference)
    
    mean_t = np.mean(x_treatment)
    mean_r = np.mean(x_reference)
    diff = mean_t - mean_r
    
    # Pooled standard deviation
    var_t = np.var(x_treatment, ddof=1)
    var_r = np.var(x_reference, ddof=1)
    sp = np.sqrt(((n_t - 1) * var_t + (n_r - 1) * var_r) / (n_t + n_r - 2))
    se = sp * np.sqrt(1/n_t + 1/n_r)
    
    # TOST statistics
    df = n_t + n_r - 2
    t1 = (diff + margin) / se  # Test H01: diff <= -margin
    t2 = (diff - margin) / se  # Test H02: diff >= margin
    
    # P-values (one-sided)
    p1 = stats.t.sf(t1, df)   # P(T > t1) under H01
    p2 = stats.t.cdf(t2, df)  # P(T < t2) under H02
    
    # Combined p-value
    p_tost = max(p1, p2)
    
    # Confidence interval
    t_crit = stats.t.ppf(1 - alpha, df)
    ci_lower = diff - t_crit * se
    ci_upper = diff + t_crit * se
    
    # Decision
    reject = (p1 < alpha) and (p2 < alpha)
    
    return {
        'difference': diff,
        'se': se,
        't1': t1, 't2': t2,
        'p1': p1, 'p2': p2,
        'p_tost': p_tost,
        'ci_90': (ci_lower, ci_upper),
        'reject_equivalence': reject,
        'margin': margin,
        'df': df
    }

# Simulate bioequivalence study
np.random.seed(42)
# Log-transformed AUC values
n_subjects = 24
treatment = np.random.normal(6.8, 0.4, n_subjects)   # ln(AUC_T)
reference = np.random.normal(6.85, 0.4, n_subjects)   # ln(AUC_R)
delta = np.log(1.25)  # ~0.223 for 80-125% criteria

result = tost_ind(treatment, reference, margin=delta)
print("=== TOST Bioequivalence Test ===")
print(f"Mean difference (log scale): {result['difference']:.4f}")
print(f"Ratio (T/R): {np.exp(result['difference'])*100:.1f}%")
print(f"90% CI for ratio: ({np.exp(result['ci_90'][0])*100:.1f}%, "
      f"{np.exp(result['ci_90'][1])*100:.1f}%)")
print(f"T1 statistic: {result['t1']:.3f}, p1 = {result['p1']:.4f}")
print(f"T2 statistic: {result['t2']:.3f}, p2 = {result['p2']:.4f}")
print(f"Equivalence concluded: {result['reject_equivalence']}")

# --- Power curve for TOST ---
def tost_power(delta_true, margin, sigma, n, alpha=0.05):
    """Compute power of TOST for a given true difference."""
    se = sigma * np.sqrt(2 / n)
    df = 2 * n - 2
    t_crit = stats.t.ppf(1 - alpha, df)
    
    # Non-centrality parameters
    lambda1 = (delta_true + margin) / se
    lambda2 = (delta_true - margin) / se
    
    power1 = stats.nct.sf(t_crit, df, lambda1)
    power2 = stats.nct.cdf(-t_crit, df, lambda2)
    
    return power1 + power2 - 1  # Union of rejection regions

deltas = np.linspace(-0.3, 0.3, 100)
powers = [tost_power(d, delta, sigma=1.0, n=24) for d in deltas]

plt.figure(figsize=(10, 6))
plt.plot(deltas, powers, 'b-', linewidth=2)
plt.axvline(x=-delta, color='red', linestyle='--', label=f'Δ = ±{delta:.3f}')
plt.axvline(x=delta, color='red', linestyle='--')
plt.axhline(y=0.8, color='gray', linestyle=':', label='80% power')
plt.xlabel('True Mean Difference (δ)')
plt.ylabel('Power')
plt.title('Power Curve for TOST Equivalence Test (n=24 per group)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('tost_power.png', dpi=150)
plt.show()

# --- Sample size calculation ---
def tost_sample_size(margin, sigma, delta=0, alpha=0.05, power=0.80):
    """Required sample size per group for TOST."""
    z_alpha = stats.norm.ppf(1 - alpha)
    z_beta = stats.norm.ppf(power)
    n = 2 * sigma**2 * (z_alpha + z_beta)**2 / (margin - abs(delta))**2
    return int(np.ceil(n))

n_req = tost_sample_size(delta, sigma=1.0, power=0.80)
print(f"\nRequired sample size per group: {n_req}")

Key Takeaways

Summary: Equivalence Testing

TOST demonstrates equivalence by showing the true difference lies within $(-\Delta, +\Delta)$ , not just by failing to find a difference.
The equivalence margin $\Delta$ is a clinical judgment, not a statistical artifact — it must be justified a priori.
TOST is equivalent to a confidence interval check: the $(1-2\alpha)$ CI must fall within the equivalence bounds.
Power is highest when δ = 0 and decreases as the true difference approaches the margin.
Bioequivalence uses the 80–125% rule on log-transformed data, with scaled approaches for highly variable drugs.
Non-inferiority testing is a one-sided variant requiring historical evidence of the reference treatment's effect.
Always report effect sizes and confidence intervals alongside the TOST decision for full transparency.

Equivalence Testing

Equivalence Testing

Proving Things Are the Same, Not Just Different

DfEquivalence Testing

The Problem with Traditional Hypothesis Testing

Traditional vs Equivalence

Two One-Sided Tests (TOST)

DfTOST Procedure

Test Statistics

TOST Test Statistics

Choosing the Equivalence Margin

DfEquivalence Margin

Common Equivalence Margins

Bioequivalence

DfAverage Bioequivalence

Log-Transformed Data

Log-Scale Equivalence

Scaled Average Bioequivalence (SABE)

DfScaled Average Bioequivalence

Power Analysis for Equivalence Testing

Non-Central t-Distribution Approach

Power of TOST

Sample Size Formula (Two-Group Design)

Sample Size for Equivalence

Relationship to Confidence Intervals

DfEquivalence via Confidence Interval

Non-Inferiority Testing

DfNon-Inferiority

Python Implementation

Key Takeaways

Summary: Equivalence Testing

Next Steps

Premium Content

Need Expert Statistics Help?