πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Equivalence Testing

Advanced Statistical MethodsHypothesis Testing🟒 Free Lesson

Advertisement

Equivalence Testing

Advanced Statistical Methods

Proving Things Are the Same, Not Just Different

Equivalence testing uses the TOST procedure to demonstrate that two treatments differ by no more than a pre-specified margin, reversing the logic of traditional hypothesis testing. It provides evidence for practical equivalence.

  • Generic drug approval β€” Demonstrate bioequivalence of generic and brand-name medications
  • Manufacturing quality β€” Verify that a new process produces results equivalent to the established one
  • Bioequivalence β€” Show that formulation changes do not alter drug absorption characteristics

Equivalence testing answers the right question: is the difference small enough to not matter?


DfEquivalence Testing

Equivalence testing is a statistical framework for demonstrating that two treatments are practically equivalent, rather than merely failing to detect a difference. Instead of the traditional null hypothesis of "no difference," we test whether the true difference lies within a pre-specified equivalence margin Ξ”\Delta.

"Absence of evidence is not evidence of absence." β€” Altman & Bland, 2004


The Problem with Traditional Hypothesis Testing

In traditional testing:

  • H0H_0: ΞΌTβˆ’ΞΌR=0\mu_T - \mu_R = 0 (no difference)
  • H1H_1: ΞΌTβˆ’ΞΌRβ‰ 0\mu_T - \mu_R \neq 0 (some difference)

Failing to reject H0H_0 does not prove equivalence. It merely means we lack evidence of a difference. This is fundamentally different from demonstrating that the treatments are equivalent.

Traditional vs Equivalence

AspectTraditional TestEquivalence Test
Null hypothesisΞΈ=0\theta = 0∣θ∣β‰₯Ξ”|\theta| \geq \Delta
AlternativeΞΈβ‰ 0\theta \neq 0∣θ∣<Ξ”|\theta| < \Delta
Conclusion if H0H_0 rejectedThere is a differenceThe treatments are equivalent
What failing to reject meansNo evidence of differenceNo evidence of equivalence

Two One-Sided Tests (TOST)

DfTOST Procedure

The Two One-Sided Tests (TOST) procedure (Schuirmann, 1987) demonstrates equivalence by simultaneously testing:

  • H01H_{01}: ΞΌTβˆ’ΞΌRβ‰€βˆ’Ξ”\mu_T - \mu_R \leq -\Delta (treatment is worse)
  • H02H_{02}: ΞΌTβˆ’ΞΌRβ‰₯Ξ”\mu_T - \mu_R \geq \Delta (treatment is better)

The composite null hypothesis is H0H_0: ∣μTβˆ’ΞΌR∣β‰₯Ξ”|\mu_T - \mu_R| \geq \Delta, and equivalence is concluded when both one-sided tests reject at significance level Ξ±\alpha.

Test Statistics

TOST Test Statistics

T1=(XΛ‰Tβˆ’XΛ‰R)+Ξ”Sp1nT+1nRT_1 = \frac{(\bar{X}_T - \bar{X}_R) + \Delta}{S_p \sqrt{\frac{1}{n_T} + \frac{1}{n_R}}}
T2=(XΛ‰Tβˆ’XΛ‰R)βˆ’Ξ”Sp1nT+1nRT_2 = \frac{(\bar{X}_T - \bar{X}_R) - \Delta}{S_p \sqrt{\frac{1}{n_T} + \frac{1}{n_R}}}

Equivalence is concluded when T1>t1βˆ’Ξ±,Ξ½T_1 > t_{1-\alpha, \nu} and T2<βˆ’t1βˆ’Ξ±,Ξ½T_2 < -t_{1-\alpha, \nu}, where Ξ½=nT+nRβˆ’2\nu = n_T + n_R - 2.

Equivalence to Confidence Interval

TOST is equivalent to checking whether the (1βˆ’2Ξ±)(1 - 2\alpha) confidence interval for ΞΌTβˆ’ΞΌR\mu_T - \mu_R falls entirely within (βˆ’Ξ”,+Ξ”)(-\Delta, +\Delta). For a 90% CI, this corresponds to Ξ±=0.05\alpha = 0.05 per test.


Choosing the Equivalence Margin

DfEquivalence Margin

The equivalence margin Ξ”\Delta represents the largest difference that is considered clinically or practically negligible. It is not a statistical parameter β€” it is a clinical judgment that must be specified before the study.

Common Equivalence Margins

ApplicationTypical MarginRationale
Bioequivalence (AUC, Cmax)80–125% (ratio scale)Regulatory standard (FDA, EMA)
Non-inferiority (efficacy)Pre-specified based on historical dataMust preserve a fraction of standard treatment effect
Diagnostic accuracyΒ±5% sensitivity/specificityClinical non-inferiority
Device comparisonΒ±10% of reference SDClinical equivalence

Margin Selection

An overly generous Ξ”\Delta makes it easy to declare equivalence but may miss clinically meaningful differences. An overly strict Ξ”\Delta may require infeasibly large sample sizes. Consult clinical experts and use established regulatory precedents.


Bioequivalence

DfAverage Bioequivalence

Average bioequivalence (ABE) is established when the ratio of population geometric means (test/reference) falls within 80–125% for AUC and Cmax. This is equivalent to testing on the log scale with Ξ”=ln⁑(1.25)β‰ˆ0.223\Delta = \ln(1.25) \approx 0.223.

Log-Transformed Data

For bioequivalence, we work on the log scale:

Log-Scale Equivalence

ln⁑(ΞΌT/ΞΌR)=ln⁑(ΞΌT)βˆ’ln⁑(ΞΌR)\ln(\mu_T / \mu_R) = \ln(\mu_T) - \ln(\mu_R)

Equivalence when: βˆ’ln⁑(1.25)<ln⁑(ΞΌT)βˆ’ln⁑(ΞΌR)<ln⁑(1.25)-\ln(1.25) < \ln(\mu_T) - \ln(\mu_R) < \ln(1.25)

Scaled Average Bioequivalence (SABE)

DfScaled Average Bioequivalence

For highly variable drugs (intra-subject CV > 30%), SABE widens the equivalence margin proportional to the reference formulation's variability:

(ΞΌTβˆ’ΞΌR)2ΟƒWR2<ΞΈS\frac{(\mu_T - \mu_R)^2}{\sigma_{WR}^2} < \theta_S

where ΟƒWR\sigma_{WR} is the within-subject standard deviation of the reference and ΞΈS\theta_S is the scaled margin (typically 0.896 for FDA, 0.76 for EMA).


Power Analysis for Equivalence Testing

Non-Central t-Distribution Approach

Power of TOST

The power of TOST for detecting equivalence when the true difference is Ξ΄\delta is:

Power=P(T1>t1βˆ’Ξ±,Ξ½Β andΒ T2<βˆ’t1βˆ’Ξ±,ν∣δ)\text{Power} = P\left(T_1 > t_{1-\alpha, \nu} \text{ and } T_2 < -t_{1-\alpha, \nu} \mid \delta\right)

Using the non-central t-distribution:

Power=P(tΞ½,Ξ»1β€²>t1βˆ’Ξ±,Ξ½)+P(tΞ½,Ξ»2β€²<βˆ’t1βˆ’Ξ±,Ξ½)βˆ’1\text{Power} = P\left(t'_{\nu, \lambda_1} > t_{1-\alpha, \nu}\right) + P\left(t'_{\nu, \lambda_2} < -t_{1-\alpha, \nu}\right) - 1

where the non-centrality parameters are:

Ξ»1=Ξ΄+Δσ1nT+1nR,Ξ»2=Ξ΄βˆ’Ξ”Οƒ1nT+1nR\lambda_1 = \frac{\delta + \Delta}{\sigma \sqrt{\frac{1}{n_T} + \frac{1}{n_R}}}, \quad \lambda_2 = \frac{\delta - \Delta}{\sigma \sqrt{\frac{1}{n_T} + \frac{1}{n_R}}}

Power is Maximum When Ξ΄ = 0

Power for equivalence testing is highest when the true difference is zero and decreases as ∣δ∣|\delta| approaches Ξ”\Delta. This is the opposite of superiority testing, where power increases with the true effect size.

Sample Size Formula (Two-Group Design)

For equal sample sizes nT=nR=nn_T = n_R = n and Ξ΄=0\delta = 0:

Sample Size for Equivalence

n=2Οƒ2(z1βˆ’Ξ±+z1βˆ’Ξ²)2(Ξ”βˆ’βˆ£Ξ΄βˆ£)2n = \frac{2\sigma^2 (z_{1-\alpha} + z_{1-\beta})^2}{(\Delta - |\delta|)^2}

where z1βˆ’Ξ±z_{1-\alpha} is the critical value for the one-sided test and z1βˆ’Ξ²z_{1-\beta} is the power quantile.


Relationship to Confidence Intervals

DfEquivalence via Confidence Interval

A (1βˆ’2Ξ±)(1 - 2\alpha) confidence interval for ΞΌTβˆ’ΞΌR\mu_T - \mu_R that falls entirely within (βˆ’Ξ”,+Ξ”)(-\Delta, +\Delta) is equivalent to rejecting both one-sided null hypotheses at level Ξ±\alpha.

Decision rules:

CI OutcomeTOST DecisionInterpretation
CI entirely within (βˆ’Ξ”,+Ξ”)(-\Delta, +\Delta)Reject both H01H_{01} and H02H_{02}Equivalence demonstrated
CI includes 0 but extends beyond Β±Ξ”\pm\DeltaCannot reject at least oneInconclusive
CI entirely outside (βˆ’Ξ”,+Ξ”)(-\Delta, +\Delta)Cannot reject at least oneDifference detected

One-Sided vs Two-Sided CIs

A 90% two-sided CI corresponds to TOST at Ξ±=0.05\alpha = 0.05. A 95% two-sided CI corresponds to TOST at Ξ±=0.025\alpha = 0.025. Always match the CI level to the intended test level.


Non-Inferiority Testing

DfNon-Inferiority

Non-inferiority testing is a one-sided version of equivalence testing where we only test whether the new treatment is not unacceptably worse than the reference:

  • H0H_0: ΞΌTβˆ’ΞΌRβ‰€βˆ’Ξ”\mu_T - \mu_R \leq -\Delta (new treatment is worse)
  • H1H_1: ΞΌTβˆ’ΞΌR>βˆ’Ξ”\mu_T - \mu_R > -\Delta (new treatment is non-inferior)

This requires a historical evidence argument that the standard treatment has a known effect over placebo, and the margin Ξ”\Delta preserves a fraction (typically 50%) of that effect.


Python Implementation

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# --- TOST for two independent groups ---
def tost_ind(x_treatment, x_reference, margin, alpha=0.05):
    """
    Two one-sided tests (TOST) for two independent groups.
    
    Parameters:
        x_treatment: array-like, treatment group data
        x_reference: array-like, reference group data
        margin: equivalence margin (Delta)
        alpha: significance level per test
    
    Returns:
        dict with test statistics, p-values, CI, and decision
    """
    n_t = len(x_treatment)
    n_r = len(x_reference)
    
    mean_t = np.mean(x_treatment)
    mean_r = np.mean(x_reference)
    diff = mean_t - mean_r
    
    # Pooled standard deviation
    var_t = np.var(x_treatment, ddof=1)
    var_r = np.var(x_reference, ddof=1)
    sp = np.sqrt(((n_t - 1) * var_t + (n_r - 1) * var_r) / (n_t + n_r - 2))
    se = sp * np.sqrt(1/n_t + 1/n_r)
    
    # TOST statistics
    df = n_t + n_r - 2
    t1 = (diff + margin) / se  # Test H01: diff <= -margin
    t2 = (diff - margin) / se  # Test H02: diff >= margin
    
    # P-values (one-sided)
    p1 = stats.t.sf(t1, df)   # P(T > t1) under H01
    p2 = stats.t.cdf(t2, df)  # P(T < t2) under H02
    
    # Combined p-value
    p_tost = max(p1, p2)
    
    # Confidence interval
    t_crit = stats.t.ppf(1 - alpha, df)
    ci_lower = diff - t_crit * se
    ci_upper = diff + t_crit * se
    
    # Decision
    reject = (p1 < alpha) and (p2 < alpha)
    
    return {
        'difference': diff,
        'se': se,
        't1': t1, 't2': t2,
        'p1': p1, 'p2': p2,
        'p_tost': p_tost,
        'ci_90': (ci_lower, ci_upper),
        'reject_equivalence': reject,
        'margin': margin,
        'df': df
    }

# Simulate bioequivalence study
np.random.seed(42)
# Log-transformed AUC values
n_subjects = 24
treatment = np.random.normal(6.8, 0.4, n_subjects)   # ln(AUC_T)
reference = np.random.normal(6.85, 0.4, n_subjects)   # ln(AUC_R)
delta = np.log(1.25)  # ~0.223 for 80-125% criteria

result = tost_ind(treatment, reference, margin=delta)
print("=== TOST Bioequivalence Test ===")
print(f"Mean difference (log scale): {result['difference']:.4f}")
print(f"Ratio (T/R): {np.exp(result['difference'])*100:.1f}%")
print(f"90% CI for ratio: ({np.exp(result['ci_90'][0])*100:.1f}%, "
      f"{np.exp(result['ci_90'][1])*100:.1f}%)")
print(f"T1 statistic: {result['t1']:.3f}, p1 = {result['p1']:.4f}")
print(f"T2 statistic: {result['t2']:.3f}, p2 = {result['p2']:.4f}")
print(f"Equivalence concluded: {result['reject_equivalence']}")

# --- Power curve for TOST ---
def tost_power(delta_true, margin, sigma, n, alpha=0.05):
    """Compute power of TOST for a given true difference."""
    se = sigma * np.sqrt(2 / n)
    df = 2 * n - 2
    t_crit = stats.t.ppf(1 - alpha, df)
    
    # Non-centrality parameters
    lambda1 = (delta_true + margin) / se
    lambda2 = (delta_true - margin) / se
    
    power1 = stats.nct.sf(t_crit, df, lambda1)
    power2 = stats.nct.cdf(-t_crit, df, lambda2)
    
    return power1 + power2 - 1  # Union of rejection regions

deltas = np.linspace(-0.3, 0.3, 100)
powers = [tost_power(d, delta, sigma=1.0, n=24) for d in deltas]

plt.figure(figsize=(10, 6))
plt.plot(deltas, powers, 'b-', linewidth=2)
plt.axvline(x=-delta, color='red', linestyle='--', label=f'Ξ” = Β±{delta:.3f}')
plt.axvline(x=delta, color='red', linestyle='--')
plt.axhline(y=0.8, color='gray', linestyle=':', label='80% power')
plt.xlabel('True Mean Difference (Ξ΄)')
plt.ylabel('Power')
plt.title('Power Curve for TOST Equivalence Test (n=24 per group)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('tost_power.png', dpi=150)
plt.show()

# --- Sample size calculation ---
def tost_sample_size(margin, sigma, delta=0, alpha=0.05, power=0.80):
    """Required sample size per group for TOST."""
    z_alpha = stats.norm.ppf(1 - alpha)
    z_beta = stats.norm.ppf(power)
    n = 2 * sigma**2 * (z_alpha + z_beta)**2 / (margin - abs(delta))**2
    return int(np.ceil(n))

n_req = tost_sample_size(delta, sigma=1.0, power=0.80)
print(f"\nRequired sample size per group: {n_req}")

Key Takeaways

Summary: Equivalence Testing

  1. TOST demonstrates equivalence by showing the true difference lies within (βˆ’Ξ”,+Ξ”)(-\Delta, +\Delta), not just by failing to find a difference.
  2. The equivalence margin Ξ”\Delta is a clinical judgment, not a statistical artifact β€” it must be justified a priori.
  3. TOST is equivalent to a confidence interval check: the (1βˆ’2Ξ±)(1-2\alpha) CI must fall within the equivalence bounds.
  4. Power is highest when Ξ΄ = 0 and decreases as the true difference approaches the margin.
  5. Bioequivalence uses the 80–125% rule on log-transformed data, with scaled approaches for highly variable drugs.
  6. Non-inferiority testing is a one-sided variant requiring historical evidence of the reference treatment's effect.
  7. Always report effect sizes and confidence intervals alongside the TOST decision for full transparency.

Next Steps

⭐

Premium Content

Equivalence Testing

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement