Equivalence Testing
Advanced Statistical Methods
Proving Things Are the Same, Not Just Different
Equivalence testing uses the TOST procedure to demonstrate that two treatments differ by no more than a pre-specified margin, reversing the logic of traditional hypothesis testing. It provides evidence for practical equivalence.
- Generic drug approval β Demonstrate bioequivalence of generic and brand-name medications
- Manufacturing quality β Verify that a new process produces results equivalent to the established one
- Bioequivalence β Show that formulation changes do not alter drug absorption characteristics
Equivalence testing answers the right question: is the difference small enough to not matter?
DfEquivalence Testing
Equivalence testing is a statistical framework for demonstrating that two treatments are practically equivalent, rather than merely failing to detect a difference. Instead of the traditional null hypothesis of "no difference," we test whether the true difference lies within a pre-specified equivalence margin .
"Absence of evidence is not evidence of absence." β Altman & Bland, 2004
The Problem with Traditional Hypothesis Testing
In traditional testing:
- : (no difference)
- : (some difference)
Failing to reject does not prove equivalence. It merely means we lack evidence of a difference. This is fundamentally different from demonstrating that the treatments are equivalent.
Traditional vs Equivalence
| Aspect | Traditional Test | Equivalence Test |
|---|---|---|
| Null hypothesis | ||
| Alternative | ||
| Conclusion if rejected | There is a difference | The treatments are equivalent |
| What failing to reject means | No evidence of difference | No evidence of equivalence |
Two One-Sided Tests (TOST)
DfTOST Procedure
The Two One-Sided Tests (TOST) procedure (Schuirmann, 1987) demonstrates equivalence by simultaneously testing:
- : (treatment is worse)
- : (treatment is better)
The composite null hypothesis is : , and equivalence is concluded when both one-sided tests reject at significance level .
Test Statistics
TOST Test Statistics
Equivalence is concluded when and , where .
Equivalence to Confidence Interval
TOST is equivalent to checking whether the confidence interval for falls entirely within . For a 90% CI, this corresponds to per test.
Choosing the Equivalence Margin
DfEquivalence Margin
The equivalence margin represents the largest difference that is considered clinically or practically negligible. It is not a statistical parameter β it is a clinical judgment that must be specified before the study.
Common Equivalence Margins
| Application | Typical Margin | Rationale |
|---|---|---|
| Bioequivalence (AUC, Cmax) | 80β125% (ratio scale) | Regulatory standard (FDA, EMA) |
| Non-inferiority (efficacy) | Pre-specified based on historical data | Must preserve a fraction of standard treatment effect |
| Diagnostic accuracy | Β±5% sensitivity/specificity | Clinical non-inferiority |
| Device comparison | Β±10% of reference SD | Clinical equivalence |
Margin Selection
An overly generous makes it easy to declare equivalence but may miss clinically meaningful differences. An overly strict may require infeasibly large sample sizes. Consult clinical experts and use established regulatory precedents.
Bioequivalence
DfAverage Bioequivalence
Average bioequivalence (ABE) is established when the ratio of population geometric means (test/reference) falls within 80β125% for AUC and Cmax. This is equivalent to testing on the log scale with .
Log-Transformed Data
For bioequivalence, we work on the log scale:
Log-Scale Equivalence
Equivalence when:
Scaled Average Bioequivalence (SABE)
DfScaled Average Bioequivalence
For highly variable drugs (intra-subject CV > 30%), SABE widens the equivalence margin proportional to the reference formulation's variability:
where is the within-subject standard deviation of the reference and is the scaled margin (typically 0.896 for FDA, 0.76 for EMA).
Power Analysis for Equivalence Testing
Non-Central t-Distribution Approach
Power of TOST
The power of TOST for detecting equivalence when the true difference is is:
Using the non-central t-distribution:
where the non-centrality parameters are:
Power is Maximum When Ξ΄ = 0
Power for equivalence testing is highest when the true difference is zero and decreases as approaches . This is the opposite of superiority testing, where power increases with the true effect size.
Sample Size Formula (Two-Group Design)
For equal sample sizes and :
Sample Size for Equivalence
where is the critical value for the one-sided test and is the power quantile.
Relationship to Confidence Intervals
DfEquivalence via Confidence Interval
A confidence interval for that falls entirely within is equivalent to rejecting both one-sided null hypotheses at level .
Decision rules:
| CI Outcome | TOST Decision | Interpretation |
|---|---|---|
| CI entirely within | Reject both and | Equivalence demonstrated |
| CI includes 0 but extends beyond | Cannot reject at least one | Inconclusive |
| CI entirely outside | Cannot reject at least one | Difference detected |
One-Sided vs Two-Sided CIs
A 90% two-sided CI corresponds to TOST at . A 95% two-sided CI corresponds to TOST at . Always match the CI level to the intended test level.
Non-Inferiority Testing
DfNon-Inferiority
Non-inferiority testing is a one-sided version of equivalence testing where we only test whether the new treatment is not unacceptably worse than the reference:
- : (new treatment is worse)
- : (new treatment is non-inferior)
This requires a historical evidence argument that the standard treatment has a known effect over placebo, and the margin preserves a fraction (typically 50%) of that effect.
Python Implementation
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# --- TOST for two independent groups ---
def tost_ind(x_treatment, x_reference, margin, alpha=0.05):
"""
Two one-sided tests (TOST) for two independent groups.
Parameters:
x_treatment: array-like, treatment group data
x_reference: array-like, reference group data
margin: equivalence margin (Delta)
alpha: significance level per test
Returns:
dict with test statistics, p-values, CI, and decision
"""
n_t = len(x_treatment)
n_r = len(x_reference)
mean_t = np.mean(x_treatment)
mean_r = np.mean(x_reference)
diff = mean_t - mean_r
# Pooled standard deviation
var_t = np.var(x_treatment, ddof=1)
var_r = np.var(x_reference, ddof=1)
sp = np.sqrt(((n_t - 1) * var_t + (n_r - 1) * var_r) / (n_t + n_r - 2))
se = sp * np.sqrt(1/n_t + 1/n_r)
# TOST statistics
df = n_t + n_r - 2
t1 = (diff + margin) / se # Test H01: diff <= -margin
t2 = (diff - margin) / se # Test H02: diff >= margin
# P-values (one-sided)
p1 = stats.t.sf(t1, df) # P(T > t1) under H01
p2 = stats.t.cdf(t2, df) # P(T < t2) under H02
# Combined p-value
p_tost = max(p1, p2)
# Confidence interval
t_crit = stats.t.ppf(1 - alpha, df)
ci_lower = diff - t_crit * se
ci_upper = diff + t_crit * se
# Decision
reject = (p1 < alpha) and (p2 < alpha)
return {
'difference': diff,
'se': se,
't1': t1, 't2': t2,
'p1': p1, 'p2': p2,
'p_tost': p_tost,
'ci_90': (ci_lower, ci_upper),
'reject_equivalence': reject,
'margin': margin,
'df': df
}
# Simulate bioequivalence study
np.random.seed(42)
# Log-transformed AUC values
n_subjects = 24
treatment = np.random.normal(6.8, 0.4, n_subjects) # ln(AUC_T)
reference = np.random.normal(6.85, 0.4, n_subjects) # ln(AUC_R)
delta = np.log(1.25) # ~0.223 for 80-125% criteria
result = tost_ind(treatment, reference, margin=delta)
print("=== TOST Bioequivalence Test ===")
print(f"Mean difference (log scale): {result['difference']:.4f}")
print(f"Ratio (T/R): {np.exp(result['difference'])*100:.1f}%")
print(f"90% CI for ratio: ({np.exp(result['ci_90'][0])*100:.1f}%, "
f"{np.exp(result['ci_90'][1])*100:.1f}%)")
print(f"T1 statistic: {result['t1']:.3f}, p1 = {result['p1']:.4f}")
print(f"T2 statistic: {result['t2']:.3f}, p2 = {result['p2']:.4f}")
print(f"Equivalence concluded: {result['reject_equivalence']}")
# --- Power curve for TOST ---
def tost_power(delta_true, margin, sigma, n, alpha=0.05):
"""Compute power of TOST for a given true difference."""
se = sigma * np.sqrt(2 / n)
df = 2 * n - 2
t_crit = stats.t.ppf(1 - alpha, df)
# Non-centrality parameters
lambda1 = (delta_true + margin) / se
lambda2 = (delta_true - margin) / se
power1 = stats.nct.sf(t_crit, df, lambda1)
power2 = stats.nct.cdf(-t_crit, df, lambda2)
return power1 + power2 - 1 # Union of rejection regions
deltas = np.linspace(-0.3, 0.3, 100)
powers = [tost_power(d, delta, sigma=1.0, n=24) for d in deltas]
plt.figure(figsize=(10, 6))
plt.plot(deltas, powers, 'b-', linewidth=2)
plt.axvline(x=-delta, color='red', linestyle='--', label=f'Ξ = Β±{delta:.3f}')
plt.axvline(x=delta, color='red', linestyle='--')
plt.axhline(y=0.8, color='gray', linestyle=':', label='80% power')
plt.xlabel('True Mean Difference (Ξ΄)')
plt.ylabel('Power')
plt.title('Power Curve for TOST Equivalence Test (n=24 per group)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('tost_power.png', dpi=150)
plt.show()
# --- Sample size calculation ---
def tost_sample_size(margin, sigma, delta=0, alpha=0.05, power=0.80):
"""Required sample size per group for TOST."""
z_alpha = stats.norm.ppf(1 - alpha)
z_beta = stats.norm.ppf(power)
n = 2 * sigma**2 * (z_alpha + z_beta)**2 / (margin - abs(delta))**2
return int(np.ceil(n))
n_req = tost_sample_size(delta, sigma=1.0, power=0.80)
print(f"\nRequired sample size per group: {n_req}")
Key Takeaways
Summary: Equivalence Testing
- TOST demonstrates equivalence by showing the true difference lies within , not just by failing to find a difference.
- The equivalence margin is a clinical judgment, not a statistical artifact β it must be justified a priori.
- TOST is equivalent to a confidence interval check: the CI must fall within the equivalence bounds.
- Power is highest when Ξ΄ = 0 and decreases as the true difference approaches the margin.
- Bioequivalence uses the 80β125% rule on log-transformed data, with scaled approaches for highly variable drugs.
- Non-inferiority testing is a one-sided variant requiring historical evidence of the reference treatment's effect.
- Always report effect sizes and confidence intervals alongside the TOST decision for full transparency.