ML Engineering

A/B Testing - The Scientific Way to Compare Models

Learn how to rigorously compare model versions using statistical methods and experimental design.

Statistical significance - ensure results are not due to chance
Experimental design - control variables and measure impact
Online vs offline - when to use each testing approach

In God we trust; all others bring data.

A/B Testing for ML — Complete Guide

A/B testing compares two versions to determine which performs better. Essential for ML model validation.

A/B Testing Framework

DfA/B Testing

A/B testing is a statistical method for comparing two versions to determine which performs better. Users are randomly assigned to control (A) or treatment (B) groups, and outcomes are measured to determine if differences are statistically significant.

Hypothesis:
- Hâ‚€: No difference between A and B
- H₁: B is better than A
Randomization:
- Split users into control (A) and treatment (B)
Metrics:
- Primary: Click-through rate, conversion
- Secondary: Revenue, engagement
Sample size:
- Power analysis determines needed samples
Analysis:
- Statistical test -> p-value -> Decision

A/B Testing Framework Diagram

Sample Size Calculation

from statsmodels.stats.power import NormalIndPower

analysis = NormalIndPower()
sample_size = analysis.solve_power(
    effect_size=0.05,  # Minimum detectable effect
    alpha=0.05,         # Significance level
    power=0.80,         # Statistical power
    alternative='larger'
)

n = \frac{(Z_{\alpha/2} + Z_{\beta})^2 \cdot 2\sigma^2}{\delta^2}

Sample Size vs Effect Size

Statistical Significance

DfHypothesis Testing for A/B

Null Hypothesis (Hâ‚€): No difference between variants
Alternative (H₁): Treatment is better than control
p-value: Probability of observing the data given Hâ‚€ is true
Significance level (α): Threshold for rejecting Hâ‚€ (typically 0.05)
Power (1-β): Probability of detecting a true effect (typically 0.80)

Significance Testing Decision Flow

Key Takeaways

Summary: A/B Testing

A/B testing validates model improvements in production
Random assignment eliminates bias
Sample size calculation prevents underpowered tests
Statistical significance ≈ practical significance
Multi-armed bandits adapt during the test
Online ML continuously optimizes
Guardrail metrics prevent harm
Longer tests capture temporal effects

What to Learn Next

-> Model Evaluation Master model performance metrics.

-> Model Deployment Deploy models for A/B testing.

-> MLOps Integrate testing into ML pipelines.

-> Causal Inference Understand cause-effect relationships.

-> Federated Learning Train models without centralizing data.

-> ML System Design Design robust ML systems.

A/B Testing for ML — Experiment Design and Statistical Rigor