Hypothesis Testing

Why It Matters

Hypothesis testing is the backbone of scientific discovery and data-driven decision making. Whether validating a clinical trial, tuning a machine learning model, or measuring the impact of a new feature, hypothesis testing provides the formal framework to distinguish real effects from random noise. Without it, every observed difference — no matter how small or how likely to occur by chance — could be mistaken for a meaningful finding.

Overview

Every hypothesis test begins by formulating two competing statements about a population parameter. The null hypothesis ( $H_0$ ) is the default assumption of no effect. The alternative hypothesis ( $H_1$ ) is the claim that an effect exists. A test statistic measures how far observed data deviates from $H_0$ . The p-value quantifies the probability of seeing results at least as extreme if $H_0$ is true. We reject $H_0$ when the p-value falls below the significance level $\alpha$ . Two types of errors are possible: Type I (false positive, probability $\alpha$ ) and Type II (false negative, probability $\beta$ ). Power ( $1 - \beta$ ) is the probability of detecting a real effect, and increases with effect size, sample size, and $\alpha$ .

Key Concepts

Test Statistic (Z-Test)

z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}

Here,

$\bar{x}$ =Sample mean
$\mu_0$ =Hypothesized population mean
$\sigma$ =Population standard deviation
$n$ =Sample size

P-Value

p = P(\text{data or more extreme} \mid H_0 \text{ is true})

Here,

$p$ =Probability of observing results this extreme under H₀

Power of a Test

\text{Power} = 1 - \beta = P(\text{Reject } H_0 \mid H_0 \text{ is false})

Here,

$\beta$ =Type II error probability

Sample Size for Power

n = \left(\frac{z_{1-\alpha/2} + z_{1-\beta}}{d}\right)^2

Here,

$d$ =Cohen's d (effect size)
$z_{1-\alpha/2}$ =Critical value for significance level
$z_{1-\beta}$ =Critical value for desired power

Cohen's d (Effect Size)

d = \frac{\bar{x}_1 - \bar{x}_2}{s_p}

Here,

$s_p$ =Pooled standard deviation

Error Matrix

	$H_0$ is True	$H_0$ is False
Reject $H_0$	Type I Error ( $\alpha$ ) — false positive	Power ( $1-\beta$ ) — true positive
Fail to Reject $H_0$	Correct — true negative	Type II Error ( $\beta$ ) — false negative

Effect Size Benchmarks (Cohen's d)

Effect	Cohen's d	Interpretation
Small	0.2	Subtle, hard to detect
Medium	0.5	Noticeable practical effect
Large	0.8	Strong, clearly visible

P-Value Interpretation

P-Value	Evidence Against $H_0$
$p < 0.01$	Very strong
$p < 0.05$	Strong
$p < 0.10$	Weak
$p \geq 0.10$	Little or none

Quick Example

One-Sample T-Test

A researcher claims average response time is 200ms. Sample: $n = 25$ , $\bar{x} = 215$ , $s = 30$ .

t = \frac{215 - 200}{30/\sqrt{25}} = \frac{15}{6} = 2.5

With $df = 24$ , critical value $t_{0.025, 24} = 2.064$ . Since $|t| = 2.5 > 2.064$ , reject $H_0$ .

There is sufficient evidence that the mean response time differs from 200ms.

Power Analysis

To detect a medium effect ( $d = 0.5$ ) at $\alpha = 0.05$ with power = 0.80:

n = \left(\frac{1.96 + 0.842}{0.5}\right)^2 = \left(\frac{2.802}{0.5}\right)^2 \approx 64

You need approximately 64 participants per group.

Key Takeaways

Summary: Hypothesis Testing

Decision Rule: Reject $H_0$ if p-value ≤ $\alpha$ . Never say "accept $H_0$ " — say "fail to reject."
p-value: Probability of results this extreme given $H_0$ is true. NOT the probability that $H_0$ is true.
Type I vs Type II: Type I = false positive ( $\alpha$ ); Type II = false negative ( $\beta$ ). Reducing one increases the other for fixed $n$ .
Power: Increases with effect size, sample size, and $\alpha$ . Always conduct power analysis before collecting data.
Effect Size: A tiny effect can be "significant" with large $n$ . Always report Cohen's d alongside p-values.
One vs Two Tailed: Use two-tailed as the default. One-tailed requires a strong a priori directional prediction.
Multiple Comparisons: Many tests inflate family-wise error. Use Bonferroni, Holm, or FDR correction.
Statistical vs Practical: Statistical significance ≠ practical significance. Always consider effect size and context.

Deep Dive

For detailed explanations, worked examples, and Python implementations, explore the dedicated statistics lessons:

Hypothesis Formulation

Null and Alternative Hypothesis — How to formulate $H_0$ and $H_1$ , one-sided vs. two-sided, and common patterns

Errors and Significance

Type I and Type II Errors — Error matrix, trade-off, and real-world consequences
P-Values — Calculation, interpretation, and common misinterpretations
Significance Levels — Choosing $\alpha$ , multiple testing, and when to use 0.01 vs 0.05

Power and Effect Size

Power of a Test — Factors affecting power, a priori power analysis, and underpowered studies
Effect Size — Cohen's d, Hedges' g, eta-squared, and why practical significance matters

Hypothesis Testing