Statistical Testing: Hypothesis, t-tests and Chi-square

Why This Matters

Statistical testing is how we make decisions under uncertainty. Instead of guessing whether a drug works, a feature helps, or a difference is real — we quantify the evidence and let data guide our conclusions.

The Hypothesis Testing Framework

Visual: Rejection Regions

Type I and Type II Errors

The p-value: What It Actually Means

p\text{-value} = P(T \geq t_{\text{obs}} \mid H_0)

Formal Definition of p-value

p\text{-value} = P(T \geq t_{\text{obs}} \mid H_0)

Here,

$T$ =test statistic under the null distribution
$t_{obs}$ =observed test statistic from sample
$H_0$ =null hypothesis is true

❌ WRONG

"p = 0.03 means there's a 3% chance H₀ is true"

✅ RIGHT

"If H₀ were true, there's a 3% chance of seeing data this extreme"

❌ WRONG

"p < 0.05 means the effect is large"

✅ RIGHT

"p < 0.05 means the effect is unlikely under H₀" (A tiny effect can be significant with large n)

❌ WRONG

"p > 0.05 means no effect exists"

✅ RIGHT

"p > 0.05 means we don't have enough evidence to reject H₀"

Statistical vs practical significance: A p-value measures how surprised you should be if H0 were true. It does NOT measure the size or importance of an effect. With n = 1,000,000, a 0.01 cm height difference can yield p < 0.001. Always report effect sizes (Cohen's d, Cramér's V) alongside p-values.

One-Sample t-test

t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}

One-Sample t-Statistic

t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}

Here,

$\bar{x}$ =sample mean
$\mu_0$ =hypothesized population mean
$s$ =sample standard deviation
$n$ =sample size
$s / \sqrt{n}$ =standard error of the mean

Effect Size: Cohen's d

Cohen's d Effect Size

d = \frac{\bar{x} - \mu_0}{s}

Here,

$\bar{x}$ =sample mean
$\mu_0$ =hypothesized population mean
$s$ =sample standard deviation

| |d| Value | Interpretation | |-----------|---------------| | 0.2 | Small effect | | 0.5 | Medium effect | | 0.8 | Large effect |

Complete Example

import numpy as np
from scipy import stats

np.random.seed(42)
heights = np.random.normal(loc=172, scale=8, size=30)

print(f"Sample mean: {heights.mean():.2f} cm")
print(f"Sample std:  {heights.std(ddof=1):.2f} cm")

mu_0 = 170
t_stat, p_value = stats.ttest_1samp(heights, mu_0)

print(f"\nH0: mu = {mu_0} cm")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value:     {p_value:.4f}")

cohens_d = (heights.mean() - mu_0) / heights.std(ddof=1)
print(f"Cohen's d:   {cohens_d:.4f}")

alpha = 0.05
if p_value < alpha:
    print(f"\nResult: Reject H0 (p < {alpha})")
else:
    print(f"\nResult: Fail to reject H0 (p >= {alpha})")

Assumptions

Assumption	Description	How to Check
Independence	Observations are independent	Study design, random sampling
Normality	Data is approximately normal	Shapiro-Wilk test, Q-Q plot
Continuous	Dependent variable is continuous	Data type inspection

Note: The t-test is robust to mild non-normality for n > 30 due to the Central Limit Theorem.

The Central Limit Theorem saves you: Even if the underlying data is not normal, the sampling distribution of the mean approaches normality as n increases (CLT). For n > 30, the t-test is robust to moderate departures from normality.

Two-Sample t-test

Welch's t-test (Unequal Variances)

t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}

Welch's t-Statistic

t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}

Here,

$\bar{x}_1, \bar{x}_2$ =sample means of groups 1 and 2
$s_1, s_2$ =sample standard deviations
$n_1, n_2$ =sample sizes

Welch-Satterthwaite Degrees of Freedom

df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1 - 1} + \frac{(s_2^2/n_2)^2}{n_2 - 1}}

Here,

$s_1, s_2$ =sample standard deviations
$n_1, n_2$ =sample sizes

np.random.seed(42)
men_heights = np.random.normal(loc=175, scale=7, size=30)
women_heights = np.random.normal(loc=163, scale=6, size=30)

t_stat, p_value = stats.ttest_ind(men_heights, women_heights, equal_var=False)

print(f"t-statistic: {t_stat:.4f}")
print(f"p-value:     {p_value:.4f}")

pooled_std = np.sqrt((men_heights.std(ddof=1)**2 + women_heights.std(ddof=1)**2) / 2)
cohens_d = (men_heights.mean() - women_heights.mean()) / pooled_std
print(f"Cohen's d:   {cohens_d:.4f}")

Paired Samples

np.random.seed(42)
before = np.random.normal(loc=65, scale=10, size=25)
after = before + np.random.normal(loc=5, scale=8, size=25)

t_stat, p_value = stats.ttest_rel(after, before)
t_stat_one, p_value_one = stats.ttest_rel(after, before, alternative='greater')

print(f"Paired t-test:")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value (two-tailed): {p_value:.4f}")
print(f"p-value (one-tailed): {p_value_one:.4f}")

Chi-Square Test

Chi-Square Test Statistic

\chi^2 = \sum \frac{(O - E)^2}{E}

Here,

$O$ =observed frequency (actual data)
$E$ =expected frequency (if variables were independent)

import pandas as pd

contingency = pd.DataFrame({
    'Python': [50, 35],
    'Java': [30, 45],
    'R': [20, 20]
}, index=['Male', 'Female'])

chi2, p_value, dof, expected = stats.chi2_contingency(contingency)

print("Contingency Table:")
print(contingency)
print(f"\nChi-square statistic: {chi2:.4f}")
print(f"p-value:             {p_value:.4f}")
print(f"Degrees of freedom:  {dof}")

Effect Size: Cramér's V

Cramér's V Effect Size

V = \sqrt{\frac{\chi^2}{n \cdot (k - 1)}}

Here,

$Ï‡²$ =chi-square statistic
$n$ =total sample size
$k$ =min(number of rows, number of columns)

| |V| Value | Interpretation | |----------|---------------| | 0.1 | Small association | | 0.3 | Medium association | | 0.5 | Large association |

ANOVA (Analysis of Variance)

Why Not Use Multiple t-tests?

One-Way ANOVA

F-Statistic for One-Way ANOVA

F = \frac{\text{Between-Group Variance}}{\text{Within-Group Variance}} = \frac{MS_{\text{between}}}{MS_{\text{within}}}

Here,

$MS_between$ =mean square between groups (signal)
$MS_within$ =mean square within groups (noise)

np.random.seed(42)
method_a = np.random.normal(loc=75, scale=10, size=30)
method_b = np.random.normal(loc=80, scale=10, size=30)
method_c = np.random.normal(loc=72, scale=10, size=30)

f_stat, p_value = stats.f_oneway(method_a, method_b, method_c)

print(f"F-statistic: {f_stat:.4f}")
print(f"p-value:     {p_value:.4f}")

if p_value < 0.05:
    print("Result: At least one method differs significantly")

Post-Hoc: Tukey's HSD

from statsmodels.stats.multicomp import pairwise_tukeyhsd

all_scores = np.concatenate([method_a, method_b, method_c])
groups = ['A']*30 + ['B']*30 + ['C']*30

tukey = pairwise_tukeyhsd(all_scores, groups, alpha=0.05)
print(tukey)

Non-Parametric Tests

Parametric	Non-Parametric	When to Use
One-sample t	Wilcoxon signed-rank	Small sample, non-normal
Independent t	Mann-Whitney U	Unequal variances, ordinal data
Paired t	Wilcoxon signed-rank (paired)	Paired, non-normal differences
One-way ANOVA	Kruskal-Wallis	Non-normal, 3+ groups
Pearson r	Spearman rho	Non-linear monotonic relationship

# Mann-Whitney U (non-parametric independent t-test)
stat, p_value = stats.mannwhitneyu(method_a, method_b, alternative='two-sided')
print(f"Mann-Whitney U: {stat:.4f}, p-value: {p_value:.4f}")

# Kruskal-Wallis (non-parametric ANOVA)
stat, p_value = stats.kruskal(method_a, method_b, method_c)
print(f"Kruskal-Wallis H: {stat:.4f}, p-value: {p_value:.4f}")

Multiple Comparisons Problem

Solutions

from statsmodels.stats.multitest import multipletests

p_values = [0.01, 0.04, 0.03, 0.85, 0.12, 0.02, 0.06]

# Bonferroni correction
reject, pvals_corrected, _, _ = multipletests(p_values, method='bonferroni')
print("Bonferroni corrected:", pvals_corrected)

# FDR (Benjamini-Hochberg)
reject_fdr, pvals_fdr, _, _ = multipletests(p_values, method='fdr_bh')
print("FDR corrected:", pvals_fdr)

When to use which correction: Use Bonferroni when you have few comparisons (m < 10) and the cost of a false positive is high. Use FDR when you have many comparisons (m > 10) and you want to maximize discoveries (e.g., genomics, exploratory data analysis).

Power Analysis

Statistical Power

\text{Power} = 1 - \beta = f(\text{effect size}, \alpha, n)

Here,

$β$ =probability of Type II error
$α$ =significance level (typically 0.05)
$n$ =sample size per group

from statsmodels.stats.power import TTestIndPower

power_analysis = TTestIndPower()

# How many samples needed to detect d=0.5 with 80% power?
n = power_analysis.solve_power(effect_size=0.5, alpha=0.05, power=0.80, ratio=1.0)
print(f"Required n per group: {int(np.ceil(n))}")

# What power do we have with n=50 per group?
power = power_analysis.power(effect_size=0.5, alpha=0.05, nobs1=50, ratio=1.0)
print(f"Power with n=50: {power:.2f}")

Quick Reference: Which Test to Use

Key Takeaways

Summary: Statistical Testing Deep Dive

Always state H0 and H1 before testing. Hypothesis testing is a structured framework, not a fishing expedition.
p-value is NOT the probability H0 is true. It is the probability of seeing data this extreme if H0 were true.
Statistical significance ≈ practical significance. Always report effect sizes (Cohen's d, Cramér's V) alongside p-values.
Use Welch's t-test by default (safer than Student's). It does not assume equal variances.
Correct for multiple comparisons using Bonferroni (conservative) or FDR (liberal).
Always check assumptions (normality, independence, homoscedasticity). Use non-parametric tests when assumptions fail.
Use power analysis to determine sample size before collecting data.

Practice Exercises

Drug Trial: Blood pressure in 40 patients after a new drug. Historical mean 120 mmHg. Sample mean 115, std=12. Is the drug effective?
A/B Test: Website A conversion 12.3% (n=5000), Website B 13.1% (n=5000). Is B significantly better?
Survey Analysis: Association between education level and preferred news source (n=200). Chi-square test?
Experiment Design: Detect medium effect (d=0.5) with 90% power. How many subjects per group?

Statistical Testing: Hypothesis, t-tests and Chi-square

Why This Matters

The Hypothesis Testing Framework

Visual: Rejection Regions

Type I and Type II Errors

The p-value: What It Actually Means

Formal Definition of p-value

One-Sample t-test

Effect Size: Cohen's d

Cohen's d Effect Size

Complete Example

Assumptions

Two-Sample t-test

Welch's t-test (Unequal Variances)

Welch-Satterthwaite Degrees of Freedom

Paired Samples

Chi-Square Test

Effect Size: Cramér's V

Cramér's V Effect Size

ANOVA (Analysis of Variance)

Why Not Use Multiple t-tests?

One-Way ANOVA

Post-Hoc: Tukey's HSD

Non-Parametric Tests

Multiple Comparisons Problem

Solutions

Power Analysis

Statistical Power

Quick Reference: Which Test to Use

Key Takeaways

Summary: Statistical Testing Deep Dive

Practice Exercises

Premium Content

Need Expert Data Science Help?