🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Statistical Testing: Hypothesis, t-tests and Chi-square

Module 4: Statistics and Probability🟢 Free Lesson

Advertisement

Hypothesis Testing FlowState H₀, H₁Collect DataCompute Statisticp-valuep < α?Reject H₀Fail to Reject

Why This Matters

Statistical testing is how we make decisions under uncertainty. Instead of guessing whether a drug works, a feature helps, or a difference is real — we quantify the evidence and let data guide our conclusions.

The Hypothesis Testing Framework

Hypothesis Testing Framework1. State H₀, H₁2. Collect Data3. Compute Statistic4. p-valuep < α?Reject H₀Fail to Reject

Visual: Rejection Regions

Rejection Regions (Two-Tailed, α = 0.05)Reject H₀Reject H₀Fail to Reject H₀−1.96+1.96z = 0α/2 = 0.025α/2 = 0.025Distribution under H₀ (null is true)

Type I and Type II Errors

Type I and Type II ErrorsActual RealityH₀ True (No Effect)H₀ False (Effect Exists)Reject H₀Fail to RejectTYPE I ERRORFalse Positive (α)CORRECT ✓True Positive (Power)CORRECT ✓True NegativeTYPE II ERRORFalse Negative (β)

The p-value: What It Actually Means

p-value=P(TtobsH0)p\text{-value} = P(T \geq t_{\text{obs}} \mid H_0)

Formal Definition of p-value

p-value=P(TtobsH0)p\text{-value} = P(T \geq t_{\text{obs}} \mid H_0)

Here,

  • TT=test statistic under the null distribution
  • tobst_{obs}=observed test statistic from sample
  • H0H_0=null hypothesis is true
❌ WRONG
"p = 0.03 means there's a 3% chance H₀ is true"
✅ RIGHT
"If H₀ were true, there's a 3% chance of seeing data this extreme"
❌ WRONG
"p < 0.05 means the effect is large"
✅ RIGHT
"p < 0.05 means the effect is unlikely under H₀" (A tiny effect can be significant with large n)
❌ WRONG
"p > 0.05 means no effect exists"
✅ RIGHT
"p > 0.05 means we don't have enough evidence to reject H₀"

Statistical vs practical significance: A p-value measures how surprised you should be if H0 were true. It does NOT measure the size or importance of an effect. With n = 1,000,000, a 0.01 cm height difference can yield p < 0.001. Always report effect sizes (Cohen's d, Cramér's V) alongside p-values.

One-Sample t-test

t=xˉμ0s/nt = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}
One-Sample t-Statistic
t=xˉμ0s/nt = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}

Here,

  • xˉ\bar{x}=sample mean
  • μ0\mu_0=hypothesized population mean
  • ss=sample standard deviation
  • nn=sample size
  • s/ns / \sqrt{n}=standard error of the mean

Effect Size: Cohen's d

Cohen's d Effect Size

d=xˉμ0sd = \frac{\bar{x} - \mu_0}{s}

Here,

  • xˉ\bar{x}=sample mean
  • μ0\mu_0=hypothesized population mean
  • ss=sample standard deviation

| |d| Value | Interpretation | |-----------|---------------| | 0.2 | Small effect | | 0.5 | Medium effect | | 0.8 | Large effect |

Complete Example

import numpy as np
from scipy import stats

np.random.seed(42)
heights = np.random.normal(loc=172, scale=8, size=30)

print(f"Sample mean: {heights.mean():.2f} cm")
print(f"Sample std:  {heights.std(ddof=1):.2f} cm")

mu_0 = 170
t_stat, p_value = stats.ttest_1samp(heights, mu_0)

print(f"\nH0: mu = {mu_0} cm")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value:     {p_value:.4f}")

cohens_d = (heights.mean() - mu_0) / heights.std(ddof=1)
print(f"Cohen's d:   {cohens_d:.4f}")

alpha = 0.05
if p_value < alpha:
    print(f"\nResult: Reject H0 (p < {alpha})")
else:
    print(f"\nResult: Fail to reject H0 (p >= {alpha})")

Assumptions

AssumptionDescriptionHow to Check
IndependenceObservations are independentStudy design, random sampling
NormalityData is approximately normalShapiro-Wilk test, Q-Q plot
ContinuousDependent variable is continuousData type inspection

Note: The t-test is robust to mild non-normality for n > 30 due to the Central Limit Theorem.

The Central Limit Theorem saves you: Even if the underlying data is not normal, the sampling distribution of the mean approaches normality as n increases (CLT). For n > 30, the t-test is robust to moderate departures from normality.

Two-Sample t-test

Welch's t-test (Unequal Variances)

t=xˉ1xˉ2s12n1+s22n2t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
Welch's t-Statistic
t=xˉ1xˉ2s12n1+s22n2t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}

Here,

  • xˉ1,xˉ2\bar{x}_1, \bar{x}_2=sample means of groups 1 and 2
  • s1,s2s_1, s_2=sample standard deviations
  • n1,n2n_1, n_2=sample sizes

Welch-Satterthwaite Degrees of Freedom

df=(s12n1+s22n2)2(s12/n1)2n11+(s22/n2)2n21df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1 - 1} + \frac{(s_2^2/n_2)^2}{n_2 - 1}}

Here,

  • s1,s2s_1, s_2=sample standard deviations
  • n1,n2n_1, n_2=sample sizes
np.random.seed(42)
men_heights = np.random.normal(loc=175, scale=7, size=30)
women_heights = np.random.normal(loc=163, scale=6, size=30)

t_stat, p_value = stats.ttest_ind(men_heights, women_heights, equal_var=False)

print(f"t-statistic: {t_stat:.4f}")
print(f"p-value:     {p_value:.4f}")

pooled_std = np.sqrt((men_heights.std(ddof=1)**2 + women_heights.std(ddof=1)**2) / 2)
cohens_d = (men_heights.mean() - women_heights.mean()) / pooled_std
print(f"Cohen's d:   {cohens_d:.4f}")

Paired Samples

np.random.seed(42)
before = np.random.normal(loc=65, scale=10, size=25)
after = before + np.random.normal(loc=5, scale=8, size=25)

t_stat, p_value = stats.ttest_rel(after, before)
t_stat_one, p_value_one = stats.ttest_rel(after, before, alternative='greater')

print(f"Paired t-test:")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value (two-tailed): {p_value:.4f}")
print(f"p-value (one-tailed): {p_value_one:.4f}")

Chi-Square Test

Chi-Square Test Statistic
χ2=(OE)2E\chi^2 = \sum \frac{(O - E)^2}{E}

Here,

  • OO=observed frequency (actual data)
  • EE=expected frequency (if variables were independent)
import pandas as pd

contingency = pd.DataFrame({
    'Python': [50, 35],
    'Java': [30, 45],
    'R': [20, 20]
}, index=['Male', 'Female'])

chi2, p_value, dof, expected = stats.chi2_contingency(contingency)

print("Contingency Table:")
print(contingency)
print(f"\nChi-square statistic: {chi2:.4f}")
print(f"p-value:             {p_value:.4f}")
print(f"Degrees of freedom:  {dof}")

Effect Size: Cramér's V

Cramér's V Effect Size

V=χ2n(k1)V = \sqrt{\frac{\chi^2}{n \cdot (k - 1)}}

Here,

  • I¨2χ²=chi-square statistic
  • nn=total sample size
  • kk=min(number of rows, number of columns)

| |V| Value | Interpretation | |----------|---------------| | 0.1 | Small association | | 0.3 | Medium association | | 0.5 | Large association |

ANOVA (Analysis of Variance)

Why Not Use Multiple t-tests?

Why ANOVA Instead of Multiple t-tests?Multiple t-testsA vs B → test 1A vs C → test 2B vs C → test 3P(≥1 false+) = 14.3%ANOVASingle test: Are ANYgroups different?Then post-hoc: WHICH?FWER = α = 0.05

One-Way ANOVA

F-Statistic for One-Way ANOVA
F=Between-Group VarianceWithin-Group Variance=MSbetweenMSwithinF = \frac{\text{Between-Group Variance}}{\text{Within-Group Variance}} = \frac{MS_{\text{between}}}{MS_{\text{within}}}

Here,

  • MSbetweenMS_between=mean square between groups (signal)
  • MSwithinMS_within=mean square within groups (noise)
np.random.seed(42)
method_a = np.random.normal(loc=75, scale=10, size=30)
method_b = np.random.normal(loc=80, scale=10, size=30)
method_c = np.random.normal(loc=72, scale=10, size=30)

f_stat, p_value = stats.f_oneway(method_a, method_b, method_c)

print(f"F-statistic: {f_stat:.4f}")
print(f"p-value:     {p_value:.4f}")

if p_value < 0.05:
    print("Result: At least one method differs significantly")

Post-Hoc: Tukey's HSD

from statsmodels.stats.multicomp import pairwise_tukeyhsd

all_scores = np.concatenate([method_a, method_b, method_c])
groups = ['A']*30 + ['B']*30 + ['C']*30

tukey = pairwise_tukeyhsd(all_scores, groups, alpha=0.05)
print(tukey)

Non-Parametric Tests

ParametricNon-ParametricWhen to Use
One-sample tWilcoxon signed-rankSmall sample, non-normal
Independent tMann-Whitney UUnequal variances, ordinal data
Paired tWilcoxon signed-rank (paired)Paired, non-normal differences
One-way ANOVAKruskal-WallisNon-normal, 3+ groups
Pearson rSpearman rhoNon-linear monotonic relationship
# Mann-Whitney U (non-parametric independent t-test)
stat, p_value = stats.mannwhitneyu(method_a, method_b, alternative='two-sided')
print(f"Mann-Whitney U: {stat:.4f}, p-value: {p_value:.4f}")

# Kruskal-Wallis (non-parametric ANOVA)
stat, p_value = stats.kruskal(method_a, method_b, method_c)
print(f"Kruskal-Wallis H: {stat:.4f}, p-value: {p_value:.4f}")

Multiple Comparisons Problem

Solutions

from statsmodels.stats.multitest import multipletests

p_values = [0.01, 0.04, 0.03, 0.85, 0.12, 0.02, 0.06]

# Bonferroni correction
reject, pvals_corrected, _, _ = multipletests(p_values, method='bonferroni')
print("Bonferroni corrected:", pvals_corrected)

# FDR (Benjamini-Hochberg)
reject_fdr, pvals_fdr, _, _ = multipletests(p_values, method='fdr_bh')
print("FDR corrected:", pvals_fdr)

When to use which correction: Use Bonferroni when you have few comparisons (m < 10) and the cost of a false positive is high. Use FDR when you have many comparisons (m > 10) and you want to maximize discoveries (e.g., genomics, exploratory data analysis).

Power Analysis

Statistical Power

Power=1β=f(effect size,α,n)\text{Power} = 1 - \beta = f(\text{effect size}, \alpha, n)

Here,

  • ββ=probability of Type II error
  • αα=significance level (typically 0.05)
  • nn=sample size per group
from statsmodels.stats.power import TTestIndPower

power_analysis = TTestIndPower()

# How many samples needed to detect d=0.5 with 80% power?
n = power_analysis.solve_power(effect_size=0.5, alpha=0.05, power=0.80, ratio=1.0)
print(f"Required n per group: {int(np.ceil(n))}")

# What power do we have with n=50 per group?
power = power_analysis.power(effect_size=0.5, alpha=0.05, nobs1=50, ratio=1.0)
print(f"Power with n=50: {power:.2f}")

Quick Reference: Which Test to Use

Which Statistical Test Should I Use?What are you comparing?1 group vs known value2 groups3+ groupsNormal?Non-normal?One-sample tWilcoxon signed-rankIndep.?Paired?Normal?Non-normal?Welch's t-testMann-Whitney UNormal?Non-normal?Paired t-testWilcoxon paired1 factor?2+ factors?Normal?Non-normal?One-way ANOVAKruskal-WallisNormal?Two-way ANOVAFriedmanQuick Reference• Categorical data? → Chi-square test• Correlation? → Pearson (normal) or Spearman (non-linear)• Always check: Independence, Normality, Homoscedasticity

Key Takeaways

Summary: Statistical Testing Deep Dive

  1. Always state H0 and H1 before testing. Hypothesis testing is a structured framework, not a fishing expedition.
  2. p-value is NOT the probability H0 is true. It is the probability of seeing data this extreme if H0 were true.
  3. Statistical significance ≈  practical significance. Always report effect sizes (Cohen's d, Cramér's V) alongside p-values.
  4. Use Welch's t-test by default (safer than Student's). It does not assume equal variances.
  5. Correct for multiple comparisons using Bonferroni (conservative) or FDR (liberal).
  6. Always check assumptions (normality, independence, homoscedasticity). Use non-parametric tests when assumptions fail.
  7. Use power analysis to determine sample size before collecting data.

Practice Exercises

  1. Drug Trial: Blood pressure in 40 patients after a new drug. Historical mean 120 mmHg. Sample mean 115, std=12. Is the drug effective?
  2. A/B Test: Website A conversion 12.3% (n=5000), Website B 13.1% (n=5000). Is B significantly better?
  3. Survey Analysis: Association between education level and preferred news source (n=200). Chi-square test?
  4. Experiment Design: Detect medium effect (d=0.5) with 90% power. How many subjects per group?

Premium Content

Statistical Testing: Hypothesis, t-tests and Chi-square

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement