Bayes' Theorem
Probability Theory
How Evidence Updates What You Believe
Bayes' theorem is the mathematical engine of inductive reasoning. It provides a systematic way to update probabilities in light of new evidence.
- Prior to posterior — Start with a belief, see evidence, get an updated belief
- Base rate neglect — The most common Bayesian error; always account for prevalence
- Medical testing — Positive test result does not mean you have the disease
- Naive Bayes — The algorithm that powers spam filters and text classifiers
Bayes' theorem is the only mathematically valid way to update beliefs with evidence.
What is Bayes' Theorem?
Definition
Bayes' theorem is the mathematical engine of inductive reasoning. It provides a systematic way to update probabilities in light of new evidence.
Derivation
ThBayes' Theorem — Derivation
Starting from the definition of conditional probability:
Solving the second for and substituting into the first:
Using the law of total probability to expand :
where partition the sample space.
Bayes' Theorem
Here,
- =Posterior — updated belief after seeing evidence B
- =Likelihood — probability of evidence given hypothesis A
- =Prior — initial belief before seeing evidence
- =Marginal likelihood (evidence) — total probability of B
The Bayesian Updating Schema
Posterior ∝ Likelihood × Prior
In practice, since is constant for all hypotheses, we often write:
The posterior is proportional to the likelihood times the prior. Normalization ensures the posterior sums to 1.
The key insight: the posterior is a compromise between the prior and the likelihood. When data are abundant, the likelihood dominates and the prior becomes irrelevant. When data are sparse, the prior has more influence.
Sequential Updating
ThPosterior as Prior for Next Observation
When data arrive one at a time, Bayes' theorem enables sequential updating. The posterior after observing becomes the prior for observing :
For i.i.d. observations, this simplifies to:
Each observation multiplies the current belief by the likelihood of that observation.
Medical Testing Revisited
The Disease Testing Problem
Given:
- Prevalence (prior):
- Sensitivity:
- Specificity: , so
Step 1 — Marginal likelihood:
Step 2 — Posterior:
Despite 99% sensitivity, only 1 in 6 positive results is a true positive. The low prevalence (1%) causes the majority of positives to be false positives.
Base Rate Neglect
The base rate fallacy (also called base rate neglect) is the tendency to ignore the prior probability and focus only on the test accuracy. This is one of the most common reasoning errors in everyday life and in medical decision-making.
Naive Bayes Classification
ThNaive Bayes Classifier
For a classification problem with classes and features , Bayes' theorem gives:
The naive assumption is that features are conditionally independent given the class:
Despite this often-unrealistic assumption, Naive Bayes classifiers perform surprisingly well in practice (e.g., spam filtering, text classification).
The Role of the Prior
| Prior Strength | Influence on Posterior | When Appropriate |
|---|---|---|
| Diffuse (uniform) | Data dominate | Little prior knowledge |
| Informative | Prior and data compromise | Strong domain expertise |
| Conjugate | Posterior has same family as prior | Mathematical convenience |
| Jeffreys | Non-informative, transformation-invariant | Objective Bayesian analysis |
Conjugate Priors
A prior is conjugate to a likelihood if the posterior belongs to the same distributional family. For example:
- Beta prior + Binomial likelihood -> Beta posterior
- Normal prior + Normal likelihood -> Normal posterior
- Gamma prior + Poisson likelihood -> Gamma posterior
Bayes' Theorem in Machine Learning
| ML Application | Bayes' Theorem Usage | Why |
|---|---|---|
| Naive Bayes | P(class|features) ∝ P(features|class)×P(class) | Fast text classification |
| Bayesian optimization | P(optimal|data) | Hyperparameter tuning |
| MAP estimation | θ_MAP = argmax P(data|θ)×P(θ) | Regularization = prior |
| LLMs | P(next_token|context) | GPT, BERT, all transformers |
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
# Naive Bayes: direct application of Bayes' theorem
texts = ["win money now", "free prize guaranteed", "meeting tomorrow at 3",
"project deadline friday", "claim your reward", "lunch with team"]
labels = [1, 1, 0, 0, 1, 0] # 1=spam, 0=not spam
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
model = MultinomialNB()
model.fit(X, labels)
# Predict new message
new_msg = ["free money prize"]
new_X = vectorizer.transform(new_msg)
proba = model.predict_proba(new_X)[0]
print(f"Message: '{new_msg[0]}'")
print(f"P(not spam) = {proba[0]:.3f}")
print(f"P(spam) = {proba[1]:.3f}")
print(f"Prediction: {'SPAM' if proba[1] > 0.5 else 'NOT SPAM'}")
print("\nThis IS Bayes' theorem in action!")
Key Takeaways
Summary: Bayes' Theorem
- Posterior = (Likelihood × Prior) / Marginal Likelihood — the core formula
- Prior = initial belief before seeing data; Posterior = updated belief after seeing data
- The posterior is proportional to likelihood × prior — normalization is automatic
- Sequential updating: each posterior becomes the prior for the next observation — beliefs update incrementally
- Base rate neglect is a common cognitive bias — always consider the prior probability
- Naive Bayes applies Bayes' theorem with conditional independence — surprisingly effective in practice
- Conjugate priors yield closed-form posteriors — essential for analytical tractability