Bayes' Theorem

Probability Theory

How Evidence Updates What You Believe

Bayes' theorem is the mathematical engine of inductive reasoning. It provides a systematic way to update probabilities in light of new evidence.

Prior to posterior — Start with a belief, see evidence, get an updated belief
Base rate neglect — The most common Bayesian error; always account for prevalence
Medical testing — Positive test result does not mean you have the disease
Naive Bayes — The algorithm that powers spam filters and text classifiers

Bayes' theorem is the only mathematically valid way to update beliefs with evidence.

What is Bayes' Theorem?

Definition

Bayes' theorem is the mathematical engine of inductive reasoning. It provides a systematic way to update probabilities in light of new evidence.

Derivation

ThBayes' Theorem — Derivation

Starting from the definition of conditional probability:

P(A \mid B) = \frac{P(A \cap B)}{P(B)} \quad \text{and} \quad P(B \mid A) = \frac{P(A \cap B)}{P(A)}

Solving the second for $P(A \cap B)$ and substituting into the first:

P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}

Using the law of total probability to expand $P(B)$ :

P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{\sum_{i=1}^k P(B \mid A_i) \cdot P(A_i)}

where $A_1, A_2, \ldots, A_k$ partition the sample space.

Bayes' Theorem

P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}

Here,

$P(A \mid B)$ =Posterior — updated belief after seeing evidence B
$P(B \mid A)$ =Likelihood — probability of evidence given hypothesis A
$P(A)$ =Prior — initial belief before seeing evidence
$P(B)$ =Marginal likelihood (evidence) — total probability of B

The Bayesian Updating Schema

Posterior ∝ Likelihood × Prior

In practice, since $P(B)$ is constant for all hypotheses, we often write:

P(A_i \mid B) \propto P(B \mid A_i) \cdot P(A_i)

The posterior is proportional to the likelihood times the prior. Normalization ensures the posterior sums to 1.

The key insight: the posterior is a compromise between the prior and the likelihood. When data are abundant, the likelihood dominates and the prior becomes irrelevant. When data are sparse, the prior has more influence.

Sequential Updating

ThPosterior as Prior for Next Observation

When data arrive one at a time, Bayes' theorem enables sequential updating. The posterior after observing $d_1$ becomes the prior for observing $d_2$ :

P(\theta \mid d_1, d_2) = \frac{P(d_2 \mid \theta, d_1) \cdot P(\theta \mid d_1)}{P(d_2 \mid d_1)}

For i.i.d. observations, this simplifies to:

P(\theta \mid d_1, d_2) \propto P(d_2 \mid \theta) \cdot P(d_1 \mid \theta) \cdot P(\theta)

Each observation multiplies the current belief by the likelihood of that observation.

Medical Testing Revisited

The Disease Testing Problem

Given:

Prevalence (prior): $P(D) = 0.01$
Sensitivity: $P(+|D) = 0.99$
Specificity: $P(-|\neg D) = 0.95$ , so $P(+|\neg D) = 0.05$

Step 1 — Marginal likelihood:

P(+) = P(+|D)P(D) + P(+|\neg D)P(\neg D) = 0.99(0.01) + 0.05(0.99) = 0.0594

Step 2 — Posterior:

P(D \mid +) = \frac{0.99 \times 0.01}{0.0594} \approx 0.1667

Despite 99% sensitivity, only 1 in 6 positive results is a true positive. The low prevalence (1%) causes the majority of positives to be false positives.

Base Rate Neglect

The base rate fallacy (also called base rate neglect) is the tendency to ignore the prior probability $P(D)$ and focus only on the test accuracy. This is one of the most common reasoning errors in everyday life and in medical decision-making.

Naive Bayes Classification

ThNaive Bayes Classifier

For a classification problem with classes $C_1, \ldots, C_k$ and features $X_1, \ldots, X_p$ , Bayes' theorem gives:

P(C_j \mid X_1, \ldots, X_p) = \frac{P(X_1, \ldots, X_p \mid C_j) \, P(C_j)}{P(X_1, \ldots, X_p)}

The naive assumption is that features are conditionally independent given the class:

P(X_1, \ldots, X_p \mid C_j) = \prod_{i=1}^p P(X_i \mid C_j)

Despite this often-unrealistic assumption, Naive Bayes classifiers perform surprisingly well in practice (e.g., spam filtering, text classification).

The Role of the Prior

Prior Strength	Influence on Posterior	When Appropriate
Diffuse (uniform)	Data dominate	Little prior knowledge
Informative	Prior and data compromise	Strong domain expertise
Conjugate	Posterior has same family as prior	Mathematical convenience
Jeffreys	Non-informative, transformation-invariant	Objective Bayesian analysis

Conjugate Priors

A prior is conjugate to a likelihood if the posterior belongs to the same distributional family. For example:

Beta prior + Binomial likelihood -> Beta posterior
Normal prior + Normal likelihood -> Normal posterior
Gamma prior + Poisson likelihood -> Gamma posterior

Bayes' Theorem in Machine Learning

ML Application	Bayes' Theorem Usage	Why
Naive Bayes	P(class\|features) ∝ P(features\|class)×P(class)	Fast text classification
Bayesian optimization	P(optimal\|data)	Hyperparameter tuning
MAP estimation	θ_MAP = argmax P(data\|θ)×P(θ)	Regularization = prior
LLMs	P(next_token\|context)	GPT, BERT, all transformers

import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# Naive Bayes: direct application of Bayes' theorem
texts = ["win money now", "free prize guaranteed", "meeting tomorrow at 3",
         "project deadline friday", "claim your reward", "lunch with team"]
labels = [1, 1, 0, 0, 1, 0]  # 1=spam, 0=not spam

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

model = MultinomialNB()
model.fit(X, labels)

# Predict new message
new_msg = ["free money prize"]
new_X = vectorizer.transform(new_msg)
proba = model.predict_proba(new_X)[0]
print(f"Message: '{new_msg[0]}'")
print(f"P(not spam) = {proba[0]:.3f}")
print(f"P(spam)     = {proba[1]:.3f}")
print(f"Prediction: {'SPAM' if proba[1] > 0.5 else 'NOT SPAM'}")
print("\nThis IS Bayes' theorem in action!")

Key Takeaways

Summary: Bayes' Theorem

Posterior = (Likelihood × Prior) / Marginal Likelihood — the core formula
Prior = initial belief before seeing data; Posterior = updated belief after seeing data
The posterior is proportional to likelihood × prior — normalization is automatic
Sequential updating: each posterior becomes the prior for the next observation — beliefs update incrementally
Base rate neglect is a common cognitive bias — always consider the prior probability
Naive Bayes applies Bayes' theorem with conditional independence — surprisingly effective in practice
Conjugate priors yield closed-form posteriors — essential for analytical tractability

Bayes' Theorem — Updating Beliefs with Evidence

Bayes' Theorem

How Evidence Updates What You Believe

What is Bayes' Theorem?

Definition

Derivation

ThBayes' Theorem — Derivation

Bayes' Theorem

The Bayesian Updating Schema

Sequential Updating

ThPosterior as Prior for Next Observation

Medical Testing Revisited

The Disease Testing Problem

Naive Bayes Classification

ThNaive Bayes Classifier

The Role of the Prior

Bayes' Theorem in Machine Learning

Key Takeaways

Summary: Bayes' Theorem

Premium Content

Need Expert Statistics Help?