🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Bayes' Theorem — Updating Beliefs with Evidence

Foundations of StatisticsProbability Theory🟢 Free Lesson

Advertisement

Bayes' Theorem

Probability Theory

How Evidence Updates What You Believe

Bayes' theorem is the mathematical engine of inductive reasoning. It provides a systematic way to update probabilities in light of new evidence.

  • Prior to posterior — Start with a belief, see evidence, get an updated belief
  • Base rate neglect — The most common Bayesian error; always account for prevalence
  • Medical testing — Positive test result does not mean you have the disease
  • Naive Bayes — The algorithm that powers spam filters and text classifiers

Bayes' theorem is the only mathematically valid way to update beliefs with evidence.


What is Bayes' Theorem?

Definition

Bayes' theorem is the mathematical engine of inductive reasoning. It provides a systematic way to update probabilities in light of new evidence.


Derivation

ThBayes' Theorem — Derivation

Starting from the definition of conditional probability:

P(AB)=P(AB)P(B)andP(BA)=P(AB)P(A)P(A \mid B) = \frac{P(A \cap B)}{P(B)} \quad \text{and} \quad P(B \mid A) = \frac{P(A \cap B)}{P(A)}

Solving the second for P(AB)P(A \cap B) and substituting into the first:

P(AB)=P(BA)P(A)P(B)P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}

Using the law of total probability to expand P(B)P(B):

P(AB)=P(BA)P(A)i=1kP(BAi)P(Ai)P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{\sum_{i=1}^k P(B \mid A_i) \cdot P(A_i)}

where A1,A2,,AkA_1, A_2, \ldots, A_k partition the sample space.

Bayes' Theorem

P(AB)=P(BA)P(A)P(B)P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}

Here,

  • P(AB)P(A \mid B)=Posterior — updated belief after seeing evidence B
  • P(BA)P(B \mid A)=Likelihood — probability of evidence given hypothesis A
  • P(A)P(A)=Prior — initial belief before seeing evidence
  • P(B)P(B)=Marginal likelihood (evidence) — total probability of B

The Bayesian Updating Schema

Posterior ∝ Likelihood × Prior

In practice, since P(B)P(B) is constant for all hypotheses, we often write:

P(AiB)P(BAi)P(Ai)P(A_i \mid B) \propto P(B \mid A_i) \cdot P(A_i)

The posterior is proportional to the likelihood times the prior. Normalization ensures the posterior sums to 1.

The key insight: the posterior is a compromise between the prior and the likelihood. When data are abundant, the likelihood dominates and the prior becomes irrelevant. When data are sparse, the prior has more influence.


Sequential Updating

ThPosterior as Prior for Next Observation

When data arrive one at a time, Bayes' theorem enables sequential updating. The posterior after observing d1d_1 becomes the prior for observing d2d_2:

P(θd1,d2)=P(d2θ,d1)P(θd1)P(d2d1)P(\theta \mid d_1, d_2) = \frac{P(d_2 \mid \theta, d_1) \cdot P(\theta \mid d_1)}{P(d_2 \mid d_1)}

For i.i.d. observations, this simplifies to:

P(θd1,d2)P(d2θ)P(d1θ)P(θ)P(\theta \mid d_1, d_2) \propto P(d_2 \mid \theta) \cdot P(d_1 \mid \theta) \cdot P(\theta)

Each observation multiplies the current belief by the likelihood of that observation.


Medical Testing Revisited

The Disease Testing Problem

Given:

  • Prevalence (prior): P(D)=0.01P(D) = 0.01
  • Sensitivity: P(+D)=0.99P(+|D) = 0.99
  • Specificity: P(¬D)=0.95P(-|\neg D) = 0.95, so P(+¬D)=0.05P(+|\neg D) = 0.05

Step 1 — Marginal likelihood:

P(+)=P(+D)P(D)+P(+¬D)P(¬D)=0.99(0.01)+0.05(0.99)=0.0594P(+) = P(+|D)P(D) + P(+|\neg D)P(\neg D) = 0.99(0.01) + 0.05(0.99) = 0.0594

Step 2 — Posterior:

P(D+)=0.99×0.010.05940.1667P(D \mid +) = \frac{0.99 \times 0.01}{0.0594} \approx 0.1667

Despite 99% sensitivity, only 1 in 6 positive results is a true positive. The low prevalence (1%) causes the majority of positives to be false positives.

Base Rate Neglect

The base rate fallacy (also called base rate neglect) is the tendency to ignore the prior probability P(D)P(D) and focus only on the test accuracy. This is one of the most common reasoning errors in everyday life and in medical decision-making.


Naive Bayes Classification

ThNaive Bayes Classifier

For a classification problem with classes C1,,CkC_1, \ldots, C_k and features X1,,XpX_1, \ldots, X_p, Bayes' theorem gives:

P(CjX1,,Xp)=P(X1,,XpCj)P(Cj)P(X1,,Xp)P(C_j \mid X_1, \ldots, X_p) = \frac{P(X_1, \ldots, X_p \mid C_j) \, P(C_j)}{P(X_1, \ldots, X_p)}

The naive assumption is that features are conditionally independent given the class:

P(X1,,XpCj)=i=1pP(XiCj)P(X_1, \ldots, X_p \mid C_j) = \prod_{i=1}^p P(X_i \mid C_j)

Despite this often-unrealistic assumption, Naive Bayes classifiers perform surprisingly well in practice (e.g., spam filtering, text classification).


The Role of the Prior

Prior StrengthInfluence on PosteriorWhen Appropriate
Diffuse (uniform)Data dominateLittle prior knowledge
InformativePrior and data compromiseStrong domain expertise
ConjugatePosterior has same family as priorMathematical convenience
JeffreysNon-informative, transformation-invariantObjective Bayesian analysis

Conjugate Priors

A prior is conjugate to a likelihood if the posterior belongs to the same distributional family. For example:

  • Beta prior + Binomial likelihood -> Beta posterior
  • Normal prior + Normal likelihood -> Normal posterior
  • Gamma prior + Poisson likelihood -> Gamma posterior

Bayes' Theorem in Machine Learning

Naive BayesText classificationBayesian NetsCausal inferenceMLE/MAPParameter estimationLLMsNext token predictionBayes' theorem is the theoretical foundation of ALL probabilistic ML
ML ApplicationBayes' Theorem UsageWhy
Naive BayesP(class|features) ∝ P(features|class)×P(class)Fast text classification
Bayesian optimizationP(optimal|data)Hyperparameter tuning
MAP estimationθ_MAP = argmax P(data|θ)×P(θ)Regularization = prior
LLMsP(next_token|context)GPT, BERT, all transformers
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# Naive Bayes: direct application of Bayes' theorem
texts = ["win money now", "free prize guaranteed", "meeting tomorrow at 3",
         "project deadline friday", "claim your reward", "lunch with team"]
labels = [1, 1, 0, 0, 1, 0]  # 1=spam, 0=not spam

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

model = MultinomialNB()
model.fit(X, labels)

# Predict new message
new_msg = ["free money prize"]
new_X = vectorizer.transform(new_msg)
proba = model.predict_proba(new_X)[0]
print(f"Message: '{new_msg[0]}'")
print(f"P(not spam) = {proba[0]:.3f}")
print(f"P(spam)     = {proba[1]:.3f}")
print(f"Prediction: {'SPAM' if proba[1] > 0.5 else 'NOT SPAM'}")
print("\nThis IS Bayes' theorem in action!")

Key Takeaways

Summary: Bayes' Theorem

  • Posterior = (Likelihood × Prior) / Marginal Likelihood — the core formula
  • Prior = initial belief before seeing data; Posterior = updated belief after seeing data
  • The posterior is proportional to likelihood × prior — normalization is automatic
  • Sequential updating: each posterior becomes the prior for the next observation — beliefs update incrementally
  • Base rate neglect is a common cognitive bias — always consider the prior probability
  • Naive Bayes applies Bayes' theorem with conditional independence — surprisingly effective in practice
  • Conjugate priors yield closed-form posteriors — essential for analytical tractability

Premium Content

Bayes' Theorem — Updating Beliefs with Evidence

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement