Introduction to Probability
Probability Theory
The Mathematics of Uncertainty
Probability is the mathematical framework for quantifying uncertainty. It assigns a number between 0 and 1 to an event, measuring how likely that event is to occur.
- Classical interpretation — Equal likelihood of outcomes; the fair coin, the rolled die
- Frequentist interpretation — Long-run relative frequency from infinite repetitions
- Subjective interpretation — Personal degree of belief updated by evidence
- Kolmogorov axioms — The three mathematical rules that make probability work
Probability is not about certainty — it is about quantifying what we do not know.
What is Probability?
Definition
Probability is the mathematical framework for quantifying uncertainty. It assigns a number between 0 and 1 to an event, measuring how likely that event is to occur. A probability of 0 means the event is impossible; a probability of 1 means it is certain.
The foundations of modern probability were laid by Kolmogorov (1933), who axiomatized probability as a measure on a set of outcomes.
Three Interpretations of Probability
| Interpretation | Definition | Formalization | Example |
|---|---|---|---|
| Classical | Equal likelihood of outcomes | Fair coin: | |
| Frequentist | Long-run relative frequency | 48 heads in 100 tosses -> | |
| Subjective | Personal degree of belief | Bayesian: encodes belief | "I'm 80% sure it will rain" |
The Frequentist Interpretation
The frequentist interpretation defines probability as the limit of relative frequency in an infinite sequence of i.i.d. repetitions. This is formalized by the Strong Law of Large Numbers: if occurs with probability in each trial, then almost surely as .
The Axioms of Probability
ThKolmogorov Axioms (1933)
Let be the sample space (set of all possible outcomes) and a -algebra of events. A probability measure satisfies:
- Non-negativity: for every event .
- Normalization: .
- Countable additivity: If are pairwise disjoint events, then:
Finite Additivity vs Countable Additivity
Kolmogorov's third axiom requires countable (not just finite) additivity. This is necessary for rigorous measure-theoretic probability, particularly when dealing with infinite sample spaces (e.g., the uniform distribution on ).
Immediate Consequences of the Axioms
ThBasic Properties from Kolmogorov Axioms
From the three axioms, we can derive:
- (the empty event has probability zero)
- (complement rule)
- (upper bound)
- If , then (monotonicity)
- (inclusion-exclusion)
- Boole's inequality:
The Addition Rule
Addition Rule for Two Events
Here,
- =Event A or B (or both)
- =Event A and B
- =Joint probability
For mutually exclusive events ():
Conditional Probability and Independence
DfConditional Probability
For events and with , the conditional probability of given is:
This defines a valid probability measure on for fixed .
DfIndependence
Events and are independent if and only if:
Equivalently: (knowing occurred does not change the probability of ).
Mutual Exclusivity vs Independence
Mutually exclusive () and independent () are not the same concept. In fact, if and , then mutual exclusivity implies dependence (since ).
The Multiplication Rule
Multiplication Rule
Here,
- =Joint probability of A and B
- =Conditional probability of B given A
Total Probability and Bayes' Theorem
ThLaw of Total Probability
If partition the sample space (pairwise disjoint, union = ), then for any event :
ThBayes' Theorem
or equivalently:
Bayes' theorem is the foundation of Bayesian statistics: it updates prior beliefs in light of observed data to produce posterior beliefs .
Counting Principles
ThFundamental Counting Principle
If task 1 can be done in ways, task 2 in ways, ..., task in ways, then the total number of ways to perform all tasks is .
Permutations and Combinations
Here,
- =Number of ordered arrangements of k items from n
- =Number of unordered selections of k items from n
- =n factorial: n × (n−1) × ⋯ × 1
Probability in Machine Learning
| ML Application | Probability Usage | Why |
|---|---|---|
| Classification | P(class | features) | Core of supervised learning |
| Naive Bayes | P(feature | class) × P(class) | Text classification baseline |
| Bayesian optimization | P(optimal params | data) | Hyperparameter tuning |
| Uncertainty estimation | Confidence intervals | Trustworthy predictions |
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)
# Naive Bayes: applies Bayes' theorem directly
model = GaussianNB()
model.fit(X_train, y_train)
proba = model.predict_proba(X_test[:3])
print("Naive Bayes predictions (probability):")
for i, p in enumerate(proba):
print(f" Sample {i}: {p.round(3)} → class {np.argmax(p)}")
print(f"Accuracy: {model.score(X_test, y_test):.3f}")
print("ML IS applied probability theory!")
Key Takeaways
Summary: Probability Foundations
- Probability quantifies uncertainty — ranges from 0 (impossible) to 1 (certain)
- Three interpretations: classical (equally likely), frequentist (long-run frequency), subjective (belief)
- Kolmogorov axioms form the mathematical foundation: non-negativity, normalization, countable additivity
- Conditional probability is defined as — the basis for all inference
- Independence means — distinct from mutual exclusivity
- Bayes' theorem updates prior beliefs given observed data — the engine of Bayesian inference
- Counting principles (permutations, combinations) enable computation of probabilities in finite sample spaces