Introduction to Probability

Probability Theory

The Mathematics of Uncertainty

Probability is the mathematical framework for quantifying uncertainty. It assigns a number between 0 and 1 to an event, measuring how likely that event is to occur.

Classical interpretation — Equal likelihood of outcomes; the fair coin, the rolled die
Frequentist interpretation — Long-run relative frequency from infinite repetitions
Subjective interpretation — Personal degree of belief updated by evidence
Kolmogorov axioms — The three mathematical rules that make probability work

Probability is not about certainty — it is about quantifying what we do not know.

What is Probability?

Definition

Probability is the mathematical framework for quantifying uncertainty. It assigns a number between 0 and 1 to an event, measuring how likely that event is to occur. A probability of 0 means the event is impossible; a probability of 1 means it is certain.

The foundations of modern probability were laid by Kolmogorov (1933), who axiomatized probability as a measure on a set of outcomes.

Three Interpretations of Probability

Interpretation	Definition	Formalization	Example
Classical	Equal likelihood of outcomes	$P(A) = \frac{n(A)}{n(S)}$	Fair coin: $P(H) = 1/2$
Frequentist	Long-run relative frequency	$P(A) = \lim_{n \to \infty} \frac{n_A}{n}$	48 heads in 100 tosses -> $\hat{P}(H) \approx 0.48$
Subjective	Personal degree of belief	Bayesian: $P(A) \in [0,1]$ encodes belief	"I'm 80% sure it will rain"

The Frequentist Interpretation

The frequentist interpretation defines probability as the limit of relative frequency in an infinite sequence of i.i.d. repetitions. This is formalized by the Strong Law of Large Numbers: if $A$ occurs with probability $P(A)$ in each trial, then $\frac{n_A}{n} \to P(A)$ almost surely as $n \to \infty$ .

The Axioms of Probability

ThKolmogorov Axioms (1933)

Let $\Omega$ be the sample space (set of all possible outcomes) and $\mathcal{F}$ a $\sigma$ -algebra of events. A probability measure $P: \mathcal{F} \to [0,1]$ satisfies:

Non-negativity: $P(A) \geq 0$ for every event $A \in \mathcal{F}$ .
Normalization: $P(\Omega) = 1$ .
Countable additivity: If $A_1, A_2, \ldots$ are pairwise disjoint events, then:

P\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i)

Finite Additivity vs Countable Additivity

Kolmogorov's third axiom requires countable (not just finite) additivity. This is necessary for rigorous measure-theoretic probability, particularly when dealing with infinite sample spaces (e.g., the uniform distribution on $[0,1]$ ).

Immediate Consequences of the Axioms

ThBasic Properties from Kolmogorov Axioms

From the three axioms, we can derive:

$P(\emptyset) = 0$ (the empty event has probability zero)
$P(A^c) = 1 - P(A)$ (complement rule)
$P(A) \leq 1$ (upper bound)
If $A \subseteq B$ , then $P(A) \leq P(B)$ (monotonicity)
$P(A \cup B) = P(A) + P(B) - P(A \cap B)$ (inclusion-exclusion)
Boole's inequality: $P\left(\bigcup_{i=1}^n A_i\right) \leq \sum_{i=1}^n P(A_i)$

The Addition Rule

Addition Rule for Two Events

P(A \cup B) = P(A) + P(B) - P(A \cap B)

Here,

$A \cup B$ =Event A or B (or both)
$A \cap B$ =Event A and B
$P(A \cap B)$ =Joint probability

For mutually exclusive events ( $A \cap B = \emptyset$ ):

P(A \cup B) = P(A) + P(B)

Conditional Probability and Independence

DfConditional Probability

For events $A$ and $B$ with $P(B) > 0$ , the conditional probability of $A$ given $B$ is:

P(A \mid B) = \frac{P(A \cap B)}{P(B)}

This defines a valid probability measure on $\Omega$ for fixed $B$ .

DfIndependence

Events $A$ and $B$ are independent if and only if:

P(A \cap B) = P(A) \cdot P(B)

Equivalently: $P(A \mid B) = P(A)$ (knowing $B$ occurred does not change the probability of $A$ ).

Mutual Exclusivity vs Independence

Mutually exclusive ( $A \cap B = \emptyset$ ) and independent ( $P(A \cap B) = P(A)P(B)$ ) are not the same concept. In fact, if $P(A) > 0$ and $P(B) > 0$ , then mutual exclusivity implies dependence (since $P(A \cap B) = 0 \neq P(A)P(B)$ ).

The Multiplication Rule

Multiplication Rule

P(A \cap B) = P(A) \cdot P(B \mid A) = P(B) \cdot P(A \mid B)

Here,

$P(A \cap B)$ =Joint probability of A and B
$P(B \mid A)$ =Conditional probability of B given A

Total Probability and Bayes' Theorem

ThLaw of Total Probability

If $B_1, B_2, \ldots, B_k$ partition the sample space (pairwise disjoint, union = $\Omega$ ), then for any event $A$ :

P(A) = \sum_{i=1}^k P(A \cap B_i) = \sum_{i=1}^k P(A \mid B_i) P(B_i)

ThBayes' Theorem

P(B_i \mid A) = \frac{P(A \mid B_i) \, P(B_i)}{\sum_{j=1}^k P(A \mid B_j) \, P(B_j)}

or equivalently:

P(B_i \mid A) = \frac{P(A \mid B_i) \, P(B_i)}{P(A)}

Bayes' theorem is the foundation of Bayesian statistics: it updates prior beliefs $P(B_i)$ in light of observed data $A$ to produce posterior beliefs $P(B_i \mid A)$ .

Counting Principles

ThFundamental Counting Principle

If task 1 can be done in $n_1$ ways, task 2 in $n_2$ ways, ..., task $k$ in $n_k$ ways, then the total number of ways to perform all tasks is $n_1 \times n_2 \times \cdots \times n_k$ .

Permutations and Combinations

\text{Permutations: } P(n,k) = \frac{n!}{(n-k)!} \qquad \text{Combinations: } \binom{n}{k} = \frac{n!}{k!(n-k)!}

Here,

$P(n,k)$ =Number of ordered arrangements of k items from n
$\binom{n}{k}$ =Number of unordered selections of k items from n
$n!$ =n factorial: n × (n−1) × ⋯ × 1

Probability in Machine Learning

ML Application	Probability Usage	Why
Classification	P(class \| features)	Core of supervised learning
Naive Bayes	P(feature \| class) × P(class)	Text classification baseline
Bayesian optimization	P(optimal params \| data)	Hyperparameter tuning
Uncertainty estimation	Confidence intervals	Trustworthy predictions

import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)

# Naive Bayes: applies Bayes' theorem directly
model = GaussianNB()
model.fit(X_train, y_train)
proba = model.predict_proba(X_test[:3])

print("Naive Bayes predictions (probability):")
for i, p in enumerate(proba):
    print(f"  Sample {i}: {p.round(3)} → class {np.argmax(p)}")
print(f"Accuracy: {model.score(X_test, y_test):.3f}")
print("ML IS applied probability theory!")

Key Takeaways

Summary: Probability Foundations

Probability quantifies uncertainty — ranges from 0 (impossible) to 1 (certain)
Three interpretations: classical (equally likely), frequentist (long-run frequency), subjective (belief)
Kolmogorov axioms form the mathematical foundation: non-negativity, normalization, countable additivity
Conditional probability is defined as $P(A|B) = P(A \cap B)/P(B)$ — the basis for all inference
Independence means $P(A \cap B) = P(A)P(B)$ — distinct from mutual exclusivity
Bayes' theorem updates prior beliefs given observed data — the engine of Bayesian inference
Counting principles (permutations, combinations) enable computation of probabilities in finite sample spaces

Introduction to Probability — Foundations and Definitions

Introduction to Probability

The Mathematics of Uncertainty

What is Probability?

Definition

Three Interpretations of Probability

The Axioms of Probability

ThKolmogorov Axioms (1933)

Immediate Consequences of the Axioms

ThBasic Properties from Kolmogorov Axioms

The Addition Rule

Addition Rule for Two Events

Conditional Probability and Independence

DfConditional Probability

DfIndependence

The Multiplication Rule

Multiplication Rule

Total Probability and Bayes' Theorem

ThLaw of Total Probability

ThBayes' Theorem

Counting Principles

ThFundamental Counting Principle

Permutations and Combinations

Probability in Machine Learning

Key Takeaways

Summary: Probability Foundations

Premium Content

Need Expert Statistics Help?