ML Foundations

The Mathematical Backbone of Every ML Algorithm

Linear algebra, calculus, and probability form the foundation of all machine learning. Master these concepts to truly understand how algorithms work.

Linear Algebra — Vectors, matrices, and the language of data
Calculus — Derivatives and gradient descent for optimization
Probability and Statistics — Bayes' theorem, distributions, and inference

"Mathematics is the language in which God has written the universe."

Math Foundations for Machine Learning

Math is the language of machine learning. This tutorial covers the essential math you need — with clear explanations, visual intuitions, and Python code.

Linear Algebra

Vectors and Vector Operations

DfVector

A vector $\mathbf{v} \in \mathbb{R}^n$ is an ordered list of $n$ real numbers that represents a point or direction in $n$ -dimensional space. It is written as $\mathbf{v} = [v_1, v_2, \ldots, v_n]^T$ .

Vector Addition

\vec{u} + \vec{v} = [u_1 + v_1, u_2 + v_2, \ldots, u_n + v_n]

Here,

$\vec{u}, \vec{v}$ =Two vectors of the same dimension
$\vec{u} + \vec{v}$ =Resultant vector with summed components

Dot Product

\vec{v} \cdot \vec{w} = \sum_{i=1}^{n} v_i w_i = \|\vec{v}\| \|\vec{w}\| \cos\theta

Here,

$\vec{v} \cdot \vec{w}$ =The dot product (scalar result)
$\theta$ =Angle between the two vectors
$\|\vec{v}\|$ =L2 norm (magnitude) of v

Example: Vector Operations

If $\vec{v} = [1, 2, 3]$ and $\vec{w} = [4, 5, 6]$ :

\vec{v} + \vec{w} = [1+4, 2+5, 3+6] = [5, 7, 9]

\vec{v} \cdot \vec{w} = 1 \times 4 + 2 \times 5 + 3 \times 6 = 4 + 10 + 18 = 32

\|\vec{v}\| = \sqrt{1^2 + 2^2 + 3^2} = \sqrt{14} \approx 3.74

\cos\theta = \frac{\vec{v} \cdot \vec{w}}{\|\vec{v}\| \|\vec{w}\|} = \frac{32}{3.74 \times 8.77} = 0.9746

Matrices

DfMatrix

A matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ is a 2D array of $m$ rows and $n$ columns. In ML, the design matrix $\mathbf{X} \in \mathbb{R}^{N \times d}$ stores $N$ samples with $d$ features each, where row $i$ is the feature vector $\mathbf{x}^{(i)T}$ .

Matrix Operations in ML

Matrix Multiplication

(\mathbf{AB})_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}

Here,

$\mathbf{A} \in \mathbb{R}^{m \times n}$ =Left matrix
$\mathbf{B} \in \mathbb{R}^{n \times p}$ =Right matrix
$\mathbf{C} \in \mathbb{R}^{m \times p}$ =Result matrix

Computational Complexity

Naive matrix multiplication is $O(n^3)$ for $n \times n$ matrices. In practice, optimized BLAS/LAPACK implementations and GPU parallelism make this efficient. Neural network forward passes are dominated by matrix multiplications.

Calculus

Derivatives and Gradients

DfDerivative

The derivative $f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$ measures the instantaneous rate of change of $f$ at $x$ . Geometrically, it gives the slope of the tangent line at point $x$ .

Power Rule for Derivatives

f(x) = x^n \Rightarrow f'(x) = nx^{n-1}

Here,

$f(x)$ =The original function
$f'(x)$ =The derivative of f with respect to x
$n$ =The exponent

Gradient Descent

DfGradient Descent

Gradient descent is the core optimization algorithm in ML. It iteratively adjusts model parameters to minimize a loss function $L(\theta)$ by moving in the direction opposite to the gradient $\nabla_\theta L(\theta)$ .

Gradient Descent Update Rule

\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)

Here,

$\theta$ =Model parameters (weights and biases)
$\alpha$ =Learning rate (step size), typically $10^{-3}$ to $10^{-1}$
$\nabla_ heta L$ =Gradient of loss function w.r.t. parameters

Partial Derivatives and the Gradient

DfGradient Vector

For a function $f: \mathbb{R}^n \to \mathbb{R}$ , the gradient is the vector of partial derivatives:

\nabla f = \left[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n}\right]^T

The gradient points in the direction of steepest ascent, so $-\nabla f$ points toward steepest descent.

Example: Gradient of MSE Loss

For linear regression with MSE loss $L(\mathbf{w}, b) = \frac{1}{N}\sum_{i=1}^{N}(y^{(i)} - \mathbf{w}^T\mathbf{x}^{(i)} - b)^2$ :

\frac{\partial L}{\partial w_j} = -\frac{2}{N}\sum_{i=1}^{N}(y^{(i)} - \hat{y}^{(i)})x_j^{(i)}

\frac{\partial L}{\partial b} = -\frac{2}{N}\sum_{i=1}^{N}(y^{(i)} - \hat{y}^{(i)})

In matrix form: $\nabla_{\mathbf{w}} L = -\frac{2}{N}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\mathbf{w} - b\mathbf{1})$

Chain Rule

ThChain Rule

If $y = f(g(x))$ , then $\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx}$ . For multivariate functions:

\frac{\partial f}{\partial x_i} = \sum_{j} \frac{\partial f}{\partial g_j} \cdot \frac{\partial g_j}{\partial x_i}

This is the foundation of backpropagation in neural networks.

Probability

Probability Axioms

DfProbability Axioms (Kolmogorov)

A probability measure $P$ on a sample space $\Omega$ satisfies:

$P(A) \geq 0$ for all events $A$
$P(\Omega) = 1$
For disjoint events $A_1, A_2, \ldots$ : $P\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i)$

Basic Probability

P(A) = \frac{|A|}{|\Omega|}

Here,

$A$ =Event (subset of sample space)
$\Omega$ =Sample space

Conditional Probability and Bayes' Theorem

DfConditional Probability

P(A|B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0

ThBayes' Theorem

P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} = \frac{P(B|A) \cdot P(A)}{\sum_{i} P(B|A_i) P(A_i)}

where:

$P(A|B)$ is the posterior probability
$P(B|A)$ is the likelihood
$P(A)$ is the prior probability
$P(B)$ is the evidence (marginal likelihood)

Example: Medical Testing

Given:

$P(\text{Disease}) = 0.01$ (1% prevalence)
$P(\text{Positive}|\text{Disease}) = 0.99$ (sensitivity)
$P(\text{Positive}|\text{No Disease}) = 0.05$ (false positive rate)

Find: $P(\text{Disease}|\text{Positive})$

P(\text{Disease}|\text{Positive}) = \frac{0.99 \times 0.01}{0.99 \times 0.01 + 0.05 \times 0.99} = \frac{0.0099}{0.0594} = 0.1667

Despite 99% test accuracy, only 16.7% of positives actually have the disease! This is the base rate fallacy — prior probability strongly influences posterior probability.

Distributions

Normal Distribution PDF

f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)

Here,

$\mu$ =Mean (location parameter)
$\sigma^2$ =Variance (scale parameter)
$f(x)$ =Probability density at x

Expectation and Variance

Expected Value

\mathbb{E}[X] = \sum_{x} x \cdot P(X=x) = \int x \cdot f(x) \, dx

Here,

$\mathbb{E}[X]$ =Expected value (mean) of random variable X

Variance

\text{Var}(X) = \mathbb{E}[(X - \mu)^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2

Here,

$\text{Var}(X)$ =Variance — measures spread
$\mu$ =\mathbb{E}[X], the mean

Key Takeaways

Summary: Math Foundations

Vectors and matrices are the data structures of ML — $\mathbf{X} \in \mathbb{R}^{N \times d}$
Matrix multiplication $y = Wx + b$ is the fundamental operation in neural networks
Derivatives tell us how to improve model parameters: $\nabla_\theta L$
Gradient descent $\theta_{t+1} = \theta_t - \alpha \nabla_\theta L$ is the core optimization algorithm
Chain rule enables backpropagation through computation graphs
Probability underpins classification and generative models
Bayes' theorem $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$ is fundamental to probabilistic ML
Normal distribution appears everywhere due to the Central Limit Theorem
You don't need a math PhD — focus on intuition over proofs
Python/NumPy handles the computation — you handle the understanding

What to Learn Next

-> What is Machine Learning? The complete introduction to ML — concepts, types, and workflow.

-> Linear Regression From scatter plots to predictions — the simplest ML algorithm.

-> Logistic Regression Classification with probability — from linear to sigmoid.

-> Dimensionality Reduction Reduce features while preserving information with PCA and t-SNE.

-> Regularization Prevent overfitting with Ridge, Lasso, and Elastic Net.

-> KNN Instance-based learning where your neighbors tell the story.

Math Foundations for Machine Learning — Linear Algebra, Calculus, Probability

The Mathematical Backbone of Every ML Algorithm

Math Foundations for Machine Learning

Linear Algebra

Vectors and Vector Operations

DfVector

Vector Addition

Dot Product

Example: Vector Operations

Matrices

DfMatrix

Matrix Operations in ML

Matrix Multiplication

Calculus

Derivatives and Gradients

DfDerivative

Power Rule for Derivatives

Gradient Descent

DfGradient Descent

Gradient Descent Update Rule

Partial Derivatives and the Gradient

DfGradient Vector

Example: Gradient of MSE Loss

Chain Rule

ThChain Rule

Probability

Probability Axioms

DfProbability Axioms (Kolmogorov)

Basic Probability

Conditional Probability and Bayes' Theorem

DfConditional Probability

ThBayes' Theorem

Example: Medical Testing

Distributions

Normal Distribution PDF

Expectation and Variance

Expected Value

Variance

Key Takeaways

Summary: Math Foundations

What to Learn Next

Premium Content

Need Expert Machine Learning Help?