🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Math Foundations for Machine Learning — Linear Algebra, Calculus, Probability

ML FoundationsMath🟢 Free Lesson

Advertisement

ML Foundations

The Mathematical Backbone of Every ML Algorithm

Linear algebra, calculus, and probability form the foundation of all machine learning. Master these concepts to truly understand how algorithms work.

  • Linear Algebra — Vectors, matrices, and the language of data
  • Calculus — Derivatives and gradient descent for optimization
  • Probability and Statistics — Bayes' theorem, distributions, and inference

"Mathematics is the language in which God has written the universe."

Math Foundations for Machine Learning

Math is the language of machine learning. This tutorial covers the essential math you need — with clear explanations, visual intuitions, and Python code.


Linear Algebra

Vectors and Vector Operations

DfVector

A vector vRn\mathbf{v} \in \mathbb{R}^n is an ordered list of nn real numbers that represents a point or direction in nn-dimensional space. It is written as v=[v1,v2,,vn]T\mathbf{v} = [v_1, v_2, \ldots, v_n]^T.

Vector Operations in ℝ²x₁x₂v = [2, 3]w = [3, 1]v+w = [5, 4]Vector Space ℝ³xyz[2, 3, 1]Vectors encode features, gradients, and embeddings in ML

Vector Addition

u+v=[u1+v1,u2+v2,,un+vn]\vec{u} + \vec{v} = [u_1 + v_1, u_2 + v_2, \ldots, u_n + v_n]

Here,

  • u,v\vec{u}, \vec{v}=Two vectors of the same dimension
  • u+v\vec{u} + \vec{v}=Resultant vector with summed components

Dot Product

vw=i=1nviwi=vwcosθ\vec{v} \cdot \vec{w} = \sum_{i=1}^{n} v_i w_i = \|\vec{v}\| \|\vec{w}\| \cos\theta

Here,

  • vw\vec{v} \cdot \vec{w}=The dot product (scalar result)
  • θ\theta=Angle between the two vectors
  • v\|\vec{v}\|=L2 norm (magnitude) of v

Example: Vector Operations

If v=[1,2,3]\vec{v} = [1, 2, 3] and w=[4,5,6]\vec{w} = [4, 5, 6]:

v+w=[1+4,2+5,3+6]=[5,7,9]\vec{v} + \vec{w} = [1+4, 2+5, 3+6] = [5, 7, 9]
vw=1×4+2×5+3×6=4+10+18=32\vec{v} \cdot \vec{w} = 1 \times 4 + 2 \times 5 + 3 \times 6 = 4 + 10 + 18 = 32
v=12+22+32=143.74\|\vec{v}\| = \sqrt{1^2 + 2^2 + 3^2} = \sqrt{14} \approx 3.74
cosθ=vwvw=323.74×8.77=0.9746\cos\theta = \frac{\vec{v} \cdot \vec{w}}{\|\vec{v}\| \|\vec{w}\|} = \frac{32}{3.74 \times 8.77} = 0.9746

Matrices

DfMatrix

A matrix ARm×n\mathbf{A} \in \mathbb{R}^{m \times n} is a 2D array of mm rows and nn columns. In ML, the design matrix XRN×d\mathbf{X} \in \mathbb{R}^{N \times d} stores NN samples with dd features each, where row ii is the feature vector x(i)T\mathbf{x}^{(i)T}.

Matrix Multiplication: The Engine of Neural Networksx123(3×1)W0.2 0.80.5 0.30.1 0.9(3×2)×=y1.73.5(2×1)As a Neural Network Layer:x₁x₂x₃h₁h₂yy = Wx + b — this single operation is the fundamental building block of all neural networksEach connection weight is a parameter learned via backpropagation

Matrix Operations in ML

Matrix Multiplication

(AB)ij=k=1nAikBkj(\mathbf{AB})_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}

Here,

  • ARm×n\mathbf{A} \in \mathbb{R}^{m \times n}=Left matrix
  • BRn×p\mathbf{B} \in \mathbb{R}^{n \times p}=Right matrix
  • CRm×p\mathbf{C} \in \mathbb{R}^{m \times p}=Result matrix

Computational Complexity

Naive matrix multiplication is O(n3)O(n^3) for n×nn \times n matrices. In practice, optimized BLAS/LAPACK implementations and GPU parallelism make this efficient. Neural network forward passes are dominated by matrix multiplications.


Calculus

Derivatives and Gradients

DfDerivative

The derivative f(x)=limh0f(x+h)f(x)hf'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h} measures the instantaneous rate of change of ff at xx. Geometrically, it gives the slope of the tangent line at point xx.

Power Rule for Derivatives

f(x)=xnf(x)=nxn1f(x) = x^n \Rightarrow f'(x) = nx^{n-1}

Here,

  • f(x)f(x)=The original function
  • f(x)f'(x)=The derivative of f with respect to x
  • nn=The exponent
Derivative as Tangent Line Slopexf(x)f(x) = x²x=2, f'(2)=4secantGradient Descent on f(x) = x²xf(x)xâ‚€=4x₁=2.4x₂=1.44→ 0Gradient descent iterates: x_{t+1} = x_t ≈ α·f'(x_t) toward the minimum

Gradient Descent

DfGradient Descent

Gradient descent is the core optimization algorithm in ML. It iteratively adjusts model parameters to minimize a loss function L(θ)L(\theta) by moving in the direction opposite to the gradient θL(θ)\nabla_\theta L(\theta).

Gradient Descent Update Rule

θt+1=θtαθL(θt)\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)

Here,

  • θ\theta=Model parameters (weights and biases)
  • α\alpha=Learning rate (step size), typically $10^{-3}$ to $10^{-1}$
  • hetaL\nabla_ heta L=Gradient of loss function w.r.t. parameters

Partial Derivatives and the Gradient

DfGradient Vector

For a function f:RnRf: \mathbb{R}^n \to \mathbb{R}, the gradient is the vector of partial derivatives:

f=[fx1,fx2,,fxn]T\nabla f = \left[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n}\right]^T

The gradient points in the direction of steepest ascent, so f-\nabla f points toward steepest descent.

Example: Gradient of MSE Loss

For linear regression with MSE loss L(w,b)=1Ni=1N(y(i)wTx(i)b)2L(\mathbf{w}, b) = \frac{1}{N}\sum_{i=1}^{N}(y^{(i)} - \mathbf{w}^T\mathbf{x}^{(i)} - b)^2:

Lwj=2Ni=1N(y(i)y^(i))xj(i)\frac{\partial L}{\partial w_j} = -\frac{2}{N}\sum_{i=1}^{N}(y^{(i)} - \hat{y}^{(i)})x_j^{(i)}
Lb=2Ni=1N(y(i)y^(i))\frac{\partial L}{\partial b} = -\frac{2}{N}\sum_{i=1}^{N}(y^{(i)} - \hat{y}^{(i)})

In matrix form: wL=2NXT(yXwb1)\nabla_{\mathbf{w}} L = -\frac{2}{N}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\mathbf{w} - b\mathbf{1})

Chain Rule

ThChain Rule

If y=f(g(x))y = f(g(x)), then dydx=dydgdgdx\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx}. For multivariate functions:

fxi=jfgjgjxi\frac{\partial f}{\partial x_i} = \sum_{j} \frac{\partial f}{\partial g_j} \cdot \frac{\partial g_j}{\partial x_i}

This is the foundation of backpropagation in neural networks.


Probability

Probability Axioms

DfProbability Axioms (Kolmogorov)

A probability measure PP on a sample space Ω\Omega satisfies:

  1. P(A)0P(A) \geq 0 for all events AA
  2. P(Ω)=1P(\Omega) = 1
  3. For disjoint events A1,A2,A_1, A_2, \ldots: P(i=1Ai)=i=1P(Ai)P\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i)

Basic Probability

P(A)=AΩP(A) = \frac{|A|}{|\Omega|}

Here,

  • AA=Event (subset of sample space)
  • Ω\Omega=Sample space

Conditional Probability and Bayes' Theorem

DfConditional Probability

P(AB)=P(AB)P(B),P(B)>0P(A|B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0

ThBayes' Theorem

P(AB)=P(BA)P(A)P(B)=P(BA)P(A)iP(BAi)P(Ai)P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} = \frac{P(B|A) \cdot P(A)}{\sum_{i} P(B|A_i) P(A_i)}

where:

  • P(AB)P(A|B) is the posterior probability
  • P(BA)P(B|A) is the likelihood
  • P(A)P(A) is the prior probability
  • P(B)P(B) is the evidence (marginal likelihood)
Bayes' Theorem: Updating Beliefs with EvidencePriorP(A)Initial beliefbefore seeing dataLikelihoodP(B|A)How likely is theevidence if A is true?PosteriorP(A|B)Updated belief afterobserving evidence BEvidence P(B) normalizesso posterior sums to 1The key insight:Posterior ∝ Prior × Likelihood

Example: Medical Testing

Given:

  • P(Disease)=0.01P(\text{Disease}) = 0.01 (1% prevalence)
  • P(PositiveDisease)=0.99P(\text{Positive}|\text{Disease}) = 0.99 (sensitivity)
  • P(PositiveNo Disease)=0.05P(\text{Positive}|\text{No Disease}) = 0.05 (false positive rate)

Find: P(DiseasePositive)P(\text{Disease}|\text{Positive})

P(DiseasePositive)=0.99×0.010.99×0.01+0.05×0.99=0.00990.0594=0.1667P(\text{Disease}|\text{Positive}) = \frac{0.99 \times 0.01}{0.99 \times 0.01 + 0.05 \times 0.99} = \frac{0.0099}{0.0594} = 0.1667

Despite 99% test accuracy, only 16.7% of positives actually have the disease! This is the base rate fallacy — prior probability strongly influences posterior probability.

Distributions

Key Probability Distributions in MLNormal (Gaussian)Μ = mean, σ² = varianceP(x) = (1/√(2πσ²)) e^{-(x-Μ)²/(2σ²)}68%Uniform DistributionAll values equally likelyP(x) = 1/(b-a) for a ≈¤ x ≈¤ bBernoullip1-pBinary: P(X=1) = pFoundation of logistic regression

Normal Distribution PDF

f(x)=12πσ2exp((xμ)22σ2)f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)

Here,

  • μ\mu=Mean (location parameter)
  • σ2\sigma^2=Variance (scale parameter)
  • f(x)f(x)=Probability density at x

Expectation and Variance

Expected Value

E[X]=xxP(X=x)=xf(x)dx\mathbb{E}[X] = \sum_{x} x \cdot P(X=x) = \int x \cdot f(x) \, dx

Here,

  • E[X]\mathbb{E}[X]=Expected value (mean) of random variable X

Variance

Var(X)=E[(Xμ)2]=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[(X - \mu)^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2

Here,

  • Var(X)\text{Var}(X)=Variance — measures spread
  • μ\mu=\mathbb{E}[X], the mean

Key Takeaways

Summary: Math Foundations

  1. Vectors and matrices are the data structures of ML — XRN×d\mathbf{X} \in \mathbb{R}^{N \times d}
  2. Matrix multiplication y=Wx+by = Wx + b is the fundamental operation in neural networks
  3. Derivatives tell us how to improve model parameters: θL\nabla_\theta L
  4. Gradient descent θt+1=θtαθL\theta_{t+1} = \theta_t - \alpha \nabla_\theta L is the core optimization algorithm
  5. Chain rule enables backpropagation through computation graphs
  6. Probability underpins classification and generative models
  7. Bayes' theorem P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A)P(A)}{P(B)} is fundamental to probabilistic ML
  8. Normal distribution appears everywhere due to the Central Limit Theorem
  9. You don't need a math PhD — focus on intuition over proofs
  10. Python/NumPy handles the computation — you handle the understanding

What to Learn Next

-> What is Machine Learning? The complete introduction to ML — concepts, types, and workflow.

-> Linear Regression From scatter plots to predictions — the simplest ML algorithm.

-> Logistic Regression Classification with probability — from linear to sigmoid.

-> Dimensionality Reduction Reduce features while preserving information with PCA and t-SNE.

-> Regularization Prevent overfitting with Ridge, Lasso, and Elastic Net.

-> KNN Instance-based learning where your neighbors tell the story.

Premium Content

Math Foundations for Machine Learning — Linear Algebra, Calculus, Probability

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement