ML Foundations
The Mathematical Backbone of Every ML Algorithm
Linear algebra, calculus, and probability form the foundation of all machine learning. Master these concepts to truly understand how algorithms work.
- Linear Algebra — Vectors, matrices, and the language of data
- Calculus — Derivatives and gradient descent for optimization
- Probability and Statistics — Bayes' theorem, distributions, and inference
"Mathematics is the language in which God has written the universe."
Math Foundations for Machine Learning
Math is the language of machine learning. This tutorial covers the essential math you need — with clear explanations, visual intuitions, and Python code.
Linear Algebra
Vectors and Vector Operations
DfVector
A vector is an ordered list of real numbers that represents a point or direction in -dimensional space. It is written as .
Vector Addition
Here,
- =Two vectors of the same dimension
- =Resultant vector with summed components
Dot Product
Here,
- =The dot product (scalar result)
- =Angle between the two vectors
- =L2 norm (magnitude) of v
Example: Vector Operations
If and :
Matrices
DfMatrix
A matrix is a 2D array of rows and columns. In ML, the design matrix stores samples with features each, where row is the feature vector .
Matrix Operations in ML
Matrix Multiplication
Here,
- =Left matrix
- =Right matrix
- =Result matrix
Computational Complexity
Naive matrix multiplication is for matrices. In practice, optimized BLAS/LAPACK implementations and GPU parallelism make this efficient. Neural network forward passes are dominated by matrix multiplications.
Calculus
Derivatives and Gradients
DfDerivative
The derivative measures the instantaneous rate of change of at . Geometrically, it gives the slope of the tangent line at point .
Power Rule for Derivatives
Here,
- =The original function
- =The derivative of f with respect to x
- =The exponent
Gradient Descent
DfGradient Descent
Gradient descent is the core optimization algorithm in ML. It iteratively adjusts model parameters to minimize a loss function by moving in the direction opposite to the gradient .
Gradient Descent Update Rule
Here,
- =Model parameters (weights and biases)
- =Learning rate (step size), typically $10^{-3}$ to $10^{-1}$
- =Gradient of loss function w.r.t. parameters
Partial Derivatives and the Gradient
DfGradient Vector
For a function , the gradient is the vector of partial derivatives:
The gradient points in the direction of steepest ascent, so points toward steepest descent.
Example: Gradient of MSE Loss
For linear regression with MSE loss :
In matrix form:
Chain Rule
ThChain Rule
If , then . For multivariate functions:
This is the foundation of backpropagation in neural networks.
Probability
Probability Axioms
DfProbability Axioms (Kolmogorov)
A probability measure on a sample space satisfies:
- for all events
- For disjoint events :
Basic Probability
Here,
- =Event (subset of sample space)
- =Sample space
Conditional Probability and Bayes' Theorem
DfConditional Probability
ThBayes' Theorem
where:
- is the posterior probability
- is the likelihood
- is the prior probability
- is the evidence (marginal likelihood)
Example: Medical Testing
Given:
- (1% prevalence)
- (sensitivity)
- (false positive rate)
Find:
Despite 99% test accuracy, only 16.7% of positives actually have the disease! This is the base rate fallacy — prior probability strongly influences posterior probability.
Distributions
Normal Distribution PDF
Here,
- =Mean (location parameter)
- =Variance (scale parameter)
- =Probability density at x
Expectation and Variance
Expected Value
Here,
- =Expected value (mean) of random variable X
Variance
Here,
- =Variance — measures spread
- =\mathbb{E}[X], the mean
Key Takeaways
Summary: Math Foundations
- Vectors and matrices are the data structures of ML —
- Matrix multiplication is the fundamental operation in neural networks
- Derivatives tell us how to improve model parameters:
- Gradient descent is the core optimization algorithm
- Chain rule enables backpropagation through computation graphs
- Probability underpins classification and generative models
- Bayes' theorem is fundamental to probabilistic ML
- Normal distribution appears everywhere due to the Central Limit Theorem
- You don't need a math PhD — focus on intuition over proofs
- Python/NumPy handles the computation — you handle the understanding
What to Learn Next
-> What is Machine Learning? The complete introduction to ML — concepts, types, and workflow.
-> Linear Regression From scatter plots to predictions — the simplest ML algorithm.
-> Logistic Regression Classification with probability — from linear to sigmoid.
-> Dimensionality Reduction Reduce features while preserving information with PCA and t-SNE.
-> Regularization Prevent overfitting with Ridge, Lasso, and Elastic Net.
-> KNN Instance-based learning where your neighbors tell the story.