DL Foundations
The Math Behind Deep Learning — Linear Algebra, Calculus, Probability
Every neural network operation — from matrix multiplications to gradient updates — is built on mathematical principles. Mastering these foundations lets you understand why architectures work and debug training failures at the mathematical level.
- Linear Algebra — Vectors, matrices, eigendecomposition, and SVD underpin all neural network operations
- Calculus — The chain rule is the backbone of backpropagation, enabling efficient gradient computation
- Probability — Gaussian distributions, cross-entropy, and KL divergence shape loss functions and generative models
Math Foundations for Deep Learning
Deep learning is built on linear algebra, calculus, and probability. This tutorial covers the essential math you need to understand how neural networks learn.
Linear Algebra
Vectors and Matrices
DfVector and Matrix Operations
- Dot Product:
- Matrix Multiplication: where
- Norm:
- Transpose:
Matrix Multiplication
Here,
- =Input matrix of shape (m x n)
- =Input matrix of shape (n x p)
- =Output matrix of shape (m x p)
- =Inner dimension (must match)
Matrix Operations in Neural Networks
Eigenvalues and Eigenvectors
DfEigenvalue Decomposition
For a square matrix , an eigenvector and eigenvalue satisfy:
The eigendecomposition reveals the matrix's action: stretching along eigenvector directions by eigenvalue factors. This is critical for:
- PCA: Principal components are eigenvectors of the covariance matrix
- Hessian analysis: Eigenvalues indicate curvature directions
- Spectral initialization: Eigenvectors of weight matrices
Singular Value Decomposition (SVD)
DfSVD
Every matrix can be decomposed as:
where and are orthogonal, and is diagonal with singular values. SVD is used in weight pruning, low-rank approximation, and understanding network expressivity.
Calculus for Deep Learning
Gradients
DfGradient
The gradient of a scalar function is the vector of partial derivatives:
The gradient points in the direction of steepest ascent. Gradient descent moves in the opposite direction to minimize the loss.
The Chain Rule
DfChain Rule
For composite functions , the chain rule states:
For multivariate functions with and :
This is the foundation of backpropagation.
Jacobian and Hessian
DfJacobian Matrix
For a vector-valued function , the Jacobian is:
The Jacobian describes how each output component changes with respect to each input component.
DfHessian Matrix
The Hessian matrix of second derivatives:
Eigenvalues of the Hessian indicate curvature: positive eigenvalues indicate convex regions, negative eigenvalues indicate saddle points.
Probability Theory
Distributions
DfGaussian Distribution
The multivariate Gaussian distribution is fundamental to deep learning:
Used in: weight initialization, variational autoencoders, diffusion models, Bayesian deep learning.
Information Theory
DfEntropy and Cross-Entropy
- Entropy: — measure of uncertainty
- Cross-Entropy: — cost when approximating with
- KL Dergence: — asymmetric measure of difference
Cross-entropy loss in classification is equivalent to maximum likelihood estimation under a categorical distribution.
Practical Applications in Deep Learning
DfMath in Neural Networks
| Math Concept | Deep Learning Application |
|---|---|
| Matrix multiplication | Forward pass, attention mechanism |
| Chain rule | Backpropagation |
| Eigenvalues | Understanding loss landscape curvature |
| SVD | Weight compression, low-rank approximation |
| Gaussian distribution | Weight initialization, VAE, diffusion |
| KL divergence | Knowledge distillation, VAE loss |
| Cross-entropy | Classification loss function |
| Gradient descent | Parameter optimization |
Numerical Stability
When implementing math operations in deep learning:
- Use
log-sum-exptrick for log-softmax: - Add small to denominators: to avoid division by zero
- Use mixed precision (FP16/BF16) carefully — overflow/underflow is common
Summary
- Linear Algebra: Matrix multiplication, eigendecomposition, and SVD are the building blocks of all neural network operations
- Calculus: The chain rule enables efficient gradient computation through backpropagation
- Probability: Gaussian distributions, cross-entropy, and KL divergence shape loss functions and generative models
- Numerical Stability: Implementing math operations requires care to avoid overflow, underflow, and division by zero