DL Foundations

The Math Behind Deep Learning — Linear Algebra, Calculus, Probability

Every neural network operation — from matrix multiplications to gradient updates — is built on mathematical principles. Mastering these foundations lets you understand why architectures work and debug training failures at the mathematical level.

Linear Algebra — Vectors, matrices, eigendecomposition, and SVD underpin all neural network operations
Calculus — The chain rule is the backbone of backpropagation, enabling efficient gradient computation
Probability — Gaussian distributions, cross-entropy, and KL divergence shape loss functions and generative models

Math Foundations for Deep Learning

Deep learning is built on linear algebra, calculus, and probability. This tutorial covers the essential math you need to understand how neural networks learn.

Linear Algebra

Vectors and Matrices

DfVector and Matrix Operations

Dot Product: $\mathbf{u} \cdot \mathbf{v} = \sum_i u_i v_i = \|\mathbf{u}\| \|\mathbf{v}\| \cos\theta$
Matrix Multiplication: $\mathbf{C} = \mathbf{A}\mathbf{B}$ where $C_{ij} = \sum_k A_{ik} B_{kj}$
Norm: $\|\mathbf{x}\|_p = \left(\sum_i |x_i|^p\right)^{1/p}$
Transpose: $(\mathbf{A}^T)_{ij} = A_{ji}$

Matrix Multiplication

C_{ij} = \sum_{k=1}^{n} A_{ik} \cdot B_{kj}

Here,

$A$ =Input matrix of shape (m x n)
$B$ =Input matrix of shape (n x p)
$C$ =Output matrix of shape (m x p)
$n$ =Inner dimension (must match)

Matrix Operations in Neural Networks

Eigenvalues and Eigenvectors

DfEigenvalue Decomposition

For a square matrix $\mathbf{A}$ , an eigenvector $\mathbf{v}$ and eigenvalue $\lambda$ satisfy:

\mathbf{A}\mathbf{v} = \lambda \mathbf{v}

The eigendecomposition $\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^T$ reveals the matrix's action: stretching along eigenvector directions by eigenvalue factors. This is critical for:

PCA: Principal components are eigenvectors of the covariance matrix
Hessian analysis: Eigenvalues indicate curvature directions
Spectral initialization: Eigenvectors of weight matrices

Eigendecomposition

\mathbf{A} = \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^T \quad \text{where} \quad \mathbf{\Lambda} = \text{diag}(\lambda_1, \ldots, \lambda_n)

Singular Value Decomposition (SVD)

DfSVD

Every matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ can be decomposed as:

\mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T

where $\mathbf{U} \in \mathbb{R}^{m \times m}$ and $\mathbf{V} \in \mathbb{R}^{n \times n}$ are orthogonal, and $\mathbf{\Sigma}$ is diagonal with singular values. SVD is used in weight pruning, low-rank approximation, and understanding network expressivity.

Calculus for Deep Learning

Gradients

DfGradient

The gradient of a scalar function $f: \mathbb{R}^n \to \mathbb{R}$ is the vector of partial derivatives:

\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}

The gradient points in the direction of steepest ascent. Gradient descent moves in the opposite direction to minimize the loss.

The Chain Rule

DfChain Rule

For composite functions $f(g(x))$ , the chain rule states:

\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}

For multivariate functions with $\mathbf{z} = g(\mathbf{y})$ and $\mathbf{y} = h(\mathbf{x})$ :

\frac{\partial \mathbf{z}}{\partial \mathbf{x}} = \frac{\partial \mathbf{z}}{\partial \mathbf{y}} \cdot \frac{\partial \mathbf{y}}{\partial \mathbf{x}}

This is the foundation of backpropagation.

Jacobian and Hessian

DfJacobian Matrix

For a vector-valued function $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$ , the Jacobian is:

\mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

The Jacobian describes how each output component changes with respect to each input component.

DfHessian Matrix

The Hessian matrix of second derivatives:

\mathbf{H} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}

Eigenvalues of the Hessian indicate curvature: positive eigenvalues indicate convex regions, negative eigenvalues indicate saddle points.

Probability Theory

Distributions

DfGaussian Distribution

The multivariate Gaussian distribution is fundamental to deep learning:

\mathcal{N}(\mathbf{x} | \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)

Used in: weight initialization, variational autoencoders, diffusion models, Bayesian deep learning.

Information Theory

DfEntropy and Cross-Entropy

Entropy: $H(p) = -\sum_i p_i \log p_i$ — measure of uncertainty
Cross-Entropy: $H(p, q) = -\sum_i p_i \log q_i$ — cost when approximating $p$ with $q$
KL Dergence: $D_{KL}(p \| q) = \sum_i p_i \log \frac{p_i}{q_i}$ — asymmetric measure of difference

Cross-entropy loss in classification is equivalent to maximum likelihood estimation under a categorical distribution.

Cross-Entropy Loss

\mathcal{L}_{CE} = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)

Practical Applications in Deep Learning

DfMath in Neural Networks

Math Concept	Deep Learning Application
Matrix multiplication	Forward pass, attention mechanism
Chain rule	Backpropagation
Eigenvalues	Understanding loss landscape curvature
SVD	Weight compression, low-rank approximation
Gaussian distribution	Weight initialization, VAE, diffusion
KL divergence	Knowledge distillation, VAE loss
Cross-entropy	Classification loss function
Gradient descent	Parameter optimization

Numerical Stability

When implementing math operations in deep learning:

Use log-sum-exp trick for log-softmax: $\log \sum_i \exp(x_i) = \max(x) + \log \sum_i \exp(x_i - \max(x))$
Add small $\epsilon$ to denominators: $\frac{1}{x + \epsilon}$ to avoid division by zero
Use mixed precision (FP16/BF16) carefully — overflow/underflow is common

Summary

Linear Algebra: Matrix multiplication, eigendecomposition, and SVD are the building blocks of all neural network operations
Calculus: The chain rule enables efficient gradient computation through backpropagation
Probability: Gaussian distributions, cross-entropy, and KL divergence shape loss functions and generative models
Numerical Stability: Implementing math operations requires care to avoid overflow, underflow, and division by zero

Next: Backpropagation Algorithm

Math Foundations for Deep Learning — Linear Algebra, Calculus and Probability