🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Math Foundations for Deep Learning — Linear Algebra, Calculus and Probability

FoundationsMathematics🟢 Free Lesson

Advertisement

DL Foundations

The Math Behind Deep Learning — Linear Algebra, Calculus, Probability

Every neural network operation — from matrix multiplications to gradient updates — is built on mathematical principles. Mastering these foundations lets you understand why architectures work and debug training failures at the mathematical level.

  • Linear Algebra — Vectors, matrices, eigendecomposition, and SVD underpin all neural network operations
  • Calculus — The chain rule is the backbone of backpropagation, enabling efficient gradient computation
  • Probability — Gaussian distributions, cross-entropy, and KL divergence shape loss functions and generative models

Math Foundations for Deep Learning

Deep learning is built on linear algebra, calculus, and probability. This tutorial covers the essential math you need to understand how neural networks learn.


Linear Algebra

Vectors and Matrices

DfVector and Matrix Operations

  • Dot Product: uv=iuivi=uvcosθ\mathbf{u} \cdot \mathbf{v} = \sum_i u_i v_i = \|\mathbf{u}\| \|\mathbf{v}\| \cos\theta
  • Matrix Multiplication: C=AB\mathbf{C} = \mathbf{A}\mathbf{B} where Cij=kAikBkjC_{ij} = \sum_k A_{ik} B_{kj}
  • Norm: xp=(ixip)1/p\|\mathbf{x}\|_p = \left(\sum_i |x_i|^p\right)^{1/p}
  • Transpose: (AT)ij=Aji(\mathbf{A}^T)_{ij} = A_{ji}

Matrix Multiplication

Cij=k=1nAikBkjC_{ij} = \sum_{k=1}^{n} A_{ik} \cdot B_{kj}

Here,

  • AA=Input matrix of shape (m x n)
  • BB=Input matrix of shape (n x p)
  • CC=Output matrix of shape (m x p)
  • nn=Inner dimension (must match)

Matrix Operations in Neural Networks

Neural Network Layer: Matrix Multiplication + BiasInput xx₁x₂x₃x₄Weight Matrix Ww₁₁ w₁₂ w₁₃w₂₁ w₂₂ w₂₃w₃₁ w₃₂ w₃₃w₄₁ w₄₂ w₄₃×+Bias bb₁b₂b₃=Output zz₁z₂z₃z = Wx + b

Eigenvalues and Eigenvectors

DfEigenvalue Decomposition

For a square matrix A\mathbf{A}, an eigenvector v\mathbf{v} and eigenvalue λ\lambda satisfy:

Av=λv\mathbf{A}\mathbf{v} = \lambda \mathbf{v}

The eigendecomposition A=QΛQT\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^T reveals the matrix's action: stretching along eigenvector directions by eigenvalue factors. This is critical for:

  • PCA: Principal components are eigenvectors of the covariance matrix
  • Hessian analysis: Eigenvalues indicate curvature directions
  • Spectral initialization: Eigenvectors of weight matrices
Eigendecomposition
A=QΛQTwhereΛ=diag(λ1,,λn)\mathbf{A} = \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^T \quad \text{where} \quad \mathbf{\Lambda} = \text{diag}(\lambda_1, \ldots, \lambda_n)

Singular Value Decomposition (SVD)

DfSVD

Every matrix ARm×n\mathbf{A} \in \mathbb{R}^{m \times n} can be decomposed as:

A=UΣVT\mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T

where URm×m\mathbf{U} \in \mathbb{R}^{m \times m} and VRn×n\mathbf{V} \in \mathbb{R}^{n \times n} are orthogonal, and Σ\mathbf{\Sigma} is diagonal with singular values. SVD is used in weight pruning, low-rank approximation, and understanding network expressivity.

Singular Value Decomposition (SVD)A=UOrthogonal×ΣDiagonal×VTOrthogonal

Calculus for Deep Learning

Gradients

DfGradient

The gradient of a scalar function f:RnRf: \mathbb{R}^n \to \mathbb{R} is the vector of partial derivatives:

f=[fx1fx2fxn]\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}

The gradient points in the direction of steepest ascent. Gradient descent moves in the opposite direction to minimize the loss.

The Chain Rule

DfChain Rule

For composite functions f(g(x))f(g(x)), the chain rule states:

dfdx=dfdgdgdx\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}

For multivariate functions with z=g(y)\mathbf{z} = g(\mathbf{y}) and y=h(x)\mathbf{y} = h(\mathbf{x}):

zx=zyyx\frac{\partial \mathbf{z}}{\partial \mathbf{x}} = \frac{\partial \mathbf{z}}{\partial \mathbf{y}} \cdot \frac{\partial \mathbf{y}}{\partial \mathbf{x}}

This is the foundation of backpropagation.

Chain Rule VisualizationForward Pass: x → y → zxyz∂y/∂x∂z/∂y∂z/∂x = (∂z/∂y) · (∂y/∂x)

Jacobian and Hessian

DfJacobian Matrix

For a vector-valued function f:RnRm\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m, the Jacobian is:

J=[f1x1f1xnfmx1fmxn]\mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

The Jacobian describes how each output component changes with respect to each input component.

DfHessian Matrix

The Hessian matrix of second derivatives:

H=[2fx122fx1xn2fxnx12fxn2]\mathbf{H} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}

Eigenvalues of the Hessian indicate curvature: positive eigenvalues indicate convex regions, negative eigenvalues indicate saddle points.

Gradient Flow in a 2D Loss LandscapeMinimumStartHigh Loss-∇L points toward minimum

Probability Theory

Distributions

DfGaussian Distribution

The multivariate Gaussian distribution is fundamental to deep learning:

N(xμ,Σ)=1(2π)d/2Σ1/2exp(12(xμ)TΣ1(xμ))\mathcal{N}(\mathbf{x} | \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)

Used in: weight initialization, variational autoencoders, diffusion models, Bayesian deep learning.

Information Theory

DfEntropy and Cross-Entropy

  • Entropy: H(p)=ipilogpiH(p) = -\sum_i p_i \log p_i — measure of uncertainty
  • Cross-Entropy: H(p,q)=ipilogqiH(p, q) = -\sum_i p_i \log q_i — cost when approximating pp with qq
  • KL Dergence: DKL(pq)=ipilogpiqiD_{KL}(p \| q) = \sum_i p_i \log \frac{p_i}{q_i} — asymmetric measure of difference

Cross-entropy loss in classification is equivalent to maximum likelihood estimation under a categorical distribution.

Cross-Entropy Loss
LCE=i=1Cyilog(y^i)\mathcal{L}_{CE} = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)
Probability Distributions in Deep LearningGaussianWeight init, VAE, DiffusionBernoullip=0p=1Dropout, binary classificationCategoricalClassification, Softmax

Practical Applications in Deep Learning

DfMath in Neural Networks

Math ConceptDeep Learning Application
Matrix multiplicationForward pass, attention mechanism
Chain ruleBackpropagation
EigenvaluesUnderstanding loss landscape curvature
SVDWeight compression, low-rank approximation
Gaussian distributionWeight initialization, VAE, diffusion
KL divergenceKnowledge distillation, VAE loss
Cross-entropyClassification loss function
Gradient descentParameter optimization

Numerical Stability

When implementing math operations in deep learning:

  • Use log-sum-exp trick for log-softmax: logiexp(xi)=max(x)+logiexp(ximax(x))\log \sum_i \exp(x_i) = \max(x) + \log \sum_i \exp(x_i - \max(x))
  • Add small ϵ\epsilon to denominators: 1x+ϵ\frac{1}{x + \epsilon} to avoid division by zero
  • Use mixed precision (FP16/BF16) carefully — overflow/underflow is common

Summary

  • Linear Algebra: Matrix multiplication, eigendecomposition, and SVD are the building blocks of all neural network operations
  • Calculus: The chain rule enables efficient gradient computation through backpropagation
  • Probability: Gaussian distributions, cross-entropy, and KL divergence shape loss functions and generative models
  • Numerical Stability: Implementing math operations requires care to avoid overflow, underflow, and division by zero

Next: Backpropagation Algorithm

Premium Content

Math Foundations for Deep Learning — Linear Algebra, Calculus and Probability

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Deep Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement