๐ŸŽ‰ 75% of content is free forever โ€” Unlock Premium from $10/mo โ†’
CW
Search coursesโ€ฆ
๐Ÿ’ผ Servicesโ„น๏ธ Aboutโœ‰๏ธ ContactView Pricing Plansfrom $10

Matrix Calculus

Linear AlgebraCalculus๐ŸŸข Free Lesson

Advertisement

Matrix Calculus


Why It Matters

Matrix calculus is the mathematical backbone of modern machine learning. Every time a neural network learns, it computes gradients of a loss function with respect to millions of weights โ€” that computation is matrix calculus. Backpropagation is simply the chain rule applied to compositions of matrix functions. Gradient descent, the optimizer that powers everything from linear regression to GPT, updates parameters by moving in the direction of the negative gradient โ€” a vector of partial derivatives. Without matrix calculus, we cannot understand how models learn, why they fail, or how to design better architectures. From computing the gradient of a quadratic form in ridge regression to deriving the backward pass of a transformer, matrix calculus is the language of optimization.


Scalar by Matrix Derivatives

DfScalar Function of a Matrix

Let f:Rmร—nโ†’Rf: \mathbb{R}^{m \times n} \to \mathbb{R} be a scalar-valued function of a matrix AโˆˆRmร—nA \in \mathbb{R}^{m \times n}. The gradient of ff with respect to AA is an mร—nm \times n matrix where each entry is the partial derivative with respect to the corresponding entry of AA:

โˆ‚fโˆ‚A=[โˆ‚fโˆ‚a11โˆ‚fโˆ‚a12โ‹ฏโˆ‚fโˆ‚a1nโˆ‚fโˆ‚a21โˆ‚fโˆ‚a22โ‹ฏโˆ‚fโˆ‚a2nโ‹ฎโ‹ฎโ‹ฑโ‹ฎโˆ‚fโˆ‚am1โˆ‚fโˆ‚am2โ‹ฏโˆ‚fโˆ‚amn]\frac{\partial f}{\partial A} = \begin{bmatrix} \frac{\partial f}{\partial a_{11}} & \frac{\partial f}{\partial a_{12}} & \cdots & \frac{\partial f}{\partial a_{1n}} \\ \frac{\partial f}{\partial a_{21}} & \frac{\partial f}{\partial a_{22}} & \cdots & \frac{\partial f}{\partial a_{2n}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial a_{m1}} & \frac{\partial f}{\partial a_{m2}} & \cdots & \frac{\partial f}{\partial a_{mn}} \end{bmatrix}

Example: Frobenius Norm Squared

Let f(A)=โˆฅAโˆฅF2=โˆ‘i,jaij2f(A) = \|A\|_F^2 = \sum_{i,j} a_{ij}^2. Then:

โˆ‚fโˆ‚aij=2aijโ€…โ€ŠโŸนโ€…โ€Šโˆ‚fโˆ‚A=2A\frac{\partial f}{\partial a_{ij}} = 2a_{ij} \implies \frac{\partial f}{\partial A} = 2A

Example: Trace

Let f(A)=tr(A)=โˆ‘iaiif(A) = \text{tr}(A) = \sum_i a_{ii}. Then:

โˆ‚fโˆ‚aij={1ifย i=j0ifย iโ‰ jโ€…โ€ŠโŸนโ€…โ€Šโˆ‚fโˆ‚A=I\frac{\partial f}{\partial a_{ij}} = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{if } i \neq j \end{cases} \implies \frac{\partial f}{\partial A} = I

Vector by Vector Derivatives

DfJacobian Matrix

For a vector-valued function fโƒ—:Rnโ†’Rm\vec{f}: \mathbb{R}^n \to \mathbb{R}^m where fโƒ—(xโƒ—)=[f1(xโƒ—)f2(xโƒ—)โ‹ฎfm(xโƒ—)]\vec{f}(\vec{x}) = \begin{bmatrix} f_1(\vec{x}) \\ f_2(\vec{x}) \\ \vdots \\ f_m(\vec{x}) \end{bmatrix}, the Jacobian is the mร—nm \times n matrix of all first-order partial derivatives:

Jij=โˆ‚fiโˆ‚xjJ_{ij} = \frac{\partial f_i}{\partial x_j}
J=[โˆ‚f1โˆ‚x1โˆ‚f1โˆ‚x2โ‹ฏโˆ‚f1โˆ‚xnโˆ‚f2โˆ‚x1โˆ‚f2โˆ‚x2โ‹ฏโˆ‚f2โˆ‚xnโ‹ฎโ‹ฎโ‹ฑโ‹ฎโˆ‚fmโˆ‚x1โˆ‚fmโˆ‚x2โ‹ฏโˆ‚fmโˆ‚xn]J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

Jacobian vs Gradient

  • The Jacobian JโˆˆRmร—nJ \in \mathbb{R}^{m \times n} generalizes the derivative to vector-to-vector maps.
  • The gradient โˆ‡fโˆˆRn\nabla f \in \mathbb{R}^n is the Jacobian of a scalar function f:Rnโ†’Rf: \mathbb{R}^n \to \mathbb{R} (a row vector or column vector depending on convention).
  • For a linear map fโƒ—(xโƒ—)=Axโƒ—\vec{f}(\vec{x}) = A\vec{x}, the Jacobian is simply the matrix AA.

Jacobian of a Nonlinear Map

Let fโƒ—:R2โ†’R2\vec{f}: \mathbb{R}^2 \to \mathbb{R}^2 be defined by fโƒ—(x1,x2)=[x1x2sinโก(x1)+x22]\vec{f}(x_1, x_2) = \begin{bmatrix} x_1 x_2 \\ \sin(x_1) + x_2^2 \end{bmatrix}.

J=[โˆ‚f1โˆ‚x1โˆ‚f1โˆ‚x2โˆ‚f2โˆ‚x1โˆ‚f2โˆ‚x2]=[x2x1cosโก(x1)2x2]J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} \end{bmatrix} = \begin{bmatrix} x_2 & x_1 \\ \cos(x_1) & 2x_2 \end{bmatrix}

At (1,0)(1, 0): J=[01cosโก(1)0]J = \begin{bmatrix} 0 & 1 \\ \cos(1) & 0 \end{bmatrix}.


Key Matrix Calculus Rules

ThFundamental Derivative Rules

The following are the most commonly used derivative identities in matrix calculus. Let xโƒ—โˆˆRn\vec{x} \in \mathbb{R}^n, AโˆˆRmร—nA \in \mathbb{R}^{m \times n}, aโƒ—,bโƒ—โˆˆRn\vec{a}, \vec{b} \in \mathbb{R}^n, and BโˆˆRnร—nB \in \mathbb{R}^{n \times n}.

ExpressionDerivativeNotes
โˆ‚โˆ‚xโƒ—(aโƒ—Txโƒ—)\frac{\partial}{\partial \vec{x}}(\vec{a}^T\vec{x})aโƒ—\vec{a}Linear form โ€” the "constant rule"
โˆ‚โˆ‚xโƒ—(xโƒ—Taโƒ—)\frac{\partial}{\partial \vec{x}}(\vec{x}^T\vec{a})aโƒ—\vec{a}Transpose doesn't change the gradient
โˆ‚โˆ‚xโƒ—(Axโƒ—)\frac{\partial}{\partial \vec{x}}(A\vec{x})ATA^TOutput is matrix; Jacobian is ATA^T
โˆ‚โˆ‚xโƒ—(xโƒ—TBxโƒ—)\frac{\partial}{\partial \vec{x}}(\vec{x}^T B \vec{x})(B+BT)xโƒ—(B + B^T)\vec{x}Quadratic form
โˆ‚โˆ‚xโƒ—(xโƒ—Txโƒ—)\frac{\partial}{\partial \vec{x}}(\vec{x}^T \vec{x})2xโƒ—2\vec{x}Special case: B=IB = I
โˆ‚โˆ‚xโƒ—(โˆฅxโƒ—โˆฅ2)\frac{\partial}{\partial \vec{x}}(\|\vec{x}\|^2)2xโƒ—2\vec{x}Same as above
โˆ‚โˆ‚xโƒ—(aโƒ—TBxโƒ—)\frac{\partial}{\partial \vec{x}}(\vec{a}^T B \vec{x})BTaโƒ—B^T \vec{a}Bilinear form
โˆ‚โˆ‚xโƒ—(sinโก(xโƒ—))\frac{\partial}{\partial \vec{x}}(\sin(\vec{x}))diag(cosโก(xโƒ—))\text{diag}(\cos(\vec{x}))Elementwise โ€” Jacobian is diagonal
โˆ‚โˆ‚xโƒ—(ฯƒ(xโƒ—))\frac{\partial}{\partial \vec{x}}(\sigma(\vec{x}))diag(ฯƒ(xโƒ—)โŠ™(1โˆ’ฯƒ(xโƒ—)))\text{diag}(\sigma(\vec{x}) \odot (1 - \sigma(\vec{x})))Sigmoid derivative

Layout Convention

Different textbooks use different numerator-layout vs denominator-layout conventions. In numerator layout, the gradient of a scalar is a row vector; in denominator layout, it is a column vector. This course uses denominator layout (column vector gradient) to match the convention used in deep learning frameworks like PyTorch and TensorFlow.

Deriving the Quadratic Rule

For f(xโƒ—)=xโƒ—TBxโƒ—=โˆ‘i,jbijxixjf(\vec{x}) = \vec{x}^T B \vec{x} = \sum_{i,j} b_{ij} x_i x_j:

โˆ‚fโˆ‚xk=โˆ‘jbkjxj+โˆ‘ibikxi=(Bxโƒ—)k+(BTxโƒ—)k\frac{\partial f}{\partial x_k} = \sum_j b_{kj} x_j + \sum_i b_{ik} x_i = (B\vec{x})_k + (B^T\vec{x})_k

Therefore: โˆ‚fโˆ‚xโƒ—=Bxโƒ—+BTxโƒ—=(B+BT)xโƒ—\frac{\partial f}{\partial \vec{x}} = B\vec{x} + B^T\vec{x} = (B + B^T)\vec{x}.

If BB is symmetric (B=BTB = B^T): โˆ‚fโˆ‚xโƒ—=2Bxโƒ—\frac{\partial f}{\partial \vec{x}} = 2B\vec{x}.


Chain Rule for Matrices

ThMultivariable Chain Rule

For a composition of functions gโƒ—:Rpโ†’Rm\vec{g}: \mathbb{R}^p \to \mathbb{R}^m and fโƒ—:Rnโ†’Rp\vec{f}: \mathbb{R}^n \to \mathbb{R}^p, defining hโƒ—=gโƒ—โˆ˜fโƒ—\vec{h} = \vec{g} \circ \vec{f}, the Jacobian of hโƒ—\vec{h} is the product of Jacobians:

Jhโƒ—=Jgโƒ—โ‹…Jfโƒ—J_{\vec{h}} = J_{\vec{g}} \cdot J_{\vec{f}}

In terms of partial derivatives:

โˆ‚hiโˆ‚xj=โˆ‘kโˆ‚hiโˆ‚fkโ‹…โˆ‚fkโˆ‚xj\frac{\partial h_i}{\partial x_j} = \sum_k \frac{\partial h_i}{\partial f_k} \cdot \frac{\partial f_k}{\partial x_j}

For a scalar loss L\mathcal{L} and intermediate variables z1,z2,โ€ฆ,zkz_1, z_2, \ldots, z_k:

โˆ‚Lโˆ‚xj=โˆ‘i=1kโˆ‚Lโˆ‚ziโ‹…โˆ‚ziโˆ‚xj\frac{\partial \mathcal{L}}{\partial x_j} = \sum_{i=1}^{k} \frac{\partial \mathcal{L}}{\partial z_i} \cdot \frac{\partial z_i}{\partial x_j}

Why This Matters for Deep Learning

Neural networks are compositions of many functions: y^=fL(fLโˆ’1(โ‹ฏf1(xโƒ—)โ‹ฏโ€‰))\hat{y} = f_L(f_{L-1}(\cdots f_1(\vec{x}) \cdots)). The chain rule lets us decompose the gradient computation into local derivatives at each layer. This is the principle behind backpropagation: compute local Jacobians layer by layer, then multiply them together (in reverse order) to get the total derivative.

Chain Rule in Two Layers

Let zโƒ—=W1xโƒ—\vec{z} = W_1 \vec{x}, aโƒ—=ฯƒ(zโƒ—)\vec{a} = \sigma(\vec{z}), y^=W2aโƒ—\hat{y} = W_2 \vec{a}, and L=12(y^โˆ’y)2\mathcal{L} = \frac{1}{2}(\hat{y} - y)^2.

Forward pass: xโƒ—โ†’zโƒ—โ†’aโƒ—โ†’y^โ†’L\vec{x} \to \vec{z} \to \vec{a} \to \hat{y} \to \mathcal{L}

Backward pass (chain rule):

  • โˆ‚Lโˆ‚y^=y^โˆ’y\frac{\partial \mathcal{L}}{\partial \hat{y}} = \hat{y} - y
  • โˆ‚Lโˆ‚W2=(y^โˆ’y)aโƒ—T\frac{\partial \mathcal{L}}{\partial W_2} = (\hat{y} - y) \vec{a}^T
  • โˆ‚Lโˆ‚aโƒ—=W2T(y^โˆ’y)\frac{\partial \mathcal{L}}{\partial \vec{a}} = W_2^T (\hat{y} - y)
  • โˆ‚Lโˆ‚zโƒ—=ฯƒโ€ฒ(zโƒ—)โŠ™W2T(y^โˆ’y)\frac{\partial \mathcal{L}}{\partial \vec{z}} = \sigma'(\vec{z}) \odot W_2^T(\hat{y} - y)
  • โˆ‚Lโˆ‚W1=โˆ‚Lโˆ‚zโƒ—xโƒ—T\frac{\partial \mathcal{L}}{\partial W_1} = \frac{\partial \mathcal{L}}{\partial \vec{z}} \vec{x}^T

Gradient of Loss Functions

DfMean Squared Error (MSE)

For predictions y^i\hat{y}_i and targets yiy_i over NN samples:

LMSE=1Nโˆ‘i=1N(y^iโˆ’yi)2\mathcal{L}_{\text{MSE}} = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)^2

Gradient with respect to predictions:

โˆ‚Lโˆ‚y^i=2N(y^iโˆ’yi)\frac{\partial \mathcal{L}}{\partial \hat{y}_i} = \frac{2}{N}(\hat{y}_i - y_i)

For vector predictions yโƒ—^โˆˆRd\hat{\vec{y}} \in \mathbb{R}^d:

โˆ‡yโƒ—^L=2N(yโƒ—^โˆ’yโƒ—)\nabla_{\hat{\vec{y}}} \mathcal{L} = \frac{2}{N}(\hat{\vec{y}} - \vec{y})

DfBinary Cross-Entropy Loss

For binary predictions p^iโˆˆ[0,1]\hat{p}_i \in [0, 1] and targets yiโˆˆ{0,1}y_i \in \{0, 1\}:

LBCE=โˆ’1Nโˆ‘i=1N[yilogโก(p^i)+(1โˆ’yi)logโก(1โˆ’p^i)]\mathcal{L}_{\text{BCE}} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right]

Gradient with respect to predictions:

โˆ‚Lโˆ‚p^i=โˆ’yip^i+1โˆ’yi1โˆ’p^i=p^iโˆ’yip^i(1โˆ’p^i)\frac{\partial \mathcal{L}}{\partial \hat{p}_i} = \frac{-y_i}{\hat{p}_i} + \frac{1 - y_i}{1 - \hat{p}_i} = \frac{\hat{p}_i - y_i}{\hat{p}_i(1 - \hat{p}_i)}

DfCategorical Cross-Entropy Loss

For KK-class predictions p^k\hat{p}_k (softmax outputs) and one-hot targets yky_k:

LCE=โˆ’1Nโˆ‘i=1Nโˆ‘k=1Kyiklogโก(p^ik)\mathcal{L}_{\text{CE}} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{K} y_{ik} \log(\hat{p}_{ik})

Gradient with respect to logits zkz_k (before softmax):

โˆ‚Lโˆ‚zk=p^kโˆ’yk\frac{\partial \mathcal{L}}{\partial z_k} = \hat{p}_k - y_k

This elegant result โ€” the gradient is simply the prediction minus the target โ€” is why cross-entropy + softmax is the standard output layer for classification.

MSE Derivation

Let f(wโƒ—)=1NโˆฅXwโƒ—โˆ’yโƒ—โˆฅ2=1N(Xwโƒ—โˆ’yโƒ—)T(Xwโƒ—โˆ’yโƒ—)f(\vec{w}) = \frac{1}{N}\|X\vec{w} - \vec{y}\|^2 = \frac{1}{N}(X\vec{w} - \vec{y})^T(X\vec{w} - \vec{y}).

Expanding: f=1N(wโƒ—TXTXwโƒ—โˆ’2yโƒ—TXwโƒ—+yโƒ—Tyโƒ—)f = \frac{1}{N}(\vec{w}^T X^T X \vec{w} - 2\vec{y}^T X \vec{w} + \vec{y}^T \vec{y}).

Using the rules from above:

โˆ‡wโƒ—f=1N(2XTXwโƒ—โˆ’2XTyโƒ—)=2NXT(Xwโƒ—โˆ’yโƒ—)\nabla_{\vec{w}} f = \frac{1}{N}(2X^T X \vec{w} - 2X^T \vec{y}) = \frac{2}{N}X^T(X\vec{w} - \vec{y})

Setting to zero gives the normal equation: XTXwโƒ—=XTyโƒ—X^T X \vec{w} = X^T \vec{y}.


Backpropagation Derivation

ThBackpropagation Through a Neural Network

Consider a feedforward network with LL layers. For layer ll:

  • Pre-activation: zโƒ—(l)=W(l)aโƒ—(lโˆ’1)+bโƒ—(l)\vec{z}^{(l)} = W^{(l)} \vec{a}^{(l-1)} + \vec{b}^{(l)}
  • Activation: aโƒ—(l)=ฯƒ(zโƒ—(l))\vec{a}^{(l)} = \sigma(\vec{z}^{(l)})

Forward pass:

aโƒ—(0)=xโƒ—,zโƒ—(l)=W(l)aโƒ—(lโˆ’1)+bโƒ—(l),aโƒ—(l)=ฯƒ(zโƒ—(l))\vec{a}^{(0)} = \vec{x}, \quad \vec{z}^{(l)} = W^{(l)}\vec{a}^{(l-1)} + \vec{b}^{(l)}, \quad \vec{a}^{(l)} = \sigma(\vec{z}^{(l)})

Output: y^=aโƒ—(L)\hat{y} = \vec{a}^{(L)} (for regression) or pโƒ—^=softmax(zโƒ—(L))\hat{\vec{p}} = \text{softmax}(\vec{z}^{(L)}) (for classification).

Backward pass โ€” error signals:

ฮด(L)=โˆ‡aโƒ—(L)LโŠ™ฯƒโ€ฒ(zโƒ—(L))\delta^{(L)} = \nabla_{\vec{a}^{(L)}} \mathcal{L} \odot \sigma'(\vec{z}^{(L)})
ฮด(l)=((W(l+1))Tฮด(l+1))โŠ™ฯƒโ€ฒ(zโƒ—(l))\delta^{(l)} = \left((W^{(l+1)})^T \delta^{(l+1)}\right) \odot \sigma'(\vec{z}^{(l)})

Weight gradients:

โˆ‚Lโˆ‚W(l)=ฮด(l)(aโƒ—(lโˆ’1))T\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \delta^{(l)} (\vec{a}^{(l-1)})^T
โˆ‚Lโˆ‚bโƒ—(l)=ฮด(l)\frac{\partial \mathcal{L}}{\partial \vec{b}^{(l)}} = \delta^{(l)}

Full 2-Layer Network Derivation

Let xโƒ—โ†’W(1)โ†’zโƒ—(1)=W(1)xโƒ—+bโƒ—(1)โ†’aโƒ—(1)=ฯƒ(zโƒ—(1))โ†’W(2)โ†’zโƒ—(2)=W(2)aโƒ—(1)+bโƒ—(2)โ†’y^=zโƒ—(2)\vec{x} \to W^{(1)} \to \vec{z}^{(1)} = W^{(1)}\vec{x} + \vec{b}^{(1)} \to \vec{a}^{(1)} = \sigma(\vec{z}^{(1)}) \to W^{(2)} \to \vec{z}^{(2)} = W^{(2)}\vec{a}^{(1)} + \vec{b}^{(2)} \to \hat{y} = \vec{z}^{(2)}.

Loss: L=12(y^โˆ’y)2\mathcal{L} = \frac{1}{2}(\hat{y} - y)^2.

Backward pass:

  1. โˆ‚Lโˆ‚y^=y^โˆ’y\frac{\partial \mathcal{L}}{\partial \hat{y}} = \hat{y} - y
  2. โˆ‚Lโˆ‚W(2)=(y^โˆ’y)(aโƒ—(1))T\frac{\partial \mathcal{L}}{\partial W^{(2)}} = (\hat{y} - y)(\vec{a}^{(1)})^T
  3. โˆ‚Lโˆ‚bโƒ—(2)=y^โˆ’y\frac{\partial \mathcal{L}}{\partial \vec{b}^{(2)}} = \hat{y} - y
  4. โˆ‚Lโˆ‚aโƒ—(1)=W(2)T(y^โˆ’y)\frac{\partial \mathcal{L}}{\partial \vec{a}^{(1)}} = W^{(2)T}(\hat{y} - y)
  5. โˆ‚Lโˆ‚zโƒ—(1)=ฯƒโ€ฒ(zโƒ—(1))โŠ™W(2)T(y^โˆ’y)\frac{\partial \mathcal{L}}{\partial \vec{z}^{(1)}} = \sigma'(\vec{z}^{(1)}) \odot W^{(2)T}(\hat{y} - y)
  6. โˆ‚Lโˆ‚W(1)=โˆ‚Lโˆ‚zโƒ—(1)xโƒ—T\frac{\partial \mathcal{L}}{\partial W^{(1)}} = \frac{\partial \mathcal{L}}{\partial \vec{z}^{(1)}} \vec{x}^T
  7. โˆ‚Lโˆ‚bโƒ—(1)=โˆ‚Lโˆ‚zโƒ—(1)\frac{\partial \mathcal{L}}{\partial \vec{b}^{(1)}} = \frac{\partial \mathcal{L}}{\partial \vec{z}^{(1)}}

Hessian Matrices

DfHessian Matrix

For a twice-differentiable scalar function f:Rnโ†’Rf: \mathbb{R}^n \to \mathbb{R}, the Hessian is the nร—nn \times n symmetric matrix of second-order partial derivatives:

Hij=โˆ‚2fโˆ‚xiโˆ‚xjH_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}
H=[โˆ‚2fโˆ‚x12โˆ‚2fโˆ‚x1โˆ‚x2โ‹ฏโˆ‚2fโˆ‚x1โˆ‚xnโˆ‚2fโˆ‚x2โˆ‚x1โˆ‚2fโˆ‚x22โ‹ฏโˆ‚2fโˆ‚x2โˆ‚xnโ‹ฎโ‹ฎโ‹ฑโ‹ฎโˆ‚2fโˆ‚xnโˆ‚x1โˆ‚2fโˆ‚xnโˆ‚x2โ‹ฏโˆ‚2fโˆ‚xn2]H = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}

By Clairaut's theorem, HH is symmetric: H=HTH = H^T (assuming continuous second partials).

ThHessian and Optimization

  • Positive definite Hโ‰ป0H \succ 0: local minimum (all eigenvalues >0> 0)
  • Negative definite Hโ‰บ0H \prec 0: local maximum (all eigenvalues <0< 0)
  • Indefinite (mixed eigenvalues): saddle point
  • Semi-definite: inconclusive (higher-order test needed)

Newton's method uses the Hessian:

xโƒ—new=xโƒ—oldโˆ’Hโˆ’1โˆ‡f(xโƒ—old)\vec{x}_{\text{new}} = \vec{x}_{\text{old}} - H^{-1} \nabla f(\vec{x}_{\text{old}})

For a quadratic form f(xโƒ—)=12xโƒ—THxโƒ—+gโƒ—Txโƒ—+cf(\vec{x}) = \frac{1}{2}\vec{x}^T H \vec{x} + \vec{g}^T \vec{x} + c, the Hessian is constant and the optimum is xโƒ—โˆ—=โˆ’Hโˆ’1gโƒ—\vec{x}^* = -H^{-1}\vec{g}.

Hessian of a Quadratic Form

Let f(xโƒ—)=xโƒ—TBxโƒ—f(\vec{x}) = \vec{x}^T B \vec{x}. We showed โˆ‡f=(B+BT)xโƒ—\nabla f = (B + B^T)\vec{x}.

The Hessian is:

H=โˆ‚โˆ‚xโƒ—[(B+BT)xโƒ—]=B+BTH = \frac{\partial}{\partial \vec{x}} \left[(B + B^T)\vec{x}\right] = B + B^T

If BB is symmetric: H=2BH = 2B.

Hessian of MSE

For f(wโƒ—)=1NโˆฅXwโƒ—โˆ’yโƒ—โˆฅ2f(\vec{w}) = \frac{1}{N}\|X\vec{w} - \vec{y}\|^2, the gradient is 2NXT(Xwโƒ—โˆ’yโƒ—)\frac{2}{N}X^T(X\vec{w} - \vec{y}).

The Hessian is:

H=2NXTXH = \frac{2}{N}X^T X

This is always positive semi-definite, confirming MSE has a unique global minimum (when XX has full column rank).


Python Implementation

import numpy as np

# --- Gradient of f(x) = x^T A x ---
def grad_quadratic(x, A):
    return (A + A.T) @ x

A = np.array([[2, 1], [1, 3]])
x = np.array([1.0, 2.0])
print(f"Gradient of x^T A x: {grad_quadratic(x, A)}")
# Output: [4. 7.]

# --- Hessian of f(x) = x^T A x ---
def hessian_quadratic(A):
    return A + A.T

print(f"Hessian:\n{hessian_quadratic(A)}")
# [[4 2]
#  [2 6]]

# --- Manual Jacobian computation ---
def jacobian(f, x, eps=1e-7):
    """Compute Jacobian via finite differences."""
    n = len(x)
    m = len(f(x))
    J = np.zeros((m, n))
    fx = f(x)
    for j in range(n):
        x_plus = x.copy()
        x_plus[j] += eps
        J[:, j] = (f(x_plus) - fx) / eps
    return J

# Example: f(x) = [x1*x2, sin(x1) + x2^2]
f = lambda x: np.array([x[0]*x[1], np.sin(x[0]) + x[1]**2])
x0 = np.array([1.0, 0.0])
J = jacobian(f, x0)
print(f"Jacobian at (1,0):\n{J}")
# [[0.         1.        ]
#  [0.54030231 0.        ]]

# --- Autograd with JAX ---
import jax
import jax.numpy as jnp
from jax import grad, hessian

def loss_fn(x):
    return jnp.sum(x**2) + jnp.dot(x, jnp.array([1.0, 2.0]))

# Gradient
g = grad(loss_fn)
print(f"JAX gradient: {g(x)}")

# Hessian
H = hessian(loss_fn)
print(f"JAX Hessian:\n{H}")

# --- Backpropagation: 2-layer network ---
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_prime(z):
    s = sigmoid(z)
    return s * (1 - s)

def forward_pass(x, W1, b1, W2, b2):
    z1 = W1 @ x + b1
    a1 = sigmoid(z1)
    z2 = W2 @ a1 + b2
    return z2, z1, a1

def backward_pass(x, y, z2, z1, a1, W2):
    N = len(y)
    dz2 = (z2 - y) * 2 / N       # dL/dz2
    dW2 = np.outer(dz2, a1)       # dL/dW2
    da1 = W2.T @ dz2              # dL/da1
    dz1 = sigmoid_prime(z1) * da1 # dL/dz1
    dW1 = np.outer(dz1, x)        # dL/dW1
    db1 = dz1                     # dL/db1
    return dW1, db1, dW2

# Simple example
np.random.seed(42)
W1 = np.random.randn(3, 2) * 0.1
b1 = np.zeros(3)
W2 = np.random.randn(1, 3) * 0.1
b2 = np.zeros(1)
x = np.array([1.0, 0.5])
y = np.array([1.0])

z2, z1, a1 = forward_pass(x, W1, b1, W2, b2)
dW1, db1, dW2 = backward_pass(x, y, z2, z1, a1, W2)

print(f"dW1 shape: {dW1.shape}")  # (3, 2)
print(f"dW2 shape: {dW2.shape}")  # (1, 3)
print(f"db1 shape: {db1.shape}")  # (3,)

Applications in AI/ML

Gradient Descent Optimization

The fundamental update rule: wโƒ—t+1=wโƒ—tโˆ’ฮทโˆ‡wโƒ—L(wโƒ—t)\vec{w}_{t+1} = \vec{w}_t - \eta \nabla_{\vec{w}} \mathcal{L}(\vec{w}_t). All variants โ€” SGD, Adam, AdaGrad, RMSProp โ€” modify this basic step using matrix calculus. Adam tracks running averages of both the gradient (first moment) and the gradient squared (second moment), requiring elementwise matrix operations.

Neural Network Training

Each layer performs a linear transformation Wxโƒ—+bโƒ—W\vec{x} + \vec{b} followed by a nonlinear activation ฯƒ(โ‹…)\sigma(\cdot). Backpropagation computes โˆ‚Lโˆ‚W(l)=ฮด(l)(aโƒ—(lโˆ’1))T\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \delta^{(l)} (\vec{a}^{(l-1)})^T for every layer โ€” this is a rank-1 outer product, computed LL times per sample. The total computation for a network with NN parameters over BB samples is O(Bโ‹…N)O(B \cdot N) โ€” the backbone of deep learning scalability.

Attention Mechanisms (Transformers)

The attention function Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V requires computing the gradient through softmax (a matrix function) and matrix multiplication. The derivative of softmax produces a full Jacobian matrix, not a diagonal one, making this one of the more complex backprop computations in modern architectures.

Regularization

L2 regularization adds ฮป2โˆฅWโˆฅF2\frac{\lambda}{2}\|W\|_F^2 to the loss, contributing ฮปW\lambda W to the weight gradient โ€” a direct application of โˆ‚โˆ‚WโˆฅWโˆฅF2=2W\frac{\partial}{\partial W}\|W\|_F^2 = 2W. L1 regularization uses the subgradient of โˆฅWโˆฅ1\|W\|_1, leading to sparse solutions.

Generative Models

In GANs, the discriminator's loss involves gradients through a minimax objective. The generator's gradient โˆ‡zโƒ—D(G(zโƒ—))\nabla_{\vec{z}} D(G(\vec{z})) flows through both the generator and discriminator networks, requiring careful application of the chain rule through both networks simultaneously.


Common Mistakes

MistakeCorrection
Confusing โˆ‚โˆ‚x(Ax)=AT\frac{\partial}{\partial x}(Ax) = A^T with AAThe Jacobian of a linear map AxAx is ATA^T in denominator layout
Forgetting to transpose in the chain ruleWhen chaining f(g(x))f(g(x)), multiply Jacobians: Jfโ‹…JgJ_f \cdot J_g, not Jgโ‹…JfJ_g \cdot J_f
Ignoring elementwise vs matrix operationsฯƒ(xโƒ—)\sigma(\vec{x}) applies elementwise; its Jacobian is diagonal, not ฯƒโ‹…I\sigma \cdot I
Using โˆ‡f=โˆ‚fโˆ‚x\nabla f = \frac{\partial f}{\partial x} without specifying layoutAlways state whether gradient is row or column vector
Assuming โˆ‚โˆ‚x(xTAx)=2Ax\frac{\partial}{\partial x}(x^T Ax) = 2Ax for non-symmetric AAThe correct formula is (A+AT)x(A + A^T)x; only when A=ATA = A^T does it simplify to 2Ax2Ax
Forgetting the 1N\frac{1}{N} factor in loss gradientsMSE gradient is 2NXT(Xwโˆ’y)\frac{2}{N}X^T(Xw - y), not 2XT(Xwโˆ’y)2X^T(Xw - y)
Applying scalar chain rule to matrix chain ruleScalar: ddxf(g(x))=fโ€ฒ(g(x))gโ€ฒ(x)\frac{d}{dx}f(g(x)) = f'(g(x))g'(x). Matrix: Jfโˆ˜g=Jfโ‹…JgJ_{f \circ g} = J_f \cdot J_g (matrix multiply, not elementwise)

Interview Questions

Q1: What is the gradient of f(xโƒ—)=xโƒ—TAxโƒ—f(\vec{x}) = \vec{x}^T A \vec{x}?

(A+AT)xโƒ—(A + A^T)\vec{x}. If AA is symmetric, this simplifies to 2Axโƒ—2A\vec{x}. Derivation: expand f=โˆ‘i,jaijxixjf = \sum_{i,j} a_{ij} x_i x_j, differentiate with respect to xkx_k, and collect terms to get (Axโƒ—)k+(ATxโƒ—)k(A\vec{x})_k + (A^T\vec{x})_k.

Q2: Why is backpropagation O(N)O(N) where NN is the number of parameters?

Each layer's gradient is an outer product ฮด(l)(aโƒ—(lโˆ’1))T\delta^{(l)} (\vec{a}^{(l-1)})^T, which costs O(nlโ‹…nlโˆ’1)O(n_l \cdot n_{l-1}) โ€” exactly the number of parameters in W(l)W^{(l)}. Summing over all LL layers gives O(โˆ‘lnlโ‹…nlโˆ’1)=O(N)O(\sum_l n_l \cdot n_{l-1}) = O(N). The backward pass has the same time complexity as the forward pass.

Q3: What is the Jacobian of the softmax function?

For softmax(zโƒ—)i=eziโˆ‘kezk\text{softmax}(\vec{z})_i = \frac{e^{z_i}}{\sum_k e^{z_k}}, the Jacobian is:

Jij=softmax(zi)(ฮดijโˆ’softmax(zj))J_{ij} = \text{softmax}(z_i)(\delta_{ij} - \text{softmax}(z_j))

where ฮดij\delta_{ij} is the Kronecker delta. In matrix form: J=diag(pโƒ—)โˆ’pโƒ—pโƒ—TJ = \text{diag}(\vec{p}) - \vec{p}\vec{p}^T. This is NOT a diagonal matrix โ€” it's a rank-1 update to a diagonal matrix.

Q4: How does the Hessian relate to Newton's method in optimization?

Newton's method uses the Hessian to compute the optimal step: xโƒ—new=xโƒ—oldโˆ’Hโˆ’1โˆ‡f\vec{x}_{\text{new}} = \vec{x}_{\text{old}} - H^{-1}\nabla f. If Hโ‰ป0H \succ 0, the step always points toward the minimum. However, computing HH costs O(n2)O(n^2) space and O(n3)O(n^3) for the inverse, so quasi-Newton methods (BFGS, L-BFGS) approximate Hโˆ’1H^{-1} instead.

Q5: Derive the gradient of the cross-entropy loss with respect to the logits.

For softmax output p^k=ezkโˆ‘jezj\hat{p}_k = \frac{e^{z_k}}{\sum_j e^{z_j}} and cross-entropy L=โˆ’โˆ‘kyklogโก(p^k)\mathcal{L} = -\sum_k y_k \log(\hat{p}_k):

โˆ‚Lโˆ‚zk=โˆ‘jโˆ‚Lโˆ‚p^jโ‹…โˆ‚p^jโˆ‚zk=โˆ‘j(โˆ’yjp^j)โ‹…p^j(ฮดjkโˆ’p^k)=โˆ’yk+p^kโˆ‘jyj=p^kโˆ’yk\frac{\partial \mathcal{L}}{\partial z_k} = \sum_j \frac{\partial \mathcal{L}}{\partial \hat{p}_j} \cdot \frac{\partial \hat{p}_j}{\partial z_k} = \sum_j \left(\frac{-y_j}{\hat{p}_j}\right) \cdot \hat{p}_j(\delta_{jk} - \hat{p}_k) = -y_k + \hat{p}_k \sum_j y_j = \hat{p}_k - y_k

(using โˆ‘jyj=1\sum_j y_j = 1 for one-hot targets). This is why cross-entropy + softmax gives such clean gradients.

Q6: What is the difference between the Jacobian and the Hessian?

The Jacobian JโˆˆRmร—nJ \in \mathbb{R}^{m \times n} is the matrix of first derivatives of a vector function fโƒ—:Rnโ†’Rm\vec{f}: \mathbb{R}^n \to \mathbb{R}^m. The Hessian HโˆˆRnร—nH \in \mathbb{R}^{n \times n} is the matrix of second derivatives of a scalar function f:Rnโ†’Rf: \mathbb{R}^n \to \mathbb{R}. The Hessian is the Jacobian of the gradient: H=Jโˆ‡fH = J_{\nabla f}.


Practice Problems

Problem 1

Compute โˆ‚โˆ‚xโƒ—(xโƒ—TAxโƒ—)\frac{\partial}{\partial \vec{x}}(\vec{x}^T A \vec{x}) for A=[1002]A = \begin{bmatrix}1&0\\0&2\end{bmatrix}.

Solution

Since AA is symmetric, โˆ‚โˆ‚xโƒ—(xโƒ—TAxโƒ—)=2Axโƒ—\frac{\partial}{\partial \vec{x}}(\vec{x}^T A \vec{x}) = 2A\vec{x}.

2[1002][x1x2]=[2x14x2]2\begin{bmatrix}1&0\\0&2\end{bmatrix}\begin{bmatrix}x_1\\x_2\end{bmatrix} = \begin{bmatrix}2x_1\\4x_2\end{bmatrix}

Verification: f=x12+2x22f = x_1^2 + 2x_2^2, so โˆ‚fโˆ‚x1=2x1\frac{\partial f}{\partial x_1} = 2x_1, โˆ‚fโˆ‚x2=4x2\frac{\partial f}{\partial x_2} = 4x_2. โœ“

Problem 2

Find the Jacobian of fโƒ—:R3โ†’R2\vec{f}: \mathbb{R}^3 \to \mathbb{R}^2 defined by fโƒ—(x,y,z)=[xy+z2exsinโก(y)]\vec{f}(x, y, z) = \begin{bmatrix} xy + z^2 \\ e^x \sin(y) \end{bmatrix}.

Solution

J=[โˆ‚f1โˆ‚xโˆ‚f1โˆ‚yโˆ‚f1โˆ‚zโˆ‚f2โˆ‚xโˆ‚f2โˆ‚yโˆ‚f2โˆ‚z]=[yx2zexsinโก(y)excosโก(y)0]J = \begin{bmatrix} \frac{\partial f_1}{\partial x} & \frac{\partial f_1}{\partial y} & \frac{\partial f_1}{\partial z} \\ \frac{\partial f_2}{\partial x} & \frac{\partial f_2}{\partial y} & \frac{\partial f_2}{\partial z} \end{bmatrix} = \begin{bmatrix} y & x & 2z \\ e^x \sin(y) & e^x \cos(y) & 0 \end{bmatrix}

This is a 2ร—32 \times 3 matrix, mapping from R3\mathbb{R}^3 to R2\mathbb{R}^2.

Problem 3

Derive the gradient of L=12โˆฅWxโƒ—โˆ’yโƒ—โˆฅ2\mathcal{L} = \frac{1}{2}\|W\vec{x} - \vec{y}\|^2 with respect to WW.

Solution

Let rโƒ—=Wxโƒ—โˆ’yโƒ—\vec{r} = W\vec{x} - \vec{y}. Then L=12rโƒ—Trโƒ—\mathcal{L} = \frac{1}{2}\vec{r}^T\vec{r}.

By chain rule: โˆ‚Lโˆ‚Wij=โˆ‘kโˆ‚Lโˆ‚rkโ‹…โˆ‚rkโˆ‚Wij\frac{\partial \mathcal{L}}{\partial W_{ij}} = \sum_k \frac{\partial \mathcal{L}}{\partial r_k} \cdot \frac{\partial r_k}{\partial W_{ij}}.

We have โˆ‚Lโˆ‚rk=rk\frac{\partial \mathcal{L}}{\partial r_k} = r_k and โˆ‚rkโˆ‚Wij=ฮดkixj\frac{\partial r_k}{\partial W_{ij}} = \delta_{ki} x_j.

Therefore: โˆ‚Lโˆ‚Wij=rixj\frac{\partial \mathcal{L}}{\partial W_{ij}} = r_i x_j, which in matrix form is:

โˆ‚Lโˆ‚W=rโƒ—xโƒ—T=(Wxโƒ—โˆ’yโƒ—)xโƒ—T\frac{\partial \mathcal{L}}{\partial W} = \vec{r} \vec{x}^T = (W\vec{x} - \vec{y})\vec{x}^T

Problem 4

For a two-layer network h=W2ฯƒ(W1x+b1)+b2h = W_2 \sigma(W_1 x + b_1) + b_2, derive โˆ‚hโˆ‚W1\frac{\partial h}{\partial W_1}.

Solution

Let zโƒ—1=W1x+b1\vec{z}_1 = W_1 x + b_1, aโƒ—1=ฯƒ(zโƒ—1)\vec{a}_1 = \sigma(\vec{z}_1), zโƒ—2=W2aโƒ—1+b2\vec{z}_2 = W_2 \vec{a}_1 + b_2, h=zโƒ—2h = \vec{z}_2.

Chain rule:

โˆ‚hโˆ‚W1=โˆ‚hโˆ‚aโƒ—1โ‹…โˆ‚aโƒ—1โˆ‚zโƒ—1โ‹…โˆ‚zโƒ—1โˆ‚W1\frac{\partial h}{\partial W_1} = \frac{\partial h}{\partial \vec{a}_1} \cdot \frac{\partial \vec{a}_1}{\partial \vec{z}_1} \cdot \frac{\partial \vec{z}_1}{\partial W_1}
  • โˆ‚hโˆ‚aโƒ—1=W2\frac{\partial h}{\partial \vec{a}_1} = W_2 (vector: W2W_2 applied to scalar output)
  • โˆ‚aโƒ—1โˆ‚zโƒ—1=diag(ฯƒโ€ฒ(zโƒ—1))\frac{\partial \vec{a}_1}{\partial \vec{z}_1} = \text{diag}(\sigma'(\vec{z}_1))
  • โˆ‚zโƒ—1โˆ‚W1=x\frac{\partial \vec{z}_1}{\partial W_1} = x (outer product structure)

Result: โˆ‚hโˆ‚W1=W2โ‹…diag(ฯƒโ€ฒ(zโƒ—1))โ‹…xT\frac{\partial h}{\partial W_1} = W_2 \cdot \text{diag}(\sigma'(\vec{z}_1)) \cdot x^T

In component form: โˆ‚hโˆ‚(W1)ij=โˆ‘k(W2)kฯƒโ€ฒ((z1)k)ฮดkixj=(W2)iฯƒโ€ฒ((z1)i)xj\frac{\partial h}{\partial (W_1)_{ij}} = \sum_k (W_2)_k \sigma'((z_1)_k) \delta_{ki} x_j = (W_2)_i \sigma'((z_1)_i) x_j


Quick Reference

ConceptFormulaKey Insight
Linear gradientโˆ‡xโƒ—(aโƒ—Txโƒ—)=aโƒ—\nabla_{\vec{x}}(\vec{a}^T\vec{x}) = \vec{a}Derivative of linear form is the coefficient
Quadratic gradientโˆ‡xโƒ—(xโƒ—TBxโƒ—)=(B+BT)xโƒ—\nabla_{\vec{x}}(\vec{x}^T B \vec{x}) = (B+B^T)\vec{x}Symmetric: 2Bxโƒ—2B\vec{x}
Frobenius normโˆ‡AโˆฅAโˆฅF2=2A\nabla_A \|A\|_F^2 = 2AUsed in weight decay / regularization
Linear map JacobianJAxโƒ—=ATJ_{A\vec{x}} = A^TDenominator layout convention
Chain ruleJgโˆ˜f=Jgโ‹…JfJ_{g \circ f} = J_g \cdot J_fMultiply Jacobians, not derivatives
JacobianJij=โˆ‚fiโˆ‚xjJ_{ij} = \frac{\partial f_i}{\partial x_j}mร—nm \times n matrix for fโƒ—:Rnโ†’Rm\vec{f}: \mathbb{R}^n \to \mathbb{R}^m
HessianHij=โˆ‚2fโˆ‚xiโˆ‚xjH_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}Symmetric matrix of second derivatives
Backprop weight gradโˆ‚Lโˆ‚W(l)=ฮด(l)(aโƒ—(lโˆ’1))T\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \delta^{(l)} (\vec{a}^{(l-1)})^TOuter product of error signal and activation
Backprop error signalฮด(l)=(W(l+1)Tฮด(l+1))โŠ™ฯƒโ€ฒ(zโƒ—(l))\delta^{(l)} = (W^{(l+1)T}\delta^{(l+1)}) \odot \sigma'(\vec{z}^{(l)})Error propagated backward, scaled by derivative
MSE gradientโˆ‡y^L=2N(yโƒ—^โˆ’yโƒ—)\nabla_{\hat{y}} \mathcal{L} = \frac{2}{N}(\hat{\vec{y}} - \vec{y})Proportional to prediction error
Cross-entropy gradientโˆ‚Lโˆ‚zk=p^kโˆ’yk\frac{\partial \mathcal{L}}{\partial z_k} = \hat{p}_k - y_kPrediction minus target (for softmax)

Cross-References

โญ

Premium Content

Matrix Calculus

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

๐ŸŽฏEnd-to-end Projects
๐Ÿ’ผInterview Prep
๐Ÿ“œCertificates
๐ŸคCommunity Access

Already a member? Log in

Need Expert Mathematics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement