Matrix Calculus

Why It Matters

Matrix calculus is the mathematical backbone of modern machine learning. Every time a neural network learns, it computes gradients of a loss function with respect to millions of weights — that computation is matrix calculus. Backpropagation is simply the chain rule applied to compositions of matrix functions. Gradient descent, the optimizer that powers everything from linear regression to GPT, updates parameters by moving in the direction of the negative gradient — a vector of partial derivatives. Without matrix calculus, we cannot understand how models learn, why they fail, or how to design better architectures. From computing the gradient of a quadratic form in ridge regression to deriving the backward pass of a transformer, matrix calculus is the language of optimization.

Scalar by Matrix Derivatives

DfScalar Function of a Matrix

Let $f: \mathbb{R}^{m \times n} \to \mathbb{R}$ be a scalar-valued function of a matrix $A \in \mathbb{R}^{m \times n}$ . The gradient of $f$ with respect to $A$ is an $m \times n$ matrix where each entry is the partial derivative with respect to the corresponding entry of $A$ :

\frac{\partial f}{\partial A} = \begin{bmatrix} \frac{\partial f}{\partial a_{11}} & \frac{\partial f}{\partial a_{12}} & \cdots & \frac{\partial f}{\partial a_{1n}} \\ \frac{\partial f}{\partial a_{21}} & \frac{\partial f}{\partial a_{22}} & \cdots & \frac{\partial f}{\partial a_{2n}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial a_{m1}} & \frac{\partial f}{\partial a_{m2}} & \cdots & \frac{\partial f}{\partial a_{mn}} \end{bmatrix}

Example: Frobenius Norm Squared

Let $f(A) = \|A\|_F^2 = \sum_{i,j} a_{ij}^2$ . Then:

\frac{\partial f}{\partial a_{ij}} = 2a_{ij} \implies \frac{\partial f}{\partial A} = 2A

Example: Trace

Let $f(A) = \text{tr}(A) = \sum_i a_{ii}$ . Then:

\frac{\partial f}{\partial a_{ij}} = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{if } i \neq j \end{cases} \implies \frac{\partial f}{\partial A} = I

Vector by Vector Derivatives

DfJacobian Matrix

For a vector-valued function $\vec{f}: \mathbb{R}^n \to \mathbb{R}^m$ where $\vec{f}(\vec{x}) = \begin{bmatrix} f_1(\vec{x}) \\ f_2(\vec{x}) \\ \vdots \\ f_m(\vec{x}) \end{bmatrix}$ , the Jacobian is the $m \times n$ matrix of all first-order partial derivatives:

J_{ij} = \frac{\partial f_i}{\partial x_j}

J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

Jacobian vs Gradient

The Jacobian $J \in \mathbb{R}^{m \times n}$ generalizes the derivative to vector-to-vector maps.
The gradient $\nabla f \in \mathbb{R}^n$ is the Jacobian of a scalar function $f: \mathbb{R}^n \to \mathbb{R}$ (a row vector or column vector depending on convention).
For a linear map $\vec{f}(\vec{x}) = A\vec{x}$ , the Jacobian is simply the matrix $A$ .

Jacobian of a Nonlinear Map

Let $\vec{f}: \mathbb{R}^2 \to \mathbb{R}^2$ be defined by $\vec{f}(x_1, x_2) = \begin{bmatrix} x_1 x_2 \\ \sin(x_1) + x_2^2 \end{bmatrix}$ .

J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} \end{bmatrix} = \begin{bmatrix} x_2 & x_1 \\ \cos(x_1) & 2x_2 \end{bmatrix}

At $(1, 0)$ : $J = \begin{bmatrix} 0 & 1 \\ \cos(1) & 0 \end{bmatrix}$ .

Key Matrix Calculus Rules

ThFundamental Derivative Rules

The following are the most commonly used derivative identities in matrix calculus. Let $\vec{x} \in \mathbb{R}^n$ , $A \in \mathbb{R}^{m \times n}$ , $\vec{a}, \vec{b} \in \mathbb{R}^n$ , and $B \in \mathbb{R}^{n \times n}$ .

Expression	Derivative	Notes
$\frac{\partial}{\partial \vec{x}}(\vec{a}^T\vec{x})$	$\vec{a}$	Linear form — the "constant rule"
$\frac{\partial}{\partial \vec{x}}(\vec{x}^T\vec{a})$	$\vec{a}$	Transpose doesn't change the gradient
$\frac{\partial}{\partial \vec{x}}(A\vec{x})$	$A^T$	Output is matrix; Jacobian is $A^T$
$\frac{\partial}{\partial \vec{x}}(\vec{x}^T B \vec{x})$	$(B + B^T)\vec{x}$	Quadratic form
$\frac{\partial}{\partial \vec{x}}(\vec{x}^T \vec{x})$	$2\vec{x}$	Special case: $B = I$
$\frac{\partial}{\partial \vec{x}}(\\|\vec{x}\\|^2)$	$2\vec{x}$	Same as above
$\frac{\partial}{\partial \vec{x}}(\vec{a}^T B \vec{x})$	$B^T \vec{a}$	Bilinear form
$\frac{\partial}{\partial \vec{x}}(\sin(\vec{x}))$	$\text{diag}(\cos(\vec{x}))$	Elementwise — Jacobian is diagonal
$\frac{\partial}{\partial \vec{x}}(\sigma(\vec{x}))$	$\text{diag}(\sigma(\vec{x}) \odot (1 - \sigma(\vec{x})))$	Sigmoid derivative

Layout Convention

Different textbooks use different numerator-layout vs denominator-layout conventions. In numerator layout, the gradient of a scalar is a row vector; in denominator layout, it is a column vector. This course uses denominator layout (column vector gradient) to match the convention used in deep learning frameworks like PyTorch and TensorFlow.

Deriving the Quadratic Rule

For $f(\vec{x}) = \vec{x}^T B \vec{x} = \sum_{i,j} b_{ij} x_i x_j$ :

\frac{\partial f}{\partial x_k} = \sum_j b_{kj} x_j + \sum_i b_{ik} x_i = (B\vec{x})_k + (B^T\vec{x})_k

Therefore: $\frac{\partial f}{\partial \vec{x}} = B\vec{x} + B^T\vec{x} = (B + B^T)\vec{x}$ .

If $B$ is symmetric ( $B = B^T$ ): $\frac{\partial f}{\partial \vec{x}} = 2B\vec{x}$ .

Chain Rule for Matrices

ThMultivariable Chain Rule

For a composition of functions $\vec{g}: \mathbb{R}^p \to \mathbb{R}^m$ and $\vec{f}: \mathbb{R}^n \to \mathbb{R}^p$ , defining $\vec{h} = \vec{g} \circ \vec{f}$ , the Jacobian of $\vec{h}$ is the product of Jacobians:

J_{\vec{h}} = J_{\vec{g}} \cdot J_{\vec{f}}

In terms of partial derivatives:

\frac{\partial h_i}{\partial x_j} = \sum_k \frac{\partial h_i}{\partial f_k} \cdot \frac{\partial f_k}{\partial x_j}

For a scalar loss $\mathcal{L}$ and intermediate variables $z_1, z_2, \ldots, z_k$ :

\frac{\partial \mathcal{L}}{\partial x_j} = \sum_{i=1}^{k} \frac{\partial \mathcal{L}}{\partial z_i} \cdot \frac{\partial z_i}{\partial x_j}

Why This Matters for Deep Learning

Neural networks are compositions of many functions: $\hat{y} = f_L(f_{L-1}(\cdots f_1(\vec{x}) \cdots))$ . The chain rule lets us decompose the gradient computation into local derivatives at each layer. This is the principle behind backpropagation: compute local Jacobians layer by layer, then multiply them together (in reverse order) to get the total derivative.

Chain Rule in Two Layers

Let $\vec{z} = W_1 \vec{x}$ , $\vec{a} = \sigma(\vec{z})$ , $\hat{y} = W_2 \vec{a}$ , and $\mathcal{L} = \frac{1}{2}(\hat{y} - y)^2$ .

Forward pass: $\vec{x} \to \vec{z} \to \vec{a} \to \hat{y} \to \mathcal{L}$

Backward pass (chain rule):

$\frac{\partial \mathcal{L}}{\partial \hat{y}} = \hat{y} - y$
$\frac{\partial \mathcal{L}}{\partial W_2} = (\hat{y} - y) \vec{a}^T$
$\frac{\partial \mathcal{L}}{\partial \vec{a}} = W_2^T (\hat{y} - y)$
$\frac{\partial \mathcal{L}}{\partial \vec{z}} = \sigma'(\vec{z}) \odot W_2^T(\hat{y} - y)$
$\frac{\partial \mathcal{L}}{\partial W_1} = \frac{\partial \mathcal{L}}{\partial \vec{z}} \vec{x}^T$

Gradient of Loss Functions

DfMean Squared Error (MSE)

For predictions $\hat{y}_i$ and targets $y_i$ over $N$ samples:

\mathcal{L}_{\text{MSE}} = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)^2

Gradient with respect to predictions:

\frac{\partial \mathcal{L}}{\partial \hat{y}_i} = \frac{2}{N}(\hat{y}_i - y_i)

For vector predictions $\hat{\vec{y}} \in \mathbb{R}^d$ :

\nabla_{\hat{\vec{y}}} \mathcal{L} = \frac{2}{N}(\hat{\vec{y}} - \vec{y})

DfBinary Cross-Entropy Loss

For binary predictions $\hat{p}_i \in [0, 1]$ and targets $y_i \in \{0, 1\}$ :

\mathcal{L}_{\text{BCE}} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right]

Gradient with respect to predictions:

\frac{\partial \mathcal{L}}{\partial \hat{p}_i} = \frac{-y_i}{\hat{p}_i} + \frac{1 - y_i}{1 - \hat{p}_i} = \frac{\hat{p}_i - y_i}{\hat{p}_i(1 - \hat{p}_i)}

DfCategorical Cross-Entropy Loss

For $K$ -class predictions $\hat{p}_k$ (softmax outputs) and one-hot targets $y_k$ :

\mathcal{L}_{\text{CE}} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{K} y_{ik} \log(\hat{p}_{ik})

Gradient with respect to logits $z_k$ (before softmax):

\frac{\partial \mathcal{L}}{\partial z_k} = \hat{p}_k - y_k

This elegant result — the gradient is simply the prediction minus the target — is why cross-entropy + softmax is the standard output layer for classification.

MSE Derivation

Let $f(\vec{w}) = \frac{1}{N}\|X\vec{w} - \vec{y}\|^2 = \frac{1}{N}(X\vec{w} - \vec{y})^T(X\vec{w} - \vec{y})$ .

Expanding: $f = \frac{1}{N}(\vec{w}^T X^T X \vec{w} - 2\vec{y}^T X \vec{w} + \vec{y}^T \vec{y})$ .

Using the rules from above:

\nabla_{\vec{w}} f = \frac{1}{N}(2X^T X \vec{w} - 2X^T \vec{y}) = \frac{2}{N}X^T(X\vec{w} - \vec{y})

Setting to zero gives the normal equation: $X^T X \vec{w} = X^T \vec{y}$ .

Backpropagation Derivation

ThBackpropagation Through a Neural Network

Consider a feedforward network with $L$ layers. For layer $l$ :

Pre-activation: $\vec{z}^{(l)} = W^{(l)} \vec{a}^{(l-1)} + \vec{b}^{(l)}$
Activation: $\vec{a}^{(l)} = \sigma(\vec{z}^{(l)})$

Forward pass:

\vec{a}^{(0)} = \vec{x}, \quad \vec{z}^{(l)} = W^{(l)}\vec{a}^{(l-1)} + \vec{b}^{(l)}, \quad \vec{a}^{(l)} = \sigma(\vec{z}^{(l)})

Output: $\hat{y} = \vec{a}^{(L)}$ (for regression) or $\hat{\vec{p}} = \text{softmax}(\vec{z}^{(L)})$ (for classification).

Backward pass — error signals:

\delta^{(L)} = \nabla_{\vec{a}^{(L)}} \mathcal{L} \odot \sigma'(\vec{z}^{(L)})

\delta^{(l)} = \left((W^{(l+1)})^T \delta^{(l+1)}\right) \odot \sigma'(\vec{z}^{(l)})

Weight gradients:

\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \delta^{(l)} (\vec{a}^{(l-1)})^T

\frac{\partial \mathcal{L}}{\partial \vec{b}^{(l)}} = \delta^{(l)}

Full 2-Layer Network Derivation

Let $\vec{x} \to W^{(1)} \to \vec{z}^{(1)} = W^{(1)}\vec{x} + \vec{b}^{(1)} \to \vec{a}^{(1)} = \sigma(\vec{z}^{(1)}) \to W^{(2)} \to \vec{z}^{(2)} = W^{(2)}\vec{a}^{(1)} + \vec{b}^{(2)} \to \hat{y} = \vec{z}^{(2)}$ .

Loss: $\mathcal{L} = \frac{1}{2}(\hat{y} - y)^2$ .

Backward pass:

$\frac{\partial \mathcal{L}}{\partial \hat{y}} = \hat{y} - y$
$\frac{\partial \mathcal{L}}{\partial W^{(2)}} = (\hat{y} - y)(\vec{a}^{(1)})^T$
$\frac{\partial \mathcal{L}}{\partial \vec{b}^{(2)}} = \hat{y} - y$
$\frac{\partial \mathcal{L}}{\partial \vec{a}^{(1)}} = W^{(2)T}(\hat{y} - y)$
$\frac{\partial \mathcal{L}}{\partial \vec{z}^{(1)}} = \sigma'(\vec{z}^{(1)}) \odot W^{(2)T}(\hat{y} - y)$
$\frac{\partial \mathcal{L}}{\partial W^{(1)}} = \frac{\partial \mathcal{L}}{\partial \vec{z}^{(1)}} \vec{x}^T$
$\frac{\partial \mathcal{L}}{\partial \vec{b}^{(1)}} = \frac{\partial \mathcal{L}}{\partial \vec{z}^{(1)}}$

Hessian Matrices

DfHessian Matrix

For a twice-differentiable scalar function $f: \mathbb{R}^n \to \mathbb{R}$ , the Hessian is the $n \times n$ symmetric matrix of second-order partial derivatives:

H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}

H = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}

By Clairaut's theorem, $H$ is symmetric: $H = H^T$ (assuming continuous second partials).

ThHessian and Optimization

Positive definite $H \succ 0$ : local minimum (all eigenvalues $> 0$ )
Negative definite $H \prec 0$ : local maximum (all eigenvalues $< 0$ )
Indefinite (mixed eigenvalues): saddle point
Semi-definite: inconclusive (higher-order test needed)

Newton's method uses the Hessian:

\vec{x}_{\text{new}} = \vec{x}_{\text{old}} - H^{-1} \nabla f(\vec{x}_{\text{old}})

For a quadratic form $f(\vec{x}) = \frac{1}{2}\vec{x}^T H \vec{x} + \vec{g}^T \vec{x} + c$ , the Hessian is constant and the optimum is $\vec{x}^* = -H^{-1}\vec{g}$ .

Hessian of a Quadratic Form

Let $f(\vec{x}) = \vec{x}^T B \vec{x}$ . We showed $\nabla f = (B + B^T)\vec{x}$ .

The Hessian is:

H = \frac{\partial}{\partial \vec{x}} \left[(B + B^T)\vec{x}\right] = B + B^T

If $B$ is symmetric: $H = 2B$ .

Hessian of MSE

For $f(\vec{w}) = \frac{1}{N}\|X\vec{w} - \vec{y}\|^2$ , the gradient is $\frac{2}{N}X^T(X\vec{w} - \vec{y})$ .

The Hessian is:

H = \frac{2}{N}X^T X

This is always positive semi-definite, confirming MSE has a unique global minimum (when $X$ has full column rank).

Python Implementation

import numpy as np

# --- Gradient of f(x) = x^T A x ---
def grad_quadratic(x, A):
    return (A + A.T) @ x

A = np.array([[2, 1], [1, 3]])
x = np.array([1.0, 2.0])
print(f"Gradient of x^T A x: {grad_quadratic(x, A)}")
# Output: [4. 7.]

# --- Hessian of f(x) = x^T A x ---
def hessian_quadratic(A):
    return A + A.T

print(f"Hessian:\n{hessian_quadratic(A)}")
# [[4 2]
#  [2 6]]

# --- Manual Jacobian computation ---
def jacobian(f, x, eps=1e-7):
    """Compute Jacobian via finite differences."""
    n = len(x)
    m = len(f(x))
    J = np.zeros((m, n))
    fx = f(x)
    for j in range(n):
        x_plus = x.copy()
        x_plus[j] += eps
        J[:, j] = (f(x_plus) - fx) / eps
    return J

# Example: f(x) = [x1*x2, sin(x1) + x2^2]
f = lambda x: np.array([x[0]*x[1], np.sin(x[0]) + x[1]**2])
x0 = np.array([1.0, 0.0])
J = jacobian(f, x0)
print(f"Jacobian at (1,0):\n{J}")
# [[0.         1.        ]
#  [0.54030231 0.        ]]

# --- Autograd with JAX ---
import jax
import jax.numpy as jnp
from jax import grad, hessian

def loss_fn(x):
    return jnp.sum(x**2) + jnp.dot(x, jnp.array([1.0, 2.0]))

# Gradient
g = grad(loss_fn)
print(f"JAX gradient: {g(x)}")

# Hessian
H = hessian(loss_fn)
print(f"JAX Hessian:\n{H}")

# --- Backpropagation: 2-layer network ---
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_prime(z):
    s = sigmoid(z)
    return s * (1 - s)

def forward_pass(x, W1, b1, W2, b2):
    z1 = W1 @ x + b1
    a1 = sigmoid(z1)
    z2 = W2 @ a1 + b2
    return z2, z1, a1

def backward_pass(x, y, z2, z1, a1, W2):
    N = len(y)
    dz2 = (z2 - y) * 2 / N       # dL/dz2
    dW2 = np.outer(dz2, a1)       # dL/dW2
    da1 = W2.T @ dz2              # dL/da1
    dz1 = sigmoid_prime(z1) * da1 # dL/dz1
    dW1 = np.outer(dz1, x)        # dL/dW1
    db1 = dz1                     # dL/db1
    return dW1, db1, dW2

# Simple example
np.random.seed(42)
W1 = np.random.randn(3, 2) * 0.1
b1 = np.zeros(3)
W2 = np.random.randn(1, 3) * 0.1
b2 = np.zeros(1)
x = np.array([1.0, 0.5])
y = np.array([1.0])

z2, z1, a1 = forward_pass(x, W1, b1, W2, b2)
dW1, db1, dW2 = backward_pass(x, y, z2, z1, a1, W2)

print(f"dW1 shape: {dW1.shape}")  # (3, 2)
print(f"dW2 shape: {dW2.shape}")  # (1, 3)
print(f"db1 shape: {db1.shape}")  # (3,)

Applications in AI/ML

Gradient Descent Optimization

The fundamental update rule: $\vec{w}_{t+1} = \vec{w}_t - \eta \nabla_{\vec{w}} \mathcal{L}(\vec{w}_t)$ . All variants — SGD, Adam, AdaGrad, RMSProp — modify this basic step using matrix calculus. Adam tracks running averages of both the gradient (first moment) and the gradient squared (second moment), requiring elementwise matrix operations.

Neural Network Training

Each layer performs a linear transformation $W\vec{x} + \vec{b}$ followed by a nonlinear activation $\sigma(\cdot)$ . Backpropagation computes $\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \delta^{(l)} (\vec{a}^{(l-1)})^T$ for every layer — this is a rank-1 outer product, computed $L$ times per sample. The total computation for a network with $N$ parameters over $B$ samples is $O(B \cdot N)$ — the backbone of deep learning scalability.

Attention Mechanisms (Transformers)

The attention function $\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ requires computing the gradient through softmax (a matrix function) and matrix multiplication. The derivative of softmax produces a full Jacobian matrix, not a diagonal one, making this one of the more complex backprop computations in modern architectures.

Regularization

L2 regularization adds $\frac{\lambda}{2}\|W\|_F^2$ to the loss, contributing $\lambda W$ to the weight gradient — a direct application of $\frac{\partial}{\partial W}\|W\|_F^2 = 2W$ . L1 regularization uses the subgradient of $\|W\|_1$ , leading to sparse solutions.

Generative Models

In GANs, the discriminator's loss involves gradients through a minimax objective. The generator's gradient $\nabla_{\vec{z}} D(G(\vec{z}))$ flows through both the generator and discriminator networks, requiring careful application of the chain rule through both networks simultaneously.

Common Mistakes

Mistake	Correction
Confusing $\frac{\partial}{\partial x}(Ax) = A^T$ with $A$	The Jacobian of a linear map $Ax$ is $A^T$ in denominator layout
Forgetting to transpose in the chain rule	When chaining $f(g(x))$ , multiply Jacobians: $J_f \cdot J_g$ , not $J_g \cdot J_f$
Ignoring elementwise vs matrix operations	$\sigma(\vec{x})$ applies elementwise; its Jacobian is diagonal, not $\sigma \cdot I$
Using $\nabla f = \frac{\partial f}{\partial x}$ without specifying layout	Always state whether gradient is row or column vector
Assuming $\frac{\partial}{\partial x}(x^T Ax) = 2Ax$ for non-symmetric $A$	The correct formula is $(A + A^T)x$ ; only when $A = A^T$ does it simplify to $2Ax$
Forgetting the $\frac{1}{N}$ factor in loss gradients	MSE gradient is $\frac{2}{N}X^T(Xw - y)$ , not $2X^T(Xw - y)$
Applying scalar chain rule to matrix chain rule	Scalar: $\frac{d}{dx}f(g(x)) = f'(g(x))g'(x)$ . Matrix: $J_{f \circ g} = J_f \cdot J_g$ (matrix multiply, not elementwise)

Interview Questions

Q1: What is the gradient of $f(\vec{x}) = \vec{x}^T A \vec{x}$ ?

$(A + A^T)\vec{x}$ . If $A$ is symmetric, this simplifies to $2A\vec{x}$ . Derivation: expand $f = \sum_{i,j} a_{ij} x_i x_j$ , differentiate with respect to $x_k$ , and collect terms to get $(A\vec{x})_k + (A^T\vec{x})_k$ .

Q2: Why is backpropagation $O(N)$ where $N$ is the number of parameters?

Each layer's gradient is an outer product $\delta^{(l)} (\vec{a}^{(l-1)})^T$ , which costs $O(n_l \cdot n_{l-1})$ — exactly the number of parameters in $W^{(l)}$ . Summing over all $L$ layers gives $O(\sum_l n_l \cdot n_{l-1}) = O(N)$ . The backward pass has the same time complexity as the forward pass.

Q3: What is the Jacobian of the softmax function?

For $\text{softmax}(\vec{z})_i = \frac{e^{z_i}}{\sum_k e^{z_k}}$ , the Jacobian is:

J_{ij} = \text{softmax}(z_i)(\delta_{ij} - \text{softmax}(z_j))

where $\delta_{ij}$ is the Kronecker delta. In matrix form: $J = \text{diag}(\vec{p}) - \vec{p}\vec{p}^T$ . This is NOT a diagonal matrix — it's a rank-1 update to a diagonal matrix.

Q4: How does the Hessian relate to Newton's method in optimization?

Newton's method uses the Hessian to compute the optimal step: $\vec{x}_{\text{new}} = \vec{x}_{\text{old}} - H^{-1}\nabla f$ . If $H \succ 0$ , the step always points toward the minimum. However, computing $H$ costs $O(n^2)$ space and $O(n^3)$ for the inverse, so quasi-Newton methods (BFGS, L-BFGS) approximate $H^{-1}$ instead.

Q5: Derive the gradient of the cross-entropy loss with respect to the logits.

For softmax output $\hat{p}_k = \frac{e^{z_k}}{\sum_j e^{z_j}}$ and cross-entropy $\mathcal{L} = -\sum_k y_k \log(\hat{p}_k)$ :

\frac{\partial \mathcal{L}}{\partial z_k} = \sum_j \frac{\partial \mathcal{L}}{\partial \hat{p}_j} \cdot \frac{\partial \hat{p}_j}{\partial z_k} = \sum_j \left(\frac{-y_j}{\hat{p}_j}\right) \cdot \hat{p}_j(\delta_{jk} - \hat{p}_k) = -y_k + \hat{p}_k \sum_j y_j = \hat{p}_k - y_k

(using $\sum_j y_j = 1$ for one-hot targets). This is why cross-entropy + softmax gives such clean gradients.

Q6: What is the difference between the Jacobian and the Hessian?

The Jacobian $J \in \mathbb{R}^{m \times n}$ is the matrix of first derivatives of a vector function $\vec{f}: \mathbb{R}^n \to \mathbb{R}^m$ . The Hessian $H \in \mathbb{R}^{n \times n}$ is the matrix of second derivatives of a scalar function $f: \mathbb{R}^n \to \mathbb{R}$ . The Hessian is the Jacobian of the gradient: $H = J_{\nabla f}$ .

Practice Problems

Problem 1

Compute $\frac{\partial}{\partial \vec{x}}(\vec{x}^T A \vec{x})$ for $A = \begin{bmatrix}1&0\\0&2\end{bmatrix}$ .

Solution

Since $A$ is symmetric, $\frac{\partial}{\partial \vec{x}}(\vec{x}^T A \vec{x}) = 2A\vec{x}$ .

2\begin{bmatrix}1&0\\0&2\end{bmatrix}\begin{bmatrix}x_1\\x_2\end{bmatrix} = \begin{bmatrix}2x_1\\4x_2\end{bmatrix}

Verification: $f = x_1^2 + 2x_2^2$ , so $\frac{\partial f}{\partial x_1} = 2x_1$ , $\frac{\partial f}{\partial x_2} = 4x_2$ . ✓

Problem 2

Find the Jacobian of $\vec{f}: \mathbb{R}^3 \to \mathbb{R}^2$ defined by $\vec{f}(x, y, z) = \begin{bmatrix} xy + z^2 \\ e^x \sin(y) \end{bmatrix}$ .

Solution

J = \begin{bmatrix} \frac{\partial f_1}{\partial x} & \frac{\partial f_1}{\partial y} & \frac{\partial f_1}{\partial z} \\ \frac{\partial f_2}{\partial x} & \frac{\partial f_2}{\partial y} & \frac{\partial f_2}{\partial z} \end{bmatrix} = \begin{bmatrix} y & x & 2z \\ e^x \sin(y) & e^x \cos(y) & 0 \end{bmatrix}

This is a $2 \times 3$ matrix, mapping from $\mathbb{R}^3$ to $\mathbb{R}^2$ .

Problem 3

Derive the gradient of $\mathcal{L} = \frac{1}{2}\|W\vec{x} - \vec{y}\|^2$ with respect to $W$ .

Solution

Let $\vec{r} = W\vec{x} - \vec{y}$ . Then $\mathcal{L} = \frac{1}{2}\vec{r}^T\vec{r}$ .

By chain rule: $\frac{\partial \mathcal{L}}{\partial W_{ij}} = \sum_k \frac{\partial \mathcal{L}}{\partial r_k} \cdot \frac{\partial r_k}{\partial W_{ij}}$ .

We have $\frac{\partial \mathcal{L}}{\partial r_k} = r_k$ and $\frac{\partial r_k}{\partial W_{ij}} = \delta_{ki} x_j$ .

Therefore: $\frac{\partial \mathcal{L}}{\partial W_{ij}} = r_i x_j$ , which in matrix form is:

\frac{\partial \mathcal{L}}{\partial W} = \vec{r} \vec{x}^T = (W\vec{x} - \vec{y})\vec{x}^T

Problem 4

For a two-layer network $h = W_2 \sigma(W_1 x + b_1) + b_2$ , derive $\frac{\partial h}{\partial W_1}$ .

Solution

Let $\vec{z}_1 = W_1 x + b_1$ , $\vec{a}_1 = \sigma(\vec{z}_1)$ , $\vec{z}_2 = W_2 \vec{a}_1 + b_2$ , $h = \vec{z}_2$ .

Chain rule:

\frac{\partial h}{\partial W_1} = \frac{\partial h}{\partial \vec{a}_1} \cdot \frac{\partial \vec{a}_1}{\partial \vec{z}_1} \cdot \frac{\partial \vec{z}_1}{\partial W_1}

$\frac{\partial h}{\partial \vec{a}_1} = W_2$ (vector: $W_2$ applied to scalar output)
$\frac{\partial \vec{a}_1}{\partial \vec{z}_1} = \text{diag}(\sigma'(\vec{z}_1))$
$\frac{\partial \vec{z}_1}{\partial W_1} = x$ (outer product structure)

Result: $\frac{\partial h}{\partial W_1} = W_2 \cdot \text{diag}(\sigma'(\vec{z}_1)) \cdot x^T$

In component form: $\frac{\partial h}{\partial (W_1)_{ij}} = \sum_k (W_2)_k \sigma'((z_1)_k) \delta_{ki} x_j = (W_2)_i \sigma'((z_1)_i) x_j$

Quick Reference

Concept	Formula	Key Insight
Linear gradient	$\nabla_{\vec{x}}(\vec{a}^T\vec{x}) = \vec{a}$	Derivative of linear form is the coefficient
Quadratic gradient	$\nabla_{\vec{x}}(\vec{x}^T B \vec{x}) = (B+B^T)\vec{x}$	Symmetric: $2B\vec{x}$
Frobenius norm	$\nabla_A \\|A\\|_F^2 = 2A$	Used in weight decay / regularization
Linear map Jacobian	$J_{A\vec{x}} = A^T$	Denominator layout convention
Chain rule	$J_{g \circ f} = J_g \cdot J_f$	Multiply Jacobians, not derivatives
Jacobian	$J_{ij} = \frac{\partial f_i}{\partial x_j}$	$m \times n$ matrix for $\vec{f}: \mathbb{R}^n \to \mathbb{R}^m$
Hessian	$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$	Symmetric matrix of second derivatives
Backprop weight grad	$\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \delta^{(l)} (\vec{a}^{(l-1)})^T$	Outer product of error signal and activation
Backprop error signal	$\delta^{(l)} = (W^{(l+1)T}\delta^{(l+1)}) \odot \sigma'(\vec{z}^{(l)})$	Error propagated backward, scaled by derivative
MSE gradient	$\nabla_{\hat{y}} \mathcal{L} = \frac{2}{N}(\hat{\vec{y}} - \vec{y})$	Proportional to prediction error
Cross-entropy gradient	$\frac{\partial \mathcal{L}}{\partial z_k} = \hat{p}_k - y_k$	Prediction minus target (for softmax)

Cross-References

Previous topic: 020 - Linear Algebra Basis and Dimension
Next topic: 022 - Linear Algebra Advanced
Related concepts:
- Matrix Multiplication — matrix calculus operations rely on matrix multiplication
- Eigenvalues and Eigenvectors — Hessian eigenvectors define principal curvature directions
- Calculus Chain Rule — scalar chain rule; matrix chain rule generalizes it
- Partial Derivatives — foundation of all matrix calculus
- Optimization — gradient descent and Newton's method

Matrix Calculus

Matrix Calculus

Scalar by Matrix Derivatives

DfScalar Function of a Matrix

Example: Frobenius Norm Squared

Example: Trace

Vector by Vector Derivatives

DfJacobian Matrix

Jacobian of a Nonlinear Map

Key Matrix Calculus Rules

ThFundamental Derivative Rules

Deriving the Quadratic Rule

Chain Rule for Matrices

ThMultivariable Chain Rule

Chain Rule in Two Layers

Gradient of Loss Functions

DfMean Squared Error (MSE)

DfBinary Cross-Entropy Loss

DfCategorical Cross-Entropy Loss

MSE Derivation

Backpropagation Derivation

ThBackpropagation Through a Neural Network

Full 2-Layer Network Derivation

Hessian Matrices

DfHessian Matrix

ThHessian and Optimization

Hessian of a Quadratic Form

Hessian of MSE

Python Implementation

Applications in AI/ML

Gradient Descent Optimization

Neural Network Training

Attention Mechanisms (Transformers)

Regularization

Generative Models

Common Mistakes

Interview Questions

Practice Problems

Problem 1

Solution

Problem 2

Solution

Problem 3

Solution

Problem 4

Solution

Quick Reference

Cross-References

Premium Content

Need Expert Mathematics Help?