DL Foundations

Backpropagation — How Neural Networks Actually Learn

Backpropagation is the algorithm that trains all neural networks. It efficiently computes the gradient of the loss with respect to every parameter using the chain rule, enabling gradient descent optimization.

Forward Pass — Compute outputs layer by layer while storing intermediate values
Backward Pass — Propagate error signals backward to compute parameter gradients
Efficient Gradient Computation — Same complexity as forward pass, reusing intermediate computations

Backpropagation Algorithm — Forward Pass, Backward Pass and Computational Graphs

Backpropagation is the algorithm that enables neural networks to learn from data. It efficiently computes the gradient of the loss with respect to every parameter using the chain rule.

What Is Backpropagation?

DfBackpropagation

Backpropagation (backward propagation of errors) is an algorithm for computing the gradient of the loss function with respect to each weight in the network. It works by:

Forward pass: Compute the output by propagating input through the network
Compute loss: Measure the error between prediction and target
Backward pass: Propagate the error backward, computing gradients via the chain rule
Update weights: Adjust parameters to reduce the loss

The key insight: backpropagation computes all $\frac{\partial \mathcal{L}}{\partial w}$ in a single backward pass, reusing intermediate computations.

Computational Graphs

DfComputational Graph

A computational graph is a directed acyclic graph (DAG) where:

Nodes represent operations (add, multiply, activation)
Edges represent data flow (tensors)

Example: $f(x, y) = (x + y) \cdot (x \cdot y)$

Each node computes a local function. Backpropagation traverses this graph in reverse to compute gradients.

Forward Pass

DfForward Pass

The forward pass computes the output of each layer sequentially:

\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}

\mathbf{h}^{(l)} = \sigma(\mathbf{z}^{(l)})

where $\mathbf{h}^{(0)} = \mathbf{x}$ (input), and $\sigma$ is the activation function.

The forward pass also stores all intermediate values (activations, pre-activations) needed for the backward pass.

Forward Pass Equations

\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}, \quad \mathbf{h}^{(l)} = \sigma(\mathbf{z}^{(l)})

Here,

$\mathbf{W}^{(l)}$ =Weight matrix at layer l
$\mathbf{b}^{(l)}$ =Bias vector at layer l
$\mathbf{z}^{(l)}$ =Pre-activation at layer l
$\mathbf{h}^{(l)}$ =Activation at layer l
$\sigma$ =Activation function

Backward Pass

DfBackward Pass

The backward pass computes gradients using the chain rule. For a loss $\mathcal{L}$ :

\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(l)}} \cdot \frac{\partial \mathbf{h}^{(l)}}{\partial \mathbf{W}^{(l)}}

Starting from the output layer and propagating backward:

Compute $\frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(L)}}$ at the output
For each layer $l = L, L-1, \ldots, 1$ :
- Compute $\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(l)}} \cdot \sigma'(\mathbf{z}^{(l)})$
- Compute $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}} \cdot (\mathbf{h}^{(l-1)})^T$
- Compute $\frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(l-1)}} = (\mathbf{W}^{(l)})^T \cdot \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}}$

Gradient Flow Through Layers

DfGradient Magnitude

The gradient through $L$ layers is:

\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(1)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(L)}} \prod_{l=2}^{L} \frac{\partial \mathbf{h}^{(l)}}{\partial \mathbf{h}^{(l-1)}} \cdot \frac{\partial \mathbf{h}^{(2)}}{\partial \mathbf{W}^{(1)}}

Each Jacobian $\frac{\partial \mathbf{h}^{(l)}}{\partial \mathbf{h}^{(l-1)}} = \text{diag}(\sigma'(\mathbf{z}^{(l)})) \cdot \mathbf{W}^{(l)}$

If eigenvalues of these Jacobians are < 1, gradients vanish. If > 1, gradients explode.

PyTorch Autograd

DfAutomatic Differentiation

PyTorch's autograd computes gradients automatically using reverse-mode differentiation:

import torch

# Forward pass with computation graph
x = torch.randn(3, requires_grad=True)
y = x ** 2 + 2 * x
z = y.sum()

# Backward pass (computes all gradients)
z.backward()

# x.grad contains dz/dx = 2*x + 2

Autograd builds a computation graph during the forward pass and traverses it in reverse during backward().

Gradient Computation Modes

Reverse-mode (backpropagation): Computes all gradients in one backward pass. Efficient when output dimension >> input dimension (typical in neural networks).
Forward-mode: Computes gradient of one output w.r.t. all inputs. Efficient when input dimension >> output dimension.
PyTorch: Uses reverse-mode by default. Use torch.autograd.grad for custom gradient computation.

Practical Considerations

DfBackpropagation Best Practices

Issue	Solution
Vanishing gradients	Use ReLU, skip connections, proper initialization
Exploding gradients	Gradient clipping: $\mathbf{g} \leftarrow \frac{\mathbf{g}}{\max(1, \\|\mathbf{g}\\|/threshold)}$
Numerical instability	Use log-sum-exp, stable softmax implementation
Memory usage	Gradient checkpointing: recompute activations during backward
Mixed precision	Use `torch.cuda.amp` for FP16 training with FP32 gradients

Common Pitfalls

Forgetting to zero gradients: optimizer.zero_grad() before each backward pass
Not calling .backward() on loss tensor
Modifying tensors in-place that require gradients
Using .data instead of .detach() for detaching from computation graph

Summary

Backpropagation efficiently computes gradients using the chain rule
The forward pass computes outputs and stores intermediates; the backward pass computes gradients
Computational graphs represent the sequence of operations for automatic differentiation
Vanishing/exploding gradients are fundamental challenges solved by architectural innovations
PyTorch autograd implements reverse-mode automatic differentiation

Next: Activation Functions

Backpropagation Algorithm — Forward Pass, Backward Pass and Computational Graphs

Backpropagation — How Neural Networks Actually Learn

Backpropagation Algorithm — Forward Pass, Backward Pass and Computational Graphs

What Is Backpropagation?

DfBackpropagation

Computational Graphs

DfComputational Graph

Forward Pass

DfForward Pass

Forward Pass Equations

Backward Pass

DfBackward Pass

Gradient Flow Through Layers

DfGradient Magnitude

PyTorch Autograd

DfAutomatic Differentiation

Practical Considerations

DfBackpropagation Best Practices

Summary

Premium Content

Need Expert Deep Learning Help?