DL Foundations
Backpropagation — How Neural Networks Actually Learn
Backpropagation is the algorithm that trains all neural networks. It efficiently computes the gradient of the loss with respect to every parameter using the chain rule, enabling gradient descent optimization.
- Forward Pass — Compute outputs layer by layer while storing intermediate values
- Backward Pass — Propagate error signals backward to compute parameter gradients
- Efficient Gradient Computation — Same complexity as forward pass, reusing intermediate computations
Backpropagation Algorithm — Forward Pass, Backward Pass and Computational Graphs
Backpropagation is the algorithm that enables neural networks to learn from data. It efficiently computes the gradient of the loss with respect to every parameter using the chain rule.
What Is Backpropagation?
DfBackpropagation
Backpropagation (backward propagation of errors) is an algorithm for computing the gradient of the loss function with respect to each weight in the network. It works by:
- Forward pass: Compute the output by propagating input through the network
- Compute loss: Measure the error between prediction and target
- Backward pass: Propagate the error backward, computing gradients via the chain rule
- Update weights: Adjust parameters to reduce the loss
The key insight: backpropagation computes all in a single backward pass, reusing intermediate computations.
Computational Graphs
DfComputational Graph
A computational graph is a directed acyclic graph (DAG) where:
- Nodes represent operations (add, multiply, activation)
- Edges represent data flow (tensors)
Example:
Each node computes a local function. Backpropagation traverses this graph in reverse to compute gradients.
Forward Pass
DfForward Pass
The forward pass computes the output of each layer sequentially:
where (input), and is the activation function.
The forward pass also stores all intermediate values (activations, pre-activations) needed for the backward pass.
Forward Pass Equations
Here,
- =Weight matrix at layer l
- =Bias vector at layer l
- =Pre-activation at layer l
- =Activation at layer l
- =Activation function
Backward Pass
DfBackward Pass
The backward pass computes gradients using the chain rule. For a loss :
Starting from the output layer and propagating backward:
- Compute at the output
- For each layer :
- Compute
- Compute
- Compute
Gradient Flow Through Layers
DfGradient Magnitude
The gradient through layers is:
Each Jacobian
If eigenvalues of these Jacobians are < 1, gradients vanish. If > 1, gradients explode.
PyTorch Autograd
DfAutomatic Differentiation
PyTorch's autograd computes gradients automatically using reverse-mode differentiation:
import torch
# Forward pass with computation graph
x = torch.randn(3, requires_grad=True)
y = x ** 2 + 2 * x
z = y.sum()
# Backward pass (computes all gradients)
z.backward()
# x.grad contains dz/dx = 2*x + 2
Autograd builds a computation graph during the forward pass and traverses it in reverse during backward().
Gradient Computation Modes
- Reverse-mode (backpropagation): Computes all gradients in one backward pass. Efficient when output dimension >> input dimension (typical in neural networks).
- Forward-mode: Computes gradient of one output w.r.t. all inputs. Efficient when input dimension >> output dimension.
- PyTorch: Uses reverse-mode by default. Use
torch.autograd.gradfor custom gradient computation.
Practical Considerations
DfBackpropagation Best Practices
| Issue | Solution |
|---|---|
| Vanishing gradients | Use ReLU, skip connections, proper initialization |
| Exploding gradients | Gradient clipping: |
| Numerical instability | Use log-sum-exp, stable softmax implementation |
| Memory usage | Gradient checkpointing: recompute activations during backward |
| Mixed precision | Use torch.cuda.amp for FP16 training with FP32 gradients |
Common Pitfalls
- Forgetting to zero gradients:
optimizer.zero_grad()before each backward pass - Not calling
.backward()on loss tensor - Modifying tensors in-place that require gradients
- Using
.datainstead of.detach()for detaching from computation graph
Summary
- Backpropagation efficiently computes gradients using the chain rule
- The forward pass computes outputs and stores intermediates; the backward pass computes gradients
- Computational graphs represent the sequence of operations for automatic differentiation
- Vanishing/exploding gradients are fundamental challenges solved by architectural innovations
- PyTorch autograd implements reverse-mode automatic differentiation
Next: Activation Functions