🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Backpropagation Algorithm — Forward Pass, Backward Pass and Computational Graphs

FoundationsOptimization🟢 Free Lesson

Advertisement

DL Foundations

Backpropagation — How Neural Networks Actually Learn

Backpropagation is the algorithm that trains all neural networks. It efficiently computes the gradient of the loss with respect to every parameter using the chain rule, enabling gradient descent optimization.

  • Forward Pass — Compute outputs layer by layer while storing intermediate values
  • Backward Pass — Propagate error signals backward to compute parameter gradients
  • Efficient Gradient Computation — Same complexity as forward pass, reusing intermediate computations

Backpropagation Algorithm — Forward Pass, Backward Pass and Computational Graphs

Backpropagation is the algorithm that enables neural networks to learn from data. It efficiently computes the gradient of the loss with respect to every parameter using the chain rule.


What Is Backpropagation?

DfBackpropagation

Backpropagation (backward propagation of errors) is an algorithm for computing the gradient of the loss function with respect to each weight in the network. It works by:

  1. Forward pass: Compute the output by propagating input through the network
  2. Compute loss: Measure the error between prediction and target
  3. Backward pass: Propagate the error backward, computing gradients via the chain rule
  4. Update weights: Adjust parameters to reduce the loss

The key insight: backpropagation computes all Lw\frac{\partial \mathcal{L}}{\partial w} in a single backward pass, reusing intermediate computations.


Computational Graphs

DfComputational Graph

A computational graph is a directed acyclic graph (DAG) where:

  • Nodes represent operations (add, multiply, activation)
  • Edges represent data flow (tensors)

Example: f(x,y)=(x+y)(xy)f(x, y) = (x + y) \cdot (x \cdot y)

Each node computes a local function. Backpropagation traverses this graph in reverse to compute gradients.

Computational Graph: f(x,y) = (x+y)·(x·y)xy+×a = x+yb = x·y×fBackward Pass:∂f/∂f = 1∂f/∂a = b∂f/∂b = a∂f/∂x = b + a∂f/∂y = b + a

Forward Pass

DfForward Pass

The forward pass computes the output of each layer sequentially:

z(l)=W(l)h(l1)+b(l)\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}
h(l)=σ(z(l))\mathbf{h}^{(l)} = \sigma(\mathbf{z}^{(l)})

where h(0)=x\mathbf{h}^{(0)} = \mathbf{x} (input), and σ\sigma is the activation function.

The forward pass also stores all intermediate values (activations, pre-activations) needed for the backward pass.

Forward Pass Equations

z(l)=W(l)h(l1)+b(l),h(l)=σ(z(l))\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}, \quad \mathbf{h}^{(l)} = \sigma(\mathbf{z}^{(l)})

Here,

  • W(l)\mathbf{W}^{(l)}=Weight matrix at layer l
  • b(l)\mathbf{b}^{(l)}=Bias vector at layer l
  • z(l)\mathbf{z}^{(l)}=Pre-activation at layer l
  • h(l)\mathbf{h}^{(l)}=Activation at layer l
  • σ\sigma=Activation function
Forward Pass Through a 3-Layer NetworkInputh⁰ = xLayer 1z¹ = W¹h⁰ + b¹h¹ = σ(z¹)Layer 2z² = W²h¹ + b²h² = σ(z²)Layer 3z³ = W³h² + b³h³ = σ(z³)OutputŷForward: Input → Output (compute predictions)

Backward Pass

DfBackward Pass

The backward pass computes gradients using the chain rule. For a loss L\mathcal{L}:

LW(l)=Lh(l)h(l)W(l)\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(l)}} \cdot \frac{\partial \mathbf{h}^{(l)}}{\partial \mathbf{W}^{(l)}}

Starting from the output layer and propagating backward:

  1. Compute Lh(L)\frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(L)}} at the output
  2. For each layer l=L,L1,,1l = L, L-1, \ldots, 1:
    • Compute Lz(l)=Lh(l)σ(z(l))\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(l)}} \cdot \sigma'(\mathbf{z}^{(l)})
    • Compute LW(l)=Lz(l)(h(l1))T\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}} \cdot (\mathbf{h}^{(l-1)})^T
    • Compute Lh(l1)=(W(l))TLz(l)\frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(l-1)}} = (\mathbf{W}^{(l)})^T \cdot \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}}
Backward Pass: Gradient Flow∂L/∂h³= 1∂L/∂z³= ∂L/∂h³ · σ'(z³)∂L/∂W³ = ∂L/∂z³ · (h²)ᵀ∂L/∂z²= ∂L/∂h² · σ'(z²)∂L/∂W² = ∂L/∂z² · (h¹)ᵀ∂L/∂z¹= ∂L/∂h¹ · σ'(z¹)∂L/∂W¹ = ∂L/∂z¹ · (h⁰)ᵀ∂L/∂h³(W³)ᵀ(W²)ᵀBackward: Output → Input (compute gradients)∂L/∂W⁽ˡ⁾ = ∂L/∂z⁽ˡ⁾ · (h⁽ˡ⁻¹⁾)ᵀ

Gradient Flow Through Layers

DfGradient Magnitude

The gradient through LL layers is:

LW(1)=Lh(L)l=2Lh(l)h(l1)h(2)W(1)\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(1)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(L)}} \prod_{l=2}^{L} \frac{\partial \mathbf{h}^{(l)}}{\partial \mathbf{h}^{(l-1)}} \cdot \frac{\partial \mathbf{h}^{(2)}}{\partial \mathbf{W}^{(1)}}

Each Jacobian h(l)h(l1)=diag(σ(z(l)))W(l)\frac{\partial \mathbf{h}^{(l)}}{\partial \mathbf{h}^{(l-1)}} = \text{diag}(\sigma'(\mathbf{z}^{(l)})) \cdot \mathbf{W}^{(l)}

If eigenvalues of these Jacobians are < 1, gradients vanish. If > 1, gradients explode.

Gradient Magnitude Through LayersLayer 1||∂L/∂W¹|| = 1.0Layer 2||∂L/∂W²|| = 0.75Layer 3||∂L/∂W³|| = 0.40Layer 4||∂L/∂W⁴|| = 0.15Layer 5||∂L/∂W⁵|| = 0.05Vanishing Gradient• Gradients shrink exponentially• Early layers learn slowly• Caused by: sigmoid, tanh, small init• Solution: ReLU, skip connections

PyTorch Autograd

DfAutomatic Differentiation

PyTorch's autograd computes gradients automatically using reverse-mode differentiation:

import torch

# Forward pass with computation graph
x = torch.randn(3, requires_grad=True)
y = x ** 2 + 2 * x
z = y.sum()

# Backward pass (computes all gradients)
z.backward()

# x.grad contains dz/dx = 2*x + 2

Autograd builds a computation graph during the forward pass and traverses it in reverse during backward().

Gradient Computation Modes

  • Reverse-mode (backpropagation): Computes all gradients in one backward pass. Efficient when output dimension >> input dimension (typical in neural networks).
  • Forward-mode: Computes gradient of one output w.r.t. all inputs. Efficient when input dimension >> output dimension.
  • PyTorch: Uses reverse-mode by default. Use torch.autograd.grad for custom gradient computation.

Practical Considerations

DfBackpropagation Best Practices

IssueSolution
Vanishing gradientsUse ReLU, skip connections, proper initialization
Exploding gradientsGradient clipping: ggmax(1,g/threshold)\mathbf{g} \leftarrow \frac{\mathbf{g}}{\max(1, \|\mathbf{g}\|/threshold)}
Numerical instabilityUse log-sum-exp, stable softmax implementation
Memory usageGradient checkpointing: recompute activations during backward
Mixed precisionUse torch.cuda.amp for FP16 training with FP32 gradients

Common Pitfalls

  • Forgetting to zero gradients: optimizer.zero_grad() before each backward pass
  • Not calling .backward() on loss tensor
  • Modifying tensors in-place that require gradients
  • Using .data instead of .detach() for detaching from computation graph

Summary

  • Backpropagation efficiently computes gradients using the chain rule
  • The forward pass computes outputs and stores intermediates; the backward pass computes gradients
  • Computational graphs represent the sequence of operations for automatic differentiation
  • Vanishing/exploding gradients are fundamental challenges solved by architectural innovations
  • PyTorch autograd implements reverse-mode automatic differentiation

Next: Activation Functions

Premium Content

Backpropagation Algorithm — Forward Pass, Backward Pass and Computational Graphs

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Deep Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement