Deep Learning

Convolutional Neural Networks — How Computers See Images

Master CNNs and learn how computers extract visual features through convolution, pooling, and learned filters.

Convolution operations — detect edges, textures, and shapes
Pooling layers — reduce spatial dimensions efficiently
Modern architectures — ResNet, EfficientNet, and beyond

A picture is worth a thousand words — and a CNN learns all of them.

Convolutional Neural Networks — Complete Guide

CNNs exploit the spatial structure of images through two key principles: local connectivity (each neuron connects to a small region) and weight sharing (same filter applied everywhere). This yields $O(k \cdot n)$ parameters instead of $O(n^2)$ for fully connected layers.

Convolution Operation

The discrete 2D convolution (cross-correlation in practice) slides a learnable kernel $K \in \mathbb{R}^{k \times k}$ over the input:

(I * K)(i, j) = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} I(i+m, j+n) \cdot K(m, n)

n \times n

, kernel

k \times k

, stride

s

, padding

p

:Output = ⌊(n - k + 2p) / s⌋ + 1• No padding (valid): output = n - k + 1 → 5-3+1 = 3 ✓• Same padding: output = ⌈n/s⌉ (preserves spatial dims with s=1)• Stride s=2: output ≈ n/2 (halves spatial dimensions)

How this diagram works: The convolution operation slides a 3×3 kernel across the 5×5 input grid. At each position, the kernel performs element-wise multiplication with the overlapping input region, then sums all products to produce a single output value. The highlighted yellow region shows the current receptive field — the 3×3 area the kernel is currently processing. The computation panel on the right breaks down how output[0,0] = 4 is calculated: each row of the kernel multiplies the corresponding input row, producing partial sums (2 + 1 + 2 = 4). With no padding and stride 1, a 5×5 input with a 3×3 kernel produces a 3×3 output, shrinking spatial dimensions by k-1 = 2 pixels. This is the fundamental building block — every CNN starts with this operation.

DfParameters in Convolution

A convolution layer with $C_{in}$ input channels, $C_{out}$ output channels, kernel size $k \times k$ :

\text{Parameters} = C_{out} \times (C_{in} \times k \times k + 1)

For $C_{in}=3, C_{out}=64, k=3$ : $64 \times (3 \times 9 + 1) = 1,792$ parameters.

Compare to fully connected: input $224 \times 224 \times 3 = 150{,}528$ features → $150{,}528 \times (224 \times 224 \times 64) \approx 4.8 \times 10^{10}$ parameters!

Parameter sharing is the key inductive bias that makes CNNs practical.

Pooling

Pooling reduces spatial dimensions, providing translation invariance and reducing computation.

How pooling works: Pooling reduces the spatial size of feature maps while retaining the most important information. Max Pooling (left) divides the 4×4 input into 2×2 regions and keeps only the maximum value from each — this captures the strongest activation (e.g., the strongest edge or texture detected). Average Pooling (center) computes the mean of each region, providing a smoother summary. Global Average Pooling (right) collapses an entire feature map (C×H×W) into a single value per channel by averaging all spatial positions — this replaces fully connected layers in modern architectures like ResNet, drastically reducing parameters. The comparison table shows that max pooling preserves the strongest feature and routes gradients only to the winning neuron, while average pooling distributes gradients evenly. Pooling also provides translation invariance — small shifts in the input don't change the pooled output significantly.

CNN Architecture

A typical CNN follows the pattern: Conv → ReLU → Pool repeated $N$ times, followed by Flatten → FC → Output.

How this architecture flows: This diagram shows a classic CNN pipeline processing a 32×32×3 color image through progressively deeper layers. The input passes through three convolutional blocks (blue), each followed by max pooling (green) that halves spatial dimensions. Notice the key pattern: as spatial dimensions decrease (32→30→15→13→6→4), channel depth increases (3→32→64→128) — this trades spatial resolution for feature richness. The flatten layer converts the 4×4×128 3D feature map into a 1D vector of 2,048 values, which feeds into two fully connected layers (pink) for classification into 10 classes. Early layers learn low-level features (edges, colors), middle layers learn mid-level patterns (textures, shapes), and deep layers learn high-level concepts (objects, parts). The bottom annotations track how dimensions change at each stage.

ResNet and Skip Connections

The residual connection addresses the degradation problem: deeper networks should perform at least as well as shallower ones. Instead of learning $H(\mathbf{x})$ directly, learn the residual $F(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}$ :

\mathbf{y} = F(\mathbf{x}, \{W_i\}) + \mathbf{x}

How skip connections solve vanishing gradients: The red dashed line is the key innovation — it creates a "shortcut" that copies the input x directly to the addition operation, bypassing the two convolution layers. Instead of learning the full transformation H(x), the network only needs to learn the residual F(x) = H(x) - x. The output becomes y = F(x) + x. This works because if the optimal transformation is close to identity (i.e., the layer doesn't need to change anything), F(x) learns to be near zero, which is much easier than learning an identity mapping from scratch. For backpropagation, the gradient now has two paths: it flows through F(x) AND through the identity shortcut, ensuring gradients never vanish completely — even in 152+ layer networks. The identity path acts as a gradient highway, enabling training of networks that would otherwise be impossible.

DfWhy ResNets Work

Without skip connections, gradients must flow through every layer:

\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \cdot \prod_{l=1}^{L} \frac{\partial \mathbf{h}^{(l)}}{\partial \mathbf{h}^{(l-1)}}

With skip connections, gradient flows directly:

\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \cdot \left(I + \frac{\partial F(\mathbf{x})}{\partial \mathbf{x}}\right)

The identity shortcut $I$ ensures gradients never vanish, enabling training of 152+ layer networks (He et al., 2016).

PyTorch Implementation

Example: CNN in PyTorch

import torch.nn as nn

class CNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),

            nn.Conv2d(32, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),

            nn.Conv2d(64, 128, 3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.AdaptiveAvgPool2d((4, 4))
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 4 * 4, 256),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):
        return self.classifier(self.features(x))

CNN Architecture Comparison

DfArchitecture Evolution

Architecture	Year	Depth	Params	Top-5 Error	Key Innovation
LeNet-5	1998	5	60K	—	First practical CNN
AlexNet	2012	8	60M	16.4%	ReLU, Dropout, GPU training
VGG-16	2014	16	138M	7.3%	Uniform 3×3 filters
GoogLeNet	2014	22	6.8M	6.7%	Inception modules
ResNet-50	2015	50	25.6M	5.3%	Skip connections
EfficientNet	2019	—	5.3M	2.9%	Compound scaling (depth × width × resolution)

Key Takeaways

Summary: CNNs

CNNs exploit spatial structure via local connectivity and weight sharing
Convolution computes output: $\lfloor(n - k + 2p)/s\rfloor + 1$
Pooling reduces spatial dims; Global Average Pooling replaces FC layers
ResNet skip connections solve vanishing gradients in deep networks
Transfer learning with pre-trained models is standard practice
Feature hierarchy: edges → textures → patterns → parts → objects
Modern trend: Vision Transformers (ViT) compete with CNNs

What to Learn Next

-> Vision Transformers Apply Transformer architecture to vision tasks.

-> Transfer Learning Leverage pre-trained models for new tasks.

-> Object Detection Find and locate objects in images.

-> Neural Networks Understand the foundation of deep learning.

-> Semantic Segmentation Classify every pixel in an image.

-> Training Deep Networks Master optimizers, batch norm, and regularization.

Convolutional Neural Networks — Complete Guide for Vision