CNNs for Image Data: Convolution, Pooling and Architectures

Why CNNs for Images?

Fully connected networks treat images as flat vectors, destroying spatial structure. A 224ײ24׳ image flattened becomes 150,528 features connecting each to a hidden layer of 1,000 neurons requires 150 million parameters just for the first layer. CNNs solve this through two key principles:

Parameter Sharing: A single filter (kernel) slides across the entire image, reusing the same weights at every spatial location. A 3׳ kernel has only 9 parameters per channel, regardless of image size.

Translation Equivalence: Because the same filter scans all positions, a feature detected at one location can be recognized anywhere. The network learns what to detect, not where.

How this diagram works: This comparison reveals why CNNs are essential for image data. The left side shows a fully connected (FC) network that flattens a 224ײ24׳ image into 150,528 features, requiring over 150 million parameters just for the first hidden layer making it computationally expensive and prone to overfitting. The right side shows a CNN that uses convolutional layers with small filters, preserving spatial structure while reducing parameters to 1-25 million. The key insights box highlights why CNNs work: parameter sharing means the same filter detects features everywhere, translation invariance allows recognizing objects regardless of position, and spatial hierarchies let the network learn increasingly complex patterns layer by layer.

The Convolution Operation

Convolution is the core building block. A small kernel (filter) slides over the input, computing element-wise multiplications and summing the results at each position.

Mathematical Definition

For a 2D input $X$ and kernel $K$ , the output at position $(i, j)$ is:

Y[i, j] = \sum_{m=0}^{k_h-1} \sum_{n=0}^{k_w-1} X[i+m, \, j+n] \cdot K[m, n]

where $k_h$ and $k_w$ are the kernel height and width.

Stride and Padding

Stride ( $s$ ): How many pixels the kernel shifts at each step.

Padding ( $p$ ): Zero-padding added around the input border.

O = \left\lfloor \frac{W - K + 2P}{S} \right\rfloor + 1

where $W$ = input size, $K$ = kernel size, $P$ = padding, $S$ = stride.

Multi-Channel Convolution

For an RGB input with $C_{in}$ channels, each filter has shape $K \times K \times C_{in}$ . The filter slides over all channels simultaneously, producing a single output channel:

Y[i, j] = \sum_{c=0}^{C_{in}-1} \sum_{m=0}^{K-1} \sum_{n=0}^{K-1} X[i+m, \, j+n, \, c] \cdot K[m, n, c] + b

A convolution layer with $C_{out}$ filters produces $C_{out}$ channels. Total parameters: $C_{out} \times (C_{in} \times K \times K + 1)$ .

Pooling Layers

Pooling reduces spatial dimensions, providing translational invariance and reducing computation.

Max Pooling

Y[i, j] = \max_{m \in [0, P_h), \, n \in [0, P_w)} X[i \cdot S + m, \, j \cdot S + n]

Selects the maximum value within each pooling window. Preserves the strongest feature activations.

Average Pooling

Y[i, j] = \frac{1}{P_h \cdot P_w} \sum_{m=0}^{P_h-1} \sum_{n=0}^{P_w-1} X[i \cdot S + m, \, j \cdot S + n]

Computes the mean value within each window. Global Average Pooling (GAP) averages each entire feature map to a single value, commonly replacing fully connected layers.

CNN Architecture: The Pattern

The canonical CNN follows a repeating pattern:

Architecture Diagram

[Conv → ReLU → Pool] נN  →  [FC] נM  →  Output

Convolutional blocks extract hierarchical features:

Early layers: Low-level features (edges, textures, colors)
Middle layers: Mid-level features (patterns, parts, shapes)
Late layers: High-level features (objects, scenes, concepts)

Receptive Field

The receptive field is the region of the original input that influences a particular neuron. As we stack layers, the receptive field grows:

RF_{l} = RF_{l-1} + (K_l - 1) \times \prod_{i=1}^{l-1} S_i

Famous Architectures

LeNet-5 (1998)

Pioneering CNN for handwritten digit recognition. Introduced the Conv→Pool→FC paradigm.

Layer	Output	Kernel	Filters	Parameters
Conv1	28ײ8׶	5׵	6	156
Pool1	14ױ4׶	2ײ		0
Conv2	10ױ0ױ6	5׵	16	1,516
Pool2	5׵ױ6	2ײ		0
FC1	120			48,120
FC2	84			10,164
FC3	10			850

Total: ~60K parameters

AlexNet (2012)

Won ImageNet by a large margin. Key innovations: ReLU activation, dropout, data augmentation, GPU training.

5 conv layers + 3 FC layers
~60M parameters
ReLU instead of tanh → faster training
Overlapping pooling (3׳, S=2)

VGG (2014)

Demonstrated that depth matters. Used only 3׳ convolutions with stride 1 and padding 1.

Key insight: Two 3׳ conv layers have the same receptive field as one 5׵ layer but with fewer parameters:

2 \times (3 \times 3 \times C^2) = 18C^2 \quad < \quad 5 \times 5 \times C^2 = 25C^2

VGG-16: 13 conv layers + 3 FC layers = ~138M parameters.

ResNet (2015)

Introduced skip connections to solve the degradation problem deeper networks shouldn't have higher training error.

Architecture Comparison

Architecture	Year	Depth	Parameters	Top-5 Error	Key Innovation
LeNet-5	1998	7	60K		First practical CNN
AlexNet	2012	8	60M	16.4%	ReLU, dropout, GPU
VGG-16	2014	16	138M	7.3%	Small 3׳ filters
GoogLeNet	2014	22	6.8M	6.7%	Inception modules
ResNet-50	2015	50	25.6M	3.6%	Skip connections
EfficientNet	2019		5.3M	2.9%	Compound scaling

EfficientNet: Compound Scaling

EfficientNet scales three dimensions jointly using a compound coefficient $\phi$ :

\text{depth: } d = \alpha^\phi, \quad \text{width: } w = \beta^\phi, \quad \text{resolution: } r = \gamma^\phi

subject to $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$ . This achieves better accuracy-efficiency tradeoffs than scaling any single dimension.

Feature Visualization

CNNs learn interpretable feature hierarchies:

Layer 1: Edge detectors (Gabor-like filters), color blobs
Layer 2: Corners, textures, simple patterns
Layer 3: Object parts (eyes, wheels, textures)
Layer 4: Object-level features (faces, dogs, buildings)
Layer 5: Full objects and scenes

Implementation in PyTorch

Basic CNN

import torch
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),

            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),

            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.AdaptiveAvgPool2d((1, 1))
        )
        self.classifier = nn.Linear(128, num_classes)

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        return self.classifier(x)

Residual Block

class ResidualBlock(nn.Module):
    def __init__(self, channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, stride, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, 1, 1, bias=False)
        self.bn2 = nn.BatchNorm2d(channels)

        self.shortcut = nn.Sequential()
        if stride != 1:
            self.shortcut = nn.Sequential(
                nn.Conv2d(channels, channels, 1, stride, bias=False),
                nn.BatchNorm2d(channels)
            )

    def forward(self, x):
        out = torch.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)
        return torch.relu(out)

Transfer Learning

import torchvision.models as models

model = models.resnet50(pretrained=True)

for param in model.parameters():
    param.requires_grad = False

model.fc = nn.Linear(model.fc.in_features, num_classes)

optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)

Transfer learning strategy:

Load pretrained weights (ImageNet)
Freeze early layers (feature extraction)
Replace final layer for your task
Fine-tune with small learning rate

Key Takeaways

Summary

CNNs exploit spatial structure through parameter sharing and local connectivity
Convolution extracts features; pooling provides invariance and dimensionality reduction
Deeper networks learn hierarchical features: edges → textures → parts → objects
Skip connections (ResNet) enable training of very deep networks (152+ layers)
Transfer learning from pretrained models is the dominant paradigm in practice
Output size: $O = \lfloor(W - K + 2P) / S\rfloor + 1$

CNNs for Image Data: Convolution, Pooling and Architectures

CNNs for Image Data: Convolution, Pooling and Architectures

Why CNNs for Images?

The Convolution Operation

Mathematical Definition

Stride and Padding

Multi-Channel Convolution

Pooling Layers

Max Pooling

Average Pooling

CNN Architecture: The Pattern

Receptive Field

Famous Architectures

LeNet-5 (1998)

AlexNet (2012)

VGG (2014)

ResNet (2015)

Architecture Comparison

EfficientNet: Compound Scaling

Feature Visualization

Implementation in PyTorch

Basic CNN

Residual Block

Transfer Learning

Key Takeaways

Premium Content

Need Expert Data Science Help?