🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

CNNs for Image Data: Convolution, Pooling and Architectures

Module 13: Computer VisionCNNs for Image Data🟢 Free Lesson

Advertisement

CNNs for Image Data: Convolution, Pooling and Architectures

Why CNNs for Images?

Fully connected networks treat images as flat vectors, destroying spatial structure. A 224ײ24׳ image flattened becomes 150,528 features — connecting each to a hidden layer of 1,000 neurons requires 150 million parameters just for the first layer. CNNs solve this through two key principles:

Parameter Sharing: A single filter (kernel) slides across the entire image, reusing the same weights at every spatial location. A 3׳ kernel has only 9 parameters per channel, regardless of image size.

Translation Equivalence: Because the same filter scans all positions, a feature detected at one location can be recognized anywhere. The network learns what to detect, not where.

FC NetworkInputHidden150M+ paramsNo spatial awarenessVSCNNConvPoolFC~1-25M paramsSpatial hierarchy preservedKey InsightsParameter sharingTranslation invarianceSpatial hierarchiesLocal connectivity

How this diagram works: This comparison reveals why CNNs are essential for image data. The left side shows a fully connected (FC) network that flattens a 224ײ24׳ image into 150,528 features, requiring over 150 million parameters just for the first hidden layer — making it computationally expensive and prone to overfitting. The right side shows a CNN that uses convolutional layers with small filters, preserving spatial structure while reducing parameters to 1-25 million. The key insights box highlights why CNNs work: parameter sharing means the same filter detects features everywhere, translation invariance allows recognizing objects regardless of position, and spatial hierarchies let the network learn increasingly complex patterns layer by layer.


The Convolution Operation

Convolution is the core building block. A small kernel (filter) slides over the input, computing element-wise multiplications and summing the results at each position.

Mathematical Definition

For a 2D input XX and kernel KK, the output at position (i,j)(i, j) is:

Y[i,j]=m=0kh1n=0kw1X[i+m,j+n]K[m,n]Y[i, j] = \sum_{m=0}^{k_h-1} \sum_{n=0}^{k_w-1} X[i+m, \, j+n] \cdot K[m, n]

where khk_h and kwk_w are the kernel height and width.

Convolution Operation: 5׵ Input, 3׳ Kernel, Stride=1, Valid PaddingInput (5׵)1010101010101010101010101Kernel (3׳)101010101=Element-wise Multiply and Sum101010101Sum = 5Output (3׳)535353535Kernel slides across entire input, computing one output value per position→ → →

Stride and Padding

Stride (ss): How many pixels the kernel shifts at each step.

Padding (pp): Zero-padding added around the input border.

O=WK+2PS+1O = \left\lfloor \frac{W - K + 2P}{S} \right\rfloor + 1

where WW = input size, KK = kernel size, PP = padding, SS = stride.

Padding and Stride EffectsValid (P=0)Input7׷ → 5׵(K=3, S=1)Same (P=1)Input7׷ → 7׷(K=3, S=1)Stride 2 (S=2)7׷ → 3׳(K=3, S=2)Output FormulaO = ⋉(W-K+2P)/S⋊+1W=input, K=kernelP=padding, S=stride

Multi-Channel Convolution

For an RGB input with CinC_{in} channels, each filter has shape K×K×CinK \times K \times C_{in}. The filter slides over all channels simultaneously, producing a single output channel:

Y[i,j]=c=0Cin1m=0K1n=0K1X[i+m,j+n,c]K[m,n,c]+bY[i, j] = \sum_{c=0}^{C_{in}-1} \sum_{m=0}^{K-1} \sum_{n=0}^{K-1} X[i+m, \, j+n, \, c] \cdot K[m, n, c] + b

A convolution layer with CoutC_{out} filters produces CoutC_{out} channels. Total parameters: Cout×(Cin×K×K+1)C_{out} \times (C_{in} \times K \times K + 1).


Pooling Layers

Pooling reduces spatial dimensions, providing translational invariance and reducing computation.

Max Pooling

Y[i,j]=maxm[0,Ph),n[0,Pw)X[iS+m,jS+n]Y[i, j] = \max_{m \in [0, P_h), \, n \in [0, P_w)} X[i \cdot S + m, \, j \cdot S + n]

Selects the maximum value within each pooling window. Preserves the strongest feature activations.

Average Pooling

Y[i,j]=1PhPwm=0Ph1n=0Pw1X[iS+m,jS+n]Y[i, j] = \frac{1}{P_h \cdot P_w} \sum_{m=0}^{P_h-1} \sum_{n=0}^{P_w-1} X[i \cdot S + m, \, j \cdot S + n]

Computes the mean value within each window. Global Average Pooling (GAP) averages each entire feature map to a single value, commonly replacing fully connected layers.

Pooling: 4״ Input, 2ײ Pool, Stride=2Input (4״)1324561232814135Max Pool64484״ → 2ײAvg Pool3.82.32.54.34״ → 2ײPooling BenefitsMax Pool:Captures strongest featuresAvg Pool:Smooths feature mapsBoth: ← params, ← invarianceGlobal Average Pooling (GAP)7׷׵12 → 1ױ׵12 = 512 valuesReplaces FC layers, reduces overfitting

CNN Architecture: The Pattern

The canonical CNN follows a repeating pattern:

Architecture Diagram
[Conv → ReLU → Pool] נN  →  [FC] נM  →  Output

Convolutional blocks extract hierarchical features:

  • Early layers: Low-level features (edges, textures, colors)
  • Middle layers: Mid-level features (patterns, parts, shapes)
  • Late layers: High-level features (objects, scenes, concepts)
CNN Architecture PatternInput Image224ײ24׳Conv 3׳64 filtersReLUMax Pool2ײ, S=2112ױ12׶4Conv 3׳128 filtersReLUMax Pool2ײ, S=256׵6ױ28Conv 3׳256 filtersReLUMax Pool2ײ, S=228ײ8ײ56Conv 3׳512 filtersReLUMax Pool2ײ, S=214ױ4׵12GlobalAvg Pool1ױ׵12FC1000SoftmaxFeature Hierarchy (Hierarchical Representation Learning)Edges/TexturesPatterns/PartsObjects/Scenes

Receptive Field

The receptive field is the region of the original input that influences a particular neuron. As we stack layers, the receptive field grows:

RFl=RFl1+(Kl1)×i=1l1SiRF_{l} = RF_{l-1} + (K_l - 1) \times \prod_{i=1}^{l-1} S_i
Receptive Field GrowthLayer 1RF: 3׳Layer 2RF: 5׵Layer 3RF: 7׷Layer 4RF: 9׹Deep LayerRF: Full imageRF_l = RF_{l-1} + (K_l - 1) נΠ S_i

Famous Architectures

LeNet-5 (1998)

Pioneering CNN for handwritten digit recognition. Introduced the Conv→Pool→FC paradigm.

LayerOutputKernelFiltersParameters
Conv128ײ8׶6156
Pool114ױ4׶—0
Conv210ױ0ױ6161,516
Pool25׵ױ6—0
FC1120——48,120
FC284——10,164
FC310——850

Total: ~60K parameters

AlexNet (2012)

Won ImageNet by a large margin. Key innovations: ReLU activation, dropout, data augmentation, GPU training.

  • 5 conv layers + 3 FC layers
  • ~60M parameters
  • ReLU instead of tanh → faster training
  • Overlapping pooling (3׳, S=2)

VGG (2014)

Demonstrated that depth matters. Used only 3׳ convolutions with stride 1 and padding 1.

Key insight: Two 3׳ conv layers have the same receptive field as one 5׵ layer but with fewer parameters:

2×(3×3×C2)=18C2<5×5×C2=25C22 \times (3 \times 3 \times C^2) = 18C^2 \quad < \quad 5 \times 5 \times C^2 = 25C^2

VGG-16: 13 conv layers + 3 FC layers = ~138M parameters.

ResNet (2015)

Introduced skip connections to solve the degradation problem — deeper networks shouldn't have higher training error.

ResNet Skip Connection (Residual Block)Input xConv 3׳BN → ReLUConv 3׳BN+ReLUOutput F(x)+xSkip / Identity ConnectionWhy Skip Connections?• Solves vanishing gradients• Enables 152+ layer networks• Identity mapping: F(x)→0• Easier to learn residuals• Degradation problem resolved

Architecture Comparison

ArchitectureYearDepthParametersTop-5 ErrorKey Innovation
LeNet-51998760K—First practical CNN
AlexNet2012860M16.4%ReLU, dropout, GPU
VGG-16201416138M7.3%Small 3׳ filters
GoogLeNet2014226.8M6.7%Inception modules
ResNet-5020155025.6M3.6%Skip connections
EfficientNet2019—5.3M2.9%Compound scaling

EfficientNet: Compound Scaling

EfficientNet scales three dimensions jointly using a compound coefficient ϕ\phi:

depth: d=αϕ,width: w=βϕ,resolution: r=γϕ\text{depth: } d = \alpha^\phi, \quad \text{width: } w = \beta^\phi, \quad \text{resolution: } r = \gamma^\phi

subject to αβ2γ22\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2. This achieves better accuracy-efficiency tradeoffs than scaling any single dimension.


Feature Visualization

CNNs learn interpretable feature hierarchies:

  • Layer 1: Edge detectors (Gabor-like filters), color blobs
  • Layer 2: Corners, textures, simple patterns
  • Layer 3: Object parts (eyes, wheels, textures)
  • Layer 4: Object-level features (faces, dogs, buildings)
  • Layer 5: Full objects and scenes
Feature Hierarchy VisualizationLayer 1Edges/\|-Layer 2TexturesLayer 3Parts👁Layer 4Objects🐕🏠🚗👤

Implementation in PyTorch

Basic CNN

import torch
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),

            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),

            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.AdaptiveAvgPool2d((1, 1))
        )
        self.classifier = nn.Linear(128, num_classes)

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        return self.classifier(x)

Residual Block

class ResidualBlock(nn.Module):
    def __init__(self, channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, stride, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, 1, 1, bias=False)
        self.bn2 = nn.BatchNorm2d(channels)

        self.shortcut = nn.Sequential()
        if stride != 1:
            self.shortcut = nn.Sequential(
                nn.Conv2d(channels, channels, 1, stride, bias=False),
                nn.BatchNorm2d(channels)
            )

    def forward(self, x):
        out = torch.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)
        return torch.relu(out)

Transfer Learning

import torchvision.models as models

model = models.resnet50(pretrained=True)

for param in model.parameters():
    param.requires_grad = False

model.fc = nn.Linear(model.fc.in_features, num_classes)

optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)

Transfer learning strategy:

  1. Load pretrained weights (ImageNet)
  2. Freeze early layers (feature extraction)
  3. Replace final layer for your task
  4. Fine-tune with small learning rate

Key Takeaways

Summary

  • CNNs exploit spatial structure through parameter sharing and local connectivity
  • Convolution extracts features; pooling provides invariance and dimensionality reduction
  • Deeper networks learn hierarchical features: edges → textures → parts → objects
  • Skip connections (ResNet) enable training of very deep networks (152+ layers)
  • Transfer learning from pretrained models is the dominant paradigm in practice
  • Output size: O=(WK+2P)/S+1O = \lfloor(W - K + 2P) / S\rfloor + 1

Premium Content

CNNs for Image Data: Convolution, Pooling and Architectures

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement