🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Convolutional Neural Networks — Complete Guide for Vision

Deep LearningCNNs🟢 Free Lesson

Advertisement

Deep Learning

Convolutional Neural Networks — How Computers See Images

Master CNNs and learn how computers extract visual features through convolution, pooling, and learned filters.

  • Convolution operations — detect edges, textures, and shapes
  • Pooling layers — reduce spatial dimensions efficiently
  • Modern architectures — ResNet, EfficientNet, and beyond

A picture is worth a thousand words — and a CNN learns all of them.

Convolutional Neural Networks — Complete Guide

CNNs exploit the spatial structure of images through two key principles: local connectivity (each neuron connects to a small region) and weight sharing (same filter applied everywhere). This yields O(kn)O(k \cdot n) parameters instead of O(n2)O(n^2) for fully connected layers.


Convolution Operation

The discrete 2D convolution (cross-correlation in practice) slides a learnable kernel KRk×kK \in \mathbb{R}^{k \times k} over the input:

(IK)(i,j)=m=0k1n=0k1I(i+m,j+n)K(m,n)(I * K)(i, j) = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} I(i+m, j+n) \cdot K(m, n)
Input (5×5)1010101010101010101010101Kernel (3×3)101010101element-wisemultiply + sumOutput (3×3)434343434Element [0,0] computation:1×1 + 0×0 + 1×1 = 2 (row 0)0×0 + 1×1 + 0×0 = 1 (row 1)1×1 + 0×0 + 1×1 = 2 (row 2)Sum = 4Output Size FormulaFor input size n×nn \times n, kernel k×kk \times k, stride ss, padding pp:Output = ⌊(n - k + 2p) / s⌋ + 1• No padding (valid): output = n - k + 1 → 5-3+1 = 3 ✓• Same padding: output = ⌈n/s⌉ (preserves spatial dims with s=1)• Stride s=2: output ≈ n/2 (halves spatial dimensions)

How this diagram works: The convolution operation slides a 3×3 kernel across the 5×5 input grid. At each position, the kernel performs element-wise multiplication with the overlapping input region, then sums all products to produce a single output value. The highlighted yellow region shows the current receptive field — the 3×3 area the kernel is currently processing. The computation panel on the right breaks down how output[0,0] = 4 is calculated: each row of the kernel multiplies the corresponding input row, producing partial sums (2 + 1 + 2 = 4). With no padding and stride 1, a 5×5 input with a 3×3 kernel produces a 3×3 output, shrinking spatial dimensions by k-1 = 2 pixels. This is the fundamental building block — every CNN starts with this operation.

DfParameters in Convolution

A convolution layer with CinC_{in} input channels, CoutC_{out} output channels, kernel size k×kk \times k:

Parameters=Cout×(Cin×k×k+1)\text{Parameters} = C_{out} \times (C_{in} \times k \times k + 1)

For Cin=3,Cout=64,k=3C_{in}=3, C_{out}=64, k=3: 64×(3×9+1)=1,79264 \times (3 \times 9 + 1) = 1,792 parameters.

Compare to fully connected: input 224×224×3=150,528224 \times 224 \times 3 = 150{,}528 features → 150,528×(224×224×64)4.8×1010150{,}528 \times (224 \times 224 \times 64) \approx 4.8 \times 10^{10} parameters!

Parameter sharing is the key inductive bias that makes CNNs practical.


Pooling

Pooling reduces spatial dimensions, providing translation invariance and reducing computation.

Max Pooling (2×2, stride 2)Input 4×41324561232814137Output 2×26448Keeps maximum per regionAverage Pooling (2×2, stride 2)Input 4×41324561232814137Output 2×23.752.252.54.75Averages each regionGlobal Average PoolingInput C×H×WFeature Map1 valuePer channel: avg of H×WUsed in ResNet, EfficientNetMethodPreservesUse CaseGradientMax PoolStrongest featureClassificationRoutes to max onlyAvg PoolOverall tendencyFeature mapsDistributed to all

How pooling works: Pooling reduces the spatial size of feature maps while retaining the most important information. Max Pooling (left) divides the 4×4 input into 2×2 regions and keeps only the maximum value from each — this captures the strongest activation (e.g., the strongest edge or texture detected). Average Pooling (center) computes the mean of each region, providing a smoother summary. Global Average Pooling (right) collapses an entire feature map (C×H×W) into a single value per channel by averaging all spatial positions — this replaces fully connected layers in modern architectures like ResNet, drastically reducing parameters. The comparison table shows that max pooling preserves the strongest feature and routes gradients only to the winning neuron, while average pooling distributes gradients evenly. Pooling also provides translation invariance — small shifts in the input don't change the pooled output significantly.


CNN Architecture

A typical CNN follows the pattern: Conv → ReLU → Pool repeated NN times, followed by Flatten → FC → Output.

Input32×32×3Conv3×332 filtersPool2×2Conv3×364 filtersPool2×2Conv3×3128 filtersFlatten→ 1DFC256Output10 classes32×3230×3015×1513×136×64×451225610Spatial: 32×32 → 4×4 (↓)Depth: 3 → 128 → 10Key Principle: Spatial dimensions ↓, Channel dimensions ↑Early layers: edges, textures | Middle layers: patterns, parts | Deep layers: objects, scenes

How this architecture flows: This diagram shows a classic CNN pipeline processing a 32×32×3 color image through progressively deeper layers. The input passes through three convolutional blocks (blue), each followed by max pooling (green) that halves spatial dimensions. Notice the key pattern: as spatial dimensions decrease (32→30→15→13→6→4), channel depth increases (3→32→64→128) — this trades spatial resolution for feature richness. The flatten layer converts the 4×4×128 3D feature map into a 1D vector of 2,048 values, which feeds into two fully connected layers (pink) for classification into 10 classes. Early layers learn low-level features (edges, colors), middle layers learn mid-level patterns (textures, shapes), and deep layers learn high-level concepts (objects, parts). The bottom annotations track how dimensions change at each stage.


ResNet and Skip Connections

The residual connection addresses the degradation problem: deeper networks should perform at least as well as shallower ones. Instead of learning H(x)H(\mathbf{x}) directly, learn the residual F(x)=H(x)xF(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}:

y=F(x,{Wi})+x\mathbf{y} = F(\mathbf{x}, \{W_i\}) + \mathbf{x}
Residual BlockxConv 3×3BN + ReLUConv 3×3BN+ReLUSkip / Identity Connectiony = F(x) + x

How skip connections solve vanishing gradients: The red dashed line is the key innovation — it creates a "shortcut" that copies the input x directly to the addition operation, bypassing the two convolution layers. Instead of learning the full transformation H(x), the network only needs to learn the residual F(x) = H(x) - x. The output becomes y = F(x) + x. This works because if the optimal transformation is close to identity (i.e., the layer doesn't need to change anything), F(x) learns to be near zero, which is much easier than learning an identity mapping from scratch. For backpropagation, the gradient now has two paths: it flows through F(x) AND through the identity shortcut, ensuring gradients never vanish completely — even in 152+ layer networks. The identity path acts as a gradient highway, enabling training of networks that would otherwise be impossible.

DfWhy ResNets Work

Without skip connections, gradients must flow through every layer:

Lx=Lyl=1Lh(l)h(l1)\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \cdot \prod_{l=1}^{L} \frac{\partial \mathbf{h}^{(l)}}{\partial \mathbf{h}^{(l-1)}}

With skip connections, gradient flows directly:

Lx=Ly(I+F(x)x)\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \cdot \left(I + \frac{\partial F(\mathbf{x})}{\partial \mathbf{x}}\right)

The identity shortcut II ensures gradients never vanish, enabling training of 152+ layer networks (He et al., 2016).


PyTorch Implementation

Example: CNN in PyTorch

import torch.nn as nn

class CNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),

            nn.Conv2d(32, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),

            nn.Conv2d(64, 128, 3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.AdaptiveAvgPool2d((4, 4))
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 4 * 4, 256),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):
        return self.classifier(self.features(x))

CNN Architecture Comparison

DfArchitecture Evolution

ArchitectureYearDepthParamsTop-5 ErrorKey Innovation
LeNet-51998560KFirst practical CNN
AlexNet2012860M16.4%ReLU, Dropout, GPU training
VGG-16201416138M7.3%Uniform 3×3 filters
GoogLeNet2014226.8M6.7%Inception modules
ResNet-5020155025.6M5.3%Skip connections
EfficientNet20195.3M2.9%Compound scaling (depth × width × resolution)

Key Takeaways

Summary: CNNs

  • CNNs exploit spatial structure via local connectivity and weight sharing
  • Convolution computes output: (nk+2p)/s+1\lfloor(n - k + 2p)/s\rfloor + 1
  • Pooling reduces spatial dims; Global Average Pooling replaces FC layers
  • ResNet skip connections solve vanishing gradients in deep networks
  • Transfer learning with pre-trained models is standard practice
  • Feature hierarchy: edges → textures → patterns → parts → objects
  • Modern trend: Vision Transformers (ViT) compete with CNNs

What to Learn Next

-> Vision Transformers Apply Transformer architecture to vision tasks.

-> Transfer Learning Leverage pre-trained models for new tasks.

-> Object Detection Find and locate objects in images.

-> Neural Networks Understand the foundation of deep learning.

-> Semantic Segmentation Classify every pixel in an image.

-> Training Deep Networks Master optimizers, batch norm, and regularization.

Premium Content

Convolutional Neural Networks — Complete Guide for Vision

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement