Computer Vision
CNN Architectures — From LeNet to EfficientNet
Convolutional Neural Networks are the foundation of computer vision. This tutorial covers the convolution operation in depth and the evolution from LeNet to modern architectures.
- Convolution is Feature Detection — Local patterns detected with parameter sharing and translation equivariance
- Skip Connections Changed Everything — ResNet enabled training of 100+ layer networks by solving vanishing gradients
- Compound Scaling — EfficientNet uniformly scales width, depth, and resolution for optimal efficiency
CNN Architecture Deep Dive — LeNet to ResNet to EfficientNet
Convolutional Neural Networks are the foundation of computer vision. This tutorial covers the convolution operation in depth and the evolution of CNN architectures.
The Convolution Operation
Df2D Convolution
Given an input image (height , width , channels ) and a kernel (size , input channels, output channels), the convolution at position is:
Each output channel is a different filter applied to all input channels, summed together.
Padding and Stride
DfPadding
- Valid: No padding, output is smaller than input
- Same: Pad so output has same spatial size as input ( for stride 1)
- Zero-padding: Pad with zeros (most common)
DfStride
Stride is the step size of the convolution. Stride 2 reduces spatial dimensions by approximately half, acting as a learnable downsampling.
Output Size Formula
Here,
- =Input spatial dimensions
- =Kernel size
- =Zero-padding
- =Stride
- =Number of output filters
Pooling
DfPooling
Pooling reduces spatial dimensions and provides translation invariance:
- Max Pooling:
- Average Pooling:
- Global Average Pooling: Average each feature map to a single value (replaces FC layers)
CNN Architecture Evolution
LeNet (1998)
DfLeNet-5
LeNet-5 was the first successful CNN, designed for digit recognition:
| Layer | Type | Output Shape | Parameters |
|---|---|---|---|
| 1 | Conv 5×5 | 32×32×6 | 156 |
| 2 | AvgPool 2×2 | 16×16×6 | 0 |
| 3 | Conv 5×5 | 10×10×16 | 1,516 |
| 4 | AvgPool 2×2 | 5×5×16 | 0 |
| 5 | FC | 120 | 48,120 |
| 6 | FC | 84 | 10,164 |
| 7 | FC | 10 | 850 |
Total: ~60K parameters. Showed that local receptive fields + weight sharing + pooling could learn hierarchical features.
AlexNet (2012)
DfAlexNet
AlexNet won ImageNet 2012 by a large margin, starting the deep learning revolution:
- 8 layers, 60M parameters
- ReLU activations instead of tanh (faster training)
- Dropout (p=0.5) for regularization
- Data augmentation (random crop, flip, color jitter)
- GPU training (2 GTX 580 GPUs)
- Local Response Normalization (now replaced by BatchNorm)
VGGNet (2014)
DfVGGNet
VGG showed that depth matters — using only 3×3 convolutions:
- 16-19 layers deep
- All 3×3 convolutions with stride 1, padding 1
- 2×2 max pooling after each conv block
- Three FC layers at the end
- 138M parameters (VGG-16)
Key insight: Two 3×3 convolutions have the same receptive field as one 5×5, but with fewer parameters and more non-linearity.
ResNet (2015)
DfResNet (Residual Network)
ResNet introduced skip connections to solve the degradation problem:
where is the residual function and is the identity shortcut.
Skip connections allow gradients to flow directly through the network, enabling training of 100+ layer networks.
Inception Module (GoogLeNet)
DfInception Module
The Inception module applies multiple filter sizes in parallel:
- 1×1 convolution (reduces channels)
- 3×3 convolution
- 5×5 convolution
- 3×3 max pooling
Outputs are concatenated. This captures features at multiple scales without choosing a single kernel size.
Depthwise Separable Convolution
DfDepthwise Separable Convolution
Used in MobileNet and EfficientNet for efficiency:
- Depthwise convolution: Apply one filter per input channel
- Pointwise convolution: 1×1 convolution to combine channels
Parameters: vs for standard convolution.
For 3×3 kernels: ~8-9× fewer parameters and FLOPs.
EfficientNet: Compound Scaling
DfEfficientNet
EfficientNet uniformly scales three dimensions:
subject to , where is a compound coefficient.
This finds the optimal balance between depth, width, and resolution for a given compute budget.
Design Principles
DfCNN Design Principles
- Start simple: Increase complexity only when needed
- 3×3 kernels: Two 3×3 beats one 5×5 (more non-linearity, fewer params)
- Increase depth: But use skip connections beyond 20 layers
- Channel progression: Double channels when spatial dims halve
- Global average pooling: Replace FC layers to reduce parameters
- BatchNorm: After every conv, before activation
- Skip connections: Enable deeper training, improve gradient flow
Summary
- Convolution detects local patterns with parameter sharing and translation equivariance
- LeNet → AlexNet → VGG: Deeper networks with simple building blocks
- ResNet: Skip connections enable training of 100+ layer networks
- Inception: Multi-scale features with parallel filter branches
- MobileNet/EfficientNet: Depthwise separable convolutions for efficiency
- Compound scaling: Balance depth, width, and resolution for optimal performance
Next: Object Detection