Computer Vision

CNN Architectures — From LeNet to EfficientNet

Convolutional Neural Networks are the foundation of computer vision. This tutorial covers the convolution operation in depth and the evolution from LeNet to modern architectures.

Convolution is Feature Detection — Local patterns detected with parameter sharing and translation equivariance
Skip Connections Changed Everything — ResNet enabled training of 100+ layer networks by solving vanishing gradients
Compound Scaling — EfficientNet uniformly scales width, depth, and resolution for optimal efficiency

CNN Architecture Deep Dive — LeNet to ResNet to EfficientNet

Convolutional Neural Networks are the foundation of computer vision. This tutorial covers the convolution operation in depth and the evolution of CNN architectures.

The Convolution Operation

Df2D Convolution

Given an input image $\mathbf{I}$ (height $H$ , width $W$ , channels $C$ ) and a kernel $\mathbf{K}$ (size $k \times k$ , $C$ input channels, $F$ output channels), the convolution at position $(i, j)$ is:

\mathbf{O}(i, j) = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} \sum_{c=0}^{C-1} \mathbf{I}(i+m, j+n, c) \cdot \mathbf{K}(m, n, c)

Each output channel is a different filter applied to all input channels, summed together.

CNN Output Shape

\text{Output} = \left\lfloor\frac{H - k + 2p}{s} + 1\right\rfloor \times \left\lfloor\frac{W - k + 2p}{s} + 1\right\rfloor \times F

Padding and Stride

DfPadding

Valid: No padding, output is smaller than input
Same: Pad so output has same spatial size as input ( $p = \lfloor k/2 \rfloor$ for stride 1)
Zero-padding: Pad with zeros (most common)

DfStride

Stride is the step size of the convolution. Stride 2 reduces spatial dimensions by approximately half, acting as a learnable downsampling.

Output Size Formula

\text{Output size} = \left\lfloor\frac{\text{Input size} - \text{Kernel size} + 2 \times \text{Padding}}{\text{Stride}} + 1\right\rfloor

Here,

$H, W$ =Input spatial dimensions
$k$ =Kernel size
$p$ =Zero-padding
$s$ =Stride
$F$ =Number of output filters

Pooling

DfPooling

Pooling reduces spatial dimensions and provides translation invariance:

Max Pooling: $\text{MaxPool}(\mathbf{O})_{i,j} = \max_{m,n} \mathbf{O}(i \cdot s + m, j \cdot s + n)$
Average Pooling: $\text{AvgPool}(\mathbf{O})_{i,j} = \frac{1}{k^2}\sum_{m,n} \mathbf{O}(i \cdot s + m, j \cdot s + n)$
Global Average Pooling: Average each feature map to a single value (replaces FC layers)

CNN Architecture Evolution

LeNet (1998)

DfLeNet-5

LeNet-5 was the first successful CNN, designed for digit recognition:

Layer	Type	Output Shape	Parameters
1	Conv 5×5	32×32×6	156
2	AvgPool 2×2	16×16×6	0
3	Conv 5×5	10×10×16	1,516
4	AvgPool 2×2	5×5×16	0
5	FC	120	48,120
6	FC	84	10,164
7	FC	10	850

Total: ~60K parameters. Showed that local receptive fields + weight sharing + pooling could learn hierarchical features.

AlexNet (2012)

DfAlexNet

AlexNet won ImageNet 2012 by a large margin, starting the deep learning revolution:

8 layers, 60M parameters
ReLU activations instead of tanh (faster training)
Dropout (p=0.5) for regularization
Data augmentation (random crop, flip, color jitter)
GPU training (2 GTX 580 GPUs)
Local Response Normalization (now replaced by BatchNorm)

VGGNet (2014)

DfVGGNet

VGG showed that depth matters — using only 3×3 convolutions:

16-19 layers deep
All 3×3 convolutions with stride 1, padding 1
2×2 max pooling after each conv block
Three FC layers at the end
138M parameters (VGG-16)

Key insight: Two 3×3 convolutions have the same receptive field as one 5×5, but with fewer parameters and more non-linearity.

3×3 Convolution Stacking

\text{Receptive field: } 3 \times 3 \text{ conv (2 layers)} = 5 \times 5 \text{ effective}

ResNet (2015)

DfResNet (Residual Network)

ResNet introduced skip connections to solve the degradation problem:

\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}

where $\mathcal{F}$ is the residual function and $\mathbf{x}$ is the identity shortcut.

Skip connections allow gradients to flow directly through the network, enabling training of 100+ layer networks.

Inception Module (GoogLeNet)

DfInception Module

The Inception module applies multiple filter sizes in parallel:

1×1 convolution (reduces channels)
3×3 convolution
5×5 convolution
3×3 max pooling

Outputs are concatenated. This captures features at multiple scales without choosing a single kernel size.

Depthwise Separable Convolution

DfDepthwise Separable Convolution

Used in MobileNet and EfficientNet for efficiency:

Depthwise convolution: Apply one filter per input channel
Pointwise convolution: 1×1 convolution to combine channels

Parameters: $k^2 \cdot C_{\text{in}} + C_{\text{in}} \cdot C_{\text{out}}$ vs $k^2 \cdot C_{\text{in}} \cdot C_{\text{out}}$ for standard convolution.

For 3×3 kernels: ~8-9× fewer parameters and FLOPs.

EfficientNet: Compound Scaling

DfEfficientNet

EfficientNet uniformly scales three dimensions:

\text{depth: } d = \alpha^\phi, \quad \text{width: } w = \beta^\phi, \quad \text{resolution: } r = \gamma^\phi

subject to $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$ , where $\phi$ is a compound coefficient.

This finds the optimal balance between depth, width, and resolution for a given compute budget.

Compound Scaling

\text{depth: } d = \alpha^\phi, \quad \text{width: } w = \beta^\phi, \quad \text{resolution: } r = \gamma^\phi

Design Principles

DfCNN Design Principles

Start simple: Increase complexity only when needed
3×3 kernels: Two 3×3 beats one 5×5 (more non-linearity, fewer params)
Increase depth: But use skip connections beyond 20 layers
Channel progression: Double channels when spatial dims halve
Global average pooling: Replace FC layers to reduce parameters
BatchNorm: After every conv, before activation
Skip connections: Enable deeper training, improve gradient flow

Summary

Convolution detects local patterns with parameter sharing and translation equivariance
LeNet → AlexNet → VGG: Deeper networks with simple building blocks
ResNet: Skip connections enable training of 100+ layer networks
Inception: Multi-scale features with parallel filter branches
MobileNet/EfficientNet: Depthwise separable convolutions for efficiency
Compound scaling: Balance depth, width, and resolution for optimal performance

Next: Object Detection

CNN Architecture Deep Dive — LeNet to ResNet to EfficientNet

CNN Architectures — From LeNet to EfficientNet

CNN Architecture Deep Dive — LeNet to ResNet to EfficientNet

The Convolution Operation

Df2D Convolution

Padding and Stride

DfPadding

DfStride

Output Size Formula

Pooling

DfPooling

CNN Architecture Evolution

LeNet (1998)

DfLeNet-5

AlexNet (2012)

DfAlexNet

VGGNet (2014)

DfVGGNet

ResNet (2015)

DfResNet (Residual Network)

Inception Module (GoogLeNet)

DfInception Module

Depthwise Separable Convolution

DfDepthwise Separable Convolution

EfficientNet: Compound Scaling

DfEfficientNet

Design Principles

DfCNN Design Principles

Summary

Premium Content

Need Expert Deep Learning Help?