🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

CNN Architecture Deep Dive — LeNet to ResNet to EfficientNet

Computer VisionCNNs🟢 Free Lesson

Advertisement

Computer Vision

CNN Architectures — From LeNet to EfficientNet

Convolutional Neural Networks are the foundation of computer vision. This tutorial covers the convolution operation in depth and the evolution from LeNet to modern architectures.

  • Convolution is Feature Detection — Local patterns detected with parameter sharing and translation equivariance
  • Skip Connections Changed Everything — ResNet enabled training of 100+ layer networks by solving vanishing gradients
  • Compound Scaling — EfficientNet uniformly scales width, depth, and resolution for optimal efficiency

CNN Architecture Deep Dive — LeNet to ResNet to EfficientNet

Convolutional Neural Networks are the foundation of computer vision. This tutorial covers the convolution operation in depth and the evolution of CNN architectures.


The Convolution Operation

Df2D Convolution

Given an input image I\mathbf{I} (height HH, width WW, channels CC) and a kernel K\mathbf{K} (size k×kk \times k, CC input channels, FF output channels), the convolution at position (i,j)(i, j) is:

O(i,j)=m=0k1n=0k1c=0C1I(i+m,j+n,c)K(m,n,c)\mathbf{O}(i, j) = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} \sum_{c=0}^{C-1} \mathbf{I}(i+m, j+n, c) \cdot \mathbf{K}(m, n, c)

Each output channel is a different filter applied to all input channels, summed together.

CNN Output Shape
Output=Hk+2ps+1×Wk+2ps+1×F\text{Output} = \left\lfloor\frac{H - k + 2p}{s} + 1\right\rfloor \times \left\lfloor\frac{W - k + 2p}{s} + 1\right\rfloor \times F
Convolution Operation: Kernel Slides Across InputInput (5×5)1234567890123456789012345Kernel (3×3)-101-101-101Receptive fieldOutput (3×3)012123234×Output(i,j) = Σ Input(i+m, j+n) × Kernel(m, n)Each output = dot product of kernel with local region

Padding and Stride

DfPadding

  • Valid: No padding, output is smaller than input
  • Same: Pad so output has same spatial size as input (p=k/2p = \lfloor k/2 \rfloor for stride 1)
  • Zero-padding: Pad with zeros (most common)

DfStride

Stride is the step size of the convolution. Stride 2 reduces spatial dimensions by approximately half, acting as a learnable downsampling.

Output Size Formula

Output size=Input sizeKernel size+2×PaddingStride+1\text{Output size} = \left\lfloor\frac{\text{Input size} - \text{Kernel size} + 2 \times \text{Padding}}{\text{Stride}} + 1\right\rfloor

Here,

  • H,WH, W=Input spatial dimensions
  • kk=Kernel size
  • pp=Zero-padding
  • ss=Stride
  • FF=Number of output filters

Pooling

DfPooling

Pooling reduces spatial dimensions and provides translation invariance:

  • Max Pooling: MaxPool(O)i,j=maxm,nO(is+m,js+n)\text{MaxPool}(\mathbf{O})_{i,j} = \max_{m,n} \mathbf{O}(i \cdot s + m, j \cdot s + n)
  • Average Pooling: AvgPool(O)i,j=1k2m,nO(is+m,js+n)\text{AvgPool}(\mathbf{O})_{i,j} = \frac{1}{k^2}\sum_{m,n} \mathbf{O}(i \cdot s + m, j \cdot s + n)
  • Global Average Pooling: Average each feature map to a single value (replaces FC layers)

CNN Architecture Evolution

CNN Architecture Evolution: Depth vs PerformanceYearImageNet Top-1 Accuracy1998LeNet2012AlexNet2014VGG-162014GoogLeNet2015ResNet2019EfficientNet2020ViTSkip connectionsenabled depth

LeNet (1998)

DfLeNet-5

LeNet-5 was the first successful CNN, designed for digit recognition:

LayerTypeOutput ShapeParameters
1Conv 5×532×32×6156
2AvgPool 2×216×16×60
3Conv 5×510×10×161,516
4AvgPool 2×25×5×160
5FC12048,120
6FC8410,164
7FC10850

Total: ~60K parameters. Showed that local receptive fields + weight sharing + pooling could learn hierarchical features.


AlexNet (2012)

DfAlexNet

AlexNet won ImageNet 2012 by a large margin, starting the deep learning revolution:

  • 8 layers, 60M parameters
  • ReLU activations instead of tanh (faster training)
  • Dropout (p=0.5) for regularization
  • Data augmentation (random crop, flip, color jitter)
  • GPU training (2 GTX 580 GPUs)
  • Local Response Normalization (now replaced by BatchNorm)

VGGNet (2014)

DfVGGNet

VGG showed that depth matters — using only 3×3 convolutions:

  • 16-19 layers deep
  • All 3×3 convolutions with stride 1, padding 1
  • 2×2 max pooling after each conv block
  • Three FC layers at the end
  • 138M parameters (VGG-16)

Key insight: Two 3×3 convolutions have the same receptive field as one 5×5, but with fewer parameters and more non-linearity.

3×3 Convolution Stacking
Receptive field: 3×3 conv (2 layers)=5×5 effective\text{Receptive field: } 3 \times 3 \text{ conv (2 layers)} = 5 \times 5 \text{ effective}

ResNet (2015)

DfResNet (Residual Network)

ResNet introduced skip connections to solve the degradation problem:

y=F(x,{Wi})+x\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}

where F\mathcal{F} is the residual function and x\mathbf{x} is the identity shortcut.

Skip connections allow gradients to flow directly through the network, enabling training of 100+ layer networks.

ResNet Skip Connection BlockInput xIdentity shortcutConv 3×3BatchNorm+ReLUOutputy = F(x) + x (skip connection enables gradient flow)

Inception Module (GoogLeNet)

DfInception Module

The Inception module applies multiple filter sizes in parallel:

  • 1×1 convolution (reduces channels)
  • 3×3 convolution
  • 5×5 convolution
  • 3×3 max pooling

Outputs are concatenated. This captures features at multiple scales without choosing a single kernel size.

Inception Module: Multi-Scale Feature ExtractionInput1×1 Conv3×3 Conv5×5 Conv3×3 Pool1×11×1ConcatMulti-scale features

Depthwise Separable Convolution

DfDepthwise Separable Convolution

Used in MobileNet and EfficientNet for efficiency:

  1. Depthwise convolution: Apply one filter per input channel
  2. Pointwise convolution: 1×1 convolution to combine channels

Parameters: k2Cin+CinCoutk^2 \cdot C_{\text{in}} + C_{\text{in}} \cdot C_{\text{out}} vs k2CinCoutk^2 \cdot C_{\text{in}} \cdot C_{\text{out}} for standard convolution.

For 3×3 kernels: ~8-9× fewer parameters and FLOPs.


EfficientNet: Compound Scaling

DfEfficientNet

EfficientNet uniformly scales three dimensions:

depth: d=αϕ,width: w=βϕ,resolution: r=γϕ\text{depth: } d = \alpha^\phi, \quad \text{width: } w = \beta^\phi, \quad \text{resolution: } r = \gamma^\phi

subject to αβ2γ22\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2, where ϕ\phi is a compound coefficient.

This finds the optimal balance between depth, width, and resolution for a given compute budget.

Compound Scaling
depth: d=αϕ,width: w=βϕ,resolution: r=γϕ\text{depth: } d = \alpha^\phi, \quad \text{width: } w = \beta^\phi, \quad \text{resolution: } r = \gamma^\phi

Design Principles

DfCNN Design Principles

  1. Start simple: Increase complexity only when needed
  2. 3×3 kernels: Two 3×3 beats one 5×5 (more non-linearity, fewer params)
  3. Increase depth: But use skip connections beyond 20 layers
  4. Channel progression: Double channels when spatial dims halve
  5. Global average pooling: Replace FC layers to reduce parameters
  6. BatchNorm: After every conv, before activation
  7. Skip connections: Enable deeper training, improve gradient flow

Summary

  • Convolution detects local patterns with parameter sharing and translation equivariance
  • LeNet → AlexNet → VGG: Deeper networks with simple building blocks
  • ResNet: Skip connections enable training of 100+ layer networks
  • Inception: Multi-scale features with parallel filter branches
  • MobileNet/EfficientNet: Depthwise separable convolutions for efficiency
  • Compound scaling: Balance depth, width, and resolution for optimal performance

Next: Object Detection

Premium Content

CNN Architecture Deep Dive — LeNet to ResNet to EfficientNet

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Deep Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement