Deep Learning

Transfer Learning — Stand on the Shoulders of Giants

Learn how to leverage pre-trained models to solve new problems with less data and compute.

Knowledge transfer — reuse learned features from large models
Fine-tuning — adapt pre-trained weights to your task
Data efficiency — achieve great results with small datasets

If I have seen further, it is by standing on the shoulders of giants.

Transfer Learning — Complete Guide

Transfer learning reuses a pre-trained model on a new task, dramatically reducing data and training requirements. This is now the default approach in modern ML — training from scratch is the exception.

Why Transfer Learning?

10K-

100K+Overfitting RiskVery high with small dataAccuracy (small dataset): 60-70%Only practical when you have:massive data + compute budgetTransfer LearningData Required100-10K labeledComputeHours on single GPUCost

1-

100Overfitting RiskLow (regularized by pre-trained weights)Accuracy (small dataset): 90-95%+The default approach in modern ML.Train from scratch only when necessary.

Feature Hierarchy in Pre-trained Models

Transfer Learning Strategies

Implementation

Example: Transfer Learning with ResNet

from torchvision import models
import torch.nn as nn

# Load pre-trained ResNet-50
model = models.resnet50(pretrained=True)

# Strategy 1: Feature extraction (freeze all)
for param in model.parameters():
    param.requires_grad = False
model.fc = nn.Linear(2048, num_classes)
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)

# Strategy 2: Fine-tune with discriminative LR
for param in model.parameters():
    param.requires_grad = False
# Unfreeze last 2 layers
for param in model.layer4.parameters():
    param.requires_grad = True
model.fc = nn.Linear(2048, num_classes)

optimizer = torch.optim.AdamW([
    {'params': model.layer4.parameters(), 'lr': 1e-5},
    {'params': model.fc.parameters(), 'lr': 1e-3}
], weight_decay=1e-4)

When to Use Transfer Learning

DfTransfer Learning Decision Framework

Questions to ask:

How much labeled data?
- < 100: Use pre-trained model as feature extractor + simple classifier
- 100-10K: Fine-tune top layers
- 10K: Full fine-tune (or train from scratch if enough compute)
How similar is the domain?
- Same domain (e.g., ImageNet → medical images): Feature extraction works well
- Different domain (e.g., ImageNet → satellite imagery): Fine-tune more layers
What's the task complexity?
- Simple (binary classification): Feature extraction may suffice
- Complex (pixel-level segmentation): Full fine-tune needed
What pre-trained model to use?
- Vision: ResNet, EfficientNet, ViT
- NLP: BERT, RoBERTa, DeBERTa
- Multi-modal: CLIP, BLIP
- General: GPT, LLaMA

Catastrophic Forgetting

DfCatastrophic Forgetting

When fine-tuning on a new task, the model may forget what it learned during pre-training. This is especially severe when:

New dataset is very different from pre-training data
Learning rate is too high
Too many epochs of fine-tuning

Mitigation strategies:

Small learning rate (2e-5 to 5e-5 for BERT)
Early stopping based on validation loss
Regularization: L2/weight decay, dropout
Replay buffers: Mix pre-training data during fine-tuning
Gradual unfreezing: Unfreeze layers progressively
EWC (Elastic Weight Consolidation): Penalize changes to important weights

Key Takeaways

Summary: Transfer Learning

Transfer learning dramatically reduces data and compute needs
Feature extraction: Fastest, least overfitting — use for small datasets
Fine-tuning: Better performance — use small LR to prevent catastrophic forgetting
Discriminative learning rates: Lower for early layers, higher for new layers
Gradual unfreezing: Unfreeze layer by layer during training
ImageNet pre-trained models work for most vision tasks
BERT/GPT pre-trained models work for most NLP tasks
Transfer learning is the default approach in modern ML
When to train from scratch: Very large dataset, very different domain, sufficient compute

What to Learn Next

-> BERT Apply transfer learning in NLP.

-> GPT Architecture Explore large language models.

-> Transformers Master the foundation of modern AI.

-> CNNs Learn about computer vision models.

-> Fine-tuning LLMs Adapt large models to your specific needs.

-> Training Deep Networks Master optimizers and regularization.

Transfer Learning — Pre-trained Models Complete Guide