Deep Learning
Transfer Learning — Stand on the Shoulders of Giants
Learn how to leverage pre-trained models to solve new problems with less data and compute.
- Knowledge transfer — reuse learned features from large models
- Fine-tuning — adapt pre-trained weights to your task
- Data efficiency — achieve great results with small datasets
If I have seen further, it is by standing on the shoulders of giants.
Transfer Learning — Complete Guide
Transfer learning reuses a pre-trained model on a new task, dramatically reducing data and training requirements. This is now the default approach in modern ML — training from scratch is the exception.
Why Transfer Learning?
Feature Hierarchy in Pre-trained Models
Transfer Learning Strategies
Implementation
Example: Transfer Learning with ResNet
from torchvision import models
import torch.nn as nn
# Load pre-trained ResNet-50
model = models.resnet50(pretrained=True)
# Strategy 1: Feature extraction (freeze all)
for param in model.parameters():
param.requires_grad = False
model.fc = nn.Linear(2048, num_classes)
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)
# Strategy 2: Fine-tune with discriminative LR
for param in model.parameters():
param.requires_grad = False
# Unfreeze last 2 layers
for param in model.layer4.parameters():
param.requires_grad = True
model.fc = nn.Linear(2048, num_classes)
optimizer = torch.optim.AdamW([
{'params': model.layer4.parameters(), 'lr': 1e-5},
{'params': model.fc.parameters(), 'lr': 1e-3}
], weight_decay=1e-4)
When to Use Transfer Learning
DfTransfer Learning Decision Framework
Questions to ask:
-
How much labeled data?
- < 100: Use pre-trained model as feature extractor + simple classifier
- 100-10K: Fine-tune top layers
-
10K: Full fine-tune (or train from scratch if enough compute)
-
How similar is the domain?
- Same domain (e.g., ImageNet → medical images): Feature extraction works well
- Different domain (e.g., ImageNet → satellite imagery): Fine-tune more layers
-
What's the task complexity?
- Simple (binary classification): Feature extraction may suffice
- Complex (pixel-level segmentation): Full fine-tune needed
-
What pre-trained model to use?
- Vision: ResNet, EfficientNet, ViT
- NLP: BERT, RoBERTa, DeBERTa
- Multi-modal: CLIP, BLIP
- General: GPT, LLaMA
Catastrophic Forgetting
DfCatastrophic Forgetting
When fine-tuning on a new task, the model may forget what it learned during pre-training. This is especially severe when:
- New dataset is very different from pre-training data
- Learning rate is too high
- Too many epochs of fine-tuning
Mitigation strategies:
- Small learning rate (2e-5 to 5e-5 for BERT)
- Early stopping based on validation loss
- Regularization: L2/weight decay, dropout
- Replay buffers: Mix pre-training data during fine-tuning
- Gradual unfreezing: Unfreeze layers progressively
- EWC (Elastic Weight Consolidation): Penalize changes to important weights
Key Takeaways
Summary: Transfer Learning
- Transfer learning dramatically reduces data and compute needs
- Feature extraction: Fastest, least overfitting — use for small datasets
- Fine-tuning: Better performance — use small LR to prevent catastrophic forgetting
- Discriminative learning rates: Lower for early layers, higher for new layers
- Gradual unfreezing: Unfreeze layer by layer during training
- ImageNet pre-trained models work for most vision tasks
- BERT/GPT pre-trained models work for most NLP tasks
- Transfer learning is the default approach in modern ML
- When to train from scratch: Very large dataset, very different domain, sufficient compute
What to Learn Next
-> BERT Apply transfer learning in NLP.
-> GPT Architecture Explore large language models.
-> Transformers Master the foundation of modern AI.
-> CNNs Learn about computer vision models.
-> Fine-tuning LLMs Adapt large models to your specific needs.
-> Training Deep Networks Master optimizers and regularization.