🎯 The Interview Question
"Explain the different strategies for transfer learning, including feature extraction vs fine-tuning. What is domain adaptation and how does it address the domain shift problem? How do you decide which layers to freeze when fine-tuning? What are the best practices for fine-tuning large pre-trained models like BERT or GPT?"
This question is crucial for practical ML work — transfer learning is the standard approach in modern deep learning.
📚 Detailed Answer
Transfer Learning: The Core Idea
Transfer learning leverages knowledge from a source task to improve performance on a target task.
Formal definition: Given source domain with task and target domain with task , transfer learning uses to improve learning of in .
Why it works: Features learned on large datasets (ImageNet, Wikipedia) capture general patterns that transfer to new tasks.
Strategy 1: Feature Extraction
Use pre-trained model as fixed feature extractor:
# Freeze all layers
model = models.resnet50(pretrained=True)
for param in model.parameters():
param.requires_grad = False
# Replace classifier
model.fc = nn.Linear(2048, num_classes)
When to use:
- Small target dataset (< 10K samples)
- Similar domain to pre-training
- Limited compute budget
Advantages:
- Fast training (only train classifier head)
- No risk of catastrophic forgetting
- Low memory footprint
Strategy 2: Fine-Tuning
Update all or part of the pre-trained model:
# Fine-tune all layers with small learning rate
model = models.resnet50(pretrained=True)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
When to use:
- Large target dataset
- Different domain from pre-training
- Need maximum performance
Risks:
- Catastrophic forgetting (overwriting useful features)
- Overfitting on small datasets
- Requires careful hyperparameter tuning
Layer Freezing Strategies
Gradual Unfreezing
Start with all layers frozen, unfreeze one layer at a time:
# Stage 1: Train only classifier
for param in model.parameters():
param.requires_grad = False
model.fc = nn.Linear(2048, num_classes)
train(model, epochs=5)
# Stage 2: Unfreeze last block
for param in model.layer4.parameters():
param.requires_grad = True
train(model, epochs=5, lr=1e-5)
# Stage 3: Unfreeze all
for param in model.parameters():
param.requires_grad = True
train(model, epochs=10, lr=1e-6)
💡
Use different learning rates for different layers: lower for early layers (general features), higher for later layers (task-specific features). This is called "discriminative fine-tuning."
Domain Adaptation
Addressing distribution shift between source and target domains.
Maximum Mean Discrepancy (MMD)
Minimize distance between feature distributions:
where are mean and covariance of source/target features.
Adversarial Domain Adaptation
Use a domain classifier to make features domain-invariant:
where is feature extractor, is domain discriminator.
DANN (Domain-Adversarial Neural Network)
Input → Feature Extractor → Task Classifier → Task Loss
↓
Domain Classifier → Domain Loss
Gradient reversal layer ensures features are domain-invariant.
Fine-Tuning Large Language Models
Parameter-Efficient Fine-Tuning (PEFT)
LoRA (Low-Rank Adaptation):
where , , .
Only train and (typically ), keeping frozen.
QLoRA: Quantize frozen weights to 4-bit, train LoRA adapters.
Adapters: Insert small bottleneck layers between frozen layers:
where , .
Best Practices
Follow-Up Questions
Q: How do you detect catastrophic forgetting? A: Monitor performance on source task during fine-tuning. If it drops significantly, use lower learning rates, freeze more layers, or use regularization like EWC.
Q: When is domain adaptation necessary? A: When source and target distributions differ significantly. Examples: synthetic to real images, different camera angles, different text styles.
Q: How does few-shot learning differ from transfer learning? A: Transfer learning uses many target examples; few-shot uses very few (1-10). Techniques like Prototypical Networks and MAML are designed for few-shot scenarios.