🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Transfer Learning — Pre-trained Models Complete Guide

Deep LearningTransfer Learning🟢 Free Lesson

Advertisement

Deep Learning

Transfer Learning — Stand on the Shoulders of Giants

Learn how to leverage pre-trained models to solve new problems with less data and compute.

  • Knowledge transfer — reuse learned features from large models
  • Fine-tuning — adapt pre-trained weights to your task
  • Data efficiency — achieve great results with small datasets

If I have seen further, it is by standing on the shoulders of giants.

Transfer Learning — Complete Guide

Transfer learning reuses a pre-trained model on a new task, dramatically reducing data and training requirements. This is now the default approach in modern ML — training from scratch is the exception.


Why Transfer Learning?

Transfer Learning vs Training from ScratchTraining from ScratchData Required1M+ labeled imagesComputeWeeks on GPU clusterCost10K10K-100K+Overfitting RiskVery high with small dataAccuracy (small dataset): 60-70%Only practical when you have:massive data + compute budgetTransfer LearningData Required100-10K labeledComputeHours on single GPUCost11-100Overfitting RiskLow (regularized by pre-trained weights)Accuracy (small dataset): 90-95%+The default approach in modern ML.Train from scratch only when necessary.

Feature Hierarchy in Pre-trained Models

What Pre-trained Models Learn (Feature Hierarchy)Early LayersEdges, colors, simple texturesUniversal — works for any vision taskFreezeMiddle LayersPatterns, object parts, combinationsMostly transferable — slight task-specific tuningFine-tune (small LR)Deep LayersObject-level, semantic conceptsTask-specific — fine-tune theseFine-tune (small LR)Final LayerTask-specific output (1000 ImageNet classes)Replace — train new classification headReplaceKey insight: Early features (edges, textures) are universal. Deep features are task-specific. This is why transfer learning works.

Transfer Learning Strategies

Transfer Learning StrategiesStrategy 1: Feature ExtractionPre-trained layers (FROZEN)New classification head (TRAINABLE)• Fastest (only head trains)• Least overfitting risk• Use when: small dataset, similar domain to pre-trainingBest for: quick baselineStrategy 2: Partial Fine-tuningTop layers (SMALL LR)Bottom layers (FROZEN)• Moderate speed• Good balance of speed/quality• Use when: medium dataset, moderate domain shiftBest for: most practical casesStrategy 3: Full Fine-tuningALL layers (SMALL LR)New head (NORMAL LR)• Best performance• Slowest, most compute• Use when: large dataset, different domainBest for: maximum accuracyDecision MatrixSmall data + Similar domain → Feature extractionSmall data + Different domain → Fine-tune top layersLarge data + Similar domain → Full fine-tuneLarge data + Different domain → Fine-tune or train from scratch

Implementation

Example: Transfer Learning with ResNet

from torchvision import models
import torch.nn as nn

# Load pre-trained ResNet-50
model = models.resnet50(pretrained=True)

# Strategy 1: Feature extraction (freeze all)
for param in model.parameters():
    param.requires_grad = False
model.fc = nn.Linear(2048, num_classes)
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)

# Strategy 2: Fine-tune with discriminative LR
for param in model.parameters():
    param.requires_grad = False
# Unfreeze last 2 layers
for param in model.layer4.parameters():
    param.requires_grad = True
model.fc = nn.Linear(2048, num_classes)

optimizer = torch.optim.AdamW([
    {'params': model.layer4.parameters(), 'lr': 1e-5},
    {'params': model.fc.parameters(), 'lr': 1e-3}
], weight_decay=1e-4)

When to Use Transfer Learning

DfTransfer Learning Decision Framework

Questions to ask:

  1. How much labeled data?

    • < 100: Use pre-trained model as feature extractor + simple classifier
    • 100-10K: Fine-tune top layers
    • 10K: Full fine-tune (or train from scratch if enough compute)

  2. How similar is the domain?

    • Same domain (e.g., ImageNet → medical images): Feature extraction works well
    • Different domain (e.g., ImageNet → satellite imagery): Fine-tune more layers
  3. What's the task complexity?

    • Simple (binary classification): Feature extraction may suffice
    • Complex (pixel-level segmentation): Full fine-tune needed
  4. What pre-trained model to use?

    • Vision: ResNet, EfficientNet, ViT
    • NLP: BERT, RoBERTa, DeBERTa
    • Multi-modal: CLIP, BLIP
    • General: GPT, LLaMA

Catastrophic Forgetting

DfCatastrophic Forgetting

When fine-tuning on a new task, the model may forget what it learned during pre-training. This is especially severe when:

  • New dataset is very different from pre-training data
  • Learning rate is too high
  • Too many epochs of fine-tuning

Mitigation strategies:

  1. Small learning rate (2e-5 to 5e-5 for BERT)
  2. Early stopping based on validation loss
  3. Regularization: L2/weight decay, dropout
  4. Replay buffers: Mix pre-training data during fine-tuning
  5. Gradual unfreezing: Unfreeze layers progressively
  6. EWC (Elastic Weight Consolidation): Penalize changes to important weights

Key Takeaways

Summary: Transfer Learning

  • Transfer learning dramatically reduces data and compute needs
  • Feature extraction: Fastest, least overfitting — use for small datasets
  • Fine-tuning: Better performance — use small LR to prevent catastrophic forgetting
  • Discriminative learning rates: Lower for early layers, higher for new layers
  • Gradual unfreezing: Unfreeze layer by layer during training
  • ImageNet pre-trained models work for most vision tasks
  • BERT/GPT pre-trained models work for most NLP tasks
  • Transfer learning is the default approach in modern ML
  • When to train from scratch: Very large dataset, very different domain, sufficient compute

What to Learn Next

-> BERT Apply transfer learning in NLP.

-> GPT Architecture Explore large language models.

-> Transformers Master the foundation of modern AI.

-> CNNs Learn about computer vision models.

-> Fine-tuning LLMs Adapt large models to your specific needs.

-> Training Deep Networks Master optimizers and regularization.

Premium Content

Transfer Learning — Pre-trained Models Complete Guide

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement