DL Foundations
Loss Functions — The Compass That Guides Neural Network Training
Loss functions quantify how wrong a model's predictions are, defining the objective that optimization algorithms minimize. Choosing the right loss function is critical for effective training and determines what the model learns.
- Cross-Entropy Dominates — The standard loss for classification, equivalent to maximum likelihood estimation
- Focal Loss for Imbalance — Down-weights easy examples to focus on hard, rare cases in object detection
- Huber Loss for Robustness — Combines MSE and MAE to handle outliers in regression tasks
Loss Functions for Deep Learning — MSE, Cross-Entropy, Focal Loss and Beyond
Loss functions quantify how wrong a model's predictions are. Choosing the right loss function is critical for effective training.
Loss Function Taxonomy
DfTypes of Loss Functions
| Category | Loss Function | Use Case |
|---|---|---|
| Regression | MSE, MAE, Huber | Continuous output prediction |
| Classification | Cross-Entropy, Focal | Discrete class prediction |
| Ranking | Triplet, Contrastive | Similarity learning |
| Generative | Reconstruction, Adversarial | Data generation |
| Segmentation | Dice, IoU | Pixel-level prediction |
Mean Squared Error (MSE)
DfMSE Loss
MSE measures the average squared difference between predictions and targets:
- Penalizes large errors quadratically (robust to small errors, sensitive to outliers)
- Differentiable everywhere with smooth gradients
- Assumes Gaussian errors with constant variance
Mean Squared Error
Here,
- =True value for instance i
- =Predicted value for instance i
- =Number of instances
Cross-Entropy Loss
DfCross-Entropy Loss
For multi-class classification with classes:
where is the true label (one-hot) and is the predicted probability (softmax output).
Binary Cross-Entropy (for ):
Cross-entropy is equivalent to maximum likelihood estimation under a categorical distribution.
Focal Loss
DfFocal Loss
Focal loss addresses class imbalance by down-weighting easy examples:
where:
- is the predicted probability for the true class
- is the class balancing factor
- is the focusing parameter (typically )
When , focal loss equals standard cross-entropy. As increases, easy examples are down-weighted more.
Focal Loss
Here,
- =Predicted probability for true class
- =Class balancing factor
- =Focusing parameter (typically 2)
- =Modulating factor
Huber Loss
DfHuber Loss
Huber loss combines MSE and MAE for robustness to outliers:
- Quadratic for small errors (like MSE)
- Linear for large errors (like MAE)
- Parameter controls the transition point
Contrastive and Triplet Loss
DfContrastive Loss
For learning embeddings where similar items are close and dissimilar items are far:
where is the distance between embeddings, for similar pairs, and is the margin.
DfTriplet Loss
where is the anchor, is the positive (same class), is the negative (different class), and is the margin.
Loss Function Comparison
DfLoss Function Selection Guide
| Task | Loss Function | Why |
|---|---|---|
| Regression (clean) | MSE | Efficient, differentiable everywhere |
| Regression (outliers) | Huber or MAE | Robust to outliers |
| Binary classification | BCE | Standard for sigmoid output |
| Multi-class classification | CE | Standard for softmax output |
| Class imbalance | Focal loss | Down-weights easy examples |
| Object detection | Focal + Smooth L1 | Classification + box regression |
| Segmentation | Dice + BCE | Handles class imbalance, optimizes overlap |
| Embeddings | Triplet / Contrastive | Learning similarity metrics |
| Generation | Reconstruction + KL | VAE loss, autoencoders |
Label Smoothing
DfLabel Smoothing
Label smoothing prevents overconfident predictions by softening one-hot targets:
where is the smoothing parameter (typically 0.1) and is the number of classes.
This encourages the model to be less confident, improving generalization.
Why Label Smoothing Works
Hard one-hot targets push logits to infinity, causing overconfidence. Label smoothing constrains the output distribution, acting as a regularizer. It improves calibration and often improves accuracy on test data.
Summary
- MSE for regression: quadratic penalty, sensitive to outliers
- Cross-entropy for classification: equivalent to maximum likelihood
- Focal loss handles class imbalance by down-weighting easy examples
- Huber loss combines MSE and MAE for robustness
- Triplet/contrastive loss for learning embeddings
- Choose the loss function based on your task, data distribution, and what you want to optimize