🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Loss Functions for Deep Learning — MSE, Cross-Entropy, Focal Loss and Beyond

FoundationsLoss Functions🟢 Free Lesson

Advertisement

DL Foundations

Loss Functions — The Compass That Guides Neural Network Training

Loss functions quantify how wrong a model's predictions are, defining the objective that optimization algorithms minimize. Choosing the right loss function is critical for effective training and determines what the model learns.

  • Cross-Entropy Dominates — The standard loss for classification, equivalent to maximum likelihood estimation
  • Focal Loss for Imbalance — Down-weights easy examples to focus on hard, rare cases in object detection
  • Huber Loss for Robustness — Combines MSE and MAE to handle outliers in regression tasks

Loss Functions for Deep Learning — MSE, Cross-Entropy, Focal Loss and Beyond

Loss functions quantify how wrong a model's predictions are. Choosing the right loss function is critical for effective training.


Loss Function Taxonomy

DfTypes of Loss Functions

CategoryLoss FunctionUse Case
RegressionMSE, MAE, HuberContinuous output prediction
ClassificationCross-Entropy, FocalDiscrete class prediction
RankingTriplet, ContrastiveSimilarity learning
GenerativeReconstruction, AdversarialData generation
SegmentationDice, IoUPixel-level prediction
Loss Function Decision TreeTask Type?RegressionMSE / MAE / HuberClassificationCross-Entropy / FocalRanking/SimilarityTriplet / ContrastiveOutliers? → HuberSymmetric? → MSERobust? → MAEBalanced? → CEImbalanced? → FocalLabel smoothing?

Mean Squared Error (MSE)

DfMSE Loss

MSE measures the average squared difference between predictions and targets:

LMSE=1Ni=1N(yiy^i)2\mathcal{L}_{\text{MSE}} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2
  • Penalizes large errors quadratically (robust to small errors, sensitive to outliers)
  • Differentiable everywhere with smooth gradients
  • Assumes Gaussian errors with constant variance

Mean Squared Error

LMSE=1Ni=1N(yiy^i)2\mathcal{L}_{\text{MSE}} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2

Here,

  • yiy_i=True value for instance i
  • y^i\hat{y}_i=Predicted value for instance i
  • NN=Number of instances
MSE Loss LandscapeMinimum∂L/∂ŷ > 0∂L/∂ŷ < 0L = (ŷ - y)² → ∂L/∂ŷ = 2(ŷ - y)

Cross-Entropy Loss

DfCross-Entropy Loss

For multi-class classification with CC classes:

LCE=i=1Cyilog(y^i)\mathcal{L}_{\text{CE}} = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)

where yiy_i is the true label (one-hot) and y^i\hat{y}_i is the predicted probability (softmax output).

Binary Cross-Entropy (for C=2C = 2):

LBCE=1Ni=1N[yilog(y^i)+(1yi)log(1y^i)]\mathcal{L}_{\text{BCE}} = -\frac{1}{N}\sum_{i=1}^{N}\left[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)\right]

Cross-entropy is equivalent to maximum likelihood estimation under a categorical distribution.

Cross-Entropy Loss
LCE=i=1Cyilog(y^i)\mathcal{L}_{CE} = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)
Cross-Entropy: Penalty vs Prediction ProbabilityPredicted probability (ŷ)Lossy=1: -log(ŷ)y=0: -log(1-ŷ)01High penalty when y=1, ŷ→0High penalty when y=0, ŷ→1

Focal Loss

DfFocal Loss

Focal loss addresses class imbalance by down-weighting easy examples:

Lfocal=αt(1pt)γlog(pt)\mathcal{L}_{\text{focal}} = -\alpha_t (1 - p_t)^\gamma \log(p_t)

where:

  • ptp_t is the predicted probability for the true class
  • αt\alpha_t is the class balancing factor
  • γ\gamma is the focusing parameter (typically γ=2\gamma = 2)

When γ=0\gamma = 0, focal loss equals standard cross-entropy. As γ\gamma increases, easy examples are down-weighted more.

Focal Loss

Lfocal=αt(1pt)γlog(pt)\mathcal{L}_{\text{focal}} = -\alpha_t (1 - p_t)^\gamma \log(p_t)

Here,

  • ptp_t=Predicted probability for true class
  • αt\alpha_t=Class balancing factor
  • γ\gamma=Focusing parameter (typically 2)
  • (1pt)γ(1-p_t)^\gamma=Modulating factor
Focal Loss: Down-weighting Easy ExamplesClassification Confidence (p_t)CE (γ=0)γ=1γ=2 (default)γ=5Easy (down-weighted)Hard (kept high)

Huber Loss

DfHuber Loss

Huber loss combines MSE and MAE for robustness to outliers:

LHuber(δ)={12(yy^)2if yy^δδyy^12δ2otherwise\mathcal{L}_{\text{Huber}}(\delta) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \\ \delta |y - \hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}
  • Quadratic for small errors (like MSE)
  • Linear for large errors (like MAE)
  • Parameter δ\delta controls the transition point

Contrastive and Triplet Loss

DfContrastive Loss

For learning embeddings where similar items are close and dissimilar items are far:

Lcontrastive=yd2+(1y)max(0,md)2\mathcal{L}_{\text{contrastive}} = y \cdot d^2 + (1-y) \cdot \max(0, m - d)^2

where dd is the distance between embeddings, y=1y = 1 for similar pairs, and mm is the margin.

DfTriplet Loss

Ltriplet=max(0,d(a,p)d(a,n)+m)\mathcal{L}_{\text{triplet}} = \max(0, d(a, p) - d(a, n) + m)

where aa is the anchor, pp is the positive (same class), nn is the negative (different class), and mm is the margin.


Loss Function Comparison

Regression Loss Functions ComparisonError (y - ŷ)LossMSEMAEHuberZero error

DfLoss Function Selection Guide

TaskLoss FunctionWhy
Regression (clean)MSEEfficient, differentiable everywhere
Regression (outliers)Huber or MAERobust to outliers
Binary classificationBCEStandard for sigmoid output
Multi-class classificationCEStandard for softmax output
Class imbalanceFocal lossDown-weights easy examples
Object detectionFocal + Smooth L1Classification + box regression
SegmentationDice + BCEHandles class imbalance, optimizes overlap
EmbeddingsTriplet / ContrastiveLearning similarity metrics
GenerationReconstruction + KLVAE loss, autoencoders

Label Smoothing

DfLabel Smoothing

Label smoothing prevents overconfident predictions by softening one-hot targets:

yismooth=(1ϵ)yi+ϵCy_i^{\text{smooth}} = (1 - \epsilon) y_i + \frac{\epsilon}{C}

where ϵ\epsilon is the smoothing parameter (typically 0.1) and CC is the number of classes.

This encourages the model to be less confident, improving generalization.

Why Label Smoothing Works

Hard one-hot targets push logits to infinity, causing overconfidence. Label smoothing constrains the output distribution, acting as a regularizer. It improves calibration and often improves accuracy on test data.


Summary

  • MSE for regression: quadratic penalty, sensitive to outliers
  • Cross-entropy for classification: equivalent to maximum likelihood
  • Focal loss handles class imbalance by down-weighting easy examples
  • Huber loss combines MSE and MAE for robustness
  • Triplet/contrastive loss for learning embeddings
  • Choose the loss function based on your task, data distribution, and what you want to optimize

Next: Optimizers for Deep Learning

Premium Content

Loss Functions for Deep Learning — MSE, Cross-Entropy, Focal Loss and Beyond

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Deep Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement