🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Loss Functions: Cross-Entropy, Focal Loss, Contrastive Loss — Asked at Meta & Google

Deep Learning Premium InterviewsLoss Functions⭐ Premium

Advertisement

Meta & Google

Loss Functions: Cross-Entropy, Focal Loss & Contrastive Loss

Premium Interview Preparation — Loss Function Mastery

🎯 The Interview Question

"Explain the mathematical formulation of cross-entropy loss for multi-class classification. How does focal loss address class imbalance, and what is the focusing parameter? What is contrastive loss and how is it used in metric learning? When would you use each of these loss functions?"

This question tests understanding of how models are optimized — critical for Meta (recommendation systems) and Google (search, ads).


📚 Detailed Answer

Cross-Entropy Loss

For a single sample with true label yy (one-hot encoded) and predicted probabilities y^\hat{y}:

LCE=c=1Cyclog(y^c)\mathcal{L}_{CE} = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)

For a batch of NN samples:

LCE=1Ni=1Nc=1Cyi,clog(y^i,c)\mathcal{L}_{CE} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c})

With logits (numerically stable):

LCE=1Ni=1N[c=1Cyi,clog(ezi,cj=1Cezi,j)]\mathcal{L}_{CE} = -\frac{1}{N}\sum_{i=1}^{N}\left[\sum_{c=1}^{C} y_{i,c} \log\left(\frac{e^{z_{i,c}}}{\sum_{j=1}^{C} e^{z_{i,j}}}\right)\right]

The log-sum-exp trick prevents overflow:

log(ezcjezj)=zcmax(z)log(jezjmax(z))\log\left(\frac{e^{z_c}}{\sum_j e^{z_j}}\right) = z_c - \max(z) - \log\left(\sum_j e^{z_j - \max(z)}\right)

💡

Always use nn.CrossEntropyLoss() in PyTorch, which applies softmax internally and uses the log-sum-exp trick. Never compute softmax separately before cross-entropy — it leads to numerical instability.

Focal Loss for Class Imbalance

Standard cross-entropy treats all examples equally, which is problematic when classes are imbalanced (e.g., 99% negatives, 1% positives).

Focal Loss down-weights easy examples and focuses on hard ones:

LFL=αt(1pt)γlog(pt)\mathcal{L}_{FL} = -\alpha_t (1 - p_t)^\gamma \log(p_t)

where:

  • ptp_t is the model's estimated probability for the correct class
  • αt\alpha_t is the balancing factor (typically αt=0.25\alpha_t = 0.25 for rare class)
  • γ\gamma is the focusing parameter (typically γ=2\gamma = 2)

Effect:

  • For well-classified examples (ptp_t high): (1pt)γ(1-p_t)^\gamma is small → reduced loss
  • For misclassified examples (ptp_t low): (1pt)γ(1-p_t)^\gamma is large → full loss

At γ=0\gamma = 0, focal loss reduces to standard cross-entropy.

Contrastive Loss for Metric Learning

Used to learn embeddings where similar items are close and dissimilar items are far apart.

Contrastive Loss:

L=yd2+(1y)max(0,md)2\mathcal{L} = y \cdot d^2 + (1-y) \cdot \max(0, m - d)^2

where:

  • dd is the distance between embeddings
  • y=1y = 1 if similar, 00 if dissimilar
  • mm is the margin (minimum distance for dissimilar pairs)

Triplet Loss:

L=max(0,d(a,p)d(a,n)+m)\mathcal{L} = \max(0, d(a, p) - d(a, n) + m)

where:

  • aa is anchor, pp is positive (similar), nn is negative (dissimilar)
  • Forces positive pairs to be closer than negative pairs by margin mm

NT-Xent Loss (Normalized Temperature-scaled Cross Entropy):

L=log(exp(sim(zi,zj)/τ)k=12N1[ki]exp(sim(zi,zk)/τ))\mathcal{L} = -\log\left(\frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k)/\tau)}\right)

Used in SimCLR, MoCo for self-supervised learning.

Other Important Loss Functions

Label Smoothing

L=c=1Cycsmoothlog(y^c)\mathcal{L} = -\sum_{c=1}^{C} y_c^{smooth} \log(\hat{y}_c)

where ycsmooth=(1ϵ)yc+ϵ/Cy_c^{smooth} = (1-\epsilon)y_c + \epsilon/C and ϵ\epsilon is typically 0.1.

Prevents overconfident predictions, improves calibration.

Huber Loss (Smooth L1)

Lδ(a)={0.5a2if aδδ(a0.5δ)otherwise\mathcal{L}_\delta(a) = \begin{cases} 0.5 a^2 & \text{if } |a| \leq \delta \\ \delta(|a| - 0.5\delta) & \text{otherwise} \end{cases}

Combines MSE (small errors) and MAE (large errors), robust to outliers. Used in object detection.

KL Divergence

LKL=cp(c)logp(c)q(c)\mathcal{L}_{KL} = \sum_{c} p(c) \log \frac{p(c)}{q(c)}

Used in VAEs, knowledge distillation, distribution matching.

Practical Selection Guide

Follow-Up Questions

Q: Why is softmax + cross-entropy preferred over sigmoid + binary cross-entropy for multi-class? A: Softmax enforces mutual exclusivity (probabilities sum to 1), while sigmoid allows independent probabilities. Use sigmoid for multi-label, softmax for multi-class.

Q: How does focal loss handle extreme imbalance (1:10000)? A: Focal loss alone may not be enough. Combine with oversampling, undersampling, or class-balanced sampling. Adjust α\alpha and γ\gamma based on imbalance ratio.

Q: What is the difference between contrastive and triplet loss? A: Contrastive loss uses pairs (positive or negative). Triplet loss uses triples (anchor, positive, negative). Triplet loss is more stable but requires careful mining of hard negatives.

Related Topics

Advertisement