🎯 The Interview Question
"Explain the mathematical formulation of cross-entropy loss for multi-class classification. How does focal loss address class imbalance, and what is the focusing parameter? What is contrastive loss and how is it used in metric learning? When would you use each of these loss functions?"
This question tests understanding of how models are optimized — critical for Meta (recommendation systems) and Google (search, ads).
📚 Detailed Answer
Cross-Entropy Loss
For a single sample with true label (one-hot encoded) and predicted probabilities :
For a batch of samples:
With logits (numerically stable):
The log-sum-exp trick prevents overflow:
💡
Always use nn.CrossEntropyLoss() in PyTorch, which applies softmax internally and uses the log-sum-exp trick. Never compute softmax separately before cross-entropy — it leads to numerical instability.
Focal Loss for Class Imbalance
Standard cross-entropy treats all examples equally, which is problematic when classes are imbalanced (e.g., 99% negatives, 1% positives).
Focal Loss down-weights easy examples and focuses on hard ones:
where:
- is the model's estimated probability for the correct class
- is the balancing factor (typically for rare class)
- is the focusing parameter (typically )
Effect:
- For well-classified examples ( high): is small → reduced loss
- For misclassified examples ( low): is large → full loss
At , focal loss reduces to standard cross-entropy.
Contrastive Loss for Metric Learning
Used to learn embeddings where similar items are close and dissimilar items are far apart.
Contrastive Loss:
where:
- is the distance between embeddings
- if similar, if dissimilar
- is the margin (minimum distance for dissimilar pairs)
Triplet Loss:
where:
- is anchor, is positive (similar), is negative (dissimilar)
- Forces positive pairs to be closer than negative pairs by margin
NT-Xent Loss (Normalized Temperature-scaled Cross Entropy):
Used in SimCLR, MoCo for self-supervised learning.
Other Important Loss Functions
Label Smoothing
where and is typically 0.1.
Prevents overconfident predictions, improves calibration.
Huber Loss (Smooth L1)
Combines MSE (small errors) and MAE (large errors), robust to outliers. Used in object detection.
KL Divergence
Used in VAEs, knowledge distillation, distribution matching.
Practical Selection Guide
Follow-Up Questions
Q: Why is softmax + cross-entropy preferred over sigmoid + binary cross-entropy for multi-class? A: Softmax enforces mutual exclusivity (probabilities sum to 1), while sigmoid allows independent probabilities. Use sigmoid for multi-label, softmax for multi-class.
Q: How does focal loss handle extreme imbalance (1:10000)? A: Focal loss alone may not be enough. Combine with oversampling, undersampling, or class-balanced sampling. Adjust and based on imbalance ratio.
Q: What is the difference between contrastive and triplet loss? A: Contrastive loss uses pairs (positive or negative). Triplet loss uses triples (anchor, positive, negative). Triplet loss is more stable but requires careful mining of hard negatives.