Loss Functions: Cross-Entropy, Focal Loss, Contrastive Loss — Asked at Meta & Google

🎯 The Interview Question

"Explain the mathematical formulation of cross-entropy loss for multi-class classification. How does focal loss address class imbalance, and what is the focusing parameter? What is contrastive loss and how is it used in metric learning? When would you use each of these loss functions?"

This question tests understanding of how models are optimized — critical for Meta (recommendation systems) and Google (search, ads).

📚 Detailed Answer

Cross-Entropy Loss

For a single sample with true label $y$ (one-hot encoded) and predicted probabilities $\hat{y}$ :

\mathcal{L}_{CE} = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)

For a batch of $N$ samples:

\mathcal{L}_{CE} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c})

With logits (numerically stable):

\mathcal{L}_{CE} = -\frac{1}{N}\sum_{i=1}^{N}\left[\sum_{c=1}^{C} y_{i,c} \log\left(\frac{e^{z_{i,c}}}{\sum_{j=1}^{C} e^{z_{i,j}}}\right)\right]

The log-sum-exp trick prevents overflow:

\log\left(\frac{e^{z_c}}{\sum_j e^{z_j}}\right) = z_c - \max(z) - \log\left(\sum_j e^{z_j - \max(z)}\right)

💡

Always use nn.CrossEntropyLoss() in PyTorch, which applies softmax internally and uses the log-sum-exp trick. Never compute softmax separately before cross-entropy — it leads to numerical instability.

Focal Loss for Class Imbalance

Standard cross-entropy treats all examples equally, which is problematic when classes are imbalanced (e.g., 99% negatives, 1% positives).

Focal Loss down-weights easy examples and focuses on hard ones:

\mathcal{L}_{FL} = -\alpha_t (1 - p_t)^\gamma \log(p_t)

where:

$p_t$ is the model's estimated probability for the correct class
$\alpha_t$ is the balancing factor (typically $\alpha_t = 0.25$ for rare class)
$\gamma$ is the focusing parameter (typically $\gamma = 2$ )

Effect:

For well-classified examples ( $p_t$ high): $(1-p_t)^\gamma$ is small → reduced loss
For misclassified examples ( $p_t$ low): $(1-p_t)^\gamma$ is large → full loss

At $\gamma = 0$ , focal loss reduces to standard cross-entropy.

Contrastive Loss for Metric Learning

Used to learn embeddings where similar items are close and dissimilar items are far apart.

Contrastive Loss:

\mathcal{L} = y \cdot d^2 + (1-y) \cdot \max(0, m - d)^2

where:

$d$ is the distance between embeddings
$y = 1$ if similar, $0$ if dissimilar
$m$ is the margin (minimum distance for dissimilar pairs)

Triplet Loss:

\mathcal{L} = \max(0, d(a, p) - d(a, n) + m)

where:

$a$ is anchor, $p$ is positive (similar), $n$ is negative (dissimilar)
Forces positive pairs to be closer than negative pairs by margin $m$

NT-Xent Loss (Normalized Temperature-scaled Cross Entropy):

\mathcal{L} = -\log\left(\frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k)/\tau)}\right)

Used in SimCLR, MoCo for self-supervised learning.

Other Important Loss Functions

Label Smoothing

\mathcal{L} = -\sum_{c=1}^{C} y_c^{smooth} \log(\hat{y}_c)

where $y_c^{smooth} = (1-\epsilon)y_c + \epsilon/C$ and $\epsilon$ is typically 0.1.

Prevents overconfident predictions, improves calibration.

Huber Loss (Smooth L1)

\mathcal{L}_\delta(a) = \begin{cases} 0.5 a^2 & \text{if } |a| \leq \delta \\ \delta(|a| - 0.5\delta) & \text{otherwise} \end{cases}

Combines MSE (small errors) and MAE (large errors), robust to outliers. Used in object detection.

KL Divergence

\mathcal{L}_{KL} = \sum_{c} p(c) \log \frac{p(c)}{q(c)}

Used in VAEs, knowledge distillation, distribution matching.

Practical Selection Guide

Follow-Up Questions

Q: Why is softmax + cross-entropy preferred over sigmoid + binary cross-entropy for multi-class? A: Softmax enforces mutual exclusivity (probabilities sum to 1), while sigmoid allows independent probabilities. Use sigmoid for multi-label, softmax for multi-class.

Q: How does focal loss handle extreme imbalance (1:10000)? A: Focal loss alone may not be enough. Combine with oversampling, undersampling, or class-balanced sampling. Adjust $\alpha$ and $\gamma$ based on imbalance ratio.

Q: What is the difference between contrastive and triplet loss? A: Contrastive loss uses pairs (positive or negative). Triplet loss uses triples (anchor, positive, negative). Triplet loss is more stable but requires careful mining of hard negatives.

Loss Functions: Cross-Entropy, Focal Loss, Contrastive Loss — Asked at Meta & Google

Loss Functions: Cross-Entropy, Focal Loss & Contrastive Loss

🎯 The Interview Question

📚 Detailed Answer

Cross-Entropy Loss

Focal Loss for Class Imbalance

Contrastive Loss for Metric Learning

Other Important Loss Functions

Label Smoothing

Huber Loss (Smooth L1)

KL Divergence

Practical Selection Guide

Follow-Up Questions

Related Topics