🎯 The Interview Question
"Explain the difference between semantic segmentation, instance segmentation, and panoptic segmentation. How does U-Net achieve precise segmentation with skip connections? How does Mask R-CNN extend Faster R-CNN for instance segmentation? What are the modern approaches using Transformers for segmentation?"
This question tests understanding of dense prediction tasks — critical for Google (maps, photos) and Meta (AR effects).
📚 Detailed Answer
Segmentation Types
| Type | Output | Example |
|---|---|---|
| Semantic | Class per pixel | "This is a car" (all car pixels) |
| Instance | Class + instance per pixel | "This is car #1, this is car #2" |
| Panoptic | Semantic + instance | All classes with instances |
Formally:
- Semantic:
- Instance:
- Panoptic: Combination of both
U-Net: Encoder-Decoder with Skip Connections
Architecture
Encoder (Contracting Path):
Input → Conv→ReLU→Conv→ReLU → MaxPool
Conv→ReLU→Conv→ReLU → MaxPool
Conv→ReLU→Conv→ReLU → MaxPool
Conv→ReLU→Conv→ReLU → MaxPool
Conv→ReLU→Conv→ReLU (Bottleneck)
Decoder (Expanding Path):
UpConv → Concat(skip) → Conv→ReLU→Conv→ReLU
UpConv → Concat(skip) → Conv→ReLU→Conv→ReLU
UpConv → Concat(skip) → Conv→ReLU→Conv→ReLU
UpConv → Concat(skip) → Conv→ReLU→Conv→ReLU
Conv 1×1 → Output
Skip Connections
Concatenate encoder features with decoder features:
where is upsampled feature, is skip feature.
Why skip connections help:
- Preserve spatial details lost during downsampling
- Enable precise localization
- Help gradient flow
U-Net Loss Function
Combines cross-entropy and Dice loss:
💡
U-Net was designed for medical imaging where training data is scarce. The skip connections help recover fine details, and data augmentation is crucial. Modern variants use attention gates and dense connections.
Mask R-CNN
Extends Faster R-CNN for instance segmentation:
Architecture
Input → Backbone → FPN → RPN → ROI Align
↓
┌───────────────┼───────────────┐
↓ ↓ ↓
Classification Box Regression Mask Head
↓ ↓ ↓
Class Refined Box Binary Mask
Key Innovation: ROI Align
ROI Align uses bilinear interpolation to avoid quantization errors:
This preserves spatial alignment, crucial for pixel-accurate masks.
Mask Head
Small FCN applied to each ROI:
class MaskHead(nn.Module):
def __init__(self, in_channels=256, num_classes=80):
super().__init__()
self.conv1 = nn.Conv2d(in_channels, 256, 3, padding=1)
self.conv2 = nn.Conv2d(256, 256, 3, padding=1)
self.conv3 = nn.Conv2d(256, 256, 3, padding=1)
self.conv4 = nn.Conv2d(256, 256, 3, padding=1)
self.deconv = nn.ConvTranspose2d(256, 256, 2, stride=2)
self.mask = nn.Conv2d(256, num_classes, 1)
def forward(self, x):
x = torch.relu(self.conv1(x))
x = torch.relu(self.conv2(x))
x = torch.relu(self.conv3(x))
x = torch.relu(self.conv4(x))
x = torch.relu(self.deconv(x))
return self.mask(x)
Panoptic Segmentation
Combines semantic and instance segmentation:
Stuff classes: amorphous regions (sky, grass, road) Thing classes: countable objects (person, car, dog)
Panoptic FPN
Unified architecture for both tasks:
Backbone → FPN → Semantic Head (per-pixel class)
→ Instance Head (Mask R-CNN)
→ Panoptic Fusion
Fusion algorithm:
- Predict semantic segmentation
- Predict instance segmentation
- For overlapping instances, use predicted confidence
- Assign stuff regions to unassigned pixels
Transformer-Based Segmentation
SegFormer
Encoder-decoder with Transformer:
- Encoder: Hierarchical Transformer (like Swin)
- Decoder: Lightweight MLP that aggregates multi-scale features
Advantages:
- Global context (attention captures long-range dependencies)
- No positional encoding needed (hierarchical structure)
- Efficient inference
MaskFormer
Reframes segmentation as set prediction:
Uses Hungarian matching to assign predictions to ground truth.
Evaluation Metrics
IoU (Jaccard Index):
Dice Coefficient:
Panoptic Quality (PQ):
Follow-Up Questions
Q: How do you handle class imbalance in segmentation? A: Use weighted cross-entropy, Dice loss, or focal loss. Apply class-weighted sampling during training.
Q: What is the difference between transposed convolution and bilinear upsampling? A: Transposed convolution (deconvolution) learns upsampling; bilinear is fixed. Transposed can cause checkerboard artifacts; bilinear is smoother.
Q: How do Transformer-based segmentors compare to CNN-based? A: Transformers capture global context better but need more data. CNNs are more efficient for local patterns. Hybrid models often work best.