Semantic Segmentation: U-Net, Mask R-CNN, Panoptic — Asked at Google & Meta

🎯 The Interview Question

"Explain the difference between semantic segmentation, instance segmentation, and panoptic segmentation. How does U-Net achieve precise segmentation with skip connections? How does Mask R-CNN extend Faster R-CNN for instance segmentation? What are the modern approaches using Transformers for segmentation?"

This question tests understanding of dense prediction tasks — critical for Google (maps, photos) and Meta (AR effects).

📚 Detailed Answer

Segmentation Types

Type	Output	Example
Semantic	Class per pixel	"This is a car" (all car pixels)
Instance	Class + instance per pixel	"This is car #1, this is car #2"
Panoptic	Semantic + instance	All classes with instances

Formally:

Semantic: $f: \mathbb{R}^{H \times W \times 3} \rightarrow \{1, \ldots, K\}^{H \times W}$
Instance: $f: \mathbb{R}^{H \times W \times 3} \rightarrow \{(class, mask)\}_{i=1}^N$
Panoptic: Combination of both

U-Net: Encoder-Decoder with Skip Connections

Architecture

Architecture Diagram

Encoder (Contracting Path):
Input → Conv→ReLU→Conv→ReLU → MaxPool
        Conv→ReLU→Conv→ReLU → MaxPool
        Conv→ReLU→Conv→ReLU → MaxPool
        Conv→ReLU→Conv→ReLU → MaxPool
        Conv→ReLU→Conv→ReLU (Bottleneck)

Decoder (Expanding Path):
UpConv → Concat(skip) → Conv→ReLU→Conv→ReLU
UpConv → Concat(skip) → Conv→ReLU→Conv→ReLU
UpConv → Concat(skip) → Conv→ReLU→Conv→ReLU
UpConv → Concat(skip) → Conv→ReLU→Conv→ReLU
Conv 1×1 → Output

Skip Connections

Concatenate encoder features with decoder features:

\mathbf{y}_i = \text{Conv}([\mathbf{u}_i; \mathbf{s}_i])

where $\mathbf{u}_i$ is upsampled feature, $\mathbf{s}_i$ is skip feature.

Why skip connections help:

Preserve spatial details lost during downsampling
Enable precise localization
Help gradient flow

U-Net Loss Function

Combines cross-entropy and Dice loss:

\mathcal{L} = \mathcal{L}_{CE} + \lambda \mathcal{L}_{Dice}

\mathcal{L}_{Dice} = 1 - \frac{2\sum_i p_i g_i + \epsilon}{\sum_i p_i + \sum_i g_i + \epsilon}

💡

U-Net was designed for medical imaging where training data is scarce. The skip connections help recover fine details, and data augmentation is crucial. Modern variants use attention gates and dense connections.

Mask R-CNN

Extends Faster R-CNN for instance segmentation:

Architecture

Architecture Diagram

Input → Backbone → FPN → RPN → ROI Align
                                    ↓
                    ┌───────────────┼───────────────┐
                    ↓               ↓               ↓
              Classification   Box Regression   Mask Head
                    ↓               ↓               ↓
                 Class          Refined Box      Binary Mask

Key Innovation: ROI Align

ROI Align uses bilinear interpolation to avoid quantization errors:

\text{ROIAlign}(r) = \sum_{i,j} w_{ij} \cdot \text{bilinear}(\mathbf{x}, (i, j))

This preserves spatial alignment, crucial for pixel-accurate masks.

Mask Head

Small FCN applied to each ROI:

class MaskHead(nn.Module):
    def __init__(self, in_channels=256, num_classes=80):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, 256, 3, padding=1)
        self.conv2 = nn.Conv2d(256, 256, 3, padding=1)
        self.conv3 = nn.Conv2d(256, 256, 3, padding=1)
        self.conv4 = nn.Conv2d(256, 256, 3, padding=1)
        self.deconv = nn.ConvTranspose2d(256, 256, 2, stride=2)
        self.mask = nn.Conv2d(256, num_classes, 1)

    def forward(self, x):
        x = torch.relu(self.conv1(x))
        x = torch.relu(self.conv2(x))
        x = torch.relu(self.conv3(x))
        x = torch.relu(self.conv4(x))
        x = torch.relu(self.deconv(x))
        return self.mask(x)

Panoptic Segmentation

Combines semantic and instance segmentation:

\text{Panoptic} = \sum_{i \in \text{stuff}} \text{semantic}_i + \sum_{j \in \text{things}} \text{instance}_j

Stuff classes: amorphous regions (sky, grass, road) Thing classes: countable objects (person, car, dog)

Panoptic FPN

Unified architecture for both tasks:

Architecture Diagram

Backbone → FPN → Semantic Head (per-pixel class)
               → Instance Head (Mask R-CNN)
               → Panoptic Fusion

Fusion algorithm:

Predict semantic segmentation
Predict instance segmentation
For overlapping instances, use predicted confidence
Assign stuff regions to unassigned pixels

Transformer-Based Segmentation

SegFormer

Encoder-decoder with Transformer:

Encoder: Hierarchical Transformer (like Swin)
Decoder: Lightweight MLP that aggregates multi-scale features

Advantages:

Global context (attention captures long-range dependencies)
No positional encoding needed (hierarchical structure)
Efficient inference

MaskFormer

Reframes segmentation as set prediction:

\text{Prediction} = \text{mask}_i \cdot \text{class}_i

Uses Hungarian matching to assign predictions to ground truth.

Evaluation Metrics

IoU (Jaccard Index):

\text{IoU} = \frac{|A \cap B|}{|A \cup B|}

Dice Coefficient:

\text{Dice} = \frac{2|A \cap B|}{|A| + |B|}

Panoptic Quality (PQ):

PQ = \underbrace{\frac{\sum_{(p,g) \in TP} \text{IoU}(p,g)}{|TP|}}_{\text{Segmentation Quality}} \times \underbrace{\frac{|TP|}{|TP| + \frac{1}{2}|FP| + \frac{1}{2}|FN|}}_{\text{Recognition Quality}}

Follow-Up Questions

Q: How do you handle class imbalance in segmentation? A: Use weighted cross-entropy, Dice loss, or focal loss. Apply class-weighted sampling during training.

Q: What is the difference between transposed convolution and bilinear upsampling? A: Transposed convolution (deconvolution) learns upsampling; bilinear is fixed. Transposed can cause checkerboard artifacts; bilinear is smoother.

Q: How do Transformer-based segmentors compare to CNN-based? A: Transformers capture global context better but need more data. CNNs are more efficient for local patterns. Hybrid models often work best.

Semantic Segmentation: U-Net, Mask R-CNN, Panoptic — Asked at Google & Meta

Semantic Segmentation: U-Net, Mask R-CNN & Panoptic

🎯 The Interview Question

📚 Detailed Answer

Segmentation Types

U-Net: Encoder-Decoder with Skip Connections

Architecture

Skip Connections

U-Net Loss Function

Mask R-CNN

Architecture

Key Innovation: ROI Align

Mask Head

Panoptic Segmentation

Panoptic FPN

Transformer-Based Segmentation

SegFormer

MaskFormer

Evaluation Metrics

Follow-Up Questions

Related Topics