🎯 The Interview Question
"Explain the GAN training objective mathematically. What is the minimax game between generator and discriminator? Why is GAN training unstable, and what techniques exist to stabilize it? Describe the mode collapse problem and how modern GAN architectures address it."
This question tests understanding of generative models — critical for NVIDIA (image generation) and Meta (content creation).
📚 Detailed Answer
GAN: The Minimax Game
The GAN framework consists of:
- Generator : Maps noise to fake samples
- Discriminator : Classifies real vs fake samples
Objective (minimax game):
Discriminator maximizes: Correctly classify real as real and fake as fake.
Generator minimizes: Fool discriminator into classifying fake as real.
Training Dynamics
At the optimal discriminator:
The global optimum is , meaning the generator perfectly matches the data distribution.
Training algorithm:
for each training step:
# Train discriminator
Sample real batch {x_i}
Sample noise batch {z_i}
Generate fake batch: G(z_i)
Maximize: log D(x_i) + log(1 - D(G(z_i)))
# Train generator
Sample noise batch {z_i}
Minimize: log(1 - D(G(z_i)))
# OR equivalently maximize: log D(G(z_i))
⚠️
In practice, the generator loss should be minimized by maximizing instead of minimizing . The latter saturates when discriminator is strong.
Training Instability
Vanishing Gradients
When discriminator is too good, , and gradients vanish:
Solution: Train discriminator to near-optimal but not perfect.
Mode Collapse
Generator produces limited variety of outputs, ignoring parts of data distribution:
Symptoms:
- Generated samples look similar
- Low diversity despite good individual quality
- Discriminator loss oscillates
Training Oscillation
Generator and discriminator compete, causing loss oscillations without convergence.
Stabilization Techniques
1. Wasserstein GAN (WGAN)
Replaces Jensen-Shannon divergence with Wasserstein distance:
with Lipschitz constraint:
Gradient penalty:
2. Progressive Growing
Train with low resolution initially, progressively increase:
4×4 → 8×8 → 16×16 → 32×32 → ... → 1024×1024
Each resolution trained for a period, then fade in new layers.
3. Style-Based Generator (StyleGAN)
- Mapping network transforms latent to style vector
- AdaIN (Adaptive Instance Normalization) injects style at each layer
- Enables control over different aspects (pose, hair, background)
Modern GAN Architectures
StyleGAN2/3
- Path length regularization: Encourages smooth latent space
- Skip connections in mapping network: Better style mixing
- Lazy regularization: Apply regularization every 16 steps
BigGAN
- Class-conditional generation: Control output class
- Truncation trick: Trade diversity for quality
- Large-scale training: Batch size 2048, 256 GPUs
VQGAN
- Vector quantization: Discrete latent space
- Transformer decoder: Model spatial relationships
- Perceptual loss: Better visual quality
Evaluation Metrics
| Metric | What it Measures | How to Compute |
|---|---|---|
| FID | Quality + diversity | Distance between feature distributions |
| IS | Quality only | Entropy of classifier predictions |
| LPIPS | Perceptual similarity | Feature distance in VGG space |
| Precision/Recall | Quality vs diversity | Volume of learned manifold |
Follow-Up Questions
Q: How does StyleGAN achieve style mixing? A: By mixing latent codes at different layers of the generator, you can control different aspects: coarse styles (pose) from early layers, fine styles (color) from later layers.
Q: What is the relationship between GANs and VAEs? A: Both are generative models but with different objectives. GANs minimize divergence between distributions; VAEs maximize variational lower bound. GANs produce sharper images; VAEs provide better coverage.
Q: Can GANs generate text? A: Difficult due to discrete nature of text. Most text generation uses autoregressive models (GPT) or diffusion models. GANSynth exists for music.