GANs: Generator, Discriminator, Training Stability — Asked at NVIDIA & Meta

🎯 The Interview Question

"Explain the GAN training objective mathematically. What is the minimax game between generator and discriminator? Why is GAN training unstable, and what techniques exist to stabilize it? Describe the mode collapse problem and how modern GAN architectures address it."

This question tests understanding of generative models — critical for NVIDIA (image generation) and Meta (content creation).

📚 Detailed Answer

GAN: The Minimax Game

The GAN framework consists of:

Generator $G$ : Maps noise $\mathbf{z} \sim p_z$ to fake samples
Discriminator $D$ : Classifies real vs fake samples

Objective (minimax game):

\min_G \max_D \mathcal{L}(G, D) = \mathbb{E}_{\mathbf{x} \sim p_{data}}[\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_z}[\log(1 - D(G(\mathbf{z})))]

Discriminator maximizes: Correctly classify real as real and fake as fake.

Generator minimizes: Fool discriminator into classifying fake as real.

Training Dynamics

At the optimal discriminator:

D^*_G(\mathbf{x}) = \frac{p_{data}(\mathbf{x})}{p_{data}(\mathbf{x}) + p_g(\mathbf{x})}

The global optimum is $p_g = p_{data}$ , meaning the generator perfectly matches the data distribution.

Training algorithm:

Architecture Diagram

for each training step:
    # Train discriminator
    Sample real batch {x_i}
    Sample noise batch {z_i}
    Generate fake batch: G(z_i)
    Maximize: log D(x_i) + log(1 - D(G(z_i)))

    # Train generator
    Sample noise batch {z_i}
    Minimize: log(1 - D(G(z_i)))
    # OR equivalently maximize: log D(G(z_i))

⚠️

In practice, the generator loss should be minimized by maximizing $\log D(G(\mathbf{z}))$ instead of minimizing $\log(1 - D(G(\mathbf{z})))$ . The latter saturates when discriminator is strong.

Training Instability

Vanishing Gradients

When discriminator is too good, $D(G(\mathbf{z})) \approx 0$ , and gradients vanish:

\nabla_G \log(1 - D(G(\mathbf{z}))) \approx 0

Solution: Train discriminator to near-optimal but not perfect.

Mode Collapse

Generator produces limited variety of outputs, ignoring parts of data distribution:

p_g(\mathbf{x}) \neq p_{data}(\mathbf{x}) \text{ (missing modes)}

Symptoms:

Generated samples look similar
Low diversity despite good individual quality
Discriminator loss oscillates

Training Oscillation

Generator and discriminator compete, causing loss oscillations without convergence.

Stabilization Techniques

1. Wasserstein GAN (WGAN)

Replaces Jensen-Shannon divergence with Wasserstein distance:

\mathcal{L}_W = \mathbb{E}_{\mathbf{x} \sim p_{data}}[D(\mathbf{x})] - \mathbb{E}_{\mathbf{z} \sim p_z}[D(G(\mathbf{z}))]

with Lipschitz constraint: $\|D\|_L \leq 1$

Gradient penalty:

\mathcal{L}_{GP} = \lambda \mathbb{E}_{\hat{\mathbf{x}} \sim p_{\hat{x}}}\left[(\|\nabla_{\hat{\mathbf{x}}} D(\hat{\mathbf{x}})\|_2 - 1)^2\right]

2. Progressive Growing

Train with low resolution initially, progressively increase:

Architecture Diagram

4×4 → 8×8 → 16×16 → 32×32 → ... → 1024×1024

Each resolution trained for a period, then fade in new layers.

3. Style-Based Generator (StyleGAN)

Mapping network transforms latent to style vector
AdaIN (Adaptive Instance Normalization) injects style at each layer
Enables control over different aspects (pose, hair, background)

Modern GAN Architectures

StyleGAN2/3

Path length regularization: Encourages smooth latent space
Skip connections in mapping network: Better style mixing
Lazy regularization: Apply regularization every 16 steps

BigGAN

Class-conditional generation: Control output class
Truncation trick: Trade diversity for quality
Large-scale training: Batch size 2048, 256 GPUs

VQGAN

Vector quantization: Discrete latent space
Transformer decoder: Model spatial relationships
Perceptual loss: Better visual quality

Evaluation Metrics

Metric	What it Measures	How to Compute
FID	Quality + diversity	Distance between feature distributions
IS	Quality only	Entropy of classifier predictions
LPIPS	Perceptual similarity	Feature distance in VGG space
Precision/Recall	Quality vs diversity	Volume of learned manifold

Follow-Up Questions

Q: How does StyleGAN achieve style mixing? A: By mixing latent codes at different layers of the generator, you can control different aspects: coarse styles (pose) from early layers, fine styles (color) from later layers.

Q: What is the relationship between GANs and VAEs? A: Both are generative models but with different objectives. GANs minimize divergence between distributions; VAEs maximize variational lower bound. GANs produce sharper images; VAEs provide better coverage.

Q: Can GANs generate text? A: Difficult due to discrete nature of text. Most text generation uses autoregressive models (GPT) or diffusion models. GANSynth exists for music.