Generative Models

GANs Deep Dive — Mastering Generative Adversarial Networks

GANs revolutionized generative modeling by pitting two neural networks against each other in a minimax game. The generator creates realistic samples while the discriminator tries to tell real from fake, driving both to improve until generated data is indistinguishable from reality.

Key point 1 — Minimax game converges when generator matches data distribution (Nash equilibrium)
Key point 2 — DCGAN, WGAN, and StyleGAN each solve different training challenges
Key point 3 — FID score is the standard metric for evaluating generation quality

"In the battle between generator and discriminator, everyone wins."

GANs Deep Dive — Generative Adversarial Networks

GANs learn to generate realistic data by pitting two neural networks against each other in a minimax game: a generator creates fake samples, and a discriminator tries to distinguish real from fake.

The GAN Framework

DfGAN Framework

A GAN consists of:

Generator $G$ : Maps random noise $z \sim p_z(z)$ to fake data $G(z)$
Discriminator $D$ : Outputs probability that input is real data $x$

The two networks compete: $G$ tries to fool $D$ , while $D$ tries to correctly classify real vs. fake. Training converges when $G$ produces data indistinguishable from real data.

GAN Minimax Objective

\min_G \max_D \; V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]

Discriminator Objective

\max_D \; V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]

Here,

$D(x)$ =Discriminator's estimate that x is real
$G(z)$ =Generator's fake sample from noise z
$p_{\text{data}}$ =Real data distribution
$p_z$ =Prior noise distribution (e.g., Gaussian)

Generator Objective (Non-saturating)

\max_G \; \mathbb{E}_{z \sim p_z}[\log D(G(z)))]

Here,

$\log D(G(z))$ =Generator wants discriminator to output 1 for fakes

How this diagram works: This diagram illustrates the adversarial game at the heart of GANs. The generator (green, left) takes random noise $z$ and produces fake data $G(z)$ , attempting to mimic real data. The discriminator (red, right) receives both real data $x$ and fake samples $G(z)$ , and outputs a probability indicating whether each input is real or generated. The dashed feedback arrows show the competing objectives: the generator loss pushes $G$ to fool $D$ (making $D(G(z)) \to 1$ ), while the discriminator loss pushes $D$ to correctly classify both real and fake inputs. Training reaches Nash equilibrium when the generator perfectly matches the real data distribution ( $p_G = p_{\text{data}}$ ) and the discriminator can no longer tell them apart, outputting $D(x) = 0.5$ for all inputs.

Nash Equilibrium

ThGlobal Optimum of GAN

The global optimum of the minimax game is achieved when:

p_G = p_{\text{data}}

and the optimal discriminator is:

D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_G(x)}

At this point, $V(D^*, G) = -\log 4$ and the generator perfectly matches the data distribution.

Interpretation

When $p_G = p_{\text{data}}$ , the discriminator cannot distinguish real from fake and outputs $D(x) = 0.5$ everywhere. The game reaches a Nash equilibrium where neither player can improve by changing strategy unilaterally.

Training Challenges

DfMode Collapse

The generator learns to produce only a few types of outputs that fool the discriminator, ignoring the diversity of the real data distribution. This is the most common failure mode of GANs.

Symptoms: Generator produces very similar outputs regardless of input noise.

DfTraining Instability

GAN training is inherently unstable because:

Non-convergence: Alternating optimization may not converge
Vanishing gradients: When $D$ is too good, $\log(1-D(G(z)))$ saturates
Oscillation: $G$ and $D$ may cycle without converging
Mode collapse: $G$ maps all inputs to same output

DCGAN (Deep Convolutional GAN)

DfDCGAN Architecture

DCGAN (Radford et al., 2015) established stable GAN training with architectural guidelines:

Replace pooling with strided convolutions (discriminator) and transposed convolutions (generator)
Use batch normalization in both networks
Remove fully connected layers
Use ReLU activation in generator (Tanh for output)
Use LeakyReLU in discriminator

Transposed Convolution Output Size

\text{out} = (\text{in} - 1) \times \text{stride} - 2 \times \text{padding} + \text{kernel}

Here,

$\text{in}$ =Input spatial dimension
$\text{stride}$ =Stride of transposed convolution
$\text{padding}$ =Padding
$\text{kernel}$ =Kernel size

WGAN (Wasserstein GAN)

DfWGAN

WGAN (Arjovsky et al., 2017) replaces the JS divergence with Wasserstein distance (Earth-Mover distance) for more stable training:

Uses Wasserstein distance: $W(p_{\text{data}}, p_G) = \inf_{\gamma} \mathbb{E}_{(x,y) \sim \gamma}[\|x - y\|]$
Discriminator becomes "critic" — outputs scalar, not probability
Weight clipping or gradient penalty instead of batch norm in critic
Meaningful loss correlate with sample quality

WGAN Loss

\mathcal{L} = \mathbb{E}_{x \sim p_{\text{data}}}[D(x)] - \mathbb{E}_{z \sim p_z}[D(G(z))]

WGAN-GP Gradient Penalty

Instead of weight clipping (WGAN), use gradient penalty (WGAN-GP):

\mathcal{L}_{\text{GP}} = \lambda \mathbb{E}_{\hat{x}} \left[ \left( \| \nabla_{\hat{x}} D(\hat{x}) \|_2 - 1 \right)^2 \right]

where $\hat{x}$ is interpolated between real and fake samples. This enforces the Lipschitz constraint smoothly.

StyleGAN

DfStyleGAN Architecture

StyleGAN (Karras et al., 2019) introduces style-based generator architecture:

Mapping network: $z \to w$ (8 FC layers) maps latent to style space
Adaptive instance normalization (AdaIN): Injects style at each layer
Noise injection: Per-pixel noise for stochastic variation
Progressive growing: Train with increasing resolution

This enables disentangled control over high-level attributes (pose, identity) and stochastic variation (hair, freckles).

AdaIN (Adaptive Instance Normalization)

\text{AdaIN}(x, y) = y_{s,i} \frac{x_i - \mu(x_i)}{\sigma(x_i)} + y_{b,i}

Here,

$x_i$ =Feature map at layer i
$y_{s,i}$ =Style scale (from w)
$y_{b,i}$ =Style bias (from w)
$\mu(x_i), \sigma(x_i)$ =Mean and std of feature map

PyTorch Implementation

Example: DCGAN

import torch
import torch.nn as nn

class Generator(nn.Module):
    def __init__(self, latent_dim=100, channels=3):
        super().__init__()
        self.main = nn.Sequential(
            # Latent -> 512 x 4 x 4
            nn.ConvTranspose2d(latent_dim, 512, 4, 1, 0, bias=False),
            nn.BatchNorm2d(512),
            nn.ReLU(True),
            # 512 x 4 x 4 -> 256 x 8 x 8
            nn.ConvTranspose2d(512, 256, 4, 2, 1, bias=False),
            nn.BatchNorm2d(256),
            nn.ReLU(True),
            # 256 x 8 x 8 -> 128 x 16 x 16
            nn.ConvTranspose2d(256, 128, 4, 2, 1, bias=False),
            nn.BatchNorm2d(128),
            nn.ReLU(True),
            # 128 x 16 x 16 -> 64 x 32 x 32
            nn.ConvTranspose2d(128, 64, 4, 2, 1, bias=False),
            nn.BatchNorm2d(64),
            nn.ReLU(True),
            # 64 x 32 x 32 -> 3 x 64 x 64
            nn.ConvTranspose2d(64, channels, 4, 2, 1, bias=False),
            nn.Tanh()
        )

    def forward(self, z):
        return self.main(z.view(z.size(0), -1, 1, 1))


class Discriminator(nn.Module):
    def __init__(self, channels=3):
        super().__init__()
        self.main = nn.Sequential(
            # 3 x 64 x 64 -> 64 x 32 x 32
            nn.Conv2d(channels, 64, 4, 2, 1, bias=False),
            nn.LeakyReLU(0.2, inplace=True),
            # 64 x 32 x 32 -> 128 x 16 x 16
            nn.Conv2d(64, 128, 4, 2, 1, bias=False),
            nn.BatchNorm2d(128),
            nn.LeakyReLU(0.2, inplace=True),
            # 128 x 16 x 16 -> 256 x 8 x 8
            nn.Conv2d(128, 256, 4, 2, 1, bias=False),
            nn.BatchNorm2d(256),
            nn.LeakyReLU(0.2, inplace=True),
            # 256 x 8 x 8 -> 512 x 4 x 4
            nn.Conv2d(256, 512, 4, 2, 1, bias=False),
            nn.BatchNorm2d(512),
            nn.LeakyReLU(0.2, inplace=True),
            # 512 x 4 x 4 -> 1
            nn.Conv2d(512, 1, 4, 1, 0, bias=False),
        )

    def forward(self, x):
        return self.main(x).view(-1)


# Training loop
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
G = Generator(100, 3).to(device)
D = Discriminator(3).to(device)

opt_G = torch.optim.Adam(G.parameters(), lr=0.0002, betas=(0.5, 0.999))
opt_D = torch.optim.Adam(D.parameters(), lr=0.0002, betas=(0.5, 0.999))
criterion = nn.BCEWithLogitsLoss()

for epoch in range(100):
    for real, _ in dataloader:
        real = real.to(device)
        batch_size = real.size(0)

        # Train Discriminator
        z = torch.randn(batch_size, 100, device=device)
        fake = G(z).detach()
        loss_D = criterion(D(real), torch.ones(batch_size, device=device)) + \
                 criterion(D(fake), torch.zeros(batch_size, device=device))

        opt_D.zero_grad()
        loss_D.backward()
        opt_D.step()

        # Train Generator
        z = torch.randn(batch_size, 100, device=device)
        fake = G(z)
        loss_G = criterion(D(fake), torch.ones(batch_size, device=device))

        opt_G.zero_grad()
        loss_G.backward()
        opt_G.step()

Training Tips

GAN Training Best Practices

Use label smoothing: Real labels = 0.9, fake = 0.1 (reduces overconfidence)
Two-time-scale update rule: Train D more than G (e.g., 5:1 ratio)
Spectral normalization: Apply to D weights for stable training
Progressive growing: Start with low resolution, increase gradually
Track FID/IS: Fréchet Inception Distance is the standard evaluation metric
Avoid batch norm in D: Use layer norm or instance norm instead
Adam optimizer: $\beta_1 = 0.5$ , $\beta_2 = 0.999$ , learning rate $2 \times 10^{-4}$

Practice Exercises

Train DCGAN on CIFAR-10: Generate realistic images. Monitor FID over training.
WGAN-GP implementation: Replace BCE loss with Wasserstein loss + gradient penalty. Compare training stability.
Mode collapse experiment: Train a GAN on MNIST and observe mode collapse. Fix it with minibatch discrimination.
Style mixing: Implement StyleGAN and experiment with style mixing at different layers.

Key Takeaways

Summary: GANs

GANs consist of generator $G$ and discriminator $D$ in minimax game
Nash equilibrium: $p_G = p_{\text{data}}$ , $D(x) = 0.5$
Non-saturating loss: $-\log D(G(z))$ instead of $\log(1-D(G(z)))$
DCGAN: Architectural guidelines for stable training
WGAN: Wasserstein distance for better training dynamics
StyleGAN: Style-based generation with disentangled controls
Mode collapse and training instability are main challenges
FID score is the standard evaluation metric
GANs excel at image synthesis, style transfer, super-resolution
See also: GANs in ML for fundamentals

What to Learn Next

-> Variational Autoencoders Learn probabilistic latent representations with encoder-decoder architectures.

-> Diffusion Models Deep Dive Explore the math behind gradual noising and denoising for image generation.

-> Self-Supervised Learning Learn useful representations from unlabeled data without manual annotation.

-> DL Systems Design Master distributed training, monitoring, and production deployment of deep learning models.

-> Model Compression Make deep learning models fast and efficient for production deployment.

-> Graph Neural Networks Learn from graph-structured data with message passing and attention mechanisms.

GANs Deep Dive — Generative Adversarial Networks

GANs Deep Dive — Mastering Generative Adversarial Networks

GANs Deep Dive — Generative Adversarial Networks

The GAN Framework

DfGAN Framework

Discriminator Objective

Generator Objective (Non-saturating)

Nash Equilibrium

ThGlobal Optimum of GAN

Training Challenges

DfMode Collapse

DfTraining Instability

DCGAN (Deep Convolutional GAN)

DfDCGAN Architecture

Transposed Convolution Output Size

WGAN (Wasserstein GAN)

DfWGAN

StyleGAN

DfStyleGAN Architecture

AdaIN (Adaptive Instance Normalization)

PyTorch Implementation

Example: DCGAN

Training Tips

Practice Exercises

Key Takeaways

Summary: GANs

What to Learn Next

Premium Content

Need Expert Deep Learning Help?