Generative Models
GANs Deep Dive — Mastering Generative Adversarial Networks
GANs revolutionized generative modeling by pitting two neural networks against each other in a minimax game. The generator creates realistic samples while the discriminator tries to tell real from fake, driving both to improve until generated data is indistinguishable from reality.
- Key point 1 — Minimax game converges when generator matches data distribution (Nash equilibrium)
- Key point 2 — DCGAN, WGAN, and StyleGAN each solve different training challenges
- Key point 3 — FID score is the standard metric for evaluating generation quality
"In the battle between generator and discriminator, everyone wins."
GANs Deep Dive — Generative Adversarial Networks
GANs learn to generate realistic data by pitting two neural networks against each other in a minimax game: a generator creates fake samples, and a discriminator tries to distinguish real from fake.
The GAN Framework
DfGAN Framework
A GAN consists of:
- Generator : Maps random noise to fake data
- Discriminator : Outputs probability that input is real data
The two networks compete: tries to fool , while tries to correctly classify real vs. fake. Training converges when produces data indistinguishable from real data.
Discriminator Objective
Here,
- =Discriminator's estimate that x is real
- =Generator's fake sample from noise z
- =Real data distribution
- =Prior noise distribution (e.g., Gaussian)
Generator Objective (Non-saturating)
Here,
- =Generator wants discriminator to output 1 for fakes
How this diagram works: This diagram illustrates the adversarial game at the heart of GANs. The generator (green, left) takes random noise and produces fake data , attempting to mimic real data. The discriminator (red, right) receives both real data and fake samples , and outputs a probability indicating whether each input is real or generated. The dashed feedback arrows show the competing objectives: the generator loss pushes to fool (making ), while the discriminator loss pushes to correctly classify both real and fake inputs. Training reaches Nash equilibrium when the generator perfectly matches the real data distribution () and the discriminator can no longer tell them apart, outputting for all inputs.
Nash Equilibrium
ThGlobal Optimum of GAN
The global optimum of the minimax game is achieved when:
and the optimal discriminator is:
At this point, and the generator perfectly matches the data distribution.
Interpretation
When , the discriminator cannot distinguish real from fake and outputs everywhere. The game reaches a Nash equilibrium where neither player can improve by changing strategy unilaterally.
Training Challenges
DfMode Collapse
The generator learns to produce only a few types of outputs that fool the discriminator, ignoring the diversity of the real data distribution. This is the most common failure mode of GANs.
Symptoms: Generator produces very similar outputs regardless of input noise.
DfTraining Instability
GAN training is inherently unstable because:
- Non-convergence: Alternating optimization may not converge
- Vanishing gradients: When is too good, saturates
- Oscillation: and may cycle without converging
- Mode collapse: maps all inputs to same output
DCGAN (Deep Convolutional GAN)
DfDCGAN Architecture
DCGAN (Radford et al., 2015) established stable GAN training with architectural guidelines:
- Replace pooling with strided convolutions (discriminator) and transposed convolutions (generator)
- Use batch normalization in both networks
- Remove fully connected layers
- Use ReLU activation in generator (Tanh for output)
- Use LeakyReLU in discriminator
Transposed Convolution Output Size
Here,
- =Input spatial dimension
- =Stride of transposed convolution
- =Padding
- =Kernel size
WGAN (Wasserstein GAN)
DfWGAN
WGAN (Arjovsky et al., 2017) replaces the JS divergence with Wasserstein distance (Earth-Mover distance) for more stable training:
- Uses Wasserstein distance:
- Discriminator becomes "critic" — outputs scalar, not probability
- Weight clipping or gradient penalty instead of batch norm in critic
- Meaningful loss correlate with sample quality
WGAN-GP Gradient Penalty
Instead of weight clipping (WGAN), use gradient penalty (WGAN-GP):
where is interpolated between real and fake samples. This enforces the Lipschitz constraint smoothly.
StyleGAN
DfStyleGAN Architecture
StyleGAN (Karras et al., 2019) introduces style-based generator architecture:
- Mapping network: (8 FC layers) maps latent to style space
- Adaptive instance normalization (AdaIN): Injects style at each layer
- Noise injection: Per-pixel noise for stochastic variation
- Progressive growing: Train with increasing resolution
This enables disentangled control over high-level attributes (pose, identity) and stochastic variation (hair, freckles).
AdaIN (Adaptive Instance Normalization)
Here,
- =Feature map at layer i
- =Style scale (from w)
- =Style bias (from w)
- =Mean and std of feature map
PyTorch Implementation
Example: DCGAN
import torch
import torch.nn as nn
class Generator(nn.Module):
def __init__(self, latent_dim=100, channels=3):
super().__init__()
self.main = nn.Sequential(
# Latent -> 512 x 4 x 4
nn.ConvTranspose2d(latent_dim, 512, 4, 1, 0, bias=False),
nn.BatchNorm2d(512),
nn.ReLU(True),
# 512 x 4 x 4 -> 256 x 8 x 8
nn.ConvTranspose2d(512, 256, 4, 2, 1, bias=False),
nn.BatchNorm2d(256),
nn.ReLU(True),
# 256 x 8 x 8 -> 128 x 16 x 16
nn.ConvTranspose2d(256, 128, 4, 2, 1, bias=False),
nn.BatchNorm2d(128),
nn.ReLU(True),
# 128 x 16 x 16 -> 64 x 32 x 32
nn.ConvTranspose2d(128, 64, 4, 2, 1, bias=False),
nn.BatchNorm2d(64),
nn.ReLU(True),
# 64 x 32 x 32 -> 3 x 64 x 64
nn.ConvTranspose2d(64, channels, 4, 2, 1, bias=False),
nn.Tanh()
)
def forward(self, z):
return self.main(z.view(z.size(0), -1, 1, 1))
class Discriminator(nn.Module):
def __init__(self, channels=3):
super().__init__()
self.main = nn.Sequential(
# 3 x 64 x 64 -> 64 x 32 x 32
nn.Conv2d(channels, 64, 4, 2, 1, bias=False),
nn.LeakyReLU(0.2, inplace=True),
# 64 x 32 x 32 -> 128 x 16 x 16
nn.Conv2d(64, 128, 4, 2, 1, bias=False),
nn.BatchNorm2d(128),
nn.LeakyReLU(0.2, inplace=True),
# 128 x 16 x 16 -> 256 x 8 x 8
nn.Conv2d(128, 256, 4, 2, 1, bias=False),
nn.BatchNorm2d(256),
nn.LeakyReLU(0.2, inplace=True),
# 256 x 8 x 8 -> 512 x 4 x 4
nn.Conv2d(256, 512, 4, 2, 1, bias=False),
nn.BatchNorm2d(512),
nn.LeakyReLU(0.2, inplace=True),
# 512 x 4 x 4 -> 1
nn.Conv2d(512, 1, 4, 1, 0, bias=False),
)
def forward(self, x):
return self.main(x).view(-1)
# Training loop
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
G = Generator(100, 3).to(device)
D = Discriminator(3).to(device)
opt_G = torch.optim.Adam(G.parameters(), lr=0.0002, betas=(0.5, 0.999))
opt_D = torch.optim.Adam(D.parameters(), lr=0.0002, betas=(0.5, 0.999))
criterion = nn.BCEWithLogitsLoss()
for epoch in range(100):
for real, _ in dataloader:
real = real.to(device)
batch_size = real.size(0)
# Train Discriminator
z = torch.randn(batch_size, 100, device=device)
fake = G(z).detach()
loss_D = criterion(D(real), torch.ones(batch_size, device=device)) + \
criterion(D(fake), torch.zeros(batch_size, device=device))
opt_D.zero_grad()
loss_D.backward()
opt_D.step()
# Train Generator
z = torch.randn(batch_size, 100, device=device)
fake = G(z)
loss_G = criterion(D(fake), torch.ones(batch_size, device=device))
opt_G.zero_grad()
loss_G.backward()
opt_G.step()
Training Tips
GAN Training Best Practices
- Use label smoothing: Real labels = 0.9, fake = 0.1 (reduces overconfidence)
- Two-time-scale update rule: Train D more than G (e.g., 5:1 ratio)
- Spectral normalization: Apply to D weights for stable training
- Progressive growing: Start with low resolution, increase gradually
- Track FID/IS: Fréchet Inception Distance is the standard evaluation metric
- Avoid batch norm in D: Use layer norm or instance norm instead
- Adam optimizer: , , learning rate
Practice Exercises
-
Train DCGAN on CIFAR-10: Generate realistic images. Monitor FID over training.
-
WGAN-GP implementation: Replace BCE loss with Wasserstein loss + gradient penalty. Compare training stability.
-
Mode collapse experiment: Train a GAN on MNIST and observe mode collapse. Fix it with minibatch discrimination.
-
Style mixing: Implement StyleGAN and experiment with style mixing at different layers.
Key Takeaways
Summary: GANs
- GANs consist of generator and discriminator in minimax game
- Nash equilibrium: ,
- Non-saturating loss: instead of
- DCGAN: Architectural guidelines for stable training
- WGAN: Wasserstein distance for better training dynamics
- StyleGAN: Style-based generation with disentangled controls
- Mode collapse and training instability are main challenges
- FID score is the standard evaluation metric
- GANs excel at image synthesis, style transfer, super-resolution
- See also: GANs in ML for fundamentals
What to Learn Next
-> Variational Autoencoders Learn probabilistic latent representations with encoder-decoder architectures.
-> Diffusion Models Deep Dive Explore the math behind gradual noising and denoising for image generation.
-> Self-Supervised Learning Learn useful representations from unlabeled data without manual annotation.
-> DL Systems Design Master distributed training, monitoring, and production deployment of deep learning models.
-> Model Compression Make deep learning models fast and efficient for production deployment.
-> Graph Neural Networks Learn from graph-structured data with message passing and attention mechanisms.