Generative Models

VAEs — Learning Probabilistic Latent Representations

VAEs combine autoencoders with variational inference to learn smooth, continuous latent spaces. By encoding to distributions instead of points, they enable principled generation, meaningful interpolation, and disentangled representations — all with a stable training objective.

Key point 1 — ELBO loss balances reconstruction quality with latent space regularization
Key point 2 — Reparameterization trick enables backpropagation through stochastic sampling
Key point 3 — Beta-VAE and VQ-VAE extend the framework for disentanglement and discrete latents

"In the latent space, every point tells a story."

Variational Autoencoders — Deep Dive

VAEs are generative models that learn a smooth latent space by combining autoencoders with variational inference. Unlike GANs, they provide a principled probabilistic framework for generation.

From Autoencoder to VAE

DfAutoencoder

A standard autoencoder compresses input $x$ into a latent code $z$ (encoder) and reconstructs it (decoder). It learns a deterministic mapping but the latent space may have gaps, making generation difficult.

DfVariational Autoencoder (VAE)

A VAE (Kingma and Welling, 2014) learns a probabilistic latent space:

Encoder $q_\phi(z|x)$ : Maps input to a distribution (mean $\mu$ , variance $\sigma^2$ )
Decoder $p_\theta(x|z)$ : Reconstructs input from sampled latent
Prior $p(z) = \mathcal{N}(0, I)$ : Standard Gaussian

Instead of encoding to a point, VAE encodes to a distribution. Sampling from this distribution forces the latent space to be smooth and continuous.

How this diagram works: This diagram shows the VAE architecture, which differs from a standard autoencoder by encoding inputs to a probability distribution rather than a fixed point. The encoder (green) maps input $x$ to parameters of a Gaussian distribution — a mean $\mu$ and variance $\sigma^2$ . Instead of sampling directly (which is non-differentiable), the reparameterization trick samples $z = \mu + \sigma \odot \epsilon$ where $\epsilon \sim \mathcal{N}(0, I)$ , making the process differentiable for backpropagation. The decoder (red) then reconstructs the input from this sampled latent vector. The loss combines reconstruction quality with a KL divergence term that regularizes the latent space toward a standard Gaussian prior, ensuring it remains smooth and continuous — enabling meaningful interpolation and generation by sampling from the prior.

Evidence Lower Bound (ELBO)

VAE Loss (ELBO)

\mathcal{L}_{\text{VAE}} = \underbrace{-\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}_{\text{Reconstruction Loss}} + \underbrace{D_{KL}(q_\phi(z|x) \| p(z))}_{\text{KL Divergence}}

Reconstruction Loss

\mathcal{L}_{\text{recon}} = -\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]

Here,

$q_\phi(z|x)$ =Encoder distribution (approximate posterior)
$p_\theta(x|z)$ =Decoder distribution (likelihood)
$\log p_\theta(x|z)$ =Log-likelihood of reconstruction

KL Divergence

D_{KL}(q_\phi(z|x) \| p(z)) = -\frac{1}{2} \sum_{j=1}^{J} \left(1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2\right)

Here,

$\mu_j$ =Mean of encoder output for dimension j
$\sigma_j^2$ =Variance of encoder output for dimension j
$J$ =Latent dimension

ELBO Derivation

\log p(x) = \mathbb{E}_{q_\phi(z|x)}\left[\log \frac{p_\theta(x,z)}{q_\phi(z|x)}\right] + D_{KL}(q_\phi(z|x) \| p_\theta(z|x))

Since KL divergence is always non-negative, the first term (ELBO) is a lower bound on $\log p(x)$ . Maximizing the ELBO simultaneously maximizes data likelihood and minimizes the gap to the true posterior.

Reparameterization Trick

ThReparameterization Trick

To backpropagate through a stochastic sampling operation, reparameterize the sampling as a deterministic transformation of noise:

z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

This makes the sampling differentiable with respect to $\mu$ and $\sigma$ while maintaining the stochastic nature through $\epsilon$ .

Reparameterization

z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

Here,

$\mu$ =Mean vector from encoder
$\sigma$ =Standard deviation from encoder
$\epsilon$ =Random noise (sampled during forward, not backpropagated)
$\odot$ =Element-wise multiplication

Variational Autoencoders — Deep Dive

VAEs — Learning Probabilistic Latent Representations

Variational Autoencoders — Deep Dive

From Autoencoder to VAE

DfAutoencoder

DfVariational Autoencoder (VAE)

Evidence Lower Bound (ELBO)

Reconstruction Loss

KL Divergence

Reparameterization Trick

ThReparameterization Trick

Reparameterization

Premium Content

Need Expert Deep Learning Help?