🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Variational Autoencoders — Deep Dive

Generative ModelsVAEs🟢 Free Lesson

Advertisement

Generative Models

VAEs — Learning Probabilistic Latent Representations

VAEs combine autoencoders with variational inference to learn smooth, continuous latent spaces. By encoding to distributions instead of points, they enable principled generation, meaningful interpolation, and disentangled representations — all with a stable training objective.

  • Key point 1 — ELBO loss balances reconstruction quality with latent space regularization
  • Key point 2 — Reparameterization trick enables backpropagation through stochastic sampling
  • Key point 3 — Beta-VAE and VQ-VAE extend the framework for disentanglement and discrete latents

"In the latent space, every point tells a story."

Variational Autoencoders — Deep Dive

VAEs are generative models that learn a smooth latent space by combining autoencoders with variational inference. Unlike GANs, they provide a principled probabilistic framework for generation.


From Autoencoder to VAE

DfAutoencoder

A standard autoencoder compresses input xx into a latent code zz (encoder) and reconstructs it (decoder). It learns a deterministic mapping but the latent space may have gaps, making generation difficult.

DfVariational Autoencoder (VAE)

A VAE (Kingma and Welling, 2014) learns a probabilistic latent space:

  • Encoder qϕ(zx)q_\phi(z|x): Maps input to a distribution (mean μ\mu, variance σ2\sigma^2)
  • Decoder pθ(xz)p_\theta(x|z): Reconstructs input from sampled latent
  • Prior p(z)=N(0,I)p(z) = \mathcal{N}(0, I): Standard Gaussian

Instead of encoding to a point, VAE encodes to a distribution. Sampling from this distribution forces the latent space to be smooth and continuous.

Variational Autoencoder (VAE) ArchitectureInput xData pointEncoderq_φ(z|x)Neural networkMean μVariance σ²Reparameterizez = μ + σ ⊙ εLatent z~ N(μ,σ²)Decoderp_θ(x|z)Neural networkRecon x̂OutputLoss = Recon + KLRecon: -log p_θ(x|z)KL: D_KL(q_φ(z|x) || p(z))Regularizes latent spaceKey: Reparameterization trick makes sampling differentiable → enables end-to-end training

How this diagram works: This diagram shows the VAE architecture, which differs from a standard autoencoder by encoding inputs to a probability distribution rather than a fixed point. The encoder (green) maps input xx to parameters of a Gaussian distribution — a mean μ\mu and variance σ2\sigma^2. Instead of sampling directly (which is non-differentiable), the reparameterization trick samples z=μ+σϵz = \mu + \sigma \odot \epsilon where ϵN(0,I)\epsilon \sim \mathcal{N}(0, I), making the process differentiable for backpropagation. The decoder (red) then reconstructs the input from this sampled latent vector. The loss combines reconstruction quality with a KL divergence term that regularizes the latent space toward a standard Gaussian prior, ensuring it remains smooth and continuous — enabling meaningful interpolation and generation by sampling from the prior.


Evidence Lower Bound (ELBO)

VAE Loss (ELBO)
LVAE=Eqϕ(zx)[logpθ(xz)]Reconstruction Loss+DKL(qϕ(zx)p(z))KL Divergence\mathcal{L}_{\text{VAE}} = \underbrace{-\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}_{\text{Reconstruction Loss}} + \underbrace{D_{KL}(q_\phi(z|x) \| p(z))}_{\text{KL Divergence}}

Reconstruction Loss

Lrecon=Eqϕ(zx)[logpθ(xz)]\mathcal{L}_{\text{recon}} = -\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]

Here,

  • qϕ(zx)q_\phi(z|x)=Encoder distribution (approximate posterior)
  • pθ(xz)p_\theta(x|z)=Decoder distribution (likelihood)
  • logpθ(xz)\log p_\theta(x|z)=Log-likelihood of reconstruction

KL Divergence

DKL(qϕ(zx)p(z))=12j=1J(1+logσj2μj2σj2)D_{KL}(q_\phi(z|x) \| p(z)) = -\frac{1}{2} \sum_{j=1}^{J} \left(1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2\right)

Here,

  • μj\mu_j=Mean of encoder output for dimension j
  • σj2\sigma_j^2=Variance of encoder output for dimension j
  • JJ=Latent dimension

ELBO Derivation

logp(x)=Eqϕ(zx)[logpθ(x,z)qϕ(zx)]+DKL(qϕ(zx)pθ(zx))\log p(x) = \mathbb{E}_{q_\phi(z|x)}\left[\log \frac{p_\theta(x,z)}{q_\phi(z|x)}\right] + D_{KL}(q_\phi(z|x) \| p_\theta(z|x))

Since KL divergence is always non-negative, the first term (ELBO) is a lower bound on logp(x)\log p(x). Maximizing the ELBO simultaneously maximizes data likelihood and minimizes the gap to the true posterior.


Reparameterization Trick

ThReparameterization Trick

To backpropagate through a stochastic sampling operation, reparameterize the sampling as a deterministic transformation of noise:

z=μ+σϵ,ϵN(0,I)z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

This makes the sampling differentiable with respect to μ\mu and σ\sigma while maintaining the stochastic nature through ϵ\epsilon.

Reparameterization

z=μ+σϵ,ϵN(0,I)z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

Here,

  • μ\mu=Mean vector from encoder
  • σ\sigma=Standard deviation from encoder
  • ϵ\epsilon=Random noise (sampled during forward, not backpropagated)
  • \odot=Element-wise multiplication

Premium Content

Variational Autoencoders — Deep Dive

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Deep Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement