πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Bayesian Linear Regression

Advanced Statistical MethodsBayesian Methods🟒 Free Lesson

Advertisement

Bayesian Linear Regression

Advanced Statistical Methods

Regression With Full Uncertainty Quantification

Bayesian linear regression places prior distributions on regression coefficients and computes posterior distributions that capture full parameter uncertainty. Credible intervals provide direct probabilistic statements about parameters.

  • Clinical trials β€” Incorporate prior knowledge and quantify uncertainty in treatment effects
  • Engineering β€” Predict system performance with honest uncertainty bounds for safety-critical decisions
  • Economics β€” Combine historical data with current observations for more stable parameter estimates

Bayesian regression replaces point estimates with complete probability distributions over parameters.


Bayesian linear regression extends classical OLS by placing probability distributions on the regression coefficients and error variance, yielding full posterior distributions rather than point estimates.


The Bayesian Regression Model

DfBayesian Linear Regression Model

Yi=xiTΞ²+Ξ΅i,Ξ΅i∼N(0,Οƒ2)Y_i = \mathbf{x}_i^T \boldsymbol{\beta} + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, \sigma^2)

The Bayesian approach treats Ξ²\boldsymbol{\beta} and Οƒ2\sigma^2 as random variables with prior distributions, updating them via Bayes' theorem to obtain posterior distributions given the data D={(xi,yi)}i=1n\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n.

Posterior Distribution

p(Ξ²,Οƒ2∣D)=p(D∣β,Οƒ2) p(Ξ²,Οƒ2)p(D)p(\boldsymbol{\beta}, \sigma^2 \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \boldsymbol{\beta}, \sigma^2) \, p(\boldsymbol{\beta}, \sigma^2)}{p(\mathcal{D})}

Here,

  • p(D∣β,Οƒ2)p(\mathcal{D} \mid \boldsymbol{\beta}, \sigma^2)=Likelihood: product of Gaussian densities
  • p(Ξ²,Οƒ2)p(\boldsymbol{\beta}, \sigma^2)=Joint prior on coefficients and variance
  • p(D)p(\mathcal{D})=Marginal likelihood (evidence) β€” normalizing constant

Why Bayesian Regression?

The Bayesian approach provides: (1) full uncertainty quantification over parameters, (2) natural incorporation of prior knowledge, (3) exact finite-sample posterior distributions (no reliance on asymptotic approximations), and (4) coherent model comparison via marginal likelihoods.


Conjugate Prior: Normal-Inverse-Gamma

DfNormal-Inverse-Gamma Prior

For the linear model Y∣X,Ξ²,Οƒ2∼N(XΞ²,Οƒ2In)Y \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2 \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2 \mathbf{I}_n), the conjugate prior is:

Ξ²βˆ£Οƒ2∼N(ΞΌ0,Οƒ2Ξ›0βˆ’1),Οƒ2∼Inv-Gamma(a0,b0)\boldsymbol{\beta} \mid \sigma^2 \sim \mathcal{N}(\boldsymbol{\mu}_0, \sigma^2 \boldsymbol{\Lambda}_0^{-1}), \quad \sigma^2 \sim \text{Inv-Gamma}(a_0, b_0)

where Ξ›0\boldsymbol{\Lambda}_0 is the prior precision matrix, ΞΌ0\boldsymbol{\mu}_0 is the prior mean, and a0,b0a_0, b_0 control the prior on Οƒ2\sigma^2.

ThPosterior under Normal-Inverse-Gamma Prior

Given nn observations, the posterior is also Normal-Inverse-Gamma:

Ξ²βˆ£Οƒ2,D∼N(ΞΌn,Οƒ2Ξ›nβˆ’1)\boldsymbol{\beta} \mid \sigma^2, \mathcal{D} \sim \mathcal{N}(\boldsymbol{\mu}_n, \sigma^2 \boldsymbol{\Lambda}_n^{-1})
Οƒ2∣D∼Inv-Gamma(an,bn)\sigma^2 \mid \mathcal{D} \sim \text{Inv-Gamma}(a_n, b_n)

with updated parameters:

Ξ›n=Ξ›0+XTX,ΞΌn=Ξ›nβˆ’1(Ξ›0ΞΌ0+XTy)\boldsymbol{\Lambda}_n = \boldsymbol{\Lambda}_0 + \mathbf{X}^T\mathbf{X}, \quad \boldsymbol{\mu}_n = \boldsymbol{\Lambda}_n^{-1}(\boldsymbol{\Lambda}_0 \boldsymbol{\mu}_0 + \mathbf{X}^T\mathbf{y})
an=a0+n2,bn=b0+12(yTy+ΞΌ0TΞ›0ΞΌ0βˆ’ΞΌnTΞ›nΞΌn)a_n = a_0 + \frac{n}{2}, \quad b_n = b_0 + \frac{1}{2}\left(\mathbf{y}^T\mathbf{y} + \boldsymbol{\mu}_0^T \boldsymbol{\Lambda}_0 \boldsymbol{\mu}_0 - \boldsymbol{\mu}_n^T \boldsymbol{\Lambda}_n \boldsymbol{\mu}_n\right)

Prior Elicitation

Choosing Ξ›0\boldsymbol{\Lambda}_0 encodes prior confidence. A large Ξ›0\boldsymbol{\Lambda}_0 means strong prior belief β€” the posterior will be dominated by the prior. Weakly informative priors (small Ξ›0\boldsymbol{\Lambda}_0) let the data speak. In practice, Ξ›0=Ο„βˆ’1I\boldsymbol{\Lambda}_0 = \tau^{-1}\mathbf{I} with large Ο„\tau is common.


Posterior Inference for Coefficients

Marginal Posterior of Ξ²

β∣D∼t2an(ΞΌn,bnanΞ›nβˆ’1)\boldsymbol{\beta} \mid \mathcal{D} \sim t_{2a_n}\left(\boldsymbol{\mu}_n, \frac{b_n}{a_n} \boldsymbol{\Lambda}_n^{-1}\right)

Here,

  • ΞΌn\boldsymbol{\mu}_n=Posterior mean β€” weighted average of prior mean and OLS estimate
  • bnanLambdanβˆ’1\frac{b_n}{a_n} \boldsymbol{Lambda}_n^{-1}=Posterior scale matrix
  • 2an2a_n=Degrees of freedom of the marginal t-distribution

The marginal posterior of Ξ²\boldsymbol{\beta} is a multivariate tt-distribution. For large nn, this approaches a Gaussian:

Large-Sample Posterior

β∣Dβ‰ˆN(Ξ²^OLS, s2(XTX)βˆ’1)\boldsymbol{\beta} \mid \mathcal{D} \approx \mathcal{N}\left(\hat{\boldsymbol{\beta}}_{\text{OLS}}, \, s^2 (\mathbf{X}^T\mathbf{X})^{-1}\right)

Here,

  • Ξ²^OLS\hat{\boldsymbol{\beta}}_{\text{OLS}}=Ordinary least squares estimate
  • s2s^2=Residual variance estimate SSR/(n-p)

Connection to Frequentist

With a flat (improper) prior Ξ›0β†’0\boldsymbol{\Lambda}_0 \to \mathbf{0}, the Bayesian posterior mean equals the OLS estimate, and the posterior variance equals the frequentist sampling variance. The Bayesian credible interval with appropriate coverage coincides with the frequentist confidence interval.


Credible Intervals vs. Confidence Intervals

DfCredible Interval

A (1βˆ’Ξ±)(1-\alpha) credible interval for a parameter ΞΈ\theta is an interval [L,U][L, U] such that:

P(θ∈[L,U]∣D)=1βˆ’Ξ±P(\theta \in [L, U] \mid \mathcal{D}) = 1 - \alpha

This is a direct probability statement about the parameter β€” unlike the frequentist confidence interval.

PropertyCredible IntervalConfidence Interval
InterpretationP(θ∈[L,U]∣D)=1βˆ’Ξ±P(\theta \in [L,U] \mid \mathcal{D}) = 1-\alphaP(θ∈[L,U])=1βˆ’Ξ±P(\theta \in [L,U]) = 1-\alpha over repeated samples
Depends onPosterior + dataSampling distribution only
Prior informationYesNo
Finite-sample exactnessYes (with correct prior)Only asymptotically

Highest Posterior Density (HPD)

The equal-tailed credible interval uses quantiles of the posterior. The HPD interval is the narrowest interval containing 1βˆ’Ξ±1-\alpha probability β€” it is unique and preferentially used when the posterior is skewed.


Bayesian Prediction

Posterior Predictive Distribution

p(y~∣x0,D)=∫p(y~∣x0,Ξ²,Οƒ2) p(Ξ²,Οƒ2∣D) dβ dΟƒ2p(\tilde{y} \mid \mathbf{x}_0, \mathcal{D}) = \int p(\tilde{y} \mid \mathbf{x}_0, \boldsymbol{\beta}, \sigma^2) \, p(\boldsymbol{\beta}, \sigma^2 \mid \mathcal{D}) \, d\boldsymbol{\beta} \, d\sigma^2

Here,

  • y~\tilde{y}=New observation at xβ‚€
  • p(y~∣x0,Ξ²,Οƒ2)p(\tilde{y} \mid \mathbf{x}_0, \boldsymbol{\beta}, \sigma^2)=Sampling model (Gaussian)
  • p(Ξ²,Οƒ2∣D)p(\boldsymbol{\beta}, \sigma^2 \mid \mathcal{D})=Posterior over parameters

Under the Normal-Inverse-Gamma prior, the posterior predictive is a tt-distribution:

y~∣x0,D∼t2an(x0TΞΌn,β€…β€Šbnan(1+x0TΞ›nβˆ’1x0))\tilde{y} \mid \mathbf{x}_0, \mathcal{D} \sim t_{2a_n}\left(\mathbf{x}_0^T \boldsymbol{\mu}_n, \; \frac{b_n}{a_n}(1 + \mathbf{x}_0^T \boldsymbol{\Lambda}_n^{-1} \mathbf{x}_0)\right)

The +1+1 inside the parentheses accounts for parameter uncertainty β€” this is wider than the OLS prediction interval when the prior is weak and nn is small.


Bayesian Model Comparison

Bayes Factor

BF10=p(D∣M1)p(D∣M0)=∫p(D∣θ1,M1) p(ΞΈ1∣M1) dΞΈ1∫p(D∣θ0,M0) p(ΞΈ0∣M0) dΞΈ0\text{BF}_{10} = \frac{p(\mathcal{D} \mid \mathcal{M}_1)}{p(\mathcal{D} \mid \mathcal{M}_0)} = \frac{\int p(\mathcal{D} \mid \boldsymbol{\theta}_1, \mathcal{M}_1) \, p(\boldsymbol{\theta}_1 \mid \mathcal{M}_1) \, d\boldsymbol{\theta}_1}{\int p(\mathcal{D} \mid \boldsymbol{\theta}_0, \mathcal{M}_0) \, p(\boldsymbol{\theta}_0 \mid \mathcal{M}_0) \, d\boldsymbol{\theta}_0}

Here,

  • p(D∣Mk)p(\mathcal{D} \mid \mathcal{M}_k)=Marginal likelihood (evidence) of model k
  • BF10\text{BF}_{10}=Evidence ratio: how much more likely data under M₁ vs Mβ‚€

Occam's Razor Built In

The Bayes factor naturally penalizes complex models because the marginal likelihood integrates over the entire prior β€” models with many parameters spread their prior mass thinly, incurring an automatic penalty. This is the Bayesian implementation of Occam's razor.


Python Implementation

import numpy as np
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(42)

# --- Generate synthetic data ---
n = 80
X = np.random.randn(n, 2)
X = np.column_stack([np.ones(n), X])  # Add intercept
true_beta = np.array([1.5, -2.0, 0.8])
sigma_true = 1.5
y = X @ true_beta + np.random.randn(n) * sigma_true

# --- Bayesian Regression with PyMC ---
with pm.Model() as bayes_reg:
    # Priors
    beta = pm.Normal('beta', mu=0, sigma=10, shape=3)
    sigma = pm.HalfNormal('sigma', sigma=5)
    
    # Likelihood
    mu = pm.math.dot(X, beta)
    y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)
    
    # Posterior sampling
    trace = pm.sample(2000, tune=1000, chains=4, return_inferencedata=True)

# --- Posterior summaries ---
print(az.summary(trace, var_names=['beta', 'sigma']))

# --- Credible intervals ---
for i, name in enumerate(['Intercept', 'X1', 'X2']):
    post_samples = trace.posterior['beta'].values[:, :, i].flatten()
    ci = np.percentile(post_samples, [2.5, 97.5])
    print(f"{name}: {ci[0]:.3f} to {ci[1]:.3f} (true={true_beta[i]:.3f})")

# --- Trace plots ---
az.plot_trace(trace, var_names=['beta', 'sigma'])
plt.tight_layout()
plt.savefig('bayesian_regression_trace.png', dpi=150)
plt.show()

# --- Posterior predictive ---
with bayes_reg:
    ppc = pm.sample_posterior_predictive(trace)
    
y_pred = ppc.posterior_predictive['y_obs'].values.reshape(-1, n)
y_mean = y_pred.mean(axis=0)
y_low = np.percentile(y_pred, 2.5, axis=0)
y_high = np.percentile(y_pred, 97.5, axis=0)

plt.figure(figsize=(10, 6))
order = np.argsort(X[:, 1])
plt.scatter(X[:, 1], y, alpha=0.5, label='Observed')
plt.plot(X[order, 1], y_mean[order], 'r-', label='Posterior mean')
plt.fill_between(X[order, 1], y_low[order], y_high[order], alpha=0.3, label='95% credible band')
plt.xlabel('X₁')
plt.ylabel('y')
plt.title('Bayesian Linear Regression β€” Posterior Predictive')
plt.legend()
plt.savefig('bayesian_regression_ppc.png', dpi=150)
plt.show()

Flat Prior Recovery

With flat priors (Ξ›0β†’0\boldsymbol{\Lambda}_0 \to \mathbf{0}), the Bayesian posterior mean recovers the OLS estimate exactly, and the 95% credible intervals approximate the frequentist confidence intervals for large nn.


Related Topics


Key Takeaways

Summary: Bayesian Linear Regression

  • Bayesian regression treats Ξ²\boldsymbol{\beta} as random β€” the posterior p(β∣D)p(\boldsymbol{\beta} \mid \mathcal{D}) encodes all uncertainty about the coefficients
  • Conjugate Normal-Inverse-Gamma prior yields tractable tt-distributed posteriors for Ξ²\boldsymbol{\beta} and Inverse-Gamma for Οƒ2\sigma^2
  • Credible intervals have a direct probabilistic interpretation: P(θ∈[L,U]∣D)=1βˆ’Ξ±P(\theta \in [L,U] \mid \mathcal{D}) = 1-\alpha
  • Bayesian prediction integrates over parameter uncertainty β€” prediction intervals are naturally wider for small samples
  • Bayes factors provide automatic model comparison with built-in Occam's razor
  • With flat priors, Bayesian and frequentist results coincide: posterior mean = OLS, credible interval β‰ˆ confidence interval
  • Weakly informative priors are preferred in practice β€” they regularize without dominating the likelihood
⭐

Premium Content

Bayesian Linear Regression

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement