Bayesian Linear Regression

Advanced Statistical Methods

Regression With Full Uncertainty Quantification

Bayesian linear regression places prior distributions on regression coefficients and computes posterior distributions that capture full parameter uncertainty. Credible intervals provide direct probabilistic statements about parameters.

Clinical trials — Incorporate prior knowledge and quantify uncertainty in treatment effects
Engineering — Predict system performance with honest uncertainty bounds for safety-critical decisions
Economics — Combine historical data with current observations for more stable parameter estimates

Bayesian regression replaces point estimates with complete probability distributions over parameters.

Bayesian linear regression extends classical OLS by placing probability distributions on the regression coefficients and error variance, yielding full posterior distributions rather than point estimates.

The Bayesian Regression Model

DfBayesian Linear Regression Model

Y_i = \mathbf{x}_i^T \boldsymbol{\beta} + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, \sigma^2)

The Bayesian approach treats $\boldsymbol{\beta}$ and $\sigma^2$ as random variables with prior distributions, updating them via Bayes' theorem to obtain posterior distributions given the data $\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n$ .

Posterior Distribution

p(\boldsymbol{\beta}, \sigma^2 \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \boldsymbol{\beta}, \sigma^2) \, p(\boldsymbol{\beta}, \sigma^2)}{p(\mathcal{D})}

Here,

$p(\mathcal{D} \mid \boldsymbol{\beta}, \sigma^2)$ =Likelihood: product of Gaussian densities
$p(\boldsymbol{\beta}, \sigma^2)$ =Joint prior on coefficients and variance
$p(\mathcal{D})$ =Marginal likelihood (evidence) — normalizing constant

Why Bayesian Regression?

The Bayesian approach provides: (1) full uncertainty quantification over parameters, (2) natural incorporation of prior knowledge, (3) exact finite-sample posterior distributions (no reliance on asymptotic approximations), and (4) coherent model comparison via marginal likelihoods.

Conjugate Prior: Normal-Inverse-Gamma

DfNormal-Inverse-Gamma Prior

For the linear model $Y \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2 \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2 \mathbf{I}_n)$ , the conjugate prior is:

\boldsymbol{\beta} \mid \sigma^2 \sim \mathcal{N}(\boldsymbol{\mu}_0, \sigma^2 \boldsymbol{\Lambda}_0^{-1}), \quad \sigma^2 \sim \text{Inv-Gamma}(a_0, b_0)

where $\boldsymbol{\Lambda}_0$ is the prior precision matrix, $\boldsymbol{\mu}_0$ is the prior mean, and $a_0, b_0$ control the prior on $\sigma^2$ .

ThPosterior under Normal-Inverse-Gamma Prior

Given $n$ observations, the posterior is also Normal-Inverse-Gamma:

\boldsymbol{\beta} \mid \sigma^2, \mathcal{D} \sim \mathcal{N}(\boldsymbol{\mu}_n, \sigma^2 \boldsymbol{\Lambda}_n^{-1})

\sigma^2 \mid \mathcal{D} \sim \text{Inv-Gamma}(a_n, b_n)

with updated parameters:

\boldsymbol{\Lambda}_n = \boldsymbol{\Lambda}_0 + \mathbf{X}^T\mathbf{X}, \quad \boldsymbol{\mu}_n = \boldsymbol{\Lambda}_n^{-1}(\boldsymbol{\Lambda}_0 \boldsymbol{\mu}_0 + \mathbf{X}^T\mathbf{y})

a_n = a_0 + \frac{n}{2}, \quad b_n = b_0 + \frac{1}{2}\left(\mathbf{y}^T\mathbf{y} + \boldsymbol{\mu}_0^T \boldsymbol{\Lambda}_0 \boldsymbol{\mu}_0 - \boldsymbol{\mu}_n^T \boldsymbol{\Lambda}_n \boldsymbol{\mu}_n\right)

Prior Elicitation

Choosing $\boldsymbol{\Lambda}_0$ encodes prior confidence. A large $\boldsymbol{\Lambda}_0$ means strong prior belief — the posterior will be dominated by the prior. Weakly informative priors (small $\boldsymbol{\Lambda}_0$ ) let the data speak. In practice, $\boldsymbol{\Lambda}_0 = \tau^{-1}\mathbf{I}$ with large $\tau$ is common.

Posterior Inference for Coefficients

Marginal Posterior of β

\boldsymbol{\beta} \mid \mathcal{D} \sim t_{2a_n}\left(\boldsymbol{\mu}_n, \frac{b_n}{a_n} \boldsymbol{\Lambda}_n^{-1}\right)

Here,

$\boldsymbol{\mu}_n$ =Posterior mean — weighted average of prior mean and OLS estimate
$\frac{b_n}{a_n} \boldsymbol{Lambda}_n^{-1}$ =Posterior scale matrix
$2a_n$ =Degrees of freedom of the marginal t-distribution

The marginal posterior of $\boldsymbol{\beta}$ is a multivariate $t$ -distribution. For large $n$ , this approaches a Gaussian:

Large-Sample Posterior

\boldsymbol{\beta} \mid \mathcal{D} \approx \mathcal{N}\left(\hat{\boldsymbol{\beta}}_{\text{OLS}}, \, s^2 (\mathbf{X}^T\mathbf{X})^{-1}\right)

Here,

$\hat{\boldsymbol{\beta}}_{\text{OLS}}$ =Ordinary least squares estimate
$s^2$ =Residual variance estimate SSR/(n-p)

Connection to Frequentist

With a flat (improper) prior $\boldsymbol{\Lambda}_0 \to \mathbf{0}$ , the Bayesian posterior mean equals the OLS estimate, and the posterior variance equals the frequentist sampling variance. The Bayesian credible interval with appropriate coverage coincides with the frequentist confidence interval.

Credible Intervals vs. Confidence Intervals

DfCredible Interval

A $(1-\alpha)$ credible interval for a parameter $\theta$ is an interval $[L, U]$ such that:

P(\theta \in [L, U] \mid \mathcal{D}) = 1 - \alpha

This is a direct probability statement about the parameter — unlike the frequentist confidence interval.

Property	Credible Interval	Confidence Interval
Interpretation	$P(\theta \in [L,U] \mid \mathcal{D}) = 1-\alpha$	$P(\theta \in [L,U]) = 1-\alpha$ over repeated samples
Depends on	Posterior + data	Sampling distribution only
Prior information	Yes	No
Finite-sample exactness	Yes (with correct prior)	Only asymptotically

Highest Posterior Density (HPD)

The equal-tailed credible interval uses quantiles of the posterior. The HPD interval is the narrowest interval containing $1-\alpha$ probability — it is unique and preferentially used when the posterior is skewed.

Bayesian Prediction

Posterior Predictive Distribution

p(\tilde{y} \mid \mathbf{x}_0, \mathcal{D}) = \int p(\tilde{y} \mid \mathbf{x}_0, \boldsymbol{\beta}, \sigma^2) \, p(\boldsymbol{\beta}, \sigma^2 \mid \mathcal{D}) \, d\boldsymbol{\beta} \, d\sigma^2

Here,

$\tilde{y}$ =New observation at x₀
$p(\tilde{y} \mid \mathbf{x}_0, \boldsymbol{\beta}, \sigma^2)$ =Sampling model (Gaussian)
$p(\boldsymbol{\beta}, \sigma^2 \mid \mathcal{D})$ =Posterior over parameters

Under the Normal-Inverse-Gamma prior, the posterior predictive is a $t$ -distribution:

\tilde{y} \mid \mathbf{x}_0, \mathcal{D} \sim t_{2a_n}\left(\mathbf{x}_0^T \boldsymbol{\mu}_n, \; \frac{b_n}{a_n}(1 + \mathbf{x}_0^T \boldsymbol{\Lambda}_n^{-1} \mathbf{x}_0)\right)

The $+1$ inside the parentheses accounts for parameter uncertainty — this is wider than the OLS prediction interval when the prior is weak and $n$ is small.

Bayesian Model Comparison

Bayes Factor

\text{BF}_{10} = \frac{p(\mathcal{D} \mid \mathcal{M}_1)}{p(\mathcal{D} \mid \mathcal{M}_0)} = \frac{\int p(\mathcal{D} \mid \boldsymbol{\theta}_1, \mathcal{M}_1) \, p(\boldsymbol{\theta}_1 \mid \mathcal{M}_1) \, d\boldsymbol{\theta}_1}{\int p(\mathcal{D} \mid \boldsymbol{\theta}_0, \mathcal{M}_0) \, p(\boldsymbol{\theta}_0 \mid \mathcal{M}_0) \, d\boldsymbol{\theta}_0}

Here,

$p(\mathcal{D} \mid \mathcal{M}_k)$ =Marginal likelihood (evidence) of model k
$\text{BF}_{10}$ =Evidence ratio: how much more likely data under M₁ vs M₀

Occam's Razor Built In

The Bayes factor naturally penalizes complex models because the marginal likelihood integrates over the entire prior — models with many parameters spread their prior mass thinly, incurring an automatic penalty. This is the Bayesian implementation of Occam's razor.

Python Implementation

import numpy as np
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(42)

# --- Generate synthetic data ---
n = 80
X = np.random.randn(n, 2)
X = np.column_stack([np.ones(n), X])  # Add intercept
true_beta = np.array([1.5, -2.0, 0.8])
sigma_true = 1.5
y = X @ true_beta + np.random.randn(n) * sigma_true

# --- Bayesian Regression with PyMC ---
with pm.Model() as bayes_reg:
    # Priors
    beta = pm.Normal('beta', mu=0, sigma=10, shape=3)
    sigma = pm.HalfNormal('sigma', sigma=5)
    
    # Likelihood
    mu = pm.math.dot(X, beta)
    y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)
    
    # Posterior sampling
    trace = pm.sample(2000, tune=1000, chains=4, return_inferencedata=True)

# --- Posterior summaries ---
print(az.summary(trace, var_names=['beta', 'sigma']))

# --- Credible intervals ---
for i, name in enumerate(['Intercept', 'X1', 'X2']):
    post_samples = trace.posterior['beta'].values[:, :, i].flatten()
    ci = np.percentile(post_samples, [2.5, 97.5])
    print(f"{name}: {ci[0]:.3f} to {ci[1]:.3f} (true={true_beta[i]:.3f})")

# --- Trace plots ---
az.plot_trace(trace, var_names=['beta', 'sigma'])
plt.tight_layout()
plt.savefig('bayesian_regression_trace.png', dpi=150)
plt.show()

# --- Posterior predictive ---
with bayes_reg:
    ppc = pm.sample_posterior_predictive(trace)
    
y_pred = ppc.posterior_predictive['y_obs'].values.reshape(-1, n)
y_mean = y_pred.mean(axis=0)
y_low = np.percentile(y_pred, 2.5, axis=0)
y_high = np.percentile(y_pred, 97.5, axis=0)

plt.figure(figsize=(10, 6))
order = np.argsort(X[:, 1])
plt.scatter(X[:, 1], y, alpha=0.5, label='Observed')
plt.plot(X[order, 1], y_mean[order], 'r-', label='Posterior mean')
plt.fill_between(X[order, 1], y_low[order], y_high[order], alpha=0.3, label='95% credible band')
plt.xlabel('X₁')
plt.ylabel('y')
plt.title('Bayesian Linear Regression — Posterior Predictive')
plt.legend()
plt.savefig('bayesian_regression_ppc.png', dpi=150)
plt.show()

Flat Prior Recovery

With flat priors ( $\boldsymbol{\Lambda}_0 \to \mathbf{0}$ ), the Bayesian posterior mean recovers the OLS estimate exactly, and the 95% credible intervals approximate the frequentist confidence intervals for large $n$ .

Key Takeaways

Summary: Bayesian Linear Regression

Bayesian regression treats $\boldsymbol{\beta}$ as random — the posterior $p(\boldsymbol{\beta} \mid \mathcal{D})$ encodes all uncertainty about the coefficients
Conjugate Normal-Inverse-Gamma prior yields tractable $t$ -distributed posteriors for $\boldsymbol{\beta}$ and Inverse-Gamma for $\sigma^2$
Credible intervals have a direct probabilistic interpretation: $P(\theta \in [L,U] \mid \mathcal{D}) = 1-\alpha$
Bayesian prediction integrates over parameter uncertainty — prediction intervals are naturally wider for small samples
Bayes factors provide automatic model comparison with built-in Occam's razor
With flat priors, Bayesian and frequentist results coincide: posterior mean = OLS, credible interval ≈ confidence interval
Weakly informative priors are preferred in practice — they regularize without dominating the likelihood

Bayesian Linear Regression

Bayesian Linear Regression

Regression With Full Uncertainty Quantification

The Bayesian Regression Model

DfBayesian Linear Regression Model

Posterior Distribution

Conjugate Prior: Normal-Inverse-Gamma

DfNormal-Inverse-Gamma Prior

ThPosterior under Normal-Inverse-Gamma Prior

Posterior Inference for Coefficients

Marginal Posterior of β

Large-Sample Posterior

Credible Intervals vs. Confidence Intervals

DfCredible Interval

Bayesian Prediction

Posterior Predictive Distribution

Bayesian Model Comparison

Bayes Factor

Python Implementation

Related Topics

Key Takeaways

Summary: Bayesian Linear Regression

Premium Content

Need Expert Statistics Help?