Bayesian Statistics

Why It Matters

Bayesian methods quantify uncertainty in parameters, enabling better decision-making under uncertainty. Rather than treating parameters as fixed unknowns (frequentist), Bayesian inference treats them as random variables with distributions. This yields full posterior distributions, credible intervals, and direct probability statements about parameters — invaluable for risk-aware decision-making in healthcare, finance, and autonomous systems.

Overview

Bayesian inference updates prior beliefs about parameters using observed data via Bayes' rule: posterior ∝ likelihood × prior. The prior $p(\theta)$ encodes beliefs before seeing data. The likelihood $p(D|\theta)$ is the probability of the data given the parameters. The posterior $p(\theta|D)$ is the updated belief after seeing data. Conjugate priors (e.g., Beta-Binomial, Normal-Normal) yield closed-form posteriors for exact analytical updates. MAP estimation finds the mode of the posterior, equivalent to MLE with regularization. For complex models, MCMC methods (Gibbs sampling, HMC) sample from the posterior distribution numerically.

Key Concepts

Bayes' Rule

p(\theta|D) = \frac{p(D|\theta)p(\theta)}{p(D)}

Here,

$p(\theta|D)$ =Posterior: updated belief after seeing data
$p(D|\theta)$ =Likelihood: probability of data given θ
$p(\theta)$ =Prior: belief before seeing data
$p(D)$ =Evidence (normalizing constant)

MAP Estimator

\hat{\theta}_{MAP} = \arg\max_\theta p(D|\theta)p(\theta)

Here,

$\hat{\theta}_{MAP}$ =Maximum a posteriori estimate

Beta-Binomial Conjugate

\text{Prior: } \theta \sim \text{Beta}(\alpha, \beta) \implies \text{Posterior: } \theta | D \sim \text{Beta}(\alpha + s, \beta + f)

Here,

$s$ =Number of successes
$f$ =Number of failures

Normal-Normal Conjugate

\text{Posterior mean: } \mu_n = \frac{\sigma^2 \mu_0 + n \tau^2 \bar{x}}{\sigma^2 + n\tau^2}

Here,

$\mu_0$ =Prior mean
$\tau^2$ =Prior variance (prior strength)
$\sigma^2$ =Data variance
$n$ =Sample size

Posterior Precision

\frac{1}{\tau_n^2} = \frac{1}{\tau_0^2} + \frac{n}{\sigma^2}

Here,

$\tau_n^2$ =Posterior variance
$\tau_0^2$ =Prior variance

Conjugate Prior Families

Likelihood	Prior	Posterior	Use Case
Bernoulli/Binomial	Beta	Beta	Proportions, click rates
Normal (known $\sigma^2$ )	Normal	Normal	Mean estimation
Poisson	Gamma	Gamma	Count data
Normal (unknown $\mu$ , $\sigma^2$ )	Normal-Inverse-Gamma	Normal-Inverse-Gamma	Full normal model

Prior Strength Effects

Prior Strength	Effect on Posterior	When to Use
Weak (large $\tau_0^2$ )	Posterior dominated by data	Large samples, little prior knowledge
Strong (small $\tau_0^2$ )	Posterior dominated by prior	Small samples, strong prior knowledge
Flat (uniform)	Posterior = likelihood (up to constant)	Non-informative analysis

Quick Example

Beta-Binomial Conjugate

Prior: $\theta \sim \text{Beta}(2, 2)$ (centered at 0.5, moderate strength). Data: 7 successes in 10 trials.

Posterior: $\text{Beta}(2+7, 2+3) = \text{Beta}(9, 5)$ .

Posterior mean = $9/14 \approx 0.643$ . The prior (centered at 0.5) is pulled toward the data proportion (0.7) but moderated by the prior strength. With more data, the prior's influence diminishes.

MAP = MLE + Regularization

With a Gaussian prior $\theta \sim N(0, \tau^2)$ , the MAP estimate is:

\hat{\theta}_{MAP} = \arg\max_\theta [\ell(\theta) - \frac{\theta^2}{2\tau^2}]

This is equivalent to MLE with L2 regularization (Ridge regression). The prior variance $\tau^2$ controls the regularization strength.

Key Takeaways

Summary: Bayesian Statistics

Bayes' Rule: Posterior ∝ Likelihood × Prior. Updates beliefs systematically as data accumulates.
Conjugate Priors: Beta-Binomial, Normal-Normal, Gamma-Poisson yield closed-form posteriors. Convenient for exact inference.
MAP = MLE + Regularization: MAP estimation with a Gaussian prior is equivalent to L2-regularized MLE.
Prior Choice: With little data, the prior dominates. Use weakly informative priors to regularize without biasing.
Posterior Mean: Under squared-error loss, $E[\theta|D]$ is the Bayes-optimal point estimate.
Credible Intervals: Unlike confidence intervals, a 95% credible interval means "95% probability $\theta$ is in this interval." Direct interpretation.
MCMC: For complex models without conjugate priors, use Markov Chain Monte Carlo (Gibbs, HMC) to sample from the posterior.
Prior Sensitivity: Always check how sensitive results are to prior choice — especially with small samples.

Deep Dive

For detailed explanations, worked examples, and Python implementations, explore the dedicated statistics lessons:

Bayesian Regression

Bayesian Regression — Full Bayesian treatment of regression with posterior distributions over coefficients

Hierarchical Models

Hierarchical Bayesian — Multi-level models with partial pooling, random effects, and shrinkage

MCMC Diagnostics

MCMC Diagnostics — Convergence checks, trace plots, effective sample size, $\hat{R}$ statistic, and autocorrelation

Bayesian Statistics