πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Hierarchical Bayesian Models

Advanced Statistical MethodsBayesian Methods🟒 Free Lesson

Advertisement

Hierarchical Bayesian Models

Advanced Statistical Methods

Borrowing Strength Across Groups

Hierarchical Bayesian models share information across groups through multilevel priors, producing shrinkage estimates that balance group-specific data with overall patterns. Partial pooling prevents overfitting in small groups.

  • Education β€” Estimate school-level effects by pooling information across thousands of students
  • Clinical meta-analysis β€” Combine trial results while accounting for between-study heterogeneity
  • Sports analytics β€” Rank players by borrowing strength across teammates and opponents

Hierarchical models let small groups learn from the collective, producing better estimates for everyone.


Hierarchical (multilevel) Bayesian models place distributions on the parameters of other distributions, creating a layered structure that naturally captures variation at multiple levels β€” individuals within groups, groups within populations.


The Hierarchical Structure

DfHierarchical Bayesian Model

A two-level hierarchical model has the form:

Level 1 (Observation model):

yij∣θj,Οƒ2∼p(y∣θj,Οƒ2)y_{ij} \mid \theta_j, \sigma^2 \sim p(y \mid \theta_j, \sigma^2)

Level 2 (Prior model for group parameters):

ΞΈj∣μ,Ο„2∼p(θ∣μ,Ο„2)\theta_j \mid \mu, \tau^2 \sim p(\theta \mid \mu, \tau^2)

Level 3 (Hyperprior):

ΞΌ,Ο„2∼p(ΞΌ) p(Ο„2)\mu, \tau^2 \sim p(\mu) \, p(\tau^2)

where i=1,…,nji = 1, \ldots, n_j indexes observations within group j=1,…,Jj = 1, \ldots, J.

Why Hierarchical Models?

Hierarchical models allow groups to share statistical strength β€” estimates for small groups are pulled toward the population mean, while large groups are less affected. This is called partial pooling, and it outperforms both no pooling (independent estimates) and complete pooling (ignoring group structure).


Three Pooling Strategies

StrategyModelProblem
No poolingΞΈj\theta_j estimated independently from each groupOverfitting for small groups; ignores group similarity
Complete poolingSingle ΞΈ\theta for all groupsIgnores genuine group differences
Partial poolingΞΈj∼N(ΞΌ,Ο„2)\theta_j \sim \mathcal{N}(\mu, \tau^2) β€” groups share informationOptimal trade-off; shrinks extreme estimates toward the mean

Partial Pooling Estimate

ΞΈ^jpp=Ξ»jyΛ‰j+(1βˆ’Ξ»j)ΞΌ^\hat{\theta}_j^{\text{pp}} = \lambda_j \bar{y}_j + (1 - \lambda_j) \hat{\mu}

Here,

  • yΛ‰j\bar{y}_j=Group j sample mean (no pooling estimate)
  • ΞΌ^\hat{\mu}=Population mean (complete pooling estimate)
  • Ξ»j\lambda_j=Shrinkage weight: depends on group sample size and within-group variance

The shrinkage weight λj∈[0,1]\lambda_j \in [0, 1] determines how much each group's estimate is pulled toward the population mean:

Shrinkage Weight

Ξ»j=Ο„2Ο„2+Οƒ2/nj\lambda_j = \frac{\tau^2}{\tau^2 + \sigma^2 / n_j}

Here,

  • Ο„2\tau^2=Between-group variance
  • Οƒ2\sigma^2=Within-group variance
  • njn_j=Number of observations in group j

Shrinkage Dependence

When njn_j is large, λj→1\lambda_j \to 1 and the estimate equals the group mean. When njn_j is small, λj→0\lambda_j \to 0 and the estimate is pulled strongly toward the population mean. This is shrinkage estimation.


Shrinkage Estimation

ThJames-Stein Shrinkage Property

In the normal means problem yj∼N(ΞΈj,Οƒ2)y_j \sim \mathcal{N}(\theta_j, \sigma^2) for j=1,…,Jj = 1, \ldots, J with Jβ‰₯3J \geq 3, the hierarchical Bayesian estimate with ΞΌ=0\mu = 0:

ΞΈ^jhs=(1βˆ’(Jβˆ’2)Οƒ2βˆ‘k=1Jyk2)yj\hat{\theta}_j^{\text{hs}} = \left(1 - \frac{(J-2)\sigma^2}{\sum_{k=1}^J y_k^2}\right) y_j

dominates the MLE ΞΈ^j=yj\hat{\theta}_j = y_j under total squared error loss:

E[βˆ‘j=1J(ΞΈ^jhsβˆ’ΞΈj)2]<E[βˆ‘j=1J(yjβˆ’ΞΈj)2]E\left[\sum_{j=1}^J (\hat{\theta}_j^{\text{hs}} - \theta_j)^2\right] < E\left[\sum_{j=1}^J (y_j - \theta_j)^2\right]

for all θ∈RJ\boldsymbol{\theta} \in \mathbb{R}^J. The improvement is substantial for small JJ and large θ\theta values.

Practical Implication

The James-Stein result shows that shrinkage always helps when estimating 3 or more parameters simultaneously. This is a cornerstone of modern statistics β€” it demonstrates that the MLE is inadmissible for multivariate problems.


Hierarchical Linear Regression

DfHierarchical Linear Regression Model

For JJ groups (e.g., schools, hospitals, regions):

Observation model:

yij=Ξ±j+Ξ²jxij+Ξ΅ij,Ξ΅ij∼N(0,Οƒ2)y_{ij} = \alpha_j + \beta_j x_{ij} + \varepsilon_{ij}, \quad \varepsilon_{ij} \sim \mathcal{N}(0, \sigma^2)

Group-level model (random effects):

(Ξ±jΞ²j)∼N((ΞΌΞ±ΞΌΞ²),(τα2ρτατβρτατβτβ2))\begin{pmatrix} \alpha_j \\ \beta_j \end{pmatrix} \sim \mathcal{N}\left(\begin{pmatrix} \mu_\alpha \\ \mu_\beta \end{pmatrix}, \begin{pmatrix} \tau_\alpha^2 & \rho \tau_\alpha \tau_\beta \\ \rho \tau_\alpha \tau_\beta & \tau_\beta^2 \end{pmatrix}\right)

The hyperparameters ΞΌΞ±,ΞΌΞ²,τα2,τβ2,ρ\mu_\alpha, \mu_\beta, \tau_\alpha^2, \tau_\beta^2, \rho are estimated from the data.

Random vs. Fixed Effects

  • Fixed effects: group-specific parameters are unknown constants β€” each group gets its own parameter with no information sharing
  • Random effects: group-specific parameters are drawn from a population distribution β€” groups borrow strength from each other
  • The hierarchical Bayesian approach treats random effects as latent variables with full posterior uncertainty

Borrowing Strength Across Groups

ThShrinkage Property of Hierarchical Models

For the hierarchical model ΞΈj∼N(ΞΌ,Ο„2)\theta_j \sim \mathcal{N}(\mu, \tau^2) with likelihood yΛ‰j∣θj∼N(ΞΈj,Οƒ2/nj)\bar{y}_j \mid \theta_j \sim \mathcal{N}(\theta_j, \sigma^2/n_j), the posterior mean is:

E[ΞΈj∣y]=Ξ»jyΛ‰j+(1βˆ’Ξ»j)ΞΌ,Ξ»j=Ο„2Ο„2+Οƒ2/njE[\theta_j \mid \mathbf{y}] = \lambda_j \bar{y}_j + (1 - \lambda_j) \mu, \quad \lambda_j = \frac{\tau^2}{\tau^2 + \sigma^2/n_j}

Shrinkage magnitude depends on:

  1. Group size njn_j: smaller groups are shrunk more
  2. Between-group variance Ο„2\tau^2: larger Ο„2\tau^2 means less shrinkage
  3. Within-group variance Οƒ2\sigma^2: larger Οƒ2\sigma^2 means more shrinkage

Borrowing Strength

" borrowing strength " means that information from the entire population informs each group's estimate. A school with only 10 students gets an estimate informed by the performance of all schools β€” not just those 10 students.


Posterior Predictive Checks

DfPosterior Predictive Check

Generate replicated data yrep\mathbf{y}^{\text{rep}} from the posterior predictive distribution:

p(yrep∣y)=∫p(yrep∣θ) p(θ∣y) dΞΈp(\mathbf{y}^{\text{rep}} \mid \mathbf{y}) = \int p(\mathbf{y}^{\text{rep}} \mid \boldsymbol{\theta}) \, p(\boldsymbol{\theta} \mid \mathbf{y}) \, d\boldsymbol{\theta}

Compare summary statistics T(y)T(\mathbf{y}) of observed vs. replicated data. If the model fits, T(yrep)T(\mathbf{y}^{\text{rep}}) should be similar to T(y)T(\mathbf{y}).

The Bayesian pp-value is:

pB=P(T(yrep)β‰₯T(y)∣y)p_B = P(T(\mathbf{y}^{\text{rep}}) \geq T(\mathbf{y}) \mid \mathbf{y})

Values near 0 or 1 indicate model misfit.


Python Implementation

import numpy as np
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt

np.random.seed(42)

# --- Simulate grouped data: 8 schools ---
J = 8
true_mu = 5.0
true_tau = 2.0
true_theta = np.random.normal(true_mu, true_tau, J)
n_j = np.array([15, 22, 12, 18, 10, 25, 14, 20])
sigma = 3.0

y_obs = []
for j in range(J):
    y_obs.append(np.random.normal(true_theta[j], sigma, n_j[j]))
    
y_bar = np.array([np.mean(y_obs[j]) for j in range(J)])
se = np.array([sigma / np.sqrt(n_j[j]) for j in range(J)])

# --- Hierarchical Bayesian Model (Eight Schools) ---
with pm.Model() as hierarchical:
    # Hyperpriors
    mu = pm.Normal('mu', mu=0, sigma=10)
    tau = pm.HalfNormal('tau', sigma=5)
    
    # Group-level parameters (random effects)
    theta = pm.Normal('theta', mu=mu, sigma=tau, shape=J)
    
    # Likelihood
    y = pm.Normal('y_obs', mu=theta, sigma=se, observed=y_bar)
    
    # Sample posterior
    trace = pm.sample(3000, tune=1500, chains=4, return_inferencedata=True)

# --- Posterior summaries ---
print(az.summary(trace, var_names=['mu', 'tau', 'theta']))

# --- Shrinkage visualization ---
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# No pooling vs partial pooling
no_pool = y_bar
partial_pool = trace.posterior['theta'].mean(dim=['chain', 'draw']).values

axes[0].errorbar(range(J), no_pool, yerr=1.96*se, fmt='o', label='No pooling (MLE)', capsize=3)
axes[0].errorbar(range(J), partial_pool, 
                 yerr=1.96*trace.posterior['theta'].std(dim=['chain', 'draw']).values,
                 fmt='s', label='Partial pooling (Bayesian)', capsize=3)
axes[0].axhline(trace.posterior['mu'].mean().values, color='red', linestyle='--', label='Population mean')
axes[0].set_xlabel('School')
axes[0].set_ylabel('Effect estimate')
axes[0].set_title('No Pooling vs. Partial Pooling')
axes[0].legend()

# Shrinkage amount
shrinkage = 1 - (partial_pool - trace.posterior['mu'].mean().values) / (no_pool - trace.posterior['mu'].mean().values)
axes[1].bar(range(J), shrinkage)
axes[1].set_xlabel('School')
axes[1].set_ylabel('Shrinkage toward mean')
axes[1].set_title('Shrinkage Amount by School')
axes[1].axhline(0.5, color='red', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.savefig('hierarchical_shrinkage.png', dpi=150)
plt.show()

# --- Compare: no pooling vs hierarchical ---
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(n_j, true_theta, color='black', s=100, zorder=5, label='True ΞΈβ±Ό')
ax.scatter(n_j, no_pool, color='blue', s=80, marker='D', label='No pooling (MLE)')
ax.scatter(n_j, partial_pool, color='red', s=80, marker='s', label='Partial pooling (Bayesian)')
ax.axhline(true_mu, color='gray', linestyle='--', alpha=0.5, label='True ΞΌ')
ax.set_xlabel('Group size (nβ±Ό)')
ax.set_ylabel('Estimate')
ax.set_title('Shrinkage: Smaller Groups Shrunk More')
ax.legend()
plt.savefig('hierarchical_shrinkage_vs_size.png', dpi=150)
plt.show()

Practical Workflow

  1. Start with the simplest hierarchical model (varying intercepts only)
  2. Check convergence (R-hat < 1.01, ESS > 400)
  3. Perform posterior predictive checks
  4. Add complexity (varying slopes, interactions) only if needed
  5. Compare models via WAIC or LOO-CV

Related Topics


Key Takeaways

Summary: Hierarchical Bayesian Models

  • Hierarchical models place priors on parameters of other priors, creating multilevel structure
  • Partial pooling optimally balances no pooling (independent estimates) and complete pooling (ignoring group structure)
  • Shrinkage weight Ξ»j=Ο„2/(Ο„2+Οƒ2/nj)\lambda_j = \tau^2 / (\tau^2 + \sigma^2/n_j) determines how much each group is pulled toward the population mean
  • Small groups are shrunk more β€” this is the key practical advantage of hierarchical models
  • James-Stein theorem proves shrinkage always helps when estimating 3+ parameters simultaneously
  • Posterior predictive checks assess model fit by comparing observed and replicated data
  • Hierarchical models are essential for meta-analysis, educational research, and any grouped data structure
⭐

Premium Content

Hierarchical Bayesian Models

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement