Hierarchical Bayesian Models
Advanced Statistical Methods
Borrowing Strength Across Groups
Hierarchical Bayesian models share information across groups through multilevel priors, producing shrinkage estimates that balance group-specific data with overall patterns. Partial pooling prevents overfitting in small groups.
- Education β Estimate school-level effects by pooling information across thousands of students
- Clinical meta-analysis β Combine trial results while accounting for between-study heterogeneity
- Sports analytics β Rank players by borrowing strength across teammates and opponents
Hierarchical models let small groups learn from the collective, producing better estimates for everyone.
Hierarchical (multilevel) Bayesian models place distributions on the parameters of other distributions, creating a layered structure that naturally captures variation at multiple levels β individuals within groups, groups within populations.
The Hierarchical Structure
DfHierarchical Bayesian Model
A two-level hierarchical model has the form:
Level 1 (Observation model):
Level 2 (Prior model for group parameters):
Level 3 (Hyperprior):
where indexes observations within group .
Why Hierarchical Models?
Hierarchical models allow groups to share statistical strength β estimates for small groups are pulled toward the population mean, while large groups are less affected. This is called partial pooling, and it outperforms both no pooling (independent estimates) and complete pooling (ignoring group structure).
Three Pooling Strategies
| Strategy | Model | Problem |
|---|---|---|
| No pooling | estimated independently from each group | Overfitting for small groups; ignores group similarity |
| Complete pooling | Single for all groups | Ignores genuine group differences |
| Partial pooling | β groups share information | Optimal trade-off; shrinks extreme estimates toward the mean |
Partial Pooling Estimate
Here,
- =Group j sample mean (no pooling estimate)
- =Population mean (complete pooling estimate)
- =Shrinkage weight: depends on group sample size and within-group variance
The shrinkage weight determines how much each group's estimate is pulled toward the population mean:
Shrinkage Weight
Here,
- =Between-group variance
- =Within-group variance
- =Number of observations in group j
Shrinkage Dependence
When is large, and the estimate equals the group mean. When is small, and the estimate is pulled strongly toward the population mean. This is shrinkage estimation.
Shrinkage Estimation
ThJames-Stein Shrinkage Property
In the normal means problem for with , the hierarchical Bayesian estimate with :
dominates the MLE under total squared error loss:
for all . The improvement is substantial for small and large values.
Practical Implication
The James-Stein result shows that shrinkage always helps when estimating 3 or more parameters simultaneously. This is a cornerstone of modern statistics β it demonstrates that the MLE is inadmissible for multivariate problems.
Hierarchical Linear Regression
DfHierarchical Linear Regression Model
For groups (e.g., schools, hospitals, regions):
Observation model:
Group-level model (random effects):
The hyperparameters are estimated from the data.
Random vs. Fixed Effects
- Fixed effects: group-specific parameters are unknown constants β each group gets its own parameter with no information sharing
- Random effects: group-specific parameters are drawn from a population distribution β groups borrow strength from each other
- The hierarchical Bayesian approach treats random effects as latent variables with full posterior uncertainty
Borrowing Strength Across Groups
ThShrinkage Property of Hierarchical Models
For the hierarchical model with likelihood , the posterior mean is:
Shrinkage magnitude depends on:
- Group size : smaller groups are shrunk more
- Between-group variance : larger means less shrinkage
- Within-group variance : larger means more shrinkage
Borrowing Strength
" borrowing strength " means that information from the entire population informs each group's estimate. A school with only 10 students gets an estimate informed by the performance of all schools β not just those 10 students.
Posterior Predictive Checks
DfPosterior Predictive Check
Generate replicated data from the posterior predictive distribution:
Compare summary statistics of observed vs. replicated data. If the model fits, should be similar to .
The Bayesian -value is:
Values near 0 or 1 indicate model misfit.
Python Implementation
import numpy as np
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt
np.random.seed(42)
# --- Simulate grouped data: 8 schools ---
J = 8
true_mu = 5.0
true_tau = 2.0
true_theta = np.random.normal(true_mu, true_tau, J)
n_j = np.array([15, 22, 12, 18, 10, 25, 14, 20])
sigma = 3.0
y_obs = []
for j in range(J):
y_obs.append(np.random.normal(true_theta[j], sigma, n_j[j]))
y_bar = np.array([np.mean(y_obs[j]) for j in range(J)])
se = np.array([sigma / np.sqrt(n_j[j]) for j in range(J)])
# --- Hierarchical Bayesian Model (Eight Schools) ---
with pm.Model() as hierarchical:
# Hyperpriors
mu = pm.Normal('mu', mu=0, sigma=10)
tau = pm.HalfNormal('tau', sigma=5)
# Group-level parameters (random effects)
theta = pm.Normal('theta', mu=mu, sigma=tau, shape=J)
# Likelihood
y = pm.Normal('y_obs', mu=theta, sigma=se, observed=y_bar)
# Sample posterior
trace = pm.sample(3000, tune=1500, chains=4, return_inferencedata=True)
# --- Posterior summaries ---
print(az.summary(trace, var_names=['mu', 'tau', 'theta']))
# --- Shrinkage visualization ---
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# No pooling vs partial pooling
no_pool = y_bar
partial_pool = trace.posterior['theta'].mean(dim=['chain', 'draw']).values
axes[0].errorbar(range(J), no_pool, yerr=1.96*se, fmt='o', label='No pooling (MLE)', capsize=3)
axes[0].errorbar(range(J), partial_pool,
yerr=1.96*trace.posterior['theta'].std(dim=['chain', 'draw']).values,
fmt='s', label='Partial pooling (Bayesian)', capsize=3)
axes[0].axhline(trace.posterior['mu'].mean().values, color='red', linestyle='--', label='Population mean')
axes[0].set_xlabel('School')
axes[0].set_ylabel('Effect estimate')
axes[0].set_title('No Pooling vs. Partial Pooling')
axes[0].legend()
# Shrinkage amount
shrinkage = 1 - (partial_pool - trace.posterior['mu'].mean().values) / (no_pool - trace.posterior['mu'].mean().values)
axes[1].bar(range(J), shrinkage)
axes[1].set_xlabel('School')
axes[1].set_ylabel('Shrinkage toward mean')
axes[1].set_title('Shrinkage Amount by School')
axes[1].axhline(0.5, color='red', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.savefig('hierarchical_shrinkage.png', dpi=150)
plt.show()
# --- Compare: no pooling vs hierarchical ---
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(n_j, true_theta, color='black', s=100, zorder=5, label='True ΞΈβ±Ό')
ax.scatter(n_j, no_pool, color='blue', s=80, marker='D', label='No pooling (MLE)')
ax.scatter(n_j, partial_pool, color='red', s=80, marker='s', label='Partial pooling (Bayesian)')
ax.axhline(true_mu, color='gray', linestyle='--', alpha=0.5, label='True ΞΌ')
ax.set_xlabel('Group size (nβ±Ό)')
ax.set_ylabel('Estimate')
ax.set_title('Shrinkage: Smaller Groups Shrunk More')
ax.legend()
plt.savefig('hierarchical_shrinkage_vs_size.png', dpi=150)
plt.show()
Practical Workflow
- Start with the simplest hierarchical model (varying intercepts only)
- Check convergence (R-hat < 1.01, ESS > 400)
- Perform posterior predictive checks
- Add complexity (varying slopes, interactions) only if needed
- Compare models via WAIC or LOO-CV
Related Topics
- See Bayesian Linear Regression for the single-level Bayesian model
- See MCMC Diagnostics for convergence assessment
- See Multilevel Modeling for the frequentist equivalent
- See Ridge Regression for connections to shrinkage estimation
Key Takeaways
Summary: Hierarchical Bayesian Models
- Hierarchical models place priors on parameters of other priors, creating multilevel structure
- Partial pooling optimally balances no pooling (independent estimates) and complete pooling (ignoring group structure)
- Shrinkage weight determines how much each group is pulled toward the population mean
- Small groups are shrunk more β this is the key practical advantage of hierarchical models
- James-Stein theorem proves shrinkage always helps when estimating 3+ parameters simultaneously
- Posterior predictive checks assess model fit by comparing observed and replicated data
- Hierarchical models are essential for meta-analysis, educational research, and any grouped data structure