Hierarchical Bayesian Models

Advanced Statistical Methods

Borrowing Strength Across Groups

Hierarchical Bayesian models share information across groups through multilevel priors, producing shrinkage estimates that balance group-specific data with overall patterns. Partial pooling prevents overfitting in small groups.

Education — Estimate school-level effects by pooling information across thousands of students
Clinical meta-analysis — Combine trial results while accounting for between-study heterogeneity
Sports analytics — Rank players by borrowing strength across teammates and opponents

Hierarchical models let small groups learn from the collective, producing better estimates for everyone.

Hierarchical (multilevel) Bayesian models place distributions on the parameters of other distributions, creating a layered structure that naturally captures variation at multiple levels — individuals within groups, groups within populations.

The Hierarchical Structure

DfHierarchical Bayesian Model

A two-level hierarchical model has the form:

Level 1 (Observation model):

y_{ij} \mid \theta_j, \sigma^2 \sim p(y \mid \theta_j, \sigma^2)

Level 2 (Prior model for group parameters):

\theta_j \mid \mu, \tau^2 \sim p(\theta \mid \mu, \tau^2)

Level 3 (Hyperprior):

\mu, \tau^2 \sim p(\mu) \, p(\tau^2)

where $i = 1, \ldots, n_j$ indexes observations within group $j = 1, \ldots, J$ .

Why Hierarchical Models?

Hierarchical models allow groups to share statistical strength — estimates for small groups are pulled toward the population mean, while large groups are less affected. This is called partial pooling, and it outperforms both no pooling (independent estimates) and complete pooling (ignoring group structure).

Three Pooling Strategies

Strategy	Model	Problem
No pooling	$\theta_j$ estimated independently from each group	Overfitting for small groups; ignores group similarity
Complete pooling	Single $\theta$ for all groups	Ignores genuine group differences
Partial pooling	$\theta_j \sim \mathcal{N}(\mu, \tau^2)$ — groups share information	Optimal trade-off; shrinks extreme estimates toward the mean

Partial Pooling Estimate

\hat{\theta}_j^{\text{pp}} = \lambda_j \bar{y}_j + (1 - \lambda_j) \hat{\mu}

Here,

$\bar{y}_j$ =Group j sample mean (no pooling estimate)
$\hat{\mu}$ =Population mean (complete pooling estimate)
$\lambda_j$ =Shrinkage weight: depends on group sample size and within-group variance

The shrinkage weight $\lambda_j \in [0, 1]$ determines how much each group's estimate is pulled toward the population mean:

Shrinkage Weight

\lambda_j = \frac{\tau^2}{\tau^2 + \sigma^2 / n_j}

Here,

$\tau^2$ =Between-group variance
$\sigma^2$ =Within-group variance
$n_j$ =Number of observations in group j

Shrinkage Dependence

When $n_j$ is large, $\lambda_j \to 1$ and the estimate equals the group mean. When $n_j$ is small, $\lambda_j \to 0$ and the estimate is pulled strongly toward the population mean. This is shrinkage estimation.

Shrinkage Estimation

ThJames-Stein Shrinkage Property

In the normal means problem $y_j \sim \mathcal{N}(\theta_j, \sigma^2)$ for $j = 1, \ldots, J$ with $J \geq 3$ , the hierarchical Bayesian estimate with $\mu = 0$ :

\hat{\theta}_j^{\text{hs}} = \left(1 - \frac{(J-2)\sigma^2}{\sum_{k=1}^J y_k^2}\right) y_j

dominates the MLE $\hat{\theta}_j = y_j$ under total squared error loss:

E\left[\sum_{j=1}^J (\hat{\theta}_j^{\text{hs}} - \theta_j)^2\right] < E\left[\sum_{j=1}^J (y_j - \theta_j)^2\right]

for all $\boldsymbol{\theta} \in \mathbb{R}^J$ . The improvement is substantial for small $J$ and large $\theta$ values.

Practical Implication

The James-Stein result shows that shrinkage always helps when estimating 3 or more parameters simultaneously. This is a cornerstone of modern statistics — it demonstrates that the MLE is inadmissible for multivariate problems.

Hierarchical Linear Regression

DfHierarchical Linear Regression Model

For $J$ groups (e.g., schools, hospitals, regions):

Observation model:

y_{ij} = \alpha_j + \beta_j x_{ij} + \varepsilon_{ij}, \quad \varepsilon_{ij} \sim \mathcal{N}(0, \sigma^2)

Group-level model (random effects):

\begin{pmatrix} \alpha_j \\ \beta_j \end{pmatrix} \sim \mathcal{N}\left(\begin{pmatrix} \mu_\alpha \\ \mu_\beta \end{pmatrix}, \begin{pmatrix} \tau_\alpha^2 & \rho \tau_\alpha \tau_\beta \\ \rho \tau_\alpha \tau_\beta & \tau_\beta^2 \end{pmatrix}\right)

The hyperparameters $\mu_\alpha, \mu_\beta, \tau_\alpha^2, \tau_\beta^2, \rho$ are estimated from the data.

Random vs. Fixed Effects

Fixed effects: group-specific parameters are unknown constants — each group gets its own parameter with no information sharing
Random effects: group-specific parameters are drawn from a population distribution — groups borrow strength from each other
The hierarchical Bayesian approach treats random effects as latent variables with full posterior uncertainty

Borrowing Strength Across Groups

ThShrinkage Property of Hierarchical Models

For the hierarchical model $\theta_j \sim \mathcal{N}(\mu, \tau^2)$ with likelihood $\bar{y}_j \mid \theta_j \sim \mathcal{N}(\theta_j, \sigma^2/n_j)$ , the posterior mean is:

E[\theta_j \mid \mathbf{y}] = \lambda_j \bar{y}_j + (1 - \lambda_j) \mu, \quad \lambda_j = \frac{\tau^2}{\tau^2 + \sigma^2/n_j}

Shrinkage magnitude depends on:

Group size $n_j$ : smaller groups are shrunk more
Between-group variance $\tau^2$ : larger $\tau^2$ means less shrinkage
Within-group variance $\sigma^2$ : larger $\sigma^2$ means more shrinkage

Borrowing Strength

" borrowing strength " means that information from the entire population informs each group's estimate. A school with only 10 students gets an estimate informed by the performance of all schools — not just those 10 students.

Posterior Predictive Checks

DfPosterior Predictive Check

Generate replicated data $\mathbf{y}^{\text{rep}}$ from the posterior predictive distribution:

p(\mathbf{y}^{\text{rep}} \mid \mathbf{y}) = \int p(\mathbf{y}^{\text{rep}} \mid \boldsymbol{\theta}) \, p(\boldsymbol{\theta} \mid \mathbf{y}) \, d\boldsymbol{\theta}

Compare summary statistics $T(\mathbf{y})$ of observed vs. replicated data. If the model fits, $T(\mathbf{y}^{\text{rep}})$ should be similar to $T(\mathbf{y})$ .

The Bayesian $p$ -value is:

p_B = P(T(\mathbf{y}^{\text{rep}}) \geq T(\mathbf{y}) \mid \mathbf{y})

Values near 0 or 1 indicate model misfit.

Python Implementation

import numpy as np
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt

np.random.seed(42)

# --- Simulate grouped data: 8 schools ---
J = 8
true_mu = 5.0
true_tau = 2.0
true_theta = np.random.normal(true_mu, true_tau, J)
n_j = np.array([15, 22, 12, 18, 10, 25, 14, 20])
sigma = 3.0

y_obs = []
for j in range(J):
    y_obs.append(np.random.normal(true_theta[j], sigma, n_j[j]))
    
y_bar = np.array([np.mean(y_obs[j]) for j in range(J)])
se = np.array([sigma / np.sqrt(n_j[j]) for j in range(J)])

# --- Hierarchical Bayesian Model (Eight Schools) ---
with pm.Model() as hierarchical:
    # Hyperpriors
    mu = pm.Normal('mu', mu=0, sigma=10)
    tau = pm.HalfNormal('tau', sigma=5)
    
    # Group-level parameters (random effects)
    theta = pm.Normal('theta', mu=mu, sigma=tau, shape=J)
    
    # Likelihood
    y = pm.Normal('y_obs', mu=theta, sigma=se, observed=y_bar)
    
    # Sample posterior
    trace = pm.sample(3000, tune=1500, chains=4, return_inferencedata=True)

# --- Posterior summaries ---
print(az.summary(trace, var_names=['mu', 'tau', 'theta']))

# --- Shrinkage visualization ---
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# No pooling vs partial pooling
no_pool = y_bar
partial_pool = trace.posterior['theta'].mean(dim=['chain', 'draw']).values

axes[0].errorbar(range(J), no_pool, yerr=1.96*se, fmt='o', label='No pooling (MLE)', capsize=3)
axes[0].errorbar(range(J), partial_pool, 
                 yerr=1.96*trace.posterior['theta'].std(dim=['chain', 'draw']).values,
                 fmt='s', label='Partial pooling (Bayesian)', capsize=3)
axes[0].axhline(trace.posterior['mu'].mean().values, color='red', linestyle='--', label='Population mean')
axes[0].set_xlabel('School')
axes[0].set_ylabel('Effect estimate')
axes[0].set_title('No Pooling vs. Partial Pooling')
axes[0].legend()

# Shrinkage amount
shrinkage = 1 - (partial_pool - trace.posterior['mu'].mean().values) / (no_pool - trace.posterior['mu'].mean().values)
axes[1].bar(range(J), shrinkage)
axes[1].set_xlabel('School')
axes[1].set_ylabel('Shrinkage toward mean')
axes[1].set_title('Shrinkage Amount by School')
axes[1].axhline(0.5, color='red', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.savefig('hierarchical_shrinkage.png', dpi=150)
plt.show()

# --- Compare: no pooling vs hierarchical ---
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(n_j, true_theta, color='black', s=100, zorder=5, label='True θⱼ')
ax.scatter(n_j, no_pool, color='blue', s=80, marker='D', label='No pooling (MLE)')
ax.scatter(n_j, partial_pool, color='red', s=80, marker='s', label='Partial pooling (Bayesian)')
ax.axhline(true_mu, color='gray', linestyle='--', alpha=0.5, label='True μ')
ax.set_xlabel('Group size (nⱼ)')
ax.set_ylabel('Estimate')
ax.set_title('Shrinkage: Smaller Groups Shrunk More')
ax.legend()
plt.savefig('hierarchical_shrinkage_vs_size.png', dpi=150)
plt.show()

Practical Workflow

Start with the simplest hierarchical model (varying intercepts only)
Check convergence (R-hat < 1.01, ESS > 400)
Perform posterior predictive checks
Add complexity (varying slopes, interactions) only if needed
Compare models via WAIC or LOO-CV

Key Takeaways

Summary: Hierarchical Bayesian Models

Hierarchical models place priors on parameters of other priors, creating multilevel structure
Partial pooling optimally balances no pooling (independent estimates) and complete pooling (ignoring group structure)
Shrinkage weight $\lambda_j = \tau^2 / (\tau^2 + \sigma^2/n_j)$ determines how much each group is pulled toward the population mean
Small groups are shrunk more — this is the key practical advantage of hierarchical models
James-Stein theorem proves shrinkage always helps when estimating 3+ parameters simultaneously
Posterior predictive checks assess model fit by comparing observed and replicated data
Hierarchical models are essential for meta-analysis, educational research, and any grouped data structure

Hierarchical Bayesian Models

Hierarchical Bayesian Models

Borrowing Strength Across Groups

The Hierarchical Structure

DfHierarchical Bayesian Model

Three Pooling Strategies

Partial Pooling Estimate

Shrinkage Weight

Shrinkage Estimation

ThJames-Stein Shrinkage Property

Hierarchical Linear Regression

DfHierarchical Linear Regression Model

Borrowing Strength Across Groups

ThShrinkage Property of Hierarchical Models

Posterior Predictive Checks

DfPosterior Predictive Check

Python Implementation

Related Topics

Key Takeaways

Summary: Hierarchical Bayesian Models

Premium Content

Need Expert Statistics Help?