Statistical Decision Theory

Advanced Statistical Methods

Optimal Decisions Under Uncertainty

Decision theory provides a formal framework for choosing actions that minimize expected loss, combining probability models with utility functions. It unifies hypothesis testing, estimation, and prediction under one coherent philosophy.

Medical treatment decisions — Balance risks and benefits using loss functions and prior probabilities
Quality control — Set acceptance sampling rules that minimize total expected inspection costs
Financial portfolio allocation — Optimize investment decisions by minimizing expected utility

Decision theory transforms statistical evidence into optimal actions.

Statistical decision theory provides a framework for choosing between actions (estimators, tests, decisions) by formalizing the consequences of each choice through loss functions and risk.

The Decision Problem

DfStatistical Decision Problem

A decision problem consists of:

Parameter space $\Theta$ — the set of possible states of nature
Action space $\mathcal{A}$ — the set of available actions
Loss function $L(\theta, a)$ — the cost of taking action $a$ when the true state is $\theta$
Decision rule $\delta(\mathbf{x})$ — a function from data $\mathbf{x}$ to actions

The goal is to find the decision rule $\delta^*$ that minimizes expected loss.

Loss Functions

DfCommon Loss Functions

Loss Function	Formula	Property
0-1 loss (classification)	$L(\theta, a) = \mathbb{1}(\theta \neq a)$	Discontinuous; classification
Squared error loss (regression)	$L(\theta, a) = (\theta - a)^2$	Penalizes large errors heavily; symmetric
Absolute error loss	$L(\theta, a) = \|\theta - a\|$	Robust to outliers; median
0-inflated loss	$L(\theta, a) = \lambda_0 \mathbb{1}(a \neq 0) + \lambda_1 \|\theta - a\|$	Encourages sparsity (LASSO connection)
Asymmetric linear loss	$L(\theta, a) = \begin{cases} \lambda_1(\theta - a) & a < \theta \\ \lambda_2(a - \theta) & a > \theta \end{cases}$	Different penalties for over/underestimation

Loss Function Choice Drives the Estimator

The choice of loss function determines the optimal estimator:

Squared error → posterior mean (Bayes) / sample mean (frequentist)
Absolute error → posterior median
0-1 loss → posterior mode (MAP)
Asymmetric linear → posterior quantile

Risk Function

DfRisk Function

The risk of a decision rule $\delta$ under loss $L$ is the expected loss:

R(\theta, \delta) = E_\theta[L(\theta, \delta(\mathbf{X}))] = \int L(\theta, \delta(\mathbf{x})) \, p(\mathbf{x} \mid \theta) \, d\mathbf{x}

For squared error loss: $R(\theta, \delta) = E_\theta[(\theta - \delta(\mathbf{X}))^2] = \text{Bias}^2(\delta) + \text{Var}(\delta)$ .

Bias-Variance Decomposition

R(\theta, \delta) = \text{Bias}^2(\delta) + \text{Var}(\delta)

Here,

$\text{Bias}(\delta) = E[\delta(\mathbf{X})] - \theta$ =Systematic error of the estimator
$\text{Var}(\delta) = E[(\delta(\mathbf{X}) - E[\delta(\mathbf{X})])^2]$ =Variability of the estimator

Bias-Variance Tradeoff

Reducing bias often increases variance and vice versa. The optimal estimator minimizes the total risk — the sum of squared bias and variance. Regularization methods (ridge, LASSO) deliberately introduce bias to reduce variance.

Admissibility

DfAdmissible Estimator

An estimator $\delta_1$ is dominated by $\delta_2$ if $R(\theta, \delta_2) \leq R(\theta, \delta_1)$ for all $\theta$ with strict inequality for some $\theta$ . An estimator is admissible if no other estimator dominates it.

ThInadmissibility of the MLE in High Dimensions

For the normal means problem $\mathbf{y} \sim \mathcal{N}(\boldsymbol{\theta}, \mathbf{I}_J)$ with $J \geq 3$ :

The MLE $\hat{\boldsymbol{\theta}} = \mathbf{y}$ is inadmissible under squared error loss:

E\|\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}\|^2 = J > E\|\delta^{\text{JS}} - \boldsymbol{\theta}\|^2

for the James-Stein estimator $\delta^{\text{JS}}$ , which shrinks toward zero.

Practical Consequence

In high-dimensional problems, the MLE is always dominated. Shrinkage always helps. This is a fundamental result in modern statistics.

Bayes Risk

DfBayes Risk

Given a prior $\pi(\theta)$ , the Bayes risk of a decision rule $\delta$ is:

r(\pi, \delta) = E^\pi[R(\theta, \delta)] = \int R(\theta, \delta) \, \pi(\theta) \, d\theta = E^\pi E^{\mathbf{x} \mid \theta}[L(\theta, \delta(\mathbf{x}))]

The Bayes rule $\delta^*$ minimizes the Bayes risk:

\delta^*(\mathbf{x}) = \arg\min_a \int L(\theta, a) \, p(\theta \mid \mathbf{x}) \, d\theta

Bayes Rule is Optimal

The Bayes rule is the best estimator given the prior — it achieves the minimum possible Bayes risk. For squared error loss, the Bayes rule is the posterior mean: $\delta^*(\mathbf{x}) = E[\theta \mid \mathbf{x}]$ .

Minimax Estimators

DfMinimax Estimator

The minimax estimator minimizes the worst-case risk:

\delta^{\text{mm}} = \arg\min_\delta \max_\theta R(\theta, \delta)

A minimax estimator achieves the minimax value: $V = \min_\delta \max_\theta R(\theta, \delta)$ .

ThMinimax Theorem

An estimator $\delta^*$ is minimax if and only if its risk function is constant (flat) and no other estimator has uniformly lower risk.

Bayes minimax connection: If $\delta^*$ is a Bayes rule for prior $\pi^*$ and its risk is constant, then $\delta^*$ is minimax, and the minimax value equals the Bayes risk:

V = r(\pi^*, \delta^*)

When to Use Minimax

Minimax is appropriate when:

No reliable prior is available
The consequence of the worst case is catastrophic
A guarantee on worst-case performance is needed

In practice, minimax is often too conservative — Bayes rules with reasonable priors typically perform better on average.

The James-Stein Estimator

James-Stein Estimator

\delta^{\text{JS}} = \left(1 - \frac{(J-2)\sigma^2}{\|\mathbf{y}\|^2}\right) \mathbf{y}

Here,

$J$ =Dimension (number of parameters)
$\sigma^2$ =Known variance
$\|\mathbf{y}\|^2$ =Squared norm of the observation vector

ThJames-Stein Dominance

For $J \geq 3$ , the James-Stein estimator dominates the MLE under total squared error loss:

R(\boldsymbol{\theta}, \delta^{\text{JS}}) < R(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}_{\text{MLE}}) = J

for all $\boldsymbol{\theta} \in \mathbb{R}^J$ . The improvement is greatest when $\|\boldsymbol{\theta}\|^2$ is small (near zero).

The Shrinkage Effect

The James-Stein estimator shrinks $\mathbf{y}$ toward the origin. The shrinkage factor depends on $\|\mathbf{y}\|^2$ — when the observations are large (far from zero), shrinkage is small. When observations are small (near zero), shrinkage is large. This adaptive shrinkage is what gives James-Stein its power.

Pareto Optimality

DfPareto Optimal Risk

A risk point $(R(\theta_1, \delta), R(\theta_2, \delta))$ is Pareto optimal if no other decision rule achieves lower risk for both $\theta_1$ and $\theta_2$ simultaneously.

A decision rule $\delta^*$ is Pareto optimal if its risk vector is not dominated in any component.

Pareto Frontier

The set of Pareto optimal risk points forms the Pareto frontier — the efficient trade-off curve between risks at different parameter values. Any Bayes rule with a proper prior is Pareto optimal.

Python Implementation

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(42)

# --- Risk calculation for normal means problem ---
def squared_error_risk(theta_hat, theta_true):
    return np.mean((theta_hat - theta_true) ** 2)

def james_stein_estimator(y, sigma2):
    J = len(y)
    shrinkage = max(0, 1 - (J - 2) * sigma2 / np.sum(y**2))
    return shrinkage * y

def simulate_risk(true_theta, sigma2=1.0, n_sims=5000):
    J = len(true_theta)
    y = np.random.randn(n_sims, J) * np.sqrt(sigma2) + true_theta
    
    risk_mle = np.zeros(n_sims)
    risk_js = np.zeros(n_sims)
    
    for i in range(n_sims):
        risk_mle[i] = squared_error_risk(y[i], true_theta)
        risk_js[i] = squared_error_risk(james_stein_estimator(y[i], sigma2), true_theta)
    
    return np.mean(risk_mle), np.mean(risk_js)

# --- Scenario 1: Small theta (near zero) ---
J = 10
theta_small = np.random.randn(J) * 0.3
risk_mle_small, risk_js_small = simulate_risk(theta_small)
print(f"Small θ: MLE risk = {risk_mle_small:.3f}, JS risk = {risk_js_small:.3f}")

# --- Scenario 2: Large theta (far from zero) ---
theta_large = np.random.randn(J) * 3.0
risk_mle_large, risk_js_large = simulate_risk(theta_large)
print(f"Large θ: MLE risk = {risk_mle_large:.3f}, JS risk = {risk_js_large:.3f}")

# --- Risk as function of ||θ||² ---
norms = np.linspace(0.1, 50, 100)
risks_mle = []
risks_js = []
for norm_sq in norms:
    theta = np.random.randn(J) * np.sqrt(norm_sq / J)
    rm, rj = simulate_risk(theta)
    risks_mle.append(rm)
    risks_js.append(rj)

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

axes[0].plot(norms, risks_mle, 'b-', linewidth=2, label='MLE (y)')
axes[0].plot(norms, risks_js, 'r-', linewidth=2, label='James-Stein')
axes[0].axhline(J, color='blue', linestyle='--', alpha=0.5, label=f'J={J}')
axes[0].set_xlabel('||θ||²')
axes[0].set_ylabel('Risk (MSE)')
axes[0].set_title(f'Risk Comparison (J={J})')
axes[0].legend()

# --- Loss function comparison ---
theta_range = np.linspace(-3, 3, 200)
axes[1].plot(theta_range, theta_range**2, 'b-', linewidth=2, label='Squared error: $(θ-a)²$')
axes[1].plot(theta_range, np.abs(theta_range), 'r-', linewidth=2, label='Absolute error: $|θ-a|$')
axes[1].plot(theta_range, (theta_range != 0).astype(float), 'g-', linewidth=2, label='0-1 loss: $1(θ≠a)$')
axes[1].set_xlabel('θ - a (estimation error)')
axes[1].set_ylabel('Loss')
axes[1].set_title('Loss Functions')
axes[1].legend()

# --- Bias-variance tradeoff ---
lambdas = np.linspace(0, 5, 100)
bias_sq = (lambdas * 0.5)**2  # Squared bias increases with λ
variance = 1.0 / (1 + lambdas)  # Variance decreases with λ
total_risk = bias_sq + variance

axes[2].plot(lambdas, bias_sq, 'r--', linewidth=2, label='Bias²')
axes[2].plot(lambdas, variance, 'b--', linewidth=2, label='Variance')
axes[2].plot(lambdas, total_risk, 'k-', linewidth=2, label='Total risk')
min_idx = np.argmin(total_risk)
axes[2].axvline(lambdas[min_idx], color='green', linestyle=':', alpha=0.7, 
                label=f'Optimal λ={lambdas[min_idx]:.2f}')
axes[2].set_xlabel('Regularization parameter λ')
axes[2].set_ylabel('Risk')
axes[2].set_title('Bias-Variance Tradeoff')
axes[2].legend()

plt.tight_layout()
plt.savefig('decision_theory.png', dpi=150)
plt.show()

# --- Minimax vs Bayes comparison ---
print("\n=== Minimax vs Bayes ===")
theta_grid = np.linspace(-5, 5, 200)
risk_grid_mle = theta_grid**2  # Risk of MLE (constant = J)
# Bayes rule with N(0, τ²) prior
tau2 = 2.0
bayes_shrink = tau2 / (tau2 + 1.0)
risk_grid_bayes = bayes_shrink**2 * theta_grid**2 + (1 - bayes_shrink**2)

print(f"Minimax value (MLE risk): {J}")
print(f"Bayes risk (uniform prior): {np.mean(risk_grid_bayes):.3f}")

Key Insight

Decision theory unifies frequentist and Bayesian approaches: the minimax estimator minimizes worst-case frequentist risk, while the Bayes estimator minimizes average Bayesian risk. When a minimax estimator is also a Bayes rule, the two approaches agree.

Key Takeaways

Summary: Statistical Decision Theory

Loss function $L(\theta, a)$ quantifies the cost of decision $a$ when state is $\theta$ — choice of loss drives the optimal estimator
Risk function $R(\theta, \delta) = E_\theta[L(\theta, \delta(\mathbf{X}))]$ is expected loss — the frequentist criterion for comparing estimators
Bias-variance decomposition: $R(\theta, \delta) = \text{Bias}^2 + \text{Var}$ — the fundamental tradeoff in estimation
Admissible estimators are not dominated by any other estimator — the MLE is inadmissible for $J \geq 3$
Bayes risk $r(\pi, \delta) = E^\pi[R(\theta, \delta)]$ averages over the prior — the Bayes rule minimizes this
Minimax estimators minimize worst-case risk — conservative, no prior required
James-Stein estimator dominates the MLE for $J \geq 3$ — shrinkage always helps in high dimensions
Pareto optimality characterizes efficient trade-offs between risks at different parameter values

Statistical Decision Theory

Statistical Decision Theory

Optimal Decisions Under Uncertainty

The Decision Problem

DfStatistical Decision Problem

Loss Functions

DfCommon Loss Functions

Risk Function

DfRisk Function

Bias-Variance Decomposition

Admissibility

DfAdmissible Estimator

ThInadmissibility of the MLE in High Dimensions

Bayes Risk

DfBayes Risk

Minimax Estimators

DfMinimax Estimator

ThMinimax Theorem

The James-Stein Estimator

James-Stein Estimator

ThJames-Stein Dominance

Pareto Optimality

DfPareto Optimal Risk

Python Implementation

Related Topics

Key Takeaways

Summary: Statistical Decision Theory

Premium Content

Need Expert Statistics Help?