Statistical Decision Theory
Advanced Statistical Methods
Optimal Decisions Under Uncertainty
Decision theory provides a formal framework for choosing actions that minimize expected loss, combining probability models with utility functions. It unifies hypothesis testing, estimation, and prediction under one coherent philosophy.
- Medical treatment decisions β Balance risks and benefits using loss functions and prior probabilities
- Quality control β Set acceptance sampling rules that minimize total expected inspection costs
- Financial portfolio allocation β Optimize investment decisions by minimizing expected utility
Decision theory transforms statistical evidence into optimal actions.
Statistical decision theory provides a framework for choosing between actions (estimators, tests, decisions) by formalizing the consequences of each choice through loss functions and risk.
The Decision Problem
DfStatistical Decision Problem
A decision problem consists of:
- Parameter space β the set of possible states of nature
- Action space β the set of available actions
- Loss function β the cost of taking action when the true state is
- Decision rule β a function from data to actions
The goal is to find the decision rule that minimizes expected loss.
Loss Functions
DfCommon Loss Functions
| Loss Function | Formula | Property |
|---|---|---|
| 0-1 loss (classification) | Discontinuous; classification | |
| Squared error loss (regression) | Penalizes large errors heavily; symmetric | |
| Absolute error loss | Robust to outliers; median | |
| 0-inflated loss | Encourages sparsity (LASSO connection) | |
| Asymmetric linear loss | Different penalties for over/underestimation |
Loss Function Choice Drives the Estimator
The choice of loss function determines the optimal estimator:
- Squared error β posterior mean (Bayes) / sample mean (frequentist)
- Absolute error β posterior median
- 0-1 loss β posterior mode (MAP)
- Asymmetric linear β posterior quantile
Risk Function
DfRisk Function
The risk of a decision rule under loss is the expected loss:
For squared error loss: .
Bias-Variance Decomposition
Here,
- =Systematic error of the estimator
- =Variability of the estimator
Bias-Variance Tradeoff
Reducing bias often increases variance and vice versa. The optimal estimator minimizes the total risk β the sum of squared bias and variance. Regularization methods (ridge, LASSO) deliberately introduce bias to reduce variance.
Admissibility
DfAdmissible Estimator
An estimator is dominated by if for all with strict inequality for some . An estimator is admissible if no other estimator dominates it.
ThInadmissibility of the MLE in High Dimensions
For the normal means problem with :
The MLE is inadmissible under squared error loss:
for the James-Stein estimator , which shrinks toward zero.
Practical Consequence
In high-dimensional problems, the MLE is always dominated. Shrinkage always helps. This is a fundamental result in modern statistics.
Bayes Risk
DfBayes Risk
Given a prior , the Bayes risk of a decision rule is:
The Bayes rule minimizes the Bayes risk:
Bayes Rule is Optimal
The Bayes rule is the best estimator given the prior β it achieves the minimum possible Bayes risk. For squared error loss, the Bayes rule is the posterior mean: .
Minimax Estimators
DfMinimax Estimator
The minimax estimator minimizes the worst-case risk:
A minimax estimator achieves the minimax value: .
ThMinimax Theorem
An estimator is minimax if and only if its risk function is constant (flat) and no other estimator has uniformly lower risk.
Bayes minimax connection: If is a Bayes rule for prior and its risk is constant, then is minimax, and the minimax value equals the Bayes risk:
When to Use Minimax
Minimax is appropriate when:
- No reliable prior is available
- The consequence of the worst case is catastrophic
- A guarantee on worst-case performance is needed
In practice, minimax is often too conservative β Bayes rules with reasonable priors typically perform better on average.
The James-Stein Estimator
James-Stein Estimator
Here,
- =Dimension (number of parameters)
- =Known variance
- =Squared norm of the observation vector
ThJames-Stein Dominance
For , the James-Stein estimator dominates the MLE under total squared error loss:
for all . The improvement is greatest when is small (near zero).
The Shrinkage Effect
The James-Stein estimator shrinks toward the origin. The shrinkage factor depends on β when the observations are large (far from zero), shrinkage is small. When observations are small (near zero), shrinkage is large. This adaptive shrinkage is what gives James-Stein its power.
Pareto Optimality
DfPareto Optimal Risk
A risk point is Pareto optimal if no other decision rule achieves lower risk for both and simultaneously.
A decision rule is Pareto optimal if its risk vector is not dominated in any component.
Pareto Frontier
The set of Pareto optimal risk points forms the Pareto frontier β the efficient trade-off curve between risks at different parameter values. Any Bayes rule with a proper prior is Pareto optimal.
Python Implementation
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
np.random.seed(42)
# --- Risk calculation for normal means problem ---
def squared_error_risk(theta_hat, theta_true):
return np.mean((theta_hat - theta_true) ** 2)
def james_stein_estimator(y, sigma2):
J = len(y)
shrinkage = max(0, 1 - (J - 2) * sigma2 / np.sum(y**2))
return shrinkage * y
def simulate_risk(true_theta, sigma2=1.0, n_sims=5000):
J = len(true_theta)
y = np.random.randn(n_sims, J) * np.sqrt(sigma2) + true_theta
risk_mle = np.zeros(n_sims)
risk_js = np.zeros(n_sims)
for i in range(n_sims):
risk_mle[i] = squared_error_risk(y[i], true_theta)
risk_js[i] = squared_error_risk(james_stein_estimator(y[i], sigma2), true_theta)
return np.mean(risk_mle), np.mean(risk_js)
# --- Scenario 1: Small theta (near zero) ---
J = 10
theta_small = np.random.randn(J) * 0.3
risk_mle_small, risk_js_small = simulate_risk(theta_small)
print(f"Small ΞΈ: MLE risk = {risk_mle_small:.3f}, JS risk = {risk_js_small:.3f}")
# --- Scenario 2: Large theta (far from zero) ---
theta_large = np.random.randn(J) * 3.0
risk_mle_large, risk_js_large = simulate_risk(theta_large)
print(f"Large ΞΈ: MLE risk = {risk_mle_large:.3f}, JS risk = {risk_js_large:.3f}")
# --- Risk as function of ||ΞΈ||Β² ---
norms = np.linspace(0.1, 50, 100)
risks_mle = []
risks_js = []
for norm_sq in norms:
theta = np.random.randn(J) * np.sqrt(norm_sq / J)
rm, rj = simulate_risk(theta)
risks_mle.append(rm)
risks_js.append(rj)
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
axes[0].plot(norms, risks_mle, 'b-', linewidth=2, label='MLE (y)')
axes[0].plot(norms, risks_js, 'r-', linewidth=2, label='James-Stein')
axes[0].axhline(J, color='blue', linestyle='--', alpha=0.5, label=f'J={J}')
axes[0].set_xlabel('||ΞΈ||Β²')
axes[0].set_ylabel('Risk (MSE)')
axes[0].set_title(f'Risk Comparison (J={J})')
axes[0].legend()
# --- Loss function comparison ---
theta_range = np.linspace(-3, 3, 200)
axes[1].plot(theta_range, theta_range**2, 'b-', linewidth=2, label='Squared error: $(ΞΈ-a)Β²$')
axes[1].plot(theta_range, np.abs(theta_range), 'r-', linewidth=2, label='Absolute error: $|ΞΈ-a|$')
axes[1].plot(theta_range, (theta_range != 0).astype(float), 'g-', linewidth=2, label='0-1 loss: $1(ΞΈβ a)$')
axes[1].set_xlabel('ΞΈ - a (estimation error)')
axes[1].set_ylabel('Loss')
axes[1].set_title('Loss Functions')
axes[1].legend()
# --- Bias-variance tradeoff ---
lambdas = np.linspace(0, 5, 100)
bias_sq = (lambdas * 0.5)**2 # Squared bias increases with Ξ»
variance = 1.0 / (1 + lambdas) # Variance decreases with Ξ»
total_risk = bias_sq + variance
axes[2].plot(lambdas, bias_sq, 'r--', linewidth=2, label='BiasΒ²')
axes[2].plot(lambdas, variance, 'b--', linewidth=2, label='Variance')
axes[2].plot(lambdas, total_risk, 'k-', linewidth=2, label='Total risk')
min_idx = np.argmin(total_risk)
axes[2].axvline(lambdas[min_idx], color='green', linestyle=':', alpha=0.7,
label=f'Optimal Ξ»={lambdas[min_idx]:.2f}')
axes[2].set_xlabel('Regularization parameter Ξ»')
axes[2].set_ylabel('Risk')
axes[2].set_title('Bias-Variance Tradeoff')
axes[2].legend()
plt.tight_layout()
plt.savefig('decision_theory.png', dpi=150)
plt.show()
# --- Minimax vs Bayes comparison ---
print("\n=== Minimax vs Bayes ===")
theta_grid = np.linspace(-5, 5, 200)
risk_grid_mle = theta_grid**2 # Risk of MLE (constant = J)
# Bayes rule with N(0, ΟΒ²) prior
tau2 = 2.0
bayes_shrink = tau2 / (tau2 + 1.0)
risk_grid_bayes = bayes_shrink**2 * theta_grid**2 + (1 - bayes_shrink**2)
print(f"Minimax value (MLE risk): {J}")
print(f"Bayes risk (uniform prior): {np.mean(risk_grid_bayes):.3f}")
Key Insight
Decision theory unifies frequentist and Bayesian approaches: the minimax estimator minimizes worst-case frequentist risk, while the Bayes estimator minimizes average Bayesian risk. When a minimax estimator is also a Bayes rule, the two approaches agree.
Related Topics
- See Bayesian Linear Regression for Bayes rules applied to regression
- See Ridge Regression for shrinkage in practice
- See Hypothesis Testing for decision-theoretic view of hypothesis tests
Key Takeaways
Summary: Statistical Decision Theory
- Loss function quantifies the cost of decision when state is β choice of loss drives the optimal estimator
- Risk function is expected loss β the frequentist criterion for comparing estimators
- Bias-variance decomposition: β the fundamental tradeoff in estimation
- Admissible estimators are not dominated by any other estimator β the MLE is inadmissible for
- Bayes risk averages over the prior β the Bayes rule minimizes this
- Minimax estimators minimize worst-case risk β conservative, no prior required
- James-Stein estimator dominates the MLE for β shrinkage always helps in high dimensions
- Pareto optimality characterizes efficient trade-offs between risks at different parameter values