πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Statistical Decision Theory

Advanced Statistical MethodsDecision Theory🟒 Free Lesson

Advertisement

Statistical Decision Theory

Advanced Statistical Methods

Optimal Decisions Under Uncertainty

Decision theory provides a formal framework for choosing actions that minimize expected loss, combining probability models with utility functions. It unifies hypothesis testing, estimation, and prediction under one coherent philosophy.

  • Medical treatment decisions β€” Balance risks and benefits using loss functions and prior probabilities
  • Quality control β€” Set acceptance sampling rules that minimize total expected inspection costs
  • Financial portfolio allocation β€” Optimize investment decisions by minimizing expected utility

Decision theory transforms statistical evidence into optimal actions.


Statistical decision theory provides a framework for choosing between actions (estimators, tests, decisions) by formalizing the consequences of each choice through loss functions and risk.


The Decision Problem

DfStatistical Decision Problem

A decision problem consists of:

  1. Parameter space Θ\Theta β€” the set of possible states of nature
  2. Action space A\mathcal{A} β€” the set of available actions
  3. Loss function L(ΞΈ,a)L(\theta, a) β€” the cost of taking action aa when the true state is ΞΈ\theta
  4. Decision rule Ξ΄(x)\delta(\mathbf{x}) β€” a function from data x\mathbf{x} to actions

The goal is to find the decision rule Ξ΄βˆ—\delta^* that minimizes expected loss.


Loss Functions

DfCommon Loss Functions

Loss FunctionFormulaProperty
0-1 loss (classification)L(ΞΈ,a)=1(ΞΈβ‰ a)L(\theta, a) = \mathbb{1}(\theta \neq a)Discontinuous; classification
Squared error loss (regression)L(ΞΈ,a)=(ΞΈβˆ’a)2L(\theta, a) = (\theta - a)^2Penalizes large errors heavily; symmetric
Absolute error lossL(ΞΈ,a)=βˆ£ΞΈβˆ’a∣L(\theta, a) = |\theta - a|Robust to outliers; median
0-inflated lossL(ΞΈ,a)=Ξ»01(aβ‰ 0)+Ξ»1βˆ£ΞΈβˆ’a∣L(\theta, a) = \lambda_0 \mathbb{1}(a \neq 0) + \lambda_1 |\theta - a|Encourages sparsity (LASSO connection)
Asymmetric linear lossL(ΞΈ,a)={Ξ»1(ΞΈβˆ’a)a<ΞΈΞ»2(aβˆ’ΞΈ)a>ΞΈL(\theta, a) = \begin{cases} \lambda_1(\theta - a) & a < \theta \\ \lambda_2(a - \theta) & a > \theta \end{cases}Different penalties for over/underestimation

Loss Function Choice Drives the Estimator

The choice of loss function determines the optimal estimator:

  • Squared error β†’ posterior mean (Bayes) / sample mean (frequentist)
  • Absolute error β†’ posterior median
  • 0-1 loss β†’ posterior mode (MAP)
  • Asymmetric linear β†’ posterior quantile

Risk Function

DfRisk Function

The risk of a decision rule Ξ΄\delta under loss LL is the expected loss:

R(ΞΈ,Ξ΄)=EΞΈ[L(ΞΈ,Ξ΄(X))]=∫L(ΞΈ,Ξ΄(x)) p(x∣θ) dxR(\theta, \delta) = E_\theta[L(\theta, \delta(\mathbf{X}))] = \int L(\theta, \delta(\mathbf{x})) \, p(\mathbf{x} \mid \theta) \, d\mathbf{x}

For squared error loss: R(ΞΈ,Ξ΄)=EΞΈ[(ΞΈβˆ’Ξ΄(X))2]=Bias2(Ξ΄)+Var(Ξ΄)R(\theta, \delta) = E_\theta[(\theta - \delta(\mathbf{X}))^2] = \text{Bias}^2(\delta) + \text{Var}(\delta).

Bias-Variance Decomposition

R(ΞΈ,Ξ΄)=Bias2(Ξ΄)+Var(Ξ΄)R(\theta, \delta) = \text{Bias}^2(\delta) + \text{Var}(\delta)

Here,

  • Bias(Ξ΄)=E[Ξ΄(X)]βˆ’ΞΈ\text{Bias}(\delta) = E[\delta(\mathbf{X})] - \theta=Systematic error of the estimator
  • Var(Ξ΄)=E[(Ξ΄(X)βˆ’E[Ξ΄(X)])2]\text{Var}(\delta) = E[(\delta(\mathbf{X}) - E[\delta(\mathbf{X})])^2]=Variability of the estimator

Bias-Variance Tradeoff

Reducing bias often increases variance and vice versa. The optimal estimator minimizes the total risk β€” the sum of squared bias and variance. Regularization methods (ridge, LASSO) deliberately introduce bias to reduce variance.


Admissibility

DfAdmissible Estimator

An estimator Ξ΄1\delta_1 is dominated by Ξ΄2\delta_2 if R(ΞΈ,Ξ΄2)≀R(ΞΈ,Ξ΄1)R(\theta, \delta_2) \leq R(\theta, \delta_1) for all ΞΈ\theta with strict inequality for some ΞΈ\theta. An estimator is admissible if no other estimator dominates it.

ThInadmissibility of the MLE in High Dimensions

For the normal means problem y∼N(ΞΈ,IJ)\mathbf{y} \sim \mathcal{N}(\boldsymbol{\theta}, \mathbf{I}_J) with Jβ‰₯3J \geq 3:

The MLE ΞΈ^=y\hat{\boldsymbol{\theta}} = \mathbf{y} is inadmissible under squared error loss:

Eβˆ₯ΞΈ^βˆ’ΞΈβˆ₯2=J>Eβˆ₯Ξ΄JSβˆ’ΞΈβˆ₯2E\|\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}\|^2 = J > E\|\delta^{\text{JS}} - \boldsymbol{\theta}\|^2

for the James-Stein estimator Ξ΄JS\delta^{\text{JS}}, which shrinks toward zero.

Practical Consequence

In high-dimensional problems, the MLE is always dominated. Shrinkage always helps. This is a fundamental result in modern statistics.


Bayes Risk

DfBayes Risk

Given a prior Ο€(ΞΈ)\pi(\theta), the Bayes risk of a decision rule Ξ΄\delta is:

r(Ο€,Ξ΄)=EΟ€[R(ΞΈ,Ξ΄)]=∫R(ΞΈ,Ξ΄) π(ΞΈ) dΞΈ=EΟ€Ex∣θ[L(ΞΈ,Ξ΄(x))]r(\pi, \delta) = E^\pi[R(\theta, \delta)] = \int R(\theta, \delta) \, \pi(\theta) \, d\theta = E^\pi E^{\mathbf{x} \mid \theta}[L(\theta, \delta(\mathbf{x}))]

The Bayes rule Ξ΄βˆ—\delta^* minimizes the Bayes risk:

Ξ΄βˆ—(x)=arg⁑min⁑a∫L(ΞΈ,a) p(θ∣x) dΞΈ\delta^*(\mathbf{x}) = \arg\min_a \int L(\theta, a) \, p(\theta \mid \mathbf{x}) \, d\theta

Bayes Rule is Optimal

The Bayes rule is the best estimator given the prior β€” it achieves the minimum possible Bayes risk. For squared error loss, the Bayes rule is the posterior mean: Ξ΄βˆ—(x)=E[θ∣x]\delta^*(\mathbf{x}) = E[\theta \mid \mathbf{x}].


Minimax Estimators

DfMinimax Estimator

The minimax estimator minimizes the worst-case risk:

δmm=arg⁑min⁑δmax⁑θR(θ,δ)\delta^{\text{mm}} = \arg\min_\delta \max_\theta R(\theta, \delta)

A minimax estimator achieves the minimax value: V=min⁑δmax⁑θR(θ,δ)V = \min_\delta \max_\theta R(\theta, \delta).

ThMinimax Theorem

An estimator Ξ΄βˆ—\delta^* is minimax if and only if its risk function is constant (flat) and no other estimator has uniformly lower risk.

Bayes minimax connection: If Ξ΄βˆ—\delta^* is a Bayes rule for prior Ο€βˆ—\pi^* and its risk is constant, then Ξ΄βˆ—\delta^* is minimax, and the minimax value equals the Bayes risk:

V=r(Ο€βˆ—,Ξ΄βˆ—)V = r(\pi^*, \delta^*)

When to Use Minimax

Minimax is appropriate when:

  1. No reliable prior is available
  2. The consequence of the worst case is catastrophic
  3. A guarantee on worst-case performance is needed

In practice, minimax is often too conservative β€” Bayes rules with reasonable priors typically perform better on average.


The James-Stein Estimator

James-Stein Estimator

Ξ΄JS=(1βˆ’(Jβˆ’2)Οƒ2βˆ₯yβˆ₯2)y\delta^{\text{JS}} = \left(1 - \frac{(J-2)\sigma^2}{\|\mathbf{y}\|^2}\right) \mathbf{y}

Here,

  • JJ=Dimension (number of parameters)
  • Οƒ2\sigma^2=Known variance
  • βˆ₯yβˆ₯2\|\mathbf{y}\|^2=Squared norm of the observation vector

ThJames-Stein Dominance

For Jβ‰₯3J \geq 3, the James-Stein estimator dominates the MLE under total squared error loss:

R(ΞΈ,Ξ΄JS)<R(ΞΈ,ΞΈ^MLE)=JR(\boldsymbol{\theta}, \delta^{\text{JS}}) < R(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}_{\text{MLE}}) = J

for all θ∈RJ\boldsymbol{\theta} \in \mathbb{R}^J. The improvement is greatest when βˆ₯ΞΈβˆ₯2\|\boldsymbol{\theta}\|^2 is small (near zero).

The Shrinkage Effect

The James-Stein estimator shrinks y\mathbf{y} toward the origin. The shrinkage factor depends on βˆ₯yβˆ₯2\|\mathbf{y}\|^2 β€” when the observations are large (far from zero), shrinkage is small. When observations are small (near zero), shrinkage is large. This adaptive shrinkage is what gives James-Stein its power.


Pareto Optimality

DfPareto Optimal Risk

A risk point (R(ΞΈ1,Ξ΄),R(ΞΈ2,Ξ΄))(R(\theta_1, \delta), R(\theta_2, \delta)) is Pareto optimal if no other decision rule achieves lower risk for both ΞΈ1\theta_1 and ΞΈ2\theta_2 simultaneously.

A decision rule Ξ΄βˆ—\delta^* is Pareto optimal if its risk vector is not dominated in any component.

Pareto Frontier

The set of Pareto optimal risk points forms the Pareto frontier β€” the efficient trade-off curve between risks at different parameter values. Any Bayes rule with a proper prior is Pareto optimal.


Python Implementation

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(42)

# --- Risk calculation for normal means problem ---
def squared_error_risk(theta_hat, theta_true):
    return np.mean((theta_hat - theta_true) ** 2)

def james_stein_estimator(y, sigma2):
    J = len(y)
    shrinkage = max(0, 1 - (J - 2) * sigma2 / np.sum(y**2))
    return shrinkage * y

def simulate_risk(true_theta, sigma2=1.0, n_sims=5000):
    J = len(true_theta)
    y = np.random.randn(n_sims, J) * np.sqrt(sigma2) + true_theta
    
    risk_mle = np.zeros(n_sims)
    risk_js = np.zeros(n_sims)
    
    for i in range(n_sims):
        risk_mle[i] = squared_error_risk(y[i], true_theta)
        risk_js[i] = squared_error_risk(james_stein_estimator(y[i], sigma2), true_theta)
    
    return np.mean(risk_mle), np.mean(risk_js)

# --- Scenario 1: Small theta (near zero) ---
J = 10
theta_small = np.random.randn(J) * 0.3
risk_mle_small, risk_js_small = simulate_risk(theta_small)
print(f"Small ΞΈ: MLE risk = {risk_mle_small:.3f}, JS risk = {risk_js_small:.3f}")

# --- Scenario 2: Large theta (far from zero) ---
theta_large = np.random.randn(J) * 3.0
risk_mle_large, risk_js_large = simulate_risk(theta_large)
print(f"Large ΞΈ: MLE risk = {risk_mle_large:.3f}, JS risk = {risk_js_large:.3f}")

# --- Risk as function of ||ΞΈ||Β² ---
norms = np.linspace(0.1, 50, 100)
risks_mle = []
risks_js = []
for norm_sq in norms:
    theta = np.random.randn(J) * np.sqrt(norm_sq / J)
    rm, rj = simulate_risk(theta)
    risks_mle.append(rm)
    risks_js.append(rj)

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

axes[0].plot(norms, risks_mle, 'b-', linewidth=2, label='MLE (y)')
axes[0].plot(norms, risks_js, 'r-', linewidth=2, label='James-Stein')
axes[0].axhline(J, color='blue', linestyle='--', alpha=0.5, label=f'J={J}')
axes[0].set_xlabel('||ΞΈ||Β²')
axes[0].set_ylabel('Risk (MSE)')
axes[0].set_title(f'Risk Comparison (J={J})')
axes[0].legend()

# --- Loss function comparison ---
theta_range = np.linspace(-3, 3, 200)
axes[1].plot(theta_range, theta_range**2, 'b-', linewidth=2, label='Squared error: $(ΞΈ-a)Β²$')
axes[1].plot(theta_range, np.abs(theta_range), 'r-', linewidth=2, label='Absolute error: $|ΞΈ-a|$')
axes[1].plot(theta_range, (theta_range != 0).astype(float), 'g-', linewidth=2, label='0-1 loss: $1(ΞΈβ‰ a)$')
axes[1].set_xlabel('ΞΈ - a (estimation error)')
axes[1].set_ylabel('Loss')
axes[1].set_title('Loss Functions')
axes[1].legend()

# --- Bias-variance tradeoff ---
lambdas = np.linspace(0, 5, 100)
bias_sq = (lambdas * 0.5)**2  # Squared bias increases with Ξ»
variance = 1.0 / (1 + lambdas)  # Variance decreases with Ξ»
total_risk = bias_sq + variance

axes[2].plot(lambdas, bias_sq, 'r--', linewidth=2, label='BiasΒ²')
axes[2].plot(lambdas, variance, 'b--', linewidth=2, label='Variance')
axes[2].plot(lambdas, total_risk, 'k-', linewidth=2, label='Total risk')
min_idx = np.argmin(total_risk)
axes[2].axvline(lambdas[min_idx], color='green', linestyle=':', alpha=0.7, 
                label=f'Optimal Ξ»={lambdas[min_idx]:.2f}')
axes[2].set_xlabel('Regularization parameter Ξ»')
axes[2].set_ylabel('Risk')
axes[2].set_title('Bias-Variance Tradeoff')
axes[2].legend()

plt.tight_layout()
plt.savefig('decision_theory.png', dpi=150)
plt.show()

# --- Minimax vs Bayes comparison ---
print("\n=== Minimax vs Bayes ===")
theta_grid = np.linspace(-5, 5, 200)
risk_grid_mle = theta_grid**2  # Risk of MLE (constant = J)
# Bayes rule with N(0, τ²) prior
tau2 = 2.0
bayes_shrink = tau2 / (tau2 + 1.0)
risk_grid_bayes = bayes_shrink**2 * theta_grid**2 + (1 - bayes_shrink**2)

print(f"Minimax value (MLE risk): {J}")
print(f"Bayes risk (uniform prior): {np.mean(risk_grid_bayes):.3f}")

Key Insight

Decision theory unifies frequentist and Bayesian approaches: the minimax estimator minimizes worst-case frequentist risk, while the Bayes estimator minimizes average Bayesian risk. When a minimax estimator is also a Bayes rule, the two approaches agree.


Related Topics


Key Takeaways

Summary: Statistical Decision Theory

  • Loss function L(ΞΈ,a)L(\theta, a) quantifies the cost of decision aa when state is ΞΈ\theta β€” choice of loss drives the optimal estimator
  • Risk function R(ΞΈ,Ξ΄)=EΞΈ[L(ΞΈ,Ξ΄(X))]R(\theta, \delta) = E_\theta[L(\theta, \delta(\mathbf{X}))] is expected loss β€” the frequentist criterion for comparing estimators
  • Bias-variance decomposition: R(ΞΈ,Ξ΄)=Bias2+VarR(\theta, \delta) = \text{Bias}^2 + \text{Var} β€” the fundamental tradeoff in estimation
  • Admissible estimators are not dominated by any other estimator β€” the MLE is inadmissible for Jβ‰₯3J \geq 3
  • Bayes risk r(Ο€,Ξ΄)=EΟ€[R(ΞΈ,Ξ΄)]r(\pi, \delta) = E^\pi[R(\theta, \delta)] averages over the prior β€” the Bayes rule minimizes this
  • Minimax estimators minimize worst-case risk β€” conservative, no prior required
  • James-Stein estimator dominates the MLE for Jβ‰₯3J \geq 3 β€” shrinkage always helps in high dimensions
  • Pareto optimality characterizes efficient trade-offs between risks at different parameter values
⭐

Premium Content

Statistical Decision Theory

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement