AIC and BIC — Information Criteria for Model Selection
Statistics
Balancing Model Fit Against Complexity
Information criteria penalize models for having too many parameters, preventing overfitting while rewarding good fit. AIC targets predictive accuracy; BIC targets the true model — their comparison reveals whether complexity is justified.
-
Time Series — Select the best ARIMA order from competing specifications
-
Epidemiology — Choose among risk factor models with different covariate sets
-
Ecology — Compare species distribution models with varying environmental predictors
Lower information criteria values indicate models that balance simplicity and accuracy best.
Information criteria balance model fit against complexity to select the best model among candidates. They provide a principled way to avoid overfitting.
DfInformation Criterion
A metric that penalizes models for having more parameters, balancing goodness-of-fit with parsimony. Lower values indicate better models.
Akaike Information Criterion (AIC)
AIC
Here,
- =Maximized likelihood value
- =Number of estimated parameters
- =Deviance (measure of lack of fit)
AIC Interpretation
AIC estimates the out-of-sample prediction error. Among a set of models, the one with the lowest AIC is expected to have the best predictive performance.
Bayesian Information Criterion (BIC)
BIC
Here,
- =Sample size
- =Number of parameters
AIC vs BIC
-
AIC: Optimizes predictive accuracy; tends to select larger models
-
BIC: Optimizes model identification (finds the true model); tends to select smaller models
-
BIC penalizes complexity more heavily than AIC when ()
Corrected AIC (AICc)
For small samples, AIC can be overly liberal (overfits).
AICc
Here,
- =Corrected AIC
- =Sample size
- =Number of parameters
Use AICc When
Use AICc when . It converges to AIC as .
Deviance Information Criterion (DIC)
For Bayesian models:
DIC
Here,
- =Deviance at posterior mean
- =Effective number of parameters
Comparing Models
Likelihood Ratio Test
For nested models:
Likelihood Ratio Test
Here,
- =Likelihood of simpler (restricted) model
- =Likelihood of more complex model
- =Test statistic with $df = k_1 - k_0$
Information Criterion Comparison
| Metric | Values | Interpretation |
|--------|--------|---------------|
| | Best model | Strongest support |
| | | Substantial support |
| | | Considerable support |
| | | Much less support |
| | | Essentially no support |
Evidence Ratios
Akaike Weight
Here,
- =Akaike weight for model i (probability of being best)
- =Difference from best model
Interpreting Weights
An Akaike weight of 0.85 means the model has an 85% probability of being the best among the candidate set (given the data and criteria).
Python Implementation
import numpy as np
import pandas as pd
import statsmodels.api as sm
from scipy import stats
import matplotlib.pyplot as plt
np.random.seed(42)
# Generate data: true model is quadratic
n = 100
X = np.random.uniform(-3, 3, n)
Y = 2 + 1.5*X - 0.8*X**2 + np.random.randn(n) * 1.5
# Fit models of increasing complexity
models = {}
for degree in range(1, 6):
X_poly = np.column_stack([X**i for i in range(degree + 1)])
X_poly = sm.add_constant(X_poly)
model = sm.OLS(Y, X_poly).fit()
models[degree] = model
# Compare AIC and BIC
print("Model Comparison:")
print(f"{'Degree':<10} {'AIC':<10} {'BIC':<10} {'AICc':<10} {'k':<5}")
print("-" * 45)
for deg, m in models.items():
aic = m.aic
bic = m.bic
k = m.df_model + 1
aicc = aic + 2*k*(k+1)/(n - k - 1)
print(f"{deg:<10} {aic:<10.1f} {bic:<10.1f} {aicc:<10.1f} {k:<5}")
# Akaike weights
aics = np.array([m.aic for m in models.values()])
delta_aics = aics - aics.min()
weights = np.exp(-delta_aics / 2)
weights = weights / weights.sum()
print("\nAkaike Weights:")
for deg, w in zip(models.keys(), weights):
print(f" Degree {deg}: {w:.3f}")
# Likelihood ratio test (nested models)
lr_stat = -2 * (models[1].llf - models[2].llf)
lr_pval = 1 - stats.chi2.cdf(lr_stat, 1)
print(f"\nLR test (degree 1 vs 2): ?²={lr_stat:.2f}, p={lr_pval:.4f}")
Worked Example
Example: Variable Selection in Regression
Comparing models with different predictor sets:
| Model | Variables | k | AIC | BIC | AICc |
|-------|----------|---|-----|-----|------|
| 1 | X1 | 2 | 452.1 | 458.3 | 452.4 |
| 2 | X1, X2 | 3 | 445.3 | 454.5 | 445.7 |
| 3 | X1, X2, X3 | 4 | 447.8 | 460.1 | 448.5 |
| 4 | X1, X2, X3, X4 | 5 | 450.2 | 465.5 | 451.2 |
AIC selects Model 2 (lowest AIC)
BIC selects Model 1 (penalizes extra parameters more)
Conclusion: Model 2 with X1 and X2 is the best predictive model.
Key Takeaways
Summary: AIC and BIC
-
AIC = : optimizes predictive accuracy
-
BIC = : optimizes model identification; penalizes complexity more
-
AICc adds a correction for small samples
-
Lower is better for all information criteria
-
Use and Akaike weights for model comparison
-
AIC tends to select larger models; BIC selects smaller models
-
For nested models, the likelihood ratio test is also appropriate
-
Always report multiple criteria (AIC, BIC, AICc) for transparency
Related Topics
-
See Cross-Validation for resampling-based model evaluation
-
See ARIMA Models for using AIC/BIC in time series
-
See ROC and AUC for classification model evaluation