Regression Assumptions: The LINE Framework
Regression Analysis
Four Assumptions Every Regression Must Meet
The LINE framework (Linearity, Independence, Normality, Equal Variance) ensures OLS estimates are valid and inference is trustworthy. Violating these assumptions leads to biased or inefficient results.
- Policy Evaluation — Ensure causal estimates from regression models are credible
- Financial Modeling — Validate assumptions before using regression for risk assessment
- Scientific Research — Meet peer-review standards by demonstrating assumption compliance
Check assumptions before trusting the coefficients they produce.
For OLS estimates to be valid and inference to be correct, four key assumptions must hold.
DfLINE Framework
The four key assumptions for valid OLS inference: Linearity, Independence, Normality, and Equal variance (homoscedasticity).
L — Linearity
The expected relationship between X and Y is linear.
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats
np.random.seed(42)
n = 100
X = np.random.uniform(1, 10, n)
X_dm = sm.add_constant(X)
# Good: linear relationship
y_lin = 3 + 2*X + np.random.normal(0, 2, n)
model_lin = sm.OLS(y_lin, X_dm).fit()
# Violated: curved relationship
y_quad = 3 + 2*X + 0.5*X**2 + np.random.normal(0, 3, n)
model_quad = sm.OLS(y_quad, X_dm).fit()
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
for i, (model, label) in enumerate([(model_lin, 'Linear (Assumption Met)'),
(model_quad, 'Quadratic (Linearity Violated)')]):
axes[i,0].scatter(model.fittedvalues, model.resid, alpha=0.6)
axes[i,0].axhline(0, color='red', linestyle='--')
axes[i,0].set_title(f'{label}: Residuals vs Fitted')
axes[i,0].set_xlabel('Fitted Values')
axes[i,0].set_ylabel('Residuals')
stats.probplot(model.resid, dist='norm', plot=axes[i,1])
axes[i,1].set_title(f'{label}: Q-Q Plot')
plt.tight_layout()
plt.savefig('regression_assumptions.png', dpi=150)
plt.show()
I — Independence
Residuals are independent across observations. Violated in:
- Time series data (autocorrelation)
- Clustered data (students within schools)
- Spatial data
from statsmodels.stats.stattools import durbin_watson
# Durbin-Watson statistic: 2 = no autocorrelation, <2 positive, >2 negative
dw = durbin_watson(model_lin.resid)
print(f"Durbin-Watson = {dw:.4f}")
print(f"Interpretation: {'No autocorrelation' if 1.5<dw<2.5 else 'Possible autocorrelation'}")
N — Normality of Residuals
Residuals should be approximately normally distributed.
# Shapiro-Wilk test
stat_sw, p_sw = stats.shapiro(model_lin.resid)
print(f"Shapiro-Wilk: W={stat_sw:.4f}, p={p_sw:.4f}")
# Also check Q-Q plot (visual is often more informative for moderate n)
# Normality mainly matters for inference (t-tests, p-values) — less for point estimates
E — Equal Variance (Homoscedasticity)
The variance of residuals should be constant across all levels of X.
# Breusch-Pagan test for heteroscedasticity
from statsmodels.stats.diagnostic import het_breuschpagan
bp_stat, bp_p, _, _ = het_breuschpagan(model_lin.resid, model_lin.model.exog)
print(f"Breusch-Pagan: χ²={bp_stat:.4f}, p={bp_p:.4f}")
print(f"Heteroscedasticity: {'Detected' if bp_p < 0.05 else 'Not detected'}")
# White's test (more general)
from statsmodels.stats.diagnostic import het_white
wh_stat, wh_p, _, _ = het_white(model_lin.resid, model_lin.model.exog)
print(f"White's test: F={wh_stat:.4f}, p={wh_p:.4f}")
Violation Consequences
Violations of homoscedasticity inflate or deflate standard errors, leading to incorrect inference. Consider using robust standard errors or transforming the response variable.
Key Takeaways
Summary: Regression Assumptions
- Linearity: residuals vs fitted should show no pattern
- Independence: use Durbin-Watson for time series; design clustered models for grouped data
- Normality: matters mainly for inference; large samples are robust via CLT
- Homoscedasticity: most important — violations inflate/deflate standard errors
- Violations: transform Y (log), use robust SEs, or switch to GLMs