Regression Assumptions: The LINE Framework

Regression Analysis

Four Assumptions Every Regression Must Meet

The LINE framework (Linearity, Independence, Normality, Equal Variance) ensures OLS estimates are valid and inference is trustworthy. Violating these assumptions leads to biased or inefficient results.

Policy Evaluation — Ensure causal estimates from regression models are credible
Financial Modeling — Validate assumptions before using regression for risk assessment
Scientific Research — Meet peer-review standards by demonstrating assumption compliance

Check assumptions before trusting the coefficients they produce.

For OLS estimates to be valid and inference to be correct, four key assumptions must hold.

DfLINE Framework

The four key assumptions for valid OLS inference: Linearity, Independence, Normality, and Equal variance (homoscedasticity).

L — Linearity

The expected relationship between X and Y is linear.

import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats

np.random.seed(42)
n = 100
X = np.random.uniform(1, 10, n)
X_dm = sm.add_constant(X)

# Good: linear relationship
y_lin = 3 + 2*X + np.random.normal(0, 2, n)
model_lin = sm.OLS(y_lin, X_dm).fit()

# Violated: curved relationship
y_quad = 3 + 2*X + 0.5*X**2 + np.random.normal(0, 3, n)
model_quad = sm.OLS(y_quad, X_dm).fit()

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
for i, (model, label) in enumerate([(model_lin, 'Linear (Assumption Met)'),
                                     (model_quad, 'Quadratic (Linearity Violated)')]):
    axes[i,0].scatter(model.fittedvalues, model.resid, alpha=0.6)
    axes[i,0].axhline(0, color='red', linestyle='--')
    axes[i,0].set_title(f'{label}: Residuals vs Fitted')
    axes[i,0].set_xlabel('Fitted Values')
    axes[i,0].set_ylabel('Residuals')
    
    stats.probplot(model.resid, dist='norm', plot=axes[i,1])
    axes[i,1].set_title(f'{label}: Q-Q Plot')

plt.tight_layout()
plt.savefig('regression_assumptions.png', dpi=150)
plt.show()

I — Independence

Residuals are independent across observations. Violated in:

Time series data (autocorrelation)
Clustered data (students within schools)
Spatial data

from statsmodels.stats.stattools import durbin_watson

# Durbin-Watson statistic: 2 = no autocorrelation, <2 positive, >2 negative
dw = durbin_watson(model_lin.resid)
print(f"Durbin-Watson = {dw:.4f}")
print(f"Interpretation: {'No autocorrelation' if 1.5<dw<2.5 else 'Possible autocorrelation'}")

N — Normality of Residuals

Residuals should be approximately normally distributed.

# Shapiro-Wilk test
stat_sw, p_sw = stats.shapiro(model_lin.resid)
print(f"Shapiro-Wilk: W={stat_sw:.4f}, p={p_sw:.4f}")

# Also check Q-Q plot (visual is often more informative for moderate n)
# Normality mainly matters for inference (t-tests, p-values) — less for point estimates

E — Equal Variance (Homoscedasticity)

The variance of residuals should be constant across all levels of X.

# Breusch-Pagan test for heteroscedasticity
from statsmodels.stats.diagnostic import het_breuschpagan

bp_stat, bp_p, _, _ = het_breuschpagan(model_lin.resid, model_lin.model.exog)
print(f"Breusch-Pagan: χ²={bp_stat:.4f}, p={bp_p:.4f}")
print(f"Heteroscedasticity: {'Detected' if bp_p < 0.05 else 'Not detected'}")

# White's test (more general)
from statsmodels.stats.diagnostic import het_white
wh_stat, wh_p, _, _ = het_white(model_lin.resid, model_lin.model.exog)
print(f"White's test: F={wh_stat:.4f}, p={wh_p:.4f}")

Violation Consequences

Violations of homoscedasticity inflate or deflate standard errors, leading to incorrect inference. Consider using robust standard errors or transforming the response variable.

Key Takeaways

Summary: Regression Assumptions

Linearity: residuals vs fitted should show no pattern
Independence: use Durbin-Watson for time series; design clustered models for grouped data
Normality: matters mainly for inference; large samples are robust via CLT
Homoscedasticity: most important — violations inflate/deflate standard errors
Violations: transform Y (log), use robust SEs, or switch to GLMs

Regression Assumptions — LINE Framework and Diagnostics