🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Regression Assumptions — LINE Framework and Diagnostics

Regression AnalysisLinear Regression🟢 Free Lesson

Advertisement

Regression Assumptions: The LINE Framework

Regression Analysis

Four Assumptions Every Regression Must Meet

The LINE framework (Linearity, Independence, Normality, Equal Variance) ensures OLS estimates are valid and inference is trustworthy. Violating these assumptions leads to biased or inefficient results.

  • Policy Evaluation — Ensure causal estimates from regression models are credible
  • Financial Modeling — Validate assumptions before using regression for risk assessment
  • Scientific Research — Meet peer-review standards by demonstrating assumption compliance

Check assumptions before trusting the coefficients they produce.


For OLS estimates to be valid and inference to be correct, four key assumptions must hold.

DfLINE Framework

The four key assumptions for valid OLS inference: Linearity, Independence, Normality, and Equal variance (homoscedasticity).

L — Linearity

The expected relationship between X and Y is linear.

import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats

np.random.seed(42)
n = 100
X = np.random.uniform(1, 10, n)
X_dm = sm.add_constant(X)

# Good: linear relationship
y_lin = 3 + 2*X + np.random.normal(0, 2, n)
model_lin = sm.OLS(y_lin, X_dm).fit()

# Violated: curved relationship
y_quad = 3 + 2*X + 0.5*X**2 + np.random.normal(0, 3, n)
model_quad = sm.OLS(y_quad, X_dm).fit()

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
for i, (model, label) in enumerate([(model_lin, 'Linear (Assumption Met)'),
                                     (model_quad, 'Quadratic (Linearity Violated)')]):
    axes[i,0].scatter(model.fittedvalues, model.resid, alpha=0.6)
    axes[i,0].axhline(0, color='red', linestyle='--')
    axes[i,0].set_title(f'{label}: Residuals vs Fitted')
    axes[i,0].set_xlabel('Fitted Values')
    axes[i,0].set_ylabel('Residuals')
    
    stats.probplot(model.resid, dist='norm', plot=axes[i,1])
    axes[i,1].set_title(f'{label}: Q-Q Plot')

plt.tight_layout()
plt.savefig('regression_assumptions.png', dpi=150)
plt.show()

I — Independence

Residuals are independent across observations. Violated in:

  • Time series data (autocorrelation)
  • Clustered data (students within schools)
  • Spatial data
from statsmodels.stats.stattools import durbin_watson

# Durbin-Watson statistic: 2 = no autocorrelation, <2 positive, >2 negative
dw = durbin_watson(model_lin.resid)
print(f"Durbin-Watson = {dw:.4f}")
print(f"Interpretation: {'No autocorrelation' if 1.5<dw<2.5 else 'Possible autocorrelation'}")

N — Normality of Residuals

Residuals should be approximately normally distributed.

# Shapiro-Wilk test
stat_sw, p_sw = stats.shapiro(model_lin.resid)
print(f"Shapiro-Wilk: W={stat_sw:.4f}, p={p_sw:.4f}")

# Also check Q-Q plot (visual is often more informative for moderate n)
# Normality mainly matters for inference (t-tests, p-values) — less for point estimates

E — Equal Variance (Homoscedasticity)

The variance of residuals should be constant across all levels of X.

# Breusch-Pagan test for heteroscedasticity
from statsmodels.stats.diagnostic import het_breuschpagan

bp_stat, bp_p, _, _ = het_breuschpagan(model_lin.resid, model_lin.model.exog)
print(f"Breusch-Pagan: χ²={bp_stat:.4f}, p={bp_p:.4f}")
print(f"Heteroscedasticity: {'Detected' if bp_p < 0.05 else 'Not detected'}")

# White's test (more general)
from statsmodels.stats.diagnostic import het_white
wh_stat, wh_p, _, _ = het_white(model_lin.resid, model_lin.model.exog)
print(f"White's test: F={wh_stat:.4f}, p={wh_p:.4f}")

Violation Consequences

Violations of homoscedasticity inflate or deflate standard errors, leading to incorrect inference. Consider using robust standard errors or transforming the response variable.


Key Takeaways

Summary: Regression Assumptions

  • Linearity: residuals vs fitted should show no pattern
  • Independence: use Durbin-Watson for time series; design clustered models for grouped data
  • Normality: matters mainly for inference; large samples are robust via CLT
  • Homoscedasticity: most important — violations inflate/deflate standard errors
  • Violations: transform Y (log), use robust SEs, or switch to GLMs

Premium Content

Regression Assumptions — LINE Framework and Diagnostics

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement