🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Residual Analysis — Diagnosing Regression Problems

Regression AnalysisLinear Regression🟢 Free Lesson

Advertisement

Residual Analysis

Regression Analysis

Diagnosing What's Wrong With Your Regression

Residual analysis reveals whether your model assumptions hold, identifies influential outliers, and detects non-linearity. It's the diagnostic toolkit that separates trustworthy models from misleading ones.

  • Clinical Studies — Detect influential patients that distort treatment effect estimates

  • Financial Forecasting — Identify time periods where models systematically fail

  • Manufacturing — Spot measurement errors and anomalous production runs

What remains after fitting the model tells you as much as what the model predicts.


Residuals are the differences between observed and predicted values:

Residual

ei=yiy^ie_i = y_i - \hat{y}_i

Here,

  • eie_i=The residual for observation i
  • yiy_i=Observed value
  • y^i\hat{y}_i=Predicted (fitted) value

Analyzing residuals reveals whether regression assumptions are met and identifies influential observations.


import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import statsmodels.api as sm

from scipy import stats



np.random.seed(42)



# Build a regression model with potential issues

n = 80

X = np.random.uniform(1, 10, n)

y = 2 + 3*X + np.random.normal(0, 2, n)



# Add some outliers and influential points

X = np.append(X, [1.5, 9.5])

y = np.append(y, [25, 10])  # outliers



X_dm = sm.add_constant(X)

model = sm.OLS(y, X_dm).fit()



# Types of residuals

raw_resid = model.resid

student_resid = model.outlier_test()['student_resid']

influence = model.get_influence()

leverage = influence.hat_matrix_diag

cooks_d = influence.cooks_distance[0]

standardized_resid = influence.resid_studentized_internal



print("Influence Diagnostics:")

print(f"Max leverage: {leverage.max():.4f} (threshold: {2*(2/len(X)):.4f})")

print(f"Max Cook's D: {cooks_d.max():.4f} (threshold: 4/n = {4/len(X):.4f})")



high_lev = leverage > 2*(2/len(X))

high_cook = cooks_d > 4/len(X)

print(f"High leverage points: {high_lev.sum()}")

print(f"High Cook's D points: {high_cook.sum()}")



# 4-panel diagnostic plot

fig, axes = plt.subplots(2, 2, figsize=(12, 8))



# 1. Residuals vs Fitted

axes[0,0].scatter(model.fittedvalues, raw_resid, alpha=0.6, color='steelblue')

axes[0,0].axhline(0, color='red', linestyle='--', linewidth=2)

axes[0,0].set_xlabel('Fitted Values')

axes[0,0].set_ylabel('Residuals')

axes[0,0].set_title('Residuals vs Fitted')



# 2. Q-Q Plot

stats.probplot(raw_resid, dist='norm', plot=axes[0,1])

axes[0,1].set_title('Normal Q-Q Plot')



# 3. Scale-Location

axes[1,0].scatter(model.fittedvalues, np.sqrt(np.abs(standardized_resid)), alpha=0.6, color='coral')

axes[1,0].set_xlabel('Fitted Values')

axes[1,0].set_ylabel('v|Standardized Residuals|')

axes[1,0].set_title('Scale-Location (Homoscedasticity)')



# 4. Residuals vs Leverage (Cook's D)

axes[1,1].scatter(leverage, standardized_resid, alpha=0.6, color='orchid')

axes[1,1].axhline(0, color='black', linestyle='--')

axes[1,1].axvline(2*(2/len(X)), color='red', linestyle=':', label='Leverage threshold')

for i, (lev, res, cd) in enumerate(zip(leverage, standardized_resid, cooks_d)):

    if cd > 4/len(X):

        axes[1,1].annotate(f'Obs {i}', (lev, res), fontsize=8)

axes[1,1].set_xlabel('Leverage')

axes[1,1].set_ylabel('Standardized Residuals')

axes[1,1].set_title("Residuals vs Leverage (Cook's D)")

axes[1,1].legend()



plt.tight_layout()

plt.savefig('residual_analysis.png', dpi=150)

plt.show()



# Outlier test (Bonferroni-corrected)

print("\nOutlier Test (Bonferroni-corrected):")

outlier_results = model.outlier_test()

significant_outliers = outlier_results[outlier_results['bonf(p)'] < 0.05]

print(f"Significant outliers: {len(significant_outliers)}")

Key Diagnostics

| Measure | Formula | Threshold | Indicates |

|---------|---------|-----------|----------|

| Leverage h?? | diag(H) | greater than 2p/n | Unusual X value |

| Standardized residual | e?/sv(1-h??) | |r| greater than 3 | Outlier in Y |

| Cook's Distance | D? | greater than 4/n | Influential observation |

| DFFITS | Scaled leverage | greater than 2v(p/n) | Influential in fitted value |

Investigate Before Removing

Always investigate flagged points before removing them — they may be the most important observations in your dataset.


Key Takeaways

Summary: Residual Analysis

  • Raw residuals show overall pattern; standardized allow comparison across observations

  • High leverage = unusual X (far from x¯) — potential for high influence

  • High Cook's D = removes this point and model changes substantially -> influential

  • Always investigate flagged points before removing them (they may be correct!)

  • The 4-panel diagnostic plot (residuals, Q-Q, scale-location, leverage) is standard

Premium Content

Residual Analysis — Diagnosing Regression Problems

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement