Residual Analysis
Regression Analysis
Diagnosing What's Wrong With Your Regression
Residual analysis reveals whether your model assumptions hold, identifies influential outliers, and detects non-linearity. It's the diagnostic toolkit that separates trustworthy models from misleading ones.
-
Clinical Studies — Detect influential patients that distort treatment effect estimates
-
Financial Forecasting — Identify time periods where models systematically fail
-
Manufacturing — Spot measurement errors and anomalous production runs
What remains after fitting the model tells you as much as what the model predicts.
Residuals are the differences between observed and predicted values:
Residual
Here,
- =The residual for observation i
- =Observed value
- =Predicted (fitted) value
Analyzing residuals reveals whether regression assumptions are met and identifies influential observations.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats
np.random.seed(42)
# Build a regression model with potential issues
n = 80
X = np.random.uniform(1, 10, n)
y = 2 + 3*X + np.random.normal(0, 2, n)
# Add some outliers and influential points
X = np.append(X, [1.5, 9.5])
y = np.append(y, [25, 10]) # outliers
X_dm = sm.add_constant(X)
model = sm.OLS(y, X_dm).fit()
# Types of residuals
raw_resid = model.resid
student_resid = model.outlier_test()['student_resid']
influence = model.get_influence()
leverage = influence.hat_matrix_diag
cooks_d = influence.cooks_distance[0]
standardized_resid = influence.resid_studentized_internal
print("Influence Diagnostics:")
print(f"Max leverage: {leverage.max():.4f} (threshold: {2*(2/len(X)):.4f})")
print(f"Max Cook's D: {cooks_d.max():.4f} (threshold: 4/n = {4/len(X):.4f})")
high_lev = leverage > 2*(2/len(X))
high_cook = cooks_d > 4/len(X)
print(f"High leverage points: {high_lev.sum()}")
print(f"High Cook's D points: {high_cook.sum()}")
# 4-panel diagnostic plot
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# 1. Residuals vs Fitted
axes[0,0].scatter(model.fittedvalues, raw_resid, alpha=0.6, color='steelblue')
axes[0,0].axhline(0, color='red', linestyle='--', linewidth=2)
axes[0,0].set_xlabel('Fitted Values')
axes[0,0].set_ylabel('Residuals')
axes[0,0].set_title('Residuals vs Fitted')
# 2. Q-Q Plot
stats.probplot(raw_resid, dist='norm', plot=axes[0,1])
axes[0,1].set_title('Normal Q-Q Plot')
# 3. Scale-Location
axes[1,0].scatter(model.fittedvalues, np.sqrt(np.abs(standardized_resid)), alpha=0.6, color='coral')
axes[1,0].set_xlabel('Fitted Values')
axes[1,0].set_ylabel('v|Standardized Residuals|')
axes[1,0].set_title('Scale-Location (Homoscedasticity)')
# 4. Residuals vs Leverage (Cook's D)
axes[1,1].scatter(leverage, standardized_resid, alpha=0.6, color='orchid')
axes[1,1].axhline(0, color='black', linestyle='--')
axes[1,1].axvline(2*(2/len(X)), color='red', linestyle=':', label='Leverage threshold')
for i, (lev, res, cd) in enumerate(zip(leverage, standardized_resid, cooks_d)):
if cd > 4/len(X):
axes[1,1].annotate(f'Obs {i}', (lev, res), fontsize=8)
axes[1,1].set_xlabel('Leverage')
axes[1,1].set_ylabel('Standardized Residuals')
axes[1,1].set_title("Residuals vs Leverage (Cook's D)")
axes[1,1].legend()
plt.tight_layout()
plt.savefig('residual_analysis.png', dpi=150)
plt.show()
# Outlier test (Bonferroni-corrected)
print("\nOutlier Test (Bonferroni-corrected):")
outlier_results = model.outlier_test()
significant_outliers = outlier_results[outlier_results['bonf(p)'] < 0.05]
print(f"Significant outliers: {len(significant_outliers)}")
Key Diagnostics
| Measure | Formula | Threshold | Indicates |
|---------|---------|-----------|----------|
| Leverage h?? | diag(H) | greater than 2p/n | Unusual X value |
| Standardized residual | e?/sv(1-h??) | |r| greater than 3 | Outlier in Y |
| Cook's Distance | D? | greater than 4/n | Influential observation |
| DFFITS | Scaled leverage | greater than 2v(p/n) | Influential in fitted value |
Investigate Before Removing
Always investigate flagged points before removing them — they may be the most important observations in your dataset.
Key Takeaways
Summary: Residual Analysis
-
Raw residuals show overall pattern; standardized allow comparison across observations
-
High leverage = unusual X (far from x¯) — potential for high influence
-
High Cook's D = removes this point and model changes substantially -> influential
-
Always investigate flagged points before removing them (they may be correct!)
-
The 4-panel diagnostic plot (residuals, Q-Q, scale-location, leverage) is standard