Multicollinearity
Regression Analysis
When Predictors Correlate and Coefficients Become Unreliable
Multicollinearity inflates standard errors, making individual predictors appear insignificant even when the overall model is strong. Detection through VIF and condition numbers is essential before interpreting coefficients.
- Economics — Disentangle effects of correlated macroeconomic indicators
- Genomics — Handle highly correlated gene expression variables
- Policy Analysis — Isolate individual policy impacts when interventions are bundled
High VIF signals that the model cannot distinguish one predictor's effect from another's.
Multicollinearity occurs when two or more predictors are highly correlated with each other. It doesn't bias OLS estimates but inflates standard errors, making individual coefficients unreliable.
DfMulticollinearity
A condition in regression where two or more predictor variables are highly correlated, leading to unstable coefficient estimates and inflated standard errors.
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
import seaborn as sns
np.random.seed(42)
n = 100
# Create correlated predictors
z = np.random.normal(0, 1, n)
x1 = z + np.random.normal(0, 0.3, n) # strongly correlated with z
x2 = z + np.random.normal(0, 0.3, n) # also correlated with z
x3 = np.random.normal(0, 1, n) # independent
y = 2*x1 + 1.5*x3 + np.random.normal(0, 1, n)
X = sm.add_constant(pd.DataFrame({'x1':x1,'x2':x2,'x3':x3}))
# Detect multicollinearity: Variance Inflation Factor
vif_data = pd.DataFrame()
vif_data['Feature'] = ['x1','x2','x3']
vif_data['VIF'] = [variance_inflation_factor(X.values, i+1) for i in range(3)]
print("VIF (Variance Inflation Factor):")
print(vif_data)
print("Rule of thumb: VIF > 10 (or >5) indicates problematic multicollinearity")
# Correlation matrix
corr = pd.DataFrame({'x1':x1,'x2':x2,'x3':x3}).corr()
print("\nCorrelation matrix:")
print(corr.round(3))
plt.figure(figsize=(6, 4))
sns.heatmap(corr, annot=True, fmt='.3f', cmap='RdBu_r', center=0)
plt.title('Predictor Correlation Matrix')
plt.tight_layout()
plt.savefig('multicollinearity.png', dpi=150)
plt.show()
# Show effect: unstable coefficients with multicollinearity
print("\nWith multicollinearity — coefficient instability:")
for seed in [1, 2, 3, 4, 5]:
np.random.seed(seed)
x1s = z + np.random.normal(0, 0.3, n)
x2s = z + np.random.normal(0, 0.3, n)
ys = 2*x1s + 1.5*np.random.normal(0,1,n) + np.random.normal(0, 1, n)
Xs = sm.add_constant(pd.DataFrame({'x1':x1s,'x2':x2s}))
m = sm.OLS(ys, Xs).fit()
print(f" Seed {seed}: β₁={m.params['x1']:.3f}, β₂={m.params['x2']:.3f}")
Solutions
| Solution | When to Use |
|---|---|
| Remove one collinear predictor | If redundant (e.g., two versions of same variable) |
| Create composite (PCA) | When both carry signal |
| Ridge regression | Regularization shrinks correlated coefficients |
| Center/standardize variables | For polynomial terms and interactions |
| Collect more data | Increases precision |
VIF Thresholds
VIF greater than 10 suggests serious multicollinearity. VIF greater than 5 warrants attention, but context matters — in some fields, higher VIF values may be acceptable.
Key Takeaways
Summary: Multicollinearity
- VIF greater than 10 suggests serious multicollinearity; VIF greater than 5 warrants attention
- Multicollinearity inflates standard errors -> wide CIs, large p-values, unstable coefficients
- Point estimates are still unbiased — only inference is affected
- Perfect collinearity makes XᵀX non-invertible -> OLS impossible
- Ridge regression is the best solution when you need all predictors