🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Instrumental Variables — IV Estimation

StatisticsCausal Inference🟢 Free Lesson

Advertisement

Instrumental Variables — IV Estimation

Statistics

Isolating Exogenous Variation to Solve Endogeneity

Instrumental variables exploit external variation that affects the treatment but not the outcome directly. Two-stage least squares uses this exogenous variation to produce consistent causal estimates even when OLS fails.

  • Economics — Estimate returns to education using quarter of birth as an instrument

  • Healthcare — Assess treatment effects when?? assignment is non-random

  • Political Science — Evaluate policy impacts with institutional instruments

A valid instrument creates a natural experiment that breaks the correlation between treatment and error.


Instrumental variables (IV) methods address endogeneity — when a covariate is correlated with the error term. IV uses an external variable (the instrument) to isolate exogenous variation in the treatment.

DfEndogeneity

A variable XX is endogenous if it is correlated with the error term: Cov(X,ε)0\text{Cov}(X, \varepsilon) \neq 0. This causes OLS to be biased and inconsistent.


The IV Approach

Intuition

An instrument ZZ affects the outcome YY only through the endogenous variable XX. By using only the variation in XX that comes from ZZ (which is exogenous), we can estimate the causal effect of XX on YY.


Required Conditions

Relevance

Relevance Condition

Cov(Z,X)0\text{Cov}(Z, X) \neq 0

Here,

  • ZZ=Instrumental variable
  • XX=Endogenous regressor

The instrument must be strongly correlated with the endogenous variable.

Exogeneity (Exclusion Restriction)

Exogeneity Condition

Cov(Z,ε)=0\text{Cov}(Z, \varepsilon) = 0

Here,

  • ε\varepsilon=Error term in the structural equation

The instrument must be uncorrelated with the error term — it affects YY only through XX.

Untestable Assumption

The exclusion restriction cannot be tested directly. It must be justified theoretically or by the research design. This is the main challenge of IV estimation.


Two-Stage Least Squares (2SLS)

The most common IV estimation method.

Stage 1

Regress XX on ZZ (and any exogenous covariates):

Stage 1

X=π0+π1Z+νX = \pi_0 + \pi_1 Z + \nu

Here,

  • π1\pi_1=First-stage coefficient
  • X^\hat{X}=Fitted values from first stage

Stage 2

Regress YY on X^\hat{X}:

Stage 2

Y=β0+β1X^+εY = \beta_0 + \beta_1 \hat{X} + \varepsilon

Here,

  • β1\beta_1=IV estimate of the causal effect

IV Estimator Formula

IV Estimator

β^IV=Cov(Z,Y)Cov(Z,X)\hat{\beta}_{IV} = \frac{\text{Cov}(Z, Y)}{\text{Cov}(Z, X)}

Here,

  • Cov(Z,Y)\text{Cov}(Z, Y)=Covariance between instrument and outcome
  • Cov(Z,X)\text{Cov}(Z, X)=Covariance between instrument and treatment

Weak Instruments

Weak Instrument Problem

If Cov(Z,X)\text{Cov}(Z, X) is small (weak first stage), the IV estimator has:

  • Large variance (imprecise estimates)

  • Bias toward OLS in finite samples

  • Invalid confidence intervals

Testing for Weak Instruments

First-Stage F-Statistic

F=π12/Var(π^1)1F = \frac{\pi_1^2 / \text{Var}(\hat{\pi}_1)}{1}

Here,

  • FF=First-stage F-statistic

| F-statistic | Interpretation |

|------------|---------------|

| F > 10 | Rule of thumb: instrument is strong |

| F < 10 | Potentially weak; use weak-IV robust methods |


Overidentification

When you have more instruments than endogenous variables, you can test whether the instruments are valid.

Sargan/Hansen J-Test

J=n×Raux2J = n \times R^2_{aux}

Here,

  • Raux2R^2_{aux}=R-squared from auxiliary regression of 2SLS residuals on all instruments

| Decision | Interpretation |

|---------|---------------|

| Reject H0H_0 | At least one instrument is invalid |

| Fail to reject H0H_0 | Instruments are jointly valid |


Hausman Test for Endogeneity

Hausman Endogeneity Test

H=(β^OLSβ^IV)(Var(β^OLS)Var(β^IV))1(β^OLSβ^IV)H = (\hat{\beta}_{OLS} - \hat{\beta}_{IV})'(\text{Var}(\hat{\beta}_{OLS}) - \text{Var}(\hat{\beta}_{IV}))^{-1}(\hat{\beta}_{OLS} - \hat{\beta}_{IV})

Here,

  • HH=Test statistic ($\chi^2$ under $H_0$: no endogeneity)

If H0H_0 is rejected, endogeneity is present and IV is preferred.


Python Implementation


import numpy as np

import pandas as pd

import statsmodels.api as sm

from linearmodels.iv import IV2SLS

import matplotlib.pyplot as plt



np.random.seed(42)



# Simulate endogeneity

n = 1000

Z = np.random.randn(n)  # Instrument

U = np.random.randn(n)  # Unobserved confounder

X = 0.8 * Z + 0.5 * U + np.random.randn(n) * 0.5  # Endogenous

Y = 2.0 * X + 1.5 * U + np.random.randn(n)  # Outcome



# Naive OLS (biased)

ols = sm.OLS(Y, sm.add_constant(X)).fit()

print(f"OLS estimate: {ols.params[1]:.3f} (true: 2.0)")



# IV/2SLS

iv_model = IV2SLS.from_formula('Y ~ 1 + [X ~ Z]', 

    data=pd.DataFrame({'Y': Y, 'X': X, 'Z': Z})).fit()

print(f"IV estimate: {iv_model.params['X']:.3f} (true: 2.0)")



# First-stage F-stat

first_stage = sm.OLS(X, sm.add_constant(Z)).fit()

f_stat = first_stage.fvalue

print(f"\nFirst-stage F-statistic: {f_stat:.1f}")

print(f"Weak instrument: {'Yes' if f_stat < 10 else 'No'}")



# Hausman test

diff = ols.params[1] - iv_model.params['X']

var_diff = ols.bse[1]**2 - iv_model.std_errors['X']**2

hausman_stat = diff**2 / var_diff

print(f"\nHausman test: {hausman_stat:.3f} (p ~ {1 - stats.chi2.cdf(hausman_stat, 1):.3f})")

Worked Example

Example: Returns to Education

Estimating the causal effect of education on earnings:

  • Endogenous variable: Years of education

  • Instrument: Quarter of birth (in states with compulsory schooling laws)

  • Confounder: Ability (unobserved)

| Method | Estimate | SE |

|--------|----------|-----|

| OLS | 0.12 | 0.01 |

| IV (2SLS) | 0.08 | 0.03 |

First-stage F: 45.2 (strong instrument)

The IV estimate is smaller than OLS, suggesting OLS has upward ability bias. The 95% CI [0.02, 0.14] excludes zero — education has a significant causal effect.


Key Takeaways

Summary: Instrumental Variables

  • IV addresses endogeneity when Cov(X,ε)0\text{Cov}(X, \varepsilon) \neq 0

  • An instrument must satisfy relevance (Cov(Z,X)0\text{Cov}(Z,X) \neq 0) and exogeneity (Cov(Z,ε)=0\text{Cov}(Z,\varepsilon) = 0)

  • 2SLS is the standard estimation method

  • Check the first-stage F-statistic (>10 indicates strong instrument)

  • The exclusion restriction is untestable — must be justified theoretically

  • Use the Sargan/Hansen test when overidentified (more instruments than endogenous variables)

  • IV estimates are less precise than OLS but consistent when endogeneity is present


Related Topics

Premium Content

Instrumental Variables — IV Estimation

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement