Causal Inference: Beyond Correlation

Correlation vs Causation

Causal inference is the process of determining whether one variable truly causes a change in another, as opposed to merely being correlated.

Architecture Diagram

Correlation:        X ~ Y     (X and Y move together)
Causal:             X --> Y   (X causes Y)
Fork:               X <-- Z --> Y  (Z confounds X and Y)
Collider:           X --> Z <-- Y  (Z is collider, conditioning biases)
Chain:              X --> Z --> Y  (Z mediates X -> Y)

Fundamental Principle

\text{Correlation} \neq \text{Causation}

Three conditions for causation:

Temporal precedence: Cause precedes effect
Covariation: Cause and effect are related
No confounders: The relationship is not explained by a third variable

Collider Bias

Conditioning on a collider (a variable caused by both X and Y) can create a spurious association between X and Y. This is a common pitfall in observational studies where researchers control for too many variables.

Potential Outcomes Framework

Rubin Causal Model

For individual i, define:

Y_i(1) = outcome if treated (potential outcome under treatment)
Y_i(0) = outcome if not treated (potential outcome under control)

\tau_i = Y_i(1) - Y_i(0)

Y = \alpha + \beta X + \epsilon

Individual Treatment Effect

\tau_i = Y_i(1) - Y_i(0)

Here,

$Ï„_i$ =Treatment effect for individual i
$Y_i(1)$ =Potential outcome under treatment
$Y_i(0)$ =Potential outcome under control

Fundamental Problem of Causal Inference

Observed Outcome

Y_i^{obs} = D_i \cdot Y_i(1) + (1 - D_i) \cdot Y_i(0)

Here,

$D_i$ =Treatment assignment (0 or 1)

The Fundamental Problem

Because we never observe both Y_i(1) and Y_i(0) for the same individual, causal effects are inherently counterfactual. All causal inference methods are strategies for estimating what we cannot directly observe.

Average Treatment Effect (ATE)

ATE = E[Y(1) - Y(0)] = E[Y(1)] - E[Y(0)]

Average Treatment Effect

ATE = E[Y(1) - Y(0)] = E[Y(1)] - E[Y(0)]

Here,

$ATE$ =Average treatment effect

Conditional Average Treatment Effect (CATE)

CATE

CATE(x) = E[Y(1) - Y(0) \mid X = x]

Here,

$CATE(x)$ =Conditional average treatment effect at x

SUTVA (Stable Unit Treatment Value Assumption)

No interference: One unit's treatment does not affect another's outcome
Consistency: Only one version of each treatment level

Propensity Score Matching

Propensity Score

e(X) = P(D = 1 \mid X)

Here,

$e(X)$ =Propensity score
$X$ =Covariates
$D$ =Treatment indicator

Key Property

ThPropensity Score Theorem (Rosenbaum and Rubin, 1983)

If treatment assignment is unconfounded given X, then Y(1), Y(0) âŠ¥ D | e(X). This reduces the dimensionality problem from balancing all covariates to balancing just the propensity score.

Matching Methods

Method	Description	Example
Nearest Neighbor	Find closest control by propensity score	T1→C3, T2→C1
Caliper	Reject matches beyond threshold	T1 (d=0.02) âœ“, T2 (d=0.08) ≤
Kernel	Weighted average of all controls	T1 ← weighted sum of C1-C5
Stratification	Divide by propensity score strata	Stratum 1: [0.0-0.2] → ATE_1

Implementation

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler

class PropensityScoreMatching:
    def __init__(self, caliper=0.05, n_neighbors=1):
        self.caliper = caliper
        self.n_neighbors = n_neighbors
        self.propensity_model = LogisticRegression(max_iter=1000)
    
    def fit_propensity(self, X, treatment):
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        self.propensity_model.fit(X_scaled, treatment)
        self.propensity_scores = self.propensity_model.predict_proba(X_scaled)[:, 1]
        return self.propensity_scores
    
    def match(self, treatment, n_neighbors=None):
        if n_neighbors is None:
            n_neighbors = self.n_neighbors
        
        treated_idx = np.where(treatment == 1)[0]
        control_idx = np.where(treatment == 0)[0]
        
        treated_scores = self.propensity_scores[treated_idx].reshape(-1, 1)
        control_scores = self.propensity_scores[control_idx].reshape(-1, 1)
        
        nn = NearestNeighbors(n_neighbors=n_neighbors)
        nn.fit(control_scores)
        distances, indices = nn.kneighbors(treated_scores)
        
        matches = []
        for i, (t_idx, dist) in enumerate(zip(treated_idx, distances[:, 0])):
            if dist <= self.caliper:
                c_idx = control_idx[indices[i, 0]]
                matches.append((t_idx, c_idx, dist))
        
        self.matches_ = matches
        return matches
    
    def estimate_ate(self, outcome):
        treatment_effects = []
        for t_idx, c_idx, _ in self.matches_:
            treatment_effects.append(outcome[t_idx] - outcome[c_idx])
        
        treatment_effects = np.array(treatment_effects)
        return {
            'ate': np.mean(treatment_effects),
            'se': np.std(treatment_effects, ddof=1) / np.sqrt(len(treatment_effects)),
            'n_matches': len(treatment_effects)
        }

# Example: Job training program effect
np.random.seed(42)
n = 2000

age = np.random.normal(35, 10, n)
education = np.random.normal(12, 3, n)
income = np.random.normal(40000, 15000, n)
experience = np.random.normal(10, 5, n)

X = np.column_stack([age, education, income, experience])

propensity_true = 1 / (1 + np.exp(-(-3 + 0.05*age + 0.2*education + 0.0001*income)))
treatment = np.random.binomial(1, propensity_true)

true_ate = 5000
outcome = 30000 + 500*age + 1000*education + 0.3*income + true_ate*treatment + np.random.normal(0, 5000, n)

# Naive comparison
naive_ate = outcome[treatment == 1].mean() - outcome[treatment == 0].mean()
print(f"Naive ATE (biased): ${naive_ate:,.2f}")
print(f"True ATE: ${true_ate:,.2f}")

# Propensity score matching
psm = PropensityScoreMatching(caliper=0.05)
psm.fit_propensity(X, treatment)
psm.match(treatment)
result = psm.estimate_ate(outcome)

print(f"\nPropensity Score Matching Results:")
print(f"Estimated ATE: ${result['ate']:,.2f}")
print(f"Standard Error: ${result['se']:,.2f}")
print(f"Number of matches: {result['n_matches']}")

Difference-in-Differences (DiD)

Two-Period Model

Difference-in-Differences Model

Y_{it} = \alpha + \beta \cdot \text{Treat}_i + \gamma \cdot \text{Post}_t + \delta \cdot (\text{Treat}_i \times \text{Post}_t) + \epsilon_{it}

Here,

$Δ$ =DiD estimator (treatment effect)
$Treat_i$ =Treatment indicator for unit i
$Post_t$ =Post-treatment indicator

DiD Estimator

\hat{\delta} = (\bar{Y}_{T,post} - \bar{Y}_{T,pre}) - (\bar{Y}_{C,post} - \bar{Y}_{C,pre})

Here,

$ΔÌ‚$ =Estimated treatment effect

Architecture Diagram

Y
^                    * Treatment (Post)
|                  /
|                / <-- Δ (DiD estimate)
|              /
|      *------* Treatment (Pre)
|    /
|  *----------* Control (Pre and Post)
+-----------------------------------------> Time
  Pre              Post

Parallel Trends Assumption

The key assumption: absent treatment, treatment and control groups would have followed parallel trends. This is untestable but can be partially validated using pre-treatment data.

Implementation

import statsmodels.api as sm

np.random.seed(42)

n_states = 50
n_periods = 20
states = []
for state in range(n_states):
    treated = 1 if state < 25 else 0
    for time in range(n_periods):
        post = 1 if time >= 10 else 0
        true_effect = 500 if treated and post else 0
        state_effect = state * 100
        time_effect = time * 50
        outcome = 10000 + state_effect + time_effect + true_effect + np.random.normal(0, 500)
        states.append({
            'state': state, 'time': time, 'treated': treated,
            'post': post, 'employment': outcome
        })

df = pd.DataFrame(states)
df['did'] = df['treated'] * df['post']

import statsmodels.formula.api as smf
model = smf.ols('employment ~ treated + post + did', data=df).fit()
print(model.summary())
print(f"\nDiD Estimate: {model.params['did']:.2f}")
print(f"P-value: {model.pvalues['did']:.4f}")

Instrumental Variables (IV)

When to Use IV

When treatment is correlated with unobserved confounders (endogeneity):

Endogeneity

\text{Cov}(X, \epsilon) \neq 0

Here,

$X$ =Treatment variable
$ε$ =Error term

Instrument Requirements

Relevance: Z is correlated with X
Exclusion: Z affects Y only through X
Exogeneity: Z is uncorrelated with confounders

Two-Stage Least Squares (2SLS)

First Stage

X = \pi_0 + \pi_1 Z + \nu

Here,

$Z$ =Instrument
$π₁$ =First-stage coefficient

Second Stage

Y = \beta_0 + \beta_1 \hat{X} + \epsilon

Here,

$XÌ‚$ =Predicted treatment from first stage

IV Estimator

\hat{\beta}_{IV} = \frac{\text{Cov}(Z, Y)}{\text{Cov}(Z, X)}

Here,

$βÌ‚_IV$ =IV estimate of treatment effect

Architecture Diagram

Confounding:

    U (unobserved)
   / \
  v   v
  X   Y     Endogeneity: Cov(X, ε) ≈  0

Instrumental Variable:

    U (unobserved)
   / \
  v   v
  X   Y     IV Z satisfies: Z âŠ¥ U, Z -> X
  ^
  |
  Z (instrument)

Regression Discontinuity Design (RDD)

Sharp RDD

Sharp RDD Treatment Assignment

D_i = \mathbb{1}(X_i \geq c)

Here,

$D_i$ =Treatment indicator
$X_i$ =Running variable
$c$ =Cutoff point

RDD Estimator

\hat{\tau}_{RDD} = \lim_{x \downarrow c} E[Y \mid X=x] - \lim_{x \uparrow c} E[Y \mid X=x]

Here,

$Ï„Ì‚_RDD$ =Estimated treatment effect at cutoff

Architecture Diagram

Y
^         *  *  *
|       *        *  *
|     *    Treatment  *  *
|   *        *          *
|  *   *                 *
| *  *    Control
|*  *
+---------------------------+-----------> X
                     Cutoff (c)

RDD Validity

RDD is considered one of the most credible quasi-experimental methods because it mimics a randomized experiment at the cutoff. The key validity check is whether pre-treatment covariates also show discontinuities at the cutoff.

Directed Acyclic Graphs (DAGs)

Pearl's Causal Hierarchy

Association: P(Y | X) — Seeing
Intervention: P(Y | do(X)) — Doing
Counterfactual: P(Y_x | X', Y') — Imagining

Causal Hierarchy

Each level of Pearl's causal hierarchy requires strictly more assumptions than the level below. Association requires no assumptions beyond observational data. Intervention requires causal assumptions encoded in a DAG. Counterfactuals require a fully specified structural causal model.

do-Calculus

Backdoor Adjustment Formula

P(Y \mid do(X)) = \sum_z P(Y \mid X, z)P(z) \quad \text{(when backdoor criterion satisfied)}

Here,

$do(X)$ =Intervention on X
$z$ =Confounders satisfying backdoor criterion

Backdoor Criterion

A set Z satisfies the backdoor criterion relative to (X, Y) if:

No node in Z is a descendant of X
Z blocks every path between X and Y that contains an arrow into X

Key Takeaways

Summary: Causal Inference

Correlation ≈ Causation: Always consider confounders — observational data alone cannot establish causation without additional assumptions
Potential outcomes framework provides rigorous definition of causal effects through counterfactuals
Propensity score matching reduces confounding by balancing covariates between treated and control groups
DiD uses variation over time to control for unobserved time-invariant confounders
IV addresses endogeneity using exogenous variation from instruments
RDD exploits arbitrary cutoffs for quasi-experimental identification — one of the most credible quasi-experimental designs
DAGs help visualize and verify causal assumptions formally
No method is perfect: All require untestable assumptions; sensitivity analysis is essential

Practice Exercises

Propensity Score Matching: Perform PSM on job training data and check covariate balance before/after matching
DiD Analysis: Replicate the minimum wage study and test the parallel trends assumption
IV Estimation: Implement IV using quarter of birth as an instrument for education. Check instrument strength
RDD Design: Apply RDD to a dataset with a known cutoff. Estimate the effect at different bandwidths
Discussion: When would you prefer DiD over propensity score matching? How do you validate the exclusion restriction?

Causal Inference: Beyond Correlation

Correlation vs Causation

Fundamental Principle

Potential Outcomes Framework

Rubin Causal Model

Individual Treatment Effect

Fundamental Problem of Causal Inference

Observed Outcome

Average Treatment Effect (ATE)

Conditional Average Treatment Effect (CATE)

CATE

SUTVA (Stable Unit Treatment Value Assumption)

Propensity Score Matching

Propensity Score

Key Property

ThPropensity Score Theorem (Rosenbaum and Rubin, 1983)

Matching Methods

Implementation

Difference-in-Differences (DiD)

Two-Period Model

Difference-in-Differences Model

Parallel Trends Assumption

Implementation

Instrumental Variables (IV)

When to Use IV

Endogeneity

Instrument Requirements

Two-Stage Least Squares (2SLS)

First Stage

Second Stage

Regression Discontinuity Design (RDD)

Sharp RDD

Sharp RDD Treatment Assignment

Directed Acyclic Graphs (DAGs)

Pearl's Causal Hierarchy

do-Calculus

Backdoor Criterion

Key Takeaways

Summary: Causal Inference

Practice Exercises

Premium Content

Need Expert Data Science Help?