πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Causal Inference: Beyond Correlation

Module 4: Statistics and Probability🟒 Free Lesson

Advertisement

Causal Inference DAGXUYConfounder UX: Treatment | Y: Outcome | U: UnobservedY = Ξ± + Ξ²Β·X+ Ξ΅ (error)

Correlation vs Causation

Causal inference is the process of determining whether one variable truly causes a change in another, as opposed to merely being correlated.

Architecture Diagram
Correlation:        X ~ Y     (X and Y move together)
Causal:             X --> Y   (X causes Y)
Fork:               X <-- Z --> Y  (Z confounds X and Y)
Collider:           X --> Z <-- Y  (Z is collider, conditioning biases)
Chain:              X --> Z --> Y  (Z mediates X -> Y)

Fundamental Principle

Correlation≠Causation\text{Correlation} \neq \text{Causation}

Three conditions for causation:

  1. Temporal precedence: Cause precedes effect
  2. Covariation: Cause and effect are related
  3. No confounders: The relationship is not explained by a third variable

Collider Bias

Conditioning on a collider (a variable caused by both X and Y) can create a spurious association between X and Y. This is a common pitfall in observational studies where researchers control for too many variables.

Potential Outcomes Framework

Rubin Causal Model

For individual i, define:

  • Y_i(1) = outcome if treated (potential outcome under treatment)
  • Y_i(0) = outcome if not treated (potential outcome under control)
Ο„i=Yi(1)βˆ’Yi(0)\tau_i = Y_i(1) - Y_i(0)
Y=Ξ±+Ξ²X+Ο΅Y = \alpha + \beta X + \epsilon

Individual Treatment Effect

Ο„i=Yi(1)βˆ’Yi(0)\tau_i = Y_i(1) - Y_i(0)

Here,

  • IΒ¨β€žiΓβ€ž_i=Treatment effect for individual i
  • Yi(1)Y_i(1)=Potential outcome under treatment
  • Yi(0)Y_i(0)=Potential outcome under control

Fundamental Problem of Causal Inference

Observed Outcome

Yiobs=Diβ‹…Yi(1)+(1βˆ’Di)β‹…Yi(0)Y_i^{obs} = D_i \cdot Y_i(1) + (1 - D_i) \cdot Y_i(0)

Here,

  • DiD_i=Treatment assignment (0 or 1)

The Fundamental Problem

Because we never observe both Y_i(1) and Y_i(0) for the same individual, causal effects are inherently counterfactual. All causal inference methods are strategies for estimating what we cannot directly observe.

Average Treatment Effect (ATE)

ATE=E[Y(1)βˆ’Y(0)]=E[Y(1)]βˆ’E[Y(0)]ATE = E[Y(1) - Y(0)] = E[Y(1)] - E[Y(0)]
Average Treatment Effect
ATE=E[Y(1)βˆ’Y(0)]=E[Y(1)]βˆ’E[Y(0)]ATE = E[Y(1) - Y(0)] = E[Y(1)] - E[Y(0)]

Here,

  • ATEATE=Average treatment effect

Conditional Average Treatment Effect (CATE)

CATE

CATE(x)=E[Y(1)βˆ’Y(0)∣X=x]CATE(x) = E[Y(1) - Y(0) \mid X = x]

Here,

  • CATE(x)CATE(x)=Conditional average treatment effect at x

SUTVA (Stable Unit Treatment Value Assumption)

  1. No interference: One unit's treatment does not affect another's outcome
  2. Consistency: Only one version of each treatment level

Propensity Score Matching

Propensity Score

e(X)=P(D=1∣X)e(X) = P(D = 1 \mid X)

Here,

  • e(X)e(X)=Propensity score
  • XX=Covariates
  • DD=Treatment indicator

Key Property

ThPropensity Score Theorem (Rosenbaum and Rubin, 1983)

If treatment assignment is unconfounded given X, then Y(1), Y(0) Γ’Ε Β₯ D | e(X). This reduces the dimensionality problem from balancing all covariates to balancing just the propensity score.

Matching Methods

MethodDescriptionExample
Nearest NeighborFind closest control by propensity scoreT1β†’C3, T2β†’C1
CaliperReject matches beyond thresholdT1 (d=0.02) Γ’Ε“β€œ, T2 (d=0.08) ≀
KernelWeighted average of all controlsT1 ← weighted sum of C1-C5
StratificationDivide by propensity score strataStratum 1: [0.0-0.2] β†’ ATE_1

Implementation

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler

class PropensityScoreMatching:
    def __init__(self, caliper=0.05, n_neighbors=1):
        self.caliper = caliper
        self.n_neighbors = n_neighbors
        self.propensity_model = LogisticRegression(max_iter=1000)
    
    def fit_propensity(self, X, treatment):
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        self.propensity_model.fit(X_scaled, treatment)
        self.propensity_scores = self.propensity_model.predict_proba(X_scaled)[:, 1]
        return self.propensity_scores
    
    def match(self, treatment, n_neighbors=None):
        if n_neighbors is None:
            n_neighbors = self.n_neighbors
        
        treated_idx = np.where(treatment == 1)[0]
        control_idx = np.where(treatment == 0)[0]
        
        treated_scores = self.propensity_scores[treated_idx].reshape(-1, 1)
        control_scores = self.propensity_scores[control_idx].reshape(-1, 1)
        
        nn = NearestNeighbors(n_neighbors=n_neighbors)
        nn.fit(control_scores)
        distances, indices = nn.kneighbors(treated_scores)
        
        matches = []
        for i, (t_idx, dist) in enumerate(zip(treated_idx, distances[:, 0])):
            if dist <= self.caliper:
                c_idx = control_idx[indices[i, 0]]
                matches.append((t_idx, c_idx, dist))
        
        self.matches_ = matches
        return matches
    
    def estimate_ate(self, outcome):
        treatment_effects = []
        for t_idx, c_idx, _ in self.matches_:
            treatment_effects.append(outcome[t_idx] - outcome[c_idx])
        
        treatment_effects = np.array(treatment_effects)
        return {
            'ate': np.mean(treatment_effects),
            'se': np.std(treatment_effects, ddof=1) / np.sqrt(len(treatment_effects)),
            'n_matches': len(treatment_effects)
        }

# Example: Job training program effect
np.random.seed(42)
n = 2000

age = np.random.normal(35, 10, n)
education = np.random.normal(12, 3, n)
income = np.random.normal(40000, 15000, n)
experience = np.random.normal(10, 5, n)

X = np.column_stack([age, education, income, experience])

propensity_true = 1 / (1 + np.exp(-(-3 + 0.05*age + 0.2*education + 0.0001*income)))
treatment = np.random.binomial(1, propensity_true)

true_ate = 5000
outcome = 30000 + 500*age + 1000*education + 0.3*income + true_ate*treatment + np.random.normal(0, 5000, n)

# Naive comparison
naive_ate = outcome[treatment == 1].mean() - outcome[treatment == 0].mean()
print(f"Naive ATE (biased): ${naive_ate:,.2f}")
print(f"True ATE: ${true_ate:,.2f}")

# Propensity score matching
psm = PropensityScoreMatching(caliper=0.05)
psm.fit_propensity(X, treatment)
psm.match(treatment)
result = psm.estimate_ate(outcome)

print(f"\nPropensity Score Matching Results:")
print(f"Estimated ATE: ${result['ate']:,.2f}")
print(f"Standard Error: ${result['se']:,.2f}")
print(f"Number of matches: {result['n_matches']}")

Difference-in-Differences (DiD)

Two-Period Model

Difference-in-Differences Model

Yit=Ξ±+Ξ²β‹…Treati+Ξ³β‹…Postt+Ξ΄β‹…(TreatiΓ—Postt)+Ο΅itY_{it} = \alpha + \beta \cdot \text{Treat}_i + \gamma \cdot \text{Post}_t + \delta \cdot (\text{Treat}_i \times \text{Post}_t) + \epsilon_{it}

Here,

  • ΔΔ=DiD estimator (treatment effect)
  • TreatiTreat_i=Treatment indicator for unit i
  • PosttPost_t=Post-treatment indicator
DiD Estimator
Ξ΄^=(YΛ‰T,postβˆ’YΛ‰T,pre)βˆ’(YΛ‰C,postβˆ’YΛ‰C,pre)\hat{\delta} = (\bar{Y}_{T,post} - \bar{Y}_{T,pre}) - (\bar{Y}_{C,post} - \bar{Y}_{C,pre})

Here,

  • Ξ”IΛ‹β€šΞ”ΓŒβ€š=Estimated treatment effect
Architecture Diagram
Y
^                    * Treatment (Post)
|                  /
|                / <-- Ξ” (DiD estimate)
|              /
|      *------* Treatment (Pre)
|    /
|  *----------* Control (Pre and Post)
+-----------------------------------------> Time
  Pre              Post

Parallel Trends Assumption

Parallel Trends Assumption

The key assumption: absent treatment, treatment and control groups would have followed parallel trends. This is untestable but can be partially validated using pre-treatment data.

Implementation

import statsmodels.api as sm

np.random.seed(42)

n_states = 50
n_periods = 20
states = []
for state in range(n_states):
    treated = 1 if state < 25 else 0
    for time in range(n_periods):
        post = 1 if time >= 10 else 0
        true_effect = 500 if treated and post else 0
        state_effect = state * 100
        time_effect = time * 50
        outcome = 10000 + state_effect + time_effect + true_effect + np.random.normal(0, 500)
        states.append({
            'state': state, 'time': time, 'treated': treated,
            'post': post, 'employment': outcome
        })

df = pd.DataFrame(states)
df['did'] = df['treated'] * df['post']

import statsmodels.formula.api as smf
model = smf.ols('employment ~ treated + post + did', data=df).fit()
print(model.summary())
print(f"\nDiD Estimate: {model.params['did']:.2f}")
print(f"P-value: {model.pvalues['did']:.4f}")

Instrumental Variables (IV)

When to Use IV

When treatment is correlated with unobserved confounders (endogeneity):

Endogeneity

Cov(X,Ο΅)β‰ 0\text{Cov}(X, \epsilon) \neq 0

Here,

  • XX=Treatment variable
  • ΡΡ=Error term

Instrument Requirements

  1. Relevance: Z is correlated with X
  2. Exclusion: Z affects Y only through X
  3. Exogeneity: Z is uncorrelated with confounders

Two-Stage Least Squares (2SLS)

First Stage

X=Ο€0+Ο€1Z+Ξ½X = \pi_0 + \pi_1 Z + \nu

Here,

  • ZZ=Instrument
  • Ο€1π₁=First-stage coefficient

Second Stage

Y=Ξ²0+Ξ²1X^+Ο΅Y = \beta_0 + \beta_1 \hat{X} + \epsilon

Here,

  • XIΛ‹β€šXΓŒβ€š=Predicted treatment from first stage
IV Estimator
Ξ²^IV=Cov(Z,Y)Cov(Z,X)\hat{\beta}_{IV} = \frac{\text{Cov}(Z, Y)}{\text{Cov}(Z, X)}

Here,

  • Ξ²IΛ‹β€šIVΞ²ΓŒβ€š_IV=IV estimate of treatment effect
Architecture Diagram
Confounding:

    U (unobserved)
   / \
  v   v
  X   Y     Endogeneity: Cov(X, Ξ΅) β‰ˆΒ  0

Instrumental Variable:

    U (unobserved)
   / \
  v   v
  X   Y     IV Z satisfies: Z Γ’Ε Β₯ U, Z -> X
  ^
  |
  Z (instrument)

Regression Discontinuity Design (RDD)

Sharp RDD

Sharp RDD Treatment Assignment

Di=1(Xiβ‰₯c)D_i = \mathbb{1}(X_i \geq c)

Here,

  • DiD_i=Treatment indicator
  • XiX_i=Running variable
  • cc=Cutoff point
RDD Estimator
Ο„^RDD=lim⁑x↓cE[Y∣X=x]βˆ’lim⁑x↑cE[Y∣X=x]\hat{\tau}_{RDD} = \lim_{x \downarrow c} E[Y \mid X=x] - \lim_{x \uparrow c} E[Y \mid X=x]

Here,

  • IΒ¨β€žIΛ‹β€šRDDΓβ€žΓŒβ€š_RDD=Estimated treatment effect at cutoff
Architecture Diagram
Y
^         *  *  *
|       *        *  *
|     *    Treatment  *  *
|   *        *          *
|  *   *                 *
| *  *    Control
|*  *
+---------------------------+-----------> X
                     Cutoff (c)

RDD Validity

RDD is considered one of the most credible quasi-experimental methods because it mimics a randomized experiment at the cutoff. The key validity check is whether pre-treatment covariates also show discontinuities at the cutoff.

Directed Acyclic Graphs (DAGs)

Pearl's Causal Hierarchy

  1. Association: P(Y | X) β€” Seeing
  2. Intervention: P(Y | do(X)) β€” Doing
  3. Counterfactual: P(Y_x | X', Y') β€” Imagining

Causal Hierarchy

Each level of Pearl's causal hierarchy requires strictly more assumptions than the level below. Association requires no assumptions beyond observational data. Intervention requires causal assumptions encoded in a DAG. Counterfactuals require a fully specified structural causal model.

do-Calculus

Backdoor Adjustment Formula
P(Y∣do(X))=βˆ‘zP(Y∣X,z)P(z)(whenΒ backdoorΒ criterionΒ satisfied)P(Y \mid do(X)) = \sum_z P(Y \mid X, z)P(z) \quad \text{(when backdoor criterion satisfied)}

Here,

  • do(X)do(X)=Intervention on X
  • zz=Confounders satisfying backdoor criterion

Backdoor Criterion

A set Z satisfies the backdoor criterion relative to (X, Y) if:

  1. No node in Z is a descendant of X
  2. Z blocks every path between X and Y that contains an arrow into X

Key Takeaways

Summary: Causal Inference

  1. Correlation β‰ˆΒ  Causation: Always consider confounders β€” observational data alone cannot establish causation without additional assumptions
  2. Potential outcomes framework provides rigorous definition of causal effects through counterfactuals
  3. Propensity score matching reduces confounding by balancing covariates between treated and control groups
  4. DiD uses variation over time to control for unobserved time-invariant confounders
  5. IV addresses endogeneity using exogenous variation from instruments
  6. RDD exploits arbitrary cutoffs for quasi-experimental identification β€” one of the most credible quasi-experimental designs
  7. DAGs help visualize and verify causal assumptions formally
  8. No method is perfect: All require untestable assumptions; sensitivity analysis is essential

Practice Exercises

  1. Propensity Score Matching: Perform PSM on job training data and check covariate balance before/after matching
  2. DiD Analysis: Replicate the minimum wage study and test the parallel trends assumption
  3. IV Estimation: Implement IV using quarter of birth as an instrument for education. Check instrument strength
  4. RDD Design: Apply RDD to a dataset with a known cutoff. Estimate the effect at different bandwidths
  5. Discussion: When would you prefer DiD over propensity score matching? How do you validate the exclusion restriction?
⭐

Premium Content

Causal Inference: Beyond Correlation

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement