Path Analysis
Advanced Statistical Methods
Tracing Cause and Effect Through Systems
Path analysis decomposes relationships among observed variables into direct, indirect, and total effects using path diagrams and structural equations. It reveals how variables influence each other through chains of causation.
- Social science β Quantify how education affects income both directly and through occupational prestige
- Epidemiology β Trace how risk factors contribute to disease through mediating biological pathways
- Organizational behavior β Map how leadership styles influence employee performance through motivation
Path analysis reveals not just whether variables are related, but how they influence each other.
What Is Path Analysis?
DfPath Analysis
Path analysis is a multivariate technique for modeling direct and indirect causal relationships among a set of observed variables. It is a special case of structural equation modeling in which all variables are observed (no latent variables). Path analysis allows decomposition of correlations into components attributable to direct causation, indirect causation (mediation), and spurious association.
Path analysis was developed by Sewall Wright (1920s) as a method for analyzing causal systems in genetics and economics. It remains foundational in psychology, education, and the social sciences.
Path Diagrams
DfPath Diagram
A path diagram is a graphical representation of a path model:
- Boxes represent observed variables
- Single-headed arrows (β) represent direct causal effects (path coefficients)
- Double-headed arrows (β) represent correlations or covariances
- Residual arrows (β) represent unexplained variance in endogenous variables
The direction of arrows encodes the assumed causal order: variables at the "top" of the causal chain are exogenous (their causes are not modeled); variables at the "bottom" are endogenous (their causes are within the model).
Decomposition of Effects
ThPath Coefficient Decomposition
For a recursive path model with observed variables, the total effect of variable on variable decomposes as:
where:
- is the direct effect (path coefficient from to )
- represents an indirect effect through mediator
- The sum accounts for all mediating pathways
More generally, for the full system :
Direct, Indirect, and Total Effects
Here,
- =The path coefficient for the direct arrow from X_j to Y
- =Product of path coefficients along the mediated pathway
- =Sum of direct and all indirect effects
Identification Rules
ThIdentification Conditions for Path Models
A path model is identified (has a unique solution for the path coefficients) when:
-
Recursive models (no feedback loops): always identified if there is at least one exogenous variable predicting each endogenous variable
-
Order condition (necessary): for each endogenous variable, there must be at least as many exogenous variables excluded from its equation as there are endogenous variables included in its equation
-
Rank condition (necessary and sufficient): for each endogenous variable, the matrix of coefficients of excluded exogenous variables must have full row rank
For the standard recursive path model with endogenous variables and exogenous variables, the model is always identified when:
- All exogenous variables are correlated (their covariances are freely estimated)
- Each endogenous variable has at least one exogenous predictor
Order Condition Explained
For endogenous variable , let = number of included endogenous variables (including ) and = number of excluded exogenous variables. The order condition requires . If exactly, the equation is just-identified. If , it is over-identified and can be tested for fit.
Recursive vs. Non-Recursive Models
DfRecursive Path Model
A recursive path model has no feedback loops: the causal flow is unidirectional. All structural errors are uncorrelated (or at least uncorrelated with all predictors of a given endogenous variable). Recursive models are always identified under standard conditions.
DfNon-Recursive Path Model
A non-recursive model contains feedback loops (e.g., ) or simultaneous equations. These models require additional identification conditions:
- The order condition must be satisfied
- The rank condition must be satisfied
- Instrumental variables or exclusion restrictions may be needed
- Estimation typically requires 2SLS, 3SLS, or full information ML
Mediation Analysis via Path Analysis
DfMediation
Variable mediates the effect of on if:
- affects (path )
- affects controlling for (path )
- The total effect of on is partially or fully transmitted through
The indirect effect is . Mediation is present when .
ThSobel Test for Indirect Effects
The classic test for mediation uses the Sobel statistic:
Under (no indirect effect), . Modern practice prefers bootstrap confidence intervals for the indirect effect, which do not assume normality of the product distribution .
Python Implementation
Path Analysis with semopy
import numpy as np
import pandas as pd
from semopy import Model, calc_stats
np.random.seed(42)
n = 500
# True path model:
# X1 β M β Y
# X2 β Y (direct)
# X1 β Y (direct)
# X1 β X2 (correlated exogenous)
x1 = np.random.normal(0, 1, n)
x2 = np.random.normal(0, 1, n)
# X1 and X2 correlated
x2 = 0.3 * x1 + np.sqrt(1 - 0.3**2) * x2
# M = 0.5*X1 + 0.2*X2 + error
m = 0.5 * x1 + 0.2 * x2 + np.random.normal(0, 0.8, n)
# Y = 0.3*X1 + 0.4*X2 + 0.6*M + error
y = 0.3 * x1 + 0.4 * x2 + 0.6 * m + np.random.normal(0, 0.7, n)
df = pd.DataFrame({'X1': x1, 'X2': x2, 'M': m, 'Y': y})
# Define path model
spec = """
# Structural equations
M ~ X1 + X2
Y ~ X1 + X2 + M
# Covariance among exogenous variables
X1 ~~ X2
"""
model = Model()
model.fit(df, spec)
# Parameter estimates
estimates = model.inspect()
print("Path Coefficients:")
print(estimates[['op', 'lval', 'est', 'se', 'p-value']])
# Calculate effects manually from path coefficients
params = estimates.set_index(['op', 'lval', 'rval'])['est']
# Direct effects on Y
direct_x1_y = params[('~', 'Y', 'X1')]
direct_x2_y = params[('~', 'Y', 'X2')]
direct_m_y = params[('~', 'Y', 'M')]
# Direct effects on M
direct_x1_m = params[('~', 'M', 'X1')]
direct_x2_m = params[('~', 'M', 'X2')]
# Indirect effects (through M)
indirect_x1_y = direct_x1_m * direct_m_y
indirect_x2_y = direct_x2_m * direct_m_y
# Total effects
total_x1_y = direct_x1_y + indirect_x1_y
total_x2_y = direct_x2_y + indirect_x2_y
print("\n=== Effect Decomposition ===")
print(f"X1 β Y: Direct = {direct_x1_y:.4f}, Indirect = {indirect_x1_y:.4f}, Total = {total_x1_y:.4f}")
print(f"X2 β Y: Direct = {direct_x2_y:.4f}, Indirect = {indirect_x2_y:.4f}, Total = {total_x2_y:.4f}")
# Model fit
stats = calc_stats(model)
print(f"\nCFI: {stats['CFI'].values[0]:.4f}")
print(f"RMSEA: {stats['RMSEA'].values[0]:.4f}")
print(f"SRMR: {stats['SRMR'].values[0]:.4f}")
Bootstrap Mediation Test
import numpy as np
from scipy import stats
def bootstrap_indirect_effect(x, m, y, n_boot=5000):
"""Bootstrap test for mediation: a*b indirect effect."""
n = len(x)
a_coefs, b_coefs, indirects = [], [], []
for _ in range(n_boot):
idx = np.random.choice(n, size=n, replace=True)
x_b, m_b, y_b = x[idx], m[idx], y[idx]
# Path a: M ~ X
a = np.polyfit(x_b, m_b, 1)[0]
# Path b: Y ~ M (controlling for X)
X_bm = np.column_stack([x_b, m_b, np.ones(n)])
b_path = np.linalg.lstsq(X_bm, y_b, rcond=None)[0][1]
a_coefs.append(a)
b_coefs.append(b_path)
indirects.append(a * b_path)
ci = np.percentile(indirects, [2.5, 97.5])
return np.mean(indirects), np.std(indirects), ci, indirects
np.random.seed(42)
n = 300
x = np.random.normal(0, 1, n)
m = 0.5 * x + np.random.normal(0, 1, n)
y = 0.6 * m + 0.3 * x + np.random.normal(0, 1, n)
mean_ind, se_ind, ci, indirects = bootstrap_indirect_effect(x, m, y)
print(f"Mean indirect effect (a*b): {mean_ind:.4f}")
print(f"Bootstrap SE: {se_ind:.4f}")
print(f"95% CI: [{ci[0]:.4f}, {ci[1]:.4f}]")
print(f"Significant mediation: {'Yes' if ci[0] > 0 or ci[1] < 0 else 'No'}")
Key Takeaways
Summary: Path Analysis
- Path analysis models direct and indirect causal relationships among observed variables
- Path diagrams encode causal assumptions: arrows = direct effects; double-headed arrows = correlations
- Total effect = Direct effect + Indirect effects β effects decompose multiplicatively along paths
- Recursive models (no feedback loops) are always identified under standard conditions
- Non-recursive models (feedback loops) require exclusion restrictions or instrumental variables
- Mediation is tested via the indirect effect ; bootstrap CIs are preferred over the Sobel test
- Path analysis is a special case of SEM with no latent variables β use full SEM when measurement error is a concern
- Always assess model fit (CFI, RMSEA, SRMR) and compare alternative path specifications