Structural Equation Modeling (SEM)
Advanced Statistical Methods
Testing Complex Causal Theories Simultaneously
SEM combines factor analysis and path analysis to test entire theoretical models with latent variables, direct effects, and indirect effects in a single unified framework. It answers questions that simpler methods cannot.
- Psychology β Test theories about unobservable constructs like intelligence or anxiety
- Marketing research β Model how brand perception drives customer loyalty through mediating factors
- Education β Evaluate how teaching methods influence learning outcomes through multiple pathways
SEM lets you test the whole theory, not just isolated pieces of it.
What Is SEM?
DfStructural Equation Modeling
Structural equation modeling is a multivariate framework that simultaneously estimates structural relationships among latent and observed variables and measurement relationships between latent constructs and their indicators. SEM combines factor analysis (measurement model) with path analysis (structural model) into a single confirmatory framework.
SEM allows researchers to:
- Test complex theoretical models with multiple dependent and independent variables
- Model latent variables (constructs not directly observed)
- Assess both direct and indirect effects simultaneously
- Evaluate overall model fit against the observed covariance matrix
Components of an SEM
DfMeasurement Model
The measurement model specifies how latent variables (, ) are measured by observed indicators (, ):
where and are matrices of factor loadings, and , are measurement errors.
DfStructural Model
The structural model specifies relationships among latent variables:
where captures effects among endogenous latent variables, captures effects of exogenous variables, and is the structural residual.
Full SEM in Matrix Form
Here,
- =Model-implied covariance matrix
- =Factor loading matrix for endogenous indicators
- =Matrix of regression coefficients among endogenous latents
- =Matrix of effects from exogenous to endogenous latents
- =Covariance matrix of exogenous latent variables
- =Covariance matrix of structural residuals
Confirmatory Factor Analysis (CFA)
DfConfirmatory Factor Analysis
CFA is a measurement-only SEM that tests whether a predefined factor structure fits the observed data. The model specifies:
where are latent factors, contains factor loadings (some fixed to zero for identification), and are unique variances. CFA is a prerequisite for full SEM β the measurement model must be validated before testing structural paths.
Model Fit Indices
ThChi-Square Test of Model Fit
The likelihood-ratio test statistic for SEM:
where is the minimum of the maximum likelihood fit function, and is the sample size. Under correct model specification, where with observed variables and free parameters.
A non-significant p-value indicates acceptable fit (the model-implied covariance matrix is not significantly different from the observed).
Chi-Square Sensitivity
The test is highly sensitive to sample size: with , even trivial misspecifications produce significant results. Therefore, researchers rely on approximate fit indices.
Comparative Fit Index (CFI)
Here,
- =Chi-square of the specified model
- =Degrees of freedom of the specified model
- =Chi-square of the null (independence) model
- =Degrees of freedom of the null model
Root Mean Square Error of Approximation (RMSEA)
Here,
- =Approximation error per degree of freedom
RMSEA Interpretation
- RMSEA : close fit
- RMSEA : reasonable fit
- RMSEA : mediocre fit
- RMSEA : poor fit
The 90% confidence interval for RMSEA should ideally include values below 0.05. The test of close fit (: RMSEA ) should be non-significant.
Standardized Root Mean Square Residual (SRMR)
Here,
- =Observed correlation
- =Model-implied correlation
- =Number of observed variables
Fit Index Benchmarks
| Index | Excellent | Acceptable | Poor |
|---|---|---|---|
| CFI | |||
| RMSEA | |||
| SRMR | |||
| TLI |
Identification
ThIdentification Rules for SEM
A model is identified if there is a unique solution for the free parameters. The t-rule (necessary condition) states:
where is the number of free parameters and is the number of observed variables. The right side is the number of unique elements in the sample covariance matrix.
Sufficient conditions:
- Recursive models (no feedback loops) with at least one indicator per latent are identified
- The three-indicator rule: each latent needs at least 3 indicators, each indicator loads on only one factor, and residuals are uncorrelated
- Two-stage least squares can identify non-recursive models under certain conditions
Model Modification Indices
DfModification Index
The modification index for a fixed parameter estimates the decrease in if that parameter were freely estimated:
where EMR is the expected parameter change ratio. Large modification indices (typically , the critical value) suggest potentially important misspecifications.
Prudence with Modification Indices
Modification indices should be used sparingly and only when theoretically justified. Freely adding parameters based purely on statistical fit inflates Type I error and capitalizes on chance. Always validate modifications on a holdout sample.
Python Implementation
SEM with semopy
import numpy as np
import pandas as pd
# semopy is the primary Python package for SEM
from semopy import Model, calc_stats
np.random.seed(42)
n = 500
# Simulate SEM data
# Latent factors
eta1 = np.random.normal(0, 1, n) # Latent: Job Satisfaction
eta2 = np.random.normal(0, 1, n) # Latent: Organizational Commitment
xi = np.random.normal(0, 1, n) # Latent: Leadership Quality
# Structural model: eta2 = 0.6*eta1 + 0.4*xi + zeta
zeta = np.random.normal(0, 0.5, n)
eta2_true = 0.6 * eta1 + 0.4 * xi + zeta
# Indicators (measurement model)
eps = np.random.normal(0, 0.3, (n, 3))
y1 = 0.8 * eta1 + eps[:, 0] # JS indicator 1
y2 = 0.7 * eta1 + eps[:, 1] # JS indicator 2
y3 = 0.9 * eta1 + eps[:, 2] # JS indicator 3
eps2 = np.random.normal(0, 0.4, (n, 3))
y4 = 0.75 * eta2_true + eps2[:, 0] # OC indicator 1
y5 = 0.85 * eta2_true + eps2[:, 1] # OC indicator 2
y6 = 0.70 * eta2_true + eps2[:, 2] # OC indicator 3
eps3 = np.random.normal(0, 0.35, (n, 3))
x1 = 0.8 * xi + eps3[:, 0] # Leadership indicator 1
x2 = 0.65 * xi + eps3[:, 1] # Leadership indicator 2
x3 = 0.9 * xi + eps3[:, 2] # Leadership indicator 3
df = pd.DataFrame({'y1': y1, 'y2': y2, 'y3': y3,
'y4': y4, 'y5': y5, 'y6': y6,
'x1': x1, 'x2': x2, 'x3': x3})
# Define SEM model specification (lavaan-like syntax)
spec = """
# Measurement model
JS =~ y1 + y2 + y3
OC =~ y4 + y5 + y6
Leadership =~ x1 + x2 + x3
# Structural model
OC ~ JS + Leadership
"""
model = Model()
model.fit(df, spec)
# Extract parameter estimates
estimates = model.inspect()
print("Parameter Estimates:")
print(estimates[['op', 'lval', 'est', 'se', 'p-value']])
# Model fit statistics
stats = calc_stats(model)
print("\nModel Fit Statistics:")
print(f" Chi-Square: {stats['chi2'].values[0]:.2f}")
print(f" df: {stats['chi2_dof'].values[0]:.0f}")
print(f" CFI: {stats['CFI'].values[0]:.4f}")
print(f" RMSEA: {stats['RMSEA'].values[0]:.4f}")
print(f" SRMR: {stats['SRMR'].values[0]:.4f}")
print(f" TLI: {stats['TLI'].values[0]:.4f}")
Key Takeaways
Summary: Structural Equation Modeling
- SEM combines measurement models (CFA) with structural models (path analysis) into a single framework
- The model-implied covariance matrix is compared to the observed covariance matrix
- CFI , RMSEA , SRMR indicate excellent fit
- Identification requires free parameters β insufficient indicators cause underidentification
- ML estimation assumes multivariate normality; use robust methods (WLSMV) for ordinal or non-normal data
- Modification indices can guide model respecification but must be theoretically justified
- Always report multiple fit indices β no single index is sufficient
- SEM requires large samples: is a common minimum; is preferred