Odds Ratios
Regression Analysis
Interpreting Associations in Binary Outcomes
Odds ratios quantify the strength of association between exposures and binary outcomes. They are the primary effect measure in logistic regression and case-control studies, providing intuitive multiplicative comparisons.
-
Epidemiology — Measure risk factor associations in disease studies
-
Clinical Trials — Report treatment effects on binary endpoints
-
Social Sciences — Quantify how factors like education affect binary decisions
An odds ratio of 2 means the odds double — simple interpretation with profound implications.
The odds ratio (OR) measures the association between a binary predictor and a binary outcome. It is the ratio of two odds.
Odds Ratio
Here,
- =Odds ratio
- =Probability of outcome in group 1
- =Probability of outcome in group 2
import numpy as np
import pandas as pd
from scipy import stats
# 2×2 contingency table
# Smoking vs Heart Disease
data = np.array([[80, 120], # Smokers: disease, no disease
[30, 270]]) # Non-smokers: disease, no disease
smoker_odds = data[0,0] / data[0,1]
nonsmoker_odds = data[1,0] / data[1,1]
OR = smoker_odds / nonsmoker_odds
print("Smoking and Heart Disease:")
print(f" Smokers: {data[0,0]} disease, {data[0,1]} no disease -> odds = {smoker_odds:.3f}")
print(f" Non-smokers: {data[1,0]} disease, {data[1,1]} no disease -> odds = {nonsmoker_odds:.3f}")
print(f" Odds Ratio = {OR:.3f}")
print(f" Smokers have {OR:.1f}× the odds of heart disease vs non-smokers")
# 95% CI for OR (log-method)
log_OR = np.log(OR)
SE_log_OR = np.sqrt(sum(1/x for x in data.flatten()))
CI_lower = np.exp(log_OR - 1.96*SE_log_OR)
CI_upper = np.exp(log_OR + 1.96*SE_log_OR)
print(f" 95% CI: ({CI_lower:.3f}, {CI_upper:.3f})")
# Fisher's exact test
oddsratio_fisher, p_fisher = stats.fisher_exact(data)
print(f" Fisher's exact p-value: {p_fisher:.6f}")
# OR from logistic regression
import statsmodels.api as sm
np.random.seed(42)
n = 500
smoking = np.random.binomial(1, 0.4, n)
heart_disease = np.random.binomial(1, 0.1 + 0.2*smoking)
X = sm.add_constant(smoking)
logit_model = sm.Logit(heart_disease, X).fit(disp=False)
or_logit = np.exp(logit_model.params['x1'])
ci_logit = np.exp(logit_model.conf_int().loc['x1'])
print(f"\nOR from logistic regression: {or_logit:.3f}")
print(f"95% CI: ({ci_logit[0]:.3f}, {ci_logit[1]:.3f})")
# OR vs Risk Ratio
p1 = data[0,0] / data[0].sum()
p2 = data[1,0] / data[1].sum()
RR = p1 / p2
print(f"\nOR = {OR:.3f}, Risk Ratio (RR) = {RR:.3f}")
print("OR overestimates effect when outcome is common (>10%)")
print("Use RR for cohort studies, OR for case-control studies")
OR vs RR
OR overestimates the effect when the outcome is common (>10%). Use RR for cohort studies and RCTs; use OR for case-control studies.
Key Takeaways
Summary: Odds Ratios
-
OR = 1: no association; OR greater than 1: positive association; OR < 1: negative association
-
OR ˜ RR only when the outcome is rare (<10%)
-
Log(OR) from logistic regression gives the coefficient
-
95% CI not including 1 means statistically significant association
-
Case-control studies use OR; cohort/RCT studies can use either RR or OR