Ethics in Statistics
Advanced Statistical Methods
The Responsibility That Comes With Analytical Power
Statistical methods are powerful tools that can be misused β intentionally or accidentally β to mislead. The ASA Ethical Guidelines, algorithmic fairness, data privacy, and professional responsibility form the ethical backbone of the discipline.
- Algorithmic fairness β Auditing models for bias across protected groups to ensure equitable outcomes
- Data privacy β Balancing analytical utility with GDPR/CCPA compliance and informed consent
- Professional integrity β Resisting pressure to selectively report or manipulate results for desired outcomes
Ethical statistics means using your analytical power in ways that serve truth and society, not just clients.
DfStatistical Ethics
Statistical ethics encompasses the principles, standards, and practices that guide responsible conduct in the collection, analysis, interpretation, and communication of data. It addresses the moral obligations of statisticians to society, their clients, and the integrity of their discipline.
"Statistics is the grammar of science β and like any language, it can be used to illuminate or to deceive." β Adapted from Karl Pearson
The ASA Ethical Guidelines
DfASA Ethical Guidelines for Statistical Practice
The American Statistical Association (ASA) adopted its Ethical Guidelines in 1989 (revised 2016) to promote ethical practice across all branches of statistics. The guidelines are organized around six foundational principles:
The Six Principles
-
Professional Integrity and Accountability
- Strive for honesty, objectivity, and transparency
- Acknowledge limitations and potential biases in analyses
- Accept responsibility for professional work
-
Integrity of Data and Methods
- Use appropriate statistical methods
- Document data processing and analytic decisions
- Distinguish between exploratory and confirmatory analysis
-
Responsibilities to Clients, Employers, and Others
- Protect confidential information
- Disclose potential conflicts of interest
- Report results accurately and completely
-
Responsibilities Regarding Allegations of Misconduct
- Address allegations of misconduct promptly
- Cooperate with investigations
-
Competence and Judgment
- Practice only in areas of competence
- Seek statistical expertise when needed
-
Responsibilities to Other Statisticians
- Respect colleagues' work
- Acknowledge contributions appropriately
ASA Code of Ethics
The ASA Ethical Guidelines are not enforceable in the way medical licenses are β they serve as aspirational standards. However, many institutions and journals now require adherence to these guidelines as a condition of publication or employment.
Responsible Use of Statistics
P-Hacking and Data Dredging
DfP-Hacking
P-hacking (Simmons et al., 2011) is the practice of selectively reporting, analyzing, or modifying data or analyses until a statistically significant result () is obtained. This inflates the false positive rate far beyond the nominal level.
Common forms of p-hacking include:
| Practice | Effect on False Positive Rate |
|---|---|
| Testing multiple outcomes, reporting only significant | Up to 30% FPR (vs. 5% nominal) |
| Stopping data collection when | Uncontrolled FPR |
| Excluding outliers after seeing results | Inflated effect sizes |
| Trying different model specifications | Multiplicity without correction |
| Reporting one-tailed tests when two-tailed planned | Doubles effective alpha |
ThThe Garden of Forking Paths
Gelman & Loken (2013) formalized the problem: even without deliberate p-hacking, the many researcher degrees of freedom in data analysis (variable transformations, subgroup analyses, model specifications) create a multiplicity problem that is invisible to the standard framework.
Replication and Transparency
The Replication Crisis
The replication crisis β the finding that many published results fail to replicate β has been linked partly to unethical statistical practices: p-hacking, HARKing (Hypothesizing After Results are Known), and publication bias. The Open Science Collaboration (2015) found that only 36% of psychology studies replicated successfully.
Remedies for P-Hacking:
- Pre-registration: Specify hypotheses, methods, and analysis plans before data collection
- Registered Reports: Journals accept papers based on methodology, before results are known
- Open Data and Code: Share analysis code and (where ethical) data
- Bayesian Methods: Shift from binary significance to continuous evidence measures
- Effect Size Reporting: Report practical significance alongside statistical significance
Algorithmic Fairness
DfAlgorithmic Fairness
Algorithmic fairness addresses the question: when does a decision-making algorithm treat different groups equitably? There are multiple, often incompatible, fairness criteria, and choosing among them is ultimately an ethical, not purely technical, decision.
Formal Fairness Criteria
DfDemographic Parity (Statistical Parity)
A predictor satisfies demographic parity if:
where denotes the protected attribute (e.g., race, gender). The predictor's decisions are independent of the protected attribute.
DfEqualized Odds
A predictor satisfies equalized odds if:
True positive rates and false positive rates are equal across groups. This conditions on the true label .
DfCalibration
A predictor is calibrated if:
Among individuals assigned prediction score , the actual positive rate is , regardless of group membership.
ThImpossibility Theorem (Chouldechova, 2017; Kleinberg et al., 2016)
When base rates differ across groups () and predictions are not perfect, no predictor can simultaneously satisfy demographic parity, equalized odds, and calibration. At least one criterion must be relaxed.
Choosing Fairness Criteria
The choice of fairness criterion depends on context:
- Criminal justice: Equalized odds may be preferred (equal error rates across groups)
- Hiring: Demographic parity may be legally required
- Medical diagnosis: Calibration may be most important (probability scores should be meaningful for all groups) The impossibility theorem means that fairness cannot be reduced to a single technical constraint β it requires ethical judgment.
Bias in Algorithms
DfSources of Algorithmic Bias
| Source | Description | Example |
|---|---|---|
| Historical bias | Training data reflects past discrimination | Hiring algorithms trained on biased hiring data |
| Representation bias | Underrepresentation of certain groups | Facial recognition trained on predominantly white faces |
| Measurement bias | Features measured differently across groups | Credit scores as proxy for creditworthiness |
| Aggregation bias | Model assumes same relationships across groups | Medical model trained on average demographics |
| Evaluation bias | Benchmark datasets not representative | NLP models tested on non-diverse text |
| Deployment bias | Model used in contexts different from design | Risk assessment tool used beyond its scope |
Fairness-Aware Machine Learning
DfPre-processing, In-processing, and Post-processing
Fairness interventions can be applied at three stages:
- Pre-processing: Transform training data to remove bias (e.g., reweighting, resampling, adversarial debiasing)
- In-processing: Modify the learning algorithm to incorporate fairness constraints (e.g., constrained optimization, adversarial training)
- Post-processing: Adjust model outputs to satisfy fairness criteria (e.g., threshold adjustment per group)
Fairness-Accuracy Tradeoff
Imposing fairness constraints typically reduces predictive accuracy. The magnitude of the tradeoff depends on the dataset, the fairness criterion, and the degree of underlying disparity. There is no free lunch in algorithmic fairness.
Informed Consent
DfInformed Consent in Data Collection
Informed consent requires that research participants understand:
- What data is being collected
- How it will be used (primary and secondary uses)
- Who will have access to the data
- Risks and benefits of participation
- Right to withdraw at any time
- Duration of data retention
Ethical Challenges in Modern Data Science
| Challenge | Description | Mitigation |
|---|---|---|
| Big data | Consent for collection is impractical at scale | Opt-out mechanisms, data minimization |
| Re-identification | Anonymized data can be de-anonymized | Differential privacy, k-anonymity |
| Secondary use | Data collected for one purpose used for another | Purpose limitation, consent renewal |
| Children/minors | Cannot provide informed consent | Parental consent, age-appropriate design |
| Vulnerable populations | Power dynamics may compromise autonomy | IRB review, community engagement |
Data Privacy
Differential Privacy
DfDifferential Privacy
A randomized mechanism satisfies -differential privacy if for all neighboring datasets and (differing in one record) and all subsets :
Smaller means stronger privacy. When , the mechanism is -differentially private.
ThComposition Theorem for Differential Privacy
If is -DP and is -DP, then their sequential composition is -DP. Under the advanced composition theorem:
for independent mechanisms, each -DP.
Regulatory Frameworks
| Regulation | Jurisdiction | Key Requirements |
|---|---|---|
| GDPR | EU | Explicit consent, right to erasure, data minimization, privacy by design |
| CCPA/CPRA | California | Right to know, right to delete, opt-out of sale, non-discrimination |
| HIPAA | US (health) | Protected health information, minimum necessary standard |
| FERPA | US (education) | Student record privacy, parental access rights |
| PIPEDA | Canada | Consent, limiting collection, accountability |
Privacy-Utility Tradeoff
Differential privacy provides a mathematically rigorous privacy guarantee, but at a cost: noise must be added to query results. The noise magnitude scales as , creating a fundamental tension between privacy and accuracy. The privacy budget must be chosen carefully.
Professional Responsibility
DfStatistical Professional Responsibility
Statisticians have obligations to:
- Society: Report findings honestly; avoid misleading the public
- Clients/Employers: Provide competent, unbiased analysis; disclose limitations
- Colleagues: Acknowledge contributions; maintain research integrity
- Research Subjects: Protect privacy; ensure informed consent
- The Discipline: Uphold the reputation and trustworthiness of statistics
Conflicts of Interest
Financial Conflicts of Interest
Industry-funded studies are significantly more likely to report results favorable to the sponsor (Lexchin et al., 2003). Statisticians must disclose funding sources and maintain analytical independence, even when under pressure to produce favorable results.
Landmark Case Studies
Case 1: The Tuskegee Syphilis Study (1932β1972)
DfTuskegee Syphilis Study
The US Public Health Service conducted a 40-year study of untreated syphilis in 399 African American men in Macon County, Alabama. Participants were never told they had syphilis, were told they were receiving "free health care," and were actively prevented from receiving treatment (including penicillin after it became the standard of care in the 1940s).
Ethical violations: No informed consent, deception, withholding treatment, selection of a vulnerable population.
Impact: Led directly to the National Research Act (1974) and the Belmont Report (1979), establishing the modern framework of informed consent, beneficence, and justice in human subjects research.
Case 2: The Challenger Disaster (1986)
Engineers at Morton Thiokol warned that O-rings could fail at low temperatures. Management overruled them. Statistical analysis of prior launches showed a clear relationship between temperature and O-ring damage β but this analysis was not presented to decision-makers.
Ethical lesson: Statistical evidence must be communicated clearly and forcefully when lives are at stake. The failure was not in the statistics but in the communication of statistical evidence.
Case 3: Algorithmic Bias in Criminal Justice (COMPAS)
DfCOMPAS Recidivism Tool
COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is a widely used risk assessment tool in US criminal justice. A 2016 ProPublica investigation found that the tool was biased against Black defendants:
- Black defendants: 45% false positive rate (predicted high risk, did not reoffend)
- White defendants: 23% false positive rate
- White defendants: 48% false negative rate (predicted low risk, did reoffend)
- Black defendants: 28% false negative rate
The Impossibility in Practice
COMPAS illustrates the impossibility theorem in practice: the tool cannot simultaneously satisfy equalized odds (equal error rates) and calibration (equal predictive values) when base rates differ. Northpointe (the developer) calibrated the tool; ProPublica demanded equalized odds. Both claims were technically correct β they simply optimized for different fairness criteria.
Case 4: P-Hacking in Psychosocial Research
Simmons, Nelson, & Simonsohn (2011) demonstrated that common research practices (optional stopping, selective outcome reporting, including/excluding covariates) allow researchers to "find" statistically significant effects with probability up to 61% when the true effect is zero β far exceeding the nominal 5% Type I error rate.
Python Implementation
import numpy as np
from collections import defaultdict
# --- Simulating Algorithmic Bias ---
np.random.seed(42)
def simulate_fairness_audit(n=10000):
"""Audit a classifier for fairness violations across groups."""
# Generate data with different base rates
group_a = np.random.binomial(1, 0.3, n) # Base rate 30%
group_b = np.random.binomial(1, 0.5, n) # Base rate 50%
# Simulate predictions (deliberately biased model)
# Model has equal TPR but different FPR across groups
def predict(true_labels, fpr, tpr):
n = len(true_labels)
pred = np.zeros(n, dtype=int)
for i in range(n):
if true_labels[i] == 1:
pred[i] = 1 if np.random.random() < tpr else 0
else:
pred[i] = 1 if np.random.random() < fpr else 0
return pred
pred_a = predict(group_a, fpr=0.15, tpr=0.85)
pred_b = predict(group_b, fpr=0.30, tpr=0.85)
# Compute fairness metrics
def metrics(y_true, y_pred):
tp = np.sum((y_true == 1) & (y_pred == 1))
fp = np.sum((y_true == 0) & (y_pred == 1))
fn = np.sum((y_true == 1) & (y_pred == 0))
tn = np.sum((y_true == 0) & (y_pred == 0))
return {'tpr': tp/(tp+fn), 'fpr': fp/(fp+tn),
'selection_rate': np.mean(y_pred),
'precision': tp/(tp+fp) if (tp+fp) > 0 else 0}
m_a = metrics(group_a, pred_a)
m_b = metrics(group_b, pred_b)
print("=== Fairness Audit ===")
print(f"{'Metric':<20s} {'Group A':>10s} {'Group B':>10s} {'Ratio':>10s}")
print("-" * 55)
for key in ['selection_rate', 'tpr', 'fpr', 'precision']:
va, vb = m_a[key], m_b[key]
ratio = min(va, vb) / max(va, vb) if max(va, vb) > 0 else float('inf')
print(f"{key:<20s} {va:>10.3f} {vb:>10.3f} {ratio:>10.3f}")
# Demographic parity violation
dp_diff = abs(m_a['selection_rate'] - m_b['selection_rate'])
print(f"\nDemographic parity difference: {dp_diff:.3f}")
print(f"Equalized odds (TPR difference): {abs(m_a['tpr'] - m_b['tpr']):.3f}")
print(f"Equalized odds (FPR difference): {abs(m_a['fpr'] - m_b['fpr']):.3f}")
return m_a, m_b
simulate_fairness_audit()
# --- Differential Privacy Simulation ---
def laplace_mechanism(true_value, sensitivity, epsilon):
"""Add Laplace noise for differential privacy."""
noise = np.random.laplace(0, sensitivity / epsilon)
return true_value + noise
def gaussian_mechanism(true_value, sensitivity, epsilon, delta):
"""Add Gaussian noise for (epsilon, delta)-differential privacy."""
sigma = sensitivity * np.sqrt(2 * np.log(1.25 / delta)) / epsilon
return true_value + np.random.normal(0, sigma)
print("\n=== Differential Privacy Demonstration ===")
true_count = 5000
sensitivity = 1 # Adding/removing one person changes count by at most 1
true_proportion = 0.42
for eps in [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]:
estimates = [laplace_mechanism(true_proportion, sensitivity, eps)
for _ in range(1000)]
bias = np.mean(estimates) - true_proportion
rmse = np.sqrt(np.mean((np.array(estimates) - true_proportion) ** 2))
print(f" Ξ΅={eps:5.1f}: Mean={np.mean(estimates):.4f}, "
f"Bias={bias:+.4f}, RMSE={rmse:.4f}")
# --- P-Hacking Simulation ---
print("\n=== P-Hacking Simulation ===")
from scipy import stats
def simulate_phacking(n_experiments=10000, n_samples=50, true_effect=0):
"""Simulate the effect of p-hacking on false positive rate."""
# Standard analysis
standard_fps = 0
for _ in range(n_experiments):
x = np.random.normal(0, 1, n_samples)
y = true_effect + np.random.normal(0, 1, n_samples)
_, p = stats.ttest_ind(x, y)
if p < 0.05:
standard_fps += 1
# P-hacked analysis (try multiple tests, report best)
hacked_fps = 0
for _ in range(n_experiments):
x = np.random.normal(0, 1, n_samples)
y = true_effect + np.random.normal(0, 1, n_samples)
# Try 4 analyses: original, log-transformed, with/outlier removed, two-tailedβone-tailed
tests = [stats.ttest_ind(x, y),
stats.ttest_ind(np.log(np.abs(x)+1), np.log(np.abs(y)+1)),
stats.ttest_ind(x[1:], y[1:]),
stats.ttest_ind(x, y, alternative='less')]
pvals = [p for _, p in tests]
if min(pvals) < 0.05:
hacked_fps += 1
print(f" True effect = {true_effect}")
print(f" Standard FPR: {standard_fps/n_experiments:.3f} (nominal: 0.050)")
print(f" P-hacked FPR: {hacked_fps/n_experiments:.3f}")
simulate_phacking(true_effect=0)
simulate_phacking(true_effect=0.3)
# --- Bayesian Fairness Assessment ---
print("\n=== Bayesian Perspective on Fairness ===")
def bayesian_fairness_prior(n_a, pos_a, n_b, pos_b, prior=1):
"""Compute posterior probability that groups have different true rates."""
# Beta-Binomial model
post_a = (prior + pos_a, prior + n_a - pos_a)
post_b = (prior + pos_b, prior + n_b - pos_b)
# Monte Carlo comparison
samples_a = np.random.beta(post_a[0], post_a[1], 100000)
samples_b = np.random.beta(post_b[0], post_b[1], 100000)
p_a_greater = np.mean(samples_a > samples_b)
diff = np.mean(samples_a - samples_b)
ci = np.percentile(samples_a - samples_b, [2.5, 97.5])
print(f" Group A: {pos_a}/{n_a} = {pos_a/n_a:.3f}")
print(f" Group B: {pos_b}/{n_b} = {pos_b/n_b:.3f}")
print(f" P(A > B): {p_a_greater:.3f}")
print(f" Mean difference: {diff:+.4f}")
print(f" 95% CI for difference: ({ci[0]:.4f}, {ci[1]:.4f})")
bayesian_fairness_prior(1000, 300, 1000, 350) # 30% vs 35%
bayesian_fairness_prior(100, 30, 100, 50) # 30% vs 50%
Key Takeaways
Summary: Ethics in Statistics
- The ASA Ethical Guidelines establish six principles: professional integrity, data/method integrity, client responsibility, misconduct accountability, competence, and collegial respect.
- P-hacking inflates false positive rates to 30%+; remedies include pre-registration, registered reports, and open data.
- Algorithmic fairness has multiple incompatible criteria (demographic parity, equalized odds, calibration); the impossibility theorem proves no predictor can satisfy all three simultaneously.
- Informed consent must address what data is collected, how it is used, who accesses it, and the right to withdraw β increasingly challenging in big data contexts.
- Differential privacy provides mathematically rigorous privacy guarantees at the cost of accuracy, with the privacy budget governing the tradeoff.
- Landmark cases (Tuskegee, COMPAS, Challenger) demonstrate that ethical failures in statistics have real-world consequences for individuals and public trust.
- Professional responsibility requires disclosure of conflicts, honest communication of uncertainty, and resistance to pressure that compromises analytical integrity.