Ethics in Statistics

Advanced Statistical Methods

The Responsibility That Comes With Analytical Power

Statistical methods are powerful tools that can be misused — intentionally or accidentally — to mislead. The ASA Ethical Guidelines, algorithmic fairness, data privacy, and professional responsibility form the ethical backbone of the discipline.

Algorithmic fairness — Auditing models for bias across protected groups to ensure equitable outcomes
Data privacy — Balancing analytical utility with GDPR/CCPA compliance and informed consent
Professional integrity — Resisting pressure to selectively report or manipulate results for desired outcomes

Ethical statistics means using your analytical power in ways that serve truth and society, not just clients.

DfStatistical Ethics

Statistical ethics encompasses the principles, standards, and practices that guide responsible conduct in the collection, analysis, interpretation, and communication of data. It addresses the moral obligations of statisticians to society, their clients, and the integrity of their discipline.

"Statistics is the grammar of science — and like any language, it can be used to illuminate or to deceive." — Adapted from Karl Pearson

The ASA Ethical Guidelines

DfASA Ethical Guidelines for Statistical Practice

The American Statistical Association (ASA) adopted its Ethical Guidelines in 1989 (revised 2016) to promote ethical practice across all branches of statistics. The guidelines are organized around six foundational principles:

The Six Principles

Professional Integrity and Accountability
- Strive for honesty, objectivity, and transparency
- Acknowledge limitations and potential biases in analyses
- Accept responsibility for professional work
Integrity of Data and Methods
- Use appropriate statistical methods
- Document data processing and analytic decisions
- Distinguish between exploratory and confirmatory analysis
Responsibilities to Clients, Employers, and Others
- Protect confidential information
- Disclose potential conflicts of interest
- Report results accurately and completely
Responsibilities Regarding Allegations of Misconduct
- Address allegations of misconduct promptly
- Cooperate with investigations
Competence and Judgment
- Practice only in areas of competence
- Seek statistical expertise when needed
Responsibilities to Other Statisticians
- Respect colleagues' work
- Acknowledge contributions appropriately

ASA Code of Ethics

The ASA Ethical Guidelines are not enforceable in the way medical licenses are — they serve as aspirational standards. However, many institutions and journals now require adherence to these guidelines as a condition of publication or employment.

Responsible Use of Statistics

P-Hacking and Data Dredging

DfP-Hacking

P-hacking (Simmons et al., 2011) is the practice of selectively reporting, analyzing, or modifying data or analyses until a statistically significant result ( $p < 0.05$ ) is obtained. This inflates the false positive rate far beyond the nominal level.

Common forms of p-hacking include:

Practice	Effect on False Positive Rate
Testing multiple outcomes, reporting only significant	Up to 30% FPR (vs. 5% nominal)
Stopping data collection when $p < 0.05$	Uncontrolled FPR
Excluding outliers after seeing results	Inflated effect sizes
Trying different model specifications	Multiplicity without correction
Reporting one-tailed tests when two-tailed planned	Doubles effective alpha

ThThe Garden of Forking Paths

Gelman & Loken (2013) formalized the problem: even without deliberate p-hacking, the many researcher degrees of freedom in data analysis (variable transformations, subgroup analyses, model specifications) create a multiplicity problem that is invisible to the standard $p < 0.05$ framework.

Replication and Transparency

The Replication Crisis

The replication crisis — the finding that many published results fail to replicate — has been linked partly to unethical statistical practices: p-hacking, HARKing (Hypothesizing After Results are Known), and publication bias. The Open Science Collaboration (2015) found that only 36% of psychology studies replicated successfully.

Remedies for P-Hacking:

Pre-registration: Specify hypotheses, methods, and analysis plans before data collection
Registered Reports: Journals accept papers based on methodology, before results are known
Open Data and Code: Share analysis code and (where ethical) data
Bayesian Methods: Shift from binary significance to continuous evidence measures
Effect Size Reporting: Report practical significance alongside statistical significance

Algorithmic Fairness

DfAlgorithmic Fairness

Algorithmic fairness addresses the question: when does a decision-making algorithm treat different groups equitably? There are multiple, often incompatible, fairness criteria, and choosing among them is ultimately an ethical, not purely technical, decision.

Formal Fairness Criteria

DfDemographic Parity (Statistical Parity)

A predictor $\hat{Y}$ satisfies demographic parity if:

P(\hat{Y} = 1 \mid A = a) = P(\hat{Y} = 1 \mid A = b) \quad \forall a, b

where $A$ denotes the protected attribute (e.g., race, gender). The predictor's decisions are independent of the protected attribute.

DfEqualized Odds

A predictor $\hat{Y}$ satisfies equalized odds if:

P(\hat{Y} = 1 \mid Y = y, A = a) = P(\hat{Y} = 1 \mid Y = y, A = b) \quad \forall y, a, b

True positive rates and false positive rates are equal across groups. This conditions on the true label $Y$ .

DfCalibration

A predictor is calibrated if:

P(Y = 1 \mid \hat{P} = p, A = a) = p \quad \forall p, a

Among individuals assigned prediction score $p$ , the actual positive rate is $p$ , regardless of group membership.

ThImpossibility Theorem (Chouldechova, 2017; Kleinberg et al., 2016)

When base rates differ across groups ( $P(Y=1 \mid A=a) \neq P(Y=1 \mid A=b)$ ) and predictions are not perfect, no predictor can simultaneously satisfy demographic parity, equalized odds, and calibration. At least one criterion must be relaxed.

Choosing Fairness Criteria

The choice of fairness criterion depends on context:

Criminal justice: Equalized odds may be preferred (equal error rates across groups)
Hiring: Demographic parity may be legally required
Medical diagnosis: Calibration may be most important (probability scores should be meaningful for all groups) The impossibility theorem means that fairness cannot be reduced to a single technical constraint — it requires ethical judgment.

Bias in Algorithms

DfSources of Algorithmic Bias

Source	Description	Example
Historical bias	Training data reflects past discrimination	Hiring algorithms trained on biased hiring data
Representation bias	Underrepresentation of certain groups	Facial recognition trained on predominantly white faces
Measurement bias	Features measured differently across groups	Credit scores as proxy for creditworthiness
Aggregation bias	Model assumes same relationships across groups	Medical model trained on average demographics
Evaluation bias	Benchmark datasets not representative	NLP models tested on non-diverse text
Deployment bias	Model used in contexts different from design	Risk assessment tool used beyond its scope

Fairness-Aware Machine Learning

DfPre-processing, In-processing, and Post-processing

Fairness interventions can be applied at three stages:

Pre-processing: Transform training data to remove bias (e.g., reweighting, resampling, adversarial debiasing)
In-processing: Modify the learning algorithm to incorporate fairness constraints (e.g., constrained optimization, adversarial training)
Post-processing: Adjust model outputs to satisfy fairness criteria (e.g., threshold adjustment per group)

Fairness-Accuracy Tradeoff

Imposing fairness constraints typically reduces predictive accuracy. The magnitude of the tradeoff depends on the dataset, the fairness criterion, and the degree of underlying disparity. There is no free lunch in algorithmic fairness.

Informed Consent

DfInformed Consent in Data Collection

Informed consent requires that research participants understand:

What data is being collected
How it will be used (primary and secondary uses)
Who will have access to the data
Risks and benefits of participation
Right to withdraw at any time
Duration of data retention

Ethical Challenges in Modern Data Science

Challenge	Description	Mitigation
Big data	Consent for collection is impractical at scale	Opt-out mechanisms, data minimization
Re-identification	Anonymized data can be de-anonymized	Differential privacy, k-anonymity
Secondary use	Data collected for one purpose used for another	Purpose limitation, consent renewal
Children/minors	Cannot provide informed consent	Parental consent, age-appropriate design
Vulnerable populations	Power dynamics may compromise autonomy	IRB review, community engagement

Data Privacy

Differential Privacy

DfDifferential Privacy

A randomized mechanism $\mathcal{M}$ satisfies $(\varepsilon, \delta)$ -differential privacy if for all neighboring datasets $D$ and $D'$ (differing in one record) and all subsets $S$ :

P(\mathcal{M}(D) \in S) \leq e^{\varepsilon} \cdot P(\mathcal{M}(D') \in S) + \delta

Smaller $\varepsilon$ means stronger privacy. When $\delta = 0$ , the mechanism is $\varepsilon$ -differentially private.

ThComposition Theorem for Differential Privacy

If $\mathcal{M}_1$ is $\varepsilon_1$ -DP and $\mathcal{M}_2$ is $\varepsilon_2$ -DP, then their sequential composition $\mathcal{M}_1 \circ \mathcal{M}_2$ is $(\varepsilon_1 + \varepsilon_2)$ -DP. Under the advanced composition theorem:

\left(\varepsilon \sqrt{2k \ln(1/\delta)} + k \varepsilon(e^\varepsilon - 1), \, k\delta\right)\text{-DP}

for $k$ independent mechanisms, each $\varepsilon$ -DP.

Regulatory Frameworks

Regulation	Jurisdiction	Key Requirements
GDPR	EU	Explicit consent, right to erasure, data minimization, privacy by design
CCPA/CPRA	California	Right to know, right to delete, opt-out of sale, non-discrimination
HIPAA	US (health)	Protected health information, minimum necessary standard
FERPA	US (education)	Student record privacy, parental access rights
PIPEDA	Canada	Consent, limiting collection, accountability

Privacy-Utility Tradeoff

Differential privacy provides a mathematically rigorous privacy guarantee, but at a cost: noise must be added to query results. The noise magnitude scales as $O(1/\varepsilon)$ , creating a fundamental tension between privacy and accuracy. The privacy budget $\varepsilon$ must be chosen carefully.

Professional Responsibility

DfStatistical Professional Responsibility

Statisticians have obligations to:

Society: Report findings honestly; avoid misleading the public
Clients/Employers: Provide competent, unbiased analysis; disclose limitations
Colleagues: Acknowledge contributions; maintain research integrity
Research Subjects: Protect privacy; ensure informed consent
The Discipline: Uphold the reputation and trustworthiness of statistics

Conflicts of Interest

Financial Conflicts of Interest

Industry-funded studies are significantly more likely to report results favorable to the sponsor (Lexchin et al., 2003). Statisticians must disclose funding sources and maintain analytical independence, even when under pressure to produce favorable results.

Landmark Case Studies

Case 1: The Tuskegee Syphilis Study (1932–1972)

DfTuskegee Syphilis Study

The US Public Health Service conducted a 40-year study of untreated syphilis in 399 African American men in Macon County, Alabama. Participants were never told they had syphilis, were told they were receiving "free health care," and were actively prevented from receiving treatment (including penicillin after it became the standard of care in the 1940s).

Ethical violations: No informed consent, deception, withholding treatment, selection of a vulnerable population.

Impact: Led directly to the National Research Act (1974) and the Belmont Report (1979), establishing the modern framework of informed consent, beneficence, and justice in human subjects research.

Case 2: The Challenger Disaster (1986)

Engineers at Morton Thiokol warned that O-rings could fail at low temperatures. Management overruled them. Statistical analysis of prior launches showed a clear relationship between temperature and O-ring damage — but this analysis was not presented to decision-makers.

Ethical lesson: Statistical evidence must be communicated clearly and forcefully when lives are at stake. The failure was not in the statistics but in the communication of statistical evidence.

Case 3: Algorithmic Bias in Criminal Justice (COMPAS)

DfCOMPAS Recidivism Tool

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is a widely used risk assessment tool in US criminal justice. A 2016 ProPublica investigation found that the tool was biased against Black defendants:

Black defendants: 45% false positive rate (predicted high risk, did not reoffend)
White defendants: 23% false positive rate
White defendants: 48% false negative rate (predicted low risk, did reoffend)
Black defendants: 28% false negative rate

The Impossibility in Practice

COMPAS illustrates the impossibility theorem in practice: the tool cannot simultaneously satisfy equalized odds (equal error rates) and calibration (equal predictive values) when base rates differ. Northpointe (the developer) calibrated the tool; ProPublica demanded equalized odds. Both claims were technically correct — they simply optimized for different fairness criteria.

Case 4: P-Hacking in Psychosocial Research

Simmons, Nelson, & Simonsohn (2011) demonstrated that common research practices (optional stopping, selective outcome reporting, including/excluding covariates) allow researchers to "find" statistically significant effects with probability up to 61% when the true effect is zero — far exceeding the nominal 5% Type I error rate.

Python Implementation

import numpy as np
from collections import defaultdict

# --- Simulating Algorithmic Bias ---
np.random.seed(42)

def simulate_fairness_audit(n=10000):
    """Audit a classifier for fairness violations across groups."""
    # Generate data with different base rates
    group_a = np.random.binomial(1, 0.3, n)  # Base rate 30%
    group_b = np.random.binomial(1, 0.5, n)  # Base rate 50%

    # Simulate predictions (deliberately biased model)
    # Model has equal TPR but different FPR across groups
    def predict(true_labels, fpr, tpr):
        n = len(true_labels)
        pred = np.zeros(n, dtype=int)
        for i in range(n):
            if true_labels[i] == 1:
                pred[i] = 1 if np.random.random() < tpr else 0
            else:
                pred[i] = 1 if np.random.random() < fpr else 0
        return pred

    pred_a = predict(group_a, fpr=0.15, tpr=0.85)
    pred_b = predict(group_b, fpr=0.30, tpr=0.85)

    # Compute fairness metrics
    def metrics(y_true, y_pred):
        tp = np.sum((y_true == 1) & (y_pred == 1))
        fp = np.sum((y_true == 0) & (y_pred == 1))
        fn = np.sum((y_true == 1) & (y_pred == 0))
        tn = np.sum((y_true == 0) & (y_pred == 0))
        return {'tpr': tp/(tp+fn), 'fpr': fp/(fp+tn),
                'selection_rate': np.mean(y_pred),
                'precision': tp/(tp+fp) if (tp+fp) > 0 else 0}

    m_a = metrics(group_a, pred_a)
    m_b = metrics(group_b, pred_b)

    print("=== Fairness Audit ===")
    print(f"{'Metric':<20s} {'Group A':>10s} {'Group B':>10s} {'Ratio':>10s}")
    print("-" * 55)
    for key in ['selection_rate', 'tpr', 'fpr', 'precision']:
        va, vb = m_a[key], m_b[key]
        ratio = min(va, vb) / max(va, vb) if max(va, vb) > 0 else float('inf')
        print(f"{key:<20s} {va:>10.3f} {vb:>10.3f} {ratio:>10.3f}")

    # Demographic parity violation
    dp_diff = abs(m_a['selection_rate'] - m_b['selection_rate'])
    print(f"\nDemographic parity difference: {dp_diff:.3f}")
    print(f"Equalized odds (TPR difference): {abs(m_a['tpr'] - m_b['tpr']):.3f}")
    print(f"Equalized odds (FPR difference): {abs(m_a['fpr'] - m_b['fpr']):.3f}")

    return m_a, m_b

simulate_fairness_audit()

# --- Differential Privacy Simulation ---
def laplace_mechanism(true_value, sensitivity, epsilon):
    """Add Laplace noise for differential privacy."""
    noise = np.random.laplace(0, sensitivity / epsilon)
    return true_value + noise

def gaussian_mechanism(true_value, sensitivity, epsilon, delta):
    """Add Gaussian noise for (epsilon, delta)-differential privacy."""
    sigma = sensitivity * np.sqrt(2 * np.log(1.25 / delta)) / epsilon
    return true_value + np.random.normal(0, sigma)

print("\n=== Differential Privacy Demonstration ===")
true_count = 5000
sensitivity = 1  # Adding/removing one person changes count by at most 1
true_proportion = 0.42

for eps in [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]:
    estimates = [laplace_mechanism(true_proportion, sensitivity, eps)
                 for _ in range(1000)]
    bias = np.mean(estimates) - true_proportion
    rmse = np.sqrt(np.mean((np.array(estimates) - true_proportion) ** 2))
    print(f"  ε={eps:5.1f}: Mean={np.mean(estimates):.4f}, "
          f"Bias={bias:+.4f}, RMSE={rmse:.4f}")

# --- P-Hacking Simulation ---
print("\n=== P-Hacking Simulation ===")
from scipy import stats

def simulate_phacking(n_experiments=10000, n_samples=50, true_effect=0):
    """Simulate the effect of p-hacking on false positive rate."""
    # Standard analysis
    standard_fps = 0
    for _ in range(n_experiments):
        x = np.random.normal(0, 1, n_samples)
        y = true_effect + np.random.normal(0, 1, n_samples)
        _, p = stats.ttest_ind(x, y)
        if p < 0.05:
            standard_fps += 1

    # P-hacked analysis (try multiple tests, report best)
    hacked_fps = 0
    for _ in range(n_experiments):
        x = np.random.normal(0, 1, n_samples)
        y = true_effect + np.random.normal(0, 1, n_samples)
        # Try 4 analyses: original, log-transformed, with/outlier removed, two-tailed→one-tailed
        tests = [stats.ttest_ind(x, y),
                 stats.ttest_ind(np.log(np.abs(x)+1), np.log(np.abs(y)+1)),
                 stats.ttest_ind(x[1:], y[1:]),
                 stats.ttest_ind(x, y, alternative='less')]
        pvals = [p for _, p in tests]
        if min(pvals) < 0.05:
            hacked_fps += 1

    print(f"  True effect = {true_effect}")
    print(f"  Standard FPR: {standard_fps/n_experiments:.3f} (nominal: 0.050)")
    print(f"  P-hacked FPR: {hacked_fps/n_experiments:.3f}")

simulate_phacking(true_effect=0)
simulate_phacking(true_effect=0.3)

# --- Bayesian Fairness Assessment ---
print("\n=== Bayesian Perspective on Fairness ===")
def bayesian_fairness_prior(n_a, pos_a, n_b, pos_b, prior=1):
    """Compute posterior probability that groups have different true rates."""
    # Beta-Binomial model
    post_a = (prior + pos_a, prior + n_a - pos_a)
    post_b = (prior + pos_b, prior + n_b - pos_b)

    # Monte Carlo comparison
    samples_a = np.random.beta(post_a[0], post_a[1], 100000)
    samples_b = np.random.beta(post_b[0], post_b[1], 100000)

    p_a_greater = np.mean(samples_a > samples_b)
    diff = np.mean(samples_a - samples_b)
    ci = np.percentile(samples_a - samples_b, [2.5, 97.5])

    print(f"  Group A: {pos_a}/{n_a} = {pos_a/n_a:.3f}")
    print(f"  Group B: {pos_b}/{n_b} = {pos_b/n_b:.3f}")
    print(f"  P(A > B): {p_a_greater:.3f}")
    print(f"  Mean difference: {diff:+.4f}")
    print(f"  95% CI for difference: ({ci[0]:.4f}, {ci[1]:.4f})")

bayesian_fairness_prior(1000, 300, 1000, 350)  # 30% vs 35%
bayesian_fairness_prior(100, 30, 100, 50)      # 30% vs 50%

Key Takeaways

Summary: Ethics in Statistics

The ASA Ethical Guidelines establish six principles: professional integrity, data/method integrity, client responsibility, misconduct accountability, competence, and collegial respect.
P-hacking inflates false positive rates to 30%+; remedies include pre-registration, registered reports, and open data.
Algorithmic fairness has multiple incompatible criteria (demographic parity, equalized odds, calibration); the impossibility theorem proves no predictor can satisfy all three simultaneously.
Informed consent must address what data is collected, how it is used, who accesses it, and the right to withdraw — increasingly challenging in big data contexts.
Differential privacy provides mathematically rigorous privacy guarantees at the cost of accuracy, with the privacy budget $\varepsilon$ governing the tradeoff.
Landmark cases (Tuskegee, COMPAS, Challenger) demonstrate that ethical failures in statistics have real-world consequences for individuals and public trust.
Professional responsibility requires disclosure of conflicts, honest communication of uncertainty, and resistance to pressure that compromises analytical integrity.

Ethics in Statistics

Ethics in Statistics

The Responsibility That Comes With Analytical Power

DfStatistical Ethics

The ASA Ethical Guidelines

DfASA Ethical Guidelines for Statistical Practice

The Six Principles

Responsible Use of Statistics

P-Hacking and Data Dredging

DfP-Hacking

ThThe Garden of Forking Paths

Replication and Transparency

Algorithmic Fairness

DfAlgorithmic Fairness

Formal Fairness Criteria

DfDemographic Parity (Statistical Parity)

DfEqualized Odds

DfCalibration

ThImpossibility Theorem (Chouldechova, 2017; Kleinberg et al., 2016)

Bias in Algorithms

DfSources of Algorithmic Bias

Fairness-Aware Machine Learning

DfPre-processing, In-processing, and Post-processing

Informed Consent

DfInformed Consent in Data Collection

Ethical Challenges in Modern Data Science

Data Privacy

Differential Privacy

DfDifferential Privacy

ThComposition Theorem for Differential Privacy

Regulatory Frameworks

Professional Responsibility

DfStatistical Professional Responsibility

Conflicts of Interest

Landmark Case Studies

Case 1: The Tuskegee Syphilis Study (1932–1972)

DfTuskegee Syphilis Study

Case 2: The Challenger Disaster (1986)

Case 3: Algorithmic Bias in Criminal Justice (COMPAS)

DfCOMPAS Recidivism Tool

Case 4: P-Hacking in Psychosocial Research

Python Implementation

Key Takeaways

Summary: Ethics in Statistics

Next Steps

Premium Content

Need Expert Statistics Help?