πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Ethics in Statistics

Advanced Statistical MethodsResearch Methodology🟒 Free Lesson

Advertisement

Ethics in Statistics

Advanced Statistical Methods

The Responsibility That Comes With Analytical Power

Statistical methods are powerful tools that can be misused β€” intentionally or accidentally β€” to mislead. The ASA Ethical Guidelines, algorithmic fairness, data privacy, and professional responsibility form the ethical backbone of the discipline.

  • Algorithmic fairness β€” Auditing models for bias across protected groups to ensure equitable outcomes
  • Data privacy β€” Balancing analytical utility with GDPR/CCPA compliance and informed consent
  • Professional integrity β€” Resisting pressure to selectively report or manipulate results for desired outcomes

Ethical statistics means using your analytical power in ways that serve truth and society, not just clients.


DfStatistical Ethics

Statistical ethics encompasses the principles, standards, and practices that guide responsible conduct in the collection, analysis, interpretation, and communication of data. It addresses the moral obligations of statisticians to society, their clients, and the integrity of their discipline.

"Statistics is the grammar of science β€” and like any language, it can be used to illuminate or to deceive." β€” Adapted from Karl Pearson


The ASA Ethical Guidelines

DfASA Ethical Guidelines for Statistical Practice

The American Statistical Association (ASA) adopted its Ethical Guidelines in 1989 (revised 2016) to promote ethical practice across all branches of statistics. The guidelines are organized around six foundational principles:

The Six Principles

  1. Professional Integrity and Accountability

    • Strive for honesty, objectivity, and transparency
    • Acknowledge limitations and potential biases in analyses
    • Accept responsibility for professional work
  2. Integrity of Data and Methods

    • Use appropriate statistical methods
    • Document data processing and analytic decisions
    • Distinguish between exploratory and confirmatory analysis
  3. Responsibilities to Clients, Employers, and Others

    • Protect confidential information
    • Disclose potential conflicts of interest
    • Report results accurately and completely
  4. Responsibilities Regarding Allegations of Misconduct

    • Address allegations of misconduct promptly
    • Cooperate with investigations
  5. Competence and Judgment

    • Practice only in areas of competence
    • Seek statistical expertise when needed
  6. Responsibilities to Other Statisticians

    • Respect colleagues' work
    • Acknowledge contributions appropriately

ASA Code of Ethics

The ASA Ethical Guidelines are not enforceable in the way medical licenses are β€” they serve as aspirational standards. However, many institutions and journals now require adherence to these guidelines as a condition of publication or employment.


Responsible Use of Statistics

P-Hacking and Data Dredging

DfP-Hacking

P-hacking (Simmons et al., 2011) is the practice of selectively reporting, analyzing, or modifying data or analyses until a statistically significant result (p<0.05p < 0.05) is obtained. This inflates the false positive rate far beyond the nominal level.

Common forms of p-hacking include:

PracticeEffect on False Positive Rate
Testing multiple outcomes, reporting only significantUp to 30% FPR (vs. 5% nominal)
Stopping data collection when p<0.05p < 0.05Uncontrolled FPR
Excluding outliers after seeing resultsInflated effect sizes
Trying different model specificationsMultiplicity without correction
Reporting one-tailed tests when two-tailed plannedDoubles effective alpha

ThThe Garden of Forking Paths

Gelman & Loken (2013) formalized the problem: even without deliberate p-hacking, the many researcher degrees of freedom in data analysis (variable transformations, subgroup analyses, model specifications) create a multiplicity problem that is invisible to the standard p<0.05p < 0.05 framework.

Replication and Transparency

The Replication Crisis

The replication crisis β€” the finding that many published results fail to replicate β€” has been linked partly to unethical statistical practices: p-hacking, HARKing (Hypothesizing After Results are Known), and publication bias. The Open Science Collaboration (2015) found that only 36% of psychology studies replicated successfully.

Remedies for P-Hacking:

  1. Pre-registration: Specify hypotheses, methods, and analysis plans before data collection
  2. Registered Reports: Journals accept papers based on methodology, before results are known
  3. Open Data and Code: Share analysis code and (where ethical) data
  4. Bayesian Methods: Shift from binary significance to continuous evidence measures
  5. Effect Size Reporting: Report practical significance alongside statistical significance

Algorithmic Fairness

DfAlgorithmic Fairness

Algorithmic fairness addresses the question: when does a decision-making algorithm treat different groups equitably? There are multiple, often incompatible, fairness criteria, and choosing among them is ultimately an ethical, not purely technical, decision.

Formal Fairness Criteria

DfDemographic Parity (Statistical Parity)

A predictor Y^\hat{Y} satisfies demographic parity if:

P(Y^=1∣A=a)=P(Y^=1∣A=b)βˆ€a,bP(\hat{Y} = 1 \mid A = a) = P(\hat{Y} = 1 \mid A = b) \quad \forall a, b

where AA denotes the protected attribute (e.g., race, gender). The predictor's decisions are independent of the protected attribute.

DfEqualized Odds

A predictor Y^\hat{Y} satisfies equalized odds if:

P(Y^=1∣Y=y,A=a)=P(Y^=1∣Y=y,A=b)βˆ€y,a,bP(\hat{Y} = 1 \mid Y = y, A = a) = P(\hat{Y} = 1 \mid Y = y, A = b) \quad \forall y, a, b

True positive rates and false positive rates are equal across groups. This conditions on the true label YY.

DfCalibration

A predictor is calibrated if:

P(Y=1∣P^=p,A=a)=pβˆ€p,aP(Y = 1 \mid \hat{P} = p, A = a) = p \quad \forall p, a

Among individuals assigned prediction score pp, the actual positive rate is pp, regardless of group membership.

ThImpossibility Theorem (Chouldechova, 2017; Kleinberg et al., 2016)

When base rates differ across groups (P(Y=1∣A=a)β‰ P(Y=1∣A=b)P(Y=1 \mid A=a) \neq P(Y=1 \mid A=b)) and predictions are not perfect, no predictor can simultaneously satisfy demographic parity, equalized odds, and calibration. At least one criterion must be relaxed.

Choosing Fairness Criteria

The choice of fairness criterion depends on context:

  • Criminal justice: Equalized odds may be preferred (equal error rates across groups)
  • Hiring: Demographic parity may be legally required
  • Medical diagnosis: Calibration may be most important (probability scores should be meaningful for all groups) The impossibility theorem means that fairness cannot be reduced to a single technical constraint β€” it requires ethical judgment.

Bias in Algorithms

DfSources of Algorithmic Bias

SourceDescriptionExample
Historical biasTraining data reflects past discriminationHiring algorithms trained on biased hiring data
Representation biasUnderrepresentation of certain groupsFacial recognition trained on predominantly white faces
Measurement biasFeatures measured differently across groupsCredit scores as proxy for creditworthiness
Aggregation biasModel assumes same relationships across groupsMedical model trained on average demographics
Evaluation biasBenchmark datasets not representativeNLP models tested on non-diverse text
Deployment biasModel used in contexts different from designRisk assessment tool used beyond its scope

Fairness-Aware Machine Learning

DfPre-processing, In-processing, and Post-processing

Fairness interventions can be applied at three stages:

  1. Pre-processing: Transform training data to remove bias (e.g., reweighting, resampling, adversarial debiasing)
  2. In-processing: Modify the learning algorithm to incorporate fairness constraints (e.g., constrained optimization, adversarial training)
  3. Post-processing: Adjust model outputs to satisfy fairness criteria (e.g., threshold adjustment per group)

Fairness-Accuracy Tradeoff

Imposing fairness constraints typically reduces predictive accuracy. The magnitude of the tradeoff depends on the dataset, the fairness criterion, and the degree of underlying disparity. There is no free lunch in algorithmic fairness.


Informed Consent

DfInformed Consent in Data Collection

Informed consent requires that research participants understand:

  1. What data is being collected
  2. How it will be used (primary and secondary uses)
  3. Who will have access to the data
  4. Risks and benefits of participation
  5. Right to withdraw at any time
  6. Duration of data retention

Ethical Challenges in Modern Data Science

ChallengeDescriptionMitigation
Big dataConsent for collection is impractical at scaleOpt-out mechanisms, data minimization
Re-identificationAnonymized data can be de-anonymizedDifferential privacy, k-anonymity
Secondary useData collected for one purpose used for anotherPurpose limitation, consent renewal
Children/minorsCannot provide informed consentParental consent, age-appropriate design
Vulnerable populationsPower dynamics may compromise autonomyIRB review, community engagement

Data Privacy

Differential Privacy

DfDifferential Privacy

A randomized mechanism M\mathcal{M} satisfies (Ξ΅,Ξ΄)(\varepsilon, \delta)-differential privacy if for all neighboring datasets DD and Dβ€²D' (differing in one record) and all subsets SS:

P(M(D)∈S)≀eΞ΅β‹…P(M(Dβ€²)∈S)+Ξ΄P(\mathcal{M}(D) \in S) \leq e^{\varepsilon} \cdot P(\mathcal{M}(D') \in S) + \delta

Smaller Ξ΅\varepsilon means stronger privacy. When Ξ΄=0\delta = 0, the mechanism is Ξ΅\varepsilon-differentially private.

ThComposition Theorem for Differential Privacy

If M1\mathcal{M}_1 is Ρ1\varepsilon_1-DP and M2\mathcal{M}_2 is Ρ2\varepsilon_2-DP, then their sequential composition M1∘M2\mathcal{M}_1 \circ \mathcal{M}_2 is (Ρ1+Ρ2)(\varepsilon_1 + \varepsilon_2)-DP. Under the advanced composition theorem:

(Ξ΅2kln⁑(1/Ξ΄)+kΞ΅(eΞ΅βˆ’1), kΞ΄)-DP\left(\varepsilon \sqrt{2k \ln(1/\delta)} + k \varepsilon(e^\varepsilon - 1), \, k\delta\right)\text{-DP}

for kk independent mechanisms, each Ξ΅\varepsilon-DP.

Regulatory Frameworks

RegulationJurisdictionKey Requirements
GDPREUExplicit consent, right to erasure, data minimization, privacy by design
CCPA/CPRACaliforniaRight to know, right to delete, opt-out of sale, non-discrimination
HIPAAUS (health)Protected health information, minimum necessary standard
FERPAUS (education)Student record privacy, parental access rights
PIPEDACanadaConsent, limiting collection, accountability

Privacy-Utility Tradeoff

Differential privacy provides a mathematically rigorous privacy guarantee, but at a cost: noise must be added to query results. The noise magnitude scales as O(1/Ξ΅)O(1/\varepsilon), creating a fundamental tension between privacy and accuracy. The privacy budget Ξ΅\varepsilon must be chosen carefully.


Professional Responsibility

DfStatistical Professional Responsibility

Statisticians have obligations to:

  1. Society: Report findings honestly; avoid misleading the public
  2. Clients/Employers: Provide competent, unbiased analysis; disclose limitations
  3. Colleagues: Acknowledge contributions; maintain research integrity
  4. Research Subjects: Protect privacy; ensure informed consent
  5. The Discipline: Uphold the reputation and trustworthiness of statistics

Conflicts of Interest

Financial Conflicts of Interest

Industry-funded studies are significantly more likely to report results favorable to the sponsor (Lexchin et al., 2003). Statisticians must disclose funding sources and maintain analytical independence, even when under pressure to produce favorable results.


Landmark Case Studies

Case 1: The Tuskegee Syphilis Study (1932–1972)

DfTuskegee Syphilis Study

The US Public Health Service conducted a 40-year study of untreated syphilis in 399 African American men in Macon County, Alabama. Participants were never told they had syphilis, were told they were receiving "free health care," and were actively prevented from receiving treatment (including penicillin after it became the standard of care in the 1940s).

Ethical violations: No informed consent, deception, withholding treatment, selection of a vulnerable population.

Impact: Led directly to the National Research Act (1974) and the Belmont Report (1979), establishing the modern framework of informed consent, beneficence, and justice in human subjects research.

Case 2: The Challenger Disaster (1986)

Engineers at Morton Thiokol warned that O-rings could fail at low temperatures. Management overruled them. Statistical analysis of prior launches showed a clear relationship between temperature and O-ring damage β€” but this analysis was not presented to decision-makers.

Ethical lesson: Statistical evidence must be communicated clearly and forcefully when lives are at stake. The failure was not in the statistics but in the communication of statistical evidence.

Case 3: Algorithmic Bias in Criminal Justice (COMPAS)

DfCOMPAS Recidivism Tool

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is a widely used risk assessment tool in US criminal justice. A 2016 ProPublica investigation found that the tool was biased against Black defendants:

  • Black defendants: 45% false positive rate (predicted high risk, did not reoffend)
  • White defendants: 23% false positive rate
  • White defendants: 48% false negative rate (predicted low risk, did reoffend)
  • Black defendants: 28% false negative rate

The Impossibility in Practice

COMPAS illustrates the impossibility theorem in practice: the tool cannot simultaneously satisfy equalized odds (equal error rates) and calibration (equal predictive values) when base rates differ. Northpointe (the developer) calibrated the tool; ProPublica demanded equalized odds. Both claims were technically correct β€” they simply optimized for different fairness criteria.

Case 4: P-Hacking in Psychosocial Research

Simmons, Nelson, & Simonsohn (2011) demonstrated that common research practices (optional stopping, selective outcome reporting, including/excluding covariates) allow researchers to "find" statistically significant effects with probability up to 61% when the true effect is zero β€” far exceeding the nominal 5% Type I error rate.


Python Implementation

import numpy as np
from collections import defaultdict

# --- Simulating Algorithmic Bias ---
np.random.seed(42)

def simulate_fairness_audit(n=10000):
    """Audit a classifier for fairness violations across groups."""
    # Generate data with different base rates
    group_a = np.random.binomial(1, 0.3, n)  # Base rate 30%
    group_b = np.random.binomial(1, 0.5, n)  # Base rate 50%

    # Simulate predictions (deliberately biased model)
    # Model has equal TPR but different FPR across groups
    def predict(true_labels, fpr, tpr):
        n = len(true_labels)
        pred = np.zeros(n, dtype=int)
        for i in range(n):
            if true_labels[i] == 1:
                pred[i] = 1 if np.random.random() < tpr else 0
            else:
                pred[i] = 1 if np.random.random() < fpr else 0
        return pred

    pred_a = predict(group_a, fpr=0.15, tpr=0.85)
    pred_b = predict(group_b, fpr=0.30, tpr=0.85)

    # Compute fairness metrics
    def metrics(y_true, y_pred):
        tp = np.sum((y_true == 1) & (y_pred == 1))
        fp = np.sum((y_true == 0) & (y_pred == 1))
        fn = np.sum((y_true == 1) & (y_pred == 0))
        tn = np.sum((y_true == 0) & (y_pred == 0))
        return {'tpr': tp/(tp+fn), 'fpr': fp/(fp+tn),
                'selection_rate': np.mean(y_pred),
                'precision': tp/(tp+fp) if (tp+fp) > 0 else 0}

    m_a = metrics(group_a, pred_a)
    m_b = metrics(group_b, pred_b)

    print("=== Fairness Audit ===")
    print(f"{'Metric':<20s} {'Group A':>10s} {'Group B':>10s} {'Ratio':>10s}")
    print("-" * 55)
    for key in ['selection_rate', 'tpr', 'fpr', 'precision']:
        va, vb = m_a[key], m_b[key]
        ratio = min(va, vb) / max(va, vb) if max(va, vb) > 0 else float('inf')
        print(f"{key:<20s} {va:>10.3f} {vb:>10.3f} {ratio:>10.3f}")

    # Demographic parity violation
    dp_diff = abs(m_a['selection_rate'] - m_b['selection_rate'])
    print(f"\nDemographic parity difference: {dp_diff:.3f}")
    print(f"Equalized odds (TPR difference): {abs(m_a['tpr'] - m_b['tpr']):.3f}")
    print(f"Equalized odds (FPR difference): {abs(m_a['fpr'] - m_b['fpr']):.3f}")

    return m_a, m_b

simulate_fairness_audit()

# --- Differential Privacy Simulation ---
def laplace_mechanism(true_value, sensitivity, epsilon):
    """Add Laplace noise for differential privacy."""
    noise = np.random.laplace(0, sensitivity / epsilon)
    return true_value + noise

def gaussian_mechanism(true_value, sensitivity, epsilon, delta):
    """Add Gaussian noise for (epsilon, delta)-differential privacy."""
    sigma = sensitivity * np.sqrt(2 * np.log(1.25 / delta)) / epsilon
    return true_value + np.random.normal(0, sigma)

print("\n=== Differential Privacy Demonstration ===")
true_count = 5000
sensitivity = 1  # Adding/removing one person changes count by at most 1
true_proportion = 0.42

for eps in [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]:
    estimates = [laplace_mechanism(true_proportion, sensitivity, eps)
                 for _ in range(1000)]
    bias = np.mean(estimates) - true_proportion
    rmse = np.sqrt(np.mean((np.array(estimates) - true_proportion) ** 2))
    print(f"  Ξ΅={eps:5.1f}: Mean={np.mean(estimates):.4f}, "
          f"Bias={bias:+.4f}, RMSE={rmse:.4f}")

# --- P-Hacking Simulation ---
print("\n=== P-Hacking Simulation ===")
from scipy import stats

def simulate_phacking(n_experiments=10000, n_samples=50, true_effect=0):
    """Simulate the effect of p-hacking on false positive rate."""
    # Standard analysis
    standard_fps = 0
    for _ in range(n_experiments):
        x = np.random.normal(0, 1, n_samples)
        y = true_effect + np.random.normal(0, 1, n_samples)
        _, p = stats.ttest_ind(x, y)
        if p < 0.05:
            standard_fps += 1

    # P-hacked analysis (try multiple tests, report best)
    hacked_fps = 0
    for _ in range(n_experiments):
        x = np.random.normal(0, 1, n_samples)
        y = true_effect + np.random.normal(0, 1, n_samples)
        # Try 4 analyses: original, log-transformed, with/outlier removed, two-tailed→one-tailed
        tests = [stats.ttest_ind(x, y),
                 stats.ttest_ind(np.log(np.abs(x)+1), np.log(np.abs(y)+1)),
                 stats.ttest_ind(x[1:], y[1:]),
                 stats.ttest_ind(x, y, alternative='less')]
        pvals = [p for _, p in tests]
        if min(pvals) < 0.05:
            hacked_fps += 1

    print(f"  True effect = {true_effect}")
    print(f"  Standard FPR: {standard_fps/n_experiments:.3f} (nominal: 0.050)")
    print(f"  P-hacked FPR: {hacked_fps/n_experiments:.3f}")

simulate_phacking(true_effect=0)
simulate_phacking(true_effect=0.3)

# --- Bayesian Fairness Assessment ---
print("\n=== Bayesian Perspective on Fairness ===")
def bayesian_fairness_prior(n_a, pos_a, n_b, pos_b, prior=1):
    """Compute posterior probability that groups have different true rates."""
    # Beta-Binomial model
    post_a = (prior + pos_a, prior + n_a - pos_a)
    post_b = (prior + pos_b, prior + n_b - pos_b)

    # Monte Carlo comparison
    samples_a = np.random.beta(post_a[0], post_a[1], 100000)
    samples_b = np.random.beta(post_b[0], post_b[1], 100000)

    p_a_greater = np.mean(samples_a > samples_b)
    diff = np.mean(samples_a - samples_b)
    ci = np.percentile(samples_a - samples_b, [2.5, 97.5])

    print(f"  Group A: {pos_a}/{n_a} = {pos_a/n_a:.3f}")
    print(f"  Group B: {pos_b}/{n_b} = {pos_b/n_b:.3f}")
    print(f"  P(A > B): {p_a_greater:.3f}")
    print(f"  Mean difference: {diff:+.4f}")
    print(f"  95% CI for difference: ({ci[0]:.4f}, {ci[1]:.4f})")

bayesian_fairness_prior(1000, 300, 1000, 350)  # 30% vs 35%
bayesian_fairness_prior(100, 30, 100, 50)      # 30% vs 50%

Key Takeaways

Summary: Ethics in Statistics

  1. The ASA Ethical Guidelines establish six principles: professional integrity, data/method integrity, client responsibility, misconduct accountability, competence, and collegial respect.
  2. P-hacking inflates false positive rates to 30%+; remedies include pre-registration, registered reports, and open data.
  3. Algorithmic fairness has multiple incompatible criteria (demographic parity, equalized odds, calibration); the impossibility theorem proves no predictor can satisfy all three simultaneously.
  4. Informed consent must address what data is collected, how it is used, who accesses it, and the right to withdraw β€” increasingly challenging in big data contexts.
  5. Differential privacy provides mathematically rigorous privacy guarantees at the cost of accuracy, with the privacy budget Ξ΅\varepsilon governing the tradeoff.
  6. Landmark cases (Tuskegee, COMPAS, Challenger) demonstrate that ethical failures in statistics have real-world consequences for individuals and public trust.
  7. Professional responsibility requires disclosure of conflicts, honest communication of uncertainty, and resistance to pressure that compromises analytical integrity.

Next Steps

⭐

Premium Content

Ethics in Statistics

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement