🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Sampling Bias and Errors — Types, Detection, and Prevention

Foundations of StatisticsSampling Theory🟢 Free Lesson

Advertisement

Sampling Bias and Errors

Sampling Theory

Why Your Data Might Be Lying to You

Even the most carefully designed study can be undermined by bias. Understanding bias is not pessimism — it is statistical integrity.

Key things this concept helps with:

  • Selection Bias — Recognizing when your sample systematically excludes the people who matter most
  • Nonresponse Bias — Understanding why silence from certain groups distorts the entire picture
  • Survivorship Bias — Seeing the ghosts of the data that disappeared and skewed your conclusions
  • Measurement Bias — Detecting when your instruments measure something other than what you intended

The most dangerous bias is the one you cannot see. Train yourself to ask: who is missing from this data?


What is Sampling Bias?

Definition

Even the most carefully designed study can be undermined by bias. Understanding bias is not pessimism — it's statistical integrity.


Types of Error in Statistics

Sampling Error

The unavoidable random difference between a sample statistic and the true population parameter. It decreases with larger samples and can be quantified.

Sampling Error

Sampling Error=xˉμ\text{Sampling Error} = \bar{x} - \mu

Here,

  • xˉ\bar{x}=Sample mean
  • μ\mu=Population mean

This is expected and manageable. Confidence intervals are designed to account for it.

Non-Sampling Error (Bias)

Systematic errors that don't go away with larger samples. A biased survey of 1 million people is still biased.

Bias

Bias=E[θ^]θ\text{Bias} = E[\hat{\theta}] - \theta

Here,

  • E[θ^]E[\hat{\theta}]=Expected value of the estimator
  • θ\theta=True parameter value

Types of Bias

Selection Bias

Certain individuals are more likely to be included in the sample due to how the sample was drawn.

Classic example: The Literary Digest 1936 US Election Poll

  • Sent 10 million surveys to car owners and phone subscribers
  • Got 2.4 million responses
  • Predicted Landon would beat Roosevelt 57% to 43%
  • Roosevelt won 62% to 37%

Problem: In 1936, car owners and phone subscribers were wealthy — systematically more Republican than the general population.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
n = 10000

# True population: 55% support policy X
true_support = 0.55
population_support = np.random.binomial(1, true_support, n)
population_income = np.random.normal(50000, 20000, n)  # income in USD

# Selection bias: online survey — higher income people more likely to respond
prob_respond = np.clip(0.1 + (population_income - 30000) / 200000, 0.05, 0.95)
responded = np.random.binomial(1, prob_respond) 

# High-income people support policy less (e.g., a wealth tax)
true_support_by_income = np.where(population_income > 60000, 0.35, 0.65)
population_support = np.random.binomial(1, true_support_by_income)

selected = population_support[responded == 1]
print(f"True support: {true_support_by_income.mean():.3f}")
print(f"Biased sample support: {selected.mean():.3f}")
print(f"Bias: {selected.mean() - true_support_by_income.mean():.3f}")

Nonresponse Bias

DfNonresponse Bias

People who don't respond to a survey differ systematically from those who do.

Examples:

  • Happy customers ignore feedback surveys; dissatisfied customers respond
  • Busy people (who may have different characteristics) skip phone surveys
  • Sensitive topics (income, drug use) get more refusals from those actually affected

Detection:

# Compare early vs late respondents (late respondents ≈ nonrespondents)
# (Heckman's correction technique)

# Simulate survey with nonresponse
n = 1000
true_job_satisfaction = np.random.normal(6.5, 2, n)  # 1-10 scale

# Dissatisfied people less likely to respond
prob_respond = np.clip(0.3 + 0.08 * true_job_satisfaction, 0.1, 0.95)
responded = np.random.binomial(1, prob_respond)

all_responses = true_job_satisfaction[responded == 1]
print(f"True mean satisfaction: {true_job_satisfaction.mean():.3f}")
print(f"Survey mean satisfaction (biased): {all_responses.mean():.3f}")
print(f"Nonresponse bias: +{all_responses.mean() - true_job_satisfaction.mean():.3f}")
print("Survey overestimates satisfaction because unhappy people didn't respond!")

Survivorship Bias

DfSurvivorship Bias

Analyzing only survivors (successes) while ignoring those that failed and are no longer visible.

Examples:

  • Studying successful startups to find success strategies (ignoring the thousands that failed with the same strategies)
  • WWII plane damage example: reinforce the planes that returned, not where they were hit
  • Investment fund performance: funds that failed were removed from databases
# Survivorship bias in investment funds
np.random.seed(7)
n_funds = 1000
years = 10

# Each fund has a 70% chance of surviving each year
survival = np.random.binomial(1, 0.70, (n_funds, years)).cumprod(axis=1)
# Simulate annual returns: mean 5% with 20% std dev  
returns = np.random.normal(0.05, 0.20, (n_funds, years))

# Surviving funds (still alive at year 10)
survived = survival[:, -1] == 1
n_survived = survived.sum()
print(f"Funds surviving 10 years: {n_survived}/{n_funds} ({100*n_survived/n_funds:.1f}%)")

# Compare average returns
all_fund_returns = returns.mean(axis=1)
survivor_returns = returns[survived].mean(axis=1)

print(f"All funds average return: {all_fund_returns.mean():.3f}")
print(f"Surviving funds average return: {survivor_returns.mean():.3f}")
print(f"Survivorship bias inflates returns by: {survivor_returns.mean() - all_fund_returns.mean():.3f}")

Measurement Bias

DfMeasurement Bias

Systematic errors in how variables are measured, causing values to be consistently too high or too low.

Examples:

  • Self-reported weight: people tend to underreport
  • Social desirability bias: answering to appear socially acceptable
  • Question ordering effects
  • Interviewer effects (different interviewers get different answers)
# Social desirability bias: hours of exercise per week
np.random.seed(3)
n = 500
true_hours = np.random.exponential(scale=3, size=n)  # true behavior

# People exaggerate to the interviewer
exaggeration = np.random.normal(loc=1.5, scale=0.5, size=n)  # add ~1.5 hours bias
reported_hours = true_hours + exaggeration

print(f"True mean exercise: {true_hours.mean():.2f} hours/week")
print(f"Reported mean exercise: {reported_hours.mean():.2f} hours/week")
print(f"Measurement bias: +{reported_hours.mean() - true_hours.mean():.2f} hours")

Detecting and Reducing Bias

Bias TypeDetectionReduction
SelectionCompare selected vs. population on known variablesProbability sampling, weighting
NonresponseFollow-up of nonrespondents, callback analysisMaximize response rate, imputation
SurvivorshipCheck if missing data is randomInclude all units, censoring analysis
MeasurementValidate against objective measuresAnonymize surveys, indirect questions

Sampling Bias in Machine Learning

Real World DataDiverse populationBiased Training DataMissing groupsBiased ModelUnfair predictionsBiased data → Biased model → Unfair outcomes (racial, gender, age discrimination)Fix: Resampling, reweighting, fairness-aware algorithms
Bias TypeML ExampleConsequence
Selection biasTraining on data from one regionModel fails in other regions
Survivorship biasOnly training on active usersModel ignores churned users
Label biasHuman annotators' biasesModel learns those biases
Temporal biasTraining on old dataModel outdated for current patterns
Class imbalance99% non-fraud, 1% fraudModel predicts "no fraud" always
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

np.random.seed(42)

# Simulate gender bias in hiring data
n = 2000
gender = np.random.choice(['M', 'F'], n, p=[0.6, 0.4])
skill = np.random.normal(50, 15, n)  # true skill (same distribution)
# Historical bias: women with same skill less likely to be hired
hired_prob = np.where(gender == 'M',
                      np.clip(0.3 + 0.01 * skill, 0, 1),
                      np.clip(0.15 + 0.01 * skill, 0, 1))
hired = np.random.binomial(1, hired_prob)

data = pd.DataFrame({'gender': gender, 'skill': skill, 'hired': hired})
print("=== Hiring rates by gender ===")
print(data.groupby('gender')['hired'].mean())

# Train model on biased data — it learns the bias!
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['gender_enc'] = le.fit_transform(data['gender'])

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(data[['skill', 'gender_enc']], data['hired'])

# Test on balanced data (fair test set)
test_gender = np.repeat(['M', 'F'], 500)
test_skill = np.random.normal(50, 15, 1000)
test_data = pd.DataFrame({'gender': test_gender, 'skill': test_skill})
test_data['gender_enc'] = le.transform(test_data['gender'])

preds = model.predict(test_data[['skill', 'gender_enc']])
test_data['pred'] = preds

print("\n=== Predicted hire rates (same skill distribution) ===")
print(test_data.groupby('gender')['pred'].mean())
print("\nModel learned the historical bias!")

Key Takeaways

Sampling error is random and shrinks with larger samples — it is manageable with confidence intervals.

Bias is systematic and never disappears with more data — a biased sample of a million is still biased.

Survivorship bias is pervasive in business, medicine, and social science — always ask who is missing.

Probability sampling, not convenience sampling, is the strongest defense against selection bias.

The size of your sample does not correct for bias — it only makes a biased answer more confidently wrong.

Premium Content

Sampling Bias and Errors — Types, Detection, and Prevention

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement