Sampling Bias and Errors
Sampling Theory
Why Your Data Might Be Lying to You
Even the most carefully designed study can be undermined by bias. Understanding bias is not pessimism — it is statistical integrity.
Key things this concept helps with:
- Selection Bias — Recognizing when your sample systematically excludes the people who matter most
- Nonresponse Bias — Understanding why silence from certain groups distorts the entire picture
- Survivorship Bias — Seeing the ghosts of the data that disappeared and skewed your conclusions
- Measurement Bias — Detecting when your instruments measure something other than what you intended
The most dangerous bias is the one you cannot see. Train yourself to ask: who is missing from this data?
What is Sampling Bias?
Definition
Even the most carefully designed study can be undermined by bias. Understanding bias is not pessimism — it's statistical integrity.
Types of Error in Statistics
Sampling Error
The unavoidable random difference between a sample statistic and the true population parameter. It decreases with larger samples and can be quantified.
Sampling Error
Here,
- =Sample mean
- =Population mean
This is expected and manageable. Confidence intervals are designed to account for it.
Non-Sampling Error (Bias)
Systematic errors that don't go away with larger samples. A biased survey of 1 million people is still biased.
Bias
Here,
- =Expected value of the estimator
- =True parameter value
Types of Bias
Selection Bias
Certain individuals are more likely to be included in the sample due to how the sample was drawn.
Classic example: The Literary Digest 1936 US Election Poll
- Sent 10 million surveys to car owners and phone subscribers
- Got 2.4 million responses
- Predicted Landon would beat Roosevelt 57% to 43%
- Roosevelt won 62% to 37%
Problem: In 1936, car owners and phone subscribers were wealthy — systematically more Republican than the general population.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
n = 10000
# True population: 55% support policy X
true_support = 0.55
population_support = np.random.binomial(1, true_support, n)
population_income = np.random.normal(50000, 20000, n) # income in USD
# Selection bias: online survey — higher income people more likely to respond
prob_respond = np.clip(0.1 + (population_income - 30000) / 200000, 0.05, 0.95)
responded = np.random.binomial(1, prob_respond)
# High-income people support policy less (e.g., a wealth tax)
true_support_by_income = np.where(population_income > 60000, 0.35, 0.65)
population_support = np.random.binomial(1, true_support_by_income)
selected = population_support[responded == 1]
print(f"True support: {true_support_by_income.mean():.3f}")
print(f"Biased sample support: {selected.mean():.3f}")
print(f"Bias: {selected.mean() - true_support_by_income.mean():.3f}")
Nonresponse Bias
DfNonresponse Bias
People who don't respond to a survey differ systematically from those who do.
Examples:
- Happy customers ignore feedback surveys; dissatisfied customers respond
- Busy people (who may have different characteristics) skip phone surveys
- Sensitive topics (income, drug use) get more refusals from those actually affected
Detection:
# Compare early vs late respondents (late respondents ≈ nonrespondents)
# (Heckman's correction technique)
# Simulate survey with nonresponse
n = 1000
true_job_satisfaction = np.random.normal(6.5, 2, n) # 1-10 scale
# Dissatisfied people less likely to respond
prob_respond = np.clip(0.3 + 0.08 * true_job_satisfaction, 0.1, 0.95)
responded = np.random.binomial(1, prob_respond)
all_responses = true_job_satisfaction[responded == 1]
print(f"True mean satisfaction: {true_job_satisfaction.mean():.3f}")
print(f"Survey mean satisfaction (biased): {all_responses.mean():.3f}")
print(f"Nonresponse bias: +{all_responses.mean() - true_job_satisfaction.mean():.3f}")
print("Survey overestimates satisfaction because unhappy people didn't respond!")
Survivorship Bias
DfSurvivorship Bias
Analyzing only survivors (successes) while ignoring those that failed and are no longer visible.
Examples:
- Studying successful startups to find success strategies (ignoring the thousands that failed with the same strategies)
- WWII plane damage example: reinforce the planes that returned, not where they were hit
- Investment fund performance: funds that failed were removed from databases
# Survivorship bias in investment funds
np.random.seed(7)
n_funds = 1000
years = 10
# Each fund has a 70% chance of surviving each year
survival = np.random.binomial(1, 0.70, (n_funds, years)).cumprod(axis=1)
# Simulate annual returns: mean 5% with 20% std dev
returns = np.random.normal(0.05, 0.20, (n_funds, years))
# Surviving funds (still alive at year 10)
survived = survival[:, -1] == 1
n_survived = survived.sum()
print(f"Funds surviving 10 years: {n_survived}/{n_funds} ({100*n_survived/n_funds:.1f}%)")
# Compare average returns
all_fund_returns = returns.mean(axis=1)
survivor_returns = returns[survived].mean(axis=1)
print(f"All funds average return: {all_fund_returns.mean():.3f}")
print(f"Surviving funds average return: {survivor_returns.mean():.3f}")
print(f"Survivorship bias inflates returns by: {survivor_returns.mean() - all_fund_returns.mean():.3f}")
Measurement Bias
DfMeasurement Bias
Systematic errors in how variables are measured, causing values to be consistently too high or too low.
Examples:
- Self-reported weight: people tend to underreport
- Social desirability bias: answering to appear socially acceptable
- Question ordering effects
- Interviewer effects (different interviewers get different answers)
# Social desirability bias: hours of exercise per week
np.random.seed(3)
n = 500
true_hours = np.random.exponential(scale=3, size=n) # true behavior
# People exaggerate to the interviewer
exaggeration = np.random.normal(loc=1.5, scale=0.5, size=n) # add ~1.5 hours bias
reported_hours = true_hours + exaggeration
print(f"True mean exercise: {true_hours.mean():.2f} hours/week")
print(f"Reported mean exercise: {reported_hours.mean():.2f} hours/week")
print(f"Measurement bias: +{reported_hours.mean() - true_hours.mean():.2f} hours")
Detecting and Reducing Bias
| Bias Type | Detection | Reduction |
|---|---|---|
| Selection | Compare selected vs. population on known variables | Probability sampling, weighting |
| Nonresponse | Follow-up of nonrespondents, callback analysis | Maximize response rate, imputation |
| Survivorship | Check if missing data is random | Include all units, censoring analysis |
| Measurement | Validate against objective measures | Anonymize surveys, indirect questions |
Sampling Bias in Machine Learning
| Bias Type | ML Example | Consequence |
|---|---|---|
| Selection bias | Training on data from one region | Model fails in other regions |
| Survivorship bias | Only training on active users | Model ignores churned users |
| Label bias | Human annotators' biases | Model learns those biases |
| Temporal bias | Training on old data | Model outdated for current patterns |
| Class imbalance | 99% non-fraud, 1% fraud | Model predicts "no fraud" always |
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
np.random.seed(42)
# Simulate gender bias in hiring data
n = 2000
gender = np.random.choice(['M', 'F'], n, p=[0.6, 0.4])
skill = np.random.normal(50, 15, n) # true skill (same distribution)
# Historical bias: women with same skill less likely to be hired
hired_prob = np.where(gender == 'M',
np.clip(0.3 + 0.01 * skill, 0, 1),
np.clip(0.15 + 0.01 * skill, 0, 1))
hired = np.random.binomial(1, hired_prob)
data = pd.DataFrame({'gender': gender, 'skill': skill, 'hired': hired})
print("=== Hiring rates by gender ===")
print(data.groupby('gender')['hired'].mean())
# Train model on biased data — it learns the bias!
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['gender_enc'] = le.fit_transform(data['gender'])
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(data[['skill', 'gender_enc']], data['hired'])
# Test on balanced data (fair test set)
test_gender = np.repeat(['M', 'F'], 500)
test_skill = np.random.normal(50, 15, 1000)
test_data = pd.DataFrame({'gender': test_gender, 'skill': test_skill})
test_data['gender_enc'] = le.transform(test_data['gender'])
preds = model.predict(test_data[['skill', 'gender_enc']])
test_data['pred'] = preds
print("\n=== Predicted hire rates (same skill distribution) ===")
print(test_data.groupby('gender')['pred'].mean())
print("\nModel learned the historical bias!")
Key Takeaways
Sampling error is random and shrinks with larger samples — it is manageable with confidence intervals.
Bias is systematic and never disappears with more data — a biased sample of a million is still biased.
Survivorship bias is pervasive in business, medicine, and social science — always ask who is missing.
Probability sampling, not convenience sampling, is the strongest defense against selection bias.
The size of your sample does not correct for bias — it only makes a biased answer more confidently wrong.