Sampling Bias and Errors

Sampling Theory

Why Your Data Might Be Lying to You

Even the most carefully designed study can be undermined by bias. Understanding bias is not pessimism — it is statistical integrity.

Key things this concept helps with:

Selection Bias — Recognizing when your sample systematically excludes the people who matter most
Nonresponse Bias — Understanding why silence from certain groups distorts the entire picture
Survivorship Bias — Seeing the ghosts of the data that disappeared and skewed your conclusions
Measurement Bias — Detecting when your instruments measure something other than what you intended

The most dangerous bias is the one you cannot see. Train yourself to ask: who is missing from this data?

What is Sampling Bias?

Definition

Even the most carefully designed study can be undermined by bias. Understanding bias is not pessimism — it's statistical integrity.

Types of Error in Statistics

Sampling Error

The unavoidable random difference between a sample statistic and the true population parameter. It decreases with larger samples and can be quantified.

Sampling Error

\text{Sampling Error} = \bar{x} - \mu

Here,

$\bar{x}$ =Sample mean
$\mu$ =Population mean

This is expected and manageable. Confidence intervals are designed to account for it.

Non-Sampling Error (Bias)

Systematic errors that don't go away with larger samples. A biased survey of 1 million people is still biased.

Bias

\text{Bias} = E[\hat{\theta}] - \theta

Here,

$E[\hat{\theta}]$ =Expected value of the estimator
$\theta$ =True parameter value

Types of Bias

Selection Bias

Certain individuals are more likely to be included in the sample due to how the sample was drawn.

Classic example: The Literary Digest 1936 US Election Poll

Sent 10 million surveys to car owners and phone subscribers
Got 2.4 million responses
Predicted Landon would beat Roosevelt 57% to 43%
Roosevelt won 62% to 37%

Problem: In 1936, car owners and phone subscribers were wealthy — systematically more Republican than the general population.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
n = 10000

# True population: 55% support policy X
true_support = 0.55
population_support = np.random.binomial(1, true_support, n)
population_income = np.random.normal(50000, 20000, n)  # income in USD

# Selection bias: online survey — higher income people more likely to respond
prob_respond = np.clip(0.1 + (population_income - 30000) / 200000, 0.05, 0.95)
responded = np.random.binomial(1, prob_respond) 

# High-income people support policy less (e.g., a wealth tax)
true_support_by_income = np.where(population_income > 60000, 0.35, 0.65)
population_support = np.random.binomial(1, true_support_by_income)

selected = population_support[responded == 1]
print(f"True support: {true_support_by_income.mean():.3f}")
print(f"Biased sample support: {selected.mean():.3f}")
print(f"Bias: {selected.mean() - true_support_by_income.mean():.3f}")

Nonresponse Bias

DfNonresponse Bias

People who don't respond to a survey differ systematically from those who do.

Examples:

Happy customers ignore feedback surveys; dissatisfied customers respond
Busy people (who may have different characteristics) skip phone surveys
Sensitive topics (income, drug use) get more refusals from those actually affected

Detection:

# Compare early vs late respondents (late respondents ≈ nonrespondents)
# (Heckman's correction technique)

# Simulate survey with nonresponse
n = 1000
true_job_satisfaction = np.random.normal(6.5, 2, n)  # 1-10 scale

# Dissatisfied people less likely to respond
prob_respond = np.clip(0.3 + 0.08 * true_job_satisfaction, 0.1, 0.95)
responded = np.random.binomial(1, prob_respond)

all_responses = true_job_satisfaction[responded == 1]
print(f"True mean satisfaction: {true_job_satisfaction.mean():.3f}")
print(f"Survey mean satisfaction (biased): {all_responses.mean():.3f}")
print(f"Nonresponse bias: +{all_responses.mean() - true_job_satisfaction.mean():.3f}")
print("Survey overestimates satisfaction because unhappy people didn't respond!")

Survivorship Bias

DfSurvivorship Bias

Analyzing only survivors (successes) while ignoring those that failed and are no longer visible.

Examples:

Studying successful startups to find success strategies (ignoring the thousands that failed with the same strategies)
WWII plane damage example: reinforce the planes that returned, not where they were hit
Investment fund performance: funds that failed were removed from databases

# Survivorship bias in investment funds
np.random.seed(7)
n_funds = 1000
years = 10

# Each fund has a 70% chance of surviving each year
survival = np.random.binomial(1, 0.70, (n_funds, years)).cumprod(axis=1)
# Simulate annual returns: mean 5% with 20% std dev  
returns = np.random.normal(0.05, 0.20, (n_funds, years))

# Surviving funds (still alive at year 10)
survived = survival[:, -1] == 1
n_survived = survived.sum()
print(f"Funds surviving 10 years: {n_survived}/{n_funds} ({100*n_survived/n_funds:.1f}%)")

# Compare average returns
all_fund_returns = returns.mean(axis=1)
survivor_returns = returns[survived].mean(axis=1)

print(f"All funds average return: {all_fund_returns.mean():.3f}")
print(f"Surviving funds average return: {survivor_returns.mean():.3f}")
print(f"Survivorship bias inflates returns by: {survivor_returns.mean() - all_fund_returns.mean():.3f}")

Measurement Bias

DfMeasurement Bias

Systematic errors in how variables are measured, causing values to be consistently too high or too low.

Examples:

Self-reported weight: people tend to underreport
Social desirability bias: answering to appear socially acceptable
Question ordering effects
Interviewer effects (different interviewers get different answers)

# Social desirability bias: hours of exercise per week
np.random.seed(3)
n = 500
true_hours = np.random.exponential(scale=3, size=n)  # true behavior

# People exaggerate to the interviewer
exaggeration = np.random.normal(loc=1.5, scale=0.5, size=n)  # add ~1.5 hours bias
reported_hours = true_hours + exaggeration

print(f"True mean exercise: {true_hours.mean():.2f} hours/week")
print(f"Reported mean exercise: {reported_hours.mean():.2f} hours/week")
print(f"Measurement bias: +{reported_hours.mean() - true_hours.mean():.2f} hours")

Detecting and Reducing Bias

Bias Type	Detection	Reduction
Selection	Compare selected vs. population on known variables	Probability sampling, weighting
Nonresponse	Follow-up of nonrespondents, callback analysis	Maximize response rate, imputation
Survivorship	Check if missing data is random	Include all units, censoring analysis
Measurement	Validate against objective measures	Anonymize surveys, indirect questions

Sampling Bias in Machine Learning

Bias Type	ML Example	Consequence
Selection bias	Training on data from one region	Model fails in other regions
Survivorship bias	Only training on active users	Model ignores churned users
Label bias	Human annotators' biases	Model learns those biases
Temporal bias	Training on old data	Model outdated for current patterns
Class imbalance	99% non-fraud, 1% fraud	Model predicts "no fraud" always

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

np.random.seed(42)

# Simulate gender bias in hiring data
n = 2000
gender = np.random.choice(['M', 'F'], n, p=[0.6, 0.4])
skill = np.random.normal(50, 15, n)  # true skill (same distribution)
# Historical bias: women with same skill less likely to be hired
hired_prob = np.where(gender == 'M',
                      np.clip(0.3 + 0.01 * skill, 0, 1),
                      np.clip(0.15 + 0.01 * skill, 0, 1))
hired = np.random.binomial(1, hired_prob)

data = pd.DataFrame({'gender': gender, 'skill': skill, 'hired': hired})
print("=== Hiring rates by gender ===")
print(data.groupby('gender')['hired'].mean())

# Train model on biased data — it learns the bias!
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['gender_enc'] = le.fit_transform(data['gender'])

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(data[['skill', 'gender_enc']], data['hired'])

# Test on balanced data (fair test set)
test_gender = np.repeat(['M', 'F'], 500)
test_skill = np.random.normal(50, 15, 1000)
test_data = pd.DataFrame({'gender': test_gender, 'skill': test_skill})
test_data['gender_enc'] = le.transform(test_data['gender'])

preds = model.predict(test_data[['skill', 'gender_enc']])
test_data['pred'] = preds

print("\n=== Predicted hire rates (same skill distribution) ===")
print(test_data.groupby('gender')['pred'].mean())
print("\nModel learned the historical bias!")

Key Takeaways

Sampling error is random and shrinks with larger samples — it is manageable with confidence intervals.

Bias is systematic and never disappears with more data — a biased sample of a million is still biased.

Survivorship bias is pervasive in business, medicine, and social science — always ask who is missing.

Probability sampling, not convenience sampling, is the strongest defense against selection bias.

The size of your sample does not correct for bias — it only makes a biased answer more confidently wrong.

Sampling Bias and Errors — Types, Detection, and Prevention

Sampling Bias and Errors

Why Your Data Might Be Lying to You

What is Sampling Bias?

Definition

Types of Error in Statistics

Sampling Error

Sampling Error

Non-Sampling Error (Bias)

Bias

Types of Bias

Selection Bias

Nonresponse Bias

DfNonresponse Bias

Survivorship Bias

DfSurvivorship Bias

Measurement Bias

DfMeasurement Bias

Detecting and Reducing Bias

Sampling Bias in Machine Learning

Key Takeaways

Premium Content

Need Expert Statistics Help?