Sampling Techniques
Sampling Theory
The Art of Choosing Who Gets Counted
How you sample determines what you can conclude. The wrong method turns a million-dollar study into a confident wrong answer.
Key things this concept helps with:
- Simple Random Sampling — The gold standard when you have a complete list and every voice deserves equal weight
- Stratified Sampling — Guaranteed representation from every subgroup, delivering sharper estimates where it matters
- Cluster Sampling — Practical precision when the population is spread across geography or organizations
- Design Effect — Quantifying exactly how much efficiency you gain or lose with complex designs
Every dataset begins with a sample — choose your sampling method wisely, or your conclusions stand on sand.
What is Sampling?
Definition
Choosing the right sampling method is crucial. Different methods trade off between statistical efficiency, practical feasibility, and cost.
Simple Random Sampling (SRS)
DfSimple Random Sampling
Every individual in the population has an equal probability of selection, and selections are independent.
When to use: When you have a complete list (sampling frame) and no reason to believe subgroups differ.
import numpy as np
import pandas as pd
np.random.seed(42)
# Simulate a population of 1,000 employees
population = pd.DataFrame({
'employee_id': range(1, 1001),
'department': np.random.choice(['Engineering', 'Sales', 'HR', 'Marketing'], 1000,
p=[0.4, 0.3, 0.15, 0.15]),
'salary': np.random.normal(75000, 15000, 1000).round(2)
})
# Simple random sample: select 50 employees
srs = population.sample(n=50, random_state=42)
print("SRS — Department distribution:")
print(srs['department'].value_counts(normalize=True).round(3))
print(f"SRS mean salary: ${srs['salary'].mean():,.2f}")
print(f"True mean salary: ${population['salary'].mean():,.2f}")
Advantages: Unbiased, easy to implement, theoretical foundation is solid Disadvantages: Requires complete sampling frame, may miss small subgroups
Systematic Sampling
DfSystematic Sampling
Select every k-th individual from an ordered list, starting at a random point.
Sampling interval: k = N/n (population size / sample size)
def systematic_sample(df, n):
"""Select every k-th observation from a DataFrame."""
N = len(df)
k = N // n # sampling interval
start = np.random.randint(0, k) # random starting point
indices = range(start, N, k)
return df.iloc[list(indices)[:n]]
systematic = systematic_sample(population, n=50)
print("\nSystematic Sample:")
print(f"Mean salary: ${systematic['salary'].mean():,.2f}")
print(f"Dept distribution:")
print(systematic['department'].value_counts(normalize=True).round(3))
Advantages: Simple to execute, good spread across list Disadvantages: Periodicity bias if the list has a pattern at interval k
Stratified Sampling
DfStratified Sampling
Divide population into strata (subgroups) based on a variable, then sample from each stratum.
Proportional Stratification
Sample size from each stratum ∝ stratum size.
Optimal Stratification
Sample more from strata with higher variability (Neyman allocation).
# Proportional stratified sample by department
n_total = 50
dept_sizes = population['department'].value_counts()
dept_proportions = dept_sizes / len(population)
print("Dept proportions:", dept_proportions.to_dict())
stratified_samples = []
for dept, prop in dept_proportions.items():
n_dept = round(prop * n_total)
dept_population = population[population['department'] == dept]
sample = dept_population.sample(n=min(n_dept, len(dept_population)), random_state=42)
stratified_samples.append(sample)
stratified = pd.concat(stratified_samples)
print(f"\nStratified sample n={len(stratified)}")
print(f"Mean salary: ${stratified['salary'].mean():,.2f}")
# Compare precision: stratified vs SRS
# Run many samples and compare standard errors
srs_means = [population.sample(50)['salary'].mean() for _ in range(1000)]
strat_means = []
for _ in range(1000):
samples = [population[population['department']==d].sample(round(p*50))
for d, p in dept_proportions.items()]
strat_means.append(pd.concat(samples)['salary'].mean())
print(f"\nSE of SRS mean: ${np.std(srs_means):,.2f}")
print(f"SE of Stratified mean: ${np.std(strat_means):,.2f}")
print("Stratified sampling is more efficient when strata differ in means!")
Cluster Sampling
DfCluster Sampling
Divide population into clusters (naturally occurring groups), randomly select clusters, then survey all members within selected clusters.
Unlike stratified: We want clusters to be heterogeneous (diverse internally).
# Cluster sampling: schools in a district
# Population: 100 schools, each with 200 students
n_schools = 100
students_per_school = 50
schools = pd.DataFrame({
'school_id': range(1, n_schools+1),
'district': np.repeat(['North', 'South', 'East', 'West'], 25),
})
# Expand to students
students = schools.loc[schools.index.repeat(students_per_school)].reset_index(drop=True)
students['score'] = np.random.normal(70, 12, len(students))
# Select 10 clusters (schools) at random
selected_schools = np.random.choice(schools['school_id'], size=10, replace=False)
cluster_sample = students[students['school_id'].isin(selected_schools)]
print(f"Cluster sample: {len(cluster_sample)} students from {len(selected_schools)} schools")
print(f"Mean score: {cluster_sample['score'].mean():.2f}")
print(f"True mean score: {students['score'].mean():.2f}")
Advantages: Practical when population is geographically spread, no need for complete list of individuals Disadvantages: Less precise than SRS (intra-cluster correlation inflates variance), need design effect correction
Comparison Summary
| Method | Cost | Precision | Best When |
|---|---|---|---|
| SRS | Medium | Moderate | Homogeneous population, complete frame available |
| Systematic | Low | Moderate | Ordered list, no periodicity |
| Stratified | Medium | High | Subgroups differ on outcome variable |
| Cluster | Low | Lower | No complete frame, clustered population |
| Multistage | Medium | Moderate | Large national surveys |
Design Effect (DEFF)
DfDesign Effect
The design effect compares the variance of an estimator from a complex sample to what it would be under SRS:
Design Effect
Here,
- =Design effect ratio
- =Variance under the actual sampling design
- =Variance under simple random sampling
- DEFF greater than 1: Complex sample is less efficient than SRS (common in cluster sampling)
- DEFF less than 1: Complex sample is more efficient (common in stratified sampling)
# Design effect example for cluster sampling
# Intraclass correlation (ICC) measures similarity within clusters
def design_effect_cluster(icc, cluster_size):
"""Calculate design effect for cluster sampling."""
return 1 + (cluster_size - 1) * icc
icc = 0.3 # moderate intraclass correlation
cluster_size = 50 # students per school
deff = design_effect_cluster(icc, cluster_size)
print(f"ICC = {icc}, cluster size = {cluster_size}")
print(f"Design Effect = {deff:.2f}")
print(f"Effective sample size = actual n / DEFF")
print(f"To get precision of n=500 SRS, need {500*deff:.0f} cluster sample observations")
Sampling in Machine Learning
| Sampling Method | ML Use Case | Why |
|---|---|---|
| Simple Random | Train/test split | Unbiased performance estimate |
| Stratified | Classification splits | Preserve class balance |
| Systematic | Time series splits | Respect temporal order |
| Cluster | Distributed training | Data parallelism across GPUs |
| Bootstrap | Bagging, Random Forests | Ensemble diversity |
from sklearn.model_selection import train_test_split, StratifiedKFold
import numpy as np
import pandas as pd
np.random.seed(42)
# Simulated customer dataset
n = 2000
X = np.random.randn(n, 5)
y = (X[:,0] + X[:,1] > 0).astype(int) # binary classification
# 1. Simple Random Split (standard ML)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print("=== Random Split ===")
print(f"Train class balance: {y_train.mean():.3f}")
print(f"Test class balance: {y_test.mean():.3f}")
# 2. Stratified Split (preserves class balance)
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print("\n=== Stratified Split ===")
print(f"Train class balance: {y_train_s.mean():.3f}")
print(f"Test class balance: {y_test_s.mean():.3f}")
# 3. K-Fold Cross-Validation (repeated sampling)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
print(f"Fold {fold+1}: train={len(train_idx)}, val={len(val_idx)}, "
f"train_balance={y[train_idx].mean():.3f}")
Key Takeaways
SRS is the theoretical gold standard — but often impractical when populations are large or spread out.
Stratified sampling boosts precision when subgroups differ on your outcome variable.
Cluster sampling trades precision for practicality — use design effect to quantify the cost.
Always account for your sampling design when computing standard errors — ignoring it underestimates uncertainty.
The best sampling method is not the one that is easiest — it is the one that gets you closest to the truth with the resources you have.