Population vs Sample
Sampling Theory
Every Study Starts With One Question: Who Are We Measuring?
Every statistical study starts with a fundamental question: who or what are we studying? The answer determines everything — from which statistics you calculate to what conclusions you can draw.
- Define your population — Know exactly who or what your conclusions apply to
- Understand sampling — Learn why studying a subset is often the only option
- Parameters vs statistics — Master the Greek letters that separate populations from samples
- Census vs sample — Discover why even a full count can be wrong
Get this distinction right, and inferential statistics becomes logical. Get it wrong, and your conclusions stand on sand.
What is Population vs Sample?
Definition
A population is the complete set of all individuals, objects, or measurements of interest. A sample is a subset of that population that is actually observed and measured.
Understanding the distinction between population and sample is the foundation of statistical inference.
Population vs Sample Diagram
Population
Definition
A population is the complete set of all individuals, objects, or measurements of interest for a particular study.
| Study | Population | Size |
|---|---|---|
| Approval rating of a president | All eligible voters in the country | ~250 million |
| Average height of NBA players | All current NBA players | ~450 |
| Effectiveness of a drug | All people who could ever take the drug | Infinite |
| Quality control of chips | All chips produced by the factory | ~1 million/day |
Populations can be:
- Finite: all 7,500 employees at a company
- Infinite: all possible measurements a machine could produce
- Hypothetical: all people who could take an experimental drug
Sample
DfSample
A sample is a subset of the population that is actually observed and measured.
Why sample instead of study the whole population?
| Reason | Example |
|---|---|
| Cost | Surveying 2,000 people costs far less than 2 million |
| Time | Census takes years; a survey takes months |
| Destructive testing | Testing a lightbulb to failure destroys it |
| Infinite population | You cannot measure every future product |
| Practical impossibility | Can't reach every person on Earth |
Parameters vs Statistics
Population Parameters
| Parameter | Symbol | Formula |
|---|---|---|
| Mean | μ | μ = (1/N)Σxᵢ |
| Std Dev | σ | σ = √[(1/N)Σ(xᵢ-μ)²] |
| Proportion | π | π = X/N |
Fixed but unknown. We estimate them using statistics.
Sample Statistics
| Statistic | Symbol | Formula |
|---|---|---|
| Mean | x̄ | x̄ = (1/n)Σxᵢ |
| Std Dev | s | s = √[(1/(n-1))Σ(xᵢ-x̄)²] |
| Proportion | p̂ | p̂ = x/n |
Known but variable. Different samples give different values.
Standard Error of the Mean
Here,
- =Standard error of the sample mean
- =Sample standard deviation
- =Sample size
Key Insight
Parameters are fixed but unknown. Statistics are known but variable (different samples give different values). The standard error quantifies how much statistics vary across samples.
import numpy as np
from scipy import stats
# Simulate a population (in reality, we wouldn't have this)
np.random.seed(42)
population = np.random.normal(loc=170, scale=10, size=10_000) # 10,000 adults
# True population parameters
mu = population.mean()
sigma = population.std(ddof=0) # ddof=0 for population
print(f"Population Parameter μ = {mu:.4f} cm")
print(f"Population Parameter σ = {sigma:.4f} cm")
print("\n--- Drawing samples of different sizes ---")
for n in [10, 30, 100, 500]:
sample = np.random.choice(population, size=n, replace=False)
x_bar = sample.mean()
s = sample.std(ddof=1) # ddof=1 for sample (unbiased)
se = s / np.sqrt(n)
print(f"n={n:4d}: x̄={x_bar:.3f}, s={s:.3f}, SE={se:.3f} | Error = {abs(x_bar-mu):.3f}")
Output:
Population Parameter μ = 170.0694 cm
Population Parameter σ = 10.0048 cm
--- Drawing samples of different sizes ---
n= 10: x̄=169.847, s=10.042, SE=3.175 | Error = 0.222
n= 30: x̄=170.591, s=10.381, SE=1.895 | Error = 0.522
n= 100: x̄=170.204, s= 9.983, SE=0.998 | Error = 0.135
n= 500: x̄=170.082, s=10.017, SE=0.448 | Error = 0.013
Notice: Larger samples -> smaller standard error -> closer to the true parameter.
The Sampling Distribution
DfSampling Distribution
If we draw many different samples and compute a statistic each time, the distribution of those statistics is the sampling distribution.
import matplotlib.pyplot as plt
# Sampling distribution of the mean (n=30)
sample_means = []
for _ in range(10_000):
sample = np.random.choice(population, size=30, replace=False)
sample_means.append(sample.mean())
sample_means = np.array(sample_means)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Population distribution
axes[0].hist(population, bins=50, color='steelblue', alpha=0.7, density=True)
axes[0].axvline(mu, color='red', linewidth=2, label=f'μ = {mu:.1f}')
axes[0].set_title(f'Population Distribution\n(N=10,000, μ={mu:.1f}, σ={sigma:.1f})')
axes[0].legend()
# Sampling distribution of x̄
axes[1].hist(sample_means, bins=50, color='coral', alpha=0.7, density=True)
axes[1].axvline(mu, color='red', linewidth=2, label=f'μ = {mu:.1f}')
axes[1].set_title(f'Sampling Distribution of x̄\n(10,000 samples, n=30)')
axes[1].set_xlabel('Sample Mean')
axes[1].legend()
print(f"Mean of sample means = {sample_means.mean():.4f} ≈ μ = {mu:.4f}")
print(f"Std of sample means = {sample_means.std():.4f} ≈ σ/√n = {sigma/np.sqrt(30):.4f}")
plt.tight_layout()
plt.show()
Census vs Sample
DfCensus
A census attempts to measure the entire population.
| Census | Sample | |
|---|---|---|
| Coverage | All units | Subset |
| Cost | Very high | Lower |
| Time | Long | Shorter |
| Accuracy | No sampling error | Sampling error present |
| Feasibility | Limited | Broad |
| Non-response | Larger problem | Manageable |
The US Census Bureau conducts a decennial census — it takes years and billions of dollars and still has coverage errors.
Population vs Sample in Machine Learning
| Statistics Term | ML Equivalent | What It Means |
|---|---|---|
| Population | All possible data | Everything the model could ever see |
| Sample | Training set | What the model actually learns from |
| Parameter (μ, σ) | Model weights (W, b) | True values we want to learn |
| Statistic (x̄, s) | Loss/Accuracy on train | What we measure from our sample |
| Sampling error | Generalization gap | Difference between train and test performance |
Example — Train/Test Split as Sampling:
from sklearn.model_selection import train_test_split
import numpy as np
# Population: all house data
np.random.seed(42)
n_total = 1000
X = np.random.randn(n_total, 3) # 3 features
y = 2*X[:,0] + 3*X[:,1] - X[:,2] + np.random.randn(n_total)*0.5
# Sample: training set (80% of population)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Population size: {n_total}")
print(f"Training sample: {len(X_train)}")
print(f"Test sample: {len(X_test)}")
# Statistics from sample (training set)
print(f"\nSample mean of X[:,0]: {X_train[:,0].mean():.3f}")
print(f"Population mean of X[:,0]: {X[:,0].mean():.3f}")
print(f"Sampling error: {abs(X_train[:,0].mean() - X[:,0].mean()):.3f}")
Output:
Population size: 1000
Training sample: 800
Test sample: 200
Sample mean of X[:,0]: 0.018
Population mean of X[:,0]: 0.003
Sampling error: 0.015
Key Takeaways
Summary: Population vs Sample
- Population = the complete group of interest; Sample = subset we actually measure
- Parameters describe populations (Greek letters: μ, σ, π); Statistics describe samples (Latin: x̄, s, p̂)
- We use statistics to estimate parameters — the core engine of inferential statistics
- Larger samples -> smaller sampling error but there are diminishing returns
- The sampling distribution of a statistic tells us how it varies across repeated samples
- Every inference has uncertainty — quantifying that uncertainty is the job of statistics