Population vs Sample

Sampling Theory

Every Study Starts With One Question: Who Are We Measuring?

Every statistical study starts with a fundamental question: who or what are we studying? The answer determines everything — from which statistics you calculate to what conclusions you can draw.

Define your population — Know exactly who or what your conclusions apply to
Understand sampling — Learn why studying a subset is often the only option
Parameters vs statistics — Master the Greek letters that separate populations from samples
Census vs sample — Discover why even a full count can be wrong

Get this distinction right, and inferential statistics becomes logical. Get it wrong, and your conclusions stand on sand.

What is Population vs Sample?

Definition

A population is the complete set of all individuals, objects, or measurements of interest. A sample is a subset of that population that is actually observed and measured.

Understanding the distinction between population and sample is the foundation of statistical inference.

Population vs Sample Diagram

Population

Definition

A population is the complete set of all individuals, objects, or measurements of interest for a particular study.

Study	Population	Size
Approval rating of a president	All eligible voters in the country	~250 million
Average height of NBA players	All current NBA players	~450
Effectiveness of a drug	All people who could ever take the drug	Infinite
Quality control of chips	All chips produced by the factory	~1 million/day

Populations can be:

Finite: all 7,500 employees at a company
Infinite: all possible measurements a machine could produce
Hypothetical: all people who could take an experimental drug

Sample

DfSample

A sample is a subset of the population that is actually observed and measured.

Why sample instead of study the whole population?

Reason	Example
Cost	Surveying 2,000 people costs far less than 2 million
Time	Census takes years; a survey takes months
Destructive testing	Testing a lightbulb to failure destroys it
Infinite population	You cannot measure every future product
Practical impossibility	Can't reach every person on Earth

Parameters vs Statistics

Population Parameters

Parameter	Symbol	Formula
Mean	μ	μ = (1/N)Σxᵢ
Std Dev	σ	σ = √[(1/N)Σ(xᵢ-μ)²]
Proportion	π	π = X/N

Fixed but unknown. We estimate them using statistics.

Sample Statistics

Statistic	Symbol	Formula
Mean	x̄	x̄ = (1/n)Σxᵢ
Std Dev	s	s = √[(1/(n-1))Σ(xᵢ-x̄)²]
Proportion	p̂	p̂ = x/n

Known but variable. Different samples give different values.

Standard Error of the Mean

SE_{\bar{x}} = \frac{s}{\sqrt{n}}

Here,

$SE_{\bar{x}}$ =Standard error of the sample mean
$s$ =Sample standard deviation
$n$ =Sample size

Key Insight

Parameters are fixed but unknown. Statistics are known but variable (different samples give different values). The standard error quantifies how much statistics vary across samples.

import numpy as np
from scipy import stats

# Simulate a population (in reality, we wouldn't have this)
np.random.seed(42)
population = np.random.normal(loc=170, scale=10, size=10_000)  # 10,000 adults

# True population parameters
mu = population.mean()
sigma = population.std(ddof=0)  # ddof=0 for population
print(f"Population Parameter μ = {mu:.4f} cm")
print(f"Population Parameter σ = {sigma:.4f} cm")

print("\n--- Drawing samples of different sizes ---")
for n in [10, 30, 100, 500]:
    sample = np.random.choice(population, size=n, replace=False)
    x_bar = sample.mean()
    s = sample.std(ddof=1)  # ddof=1 for sample (unbiased)
    se = s / np.sqrt(n)
    print(f"n={n:4d}: x̄={x_bar:.3f}, s={s:.3f}, SE={se:.3f} | Error = {abs(x_bar-mu):.3f}")

Output:

Architecture Diagram

Population Parameter μ = 170.0694 cm
Population Parameter σ = 10.0048 cm

--- Drawing samples of different sizes ---
n=  10: x̄=169.847, s=10.042, SE=3.175 | Error = 0.222
n=  30: x̄=170.591, s=10.381, SE=1.895 | Error = 0.522
n= 100: x̄=170.204, s= 9.983, SE=0.998 | Error = 0.135
n= 500: x̄=170.082, s=10.017, SE=0.448 | Error = 0.013

Notice: Larger samples -> smaller standard error -> closer to the true parameter.

The Sampling Distribution

DfSampling Distribution

If we draw many different samples and compute a statistic each time, the distribution of those statistics is the sampling distribution.

import matplotlib.pyplot as plt

# Sampling distribution of the mean (n=30)
sample_means = []
for _ in range(10_000):
    sample = np.random.choice(population, size=30, replace=False)
    sample_means.append(sample.mean())

sample_means = np.array(sample_means)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Population distribution
axes[0].hist(population, bins=50, color='steelblue', alpha=0.7, density=True)
axes[0].axvline(mu, color='red', linewidth=2, label=f'μ = {mu:.1f}')
axes[0].set_title(f'Population Distribution\n(N=10,000, μ={mu:.1f}, σ={sigma:.1f})')
axes[0].legend()

# Sampling distribution of x̄
axes[1].hist(sample_means, bins=50, color='coral', alpha=0.7, density=True)
axes[1].axvline(mu, color='red', linewidth=2, label=f'μ = {mu:.1f}')
axes[1].set_title(f'Sampling Distribution of x̄\n(10,000 samples, n=30)')
axes[1].set_xlabel('Sample Mean')
axes[1].legend()

print(f"Mean of sample means = {sample_means.mean():.4f} ≈ μ = {mu:.4f}")
print(f"Std of sample means  = {sample_means.std():.4f} ≈ σ/√n = {sigma/np.sqrt(30):.4f}")

plt.tight_layout()
plt.show()

Census vs Sample

DfCensus

A census attempts to measure the entire population.

	Census	Sample
Coverage	All units	Subset
Cost	Very high	Lower
Time	Long	Shorter
Accuracy	No sampling error	Sampling error present
Feasibility	Limited	Broad
Non-response	Larger problem	Manageable

The US Census Bureau conducts a decennial census — it takes years and billions of dollars and still has coverage errors.

Population vs Sample in Machine Learning

Statistics Term	ML Equivalent	What It Means
Population	All possible data	Everything the model could ever see
Sample	Training set	What the model actually learns from
Parameter (μ, σ)	Model weights (W, b)	True values we want to learn
Statistic (x̄, s)	Loss/Accuracy on train	What we measure from our sample
Sampling error	Generalization gap	Difference between train and test performance

Example — Train/Test Split as Sampling:

from sklearn.model_selection import train_test_split
import numpy as np

# Population: all house data
np.random.seed(42)
n_total = 1000
X = np.random.randn(n_total, 3)  # 3 features
y = 2*X[:,0] + 3*X[:,1] - X[:,2] + np.random.randn(n_total)*0.5

# Sample: training set (80% of population)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Population size: {n_total}")
print(f"Training sample: {len(X_train)}")
print(f"Test sample: {len(X_test)}")

# Statistics from sample (training set)
print(f"\nSample mean of X[:,0]: {X_train[:,0].mean():.3f}")
print(f"Population mean of X[:,0]: {X[:,0].mean():.3f}")
print(f"Sampling error: {abs(X_train[:,0].mean() - X[:,0].mean()):.3f}")

Output:

Architecture Diagram

Population size: 1000
Training sample: 800
Test sample: 200

Sample mean of X[:,0]: 0.018
Population mean of X[:,0]: 0.003
Sampling error: 0.015

Key Takeaways

Summary: Population vs Sample

Population = the complete group of interest; Sample = subset we actually measure
Parameters describe populations (Greek letters: μ, σ, π); Statistics describe samples (Latin: x̄, s, p̂)
We use statistics to estimate parameters — the core engine of inferential statistics
Larger samples -> smaller sampling error but there are diminishing returns
The sampling distribution of a statistic tells us how it varies across repeated samples
Every inference has uncertainty — quantifying that uncertainty is the job of statistics

Population vs Sample — The Foundation of Statistical Inference

Population vs Sample

Every Study Starts With One Question: Who Are We Measuring?

What is Population vs Sample?

Definition

Population vs Sample Diagram

Population

Definition

Sample

DfSample

Parameters vs Statistics

Population Parameters

Sample Statistics

Standard Error of the Mean

The Sampling Distribution

DfSampling Distribution

Census vs Sample

DfCensus

Population vs Sample in Machine Learning

Key Takeaways

Summary: Population vs Sample

Premium Content

Need Expert Statistics Help?