🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Population vs Sample — The Foundation of Statistical Inference

Foundations of StatisticsSampling Theory🟢 Free Lesson

Advertisement

Population vs Sample

Sampling Theory

Every Study Starts With One Question: Who Are We Measuring?

Every statistical study starts with a fundamental question: who or what are we studying? The answer determines everything — from which statistics you calculate to what conclusions you can draw.

  • Define your population — Know exactly who or what your conclusions apply to
  • Understand sampling — Learn why studying a subset is often the only option
  • Parameters vs statistics — Master the Greek letters that separate populations from samples
  • Census vs sample — Discover why even a full count can be wrong

Get this distinction right, and inferential statistics becomes logical. Get it wrong, and your conclusions stand on sand.


What is Population vs Sample?

Definition

A population is the complete set of all individuals, objects, or measurements of interest. A sample is a subset of that population that is actually observed and measured.

Understanding the distinction between population and sample is the foundation of statistical inference.


Population vs Sample Diagram

Population (N = 10,000)All individuals of interestParameters: μ, σ, πUsually unknownSamplingSample (n = 100)Subset actually measuredStatistics: x̄, s, p̂Known from data

Population

Definition

A population is the complete set of all individuals, objects, or measurements of interest for a particular study.

StudyPopulationSize
Approval rating of a presidentAll eligible voters in the country~250 million
Average height of NBA playersAll current NBA players~450
Effectiveness of a drugAll people who could ever take the drugInfinite
Quality control of chipsAll chips produced by the factory~1 million/day

Populations can be:

  • Finite: all 7,500 employees at a company
  • Infinite: all possible measurements a machine could produce
  • Hypothetical: all people who could take an experimental drug

Sample

DfSample

A sample is a subset of the population that is actually observed and measured.

Why sample instead of study the whole population?

ReasonExample
CostSurveying 2,000 people costs far less than 2 million
TimeCensus takes years; a survey takes months
Destructive testingTesting a lightbulb to failure destroys it
Infinite populationYou cannot measure every future product
Practical impossibilityCan't reach every person on Earth

Parameters vs Statistics

Population Parameters

ParameterSymbolFormula
Meanμμ = (1/N)Σxᵢ
Std Devσσ = √[(1/N)Σ(xᵢ-μ)²]
Proportionππ = X/N

Fixed but unknown. We estimate them using statistics.

Sample Statistics

StatisticSymbolFormula
Meanx̄ = (1/n)Σxᵢ
Std Devss = √[(1/(n-1))Σ(xᵢ-x̄)²]
Proportionp̂ = x/n

Known but variable. Different samples give different values.

Standard Error of the Mean

SExˉ=snSE_{\bar{x}} = \frac{s}{\sqrt{n}}

Here,

  • SExˉSE_{\bar{x}}=Standard error of the sample mean
  • ss=Sample standard deviation
  • nn=Sample size

Key Insight

Parameters are fixed but unknown. Statistics are known but variable (different samples give different values). The standard error quantifies how much statistics vary across samples.

import numpy as np
from scipy import stats

# Simulate a population (in reality, we wouldn't have this)
np.random.seed(42)
population = np.random.normal(loc=170, scale=10, size=10_000)  # 10,000 adults

# True population parameters
mu = population.mean()
sigma = population.std(ddof=0)  # ddof=0 for population
print(f"Population Parameter μ = {mu:.4f} cm")
print(f"Population Parameter σ = {sigma:.4f} cm")

print("\n--- Drawing samples of different sizes ---")
for n in [10, 30, 100, 500]:
    sample = np.random.choice(population, size=n, replace=False)
    x_bar = sample.mean()
    s = sample.std(ddof=1)  # ddof=1 for sample (unbiased)
    se = s / np.sqrt(n)
    print(f"n={n:4d}: x̄={x_bar:.3f}, s={s:.3f}, SE={se:.3f} | Error = {abs(x_bar-mu):.3f}")

Output:

Architecture Diagram
Population Parameter μ = 170.0694 cm
Population Parameter σ = 10.0048 cm

--- Drawing samples of different sizes ---
n=  10: x̄=169.847, s=10.042, SE=3.175 | Error = 0.222
n=  30: x̄=170.591, s=10.381, SE=1.895 | Error = 0.522
n= 100: x̄=170.204, s= 9.983, SE=0.998 | Error = 0.135
n= 500: x̄=170.082, s=10.017, SE=0.448 | Error = 0.013

Notice: Larger samples -> smaller standard error -> closer to the true parameter.


The Sampling Distribution

DfSampling Distribution

If we draw many different samples and compute a statistic each time, the distribution of those statistics is the sampling distribution.

import matplotlib.pyplot as plt

# Sampling distribution of the mean (n=30)
sample_means = []
for _ in range(10_000):
    sample = np.random.choice(population, size=30, replace=False)
    sample_means.append(sample.mean())

sample_means = np.array(sample_means)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Population distribution
axes[0].hist(population, bins=50, color='steelblue', alpha=0.7, density=True)
axes[0].axvline(mu, color='red', linewidth=2, label=f'μ = {mu:.1f}')
axes[0].set_title(f'Population Distribution\n(N=10,000, μ={mu:.1f}, σ={sigma:.1f})')
axes[0].legend()

# Sampling distribution of x̄
axes[1].hist(sample_means, bins=50, color='coral', alpha=0.7, density=True)
axes[1].axvline(mu, color='red', linewidth=2, label=f'μ = {mu:.1f}')
axes[1].set_title(f'Sampling Distribution of x̄\n(10,000 samples, n=30)')
axes[1].set_xlabel('Sample Mean')
axes[1].legend()

print(f"Mean of sample means = {sample_means.mean():.4f} ≈ μ = {mu:.4f}")
print(f"Std of sample means  = {sample_means.std():.4f} ≈ σ/√n = {sigma/np.sqrt(30):.4f}")

plt.tight_layout()
plt.show()

Census vs Sample

DfCensus

A census attempts to measure the entire population.

CensusSample
CoverageAll unitsSubset
CostVery highLower
TimeLongShorter
AccuracyNo sampling errorSampling error present
FeasibilityLimitedBroad
Non-responseLarger problemManageable

The US Census Bureau conducts a decennial census — it takes years and billions of dollars and still has coverage errors.


Population vs Sample in Machine Learning

All Data (Population)What we want to learn aboutTrain Set80% of dataValidation Set10% of dataTest Set10% of dataTrain on sample → Validate on holdout → Test once at the end
Statistics TermML EquivalentWhat It Means
PopulationAll possible dataEverything the model could ever see
SampleTraining setWhat the model actually learns from
Parameter (μ, σ)Model weights (W, b)True values we want to learn
Statistic (x̄, s)Loss/Accuracy on trainWhat we measure from our sample
Sampling errorGeneralization gapDifference between train and test performance

Example — Train/Test Split as Sampling:

from sklearn.model_selection import train_test_split
import numpy as np

# Population: all house data
np.random.seed(42)
n_total = 1000
X = np.random.randn(n_total, 3)  # 3 features
y = 2*X[:,0] + 3*X[:,1] - X[:,2] + np.random.randn(n_total)*0.5

# Sample: training set (80% of population)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Population size: {n_total}")
print(f"Training sample: {len(X_train)}")
print(f"Test sample: {len(X_test)}")

# Statistics from sample (training set)
print(f"\nSample mean of X[:,0]: {X_train[:,0].mean():.3f}")
print(f"Population mean of X[:,0]: {X[:,0].mean():.3f}")
print(f"Sampling error: {abs(X_train[:,0].mean() - X[:,0].mean()):.3f}")

Output:

Architecture Diagram
Population size: 1000
Training sample: 800
Test sample: 200

Sample mean of X[:,0]: 0.018
Population mean of X[:,0]: 0.003
Sampling error: 0.015

Key Takeaways

Summary: Population vs Sample

  1. Population = the complete group of interest; Sample = subset we actually measure
  2. Parameters describe populations (Greek letters: μ, σ, π); Statistics describe samples (Latin: x̄, s, p̂)
  3. We use statistics to estimate parameters — the core engine of inferential statistics
  4. Larger samples -> smaller sampling error but there are diminishing returns
  5. The sampling distribution of a statistic tells us how it varies across repeated samples
  6. Every inference has uncertainty — quantifying that uncertainty is the job of statistics

Premium Content

Population vs Sample — The Foundation of Statistical Inference

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement