πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Sampling Methods & Distributions

StatisticsSampling🟒 Free Lesson

Advertisement

Sampling Methods & Distributions

Why It Matters

In the real world, examining an entire population is almost never feasible β€” it is too expensive, too time-consuming, or literally impossible. Sampling is the disciplined art of selecting a subset to learn about the whole. Done well, a sample of 1,000 people can predict election outcomes within 2%. Done poorly, even a million data points can mislead. Understanding sampling methods, bias, and sampling distributions is the bedrock of statistics, A/B testing, clinical trials, and machine learning.


Overview

Every dataset used in machine learning is a sample from some larger data-generating process. Simple random sampling gives every subset equal probability of selection, making it the gold standard for unbiasedness. Stratified sampling divides the population into subgroups (strata) and samples within each, guaranteeing representation and reducing variance when strata differ. Cluster sampling selects groups and surveys everyone in them, dramatically reducing cost for geographically dispersed populations. Systematic sampling picks every k-th individual after a random start β€” simple but vulnerable to periodicity. The sampling distribution of a statistic describes how it varies across all possible samples, and the standard error (SE=Οƒ/nSE = \sigma/\sqrt{n}) quantifies that variability. The Central Limit Theorem guarantees that sample means are approximately normal for large nn, regardless of the population distribution.


Key Concepts

Key Estimators

XΛ‰=1nβˆ‘i=1nXi,s2=1nβˆ’1βˆ‘i=1n(Xiβˆ’XΛ‰)2,p^=successesn\bar{X} = \frac{1}{n}\sum_{i=1}^{n} X_i, \quad s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(X_i - \bar{X})^2, \quad \hat{p} = \frac{\text{successes}}{n}

Here,

  • XΛ‰\bar{X}=Sample mean β€” unbiased estimator of ΞΌ
  • s2s^2=Sample variance (Bessel-corrected) β€” unbiased estimator of σ²
  • p^\hat{p}=Sample proportion β€” unbiased estimator of Ο€

Standard Error of the Mean

SE(Xˉ)=σnSE(\bar{X}) = \frac{\sigma}{\sqrt{n}}

Here,

  • Οƒ\sigma=Population standard deviation
  • nn=Sample size
  • SESE=Standard deviation of the sampling distribution of XΜ„

Stratified Estimator

XΛ‰st=βˆ‘h=1HWhXΛ‰h,Var(XΛ‰st)=βˆ‘h=1HWh2Οƒh2nh\bar{X}_{st} = \sum_{h=1}^{H} W_h \bar{X}_h, \quad \text{Var}(\bar{X}_{st}) = \sum_{h=1}^{H} W_h^2 \frac{\sigma_h^2}{n_h}

Here,

  • HH=Number of strata
  • Wh=Nh/NW_h = N_h / N=Population weight of stratum h
  • XΛ‰h\bar{X}_h=Sample mean within stratum h
  • nhn_h=Sample size allocated to stratum h

Sample Size for Mean

n=(zΞ±/2β‹…ΟƒE)2n = \left(\frac{z_{\alpha/2} \cdot \sigma}{E}\right)^2

Here,

  • EE=Desired margin of error
  • zΞ±/2z_{\alpha/2}=Critical value (1.96 for 95%)

Sample Size for Proportion

n=zΞ±/22β‹…p^(1βˆ’p^)E2n = \frac{z_{\alpha/2}^2 \cdot \hat{p}(1 - \hat{p})}{E^2}

Here,

  • p^\hat{p}=Prior estimate of proportion (use 0.5 if unknown)
  • EE=Desired margin of error

Central Limit Theorem

XΛ‰nβˆ’ΞΌΟƒ/nβ†’dN(0,1)asΒ nβ†’βˆž\frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \xrightarrow{d} N(0, 1) \quad \text{as } n \to \infty

Here,

  • XΛ‰n\bar{X}_n=Sample mean of n observations
  • ΞΌ\mu=Population mean
  • Οƒ\sigma=Population standard deviation

Sampling Methods Comparison

MethodHow It WorksBest ForKey AdvantageKey Disadvantage
Simple RandomEqual chance for every subsetHomogeneous populationsUnbiased, easy to analyzeRequires complete sampling frame
StratifiedSample within each known subgroupHeterogeneous populationsLower variance than SRSRequires prior knowledge of strata
ClusterSample clusters, survey everyoneGeographically dispersedCheaper than SRSHigher variance due to ICC
SystematicEvery k-th after random startOrdered listsSimple to implementVulnerable to periodicity

Sampling Bias Types

TypeDescriptionExample
Selection biasSampling method excludes groupsVoluntary response surveys
Non-response biasSelected individuals declinePhone surveys missing workers
Survivorship biasOnly "surviving" cases observedStudying successful companies only
Undercoverage biasSome members have zero selection chanceOnline-only surveys
Convenience samplingEasiest-to-reach individualsSurveying friends and family

Quick Example

Standard Error and Sample Size

The standard deviation of monthly incomes is $4,000. You sample 64 people.

SE=Οƒn=400064=40008=500SE = \frac{\sigma}{\sqrt{n}} = \frac{4000}{\sqrt{64}} = \frac{4000}{8} = 500

The SE is <MathBlock tex=500 β€” the sample mean typically deviates from the true mean by about \ />500. To halve the margin of error to <MathBlock tex=250, you need 4Γ— the sample size (256 people), because />SE \propto 1/\sqrt{n}$.

Non-Response Bias

In a survey, 1,000 are selected, response rate = 40%. Respondent mean income = <MathBlock tex=55,000. Non-respondent mean = \ />38,000.

True population mean: \mu = 0.4 \times 55000 + 0.6 \times 38000 = \44{,}800$.

Bias = <MathBlock tex=55,000 - 44,800 = \ />10,200 β€” a 22.8% overestimate.


Key Takeaways

Summary: Sampling Methods

  • SE decreases with n\sqrt{n}: quadrupling nn halves the standard error. This is the fundamental law of statistical precision.
  • Probability sampling (SRS, stratified, cluster, systematic) is required for valid inference. Convenience samples are biased by definition.
  • Stratified > SRS when strata differ substantially in the outcome. It controls for known heterogeneity and reduces variance.
  • Non-response bias occurs when selected individuals differ systematically from respondents. Track response rates and apply weighting corrections.
  • CLT convergence depends on population skewness: symmetric distributions need nβ‰₯10n \geq 10; heavily skewed may need nβ‰₯50n \geq 50–100100.
  • Finite population correction applies when n/N>5n/N > 5%: SEadj=SEβ‹…(Nβˆ’n)/(Nβˆ’1)SE_{adj} = SE \cdot \sqrt{(N-n)/(N-1)}.
  • To halve margin of error, quadruple nn: The 1/n1/\sqrt{n} rate means precision improvement is expensive.

Deep Dive

For detailed explanations, worked examples, and Python implementations, explore the dedicated statistics lessons:

Population and Sample

  • Population vs Sample β€” Parameters vs. statistics, sampling frames, and the goal of statistical inference

Data Collection

Sampling Techniques

  • Sampling Techniques β€” SRS, stratified, cluster, and systematic sampling with formulas, examples, and allocation strategies

Bias and Errors

  • Sampling Bias and Errors β€” Selection bias, non-response bias, survivorship bias, famous polling failures, and mitigation strategies

Related Topics

⭐

Premium Content

Sampling Methods & Distributions

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Mathematics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement