🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Hypergeometric Distribution — Sampling Without Replacement

Foundations of StatisticsProbability Distributions🟢 Free Lesson

Advertisement

Hypergeometric Distribution

Probability Distributions

Sampling Without Replacement — Finite Populations

The hypergeometric distribution models the number of successes when sampling without replacement from a finite population. Every draw changes the odds.

  • Card games — probability of drawing aces in a poker hand
  • Quality inspection — defective items in a batch sample
  • Ecology — capture-recapture population estimation
  • Legal sampling — random drug testing from a workforce

Unlike the binomial, the hypergeometric accounts for the fact that each draw depletes the population.


Core Concepts

The hypergeometric distribution models the number of successes when sampling without replacement from a finite population. It is the correct distribution for card games, quality inspection of small batches, and any scenario where the population is depleted as items are drawn.

DfHypergeometric Distribution

A random variable XX follows a hypergeometric distribution with parameters (N,K,n)(N, K, n), written XHyp(N,K,n)X \sim \text{Hyp}(N, K, n), if its PMF is:

P(X=k)=(Kk)(NKnk)(Nn),P(X = k) = \frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}},

where NN is the population size, KK is the number of successes in the population, nn is the sample size, and kk satisfies max(0,n(NK))kmin(n,K)\max(0, n - (N-K)) \leq k \leq \min(n, K).

Why This PMF Is Correct

The total number of ways to choose nn items from NN is (Nn)\binom{N}{n}. The number of ways to choose exactly kk successes from KK available and nkn-k failures from NKN-K available is (Kk)(NKnk)\binom{K}{k}\binom{N-K}{n-k}. The ratio gives the probability, since each subset of size nn is equally likely.


Mean and Variance

Hypergeometric Mean and Variance

E[X]=nKN,Var(X)=nKN(1KN)NnN1E[X] = n\frac{K}{N}, \quad \text{Var}(X) = n\frac{K}{N}\left(1-\frac{K}{N}\right)\frac{N-n}{N-1}

Here,

  • NN=Population size
  • KK=Successes in population
  • nn=Sample size
  • NnN1\frac{N-n}{N-1}=Finite population correction factor

Derivation of the Mean

Let XiX_i be the indicator that the ii-th draw is a success. By linearity of expectation:

E[X]=E ⁣[i=1nXi]=i=1nE[Xi]=nKN,E[X] = E\!\left[\sum_{i=1}^n X_i\right] = \sum_{i=1}^n E[X_i] = n \cdot \frac{K}{N},

since by symmetry each draw has probability K/NK/N of being a success, regardless of the order. (Note: this holds even though the XiX_i are not independent.)

Derivation of the Variance

Using Var(X)=iVar(Xi)+2i<jCov(Xi,Xj)\text{Var}(X) = \sum_i \text{Var}(X_i) + 2\sum_{i<j}\text{Cov}(X_i, X_j):

  • Var(Xi)=KN(1KN)=K(NK)N2\text{Var}(X_i) = \frac{K}{N}\left(1 - \frac{K}{N}\right) = \frac{K(N-K)}{N^2}
  • For iji \neq j: Cov(Xi,Xj)=K(NK)N2(N1)\text{Cov}(X_i, X_j) = -\frac{K(N-K)}{N^2(N-1)} (negative because drawing a success on draw ii makes a success on draw jj slightly less likely without replacement).

Therefore:

Var(X)=nK(NK)N2+n(n1)(K(NK)N2(N1))=nK(NK)N2(1n1N1)=nKN(1KN)NnN1.\text{Var}(X) = n\cdot\frac{K(N-K)}{N^2} + n(n-1)\cdot\left(-\frac{K(N-K)}{N^2(N-1)}\right) = n\frac{K(N-K)}{N^2}\left(1 - \frac{n-1}{N-1}\right) = n\frac{K}{N}\left(1-\frac{K}{N}\right)\frac{N-n}{N-1}.

The Finite Population Correction Factor

ThFinite Population Correction

The ratio of hypergeometric to binomial variance is:

Varhyp(X)Varbin(X)=NnN1.\frac{\text{Var}_{\text{hyp}}(X)}{\text{Var}_{\text{bin}}(X)} = \frac{N - n}{N - 1}.

This factor is always less than 1 for n2n \geq 2, meaning sampling without replacement produces less variability than sampling with replacement.

Why Without Replacement Has Lower Variance

When you sample without replacement and observe a success, the remaining population has one fewer success, making the next draw slightly less likely to succeed. This negative correlation between draws reduces the overall variability of the count. The effect is negligible when NnN \gg n.


Convergence to Binomial

ThBinomial Approximation

As NN \to \infty with K/NpK/N \to p and nn fixed, the hypergeometric distribution converges to the binomial:

(Kk)(NKnk)(Nn)(nk)pk(1p)nk.\frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}} \longrightarrow \binom{n}{k}p^k(1-p)^{n-k}.

Proof Sketch

For fixed kk and nn:

(Kk)(NKnk)(Nn)=(nk)K(K1)(Kk+1)N(N1)(Nk+1)(NK)(NK1)(NKn+k+1)N(N1)(Nk+1)\frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}} = \binom{n}{k} \cdot \frac{K(K-1)\cdots(K-k+1)}{N(N-1)\cdots(N-k+1)} \cdot \frac{(N-K)(N-K-1)\cdots(N-K-n+k+1)}{N(N-1)\cdots(N-k+1)} \cdot \ldots

As NN \to \infty with K/NpK/N \to p, each factor KiNip\frac{K-i}{N-i} \to p and NKiNi1p\frac{N-K-i}{N-i} \to 1-p, yielding (nk)pk(1p)nk\binom{n}{k}p^k(1-p)^{n-k}.


Symmetry Property

ThHypergeometric Symmetry

The hypergeometric distribution is symmetric under the transformation knkk \to n-k when K=N/2K = N/2:

P(X=k)=P(X=nk)whenK=N/2.P(X = k) = P(X = n-k) \quad \text{when} \quad K = N/2.

More generally, XHyp(N,K,n)X \sim \text{Hyp}(N, K, n) has the same distribution as nXn - X' where XHyp(N,NK,n)X' \sim \text{Hyp}(N, N-K, n): counting successes is equivalent to counting failures.


Worked Example: Lottery / Card Draw

Example: Drawing Aces from a Deck

A standard deck has N=52N = 52 cards, K=4K = 4 aces. We draw n=5n = 5 cards without replacement. Let XX = number of aces drawn.

PMF:

P(X=k)=(4k)(485k)(525).P(X = k) = \frac{\binom{4}{k}\binom{48}{5-k}}{\binom{52}{5}}.

Compute each value:

  • P(X=0)=(40)(485)(525)=11,712,3042,598,9600.6588P(X=0) = \frac{\binom{4}{0}\binom{48}{5}}{\binom{52}{5}} = \frac{1 \cdot 1{,}712{,}304}{2{,}598{,}960} \approx 0.6588
  • P(X=1)=(41)(484)(525)=4194,5802,598,9600.2995P(X=1) = \frac{\binom{4}{1}\binom{48}{4}}{\binom{52}{5}} = \frac{4 \cdot 194{,}580}{2{,}598{,}960} \approx 0.2995
  • P(X=2)=(42)(483)(525)=617,2962,598,9600.0399P(X=2) = \frac{\binom{4}{2}\binom{48}{3}}{\binom{52}{5}} = \frac{6 \cdot 17{,}296}{2{,}598{,}960} \approx 0.0399
  • P(X=3)=(43)(482)(525)=41,1282,598,9600.0017P(X=3) = \frac{\binom{4}{3}\binom{48}{2}}{\binom{52}{5}} = \frac{4 \cdot 1{,}128}{2{,}598{,}960} \approx 0.0017
  • P(X=4)=(44)(481)(525)=1482,598,9600.0000185P(X=4) = \frac{\binom{4}{4}\binom{48}{1}}{\binom{52}{5}} = \frac{1 \cdot 48}{2{,}598{,}960} \approx 0.0000185

Mean: E[X]=5452=20520.3846E[X] = 5 \cdot \frac{4}{52} = \frac{20}{52} \approx 0.3846.

Variance: Var(X)=5452485247510.3205\text{Var}(X) = 5 \cdot \frac{4}{52} \cdot \frac{48}{52} \cdot \frac{47}{51} \approx 0.3205.

Compare with binomial approximation: Binomial(5,4/52)\text{Binomial}(5, 4/52) gives Var=545248520.3550\text{Var} = 5 \cdot \frac{4}{52} \cdot \frac{48}{52} \approx 0.3550.

The finite population correction factor 47/510.92247/51 \approx 0.922 reduces the variance by about 8%8\%.


Python Implementation

import numpy as np
from scipy import stats

np.random.seed(42)

# Hypergeometric parameters
N, K, n = 52, 4, 5  # deck, aces, draws

# Theoretical mean and variance
mean_theory = n * K / N
var_theory = n * (K/N) * (1 - K/N) * (N - n) / (N - 1)
print(f"Hyp({N}, {K}, {n}):")
print(f"  Theoretical mean: {mean_theory:.4f}")
print(f"  Theoretical variance: {var_theory:.4f}")

# Simulate
n_sims = 100000
samples = np.random.hypergeometric(K, N - K, n, size=n_sims)
print(f"  Empirical mean:   {np.mean(samples):.4f}")
print(f"  Empirical variance: {np.var(samples, ddof=0):.4f}")

# Compare with binomial (with replacement)
binom_var = n * (K/N) * (1 - K/N)
fpc = (N - n) / (N - 1)
print(f"\n  Binomial variance (with replacement): {binom_var:.4f}")
print(f"  Finite population correction: {fpc:.4f}")
print(f"  Hyp var / Bin var = {var_theory / binom_var:.4f}  (should be {fpc:.4f})")

Python Implementation: Batch Quality Inspection

import numpy as np
from scipy import stats

np.random.seed(42)

# Quality control: batch of 100 items, 8 defective, sample 10
N, K, n = 100, 8, 10
samples = np.random.hypergeometric(K, N - K, n, size=50000)

print(f"Quality Inspection: N={N}, K={K} defective, sample n={n}")
print(f"PMF values:")
for k in range(min(n, K) + 1):
    pmf = stats.hypergeom.pmf(k, N, K, n)
    print(f"  P(X={k}) = {pmf:.4f}")

print(f"\nSimulated mean: {np.mean(samples):.4f} (theoretical: {n*K/N:.4f})")

# Probability of finding at least 2 defectives
prob_ge_2 = 1 - stats.hypergeom.cdf(1, N, K, n)
print(f"P(X >= 2 defectives in sample) = {prob_ge_2:.4f}")

Key Takeaways

Summary: Hypergeometric Distribution

  • Models sampling without replacement from a finite population of size NN with KK successes
  • PMF: P(X=k)=(Kk)(NKnk)/(Nn)P(X=k) = \binom{K}{k}\binom{N-K}{n-k}/\binom{N}{n}
  • Mean: E[X]=nK/NE[X] = nK/N (same as binomial — expectation doesn't need independence)
  • Variance: nKN(1KN)NnN1n\frac{K}{N}(1-\frac{K}{N})\frac{N-n}{N-1} — includes finite population correction (Nn)/(N1)<1(N-n)/(N-1) < 1
  • Negative covariance between draws: Cov(Xi,Xj)=K(NK)N2(N1)\text{Cov}(X_i, X_j) = -\frac{K(N-K)}{N^2(N-1)}
  • Converges to Binomial(n,K/N)(n, K/N) as NN \to \infty with K/NK/N fixed
  • Variance is always less than the binomial counterpart (less variability without replacement)

Premium Content

Hypergeometric Distribution — Sampling Without Replacement

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement