Variance
Descriptive Statistics
Why Does Data Spread Out?
Variance is the foundation of statistical dispersion — it tells you how far data points wander from the average.
Understanding variance helps you:
- Quantify uncertainty — measure how reliable or volatile a dataset truly is
- Compare datasets — see which group has more consistent behavior
- Build estimators — understand why dividing by n-1 produces unbiased results
- Unlock advanced measures — standard deviation, skewness, and kurtosis all build on variance
Master variance and every other measure of spread becomes a natural extension.
What is Variance?
Definition
Variance measures the average squared deviation from the mean. It quantifies the spread or dispersion of a random variable around its expected value.
For a random variable with mean , the variance is:
Population Variance (Definition)
Here,
- =Population variance
- =Population mean (expected value)
- =Population size
- =Probability of outcome xᵢ
For a finite population of equally likely observations:
Finite Population Variance
Here,
- =Population size
- =The i-th observation
- =Population mean
The Shortcut Formula
Using the identity , we obtain the computationally equivalent form:
ThComputational Formula for Variance
Proof sketch. Expand and take expectations. Since , we get .
This form is useful because it requires only one pass through the data (computing and simultaneously).
Sample Variance and Bessel's Correction
Given a sample drawn i.i.d. from a population with variance , the sample variance is:
Sample Variance (Bessel's Correction)
Here,
- =Sample variance (unbiased estimator of σ²)
- =Sample mean
- =Sample size
- =Degrees of freedom
Why ? — Degrees of Freedom
The Intuition Behind n−1
The sample mean is computed from the same data. It locks in one linear constraint: . This means only of the deviations are free to vary — the last is determined by the others. Hence we lose one degree of freedom.
More formally: dividing by systematically underestimates because the sample mean is closer to the sample points than the true is. Bessel's correction by compensates for this bias.
ThUnbiasedness of $s^2$
Proof. For i.i.d. observations with and :
Therefore .
The computational form of the sample variance is:
Computational Formula for Sample Variance
Here,
- =Sample variance
- =The i-th observation
- =Sample size
Algebraic Properties of Variance
ThProperties of Variance
For constants and independent random variables :
- Non-negativity: , with equality iff is constant a.s.
- Translation invariance:
- Scaling:
- Additivity for independent variables:
- General linear combination: (if )
Dependence Matters
When and are not independent, the general formula is:
where is the covariance. Independence implies , but the converse is not generally true.
Variance as a Second Central Moment
The variance is the second central moment of a distribution. The general framework of moments provides:
Central Moments
Here,
- =k-th central moment
- =First central moment = E[X]
- =Variance = Var(X)
The skewness uses the third central moment (), and kurtosis uses the fourth (). Variance is the foundational building block.
Worked Example
Given the sample: with :
Step 1: Compute :
Step 2: Compute squared deviations:
| 4 | −2.6 | 6.76 |
| 7 | 0.4 | 0.16 |
| 13 | 6.4 | 40.96 |
| 2 | −4.6 | 21.16 |
| 1 | −5.6 | 31.36 |
| 8 | 1.4 | 1.96 |
| 11 | 4.4 | 19.36 |
| 6 | −0.6 | 0.36 |
| 9 | 2.4 | 5.76 |
| 5 | −1.6 | 2.56 |
Step 3: Sum:
Step 4: Population variance:
Step 5: Sample variance:
The Bias of the Naive Estimator
Simulation: Bias of ÷n vs ÷(n−1)
Let us verify the bias empirically. Draw repeated samples of size from a known population and compute both estimators:
| Estimator | Formula | Expectation |
|---|---|---|
| (biased) | ||
| (unbiased) |
The ratio is the bias correction factor. For , the biased estimator systematically underestimates by .
Relationship to Standard Deviation
The standard deviation returns the spread to the original units of the data. While variance is mathematically convenient (additive for independent variables), standard deviation is more interpretable because it shares the units of the mean.
Variance in Machine Learning
| ML Application | Variance Usage | Why |
|---|---|---|
| Bias-variance tradeoff | Model variance = overfitting | High variance = complex model |
| Feature selection | Low variance → remove | No signal in feature |
| Regularization | Penalize high variance weights | Ridge/Lasso reduce variance |
| Ensemble methods | Bagging reduces variance | Average many high-variance models |
| Information gain | Variance reduction = split quality | Decision trees split on variance |
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_squared_error
np.random.seed(42)
n = 200
X = np.random.randn(n, 1) * 10
y = 3 * X[:,0] + np.random.randn(n) * 3 # signal + noise
# Single tree: high variance
tree = DecisionTreeRegressor(max_depth=10, random_state=42)
from sklearn.model_selection import cross_val_score
tree_var = -cross_val_score(tree, X, y, cv=10, scoring='neg_mean_squared_error').var()
print(f"Single tree MSE variance across folds: {tree_var:.2f}")
# Bagging: reduces variance by averaging
bagging = BaggingRegressor(n_estimators=10, random_state=42)
bag_var = -cross_val_score(bagging, X, y, cv=10, scoring='neg_mean_squared_error').var()
print(f"Bagging MSE variance across folds: {bag_var:.2f}")
print(f"Variance reduction: {(1 - bag_var/tree_var)*100:.1f}%")
Key Takeaways
Variance = expected squared deviation from the mean — it quantifies spread
Bessel's correction (dividing by n−1) makes the sample variance unbiased because the sample mean absorbs one degree of freedom
Variance is additive for independent variables: Var(X+Y) = Var(X) + Var(Y)
Scaling: Var(aX) = a²Var(X) — variance scales quadratically with constants
"Variance is the price you pay for randomness." — Every model that ignores spread will be surprised by reality.