Variance

Descriptive Statistics

Why Does Data Spread Out?

Variance is the foundation of statistical dispersion — it tells you how far data points wander from the average.

Understanding variance helps you:

Quantify uncertainty — measure how reliable or volatile a dataset truly is
Compare datasets — see which group has more consistent behavior
Build estimators — understand why dividing by n-1 produces unbiased results
Unlock advanced measures — standard deviation, skewness, and kurtosis all build on variance

Master variance and every other measure of spread becomes a natural extension.

What is Variance?

Definition

Variance measures the average squared deviation from the mean. It quantifies the spread or dispersion of a random variable around its expected value.

For a random variable $X$ with mean $\mu = E[X]$ , the variance is:

Population Variance (Definition)

\sigma^2 = \text{Var}(X) = E\left[(X - \mu)^2\right] = \sum_{i=1}^{N} (x_i - \mu)^2 \cdot P(X = x_i)

Here,

$\sigma^2$ =Population variance
$\mu$ =Population mean (expected value)
$N$ =Population size
$P(X = x_i)$ =Probability of outcome xᵢ

For a finite population of $N$ equally likely observations:

Finite Population Variance

\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2

Here,

$N$ =Population size
$x_i$ =The i-th observation
$\mu$ =Population mean

The Shortcut Formula

Using the identity $E[(X-\mu)^2] = E[X^2] - (E[X])^2$ , we obtain the computationally equivalent form:

ThComputational Formula for Variance

\sigma^2 = E[X^2] - \left(E[X]\right)^2

Proof sketch. Expand $(X - \mu)^2 = X^2 - 2\mu X + \mu^2$ and take expectations. Since $E[X] = \mu$ , we get $E[X^2] - 2\mu^2 + \mu^2 = E[X^2] - \mu^2$ . $\square$

This form is useful because it requires only one pass through the data (computing $\sum x_i$ and $\sum x_i^2$ simultaneously).

Sample Variance and Bessel's Correction

Given a sample $x_1, x_2, \ldots, x_n$ drawn i.i.d. from a population with variance $\sigma^2$ , the sample variance is:

Sample Variance (Bessel's Correction)

s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2

Here,

$s^2$ =Sample variance (unbiased estimator of σ²)
$\bar{x}$ =Sample mean
$n$ =Sample size
$n-1$ =Degrees of freedom

Why $n - 1$ ? — Degrees of Freedom

The Intuition Behind n−1

The sample mean $\bar{x}$ is computed from the same data. It locks in one linear constraint: $\sum(x_i - \bar{x}) = 0$ . This means only $n-1$ of the $n$ deviations $(x_i - \bar{x})$ are free to vary — the last is determined by the others. Hence we lose one degree of freedom.

More formally: dividing by $n$ systematically underestimates $\sigma^2$ because the sample mean $\bar{x}$ is closer to the sample points than the true $\mu$ is. Bessel's correction by $\frac{n}{n-1}$ compensates for this bias.

ThUnbiasedness of $s^2$

E[s^2] = \sigma^2

Proof. For i.i.d. observations with $E[X_i] = \mu$ and $\text{Var}(X_i) = \sigma^2$ :

E\left[\sum_{i=1}^n (X_i - \bar{X})^2\right] = \sum_{i=1}^n E\left[(X_i - \bar{X})^2\right] = (n-1)\sigma^2

Therefore $E[s^2] = \frac{n-1}{n-1}\sigma^2 = \sigma^2$ . $\square$

The computational form of the sample variance is:

Computational Formula for Sample Variance

s^2 = \frac{\sum_{i=1}^n x_i^2 - \frac{1}{n}\left(\sum_{i=1}^n x_i\right)^2}{n-1}

Here,

$s^2$ =Sample variance
$x_i$ =The i-th observation
$n$ =Sample size

Algebraic Properties of Variance

ThProperties of Variance

For constants $a, b$ and independent random variables $X, Y$ :

Non-negativity: $\text{Var}(X) \geq 0$ , with equality iff $X$ is constant a.s.
Translation invariance: $\text{Var}(X + b) = \text{Var}(X)$
Scaling: $\text{Var}(aX) = a^2 \text{Var}(X)$
Additivity for independent variables: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$
General linear combination: $\text{Var}(aX + bY) = a^2\text{Var}(X) + b^2\text{Var}(Y)$ (if $X \perp Y$ )

Dependence Matters

When $X$ and $Y$ are not independent, the general formula is:

\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)

where $\text{Cov}(X,Y) = E[(X-\mu_X)(Y-\mu_Y)]$ is the covariance. Independence implies $\text{Cov}(X,Y) = 0$ , but the converse is not generally true.

Variance as a Second Central Moment

The variance is the second central moment of a distribution. The general framework of moments provides:

Central Moments

\mu_k = E\left[(X - \mu)^k\right]

Here,

$\mu_k$ =k-th central moment
$\mu$ =First central moment = E[X]
$\mu_2$ =Variance = Var(X)

The skewness uses the third central moment ( $\mu_3$ ), and kurtosis uses the fourth ( $\mu_4$ ). Variance is the foundational building block.

Worked Example

Given the sample: $x = \{4, 7, 13, 2, 1, 8, 11, 6, 9, 5\}$ with $n = 10$ :

Step 1: Compute $\bar{x}$ :

\bar{x} = \frac{4+7+13+2+1+8+11+6+9+5}{10} = \frac{66}{10} = 6.6

Step 2: Compute squared deviations:

$x_i$	$x_i - \bar{x}$	$(x_i - \bar{x})^2$
4	−2.6	6.76
7	0.4	0.16
13	6.4	40.96
2	−4.6	21.16
1	−5.6	31.36
8	1.4	1.96
11	4.4	19.36
6	−0.6	0.36
9	2.4	5.76
5	−1.6	2.56

Step 3: Sum: $\sum(x_i - \bar{x})^2 = 130.4$

Step 4: Population variance: $\sigma^2 = 130.4 / 10 = 13.04$

Step 5: Sample variance: $s^2 = 130.4 / 9 = 14.4\overline{8}$

The Bias of the Naive Estimator

Simulation: Bias of ÷n vs ÷(n−1)

Let us verify the bias empirically. Draw repeated samples of size $n$ from a known population and compute both estimators:

Estimator	Formula	Expectation
$\hat{\sigma}^2_n$ (biased)	$\frac{1}{n}\sum(x_i - \bar{x})^2$	$\frac{n-1}{n}\sigma^2 < \sigma^2$
$s^2$ (unbiased)	$\frac{1}{n-1}\sum(x_i - \bar{x})^2$	$\sigma^2$

The ratio $\frac{n}{n-1}$ is the bias correction factor. For $n = 10$ , the biased estimator systematically underestimates by $\frac{9}{10} = 90\%$ .

Relationship to Standard Deviation

The standard deviation $\sigma = \sqrt{\sigma^2}$ returns the spread to the original units of the data. While variance is mathematically convenient (additive for independent variables), standard deviation is more interpretable because it shares the units of the mean.

Variance in Machine Learning

ML Application	Variance Usage	Why
Bias-variance tradeoff	Model variance = overfitting	High variance = complex model
Feature selection	Low variance → remove	No signal in feature
Regularization	Penalize high variance weights	Ridge/Lasso reduce variance
Ensemble methods	Bagging reduces variance	Average many high-variance models
Information gain	Variance reduction = split quality	Decision trees split on variance

import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_squared_error

np.random.seed(42)
n = 200
X = np.random.randn(n, 1) * 10
y = 3 * X[:,0] + np.random.randn(n) * 3  # signal + noise

# Single tree: high variance
tree = DecisionTreeRegressor(max_depth=10, random_state=42)
from sklearn.model_selection import cross_val_score
tree_var = -cross_val_score(tree, X, y, cv=10, scoring='neg_mean_squared_error').var()
print(f"Single tree MSE variance across folds: {tree_var:.2f}")

# Bagging: reduces variance by averaging
bagging = BaggingRegressor(n_estimators=10, random_state=42)
bag_var = -cross_val_score(bagging, X, y, cv=10, scoring='neg_mean_squared_error').var()
print(f"Bagging MSE variance across folds: {bag_var:.2f}")
print(f"Variance reduction: {(1 - bag_var/tree_var)*100:.1f}%")

Key Takeaways

Variance = expected squared deviation from the mean — it quantifies spread

Bessel's correction (dividing by n−1) makes the sample variance unbiased because the sample mean absorbs one degree of freedom

Variance is additive for independent variables: Var(X+Y) = Var(X) + Var(Y)

Scaling: Var(aX) = a²Var(X) — variance scales quadratically with constants

"Variance is the price you pay for randomness." — Every model that ignores spread will be surprised by reality.

Variance — Population vs Sample Formula and Interpretation

Variance

Why Does Data Spread Out?

What is Variance?

Definition

Population Variance (Definition)

Finite Population Variance

The Shortcut Formula

ThComputational Formula for Variance

Sample Variance and Bessel's Correction

Sample Variance (Bessel's Correction)

Why $n - 1$ ? — Degrees of Freedom

ThUnbiasedness of $s^2$

Computational Formula for Sample Variance

Algebraic Properties of Variance

ThProperties of Variance

Variance as a Second Central Moment

Central Moments

Worked Example

The Bias of the Naive Estimator

Simulation: Bias of ÷n vs ÷(n−1)

Relationship to Standard Deviation

Variance in Machine Learning

Key Takeaways

Premium Content

Need Expert Statistics Help?

Variance — Population vs Sample Formula and Interpretation

Variance

Why Does Data Spread Out?

What is Variance?

Definition

Population Variance (Definition)

Finite Population Variance

The Shortcut Formula

ThComputational Formula for Variance

Sample Variance and Bessel's Correction

Sample Variance (Bessel's Correction)

Why n−1n - 1n−1? — Degrees of Freedom

ThUnbiasedness of $s^2$

Computational Formula for Sample Variance

Algebraic Properties of Variance

ThProperties of Variance

Variance as a Second Central Moment

Central Moments

Worked Example

The Bias of the Naive Estimator

Simulation: Bias of ÷n vs ÷(n−1)

Relationship to Standard Deviation

Variance in Machine Learning

Key Takeaways

Premium Content

Need Expert Statistics Help?

Why $n - 1$ ? — Degrees of Freedom