Standard Deviation
Descriptive Statistics
How Far Are Data Points From the Mean?
Standard deviation translates variance back into the original units of your data — making spread actually meaningful.
Understanding standard deviation helps you:
- Interpret data — compare observations directly to the mean in real units
- Apply the empirical rule — know what percentage of data falls within 1, 2, or 3 standard deviations
- Detect outliers — flag unusual observations with z-scores
- Compare variability — use the coefficient of variation across different scales
If variance is the theory, standard deviation is the practice.
What is Standard Deviation?
Definition
The standard deviation is the square root of variance. It returns the spread to the original units of the data, making it directly interpretable as a measure of typical distance from the mean.
Population Standard Deviation
Here,
- =Population standard deviation
- =Population mean
- =Population size
Sample Standard Deviation
Here,
- =Sample standard deviation
- =Sample mean
- =Degrees of freedom (Bessel's correction)
Why Square Root?
Units and Interpretability
Variance has squared units (e.g., ), making it hard to interpret directly. The standard deviation has the same units as the original data, so it can be compared directly to the mean and to individual observations. For example, if exam scores have mean 75 and standard deviation 10, then a score of 85 is exactly one standard deviation above the mean.
The Empirical Rule (68-95-99.7)
ThEmpirical Rule for Normal Distributions
For :
| Range | Exact Probability | Approximation |
|---|---|---|
| ≈ 68% | ||
| ≈ 95% | ||
| ≈ 99.7% |
This rule is the foundation of outlier detection: observations beyond are extremely rare (about 0.3%) under normality.
Standardized Scores (Z-Scores)
Z-Score
Here,
- =Standardized value of xᵢ
- =Sample mean
- =Sample standard deviation
The z-score tells you how many standard deviations an observation is from the mean. It is dimensionless and enables comparison across different scales.
Coefficient of Variation (CV)
Coefficient of Variation
Here,
- =Standard deviation
- =Mean
The CV enables comparison of variability across datasets with different units or vastly different means. A lower CV indicates less relative variability.
Chebyshev's Inequality
ThChebyshev's Inequality
For any distribution (not just normal) with finite mean and variance :
This holds for all , but it is only useful for (since for ).
| Upper bound on | |
|---|---|
| 2 | |
| 3 | |
| 4 | |
| 5 |
Chebyshev vs. Empirical Rule
Chebyshev's inequality gives a universal upper bound that holds for any distribution. The empirical rule gives exact (approximate) percentages for normal distributions. For a normal distribution, , much less than the Chebyshev bound of 25%.
Relationship to Other Measures
| Measure | Formula | Units | Robust? |
|---|---|---|---|
| Variance | Squared units | No | |
| Standard deviation | Original units | No | |
| IQR | Original units | Yes | |
| MAD | Original units | Yes | |
| Range | Original units | No |
Standard Deviation in Machine Learning
| ML Application | Std Dev Usage | Why |
|---|---|---|
| StandardScaler | x_scaled = (x - μ) / σ | Neural networks train faster |
| Confidence intervals | μ ± 1.96σ/√n | Model uncertainty quantification |
| Weight initialization | Xavier/Glorot: std = √(2/n) | Prevents vanishing/exploding gradients |
| Batch normalization | Normalize to mean=0, std=1 | Stabilizes deep learning |
| Anomaly detection | z |
import numpy as np
from sklearn.preprocessing import StandardScaler
np.random.seed(42)
# StandardScaler uses std dev
data = np.random.randn(100, 3) * [10, 1, 100]
scaler = StandardScaler()
scaled = scaler.fit_transform(data)
print(f"Original std: {data.std(axis=0).round(1)}")
print(f"Scaled std: {scaled.std(axis=0).round(3)}")
# Weight initialization (Xavier)
layer_sizes = [784, 256, 128, 10]
for i in range(len(layer_sizes)-1):
fan_in, fan_out = layer_sizes[i], layer_sizes[i+1]
std_xavier = np.sqrt(2.0 / (fan_in + fan_out))
weights = np.random.randn(fan_in, fan_out) * std_xavier
print(f"Layer {i}: {fan_in}→{fan_out}, std={std_xavier:.4f}, "
f"weight range=[{weights.min():.3f}, {weights.max():.3f}]")
Key Takeaways
Standard deviation is in the same units as the data — directly interpretable as typical distance from the mean
68-95-99.7 rule applies only to normal distributions — never apply it blindly
CV = SD/mean allows variability comparison across different scales and units
Chebyshev's inequality provides universal bounds for any distribution: P(|X−μ| ≥ kσ) ≤ 1/k²
"Standard deviation is the measuring stick of uncertainty — without it, you are guessing in the dark."