Measures of Central Tendency
Descriptive Statistics
Three Ways to Find the Center — And Why It Matters
Different measures capture different notions of center, and choosing the right one depends on your data and question.
- Arithmetic mean — Sensitive to every value; the workhorse of statistics
- Median — Robust to outliers; the choice for skewed distributions
- Mode — The only measure that works for categorical data
- Shape dependency — Symmetric data: mean equals median; skewed data: they diverge
The center of your data is the single most important number you will calculate. Choose the right measure.
What is Central Tendency?
Definition
A measure of central tendency describes where the "center" of a distribution is located. Different measures capture different notions of center, and choosing the right one depends on your data and question.
The Three Main Measures
Arithmetic Mean (x̄)
Sum of all values divided by count. The most common measure.
Arithmetic Mean
Here,
- =Sample mean
- =Number of observations
- =The i-th observation
Properties:
- Uses every data point — sensitive to outliers
- The only measure where Σ(xᵢ - x̄) = 0
- Minimizes the sum of squared deviations
Median (M)
The middle value when data is sorted. For even n, average the two middle values.
Properties:
- Robust to outliers
- Appropriate for ordinal and continuous data
- Minimizes the sum of absolute deviations
Mode
The most frequently occurring value(s). A distribution can be unimodal, bimodal, or multimodal.
Properties:
- Only measure appropriate for nominal data
- Can have multiple modes (or none in continuous data)
- Not affected by extreme values
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
def describe_central(data, label):
mean = np.mean(data)
median = np.median(data)
mode_result = stats.mode(data, keepdims=True)
mode = mode_result.mode[0]
print(f"\n{label}:")
print(f" Mean = {mean:.2f}")
print(f" Median = {median:.2f}")
print(f" Mode = {mode:.2f}")
print(f" Mean - Median = {mean - median:.2f} {'(right-skewed)' if mean > median else '(left-skewed)' if mean < median else '(symmetric)'}")
np.random.seed(42)
# Symmetric distribution
sym = np.random.normal(50, 10, 1000)
describe_central(sym, "Symmetric (Normal)")
# Right-skewed (income-like)
right_skew = np.random.lognormal(3.5, 0.8, 1000)
describe_central(right_skew, "Right-Skewed (Income)")
# Left-skewed
left_skew = 100 - np.random.exponential(10, 1000)
describe_central(left_skew, "Left-Skewed")
# With outlier
with_outlier = np.concatenate([np.random.normal(50, 5, 99), [500]])
describe_central(with_outlier, "Data with One Extreme Outlier (n=100)")
Effect of Skewness on Mean vs Median
Right-Skewed: Mode < Median < Mean
Symmetric: Mode ≈ Median ≈ Mean
Left-Skewed: Mean < Median < Mode
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
datasets = {
'Right-Skewed': np.random.lognormal(3, 0.8, 2000),
'Symmetric': np.random.normal(50, 10, 2000),
'Left-Skewed': 100 - np.random.exponential(10, 2000)
}
for ax, (title, data) in zip(axes, datasets.items()):
ax.hist(data, bins=40, density=True, color='lightblue', edgecolor='gray', alpha=0.7)
mean_val = np.mean(data)
median_val = np.median(data)
ax.axvline(mean_val, color='red', lw=2, linestyle='--', label=f'Mean={mean_val:.1f}')
ax.axvline(median_val, color='blue', lw=2, linestyle='-', label=f'Median={median_val:.1f}')
ax.set_title(title)
ax.legend()
plt.tight_layout()
plt.savefig('central_tendency.png', dpi=150)
plt.show()
When to Use Which Measure
| Situation | Best Measure | Why |
|---|---|---|
| Symmetric distribution, no outliers | Mean | Most efficient, uses all data |
| Skewed distribution | Median | Robust, not pulled by the tail |
| Outliers present | Median | Mean is distorted |
| Nominal (categorical) data | Mode | Mean/median meaningless |
| Ordinal data | Median | Can rank, but intervals unknown |
| Reporting income, housing prices | Median | Right-skewed distributions |
| Reporting temperature, height | Mean | Approximately normal |
A Famous Example: Income
Why It Matters
The mean household income in the US is significantly higher than the median household income because a small number of very high earners pull the mean up. The median better represents the typical household.
# Simulate US-like income distribution
np.random.seed(0)
incomes = np.concatenate([
np.random.lognormal(10.8, 0.7, 9990), # Most people
np.random.lognormal(13.5, 1.0, 10) # Very wealthy (top 0.1%)
])
print(f"Mean income: ${np.mean(incomes):>12,.0f}")
print(f"Median income: ${np.median(incomes):>12,.0f}")
print(f"The top 0.1% pulls the mean ${np.mean(incomes)-np.median(incomes):,.0f} above the median!")
Central Tendency in Machine Learning
Central tendency is fundamental to ML:
| ML Concept | Central Tendency Connection | Why It Matters |
|---|---|---|
| Mean Squared Error (MSE) | Minimizes → predicts the mean | Regression default loss |
| Mean Absolute Error (MAE) | Minimizes → predicts the median | Robust to outliers |
| StandardScaler | Centers data to mean=0 | Neural networks train faster |
| Missing value imputation | Fill with mean/median/mode | Choice affects model performance |
| Batch normalization | Uses batch mean | Stabilizes deep learning training |
import numpy as np
from sklearn.linear_model import LinearRegression, HuberRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
np.random.seed(42)
# Generate data with outliers
n = 200
X = np.random.randn(n, 1) * 10
y = 3 * X[:,0] + np.random.randn(n) * 5
# Add outliers
outlier_idx = np.random.choice(n, 10, replace=False)
y[outlier_idx] += np.random.randn(10) * 50
# MSE minimizes to the mean → sensitive to outliers
model_mse = LinearRegression().fit(X, y)
pred_mse = model_mse.predict(X)
print(f"MSE model coefficient: {model_mse.coef_[0]:.3f}")
print(f"MSE: {mean_squared_error(y, pred_mse):.2f}")
# MAE minimizes to the median → robust to outliers
model_mae = HuberRegressor().fit(X, y)
pred_mae = model_mae.predict(X)
print(f"\nMAE model coefficient: {model_mae.coef_[0]:.3f}")
print(f"MAE: {mean_absolute_error(y, pred_mae):.2f}")
# The difference: MSE is pulled by outliers, MAE is not
print(f"\nTrue coefficient: 3.0")
print(f"MSE is off by: {abs(model_mse.coef_[0] - 3):.3f}")
print(f"MAE is off by: {abs(model_mae.coef_[0] - 3):.3f}")
Key Takeaways
Summary: Measures of Central Tendency
- Mean is best for symmetric data without outliers — it uses all information
- Median is robust — always use it when data is skewed or has outliers
- Mode is for categorical data and identifying the most common value
- In right-skewed data: Mean > Median (the tail pulls the mean right)
- In left-skewed data: Mean < Median
- Always report the appropriate measure for your data type and distribution shape