Measures of Central Tendency

Descriptive Statistics

Three Ways to Find the Center — And Why It Matters

Different measures capture different notions of center, and choosing the right one depends on your data and question.

Arithmetic mean — Sensitive to every value; the workhorse of statistics
Median — Robust to outliers; the choice for skewed distributions
Mode — The only measure that works for categorical data
Shape dependency — Symmetric data: mean equals median; skewed data: they diverge

The center of your data is the single most important number you will calculate. Choose the right measure.

What is Central Tendency?

Definition

A measure of central tendency describes where the "center" of a distribution is located. Different measures capture different notions of center, and choosing the right one depends on your data and question.

The Three Main Measures

Arithmetic Mean (x̄)

Sum of all values divided by count. The most common measure.

Arithmetic Mean

\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i

Here,

$\bar{x}$ =Sample mean
$n$ =Number of observations
$x_i$ =The i-th observation

Properties:

Uses every data point — sensitive to outliers
The only measure where Σ(xᵢ - x̄) = 0
Minimizes the sum of squared deviations

Median (M)

The middle value when data is sorted. For even n, average the two middle values.

Properties:

Robust to outliers
Appropriate for ordinal and continuous data
Minimizes the sum of absolute deviations

Mode

The most frequently occurring value(s). A distribution can be unimodal, bimodal, or multimodal.

Properties:

Only measure appropriate for nominal data
Can have multiple modes (or none in continuous data)
Not affected by extreme values

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

def describe_central(data, label):
    mean = np.mean(data)
    median = np.median(data)
    mode_result = stats.mode(data, keepdims=True)
    mode = mode_result.mode[0]
    
    print(f"\n{label}:")
    print(f"  Mean   = {mean:.2f}")
    print(f"  Median = {median:.2f}")
    print(f"  Mode   = {mode:.2f}")
    print(f"  Mean - Median = {mean - median:.2f} {'(right-skewed)' if mean > median else '(left-skewed)' if mean < median else '(symmetric)'}")

np.random.seed(42)

# Symmetric distribution
sym = np.random.normal(50, 10, 1000)
describe_central(sym, "Symmetric (Normal)")

# Right-skewed (income-like)
right_skew = np.random.lognormal(3.5, 0.8, 1000)
describe_central(right_skew, "Right-Skewed (Income)")

# Left-skewed
left_skew = 100 - np.random.exponential(10, 1000)
describe_central(left_skew, "Left-Skewed")

# With outlier
with_outlier = np.concatenate([np.random.normal(50, 5, 99), [500]])
describe_central(with_outlier, "Data with One Extreme Outlier (n=100)")

Effect of Skewness on Mean vs Median

Architecture Diagram

Right-Skewed:        Mode < Median < Mean
Symmetric:           Mode ≈ Median ≈ Mean
Left-Skewed:         Mean < Median < Mode

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

datasets = {
    'Right-Skewed': np.random.lognormal(3, 0.8, 2000),
    'Symmetric': np.random.normal(50, 10, 2000),
    'Left-Skewed': 100 - np.random.exponential(10, 2000)
}

for ax, (title, data) in zip(axes, datasets.items()):
    ax.hist(data, bins=40, density=True, color='lightblue', edgecolor='gray', alpha=0.7)
    mean_val = np.mean(data)
    median_val = np.median(data)
    ax.axvline(mean_val, color='red', lw=2, linestyle='--', label=f'Mean={mean_val:.1f}')
    ax.axvline(median_val, color='blue', lw=2, linestyle='-', label=f'Median={median_val:.1f}')
    ax.set_title(title)
    ax.legend()

plt.tight_layout()
plt.savefig('central_tendency.png', dpi=150)
plt.show()

When to Use Which Measure

Situation	Best Measure	Why
Symmetric distribution, no outliers	Mean	Most efficient, uses all data
Skewed distribution	Median	Robust, not pulled by the tail
Outliers present	Median	Mean is distorted
Nominal (categorical) data	Mode	Mean/median meaningless
Ordinal data	Median	Can rank, but intervals unknown
Reporting income, housing prices	Median	Right-skewed distributions
Reporting temperature, height	Mean	Approximately normal

A Famous Example: Income

Why It Matters

The mean household income in the US is significantly higher than the median household income because a small number of very high earners pull the mean up. The median better represents the typical household.

# Simulate US-like income distribution
np.random.seed(0)
incomes = np.concatenate([
    np.random.lognormal(10.8, 0.7, 9990),   # Most people
    np.random.lognormal(13.5, 1.0, 10)      # Very wealthy (top 0.1%)
])

print(f"Mean income:   ${np.mean(incomes):>12,.0f}")
print(f"Median income: ${np.median(incomes):>12,.0f}")
print(f"The top 0.1% pulls the mean ${np.mean(incomes)-np.median(incomes):,.0f} above the median!")

Central Tendency in Machine Learning

Central tendency is fundamental to ML:

ML Concept	Central Tendency Connection	Why It Matters
Mean Squared Error (MSE)	Minimizes → predicts the mean	Regression default loss
Mean Absolute Error (MAE)	Minimizes → predicts the median	Robust to outliers
StandardScaler	Centers data to mean=0	Neural networks train faster
Missing value imputation	Fill with mean/median/mode	Choice affects model performance
Batch normalization	Uses batch mean	Stabilizes deep learning training

import numpy as np
from sklearn.linear_model import LinearRegression, HuberRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

np.random.seed(42)

# Generate data with outliers
n = 200
X = np.random.randn(n, 1) * 10
y = 3 * X[:,0] + np.random.randn(n) * 5

# Add outliers
outlier_idx = np.random.choice(n, 10, replace=False)
y[outlier_idx] += np.random.randn(10) * 50

# MSE minimizes to the mean → sensitive to outliers
model_mse = LinearRegression().fit(X, y)
pred_mse = model_mse.predict(X)
print(f"MSE model coefficient: {model_mse.coef_[0]:.3f}")
print(f"MSE: {mean_squared_error(y, pred_mse):.2f}")

# MAE minimizes to the median → robust to outliers
model_mae = HuberRegressor().fit(X, y)
pred_mae = model_mae.predict(X)
print(f"\nMAE model coefficient: {model_mae.coef_[0]:.3f}")
print(f"MAE: {mean_absolute_error(y, pred_mae):.2f}")

# The difference: MSE is pulled by outliers, MAE is not
print(f"\nTrue coefficient: 3.0")
print(f"MSE is off by: {abs(model_mse.coef_[0] - 3):.3f}")
print(f"MAE is off by: {abs(model_mae.coef_[0] - 3):.3f}")

Key Takeaways

Summary: Measures of Central Tendency

Mean is best for symmetric data without outliers — it uses all information
Median is robust — always use it when data is skewed or has outliers
Mode is for categorical data and identifying the most common value
In right-skewed data: Mean > Median (the tail pulls the mean right)
In left-skewed data: Mean < Median
Always report the appropriate measure for your data type and distribution shape

Measures of Central Tendency — Mean, Median, Mode Compared

Measures of Central Tendency

Three Ways to Find the Center — And Why It Matters

What is Central Tendency?

Definition

The Three Main Measures

Arithmetic Mean (x̄)

Arithmetic Mean

Median (M)

Mode

Effect of Skewness on Mean vs Median

When to Use Which Measure

A Famous Example: Income

Central Tendency in Machine Learning

Key Takeaways

Summary: Measures of Central Tendency

Premium Content

Need Expert Statistics Help?