🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Measures of Central Tendency — Mean, Median, Mode Compared

Foundations of StatisticsDescriptive Statistics🟢 Free Lesson

Advertisement

Measures of Central Tendency

Descriptive Statistics

Three Ways to Find the Center — And Why It Matters

Different measures capture different notions of center, and choosing the right one depends on your data and question.

  • Arithmetic mean — Sensitive to every value; the workhorse of statistics
  • Median — Robust to outliers; the choice for skewed distributions
  • Mode — The only measure that works for categorical data
  • Shape dependency — Symmetric data: mean equals median; skewed data: they diverge

The center of your data is the single most important number you will calculate. Choose the right measure.


What is Central Tendency?

Definition

A measure of central tendency describes where the "center" of a distribution is located. Different measures capture different notions of center, and choosing the right one depends on your data and question.


The Three Main Measures

Arithmetic Mean (x̄)

Sum of all values divided by count. The most common measure.

Arithmetic Mean

xˉ=1ni=1nxi\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i

Here,

  • xˉ\bar{x}=Sample mean
  • nn=Number of observations
  • xix_i=The i-th observation

Properties:

  • Uses every data point — sensitive to outliers
  • The only measure where Σ(xᵢ - x̄) = 0
  • Minimizes the sum of squared deviations

Median (M)

The middle value when data is sorted. For even n, average the two middle values.

Properties:

  • Robust to outliers
  • Appropriate for ordinal and continuous data
  • Minimizes the sum of absolute deviations

Mode

The most frequently occurring value(s). A distribution can be unimodal, bimodal, or multimodal.

Properties:

  • Only measure appropriate for nominal data
  • Can have multiple modes (or none in continuous data)
  • Not affected by extreme values
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

def describe_central(data, label):
    mean = np.mean(data)
    median = np.median(data)
    mode_result = stats.mode(data, keepdims=True)
    mode = mode_result.mode[0]
    
    print(f"\n{label}:")
    print(f"  Mean   = {mean:.2f}")
    print(f"  Median = {median:.2f}")
    print(f"  Mode   = {mode:.2f}")
    print(f"  Mean - Median = {mean - median:.2f} {'(right-skewed)' if mean > median else '(left-skewed)' if mean < median else '(symmetric)'}")

np.random.seed(42)

# Symmetric distribution
sym = np.random.normal(50, 10, 1000)
describe_central(sym, "Symmetric (Normal)")

# Right-skewed (income-like)
right_skew = np.random.lognormal(3.5, 0.8, 1000)
describe_central(right_skew, "Right-Skewed (Income)")

# Left-skewed
left_skew = 100 - np.random.exponential(10, 1000)
describe_central(left_skew, "Left-Skewed")

# With outlier
with_outlier = np.concatenate([np.random.normal(50, 5, 99), [500]])
describe_central(with_outlier, "Data with One Extreme Outlier (n=100)")

Effect of Skewness on Mean vs Median

Architecture Diagram
Right-Skewed:        Mode < Median < Mean
Symmetric:           Mode ≈ Median ≈ Mean
Left-Skewed:         Mean < Median < Mode
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

datasets = {
    'Right-Skewed': np.random.lognormal(3, 0.8, 2000),
    'Symmetric': np.random.normal(50, 10, 2000),
    'Left-Skewed': 100 - np.random.exponential(10, 2000)
}

for ax, (title, data) in zip(axes, datasets.items()):
    ax.hist(data, bins=40, density=True, color='lightblue', edgecolor='gray', alpha=0.7)
    mean_val = np.mean(data)
    median_val = np.median(data)
    ax.axvline(mean_val, color='red', lw=2, linestyle='--', label=f'Mean={mean_val:.1f}')
    ax.axvline(median_val, color='blue', lw=2, linestyle='-', label=f'Median={median_val:.1f}')
    ax.set_title(title)
    ax.legend()

plt.tight_layout()
plt.savefig('central_tendency.png', dpi=150)
plt.show()

When to Use Which Measure

SituationBest MeasureWhy
Symmetric distribution, no outliersMeanMost efficient, uses all data
Skewed distributionMedianRobust, not pulled by the tail
Outliers presentMedianMean is distorted
Nominal (categorical) dataModeMean/median meaningless
Ordinal dataMedianCan rank, but intervals unknown
Reporting income, housing pricesMedianRight-skewed distributions
Reporting temperature, heightMeanApproximately normal

A Famous Example: Income

Why It Matters

The mean household income in the US is significantly higher than the median household income because a small number of very high earners pull the mean up. The median better represents the typical household.

# Simulate US-like income distribution
np.random.seed(0)
incomes = np.concatenate([
    np.random.lognormal(10.8, 0.7, 9990),   # Most people
    np.random.lognormal(13.5, 1.0, 10)      # Very wealthy (top 0.1%)
])

print(f"Mean income:   ${np.mean(incomes):>12,.0f}")
print(f"Median income: ${np.median(incomes):>12,.0f}")
print(f"The top 0.1% pulls the mean ${np.mean(incomes)-np.median(incomes):,.0f} above the median!")

Central Tendency in Machine Learning

Mean ImputationFill missing valuesLoss FunctionsMSE → mean, MAE → medianNormalizationStandardScaler uses meanAggregate EmbMean pooling in NLPMean, median, mode are the building blocks of ML preprocessing and loss functions

Central tendency is fundamental to ML:

ML ConceptCentral Tendency ConnectionWhy It Matters
Mean Squared Error (MSE)Minimizes → predicts the meanRegression default loss
Mean Absolute Error (MAE)Minimizes → predicts the medianRobust to outliers
StandardScalerCenters data to mean=0Neural networks train faster
Missing value imputationFill with mean/median/modeChoice affects model performance
Batch normalizationUses batch meanStabilizes deep learning training
import numpy as np
from sklearn.linear_model import LinearRegression, HuberRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

np.random.seed(42)

# Generate data with outliers
n = 200
X = np.random.randn(n, 1) * 10
y = 3 * X[:,0] + np.random.randn(n) * 5

# Add outliers
outlier_idx = np.random.choice(n, 10, replace=False)
y[outlier_idx] += np.random.randn(10) * 50

# MSE minimizes to the mean → sensitive to outliers
model_mse = LinearRegression().fit(X, y)
pred_mse = model_mse.predict(X)
print(f"MSE model coefficient: {model_mse.coef_[0]:.3f}")
print(f"MSE: {mean_squared_error(y, pred_mse):.2f}")

# MAE minimizes to the median → robust to outliers
model_mae = HuberRegressor().fit(X, y)
pred_mae = model_mae.predict(X)
print(f"\nMAE model coefficient: {model_mae.coef_[0]:.3f}")
print(f"MAE: {mean_absolute_error(y, pred_mae):.2f}")

# The difference: MSE is pulled by outliers, MAE is not
print(f"\nTrue coefficient: 3.0")
print(f"MSE is off by: {abs(model_mse.coef_[0] - 3):.3f}")
print(f"MAE is off by: {abs(model_mae.coef_[0] - 3):.3f}")

Key Takeaways

Summary: Measures of Central Tendency

  1. Mean is best for symmetric data without outliers — it uses all information
  2. Median is robust — always use it when data is skewed or has outliers
  3. Mode is for categorical data and identifying the most common value
  4. In right-skewed data: Mean > Median (the tail pulls the mean right)
  5. In left-skewed data: Mean < Median
  6. Always report the appropriate measure for your data type and distribution shape

Premium Content

Measures of Central Tendency — Mean, Median, Mode Compared

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement