🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Median — Calculation, Robustness, and When to Use It

Foundations of StatisticsDescriptive Statistics🟢 Free Lesson

Advertisement

The Median

Descriptive Statistics

The Value That Splits Your Data Exactly in Half

The median is the middle value of a dataset when sorted. It divides the distribution exactly in half — 50% below, 50% above.

  • Robust to outliers — One extreme value cannot pull the median away from center
  • Works for ordinal data — The mean cannot; the median can
  • Minimizes absolute deviations — The mathematically optimal center for absolute loss
  • Income, housing, and skewed data — The median tells the truth when the mean lies

When data is skewed or contaminated with outliers, the median is your most honest summary.


What is the Median?

Definition

The median is the middle value of a dataset when sorted in ascending order. It divides the distribution exactly in half — 50% of values fall below, 50% above.


Calculation

For odd n: Median = the middle value (position (n+1)/2)

For even n: Median = average of the two middle values

Median (Even n)

M=x(n/2)+x(n/2+1)2M = \frac{x_{(n/2)} + x_{(n/2+1)}}{2}

Here,

  • MM=The median
  • x(n/2)x_{(n/2)}=The (n/2)-th ordered value
  • x(n/2+1)x_{(n/2+1)}=The (n/2+1)-th ordered value
import numpy as np
import pandas as pd
from scipy import stats

# Odd n
data_odd = [3, 7, 12, 8, 5, 15, 9]
sorted_odd = sorted(data_odd)
print(f"Sorted: {sorted_odd}")
print(f"n = {len(sorted_odd)} (odd)")
middle_pos = (len(sorted_odd) + 1) // 2
print(f"Middle position: {middle_pos}")
print(f"Median = {sorted_odd[middle_pos - 1]}")
print(f"NumPy confirms: {np.median(data_odd)}")

# Even n
data_even = [3, 7, 12, 8, 5, 15, 9, 11]
sorted_even = sorted(data_even)
print(f"\nSorted: {sorted_even}")
print(f"n = {len(sorted_even)} (even)")
n = len(sorted_even)
lower_mid = sorted_even[n//2 - 1]
upper_mid = sorted_even[n//2]
print(f"Two middle values: {lower_mid} and {upper_mid}")
print(f"Median = ({lower_mid} + {upper_mid})/2 = {(lower_mid + upper_mid)/2}")
print(f"NumPy confirms: {np.median(data_even)}")

Robustness to Outliers

Key Insight

The median's greatest strength: a single extreme value cannot move it far.

import matplotlib.pyplot as plt

np.random.seed(42)

# Compare sensitivity to outliers
base_data = np.random.normal(50, 5, 100)

# Add increasingly extreme outliers
multipliers = [1, 2, 5, 10, 50, 100, 1000]
means = []
medians = []

for mult in multipliers:
    data_with_outlier = np.append(base_data, 50 * mult)
    means.append(np.mean(data_with_outlier))
    medians.append(np.median(data_with_outlier))

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

outlier_values = [50 * m for m in multipliers]
ax1.plot(outlier_values, means, 'r-o', label='Mean')
ax1.plot(outlier_values, medians, 'b-o', label='Median')
ax1.set_xlabel('Outlier Value')
ax1.set_ylabel('Statistic Value')
ax1.set_title('Effect of Outlier on Mean vs Median')
ax1.legend()
ax1.set_xscale('log')

# Show breakdown point
print("Breakdown point of median: 50%")
print("Breakdown point of mean: 0% (any single outlier affects it)")
print("\nWith outlier = 50000:")
data_extreme = np.append(base_data, 50000)
print(f"Mean = {np.mean(data_extreme):.2f} (was ~50)")
print(f"Median = {np.median(data_extreme):.2f} (barely changed!)")

Median for Grouped Data

Median for Grouped Data

M=L+(n/2Ff)×hM = L + \left(\frac{n/2 - F}{f}\right) \times h

Here,

  • LL=Lower boundary of median class
  • nn=Total number of observations
  • FF=Cumulative frequency before median class
  • ff=Frequency of median class
  • hh=Class width
# Frequency table
freq_table = pd.DataFrame({
    'Class': ['20-29', '30-39', '40-49', '50-59', '60-69'],
    'f': [5, 12, 20, 18, 5]
})
freq_table['cum_f'] = freq_table['f'].cumsum()
n = freq_table['f'].sum()
print(f"Total n = {n}, n/2 = {n/2}")
print(freq_table)

# Find median class (where cumulative frequency first exceeds n/2)
median_class_idx = (freq_table['cum_f'] >= n/2).idxmax()
print(f"\nMedian class: {freq_table.loc[median_class_idx, 'Class']}")

L = 40  # lower boundary of median class (40-49)
F = freq_table.loc[median_class_idx - 1, 'cum_f']  # cumulative frequency BEFORE
f = freq_table.loc[median_class_idx, 'f']
h = 10  # class width

median_grouped = L + ((n/2 - F) / f) * h
print(f"Estimated median = {L} + ({n/2} - {F}) / {f} × {h} = {median_grouped:.2f}")

Quartiles: Generalizing the Median

DfQuartiles

The median is the 50th percentile. Quartiles extend this:

  • Q1 = 25th percentile
  • Q2 = median = 50th percentile
  • Q3 = 75th percentile
data = np.random.normal(70, 15, 200)

q1, q2, q3 = np.percentile(data, [25, 50, 75])
iqr = q3 - q1

print(f"Q1 (25th pct): {q1:.2f}")
print(f"Q2 / Median:   {q2:.2f}")
print(f"Q3 (75th pct): {q3:.2f}")
print(f"IQR = Q3 - Q1: {iqr:.2f}")

The Median in Machine Learning

MAE LossMinimizes to medianRobust ScalerCenters at medianMedian ImputeFill missing valuesOutlier RemovalIQR-based filterMedian is the robust alternative to mean — essential for outlier-heavy ML data

In ML, the median is the robust choice:

ML ApplicationWhy Median?What Happens with Mean
MAE / Huber LossMinimizes to medianMSE is pulled by outliers
RobustScalerCenters at median, scales by IQRStandardScaler affected by outliers
Missing value imputationRobust to extreme valuesMean imputation distorted by outliers
Outlier detectionIQR fence uses quartilesMean ± std fails on skewed data
Feature rankingMedian test for non-normal datat-test assumes normality
import numpy as np
from sklearn.linear_model import HuberRegressor, LinearRegression
from sklearn.preprocessing import RobustScaler, StandardScaler

np.random.seed(42)

# Compare MSE vs MAE on data with outliers
n = 200
X = np.random.randn(n, 1) * 10
y = 3 * X[:,0] + np.random.randn(n) * 5

# Add outliers
y[::20] += np.random.randn(10) * 50  # every 20th point is extreme

# MSE model (minimizes to mean) — affected by outliers
mse_model = LinearRegression().fit(X, y)
print(f"MSE model (uses mean): coef = {mse_model.coef_[0]:.3f}")

# MAE model (minimizes to median) — robust
mae_model = HuberRegressor(epsilon=1.35).fit(X, y)
print(f"MAE model (uses median): coef = {mae_model.coef_[0]:.3f}")
print(f"True coefficient: 3.0\n")

# RobustScaler vs StandardScaler
data = np.concatenate([np.random.normal(50, 10, 100), [500, -200]])  # with outliers

robust = RobustScaler()  # uses median and IQR
standard = StandardScaler()  # uses mean and std

print(f"Data with outliers: mean={np.mean(data):.1f}, median={np.median(data):.1f}")
print(f"StandardScaler center: {standard.fit_transform(data.reshape(-1,1)).mean():.1f}")
print(f"RobustScaler center: {robust.fit_transform(data.reshape(-1,1)).mean():.1f}")

Key Takeaways

Summary: Median

  1. The median splits the distribution in half by frequency, not by value
  2. Breakdown point of 50% — the median stays resistant until greater than 50% of data is contaminated
  3. For skewed data, income, prices — always report median alongside mean
  4. The median minimizes the sum of absolute deviations (MAD minimization)
  5. Quartiles extend the median concept — together they form the five-number summary
  6. For grouped data, use the interpolation formula — the result is an estimate, not exact

Premium Content

Median — Calculation, Robustness, and When to Use It

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement