The Median

Descriptive Statistics

The Value That Splits Your Data Exactly in Half

The median is the middle value of a dataset when sorted. It divides the distribution exactly in half — 50% below, 50% above.

Robust to outliers — One extreme value cannot pull the median away from center
Works for ordinal data — The mean cannot; the median can
Minimizes absolute deviations — The mathematically optimal center for absolute loss
Income, housing, and skewed data — The median tells the truth when the mean lies

When data is skewed or contaminated with outliers, the median is your most honest summary.

What is the Median?

Definition

The median is the middle value of a dataset when sorted in ascending order. It divides the distribution exactly in half — 50% of values fall below, 50% above.

Calculation

For odd n: Median = the middle value (position (n+1)/2)

For even n: Median = average of the two middle values

Median (Even n)

M = \frac{x_{(n/2)} + x_{(n/2+1)}}{2}

Here,

$M$ =The median
$x_{(n/2)}$ =The (n/2)-th ordered value
$x_{(n/2+1)}$ =The (n/2+1)-th ordered value

import numpy as np
import pandas as pd
from scipy import stats

# Odd n
data_odd = [3, 7, 12, 8, 5, 15, 9]
sorted_odd = sorted(data_odd)
print(f"Sorted: {sorted_odd}")
print(f"n = {len(sorted_odd)} (odd)")
middle_pos = (len(sorted_odd) + 1) // 2
print(f"Middle position: {middle_pos}")
print(f"Median = {sorted_odd[middle_pos - 1]}")
print(f"NumPy confirms: {np.median(data_odd)}")

# Even n
data_even = [3, 7, 12, 8, 5, 15, 9, 11]
sorted_even = sorted(data_even)
print(f"\nSorted: {sorted_even}")
print(f"n = {len(sorted_even)} (even)")
n = len(sorted_even)
lower_mid = sorted_even[n//2 - 1]
upper_mid = sorted_even[n//2]
print(f"Two middle values: {lower_mid} and {upper_mid}")
print(f"Median = ({lower_mid} + {upper_mid})/2 = {(lower_mid + upper_mid)/2}")
print(f"NumPy confirms: {np.median(data_even)}")

Robustness to Outliers

Key Insight

The median's greatest strength: a single extreme value cannot move it far.

import matplotlib.pyplot as plt

np.random.seed(42)

# Compare sensitivity to outliers
base_data = np.random.normal(50, 5, 100)

# Add increasingly extreme outliers
multipliers = [1, 2, 5, 10, 50, 100, 1000]
means = []
medians = []

for mult in multipliers:
    data_with_outlier = np.append(base_data, 50 * mult)
    means.append(np.mean(data_with_outlier))
    medians.append(np.median(data_with_outlier))

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

outlier_values = [50 * m for m in multipliers]
ax1.plot(outlier_values, means, 'r-o', label='Mean')
ax1.plot(outlier_values, medians, 'b-o', label='Median')
ax1.set_xlabel('Outlier Value')
ax1.set_ylabel('Statistic Value')
ax1.set_title('Effect of Outlier on Mean vs Median')
ax1.legend()
ax1.set_xscale('log')

# Show breakdown point
print("Breakdown point of median: 50%")
print("Breakdown point of mean: 0% (any single outlier affects it)")
print("\nWith outlier = 50000:")
data_extreme = np.append(base_data, 50000)
print(f"Mean = {np.mean(data_extreme):.2f} (was ~50)")
print(f"Median = {np.median(data_extreme):.2f} (barely changed!)")

Median for Grouped Data

M = L + \left(\frac{n/2 - F}{f}\right) \times h

Here,

$L$ =Lower boundary of median class
$n$ =Total number of observations
$F$ =Cumulative frequency before median class
$f$ =Frequency of median class
$h$ =Class width

# Frequency table
freq_table = pd.DataFrame({
    'Class': ['20-29', '30-39', '40-49', '50-59', '60-69'],
    'f': [5, 12, 20, 18, 5]
})
freq_table['cum_f'] = freq_table['f'].cumsum()
n = freq_table['f'].sum()
print(f"Total n = {n}, n/2 = {n/2}")
print(freq_table)

# Find median class (where cumulative frequency first exceeds n/2)
median_class_idx = (freq_table['cum_f'] >= n/2).idxmax()
print(f"\nMedian class: {freq_table.loc[median_class_idx, 'Class']}")

L = 40  # lower boundary of median class (40-49)
F = freq_table.loc[median_class_idx - 1, 'cum_f']  # cumulative frequency BEFORE
f = freq_table.loc[median_class_idx, 'f']
h = 10  # class width

median_grouped = L + ((n/2 - F) / f) * h
print(f"Estimated median = {L} + ({n/2} - {F}) / {f} × {h} = {median_grouped:.2f}")

Quartiles: Generalizing the Median

DfQuartiles

The median is the 50th percentile. Quartiles extend this:

Q1 = 25th percentile
Q2 = median = 50th percentile
Q3 = 75th percentile

data = np.random.normal(70, 15, 200)

q1, q2, q3 = np.percentile(data, [25, 50, 75])
iqr = q3 - q1

print(f"Q1 (25th pct): {q1:.2f}")
print(f"Q2 / Median:   {q2:.2f}")
print(f"Q3 (75th pct): {q3:.2f}")
print(f"IQR = Q3 - Q1: {iqr:.2f}")

The Median in Machine Learning

In ML, the median is the robust choice:

ML Application	Why Median?	What Happens with Mean
MAE / Huber Loss	Minimizes to median	MSE is pulled by outliers
RobustScaler	Centers at median, scales by IQR	StandardScaler affected by outliers
Missing value imputation	Robust to extreme values	Mean imputation distorted by outliers
Outlier detection	IQR fence uses quartiles	Mean ± std fails on skewed data
Feature ranking	Median test for non-normal data	t-test assumes normality

import numpy as np
from sklearn.linear_model import HuberRegressor, LinearRegression
from sklearn.preprocessing import RobustScaler, StandardScaler

np.random.seed(42)

# Compare MSE vs MAE on data with outliers
n = 200
X = np.random.randn(n, 1) * 10
y = 3 * X[:,0] + np.random.randn(n) * 5

# Add outliers
y[::20] += np.random.randn(10) * 50  # every 20th point is extreme

# MSE model (minimizes to mean) — affected by outliers
mse_model = LinearRegression().fit(X, y)
print(f"MSE model (uses mean): coef = {mse_model.coef_[0]:.3f}")

# MAE model (minimizes to median) — robust
mae_model = HuberRegressor(epsilon=1.35).fit(X, y)
print(f"MAE model (uses median): coef = {mae_model.coef_[0]:.3f}")
print(f"True coefficient: 3.0\n")

# RobustScaler vs StandardScaler
data = np.concatenate([np.random.normal(50, 10, 100), [500, -200]])  # with outliers

robust = RobustScaler()  # uses median and IQR
standard = StandardScaler()  # uses mean and std

print(f"Data with outliers: mean={np.mean(data):.1f}, median={np.median(data):.1f}")
print(f"StandardScaler center: {standard.fit_transform(data.reshape(-1,1)).mean():.1f}")
print(f"RobustScaler center: {robust.fit_transform(data.reshape(-1,1)).mean():.1f}")

Key Takeaways

Summary: Median

The median splits the distribution in half by frequency, not by value
Breakdown point of 50% — the median stays resistant until greater than 50% of data is contaminated
For skewed data, income, prices — always report median alongside mean
The median minimizes the sum of absolute deviations (MAD minimization)
Quartiles extend the median concept — together they form the five-number summary
For grouped data, use the interpolation formula — the result is an estimate, not exact

Median — Calculation, Robustness, and When to Use It

The Median

The Value That Splits Your Data Exactly in Half

What is the Median?

Definition

Calculation

Median (Even n)

Robustness to Outliers

Median for Grouped Data

Median for Grouped Data

Quartiles: Generalizing the Median

DfQuartiles

The Median in Machine Learning

Key Takeaways

Summary: Median

Premium Content

Need Expert Statistics Help?