The Median
Descriptive Statistics
The Value That Splits Your Data Exactly in Half
The median is the middle value of a dataset when sorted. It divides the distribution exactly in half — 50% below, 50% above.
- Robust to outliers — One extreme value cannot pull the median away from center
- Works for ordinal data — The mean cannot; the median can
- Minimizes absolute deviations — The mathematically optimal center for absolute loss
- Income, housing, and skewed data — The median tells the truth when the mean lies
When data is skewed or contaminated with outliers, the median is your most honest summary.
What is the Median?
Definition
The median is the middle value of a dataset when sorted in ascending order. It divides the distribution exactly in half — 50% of values fall below, 50% above.
Calculation
For odd n: Median = the middle value (position (n+1)/2)
For even n: Median = average of the two middle values
Median (Even n)
Here,
- =The median
- =The (n/2)-th ordered value
- =The (n/2+1)-th ordered value
import numpy as np
import pandas as pd
from scipy import stats
# Odd n
data_odd = [3, 7, 12, 8, 5, 15, 9]
sorted_odd = sorted(data_odd)
print(f"Sorted: {sorted_odd}")
print(f"n = {len(sorted_odd)} (odd)")
middle_pos = (len(sorted_odd) + 1) // 2
print(f"Middle position: {middle_pos}")
print(f"Median = {sorted_odd[middle_pos - 1]}")
print(f"NumPy confirms: {np.median(data_odd)}")
# Even n
data_even = [3, 7, 12, 8, 5, 15, 9, 11]
sorted_even = sorted(data_even)
print(f"\nSorted: {sorted_even}")
print(f"n = {len(sorted_even)} (even)")
n = len(sorted_even)
lower_mid = sorted_even[n//2 - 1]
upper_mid = sorted_even[n//2]
print(f"Two middle values: {lower_mid} and {upper_mid}")
print(f"Median = ({lower_mid} + {upper_mid})/2 = {(lower_mid + upper_mid)/2}")
print(f"NumPy confirms: {np.median(data_even)}")
Robustness to Outliers
Key Insight
The median's greatest strength: a single extreme value cannot move it far.
import matplotlib.pyplot as plt
np.random.seed(42)
# Compare sensitivity to outliers
base_data = np.random.normal(50, 5, 100)
# Add increasingly extreme outliers
multipliers = [1, 2, 5, 10, 50, 100, 1000]
means = []
medians = []
for mult in multipliers:
data_with_outlier = np.append(base_data, 50 * mult)
means.append(np.mean(data_with_outlier))
medians.append(np.median(data_with_outlier))
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
outlier_values = [50 * m for m in multipliers]
ax1.plot(outlier_values, means, 'r-o', label='Mean')
ax1.plot(outlier_values, medians, 'b-o', label='Median')
ax1.set_xlabel('Outlier Value')
ax1.set_ylabel('Statistic Value')
ax1.set_title('Effect of Outlier on Mean vs Median')
ax1.legend()
ax1.set_xscale('log')
# Show breakdown point
print("Breakdown point of median: 50%")
print("Breakdown point of mean: 0% (any single outlier affects it)")
print("\nWith outlier = 50000:")
data_extreme = np.append(base_data, 50000)
print(f"Mean = {np.mean(data_extreme):.2f} (was ~50)")
print(f"Median = {np.median(data_extreme):.2f} (barely changed!)")
Median for Grouped Data
Median for Grouped Data
Here,
- =Lower boundary of median class
- =Total number of observations
- =Cumulative frequency before median class
- =Frequency of median class
- =Class width
# Frequency table
freq_table = pd.DataFrame({
'Class': ['20-29', '30-39', '40-49', '50-59', '60-69'],
'f': [5, 12, 20, 18, 5]
})
freq_table['cum_f'] = freq_table['f'].cumsum()
n = freq_table['f'].sum()
print(f"Total n = {n}, n/2 = {n/2}")
print(freq_table)
# Find median class (where cumulative frequency first exceeds n/2)
median_class_idx = (freq_table['cum_f'] >= n/2).idxmax()
print(f"\nMedian class: {freq_table.loc[median_class_idx, 'Class']}")
L = 40 # lower boundary of median class (40-49)
F = freq_table.loc[median_class_idx - 1, 'cum_f'] # cumulative frequency BEFORE
f = freq_table.loc[median_class_idx, 'f']
h = 10 # class width
median_grouped = L + ((n/2 - F) / f) * h
print(f"Estimated median = {L} + ({n/2} - {F}) / {f} × {h} = {median_grouped:.2f}")
Quartiles: Generalizing the Median
DfQuartiles
The median is the 50th percentile. Quartiles extend this:
- Q1 = 25th percentile
- Q2 = median = 50th percentile
- Q3 = 75th percentile
data = np.random.normal(70, 15, 200)
q1, q2, q3 = np.percentile(data, [25, 50, 75])
iqr = q3 - q1
print(f"Q1 (25th pct): {q1:.2f}")
print(f"Q2 / Median: {q2:.2f}")
print(f"Q3 (75th pct): {q3:.2f}")
print(f"IQR = Q3 - Q1: {iqr:.2f}")
The Median in Machine Learning
In ML, the median is the robust choice:
| ML Application | Why Median? | What Happens with Mean |
|---|---|---|
| MAE / Huber Loss | Minimizes to median | MSE is pulled by outliers |
| RobustScaler | Centers at median, scales by IQR | StandardScaler affected by outliers |
| Missing value imputation | Robust to extreme values | Mean imputation distorted by outliers |
| Outlier detection | IQR fence uses quartiles | Mean ± std fails on skewed data |
| Feature ranking | Median test for non-normal data | t-test assumes normality |
import numpy as np
from sklearn.linear_model import HuberRegressor, LinearRegression
from sklearn.preprocessing import RobustScaler, StandardScaler
np.random.seed(42)
# Compare MSE vs MAE on data with outliers
n = 200
X = np.random.randn(n, 1) * 10
y = 3 * X[:,0] + np.random.randn(n) * 5
# Add outliers
y[::20] += np.random.randn(10) * 50 # every 20th point is extreme
# MSE model (minimizes to mean) — affected by outliers
mse_model = LinearRegression().fit(X, y)
print(f"MSE model (uses mean): coef = {mse_model.coef_[0]:.3f}")
# MAE model (minimizes to median) — robust
mae_model = HuberRegressor(epsilon=1.35).fit(X, y)
print(f"MAE model (uses median): coef = {mae_model.coef_[0]:.3f}")
print(f"True coefficient: 3.0\n")
# RobustScaler vs StandardScaler
data = np.concatenate([np.random.normal(50, 10, 100), [500, -200]]) # with outliers
robust = RobustScaler() # uses median and IQR
standard = StandardScaler() # uses mean and std
print(f"Data with outliers: mean={np.mean(data):.1f}, median={np.median(data):.1f}")
print(f"StandardScaler center: {standard.fit_transform(data.reshape(-1,1)).mean():.1f}")
print(f"RobustScaler center: {robust.fit_transform(data.reshape(-1,1)).mean():.1f}")
Key Takeaways
Summary: Median
- The median splits the distribution in half by frequency, not by value
- Breakdown point of 50% — the median stays resistant until greater than 50% of data is contaminated
- For skewed data, income, prices — always report median alongside mean
- The median minimizes the sum of absolute deviations (MAD minimization)
- Quartiles extend the median concept — together they form the five-number summary
- For grouped data, use the interpolation formula — the result is an estimate, not exact