Five-Number Summary
Descriptive Statistics
Five Numbers That Describe Any Distribution
The five-number summary provides a compact non-parametric description of any dataset — no assumptions about shape required.
- Minimum and Maximum — The boundaries of your data
- Q1 and Q3 — The edges of the middle 50%
- Median — The center that splits data exactly in half
- Box plot foundation — These five numbers draw every box plot
- IQR from Q1 and Q3 — The robust measure of spread comes directly from this summary
Five numbers. One complete picture. The five-number summary is the Swiss Army knife of descriptive statistics.
What is the Five-Number Summary?
Definition
The five-number summary consists of five descriptive statistics that divide a dataset into four equal parts: Minimum, Q1 (25th percentile), Median (50th percentile), Q3 (75th percentile), and Maximum.
| Statistic | Description |
|---|---|
| Minimum | Smallest non-outlier value |
| Q1 | 25th percentile (lower quartile) |
| Median | 50th percentile |
| Q3 | 75th percentile (upper quartile) |
| Maximum | Largest non-outlier value |
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
tips = sns.load_dataset('tips')
print("=== Five-Number Summary: Total Bill ===")
bill = tips['total_bill']
q1, med, q3 = np.percentile(bill, [25, 50, 75])
iqr = q3 - q1
lower_fence = q1 - 1.5*iqr
upper_fence = q3 + 1.5*iqr
not_outlier = bill[(bill >= lower_fence) & (bill <= upper_fence)]
print(f"Min (non-outlier): ${not_outlier.min():.2f}")
print(f"Q1: ${q1:.2f}")
print(f"Median: ${med:.2f}")
print(f"Q3: ${q3:.2f}")
print(f"Max (non-outlier): ${not_outlier.max():.2f}")
print(f"IQR: ${iqr:.2f}")
print(f"Lower fence: ${lower_fence:.2f}")
print(f"Upper fence: ${upper_fence:.2f}")
outliers = bill[(bill < lower_fence) | (bill > upper_fence)]
print(f"Outliers: {sorted(outliers.values)}")
IQR Outlier Fences
Here,
- =First quartile (25th percentile)
- =Third quartile (75th percentile)
- =Interquartile range = Q3 - Q1
pandas describe() — Extended Summary
print(tips.describe().round(2))
# Shows: count, mean, std, min, Q1, Q2, Q3, max for all numeric columns
Comparing Groups with Five-Number Summaries
fig, ax = plt.subplots(figsize=(10, 5))
groups = tips.groupby('day')['total_bill']
for i, (day, group) in enumerate(groups):
q1, med, q3 = np.percentile(group, [25, 50, 75])
iqr = q3 - q1
whisker_lo = group[group >= q1-1.5*iqr].min()
whisker_hi = group[group <= q3+1.5*iqr].max()
outliers = group[(group < whisker_lo) | (group > whisker_hi)]
# Draw box
ax.barh(i, q3-q1, left=q1, height=0.4, color='steelblue', alpha=0.7)
ax.plot([med, med], [i-0.2, i+0.2], 'red', lw=2)
ax.plot([whisker_lo, q1], [i, i], 'black', lw=1)
ax.plot([q3, whisker_hi], [i, i], 'black', lw=1)
ax.scatter(outliers, [i]*len(outliers), color='red', zorder=5, s=20)
print(f"{day}: Min={whisker_lo:.1f} Q1={q1:.1f} Med={med:.1f} Q3={q3:.1f} Max={whisker_hi:.1f}")
ax.set_yticks(range(4))
ax.set_yticklabels(['Thursday','Friday','Saturday','Sunday'])
ax.set_xlabel('Total Bill ($)')
ax.set_title('Five-Number Summary: Total Bill by Day')
plt.tight_layout()
plt.savefig('five_num_summary.png', dpi=150)
plt.show()
Five-Number Summary in Machine Learning
| ML Application | 5-Number Usage | Why |
|---|---|---|
| EDA | Quick data understanding | First step before modeling |
| Box plots | Visual model diagnostics | Compare model errors |
| Data validation | Check for data issues | Pipeline monitoring |
| Feature profiling | Summary statistics per feature | Automated EDA reports |
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
# Five-number summary for EDA
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
print("Five-Number Summary for Iris Dataset:")
for col in df.columns:
q1, med, q3 = df[col].quantile([0.25, 0.5, 0.75])
print(f" {col:25s}: Min={df[col].min():.1f}, Q1={q1:.1f}, "
f"Med={med:.1f}, Q3={q3:.1f}, Max={df[col].max():.1f}")
Key Takeaways
Summary: Five-Number Summary
- Min, Q1, Median, Q3, Max define a box plot — the five-number summary IS a box plot
- No distributional assumptions required — works for any shape
- Outliers are defined by the 1.5×IQR rule, not by the min/max
- pandas describe() adds mean and std to the five-number summary
- Compare distributions across groups side by side with grouped five-number summaries
- Skewness is visible: if median is closer to Q1, data is right-skewed; closer to Q3 -> left-skewed