Box Plots

Data Visualization

The Five-Number Summary in One Powerful Diagram

A box plot compactly summarizes a distribution using the five-number summary and clearly shows outliers. It is one of the most powerful tools for comparing distributions across groups.

Five-number summary — Min, Q1, median, Q3, max in one glance
Outlier detection — Points beyond the whiskers are flagged instantly
Group comparison — Side-by-side boxes reveal differences between populations
Skewness indicator — Symmetric boxes mean symmetric data; lopsided boxes tell a different story

When you need to compare distributions quickly, nothing beats the box plot.

What is a Box Plot?

Definition

A box plot (or box-and-whisker plot) compactly summarizes a distribution using the five-number summary and clearly shows outliers. It is one of the most powerful tools for comparing distributions across groups.

The Five-Number Summary

Statistic	Symbol	Description
Minimum	Min	Smallest non-outlier value
First Quartile	Q1	25th percentile
Median	Q2	50th percentile
Third Quartile	Q3	75th percentile
Maximum	Max	Largest non-outlier value

Interquartile Range

IQR = Q3 - Q1

Here,

$Q3$ =Third quartile (75th percentile)
$Q1$ =First quartile (25th percentile)
$IQR$ =Interquartile range (middle 50% of data)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

np.random.seed(42)

# Dataset: salary data for three departments
data = {
    'Engineering': np.random.normal(95000, 15000, 50),
    'Marketing': np.random.normal(75000, 12000, 50),
    'Operations': np.random.normal(65000, 10000, 50)
}

# Compute five-number summaries
print("Five-Number Summaries:")
print(f"{'Dept':<15} {'Min':>8} {'Q1':>8} {'Median':>8} {'Q3':>8} {'Max':>8} {'IQR':>8}")
print("-" * 65)
for dept, salaries in data.items():
    q1, med, q3 = np.percentile(salaries, [25, 50, 75])
    iqr = q3 - q1
    # Whisker bounds (1.5 * IQR rule)
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    min_val = salaries[salaries >= lower].min()
    max_val = salaries[salaries <= upper].max()
    outliers = salaries[(salaries < lower) | (salaries > upper)]
    print(f"{dept:<15} {min_val:>8,.0f} {q1:>8,.0f} {med:>8,.0f} {q3:>8,.0f} {max_val:>8,.0f} {iqr:>8,.0f}")
    if len(outliers) > 0:
        print(f"  -> {len(outliers)} outlier(s): {[f'${o:,.0f}' for o in outliers]}")

Outlier Detection: The 1.5×IQR Rule

Df1.5×IQR Outlier Rule

A point is an outlier if it falls:

Below Q1 − 1.5 × IQR (lower fence)
Above Q3 + 1.5 × IQR (upper fence)

def detect_outliers(data):
    q1, q3 = np.percentile(data, [25, 75])
    iqr = q3 - q1
    lower_fence = q1 - 1.5 * iqr
    upper_fence = q3 + 1.5 * iqr
    outliers = data[(data < lower_fence) | (data > upper_fence)]
    return outliers, lower_fence, upper_fence

# Add some outliers to data
dept_data = np.concatenate([data['Engineering'], [200000, 25000]])  # two outliers
outliers, lower, upper = detect_outliers(dept_data)
print(f"Lower fence: ${lower:,.0f}")
print(f"Upper fence: ${upper:,.0f}")
print(f"Outliers found: {outliers}")

Creating Box Plots in Python

fig, axes = plt.subplots(1, 3, figsize=(15, 6))

# 1. matplotlib basic box plot
df_long = pd.DataFrame({
    'Salary': np.concatenate(list(data.values())),
    'Department': np.repeat(list(data.keys()), 50)
})

axes[0].boxplot([data[d] for d in data.keys()],
                labels=list(data.keys()),
                patch_artist=True,
                boxprops=dict(facecolor='lightblue', color='navy'),
                medianprops=dict(color='red', linewidth=2),
                flierprops=dict(marker='o', markerfacecolor='red', markersize=6))
axes[0].set_title('Department Salaries\n(Box Plot)')
axes[0].set_ylabel('Salary ($)')
axes[0].yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))

# 2. Seaborn box plot (easier, prettier)
sns.boxplot(data=df_long, x='Department', y='Salary', ax=axes[1],
            palette='Set2', width=0.5)
axes[1].set_title('Seaborn Box Plot')
axes[1].yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))

# 3. Violin + Box (best of both worlds)
sns.violinplot(data=df_long, x='Department', y='Salary', ax=axes[2],
               palette='Set3', inner='box')
axes[2].set_title('Violin + Box Plot\n(shows full distribution)')
axes[2].yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))

plt.tight_layout()
plt.savefig('box_plots.png', dpi=150)
plt.show()

Reading a Box Plot

Box Plot Structure

25.00

Lower

50.00

Mean

75.00

Upper

95% CI: (25.00, 75.00)

Box width: spans IQR (middle 50% of data)
Center line: median
Whiskers: extend to furthest non-outlier within 1.5×IQR
Points beyond whiskers: outliers (plotted individually)

Box Plots in Machine Learning

In ML, box plots are critical for:

ML Use Case	What to Box Plot	What to Look For
Outlier detection	Each feature	Points beyond whiskers
Model comparison	Error metrics across folds	Consistency of performance
Data drift	Feature distributions train vs test	Shifted distributions
Residual analysis	Predicted vs actual errors	Symmetric spread
Class separation	Feature values by class	Non-overlapping boxes = good feature

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_regression

# Compare multiple models using box plots of cross-validation scores
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

X, y = make_regression(n_samples=500, n_features=10, noise=10, random_state=42)

models = {
    'Linear': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.1),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'GBM': GradientBoostingRegressor(n_estimators=100, random_state=42)
}

cv_scores = {}
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='r2')
    cv_scores[name] = scores
    print(f"{name:15s}: mean={scores.mean():.3f}, std={scores.std():.3f}")

# Box plot comparison
plt.figure(figsize=(10, 5))
plt.boxplot(cv_scores.values(), labels=cv_scores.keys())
plt.title('Model Comparison (5-Fold CV R² Scores)')
plt.ylabel('R² Score')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Key Takeaways

Summary: Box Plots

Box plots are ideal for comparing multiple groups at a glance
The IQR is robust — it's not affected by extreme values
The 1.5×IQR rule identifies potential outliers but always investigate them
Violin plots add distribution shape to box plots — use them when n is large
Symmetric distributions have median centered in the box; skewed data has it off-center
Always check outliers — they might be data errors or genuinely important observations

Box Plots — Five-Number Summary, IQR, and Outlier Detection