🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Box Plots — Five-Number Summary, IQR, and Outlier Detection

Foundations of StatisticsData Visualization🟢 Free Lesson

Advertisement

Box Plots

Data Visualization

The Five-Number Summary in One Powerful Diagram

A box plot compactly summarizes a distribution using the five-number summary and clearly shows outliers. It is one of the most powerful tools for comparing distributions across groups.

  • Five-number summary — Min, Q1, median, Q3, max in one glance
  • Outlier detection — Points beyond the whiskers are flagged instantly
  • Group comparison — Side-by-side boxes reveal differences between populations
  • Skewness indicator — Symmetric boxes mean symmetric data; lopsided boxes tell a different story

When you need to compare distributions quickly, nothing beats the box plot.


What is a Box Plot?

Definition

A box plot (or box-and-whisker plot) compactly summarizes a distribution using the five-number summary and clearly shows outliers. It is one of the most powerful tools for comparing distributions across groups.


The Five-Number Summary

StatisticSymbolDescription
MinimumMinSmallest non-outlier value
First QuartileQ125th percentile
MedianQ250th percentile
Third QuartileQ375th percentile
MaximumMaxLargest non-outlier value

Interquartile Range

IQR=Q3Q1IQR = Q3 - Q1

Here,

  • Q3Q3=Third quartile (75th percentile)
  • Q1Q1=First quartile (25th percentile)
  • IQRIQR=Interquartile range (middle 50% of data)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

np.random.seed(42)

# Dataset: salary data for three departments
data = {
    'Engineering': np.random.normal(95000, 15000, 50),
    'Marketing': np.random.normal(75000, 12000, 50),
    'Operations': np.random.normal(65000, 10000, 50)
}

# Compute five-number summaries
print("Five-Number Summaries:")
print(f"{'Dept':<15} {'Min':>8} {'Q1':>8} {'Median':>8} {'Q3':>8} {'Max':>8} {'IQR':>8}")
print("-" * 65)
for dept, salaries in data.items():
    q1, med, q3 = np.percentile(salaries, [25, 50, 75])
    iqr = q3 - q1
    # Whisker bounds (1.5 * IQR rule)
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    min_val = salaries[salaries >= lower].min()
    max_val = salaries[salaries <= upper].max()
    outliers = salaries[(salaries < lower) | (salaries > upper)]
    print(f"{dept:<15} {min_val:>8,.0f} {q1:>8,.0f} {med:>8,.0f} {q3:>8,.0f} {max_val:>8,.0f} {iqr:>8,.0f}")
    if len(outliers) > 0:
        print(f"  -> {len(outliers)} outlier(s): {[f'${o:,.0f}' for o in outliers]}")

Outlier Detection: The 1.5×IQR Rule

Df1.5×IQR Outlier Rule

A point is an outlier if it falls:

  • Below Q1 − 1.5 × IQR (lower fence)
  • Above Q3 + 1.5 × IQR (upper fence)
def detect_outliers(data):
    q1, q3 = np.percentile(data, [25, 75])
    iqr = q3 - q1
    lower_fence = q1 - 1.5 * iqr
    upper_fence = q3 + 1.5 * iqr
    outliers = data[(data < lower_fence) | (data > upper_fence)]
    return outliers, lower_fence, upper_fence

# Add some outliers to data
dept_data = np.concatenate([data['Engineering'], [200000, 25000]])  # two outliers
outliers, lower, upper = detect_outliers(dept_data)
print(f"Lower fence: ${lower:,.0f}")
print(f"Upper fence: ${upper:,.0f}")
print(f"Outliers found: {outliers}")

Creating Box Plots in Python

fig, axes = plt.subplots(1, 3, figsize=(15, 6))

# 1. matplotlib basic box plot
df_long = pd.DataFrame({
    'Salary': np.concatenate(list(data.values())),
    'Department': np.repeat(list(data.keys()), 50)
})

axes[0].boxplot([data[d] for d in data.keys()],
                labels=list(data.keys()),
                patch_artist=True,
                boxprops=dict(facecolor='lightblue', color='navy'),
                medianprops=dict(color='red', linewidth=2),
                flierprops=dict(marker='o', markerfacecolor='red', markersize=6))
axes[0].set_title('Department Salaries\n(Box Plot)')
axes[0].set_ylabel('Salary ($)')
axes[0].yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))

# 2. Seaborn box plot (easier, prettier)
sns.boxplot(data=df_long, x='Department', y='Salary', ax=axes[1],
            palette='Set2', width=0.5)
axes[1].set_title('Seaborn Box Plot')
axes[1].yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))

# 3. Violin + Box (best of both worlds)
sns.violinplot(data=df_long, x='Department', y='Salary', ax=axes[2],
               palette='Set3', inner='box')
axes[2].set_title('Violin + Box Plot\n(shows full distribution)')
axes[2].yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))

plt.tight_layout()
plt.savefig('box_plots.png', dpi=150)
plt.show()

Reading a Box Plot

Box Plot Structure

25.00

Lower

50.00

Mean

75.00

Upper

95% CI: (25.00, 75.00)
  • Box width: spans IQR (middle 50% of data)
  • Center line: median
  • Whiskers: extend to furthest non-outlier within 1.5×IQR
  • Points beyond whiskers: outliers (plotted individually)

Box Plots in Machine Learning

Outlier Detection1.5×IQR ruleFeature ComparisonDistributions by classModel ErrorsResidual spreadData DriftTrain vs Test distBox plots are essential for outlier detection and model diagnostics in ML

In ML, box plots are critical for:

ML Use CaseWhat to Box PlotWhat to Look For
Outlier detectionEach featurePoints beyond whiskers
Model comparisonError metrics across foldsConsistency of performance
Data driftFeature distributions train vs testShifted distributions
Residual analysisPredicted vs actual errorsSymmetric spread
Class separationFeature values by classNon-overlapping boxes = good feature
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_regression

# Compare multiple models using box plots of cross-validation scores
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

X, y = make_regression(n_samples=500, n_features=10, noise=10, random_state=42)

models = {
    'Linear': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.1),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'GBM': GradientBoostingRegressor(n_estimators=100, random_state=42)
}

cv_scores = {}
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='r2')
    cv_scores[name] = scores
    print(f"{name:15s}: mean={scores.mean():.3f}, std={scores.std():.3f}")

# Box plot comparison
plt.figure(figsize=(10, 5))
plt.boxplot(cv_scores.values(), labels=cv_scores.keys())
plt.title('Model Comparison (5-Fold CV R² Scores)')
plt.ylabel('R² Score')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Key Takeaways

Summary: Box Plots

  1. Box plots are ideal for comparing multiple groups at a glance
  2. The IQR is robust — it's not affected by extreme values
  3. The 1.5×IQR rule identifies potential outliers but always investigate them
  4. Violin plots add distribution shape to box plots — use them when n is large
  5. Symmetric distributions have median centered in the box; skewed data has it off-center
  6. Always check outliers — they might be data errors or genuinely important observations

Premium Content

Box Plots — Five-Number Summary, IQR, and Outlier Detection

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement