Percentiles and Quartiles
Descriptive Statistics
Where Does Any Value Stand Relative to the Rest?
Percentiles tell you the relative standing of any value within a dataset. Quartiles are special percentiles that divide data into four equal parts.
- Percentile rank — "You scored better than 85% of test takers"
- Quartiles — Q1, Q2 (median), Q3 split data into four equal groups
- Deciles — Ten equal groups for finer-grained comparison
- Interpolation methods — Different calculators give slightly different answers; know why
Percentiles turn raw scores into meaningful rankings. They are the language of standardized testing and performance evaluation.
What are Percentiles and Quartiles?
Definition
The pth percentile is the value below which p% of observations fall. Quartiles (Q1=25th, Q2=50th, Q3=75th) are special cases.
Percentile Rank
Here,
- =The value being ranked
- =Total number of observations
import numpy as np
from scipy import stats
import pandas as pd
data = np.array([15, 20, 35, 40, 50, 12, 27, 45, 38, 22, 18, 55, 30, 42, 25])
sorted_d = np.sort(data)
print(f"Sorted: {sorted_d}")
for p in [10, 25, 50, 75, 90]:
print(f"P{p:2d}: {np.percentile(data, p):.2f}")
NumPy Interpolation Methods
d = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
for method in ['linear', 'lower', 'higher', 'midpoint', 'nearest']:
val = np.percentile(d, 50, interpolation=method)
print(f" method='{method}': {val}")
Interpolation Methods
NumPy supports multiple interpolation methods for percentiles: linear (default), lower, higher, midpoint, and nearest. The default 'linear' method is appropriate for most use cases.
Five-Number Summary
def five_num(data, label=''):
q1, q2, q3 = np.percentile(data, [25, 50, 75])
iqr = q3 - q1
lower, upper = q1 - 1.5*iqr, q3 + 1.5*iqr
if label: print(f"\n=== {label} ===")
print(f"Min: {data.min():.2f} Q1: {q1:.2f} Median: {q2:.2f} Q3: {q3:.2f} Max: {data.max():.2f}")
print(f"IQR: {iqr:.2f} Fences: [{lower:.2f}, {upper:.2f}]")
np.random.seed(42)
exam = np.random.normal(75, 12, 200).clip(0, 100)
five_num(exam, "Exam Scores")
| Quartile | Percentile | Description |
|---|---|---|
| Q1 | 25th | Lower quartile — 25% of data falls below this |
| Q2 | 50th | Median — middle value of the dataset |
| Q3 | 75th | Upper quartile — 75% of data falls below this |
Percentile Rank
score = 88
rank = stats.percentileofscore(exam, score, kind='weak')
print(f"Score of {score} is at the {rank:.1f}th percentile")
print(f"{rank:.1f}% of students scored at or below {score}")
Deciles
deciles = np.percentile(exam, range(10, 100, 10))
for i, val in enumerate(deciles, 1):
print(f"D{i} ({i*10}th pct): {val:.1f}")
Percentiles in Machine Learning
| ML Application | Percentile Usage | Why |
|---|---|---|
| Quantile regression | Predict percentiles, not mean | Robust to skewed targets |
| Feature binning | Cut into quantile bins | Discretize continuous features |
| Performance metrics | P95 latency, P99 response time | SLA monitoring |
| Data preprocessing | Clip outliers at percentiles | Robust scaling |
import numpy as np
from sklearn.preprocessing import QuantileTransformer
np.random.seed(42)
# Quantile binning for feature engineering
data = np.random.lognormal(3, 1, 1000)
bins = np.percentile(data, [0, 25, 50, 75, 100])
binned = np.digitize(data, bins[1:-1])
print(f"Quantile bins: {bins.round(1)}")
print(f"Binned values: {np.bincount(binned)}")
# QuantileTransformer for normality
qt = QuantileTransformer(n_quantiles=100, output_distribution='normal')
transformed = qt.fit_transform(data.reshape(-1,1)).flatten()
print(f"\nOriginal skewness: {float(np.mean(((data-data.mean())/data.std())**3)):.3f}")
print(f"Transformed skewness: {float(np.mean(((transformed-transformed.mean())/transformed.std())**3)):.3f}")
Key Takeaways
Summary: Percentiles and Quartiles
- P50 = median — percentiles generalize the median to any fraction
- Quartiles divide data into 4 equal-frequency groups (not equal-width intervals)
- IQR = Q3 − Q1 covers the middle 50% and drives outlier fences
- Percentile rank answers "where does this value fall in the distribution?"
- NumPy's default method='linear' is appropriate for most cases
- Percentiles are non-parametric — no distributional assumptions needed