Cumulative Frequency
Descriptive Statistics
Running Totals That Unlock Percentiles
Cumulative frequency is the running total of frequencies up to a given value, showing how many observations fall at or below each point.
- Percentile estimation — Read off any percentile directly from the cumulative curve
- Ogive plots — The cumulative frequency polygon visualizes the entire distribution
- ECDF — The empirical cumulative distribution function is the non-parametric version
- Median and quartiles — Find the 50th percentile (median) by reading the ogive at 50%
The cumulative frequency curve is the most direct path from raw data to percentiles.
What is Cumulative Frequency?
Definition
Cumulative frequency is the running total of frequencies up to a given value, showing how many observations fall at or below each point. It is used to determine how many observations fall below or at a particular value.
Cumulative Frequency
Here,
- =Frequency of the i-th value
- =The k-th value (sorted)
- =Cumulative frequency up to x_k
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(42)
# Generate test scores
scores = np.random.normal(75, 12, 200).clip(0, 100).astype(int)
# Compute frequency distribution
bins = np.arange(0, 105, 5)
freq, edges = np.histogram(scores, bins=bins)
cum_freq = np.cumsum(freq)
df = pd.DataFrame({
'Class Interval': [f'{edges[i]}-{edges[i+1]}' for i in range(len(freq))],
'Frequency': freq,
'Cumulative Frequency': cum_freq,
'Cumulative Relative Frequency': (cum_freq / cum_freq[-1]).round(4)
})
print(df.head(10))
Ogive Plot (Cumulative Frequency Graph)
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Ogive
axes[0].plot(edges[1:], cum_freq, marker='o', linewidth=2)
axes[0].fill_between(edges[1:], cum_freq, alpha=0.3)
axes[0].set_title('Ogive (Cumulative Frequency Graph)')
axes[0].set_xlabel('Score')
axes[0].set_ylabel('Cumulative Frequency')
axes[0].grid(True, alpha=0.3)
# ECDF (Empirical CDF)
sorted_scores = np.sort(scores)
ecdf = np.arange(1, len(sorted_scores) + 1) / len(sorted_scores)
axes[1].step(sorted_scores, ecdf, linewidth=2)
axes[1].set_title('Empirical CDF (ECDF)')
axes[1].set_xlabel('Score')
axes[1].set_ylabel('Cumulative Probability')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('cumulative-frequency.png', dpi=150)
plt.show()
Percentile Estimation from Ogive
# Estimate median (50th percentile) from cumulative frequency
target_freq = 0.5 * cum_freq[-1]
median_idx = np.searchsorted(cum_freq, target_freq)
estimated_median = edges[1:][median_idx]
print(f"Estimated median from ogive: {estimated_median}")
print(f"Actual median: {np.median(scores)}")
# Estimate quartiles
for q, name in [(0.25, 'Q1'), (0.50, 'Median'), (0.75, 'Q3')]:
target = q * cum_freq[-1]
idx = np.searchsorted(cum_freq, target)
est = edges[1:][idx]
print(f"{name}: Estimated = {est}, Actual = {np.percentile(scores, q*100):.1f}")
Reading an Ogive
To find a percentile from an ogive: (1) locate the desired percentile on the y-axis, (2) draw a horizontal line to the curve, (3) drop a vertical line to the x-axis, (4) read the value.
Cumulative Frequency in Machine Learning
| ML Application | Cumulative Freq Usage | Why |
|---|---|---|
| ROC curves | Cumulative TPR vs FPR | Model threshold selection |
| Calibration | Predicted vs observed cumulative | Reliability diagrams |
| Survival analysis | Kaplan-Meier curves | Time-to-event prediction |
import numpy as np
from sklearn.metrics import roc_curve, auc
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
X, y = make_classification(n_samples=500, random_state=42)
model = LogisticRegression(random_state=42).fit(X, y)
y_proba = model.predict_proba(X)[:, 1]
fpr, tpr, thresholds = roc_curve(y, y_proba)
roc_auc = auc(fpr, tpr)
print(f"AUC-ROC: {roc_auc:.3f}")
print(f"TPR at FPR=0.1: {tpr[np.searchsorted(fpr, 0.1)]:.3f}")
print("ROC curve is a cumulative frequency diagram of TPR vs FPR")
Key Takeaways
Summary: Cumulative Frequency
- Cumulative frequency = running total of frequencies from lowest to highest
- Ogive plots show cumulative frequency as a graph — used to estimate percentiles
- ECDF (Empirical CDF) plots cumulative proportion (0 to 1) — equivalent to the ogive
- Percentile estimation: find the desired percentage on the y-axis, read the corresponding x-value
- Median = value where cumulative frequency reaches 50%
- Ogives are non-decreasing — they never go down as you move right