Frequency Distributions
Descriptive Statistics
From Raw Numbers to Understandable Patterns
A frequency distribution organizes raw data into a table or chart showing how often each value occurs. It transforms an unreadable list of numbers into an interpretable summary.
- Absolute frequency — Count observations in each category to see the raw totals
- Relative frequency — Convert counts to proportions for comparison across datasets
- Cumulative frequency — Track running totals to answer "at or below" questions
- Grouped distributions — Handle continuous data by binning into intervals
Before you calculate any statistic, organize your data. Frequency distributions are the first step to understanding.
What is a Frequency Distribution?
Definition
A frequency distribution organizes raw data into a table or chart showing how often each value (or range of values) occurs. It transforms an unreadable list of numbers into an interpretable summary.
Absolute Frequency
The count of observations in each category or interval.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Example: Final exam scores of 50 students
np.random.seed(42)
scores = np.random.normal(72, 12, 50).clip(0, 100).round(0).astype(int)
# --- Categorical frequency table ---
from collections import Counter
freq = Counter(scores)
freq_df = pd.DataFrame({'Score': sorted(freq.keys()),
'Frequency': [freq[k] for k in sorted(freq.keys())]})
print("First 10 rows of frequency table:")
print(freq_df.head(10))
Grouped Frequency Distribution
When data is continuous or has many unique values, we group into class intervals (bins).
Steps to build:
- Find range = max − min
- Choose number of classes (typically 5–20; Sturges' rule: k = 1 + 3.322 × log₁₀(n))
- Class width = Range / k (round up)
- Create non-overlapping, equal-width intervals
- Count observations in each interval
# Sturges' rule for number of bins
n = len(scores)
k_sturges = int(np.ceil(1 + 3.322 * np.log10(n)))
print(f"Sturges' rule: k = {k_sturges} bins")
# Build grouped frequency table
min_score, max_score = scores.min(), scores.max()
bin_width = int(np.ceil((max_score - min_score) / k_sturges / 10) * 10)
bins = range(40, 105, 10) # [40,50), [50,60), ...
labels = [f"{b}-{b+9}" for b in bins[:-1]]
score_series = pd.Series(scores)
grouped = pd.cut(score_series, bins=list(bins), right=False, labels=labels)
freq_table = (grouped.value_counts(sort=False)
.reset_index()
.rename(columns={'index': 'Interval', 'count': 'Frequency'}))
freq_table.columns = ['Interval', 'Frequency']
# Add relative and cumulative frequency
freq_table['Relative Freq'] = freq_table['Frequency'] / n
freq_table['Relative %'] = (freq_table['Relative Freq'] * 100).round(1)
freq_table['Cumulative Freq'] = freq_table['Frequency'].cumsum()
freq_table['Cumulative %'] = (freq_table['Cumulative Freq'] / n * 100).round(1)
print("\nGrouped Frequency Distribution:")
print(freq_table.to_string(index=False))
Output:
Grouped Frequency Distribution:
Interval Frequency Relative Freq Relative % Cumulative Freq Cumulative %
40-49 2 0.040 4.0 2 4.0
50-59 7 0.140 14.0 9 18.0
60-69 14 0.280 28.0 23 46.0
70-79 16 0.320 32.0 39 78.0
80-89 10 0.200 20.0 49 98.0
90-99 1 0.020 2.0 50 100.0
Relative Frequency
Relative Frequency
Here,
- =Frequency of the i-th class
- =Total number of observations
Divides each frequency by total n. Useful for comparing distributions across datasets of different sizes.
Cumulative Frequency
Running total of frequencies. Answers: "What fraction of observations fall at or below this value?"
# Cumulative frequency chart (Ogive)
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
# Histogram
axes[0].hist(scores, bins=list(bins), edgecolor='black', color='steelblue', alpha=0.7)
axes[0].set_title('Histogram\n(Absolute Frequency)')
axes[0].set_xlabel('Score')
axes[0].set_ylabel('Frequency')
# Relative frequency histogram
axes[1].hist(scores, bins=list(bins), density=True, edgecolor='black', color='coral', alpha=0.7)
axes[1].set_title('Relative Frequency Histogram')
axes[1].set_xlabel('Score')
axes[1].set_ylabel('Density')
# Cumulative frequency (Ogive)
sorted_scores = np.sort(scores)
cumulative = np.arange(1, n+1) / n
axes[2].plot(sorted_scores, cumulative, 'b-', linewidth=2)
axes[2].set_title('Ogive (Cumulative Freq)')
axes[2].set_xlabel('Score')
axes[2].set_ylabel('Cumulative Proportion')
axes[2].axhline(0.5, color='red', linestyle='--', label='Median')
axes[2].legend()
plt.tight_layout()
plt.savefig('frequency_distributions.png', dpi=150)
plt.show()
Frequency Distributions in Machine Learning
In ML, frequency distributions are critical for:
| ML Use Case | Frequency Technique | Why It Matters |
|---|---|---|
| Classification targets | Class frequency table | Detect class imbalance → resample |
| Feature engineering | Histogram of features | Decide binning for continuous variables |
| NLP tokenization | Word frequency (Zipf's law) | Stop words, vocabulary pruning |
| Recommendation systems | User-item frequency | Sparse matrix handling |
| Fraud detection | Event frequency | Extreme class imbalance (99.9% legit) |
import numpy as np
import pandas as pd
# ML example: class imbalance detection
np.random.seed(42)
n = 10000
# Simulated fraud dataset (99% legit, 1% fraud)
labels = np.random.choice(['legit', 'fraud'], n, p=[0.99, 0.01])
# Frequency distribution reveals the problem
from collections import Counter
freq = Counter(labels)
print("=== Class Frequency Distribution ===")
for cls, count in sorted(freq.items()):
print(f" {cls}: {count} ({count/n:.1%})")
# Fix: resample minority class
fraud_idx = np.where(labels == 'fraud')[0]
legit_idx = np.where(labels == 'legit')[0]
fraud_oversampled = np.random.choice(fraud_idx, size=len(legit_idx), replace=True)
balanced_labels = np.concatenate([labels[legit_idx], labels[fraud_oversampled]])
print(f"\nAfter oversampling: {Counter(balanced_labels)}")
Key Takeaways
Summary: Frequency Distributions
- Frequency distributions transform raw data into interpretable summaries
- Absolute frequency = counts; Relative frequency = proportions; Cumulative = running total
- Grouped distributions are needed for continuous data — bin width matters for interpretation
- Sturges' rule (k = 1 + 3.322 log₁₀n) is a starting point for number of bins
- The ogive (cumulative frequency curve) allows you to read off percentiles
- Different bin widths reveal different features — always try several widths