Relative Frequency
Descriptive Statistics
From Counts to Proportions — The Bridge to Probability
Relative frequency converts raw counts into proportions, revealing how often each category occurs relative to the whole. It is the empirical bridge between data and probability.
- Probability estimation — Use observed proportions as estimates of true probabilities
- Cross-dataset comparison — Compare distributions of different sizes on equal footing
- Law of Large Numbers — Relative frequency converges to true probability as n grows
- Foundation for histograms — Density histograms use relative frequency on the y-axis
When you divide every count by the total, you unlock the connection between data and probability.
What is Relative Frequency?
Definition
Relative frequency is the proportion (or percentage) of times a value occurs in a dataset compared to the total number of observations. It estimates the probability of that category.
Relative Frequency Formula
Here,
- =Frequency of the i-th category
- =Total number of observations
- =Sum of all relative frequencies = 1
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(42)
# Generate sample data
colors = np.random.choice(['Red', 'Blue', 'Green', 'Yellow'], size=100, p=[0.3, 0.25, 0.25, 0.2])
# Compute relative frequencies
value_counts = pd.Series(colors).value_counts()
relative_freq = value_counts / len(colors)
print("Absolute and Relative Frequencies:")
print(pd.DataFrame({
'Count': value_counts,
'Relative Frequency': relative_freq.round(4),
'Percentage': (relative_freq * 100).round(1).astype(str) + '%'
}))
Cumulative Relative Frequency
# Cumulative relative frequency
cumulative_freq = relative_freq.cumsum()
print("\nCumulative Relative Frequency:")
print(cumulative_freq.round(4))
Cumulative Relative Frequency
Here,
- =Cumulative relative frequency up to category k
- =Frequency of the i-th category
- =Total number of observations
Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Bar chart of relative frequencies
relative_freq.plot(kind='bar', color=['#e74c3c', '#3498db', '#2ecc71', '#f39c12'], ax=axes[0])
axes[0].set_title('Relative Frequency Distribution')
axes[0].set_ylabel('Relative Frequency')
axes[0].set_ylim(0, 0.4)
# Frequency polygon
relative_freq.plot(kind='line', marker='o', ax=axes[1])
axes[1].set_title('Frequency Polygon')
axes[1].set_ylabel('Relative Frequency')
plt.tight_layout()
plt.savefig('relative-frequency.png', dpi=150)
plt.show()
Probability Estimation from Data
# Using relative frequency as probability estimate
print("Probability Estimates:")
for color, freq in relative_freq.items():
print(f" P({color}) ≈ {freq:.4f}")
# Verify sum equals 1
print(f"\nSum of probabilities: {relative_freq.sum():.4f}")
Law of Large Numbers
As the sample size increases, the relative frequency of an event converges to its true probability. This is the foundation of the frequentist interpretation of probability.
Relative Frequency in Machine Learning
| ML Application | Relative Freq Usage | Why |
|---|---|---|
| Class balance | Check target distribution | Detect imbalance |
| NLP | Word frequency (Zipf's law) | Tokenization strategy |
| Feature engineering | Frequency encoding | Replace categories with freq |
| Data validation | Expected vs observed proportions | Detect data drift |
import numpy as np
import pandas as pd
np.random.seed(42)
# Class imbalance detection
y = np.random.choice(['fraud', 'legit'], 10000, p=[0.02, 0.98])
freq = pd.Series(y).value_counts(normalize=True)
print("Relative frequency (class balance):")
print(freq.round(4))
print(f"\nFraud rate: {freq['fraud']:.2%} — extreme imbalance!")
print("Solutions: SMOTE, class weights, undersampling")
# Frequency encoding
categories = np.random.choice(['A', 'B', 'C', 'D'], 1000, p=[0.5, 0.3, 0.15, 0.05])
freq_map = pd.Series(categories).value_counts(normalize=True).to_dict()
encoded = [freq_map[c] for c in categories]
print(f"\nFrequency encoding: {freq_map}")
Key Takeaways
Summary: Relative Frequency
- Relative frequency = count / total — converts counts to proportions
- Sum of all relative frequencies = 1 — represents the entire sample
- Cumulative relative frequency shows the running total of proportions
- Relative frequency estimates probability — more data -> better estimates
- pandas value_counts(normalize=True) computes relative frequencies directly
- Frequency polygons visualize the shape of the distribution