Histograms
Data Visualization
The Shape of Your Data Reveals Everything
A histogram groups numerical data into bins and shows how frequently values fall into each range. Unlike bar charts, histograms have no gaps between bars because the underlying data is continuous.
- Distribution shape — Symmetric, skewed, bimodal, or uniform — the histogram shows it all
- Center and spread — See where data clusters and how far it stretches
- Outlier detection — Spot unusual values that sit far from the main body
- Bin width sensitivity — Too few bins hide structure; too many create noise
Always plot a histogram before calculating any statistic. The shape tells you which statistics are valid.
What is a Histogram?
Definition
A histogram is a bar graph that shows the distribution of numerical data by grouping it into bins (intervals). Unlike bar charts (for categorical data), histograms have no gaps between bars — because the data is continuous.
Anatomy of a Histogram
- X-axis: The range of the variable, divided into equal-width bins
- Y-axis: Frequency (count), relative frequency, or density
- Bar height: Frequency of observations in that bin
- Bar width: The bin width (class interval)
Building a Histogram in Python
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
np.random.seed(42)
# Generate example data: time to complete a task (minutes)
task_times = np.random.lognormal(mean=3.5, sigma=0.4, size=200)
fig, axes = plt.subplots(2, 3, figsize=(14, 8))
# 1. Basic histogram
axes[0,0].hist(task_times, bins=20, edgecolor='black', color='steelblue', alpha=0.7)
axes[0,0].set_title('Basic Histogram (bins=20)')
axes[0,0].set_xlabel('Time (minutes)')
axes[0,0].set_ylabel('Frequency')
# 2. Too few bins (underfitting)
axes[0,1].hist(task_times, bins=5, edgecolor='black', color='coral', alpha=0.7)
axes[0,1].set_title('Too Few Bins (bins=5)\n-> Hides structure')
# 3. Too many bins (overfitting)
axes[0,2].hist(task_times, bins=80, edgecolor='black', color='orchid', alpha=0.7)
axes[0,2].set_title('Too Many Bins (bins=80)\n-> Too noisy')
# 4. Density histogram with KDE
axes[1,0].hist(task_times, bins=20, density=True, edgecolor='black',
color='steelblue', alpha=0.5, label='Histogram')
kde = stats.gaussian_kde(task_times)
x = np.linspace(task_times.min(), task_times.max(), 200)
axes[1,0].plot(x, kde(x), 'r-', linewidth=2, label='KDE')
axes[1,0].set_title('Density Histogram + KDE')
axes[1,0].legend()
# 5. Seaborn histplot
sns.histplot(task_times, bins=20, kde=True, ax=axes[1,1], color='teal')
axes[1,1].set_title('Seaborn histplot (built-in KDE)')
# 6. Compare two distributions
data_a = np.random.normal(35, 8, 200)
data_b = np.random.normal(42, 6, 200)
axes[1,2].hist(data_a, bins=20, alpha=0.6, color='blue', label='Method A', density=True)
axes[1,2].hist(data_b, bins=20, alpha=0.6, color='orange', label='Method B', density=True)
axes[1,2].set_title('Comparing Two Groups')
axes[1,2].legend()
plt.tight_layout()
plt.savefig('histograms.png', dpi=150)
plt.show()
Common Distribution Shapes
Symmetric / Bell-Shaped (Normal)
Both tails are mirror images. Mean ≈ Median ≈ Mode.
Right-Skewed (Positive Skew)
Long right tail. Mean > Median > Mode. Common in: income, wait times, stock returns.
Left-Skewed (Negative Skew)
Long left tail. Mean < Median < Mode. Common in: age at death, exam scores on an easy test.
Bimodal
Two peaks. Often indicates two distinct subpopulations mixed together.
Uniform
Roughly equal frequency across all values. Random number generators produce this.
# Visualize all shapes
fig, axes = plt.subplots(1, 5, figsize=(18, 4))
np.random.seed(0)
shapes = {
'Normal\n(Symmetric)': np.random.normal(50, 10, 1000),
'Right-Skewed\n(Income-like)': np.random.lognormal(3, 0.8, 1000),
'Left-Skewed\n(Exam scores)': 100 - np.random.exponential(10, 1000),
'Bimodal\n(Two populations)': np.concatenate([np.random.normal(30,5,500),
np.random.normal(70,5,500)]),
'Uniform': np.random.uniform(0, 100, 1000)
}
for ax, (title, data) in zip(axes, shapes.items()):
ax.hist(data, bins=30, color='steelblue', edgecolor='black', alpha=0.7, density=True)
mean_val = np.mean(data)
median_val = np.median(data)
ax.axvline(mean_val, color='red', linewidth=2, linestyle='--', label=f'Mean={mean_val:.0f}')
ax.axvline(median_val, color='green', linewidth=2, linestyle='-', label=f'Median={median_val:.0f}')
ax.set_title(title)
ax.legend(fontsize=7)
plt.tight_layout()
plt.savefig('distribution_shapes.png', dpi=150)
plt.show()
Choosing the Right Number of Bins
| Rule | Formula | Best For |
|---|---|---|
| Sturges | k = 1 + log₂(n) | Normal-ish, small n |
| Scott | h = 3.49σ/n^(1/3) | Normal data |
| Freedman-Diaconis | h = 2·IQR/n^(1/3) | Skewed or outlier-prone |
def optimal_bins(data):
n = len(data)
iqr = np.percentile(data, 75) - np.percentile(data, 25)
data_range = data.max() - data.min()
sturges = int(np.ceil(1 + np.log2(n)))
scott_width = 3.49 * np.std(data) / n**(1/3)
scott_bins = int(np.ceil(data_range / scott_width))
fd_width = 2 * iqr / n**(1/3)
fd_bins = int(np.ceil(data_range / fd_width)) if fd_width > 0 else sturges
print(f"Sturges: {sturges} bins")
print(f"Scott: {scott_bins} bins (width = {scott_width:.2f})")
print(f"Freedman-Diaconis: {fd_bins} bins (width = {fd_width:.2f})")
return sturges, scott_bins, fd_bins
print("Task times data:")
optimal_bins(task_times)
Histograms in Machine Learning
In ML, histograms are everywhere:
| ML Application | What to Histogram | What to Look For |
|---|---|---|
| Feature engineering | Each input feature | Skewness → log transform |
| Model evaluation | Residuals (y - ŷ) | Normal → valid confidence intervals |
| Data drift detection | Feature distributions over time | Shifts between train/test |
| Loss curves | Training loss per epoch | Convergence behavior |
| Probability calibration | Predicted probabilities | Uniform = well calibrated |
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
np.random.seed(42)
# Generate skewed feature data
n = 500
X = np.random.lognormal(3, 1, (n, 1))
y = 50 + 0.01 * X[:,0] + np.random.normal(0, 5, n)
# Before training: check feature distribution
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
# Raw feature — skewed
axes[0].hist(X[:,0], bins=30, color='steelblue', edgecolor='black', alpha=0.7)
axes[0].set_title('Raw Feature (Skewed)\n→ Model struggles')
axes[0].set_xlabel('Feature Value')
# Log transform — now symmetric
X_log = np.log(X)
axes[1].hist(X_log[:,0], bins=30, color='green', edgecolor='black', alpha=0.7)
axes[1].set_title('Log Transformed (Symmetric)\n→ Model performs better')
axes[1].set_xlabel('Log(Feature)')
# Residuals after training
X_train, X_test, y_train, y_test = train_test_split(X_log, y, test_size=0.2)
model = LinearRegression().fit(X_train, y_train)
y_pred = model.predict(X_test)
residuals = y_test - y_pred
axes[2].hist(residuals, bins=25, color='coral', edgecolor='black', alpha=0.7)
axes[2].axvline(0, color='red', linewidth=2, linestyle='--')
axes[2].set_title('Residuals (Normal-ish)\n→ Valid confidence intervals')
axes[2].set_xlabel('Residual')
plt.tight_layout()
plt.savefig('ml_histograms.png', dpi=150)
plt.show()
print(f"MSE: {mean_squared_error(y_test, y_pred):.2f}")
print(f"Residual mean: {residuals.mean():.4f} (should be ~0)")
print(f"Residual skew: {float(np.mean(((residuals - residuals.mean())/residuals.std())**3)):.3f}")
Key Takeaways
Summary: Histograms
- Histograms reveal the shape, center, spread, and gaps in data — always plot one first
- Bin width is a critical choice — too wide hides structure; too narrow creates noise
- Shape tells you which statistics to use: symmetric -> mean; skewed -> median
- Bimodal distributions often signal mixed populations that should be analyzed separately
- Use density (not count) on y-axis when comparing groups of different sizes
- Add KDE (kernel density estimate) to smooth the histogram for a cleaner shape estimate