Histograms

Data Visualization

The Shape of Your Data Reveals Everything

A histogram groups numerical data into bins and shows how frequently values fall into each range. Unlike bar charts, histograms have no gaps between bars because the underlying data is continuous.

Distribution shape — Symmetric, skewed, bimodal, or uniform — the histogram shows it all
Center and spread — See where data clusters and how far it stretches
Outlier detection — Spot unusual values that sit far from the main body
Bin width sensitivity — Too few bins hide structure; too many create noise

Always plot a histogram before calculating any statistic. The shape tells you which statistics are valid.

What is a Histogram?

Definition

A histogram is a bar graph that shows the distribution of numerical data by grouping it into bins (intervals). Unlike bar charts (for categorical data), histograms have no gaps between bars — because the data is continuous.

Anatomy of a Histogram

Frequency

0481216

5060708090100

Score

X-axis: The range of the variable, divided into equal-width bins
Y-axis: Frequency (count), relative frequency, or density
Bar height: Frequency of observations in that bin
Bar width: The bin width (class interval)

Building a Histogram in Python

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

np.random.seed(42)

# Generate example data: time to complete a task (minutes)
task_times = np.random.lognormal(mean=3.5, sigma=0.4, size=200)

fig, axes = plt.subplots(2, 3, figsize=(14, 8))

# 1. Basic histogram
axes[0,0].hist(task_times, bins=20, edgecolor='black', color='steelblue', alpha=0.7)
axes[0,0].set_title('Basic Histogram (bins=20)')
axes[0,0].set_xlabel('Time (minutes)')
axes[0,0].set_ylabel('Frequency')

# 2. Too few bins (underfitting)
axes[0,1].hist(task_times, bins=5, edgecolor='black', color='coral', alpha=0.7)
axes[0,1].set_title('Too Few Bins (bins=5)\n-> Hides structure')

# 3. Too many bins (overfitting)
axes[0,2].hist(task_times, bins=80, edgecolor='black', color='orchid', alpha=0.7)
axes[0,2].set_title('Too Many Bins (bins=80)\n-> Too noisy')

# 4. Density histogram with KDE
axes[1,0].hist(task_times, bins=20, density=True, edgecolor='black',
               color='steelblue', alpha=0.5, label='Histogram')
kde = stats.gaussian_kde(task_times)
x = np.linspace(task_times.min(), task_times.max(), 200)
axes[1,0].plot(x, kde(x), 'r-', linewidth=2, label='KDE')
axes[1,0].set_title('Density Histogram + KDE')
axes[1,0].legend()

# 5. Seaborn histplot
sns.histplot(task_times, bins=20, kde=True, ax=axes[1,1], color='teal')
axes[1,1].set_title('Seaborn histplot (built-in KDE)')

# 6. Compare two distributions
data_a = np.random.normal(35, 8, 200)
data_b = np.random.normal(42, 6, 200)
axes[1,2].hist(data_a, bins=20, alpha=0.6, color='blue', label='Method A', density=True)
axes[1,2].hist(data_b, bins=20, alpha=0.6, color='orange', label='Method B', density=True)
axes[1,2].set_title('Comparing Two Groups')
axes[1,2].legend()

plt.tight_layout()
plt.savefig('histograms.png', dpi=150)
plt.show()

Common Distribution Shapes

Symmetric / Bell-Shaped (Normal)

Normal (Symmetric)

Frequency

0481216

Both tails are mirror images. Mean ≈ Median ≈ Mode.

Right-Skewed (Positive Skew)

Right-Skewed

Frequency

0471114

Long right tail. Mean > Median > Mode. Common in: income, wait times, stock returns.

Left-Skewed (Negative Skew)

Left-Skewed

Frequency

0471114

Long left tail. Mean < Median < Mode. Common in: age at death, exam scores on an easy test.

Bimodal

Frequency

036912

Two peaks. Often indicates two distinct subpopulations mixed together.

Uniform

Frequency

035810

Roughly equal frequency across all values. Random number generators produce this.

# Visualize all shapes
fig, axes = plt.subplots(1, 5, figsize=(18, 4))

np.random.seed(0)
shapes = {
    'Normal\n(Symmetric)': np.random.normal(50, 10, 1000),
    'Right-Skewed\n(Income-like)': np.random.lognormal(3, 0.8, 1000),
    'Left-Skewed\n(Exam scores)': 100 - np.random.exponential(10, 1000),
    'Bimodal\n(Two populations)': np.concatenate([np.random.normal(30,5,500),
                                                    np.random.normal(70,5,500)]),
    'Uniform': np.random.uniform(0, 100, 1000)
}

for ax, (title, data) in zip(axes, shapes.items()):
    ax.hist(data, bins=30, color='steelblue', edgecolor='black', alpha=0.7, density=True)
    mean_val = np.mean(data)
    median_val = np.median(data)
    ax.axvline(mean_val, color='red', linewidth=2, linestyle='--', label=f'Mean={mean_val:.0f}')
    ax.axvline(median_val, color='green', linewidth=2, linestyle='-', label=f'Median={median_val:.0f}')
    ax.set_title(title)
    ax.legend(fontsize=7)

plt.tight_layout()
plt.savefig('distribution_shapes.png', dpi=150)
plt.show()

Choosing the Right Number of Bins

Rule	Formula	Best For
Sturges	k = 1 + log₂(n)	Normal-ish, small n
Scott	h = 3.49σ/n^(1/3)	Normal data
Freedman-Diaconis	h = 2·IQR/n^(1/3)	Skewed or outlier-prone

def optimal_bins(data):
    n = len(data)
    iqr = np.percentile(data, 75) - np.percentile(data, 25)
    data_range = data.max() - data.min()
    
    sturges = int(np.ceil(1 + np.log2(n)))
    scott_width = 3.49 * np.std(data) / n**(1/3)
    scott_bins = int(np.ceil(data_range / scott_width))
    fd_width = 2 * iqr / n**(1/3)
    fd_bins = int(np.ceil(data_range / fd_width)) if fd_width > 0 else sturges
    
    print(f"Sturges: {sturges} bins")
    print(f"Scott: {scott_bins} bins (width = {scott_width:.2f})")
    print(f"Freedman-Diaconis: {fd_bins} bins (width = {fd_width:.2f})")
    return sturges, scott_bins, fd_bins

print("Task times data:")
optimal_bins(task_times)

Histograms in Machine Learning

In ML, histograms are everywhere:

ML Application	What to Histogram	What to Look For
Feature engineering	Each input feature	Skewness → log transform
Model evaluation	Residuals (y - ŷ)	Normal → valid confidence intervals
Data drift detection	Feature distributions over time	Shifts between train/test
Loss curves	Training loss per epoch	Convergence behavior
Probability calibration	Predicted probabilities	Uniform = well calibrated

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

np.random.seed(42)

# Generate skewed feature data
n = 500
X = np.random.lognormal(3, 1, (n, 1))
y = 50 + 0.01 * X[:,0] + np.random.normal(0, 5, n)

# Before training: check feature distribution
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Raw feature — skewed
axes[0].hist(X[:,0], bins=30, color='steelblue', edgecolor='black', alpha=0.7)
axes[0].set_title('Raw Feature (Skewed)\n→ Model struggles')
axes[0].set_xlabel('Feature Value')

# Log transform — now symmetric
X_log = np.log(X)
axes[1].hist(X_log[:,0], bins=30, color='green', edgecolor='black', alpha=0.7)
axes[1].set_title('Log Transformed (Symmetric)\n→ Model performs better')
axes[1].set_xlabel('Log(Feature)')

# Residuals after training
X_train, X_test, y_train, y_test = train_test_split(X_log, y, test_size=0.2)
model = LinearRegression().fit(X_train, y_train)
y_pred = model.predict(X_test)
residuals = y_test - y_pred

axes[2].hist(residuals, bins=25, color='coral', edgecolor='black', alpha=0.7)
axes[2].axvline(0, color='red', linewidth=2, linestyle='--')
axes[2].set_title('Residuals (Normal-ish)\n→ Valid confidence intervals')
axes[2].set_xlabel('Residual')

plt.tight_layout()
plt.savefig('ml_histograms.png', dpi=150)
plt.show()

print(f"MSE: {mean_squared_error(y_test, y_pred):.2f}")
print(f"Residual mean: {residuals.mean():.4f} (should be ~0)")
print(f"Residual skew: {float(np.mean(((residuals - residuals.mean())/residuals.std())**3)):.3f}")

Key Takeaways

Summary: Histograms

Histograms reveal the shape, center, spread, and gaps in data — always plot one first
Bin width is a critical choice — too wide hides structure; too narrow creates noise
Shape tells you which statistics to use: symmetric -> mean; skewed -> median
Bimodal distributions often signal mixed populations that should be analyzed separately
Use density (not count) on y-axis when comparing groups of different sizes
Add KDE (kernel density estimate) to smooth the histogram for a cleaner shape estimate

Histograms — Construction, Interpretation, and Common Shapes

Histograms

The Shape of Your Data Reveals Everything

What is a Histogram?

Definition

Anatomy of a Histogram

Building a Histogram in Python

Common Distribution Shapes

Symmetric / Bell-Shaped (Normal)

Right-Skewed (Positive Skew)

Left-Skewed (Negative Skew)

Bimodal

Uniform

Choosing the Right Number of Bins

Histograms in Machine Learning

Key Takeaways

Summary: Histograms

Premium Content

Need Expert Statistics Help?