🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Histograms — Construction, Interpretation, and Common Shapes

Foundations of StatisticsData Visualization🟢 Free Lesson

Advertisement

Histograms

Data Visualization

The Shape of Your Data Reveals Everything

A histogram groups numerical data into bins and shows how frequently values fall into each range. Unlike bar charts, histograms have no gaps between bars because the underlying data is continuous.

  • Distribution shape — Symmetric, skewed, bimodal, or uniform — the histogram shows it all
  • Center and spread — See where data clusters and how far it stretches
  • Outlier detection — Spot unusual values that sit far from the main body
  • Bin width sensitivity — Too few bins hide structure; too many create noise

Always plot a histogram before calculating any statistic. The shape tells you which statistics are valid.


What is a Histogram?

Definition

A histogram is a bar graph that shows the distribution of numerical data by grouping it into bins (intervals). Unlike bar charts (for categorical data), histograms have no gaps between bars — because the data is continuous.


Anatomy of a Histogram

Anatomy of a Histogram
Frequency
0481216
5060708090100
Score
  • X-axis: The range of the variable, divided into equal-width bins
  • Y-axis: Frequency (count), relative frequency, or density
  • Bar height: Frequency of observations in that bin
  • Bar width: The bin width (class interval)

Building a Histogram in Python

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

np.random.seed(42)

# Generate example data: time to complete a task (minutes)
task_times = np.random.lognormal(mean=3.5, sigma=0.4, size=200)

fig, axes = plt.subplots(2, 3, figsize=(14, 8))

# 1. Basic histogram
axes[0,0].hist(task_times, bins=20, edgecolor='black', color='steelblue', alpha=0.7)
axes[0,0].set_title('Basic Histogram (bins=20)')
axes[0,0].set_xlabel('Time (minutes)')
axes[0,0].set_ylabel('Frequency')

# 2. Too few bins (underfitting)
axes[0,1].hist(task_times, bins=5, edgecolor='black', color='coral', alpha=0.7)
axes[0,1].set_title('Too Few Bins (bins=5)\n-> Hides structure')

# 3. Too many bins (overfitting)
axes[0,2].hist(task_times, bins=80, edgecolor='black', color='orchid', alpha=0.7)
axes[0,2].set_title('Too Many Bins (bins=80)\n-> Too noisy')

# 4. Density histogram with KDE
axes[1,0].hist(task_times, bins=20, density=True, edgecolor='black',
               color='steelblue', alpha=0.5, label='Histogram')
kde = stats.gaussian_kde(task_times)
x = np.linspace(task_times.min(), task_times.max(), 200)
axes[1,0].plot(x, kde(x), 'r-', linewidth=2, label='KDE')
axes[1,0].set_title('Density Histogram + KDE')
axes[1,0].legend()

# 5. Seaborn histplot
sns.histplot(task_times, bins=20, kde=True, ax=axes[1,1], color='teal')
axes[1,1].set_title('Seaborn histplot (built-in KDE)')

# 6. Compare two distributions
data_a = np.random.normal(35, 8, 200)
data_b = np.random.normal(42, 6, 200)
axes[1,2].hist(data_a, bins=20, alpha=0.6, color='blue', label='Method A', density=True)
axes[1,2].hist(data_b, bins=20, alpha=0.6, color='orange', label='Method B', density=True)
axes[1,2].set_title('Comparing Two Groups')
axes[1,2].legend()

plt.tight_layout()
plt.savefig('histograms.png', dpi=150)
plt.show()

Common Distribution Shapes

Symmetric / Bell-Shaped (Normal)

Normal (Symmetric)
Frequency
0481216

Both tails are mirror images. Mean ≈ Median ≈ Mode.

Right-Skewed (Positive Skew)

Right-Skewed
Frequency
0471114

Long right tail. Mean > Median > Mode. Common in: income, wait times, stock returns.

Left-Skewed (Negative Skew)

Left-Skewed
Frequency
0471114

Long left tail. Mean < Median < Mode. Common in: age at death, exam scores on an easy test.

Bimodal

Bimodal
Frequency
036912

Two peaks. Often indicates two distinct subpopulations mixed together.

Uniform

Uniform
Frequency
035810

Roughly equal frequency across all values. Random number generators produce this.

# Visualize all shapes
fig, axes = plt.subplots(1, 5, figsize=(18, 4))

np.random.seed(0)
shapes = {
    'Normal\n(Symmetric)': np.random.normal(50, 10, 1000),
    'Right-Skewed\n(Income-like)': np.random.lognormal(3, 0.8, 1000),
    'Left-Skewed\n(Exam scores)': 100 - np.random.exponential(10, 1000),
    'Bimodal\n(Two populations)': np.concatenate([np.random.normal(30,5,500),
                                                    np.random.normal(70,5,500)]),
    'Uniform': np.random.uniform(0, 100, 1000)
}

for ax, (title, data) in zip(axes, shapes.items()):
    ax.hist(data, bins=30, color='steelblue', edgecolor='black', alpha=0.7, density=True)
    mean_val = np.mean(data)
    median_val = np.median(data)
    ax.axvline(mean_val, color='red', linewidth=2, linestyle='--', label=f'Mean={mean_val:.0f}')
    ax.axvline(median_val, color='green', linewidth=2, linestyle='-', label=f'Median={median_val:.0f}')
    ax.set_title(title)
    ax.legend(fontsize=7)

plt.tight_layout()
plt.savefig('distribution_shapes.png', dpi=150)
plt.show()

Choosing the Right Number of Bins

RuleFormulaBest For
Sturgesk = 1 + log₂(n)Normal-ish, small n
Scotth = 3.49σ/n^(1/3)Normal data
Freedman-Diaconish = 2·IQR/n^(1/3)Skewed or outlier-prone
def optimal_bins(data):
    n = len(data)
    iqr = np.percentile(data, 75) - np.percentile(data, 25)
    data_range = data.max() - data.min()
    
    sturges = int(np.ceil(1 + np.log2(n)))
    scott_width = 3.49 * np.std(data) / n**(1/3)
    scott_bins = int(np.ceil(data_range / scott_width))
    fd_width = 2 * iqr / n**(1/3)
    fd_bins = int(np.ceil(data_range / fd_width)) if fd_width > 0 else sturges
    
    print(f"Sturges: {sturges} bins")
    print(f"Scott: {scott_bins} bins (width = {scott_width:.2f})")
    print(f"Freedman-Diaconis: {fd_bins} bins (width = {fd_width:.2f})")
    return sturges, scott_bins, fd_bins

print("Task times data:")
optimal_bins(task_times)

Histograms in Machine Learning

Feature DistSkew? Outliers?Target DistClass balance?Residual DistNormal errors?Prediction DistConfidence?Histograms are the #1 diagnostic tool in ML — always plot before you model

In ML, histograms are everywhere:

ML ApplicationWhat to HistogramWhat to Look For
Feature engineeringEach input featureSkewness → log transform
Model evaluationResiduals (y - ŷ)Normal → valid confidence intervals
Data drift detectionFeature distributions over timeShifts between train/test
Loss curvesTraining loss per epochConvergence behavior
Probability calibrationPredicted probabilitiesUniform = well calibrated
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

np.random.seed(42)

# Generate skewed feature data
n = 500
X = np.random.lognormal(3, 1, (n, 1))
y = 50 + 0.01 * X[:,0] + np.random.normal(0, 5, n)

# Before training: check feature distribution
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Raw feature — skewed
axes[0].hist(X[:,0], bins=30, color='steelblue', edgecolor='black', alpha=0.7)
axes[0].set_title('Raw Feature (Skewed)\n→ Model struggles')
axes[0].set_xlabel('Feature Value')

# Log transform — now symmetric
X_log = np.log(X)
axes[1].hist(X_log[:,0], bins=30, color='green', edgecolor='black', alpha=0.7)
axes[1].set_title('Log Transformed (Symmetric)\n→ Model performs better')
axes[1].set_xlabel('Log(Feature)')

# Residuals after training
X_train, X_test, y_train, y_test = train_test_split(X_log, y, test_size=0.2)
model = LinearRegression().fit(X_train, y_train)
y_pred = model.predict(X_test)
residuals = y_test - y_pred

axes[2].hist(residuals, bins=25, color='coral', edgecolor='black', alpha=0.7)
axes[2].axvline(0, color='red', linewidth=2, linestyle='--')
axes[2].set_title('Residuals (Normal-ish)\n→ Valid confidence intervals')
axes[2].set_xlabel('Residual')

plt.tight_layout()
plt.savefig('ml_histograms.png', dpi=150)
plt.show()

print(f"MSE: {mean_squared_error(y_test, y_pred):.2f}")
print(f"Residual mean: {residuals.mean():.4f} (should be ~0)")
print(f"Residual skew: {float(np.mean(((residuals - residuals.mean())/residuals.std())**3)):.3f}")

Key Takeaways

Summary: Histograms

  1. Histograms reveal the shape, center, spread, and gaps in data — always plot one first
  2. Bin width is a critical choice — too wide hides structure; too narrow creates noise
  3. Shape tells you which statistics to use: symmetric -> mean; skewed -> median
  4. Bimodal distributions often signal mixed populations that should be analyzed separately
  5. Use density (not count) on y-axis when comparing groups of different sizes
  6. Add KDE (kernel density estimate) to smooth the histogram for a cleaner shape estimate

Premium Content

Histograms — Construction, Interpretation, and Common Shapes

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement