πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Entropy

Information TheoryEntropy🟒 Free Lesson

Advertisement

Entropy

Why It Matters

Entropy is the cornerstone of information theory. Introduced by Claude Shannon in 1948, it quantifies the fundamental uncertainty or "surprise" in a random variable. Every time you compress a file, train a neural network with cross-entropy loss, or split a decision tree, entropy is at work. Understanding entropy deeply gives you intuition for information gain, coding theory, loss function design, and the theoretical limits of compression.


Historical Context

Shannon's Breakthrough

Before Shannon, "information" was an intuitive concept. Shannon formalized it: the information gained from observing an event with probability pp is βˆ’log⁑2p-\log_2 p. Rare events carry more information. Entropy is the expected information content across all possible outcomes of a random variable. This single idea unified communication, compression, and cryptography under a mathematical framework.


Core Definitions

DfShannon Entropy

The Shannon entropy of a discrete random variable XX with probability mass function p(x)p(x) is:

H(X)=βˆ’βˆ‘x∈Xp(x)log⁑2p(x)H(X) = -\sum_{x \in \mathcal{X}} p(x) \log_2 p(x)

where the sum is over all possible values of XX, and we adopt the convention 0log⁑20=00 \log_2 0 = 0. Entropy is measured in bits when using log⁑2\log_2.

DfDifferential Entropy

For a continuous random variable XX with probability density function f(x)f(x):

h(X)=βˆ’βˆ«βˆ’βˆžβˆžf(x)log⁑2f(x) dxh(X) = -\int_{-\infty}^{\infty} f(x) \log_2 f(x) \, dx

Unlike discrete entropy, differential entropy can be negative and is not directly interpretable as "information content." However, differences in differential entropy (e.g., in rate-distortion theory) are meaningful.


Key Formulas

Shannon Entropy

H(X)=βˆ’βˆ‘xp(x)log⁑2p(x)H(X) = -\sum_{x} p(x) \log_2 p(x)

Here,

  • H(X)H(X)=Entropy of random variable X (in bits)
  • p(x)p(x)=Probability of outcome x
  • log2log_2=Logarithm base 2 (gives units in bits)

Entropy in Nats

H(X)=βˆ’βˆ‘xp(x)ln⁑p(x)H(X) = -\sum_{x} p(x) \ln p(x)

Here,

  • lnln=Natural logarithm (gives units in nats)
  • 1nat1 nat== 1 / ln(2) β‰ˆ 1.4427 bits

Joint Entropy

H(X,Y)=βˆ’βˆ‘x,yp(x,y)log⁑2p(x,y)H(X, Y) = -\sum_{x,y} p(x,y) \log_2 p(x,y)

Here,

  • H(X,Y)H(X, Y)=Joint entropy of X and Y
  • p(x,y)p(x,y)=Joint probability of x and y

Conditional Entropy

H(Y∣X)=βˆ’βˆ‘x,yp(x,y)log⁑2p(y∣x)H(Y|X) = -\sum_{x,y} p(x,y) \log_2 p(y|x)

Here,

  • H(Y∣X)H(Y|X)=Entropy of Y given X
  • p(y∣x)p(y|x)=Conditional probability of y given x

Relative Entropy (KL Divergence)

DKL(Pβˆ₯Q)=βˆ‘xp(x)log⁑p(x)q(x)D_{KL}(P \| Q) = \sum_{x} p(x) \log \frac{p(x)}{q(x)}

Here,

  • DKLD_{KL}=Kullback-Leibler divergence
  • P,QP, Q=Two probability distributions

Properties and Theorems

ThNon-negativity

H(X)β‰₯0H(X) \geq 0 for all discrete random variables XX. Equality holds if and only if XX is deterministic (one outcome has probability 1).

ThMaximum Entropy

For a discrete random variable with ∣X∣|\mathcal{X}| possible outcomes:

H(X)≀log⁑2∣X∣H(X) \leq \log_2 |\mathcal{X}|

Equality holds when XX is uniformly distributed over its support.

ThChain Rule for Entropy

H(X,Y)=H(X)+H(Y∣X)=H(Y)+H(X∣Y)H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)

The joint entropy decomposes into one marginal entropy plus one conditional entropy.

ThSubadditivity

H(X,Y)≀H(X)+H(Y)H(X, Y) \leq H(X) + H(Y)

Equality holds if and only if XX and YY are independent.

ThConditioning Reduces Entropy

H(Y∣X)≀H(Y)H(Y|X) \leq H(Y)

Knowing XX cannot increase uncertainty about YY. Equality holds iff XX and YY are independent.

ThShannon's Source Coding Theorem

No lossless compression scheme can compress a source with entropy H(X)H(X) to fewer than H(X)H(X) bits per symbol on average. Conversely, there exist codes that achieve average length arbitrarily close to H(X)H(X).


Worked Examples

Example 1: Fair Coin

Problem: Compute the entropy of a fair coin flip.

Solution: Fair Coin

A fair coin has p(heads)=p(tails)=0.5p(\text{heads}) = p(\text{tails}) = 0.5.

H(X)=βˆ’[0.5log⁑2(0.5)+0.5log⁑2(0.5)]=βˆ’[0.5(βˆ’1)+0.5(βˆ’1)]=1.0Β bitH(X) = -\left[0.5 \log_2(0.5) + 0.5 \log_2(0.5)\right] = -\left[0.5(-1) + 0.5(-1)\right] = 1.0 \text{ bit}

This is the maximum entropy for a binary variable.

Example 2: Biased Coin

Problem: Compute the entropy of a coin with p(heads)=0.9p(\text{heads}) = 0.9.

Solution: Biased Coin

H(X)=βˆ’[0.9log⁑2(0.9)+0.1log⁑2(0.1)]=βˆ’[0.9(βˆ’0.152)+0.1(βˆ’3.322)]=0.469Β bitsH(X) = -\left[0.9 \log_2(0.9) + 0.1 \log_2(0.1)\right] = -\left[0.9(-0.152) + 0.1(-3.322)\right] = 0.469 \text{ bits}

The biased coin has lower entropy because the outcome is more predictable.

Example 3: Loaded Die

Problem: Compute the entropy of a loaded die with probabilities [1/2,1/4,1/8,1/16,1/32,1/32][1/2, 1/4, 1/8, 1/16, 1/32, 1/32].

Solution: Loaded Die

H=βˆ’βˆ‘i=16pilog⁑2(pi)H = -\sum_{i=1}^{6} p_i \log_2(p_i)
=βˆ’[12(βˆ’1)+14(βˆ’2)+18(βˆ’3)+116(βˆ’4)+132(βˆ’5)+132(βˆ’5)]= -\left[\frac{1}{2}(-1) + \frac{1}{4}(-2) + \frac{1}{8}(-3) + \frac{1}{16}(-4) + \frac{1}{32}(-5) + \frac{1}{32}(-5)\right]
=0.5+0.5+0.375+0.25+0.15625+0.15625=1.9375Β bits= 0.5 + 0.5 + 0.375 + 0.25 + 0.15625 + 0.15625 = 1.9375 \text{ bits}

Compare with a fair die: H=log⁑2(6)β‰ˆ2.585H = \log_2(6) \approx 2.585 bits.

Example 4: Joint and Conditional Entropy

Problem: Let X,YX, Y be binary with joint distribution: p(0,0)=0.4p(0,0) = 0.4, p(0,1)=0.1p(0,1) = 0.1, p(1,0)=0.1p(1,0) = 0.1, p(1,1)=0.4p(1,1) = 0.4. Compute H(X)H(X), H(Y∣X)H(Y|X), and H(X,Y)H(X,Y).

Solution: Joint and Conditional

Marginals: p(X=0)=0.5p(X=0) = 0.5, p(X=1)=0.5p(X=1) = 0.5. So H(X)=1.0H(X) = 1.0 bit.

H(Y∣X=0)=βˆ’[0.8log⁑2(0.8)+0.2log⁑2(0.2)]=0.722H(Y|X=0) = -[0.8 \log_2(0.8) + 0.2 \log_2(0.2)] = 0.722 bits.

H(Y∣X=1)=βˆ’[0.2log⁑2(0.2)+0.8log⁑2(0.8)]=0.722H(Y|X=1) = -[0.2 \log_2(0.2) + 0.8 \log_2(0.8)] = 0.722 bits.

H(Y∣X)=βˆ‘xp(x)H(Y∣X=x)=0.5(0.722)+0.5(0.722)=0.722H(Y|X) = \sum_x p(x) H(Y|X=x) = 0.5(0.722) + 0.5(0.722) = 0.722 bits.

H(X,Y)=H(X)+H(Y∣X)=1.0+0.722=1.722H(X,Y) = H(X) + H(Y|X) = 1.0 + 0.722 = 1.722 bits.


Python Implementation

import numpy as np
from collections import Counter

def entropy(probs):
    """Compute Shannon entropy in bits from a probability distribution."""
    probs = np.array(probs)
    probs = probs[probs > 0]  # Avoid log(0)
    return -np.sum(probs * np.log2(probs))

def entropy_from_samples(samples):
    """Estimate entropy from a list of samples."""
    counter = Counter(samples)
    total = len(samples)
    probs = np.array([count / total for count in counter.values()])
    return entropy(probs)

def joint_entropy(joint_probs):
    """Compute joint entropy from a 2D joint probability matrix."""
    flat = joint_probs.flatten()
    return entropy(flat)

def conditional_entropy(joint_probs):
    """Compute H(Y|X) from a joint probability matrix."""
    h_y_given_x = 0.0
    for i in range(joint_probs.shape[0]):
        p_x = joint_probs[i].sum()
        if p_x > 0:
            h_y_given_x += p_x * entropy(joint_probs[i] / p_x)
    return h_y_given_x

# --- Examples ---
# Fair coin
print(f"Fair coin: {entropy([0.5, 0.5]):.3f} bits")          # 1.000

# Biased coin
print(f"Biased (0.9,0.1): {entropy([0.9, 0.1]):.3f} bits")  # 0.469

# Fair die
print(f"Fair die: {entropy([1/6]*6):.3f} bits")              # 2.585

# Loaded die
loaded = [1/2, 1/4, 1/8, 1/16, 1/32, 1/32]
print(f"Loaded die: {entropy(loaded):.3f} bits")              # 1.938

# Joint entropy example
joint = np.array([[0.4, 0.1], [0.1, 0.4]])
print(f"Joint entropy: {joint_entropy(joint):.3f} bits")      # 1.722
print(f"Conditional H(Y|X): {conditional_entropy(joint):.3f}")  # 0.722

# Entropy from samples
np.random.seed(42)
samples = np.random.choice(['A', 'B', 'C', 'D'], size=10000, p=[0.5, 0.25, 0.125, 0.125])
print(f"Estimated entropy: {entropy_from_samples(samples):.3f}")  # ~1.75

Applications in AI/ML

Decision Trees

Decision trees use information gain to choose splits. For a feature AA splitting dataset DD into subsets DvD_v:

IG(D,A)=H(D)βˆ’βˆ‘v∣Dv∣∣D∣H(Dv)IG(D, A) = H(D) - \sum_{v} \frac{|D_v|}{|D|} H(D_v)

The feature with highest IG is chosen. This is directly derived from entropy.

Cross-Entropy Loss

Neural networks for classification minimize:

L=βˆ’1Nβˆ‘i=1Nβˆ‘c=1Cyiclog⁑p^ic\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \sum_{c=1}^{C} y_{ic} \log \hat{p}_{ic}

where yicy_{ic} is the true label and p^ic\hat{p}_{ic} is the predicted probability. This is the cross-entropy between the true distribution and the model's predicted distribution.

Compression

Shannon's source coding theorem guarantees that the optimal average code length for a source is H(X)H(X) bits per symbol. Huffman coding and arithmetic coding approach this limit.

Feature Selection

Mutual information I(X;Y)=H(Y)βˆ’H(Y∣X)I(X; Y) = H(Y) - H(Y|X) measures how much a feature XX reduces uncertainty about target YY. Features with higher MI are more informative.


Common Mistakes

MistakeWhy It's WrongCorrect Approach
Using log⁑\log without specifying baseUnits depend on baseUse log⁑2\log_2 for bits, ln⁑\ln for nats
Forgetting 0log⁑0=00 \log 0 = 0 conventionCauses NaN errorsFilter zero probabilities before computing
Confusing entropy with informationEntropy is expected information, not instantaneousEntropy averages over all outcomes
Assuming entropy is always positive for continuous RVsDifferential entropy can be negativeDifferential entropy is not directly comparable
Treating conditional entropy as "entropy minus something"H(Yβˆ₯X)β‰ H(Y)βˆ’H(X)H(Y\|X) \neq H(Y) - H(X)Use the chain rule: H(Yβˆ₯X)=H(X,Y)βˆ’H(X)H(Y\|X) = H(X,Y) - H(X)

Interview Questions

Q1: Why does a fair coin have maximum entropy? A: A fair coin has p=0.5p = 0.5 for both outcomes, which maximizes βˆ’βˆ‘plog⁑p-\sum p \log p subject to βˆ‘p=1\sum p = 1. This is proven by Jensen's inequality or Lagrange multipliers.

Q2: Can entropy be negative? A: Discrete entropy H(X)β‰₯0H(X) \geq 0 always. Differential entropy h(X)h(X) can be negative (e.g., a very concentrated Gaussian with small variance).

Q3: What's the relationship between entropy and compression? A: Shannon's source coding theorem: you cannot compress below H(X)H(X) bits per symbol losslessly. Entropy is the theoretical minimum average code length.

Q4: How is entropy used in decision trees? A: Information gain IG(D,A)=H(D)βˆ’βˆ‘v∣Dv∣∣D∣H(Dv)IG(D, A) = H(D) - \sum_v \frac{|D_v|}{|D|} H(D_v) measures how much a feature reduces uncertainty. The feature with highest IG is selected for splitting.

Q5: Why use log⁑2\log_2 instead of ln⁑\ln? A: log⁑2\log_2 gives entropy in bits, which is intuitive for binary decisions. ln⁑\ln gives nats. The choice doesn't affect which distribution has higher entropy, only the units.


Practice Problems

Problem 1: Minimum Entropy

Problem: What is the minimum possible entropy for a binary random variable? Under what condition is it achieved?

Solution: Minimum Entropy

The minimum is H(X)=0H(X) = 0, achieved when p=0p = 0 or p=1p = 1 (deterministic outcome). One outcome has probability 1, so there is no uncertainty.

Problem 2: Entropy Ordering

Problem: Rank the following distributions by entropy (lowest to highest): (a) [0.99, 0.01], (b) [0.5, 0.5], (c) [0.7, 0.3], (d) [0.25, 0.25, 0.25, 0.25].

Solution: Entropy Ordering

(a) Hβ‰ˆ0.081H \approx 0.081 bits, (c) Hβ‰ˆ0.881H \approx 0.881 bits, (b) H=1.0H = 1.0 bit, (d) H=2.0H = 2.0 bits.

Ordering: (a) < (c) < (b) < (d). The more concentrated the distribution, the lower the entropy. The uniform distribution over 4 outcomes has the highest.

Problem 3: Chain Rule Application

Problem: H(X)=3H(X) = 3 bits, H(X,Y)=5H(X,Y) = 5 bits. What is H(Y∣X)H(Y|X)? What is H(Y∣X)H(Y|X) if XX and YY are independent?

Solution: Chain Rule

By the chain rule: H(Y∣X)=H(X,Y)βˆ’H(X)=5βˆ’3=2H(Y|X) = H(X,Y) - H(X) = 5 - 3 = 2 bits.

If XX and YY are independent: H(Y∣X)=H(Y)H(Y|X) = H(Y). We need H(Y)=H(X,Y)βˆ’H(X)=2H(Y) = H(X,Y) - H(X) = 2 bits only if independent. But we're told H(X,Y)=5H(X,Y) = 5 and H(X)=3H(X) = 3, so H(Y∣X)=2H(Y|X) = 2 regardless. If additionally independent, H(Y)=2H(Y) = 2 and H(X,Y)=H(X)+H(Y)=5H(X,Y) = H(X) + H(Y) = 5 βœ“.


Quick Reference

QuantityFormulaUnits
Shannon EntropyH(X)=βˆ’βˆ‘xp(x)log⁑2p(x)H(X) = -\sum_x p(x) \log_2 p(x)bits
Joint EntropyH(X,Y)=βˆ’βˆ‘x,yp(x,y)log⁑2p(x,y)H(X,Y) = -\sum_{x,y} p(x,y) \log_2 p(x,y)bits
Conditional EntropyH(Yβˆ₯X)=H(X,Y)βˆ’H(X)H(Y\|X) = H(X,Y) - H(X)bits
Max Entropylog⁑2βˆ₯Xβˆ₯\log_2 \|\mathcal{X}\|bits
Information GainIG=H(D)βˆ’βˆ‘vβˆ₯Dvβˆ₯βˆ₯Dβˆ₯H(Dv)IG = H(D) - \sum_v \frac{\|D_v\|}{\|D\|} H(D_v)bits

Relationship to Other Quantities

Entropy as the Foundation

Entropy is the building block for many other information-theoretic quantities:

  • Mutual Information: I(X;Y)=H(X)βˆ’H(X∣Y)I(X;Y) = H(X) - H(X|Y) β€” measures shared information.
  • KL Divergence: DKL(Pβˆ₯Q)=H(P,Q)βˆ’H(P)D_{KL}(P\|Q) = H(P,Q) - H(P) β€” measures distribution mismatch.
  • Cross-Entropy: H(P,Q)=H(P)+DKL(Pβˆ₯Q)H(P,Q) = H(P) + D_{KL}(P\|Q) β€” the loss function for classification.
  • Information Gain: IG(D,A)=H(D)βˆ’H(D∣A)IG(D,A) = H(D) - H(D|A) β€” used in decision trees.
  • Perplexity: PPL=2H(P,Q)\text{PPL} = 2^{H(P,Q)} β€” used in language modeling.

ThData Processing Inequality

If Xβ†’Yβ†’ZX \to Y \to Z forms a Markov chain, then H(Z∣X)β‰₯H(Z∣Y)H(Z|X) \geq H(Z|Y). Processing data cannot decrease uncertainty about a downstream variable more than the original data already does.

ThStrong Subadditivity

For three random variables X,Y,ZX, Y, Z:

H(X,Y,Z)+H(Y)≀H(X,Y)+H(Y,Z)H(X, Y, Z) + H(Y) \leq H(X, Y) + H(Y, Z)

This is the information-theoretic analog of the triangle inequality.

Entropy in Physics

In statistical mechanics, entropy S=kBln⁑WS = k_B \ln W measures the number of microstates WW consistent with a macrostate. Shannon entropy H=βˆ’βˆ‘plog⁑pH = -\sum p \log p generalizes this to arbitrary probability distributions. The connection: Boltzmann entropy is Shannon entropy when all microstates are equally likely.


Common Confusions

Entropy vs Information

Entropy is the expected information content, not the information content of a single event. A single rare event has high information (βˆ’log⁑p-\log p), but entropy averages over all events weighted by their probabilities.

Entropy vs Randomness

Entropy measures uncertainty, not randomness in the colloquial sense. A fair coin has high entropy (1 bit), while a biased coin (p=0.99p = 0.99) has low entropy (0.08 bits), even though both are "random" in the sense that outcomes are unpredictable.


Cross-References

  • 082 - Mutual Information β€” I(X;Y)=H(X)βˆ’H(X∣Y)I(X;Y) = H(X) - H(X|Y) β€” measures shared information between variables.
  • 083 - KL Divergence: DKL(Pβˆ₯Q)=H(P,Q)βˆ’H(P)D_{KL}(P\|Q) = H(P,Q) - H(P) β€” measures distribution mismatch.
  • 084 - Cross-Entropy β€” H(P,Q)=H(P)+DKL(Pβˆ₯Q)H(P,Q) = H(P) + D_{KL}(P\|Q) β€” used as loss in classification.
  • 085 - Applications β€” Information gain in decision trees, feature selection with MI, VAE loss functions.

Summary

Key Takeaways

  • Shannon Entropy: H(X)=βˆ’βˆ‘xp(x)log⁑2p(x)H(X) = -\sum_{x} p(x) \log_2 p(x) measures the average information content (in bits) of a random variable. Higher entropy means more uncertainty and more information per observation.

  • Entropy Bounds: Entropy is always non-negative H(X)β‰₯0H(X) \geq 0, equals 0 for deterministic outcomes, and is maximized at log⁑2∣X∣\log_2 |X| when all outcomes are equally likely (uniform distribution).

  • Conditional Entropy: H(Y∣X)=βˆ’βˆ‘x,yp(x,y)log⁑p(y∣x)H(Y|X) = -\sum_{x,y} p(x,y) \log p(y|x) measures remaining uncertainty about YY after observing XX. Conditioning reduces entropy: H(Y∣X)≀H(Y)H(Y|X) \leq H(Y).

  • Chain Rule: H(X,Y)=H(X)+H(Y∣X)H(X,Y) = H(X) + H(Y|X). The joint entropy of two variables equals the entropy of XX plus the conditional entropy of YY given XX. This decomposes complex systems.

  • Units Matter: Using log⁑2\log_2 gives entropy in bits; using ln⁑\ln gives nats; using log⁑10\log_{10} gives bans/hartleys. Bits are standard in information theory and machine learning.

  • Source Coding Theorem: Shannon proved that the entropy H(X)H(X) is the theoretical minimum average number of bits needed to encode messages from a source. No compression scheme can do better.

  • Practical Application: Entropy drives decision tree splitting (information gain), loss function design (cross-entropy loss in neural networks), compression limits (Shannon's source coding theorem), and feature selection in ML pipelines.

⭐

Premium Content

Entropy

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Mathematics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement