Entropy¶
Entropy quantifies the average amount of surprise or uncertainty in a random variable. A fair coin has maximum entropy because each flip is maximally unpredictable, while a heavily biased coin has lower entropy because one outcome is nearly certain. Introduced by Claude Shannon in 1948, entropy provides the theoretical foundation for data compression: it gives the minimum average number of bits needed to encode samples from a distribution. This section defines Shannon entropy, establishes its key properties, and extends the concept to continuous distributions.
Definition¶
Let \(X\) be a discrete random variable taking values in a finite set \(\mathcal{X}\) with probability mass function \(p(x) = P(X = x)\). The Shannon entropy of \(X\) is
where the logarithm is base 2 (giving entropy in bits) or base \(e\) (giving entropy in nats). By convention, \(0 \log 0 = 0\), which is justified by the limit \(\lim_{p \to 0^+} p \log p = 0\).
Each term \(-\log p(x)\) represents the information content (or surprise) of outcome \(x\): rare events carry more information than common ones. Entropy averages this surprise over all possible outcomes.
Properties¶
Shannon entropy satisfies several fundamental properties.
Non-negativity. For any discrete random variable \(X\),
with equality if and only if \(X\) is deterministic (one outcome has probability 1). This follows because \(-p(x) \log p(x) \geq 0\) for all \(p(x) \in [0, 1]\).
Maximum entropy. Among all distributions on a finite set \(\mathcal{X}\) with \(|\mathcal{X}| = n\), the uniform distribution \(p(x) = 1/n\) maximizes entropy:
with equality if and only if \(X\) is uniformly distributed. This result follows from the non-negativity of KL divergence between \(p\) and the uniform distribution.
Concavity. Entropy is a concave function of the probability distribution. For any two distributions \(p\) and \(q\) on the same space and \(\lambda \in [0, 1]\),
Entropy of a Binary Random Variable
Consider a coin with probability \(p\) of heads and \(1 - p\) of tails. Its entropy is the binary entropy function:
At \(p = 1/2\) (fair coin), \(H = \log 2 = 1\) bit. At \(p = 0\) or \(p = 1\) (deterministic outcome), \(H = 0\).
Differential Entropy¶
For a continuous random variable \(X\) with probability density function \(p(x)\), the differential entropy is
Differential entropy shares some properties with its discrete counterpart, but there are important differences:
- Can be negative. For example, if \(X \sim \text{Uniform}(0, a)\) with \(a < 1\), then \(h(X) = \log a < 0\).
- Not invariant under changes of variable. If \(Y = g(X)\) for a smooth bijection \(g\), then \(h(Y) \neq h(X)\) in general.
- Maximum entropy. Among all continuous distributions on \(\mathbb{R}\) with a given mean and variance, the Gaussian distribution maximizes differential entropy.
Gaussian Differential Entropy
If \(X \sim \mathcal{N}(\mu, \sigma^2)\), then
This result shows that differential entropy grows logarithmically with the variance.
Summary¶
Shannon entropy \(H(X) = -\sum_x p(x) \log p(x)\) measures the average uncertainty in a discrete random variable. It is non-negative, maximized by the uniform distribution, and concave. Differential entropy extends this concept to continuous distributions but can be negative and is not invariant under changes of variable.