Entropy¶

Entropy quantifies the average amount of surprise or uncertainty in a random variable. A fair coin has maximum entropy because each flip is maximally unpredictable, while a heavily biased coin has lower entropy because one outcome is nearly certain. Introduced by Claude Shannon in 1948, entropy provides the theoretical foundation for data compression: it gives the minimum average number of bits needed to encode samples from a distribution. This section defines Shannon entropy, establishes its key properties, and extends the concept to continuous distributions.

Definition¶

Let \(X\) be a discrete random variable taking values in a finite set \(\mathcal{X}\) with probability mass function \(p(x) = P(X = x)\). The Shannon entropy of \(X\) is

\[ H(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x) \]

where the logarithm is base 2 (giving entropy in bits) or base \(e\) (giving entropy in nats). By convention, \(0 \log 0 = 0\), which is justified by the limit \(\lim_{p \to 0^+} p \log p = 0\).

Each term \(-\log p(x)\) represents the information content (or surprise) of outcome \(x\): rare events carry more information than common ones. Entropy averages this surprise over all possible outcomes.

Properties¶

Shannon entropy satisfies several fundamental properties.

Non-negativity. For any discrete random variable \(X\),

\[ H(X) \geq 0 \]

with equality if and only if \(X\) is deterministic (one outcome has probability 1). This follows because \(-p(x) \log p(x) \geq 0\) for all \(p(x) \in [0, 1]\).

Maximum entropy. Among all distributions on a finite set \(\mathcal{X}\) with \(|\mathcal{X}| = n\), the uniform distribution \(p(x) = 1/n\) maximizes entropy:

\[ H(X) \leq \log n \]

with equality if and only if \(X\) is uniformly distributed. This result follows from the non-negativity of KL divergence between \(p\) and the uniform distribution.

Concavity. Entropy is a concave function of the probability distribution. For any two distributions \(p\) and \(q\) on the same space and \(\lambda \in [0, 1]\),

\[ H(\lambda p + (1-\lambda) q) \geq \lambda H(p) + (1-\lambda) H(q) \]

Entropy of a Binary Random Variable

Consider a coin with probability \(p\) of heads and \(1 - p\) of tails. Its entropy is the binary entropy function:

\[ H(p) = -p \log p - (1 - p) \log(1 - p) \]

At \(p = 1/2\) (fair coin), \(H = \log 2 = 1\) bit. At \(p = 0\) or \(p = 1\) (deterministic outcome), \(H = 0\).

Differential Entropy¶

For a continuous random variable \(X\) with probability density function \(p(x)\), the differential entropy is

\[ h(X) = -\int_{-\infty}^{\infty} p(x) \log p(x) \, dx \]

Differential entropy shares some properties with its discrete counterpart, but there are important differences:

Can be negative. For example, if \(X \sim \text{Uniform}(0, a)\) with \(a < 1\), then \(h(X) = \log a < 0\).
Not invariant under changes of variable. If \(Y = g(X)\) for a smooth bijection \(g\), then \(h(Y) \neq h(X)\) in general.
Maximum entropy. Among all continuous distributions on \(\mathbb{R}\) with a given mean and variance, the Gaussian distribution maximizes differential entropy.

Gaussian Differential Entropy

If \(X \sim \mathcal{N}(\mu, \sigma^2)\), then

\[ h(X) = \frac{1}{2} \log(2 \pi e \sigma^2) \]

This result shows that differential entropy grows logarithmically with the variance.

Summary¶

Shannon entropy \(H(X) = -\sum_x p(x) \log p(x)\) measures the average uncertainty in a discrete random variable. It is non-negative, maximized by the uniform distribution, and concave. Differential entropy extends this concept to continuous distributions but can be negative and is not invariant under changes of variable.