Variation of Information¶
Mental Model
Variation of information measures the distance between two clusterings using information theory: how many bits you lose going from one clustering to another, plus how many you gain. It is a true metric (unlike mutual information), and computing it efficiently on large datasets requires sparse matrices because the joint distribution over cluster pairs is almost always sparse.
Background¶
The variation of information (VI) is a metric on the space of clusterings, introduced by Marina Meila. It measures how much information is lost and gained when moving from one clustering to another. Unlike many clustering comparison measures, VI is a true metric: it satisfies non-negativity, symmetry, and the triangle inequality.
VI is defined in terms of conditional entropies:
where \(H(X \mid Y)\) is the conditional entropy of clustering \(X\) given clustering \(Y\). When two clusterings are identical, \(\mathrm{VI} = 0\).
Computing VI efficiently for large datasets requires sparse matrix operations because the joint probability matrix between cluster labels is typically very sparse.
Objective¶
This example demonstrates how to:
- build sparse matrices from cluster assignment data using COO format
- convert between sparse formats (COO to CSR)
- compute marginal and conditional probabilities from sparse joint distributions
- use
sparse.diagsfor efficient diagonal matrix operations - implement the variation of information metric using sparse matrix arithmetic
Setup¶
Given two clusterings \(X\) and \(Y\) of \(n\) data points, each point receives a cluster label. The joint distribution \(p(x, y)\) counts how many points have label \(x\) in the first clustering and label \(y\) in the second, then normalizes by \(n\). This joint distribution is naturally sparse because most cluster pairs have no shared points.
The marginal distributions are:
The conditional entropy \(H(Y \mid X)\) is:
where \(p(y \mid x) = p(x, y) / p(x)\).
Code¶
Building Sparse Matrices from Coordinate Data¶
The COO (Coordinate) format stores entries as (row, col, data) triples,
making it the natural choice for constructing a joint distribution from
paired cluster labels.
```python import numpy as np from scipy import sparse
x = np.array([1, 0, 2, 1, 1, 2]) # cluster labels from first clustering y = np.array([0, 1, 1, 2, 0, 2]) # cluster labels from second clustering data = np.ones(len(x)) # count 1 for each data point
coo = sparse.coo_matrix((data, (x, y)), shape=(3, 3)) print(coo) ```
(1, 0) 1.0
(0, 1) 1.0
(2, 1) 1.0
(1, 2) 1.0
(1, 0) 1.0
(2, 2) 1.0
Converting to CSR and Normalizing¶
CSR (Compressed Sparse Row) format supports efficient arithmetic operations. Converting and normalizing produces the joint probability distribution.
python
csr = coo.tocsr()
pxy = csr.copy()
pxy.data /= np.sum(pxy.data)
print("Sum:", pxy.sum()) # 1.0
Computing Marginal Probabilities¶
Row sums give \(p(x)\) and column sums give \(p(y)\).
python
px = np.array(pxy.sum(axis=1)).ravel() # p(x): sum over y
py = np.array(pxy.sum(axis=0)).ravel() # p(y): sum over x
print("p(x):", px) # [0.1667, 0.5, 0.3333]
print("p(y):", py) # [0.3333, 0.3333, 0.3333]
Diagonal Matrices and Safe Operations¶
The sparse.diags function creates diagonal matrices efficiently. Two helper
functions handle the numerical edge cases where \(0 \cdot \log 0 = 0\) by
convention and division by zero must be avoided.
```python def safe_inverse(arr): """Invert array elements, leaving zeros as zeros.""" result = arr.copy() nz = np.nonzero(arr) result[nz] = 1 / arr[nz] return result
def xlogx(arr): """Compute x * log2(x), treating 0 * log(0) = 0.""" result = arr.copy() nz = np.nonzero(arr) result[nz] = arr[nz] * np.log2(arr[nz]) return result
px_inv = sparse.diags([safe_inverse(px)], [0]) py_inv = sparse.diags([safe_inverse(py)], [0]) ```
Variation of Information¶
Combining all the pieces, the VI computation uses sparse matrix products to compute conditional probabilities and entropies.
```python def variation_of_information(x, y): """ Compute Variation of Information between two clusterings.
VI(X;Y) = H(X|Y) + H(Y|X)
Parameters
----------
x, y : array-like
Cluster assignment arrays of the same length.
Returns
-------
float
VI score. Lower is better; 0 means identical clusterings.
"""
pxy = sparse.coo_matrix(
(np.ones(x.size), (x.ravel(), y.ravel())), dtype=float
).tocsr()
pxy.data /= np.sum(pxy.data)
px = np.array(pxy.sum(axis=1)).ravel()
py = np.array(pxy.sum(axis=0)).ravel()
px_inv = sparse.diags([safe_inverse(px)], [0])
py_inv = sparse.diags([safe_inverse(py)], [0])
# H(Y|X) via p(y|x) = diag(1/p(y)) @ p(x,y)
hygx = -(px * xlogx(py_inv.dot(pxy)).sum(axis=1)).sum()
# H(X|Y) via p(x|y) = p(x,y) @ diag(1/p(x))
hxgy = -(py * xlogx(pxy.dot(px_inv)).sum(axis=1)).sum()
return hygx + hxgy
c1 = np.array([0, 0, 1, 1, 2, 2]) c2 = np.array([0, 0, 1, 1, 2, 2]) c3 = np.array([0, 1, 0, 1, 0, 1])
print("VI(identical):", variation_of_information(c1, c2)) # ~0 print("VI(different):", variation_of_information(c1, c3)) # > 0 ```
VI(identical): 0.0
VI(different): 1.459
Interpretation¶
The variation of information captures two complementary types of mismatch between clusterings:
- \(H(X \mid Y)\) measures the information about \(X\) that is not explained by \(Y\). If \(Y\) perfectly predicts \(X\), this term is zero.
- \(H(Y \mid X)\) measures the information about \(Y\) that is not explained by \(X\).
When both conditional entropies are zero, the clusterings are identical and \(\mathrm{VI} = 0\). A larger VI indicates more disagreement.
Key Insight
Unlike the Rand index or mutual information, VI is a true metric on the space of clusterings. This means you can use it to define distances between clusterings in algorithms that require metric properties, such as computing medians or building clustering dendrograms.
Statistical Insight¶
The sparse matrix approach is essential for scalability. If there are \(k_1\) clusters in \(X\) and \(k_2\) clusters in \(Y\), the joint distribution matrix has shape \(k_1 \times k_2\). For fine-grained clusterings of large datasets, most entries are zero because most cluster pairs share no points. Sparse storage reduces memory from \(O(k_1 k_2)\) to \(O(\min(n, k_1 k_2))\) and sparse arithmetic avoids wasted computation on zero entries.
Common Mistake
Do not use np.log on sparse matrix data without handling zeros. The
convention \(0 \log 0 = 0\) must be enforced explicitly, otherwise NaN
values propagate through the computation.
The key sparse patterns used here generalize to many information-theoretic computations:
| Pattern | Sparse operation |
|---|---|
| Normalize to probabilities | Divide .data by total sum |
| Compute marginals | sum(axis=0) or sum(axis=1) |
| Conditional probabilities | Multiply by sparse.diags(1/marginal) |
| Element-wise \(x \log x\) | Apply only to .data (non-zero entries) |
Extensions¶
- Compute the normalized variation of information \(\mathrm{NVI} = \mathrm{VI} / \log n\) to obtain a value between 0 and 1
- Implement mutual information \(I(X; Y) = H(X) + H(Y) - H(X, Y)\) using the same sparse framework
- Compare VI against the adjusted Rand index on synthetic clustering benchmarks
- Scale the implementation to clusterings with millions of points and observe the memory advantage of sparse formats
Exercises¶
Exercise 1. Given the joint probability matrix below in dense form, compute \(H(X \mid Y)\) by hand using base-2 logarithms.
| \(y=0\) | \(y=1\) | |
|---|---|---|
| \(x=0\) | 0.25 | 0.25 |
| \(x=1\) | 0.125 | 0.375 |
Solution to Exercise 1
First compute \(p(y)\): \(p(y=0) = 0.375\), \(p(y=1) = 0.625\).
Conditional distribution \(p(x \mid y=0)\): \(p(x=0 \mid y=0) = 0.25/0.375 = 2/3\), \(p(x=1 \mid y=0) = 0.125/0.375 = 1/3\).
Conditional distribution \(p(x \mid y=1)\): \(p(x=0 \mid y=1) = 0.25/0.625 = 2/5\), \(p(x=1 \mid y=1) = 0.375/0.625 = 3/5\).
Exercise 2. Prove that \(\mathrm{VI}(X; Y) = 0\) if and only if \(X = Y\) (i.e., the two clusterings assign the same label to every point, up to relabeling).
Solution to Exercise 2
\(\mathrm{VI}(X; Y) = H(X \mid Y) + H(Y \mid X)\). Since conditional entropy is non-negative, \(\mathrm{VI} = 0\) requires both \(H(X \mid Y) = 0\) and \(H(Y \mid X) = 0\).
\(H(X \mid Y) = 0\) means that \(X\) is a deterministic function of \(Y\): knowing the \(Y\)-label of a point completely determines its \(X\)-label. Similarly, \(H(Y \mid X) = 0\) means \(Y\) is a deterministic function of \(X\).
If \(X\) is a function of \(Y\) and \(Y\) is a function of \(X\), then the two clusterings define the same partition of the data (they may use different label names, but the grouping is identical). Conversely, if \(X = Y\) up to relabeling, then both conditional entropies are zero. \(\square\)
Exercise 3. Write a function that computes the mutual information \(I(X; Y)\) using the same sparse matrix framework. Verify that \(I(X; Y) = H(X) - H(X \mid Y)\).
Solution to Exercise 3
```python def mutual_information(x, y): pxy = sparse.coo_matrix( (np.ones(x.size), (x.ravel(), y.ravel())), dtype=float ).tocsr() pxy.data /= np.sum(pxy.data)
px = np.array(pxy.sum(axis=1)).ravel()
py = np.array(pxy.sum(axis=0)).ravel()
# H(X)
hx = -np.sum(xlogx(px))
# H(Y)
hy = -np.sum(xlogx(py))
# H(X, Y) from joint distribution
hxy = -np.sum(xlogx(pxy.data))
return hx + hy - hxy
c1 = np.array([0, 0, 1, 1, 2, 2]) c2 = np.array([0, 0, 1, 1, 2, 2]) print(mutual_information(c1, c2)) # H(X), since X = Y ```
To verify \(I(X; Y) = H(X) - H(X \mid Y)\): from the definitions,
and \(H(X \mid Y) = H(X, Y) - H(Y)\), so \(H(X) - H(X \mid Y) = H(X) - H(X, Y) + H(Y) = I(X; Y)\).
Exercise 4. Explain why COO format is preferred for constructing the joint distribution matrix, while CSR format is preferred for the subsequent arithmetic. Under what conditions would CSC format be more efficient than CSR for the VI computation?
Solution to Exercise 4
COO for construction: COO stores each entry as an independent
(row, col, value) triple, so duplicate entries (multiple data points with
the same cluster pair) are simply appended. The conversion to CSR then
sums duplicates automatically. Building directly in CSR would require
pre-sorting or incremental insertion, both of which are slower.
CSR for arithmetic: CSR stores rows contiguously, making row-based
operations (row slicing, row sums via sum(axis=1), and left-multiplication
by diagonal matrices) efficient. The VI computation relies heavily on these
row operations.
When CSC would be better: if the computation were restructured to
primarily use column operations (column sums via sum(axis=0), right-multiplication
by diagonal matrices), CSC would be faster. In the current VI implementation,
column sums are computed once for \(p(y)\), but the dominant cost is in the
matrix products, which favor CSR when the left factor is the sparse joint
distribution.