Likelihood for Logistic Regression¶
Setup¶
Given \(n\) observations \((A[i,:],\; y^{(i)})\) where \(y^{(i)}\in\{0,1\}\) and \(A\) is the \(n\times(p+1)\) design matrix (with a column of ones for the intercept), the predicted probability for observation \(i\) is
Bernoulli Likelihood¶
Each label \(y^{(i)}\) is modeled as a Bernoulli random variable with success probability \(\sigma^{(i)}\). The likelihood of the entire dataset is
Cross-Entropy Loss (Negative Log-Likelihood)¶
Taking the negative logarithm gives the cross-entropy loss (also called the log-loss):
Minimizing \(\ell\) with respect to \(\boldsymbol{\theta}\) is equivalent to maximizing \(\mathcal{L}\).
Why Cross-Entropy?¶
The cross-entropy loss has two important properties that make it preferable to, say, squared error for classification:
- Convexity. \(\ell\) is convex in \(\boldsymbol{\theta}\), so every local minimum is a global minimum.
- Information-theoretic motivation. The cross-entropy between the true distribution \(y\) and the model distribution \(\hat{y}=\sigma\) measures the additional bits needed to encode labels when using the model distribution instead of the true one.
Behavior at the Extremes¶
When \(y^{(i)}=1\) and \(\sigma^{(i)}\to 0\), the term \(-\log\sigma^{(i)}\to+\infty\), heavily penalizing a confident wrong prediction. Symmetrically for \(y^{(i)}=0\) and \(\sigma^{(i)}\to 1\). This asymmetric penalty is exactly what drives the model toward well-calibrated probabilities.
Connection to Information Theory¶
If \(q\) denotes the model distribution and \(p\) the empirical (true) distribution, the cross-entropy is
The cross-entropy decomposes as \(H(p,q) = H(p) + D_{\mathrm{KL}}(p\|q)\), where \(H(p)\) is the entropy of the true distribution and \(D_{\mathrm{KL}}\) is the Kullback–Leibler divergence. Since \(H(p)\) is constant with respect to \(\boldsymbol{\theta}\), minimizing cross-entropy is the same as minimizing the KL divergence.
Numerical Stability¶
In practice a small constant \(\varepsilon\) (e.g. \(10^{-6}\)) is added inside the logarithm to avoid \(\log 0\):
Alternatively, many frameworks compute the loss directly from the logits \(z^{(i)}\) using the numerically stable identity: