Cross-Entropy Loss¶
Definition¶
For \(n\) observations with \(C\) classes, the categorical cross-entropy loss is
where \(\mathbf{Y}\) is the \(n\times C\) one-hot label matrix and \(\hat{\mathbf{Y}}\) is the \(n\times C\) matrix of predicted probabilities from the softmax.
Derivation of the Gradient dJ/dZ^o¶
This gradient is the starting point of backpropagation through the softmax layer and has a beautifully simple form.
Step 1 — Rewrite the Loss¶
Since \(\hat{y}_{ic} = e^{z_{ic}^o}\big/\sum_{c'}e^{z_{ic'}^o}\) :
Using \(\sum_c y_{ic}=1\) (one-hot):
Step 2 — Differentiate¶
Matrix Form¶
This is the same "prediction minus target" residual that appears in binary logistic regression — the softmax + cross-entropy combination produces a clean gradient regardless of the number of classes.
Relationship to KL Divergence¶
The cross-entropy decomposes as
For one-hot labels \(H(\mathbf{y}_i)=0\), so minimizing cross-entropy is equivalent to minimizing the KL divergence between the true and predicted distributions.
Numerical Stability¶
In practice the loss is computed from logits \(\mathbf{z}\) directly using the log-sum-exp trick:
where \(m_i=\max_c z_{ic}\). This avoids both overflow and loss of
precision in the logarithm. PyTorch's nn.CrossEntropyLoss and
TensorFlow's tf.nn.softmax_cross_entropy_with_logits implement this
automatically.