Multinomial Logistic Regression¶
From Binary to Multiclass¶
Logistic regression models a binary response. When the response has \(C>2\) categories, we generalize to multinomial logistic regression (also called softmax regression). Instead of a single weight vector \(\boldsymbol{\theta}\), we learn a weight matrix \(\mathbf{W}\) and a bias vector \(\mathbf{b}\) that map each input to a vector of \(C\) real-valued scores (logits), one per class.
Model Architecture (Single Layer)¶
For a dataset with \(n\) observations and \(p\) features, the single-layer softmax model computes:
where each row of \(\hat{\mathbf{Y}}\) is a probability distribution over the \(C\) classes.
Two-Layer Model (Hidden Layer + Softmax)¶
Adding a hidden layer with the logistic activation gives a shallow neural network — the architecture used in the MNIST examples below.
The logistic (sigmoid) activation is
MNIST Data¶
The MNIST dataset is the canonical benchmark for this model family:
Each image is \(28\times 28\) pixels, flattened to a 784-dimensional vector. Pixel values are scaled to \([0,1]\).
Relationship to Binary Logistic Regression¶
When \(C=2\), multinomial logistic regression reduces to ordinary logistic regression. The two-class softmax produces the same decision boundary as the sigmoid model because the log-ratio of class probabilities is linear in the features: