Maximum Likelihood Estimation¶
Gradient of the Cross-Entropy Loss¶
The cross-entropy loss is
Using the sigmoid derivative \(\sigma'(z)=\sigma(z)(1-\sigma(z))\), the gradient with respect to \(\boldsymbol{\theta}\) simplifies to a clean matrix expression.
Element-wise Derivation¶
The sigmoid terms cancel, leaving
Matrix Form¶
Writing the residuals as a column vector \(\boldsymbol{\sigma}-\mathbf{y}\) and stacking the design-matrix rows:
This is the same form as the gradient for linear regression with squared loss — except the residual \(\hat{\mathbf{y}}-\mathbf{y}\) is replaced by \(\boldsymbol{\sigma}-\mathbf{y}\).
Hessian¶
Differentiating the gradient a second time:
where \(B\) is the \(n\times n\) diagonal matrix
Since every diagonal entry of \(B\) satisfies \(0<\sigma(1-\sigma)\le\tfrac14\), the Hessian is positive semi-definite for any \(\boldsymbol{\theta}\), confirming that the cross-entropy loss is convex.
Gradient Descent¶
The first-order update is
where \(\alpha\) is the learning rate.
Newton's Method and IRLS¶
The Newton (second-order) update uses the Hessian:
This can be rewritten as a weighted least-squares normal equation. Define the working response
Then the update becomes
This is exactly the solution to the weighted least-squares problem \(\min_{\boldsymbol{\theta}}\|B^{1/2}(\mathbf{z}-A\boldsymbol{\theta})\|^2\).
Iteratively Reweighted Least Squares (IRLS)¶
Because the weight matrix \(B\) depends on the current parameter vector \(\boldsymbol{\theta}\) (through \(\boldsymbol{\sigma}\)), we must apply the normal equations iteratively, recomputing \(B\) and \(\mathbf{z}\) at each step. This procedure is known as iteratively reweighted least squares (IRLS) (Rubin, 1983).
IRLS Algorithm
- Initialize \(\boldsymbol{\theta}_0\).
- Compute \(\boldsymbol{\sigma} = \sigma(A\boldsymbol{\theta}_0)\).
- Form \(B = \operatorname{diag}(\sigma^{(i)}(1-\sigma^{(i)}))\).
- Compute working response \(\mathbf{z} = A\boldsymbol{\theta}_0 - B^{-1}(\boldsymbol{\sigma}-\mathbf{y})\).
- Solve \(\boldsymbol{\theta} = (A^TBA)^{-1}A^TB\mathbf{z}\).
- Set \(\boldsymbol{\theta}_0 \leftarrow \boldsymbol{\theta}\) and repeat until convergence.
IRLS typically converges in a small number of iterations and is the algorithm behind many classical logistic regression solvers.
Implementation: Logistic Regression with Gradient Descent¶
The following NumPy implementation trains logistic regression from scratch using first-order gradient descent.
Data Loading¶
import pandas as pd
from sklearn.model_selection import train_test_split
def load_data(seed=1):
url = ('https://raw.githubusercontent.com/codebasics/py/'
'master/ML/7_logistic_reg/insurance_data.csv')
df = pd.read_csv(url)
x = df[['age']].values.reshape((-1, 1))
y = df.bought_insurance.values.reshape((-1,))
x_train, x_test, y_train, y_test = train_test_split(
x, y, train_size=0.5, random_state=seed)
return x_train, x_test, y_train, y_test
Model Class¶
import numpy as np
class LogisticRegression:
def __init__(self, x, y, lr=2e-4, epochs=100_000, theta=None):
self.x = x
self.y = y
self.lr = lr
self.epochs = epochs
self.theta = (theta if theta is not None
else np.random.normal(size=(x.shape[1] + 1, 1)))
@staticmethod
def design_matrix(x):
ones = np.ones((x.shape[0], 1))
return np.concatenate((ones, x), axis=1)
@staticmethod
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def predict_proba(self, x):
A = self.design_matrix(x)
z = A @ self.theta
return self.sigmoid(z).reshape((-1,))
def predict(self, x):
p = self.predict_proba(x)
return (p > 0.5).astype(float)
def loss(self):
p = self.predict_proba(self.x)
eps = 1e-6
return -np.mean(
self.y * np.log(p + eps) + (1 - self.y) * np.log(1 - p + eps))
def gradient(self):
A = self.design_matrix(self.x)
p = self.predict_proba(self.x).reshape((-1, 1))
y = self.y.reshape((-1, 1))
return A.T @ (p - y)
def train(self):
for _ in range(self.epochs):
self.theta -= self.lr * self.gradient()
Training¶
x_train, x_test, y_train, y_test = load_data()
model = LogisticRegression(x_train, y_train)
model.train()
y_pred = model.predict(x_test)
y_prob = model.predict_proba(x_test)
Implementation: Logistic Regression with Scikit-Learn¶
For comparison, the same task using sklearn:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='lbfgs')
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
y_prob = model.predict_proba(x_test)[:, 1]
Scikit-learn's LogisticRegression uses L-BFGS (a quasi-Newton method)
by default, which approximates the Hessian without forming or inverting
it explicitly. For small datasets the solver='newton-cg' option gives
exact Newton steps, equivalent to IRLS.