ROC Curve and AUC¶

Receiver Operating Characteristic (ROC) Curve¶

The ROC curve (Receiver Operating Characteristic) is a powerful tool for evaluating binary classifiers across all possible classification thresholds. It plots:

x-axis: False Positive Rate (FPR) = $1 - \text{Specificity}$
y-axis: True Positive Rate (TPR) = Sensitivity / Recall

Computing the ROC Curve¶

For a classifier that outputs probability scores, we vary the decision threshold $\tau$ from 0 to 1:

For each threshold $\tau$: classify as positive if $\hat{p} \geq \tau$.
Compute FPR and TPR:

$$\text{TPR} = \frac{\text{TP}}{\text{TP} + \text{FN}}, \quad \text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}}$$

Plot the (FPR, TPR) pair.
Repeat for all threshold values; connect the points to form the curve.

Interpretation¶

The ROC curve visualizes the tradeoff between sensitivity and specificity:

(0, 1): Perfect classifier (TPR = 1, FPR = 0)
(0, 0): Threshold so high that we predict positive for nothing
(1, 1): Threshold so low that we predict positive for everything
Main diagonal (y = x): Random classifier with no discrimination ability

A classifier above the diagonal is better than random; one below is worse than random (reverse predictions).

Example: Loan Default ROC Curve¶

For the logistic regression model on loan data:

At threshold = 0.5:
TPR = 14,336 / 22,671 ≈ 0.6323
FPR = 8,148 / 22,671 ≈ 0.3594

Varying the threshold from 0 to 1 produces a curve that typically starts near (0, 0) and ends near (1, 1), bulging upward for a good classifier.

Area Under the Curve (AUC)¶

The AUC is the area under the ROC curve, ranging from 0 to 1:

\[ \text{AUC} = \int_0^1 \text{TPR}(\text{FPR}) \, d(\text{FPR}) \]

In practice, AUC is computed numerically using the trapezoidal rule:

\[ \text{AUC} \approx \sum_{i=1}^{m} \frac{\text{TPR}_i + \text{TPR}_{i-1}}{2} \cdot (\text{FPR}_i - \text{FPR}_{i-1}) \]

Interpretation of AUC¶

AUC Value	Interpretation
0.5	Random classifier; no discrimination ability
0.6–0.7	Poor to fair discrimination
0.7–0.8	Acceptable discrimination
0.8–0.9	Excellent discrimination
0.9–1.0	Outstanding discrimination
1.0	Perfect classifier

For our loan default model: AUC ≈ 0.6917, indicating fair to acceptable discrimination.

Probabilistic Interpretation of AUC¶

An elegant interpretation of AUC comes from rank statistics:

AUC = Probability that the model ranks a random positive instance higher than a random negative instance.

In other words, if you randomly sample one default and one non-default loan, AUC is the probability that the model assigns a higher probability to the default. An AUC of 0.5 means the rankings are random; an AUC of 1.0 means the model always ranks positives higher than negatives.

Advantages of AUC¶

Threshold-independent: Summarizes performance across all thresholds in a single number.
Handles class imbalance well: Unlike accuracy, AUC is not biased by imbalanced datasets.
Probabilistic interpretation: Has a clear statistical meaning.
Useful for ranking tasks: Directly applicable to scoring and ranking problems.

Comparison: ROC vs. Precision-Recall¶

Aspect	ROC Curve	PR Curve
Focus	Sensitivity vs. Specificity	Precision vs. Recall
Threshold-independent	Yes	Yes
Class imbalance	Less sensitive to imbalance	More sensitive; better for rare events
Use case	Balanced classes	Imbalanced classes (rare positives)
Metric	AUC (0 to 1)	Average Precision (0 to 1)

For datasets with severe class imbalance (e.g., fraud detection with 1% positives), the precision-recall curve often reveals model behavior more clearly than the ROC curve.

Threshold Selection¶

The ROC curve helps select an optimal threshold for deployment. Common criteria include:

Youden's J Statistic: $J = \text{TPR} - \text{FPR}$; choose the threshold maximizing $J$.
Cost-based: Incorporate misclassification costs: $\min_\tau [c_{FP} \cdot \text{FPR} + c_{FN} \cdot (1 - \text{TPR})]$
Application-specific: Choose based on business requirements (e.g., target a specific recall level).

For the loan default model using Youden's J, the optimal threshold might differ significantly from the default 0.5, depending on the relative costs of false positives and false negatives.