Skip to content

Supervised Learning (Prediction with Labels)

Overview

In supervised learning, the model is trained on a labeled dataset—each input \(X\) has a corresponding known output \(Y\). The goal is for the model to learn the mapping \(f: X \to Y\) and make accurate predictions on new, unseen data.

Key Characteristics

  • Labeled data: The training set consists of input–output pairs \(\{(x_i, y_i)\}_{i=1}^{n}\).
  • Clear objective: Minimize a loss function that measures the discrepancy between predictions \(\hat{y}_i\) and true labels \(y_i\).
  • Evaluation is straightforward: Accuracy, MSE, AUC, and other metrics can be computed on held-out test data.

Two Main Tasks

Regression

The target variable \(Y\) is continuous. The goal is to predict a numerical value.

\[ \hat{y} = f(x) \quad \text{where } y \in \mathbb{R} \]

Examples:

  • Predicting house prices from features (size, location, age).
  • Forecasting next-quarter revenue from macroeconomic indicators.
  • Estimating option prices from underlying asset characteristics.

Common methods: Linear regression, Ridge/LASSO, decision trees, random forests, gradient boosting, neural networks.

Classification

The target variable \(Y\) is categorical. The goal is to assign an input to one of \(K\) classes.

\[ \hat{y} = f(x) \quad \text{where } y \in \{1, 2, \ldots, K\} \]

Examples:

  • Classifying emails as spam or not spam (binary).
  • Recognizing handwritten digits 0–9 (multiclass).
  • Predicting whether a borrower will default on a loan (binary).

Common methods: Logistic regression, support vector machines, decision trees, random forests, gradient boosting, neural networks.

The Supervised Learning Workflow

Training Data {(x_i, y_i)}
        │
        ▼
  Choose Model Family
        │
        ▼
  Train (minimize loss)
        │
        ▼
  Validate (tune hyperparameters)
        │
        ▼
  Test (evaluate on held-out data)
        │
        ▼
  Deploy for Prediction

Example: Predicting Default

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

np.random.seed(42)

# Simulate data: income and debt-to-income ratio → default (0/1)
n = 1_000
income = np.random.normal(60, 20, n).clip(10)
dti = np.random.normal(0.3, 0.15, n).clip(0.01, 1.0)
log_odds = -3 + 0.01 * (50 - income) + 5 * (dti - 0.3)
prob = 1 / (1 + np.exp(-log_odds))
default = np.random.binomial(1, prob)

X = np.column_stack([income, dti])
y = default

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Fit logistic regression
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(classification_report(y_test, y_pred, target_names=["No Default", "Default"]))

Key Takeaways

  • Supervised learning requires labeled data and optimizes a well-defined loss function.
  • The two main tasks are regression (continuous target) and classification (categorical target).
  • Model evaluation on held-out data is essential to assess generalization.
  • In finance, supervised learning powers credit scoring, algorithmic trading signals, fraud detection, and many other applications.