Handling Imbalanced Data¶
The Problem: Class Imbalance¶
In many real-world classification problems, the classes are not equally represented. Examples include:
- Loan default: Default rate ~5-20%, most loans pay off
- Fraud detection: Frauds typically <1% of all transactions
- Disease diagnosis: Rare diseases <5% prevalence
- Spam detection: Spam usually <10-20% of email
A naive classifier trained on imbalanced data tends to predict the majority class too often, achieving high accuracy while poorly identifying the rare (positive) class. For example, a model that always predicts "paid off" on loan data with 81% paid-off rate achieves 81% accuracy but catches 0 defaults.
The Prevalence Ratio Problem¶
When training on imbalanced data, the model learns class probabilities from the empirical distribution. If the training set has a different prevalence than the target population, predicted probabilities become poorly calibrated for the minority class.
Strategy 1: Adjustment Through Weighting¶
One approach is to increase the cost (weight) of minority class errors during training:
where \(w_i\) is a weight assigned to observation \(i\).
Class Weights¶
A common choice is inverse class frequency weighting:
or simply: \(w_{\text{minority}} = 1, w_{\text{majority}} = p_{\text{minority}} / p_{\text{majority}}\).
Example: Loan Data¶
With 18.9% default rate and 81.1% paid-off rate:
| Unweighted | Weighted |
|---|---|
| Predicted defaults: 0.98% | Predicted defaults: 61.8% |
| Model undershoots defaults by 20× | Much closer to actual prevalence |
Implementation¶
In scikit-learn:
from sklearn.linear_model import LogisticRegression
# Option 1: Automatic balance
model = LogisticRegression(class_weight='balanced')
# Option 2: Custom weights
weights = [1.0 if y == 'paid_off' else 5.3 for y in y_train]
model.fit(X_train, y_train, sample_weight=weights)
Advantages: - Simple to implement - Computationally efficient - Preserves original dataset size
Disadvantages: - Requires choosing an appropriate weight ratio - Can still bias probability estimates
Strategy 2: Resampling¶
Undersampling (Downsampling)¶
Undersampling removes majority class instances to balance the dataset:
Original: 81,105 paid off + 18,895 default
Undersampled: 18,895 paid off + 18,895 default
Advantages: - Simple; balances classes perfectly - Reduces training time
Disadvantages: - Loss of information: Discards majority class data - Increases variance; less stable estimates - Biases probability estimates toward 50-50
Oversampling (Upsampling)¶
Oversampling duplicates minority class instances:
Original: 81,105 paid off + 18,895 default
Oversampled: 81,105 paid off + 81,105 default (via replication)
Disadvantages: - Creates redundant copies - Can lead to overfitting - Inflates dataset size
Strategy 3: Synthetic Minority Over-Sampling (SMOTE)¶
SMOTE generates synthetic samples in the minority class by interpolating between neighboring minority instances. Rather than exact replication, it creates new synthetic points along the line segment connecting nearby minority examples.
How SMOTE Works¶
For each minority sample \(x_i\):
- Find \(k\) nearest neighbors in the minority class (typically \(k=5\)).
- Randomly select one neighbor \(x_{\text{neighbor}}\).
- Generate a synthetic point:
$\(x_{\text{synthetic}} = x_i + \lambda (x_{\text{neighbor}} - x_i)\)$
where \(\lambda \in [0, 1]\) is random.
Example: Loan Data with SMOTE¶
from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = SMOTE().fit_resample(X_train, y_train)
# Result: 50-50 split of defaults and paid-offs (synthetic defaults added)
model = LogisticRegression()
model.fit(X_resampled, y_resampled)
Advantages: - Creates realistic synthetic samples via interpolation - Avoids exact duplication (less overfitting than naive oversampling) - Preserves local neighborhood structure - Widely used in practice
Disadvantages: - More complex than simple resampling - Computational overhead for finding neighbors - Assumes continuous features (discrete/categorical features need adaptation)
Variants of SMOTE¶
- BorderlineSMOTE: Focuses on minority samples near the decision boundary
- ADASYN (Adaptive Synthetic Sampling): Generates more samples for harder-to-learn minority instances
- SVMSMOTE: Uses SVM decision boundary to guide synthetic sample generation
from imblearn.over_sampling import BorderlineSMOTE, ADASYN
# BorderlineSMOTE
X_resampled, y_resampled = BorderlineSMOTE().fit_resample(X_train, y_train)
# ADASYN
X_resampled, y_resampled = ADASYN().fit_resample(X_train, y_train)
Strategy 4: Threshold Adjustment¶
Rather than changing the data or loss function, adjust the decision threshold post-hoc to match the desired operating point:
- For imbalanced data, the default threshold of 0.5 is often suboptimal
- Use ROC/PR curves to select a threshold that achieves desired precision/recall tradeoff
- Adjust based on business costs and constraints
Example¶
If training data has 20% defaults but deployment target has 10%, lower the threshold to predict fewer positives and achieve calibration.
Comparison and Recommendations¶
| Method | Simplicity | Data Efficiency | Computation | Calibration | Use Case |
|---|---|---|---|---|---|
| No adjustment | High | High | Low | Poor | Balanced data only |
| Weighting | High | High | Low | Reasonable | When computational efficiency matters |
| Undersampling | High | Low | Low | Poor | Small datasets; extreme imbalance |
| Oversampling | High | Medium | Low | Poor | Risk of overfitting |
| SMOTE | Medium | High | Medium | Good | Recommended for most cases |
| Threshold tuning | High | High | Low | Good | Combined with other methods |
Best Practice: Combined Approach¶
A robust strategy often combines multiple techniques:
- Use SMOTE to generate balanced training data
- Fit the model on resampled data
- Use class weights as regularization
- Tune the decision threshold on a validation set that preserves the true class distribution
- Evaluate on test data with the original imbalance using precision/recall/ROC curves
Example workflow:
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score
# Step 1: Resample training data
X_train_resampled, y_train_resampled = SMOTE().fit_resample(X_train, y_train)
# Step 2: Train with class weights for regularization
model = LogisticRegression(class_weight='balanced')
model.fit(X_train_resampled, y_train_resampled)
# Step 3: Tune threshold on validation set (with original imbalance)
y_val_proba = model.predict_proba(X_val)[:, 1]
fpr, tpr, thresholds = roc_curve(y_val, y_val_proba)
# Select optimal threshold (e.g., via Youden's J or cost-based method)
# Step 4: Evaluate on test set using selected threshold
y_test_pred = (y_test_proba >= optimal_threshold).astype(int)
Key Takeaway¶
Class imbalance requires explicit handling. Simply increasing accuracy is insufficient; focus on recall (catching rare events) and calibration (realistic probability estimates) using the appropriate combination of techniques for your domain.