Regularization Path¶
Overview¶
The regularization path traces how model coefficients change as the regularization parameter (lambda, \(\lambda\)) varies from large (high regularization, sparse solutions) to small (low regularization, close to OLS). Understanding the regularization path is essential for:
- Visualizing the bias-variance tradeoff — See how coefficients shrink as regularization increases
- Feature selection — Identify which features are selected at different regularization levels
- Understanding model complexity — Observe how many features are "active" (non-zero) at each lambda
- Choosing the optimal lambda — Combined with cross-validation, select the best regularization strength
The Regularization Path for Lasso¶
For Lasso regression, the regularization problem is:
As \(\lambda\) increases: - More coefficients are shrunk exactly to zero - The model becomes sparser (fewer non-zero coefficients) - Bias increases, but variance decreases - Predictive error follows a U-shaped curve (minimum at optimal \(\lambda\))
Key Properties of the Lasso Path¶
Feature Activation Order¶
The Lasso path exhibits a soft-thresholding property where:
- At very high \(\lambda\), all coefficients are zero
- As \(\lambda\) decreases, features enter the model (become non-zero) one by one
- The order of entry reflects feature importance under L1 regularization
- At very small \(\lambda\), all features are active (close to OLS solution)
Monotonicity of Feature Selection¶
Once a feature becomes non-zero in the Lasso path, it typically remains non-zero as \(\lambda\) decreases (with some exceptions for highly correlated predictors). This creates a "homotopy" structure useful for computation.
Degrees of Freedom¶
At any point on the path, the effective degrees of freedom (eDoF) is:
This exact relationship (unlike Ridge regression) makes Lasso particularly useful for model selection.
Computing the Regularization Path¶
Algorithm: Coordinate Descent with Warm Starts¶
Modern implementations use coordinate descent with "warm starts":
Initialize: β = 0
For λ in decreasing order (λ_max to λ_min):
Initialize β from previous solution (warm start)
For each coordinate j:
Compute residual: r_j = y - X β + X_j β_j
Apply soft-thresholding: β_j = soft_threshold(X_j^T r_j / n, λ)
Continue until convergence
The warm start leverages the previous solution to speed computation—solutions at nearby lambdas are similar, so we start close to the optimum.
Complexity¶
- Single lambda: O(np) per iteration
- Full path (M lambdas): O(M × np × iterations) if computed sequentially
- Efficient implementation: Solutions for all lambdas in nearly the time of solving one problem!
Practical Interpretation of the Path¶
Example: Housing Prices¶
Consider predicting house prices with 12 features. The Lasso path might look like:
| λ (scaled) | # Non-Zero | Selected Features | RMSE |
|---|---|---|---|
| 10.0 | 0 | (none) | 340,000 |
| 5.0 | 1 | SqFtTotLiving | 280,000 |
| 2.0 | 4 | SqFtTotLiving, BldgGrade, YrBuilt, Bathrooms | 240,000 |
| 1.0 | 7 | + SqFtLot, Bedrooms, NbrLivingUnits | 225,000 |
| 0.5 | 10 | + SqFtFinBasement, YrRenovated | 220,000 |
| 0.1 | 12 | All features (approaching OLS) | 218,000 |
Insights: - The most important predictor (SqFtTotLiving) enters first - Building grade and year built are the next most important - Less important features (YrRenovated) enter at small lambda - RMSE continues decreasing, but improvements slow after 7 features
Cross-Validation on the Path¶
Rather than optimizing each lambda separately, compute CV error for each \(\lambda\) on the precomputed path:
For each lambda on the path:
For each CV fold:
Fit model on training fold
Predict on validation fold
Record error
Average errors across folds
Select lambda with minimum CV error
This is much faster than refitting at each lambda!
Visualization of the Path¶
Plot 1: Coefficients vs Lambda¶
import matplotlib.pyplot as plt
import numpy as np
# After computing lasso_coefs (shape: [n_lambdas, n_features])
fig, ax = plt.subplots(figsize=(10, 6))
for j in range(n_features):
ax.plot(np.log10(lambdas), lasso_coefs[:, j], label=feature_names[j])
ax.axvline(np.log10(lambda_opt), color='red', linestyle='--', label='Optimal λ')
ax.set_xlabel('log₁₀(Lambda)')
ax.set_ylabel('Coefficient Value')
ax.set_title('Lasso Regularization Path')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()
This plot reveals: - Feature selection: which features are active at each lambda - Shrinkage direction: how coefficients change - Sparsity: the order and timing of feature entry
Plot 2: Cross-Validation Error¶
# Plot CV error across the path
fig, ax = plt.subplots(figsize=(10, 6))
cv_mean = cv_mse_path.mean(axis=1)
cv_std = cv_mse_path.std(axis=1)
ax.plot(np.log10(lambdas), np.sqrt(cv_mean), 'o-', label='CV RMSE')
ax.fill_between(np.log10(lambdas),
np.sqrt(cv_mean - cv_std),
np.sqrt(cv_mean + cv_std),
alpha=0.2)
ax.axvline(np.log10(lambda_opt), color='red', linestyle='--', label='Optimal λ')
ax.set_xlabel('log₁₀(Lambda)')
ax.set_ylabel('Cross-Validation RMSE')
ax.set_title('Lambda Selection via Cross-Validation')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()
This reveals the optimal regularization strength: typically a minimum around λ = 1-10 before CV error increases due to excessive regularization.
Comparing Regularization Methods: Paths and Trade-offs¶
Lasso vs Ridge vs Elastic Net¶
| Property | Lasso (L1) | Ridge (L2) | Elastic Net |
|---|---|---|---|
| Path sparsity | Yes (exact zeros) | No (all non-zero) | Partial (grouped zeros) |
| Feature selection | Automatic | Manual (thresholding) | Automatic (grouping) |
| Computational cost | Medium (warm start) | Low (closed form) | Medium |
| High-correlation | Picks one feature | Keeps all, shrinks | Balanced selection |
| Interpretability | High (sparse) | Medium | High |
Practical Guidelines¶
Choosing Lambda via Cross-Validation¶
Standard approach (1-SE rule): 1. Compute CV error for each lambda 2. Find λ* with minimum CV error 3. Often use λ = λ_1SE: the largest lambda within 1 standard error of minimum - Provides simpler model with similar CV error - More conservative against overfitting
# Find lambda_1SE
best_idx = np.argmin(cv_mse)
lambda_1se = lambdas[np.where(cv_mse <= cv_mse[best_idx] + cv_std[best_idx])[0][-1]]
Interpreting the Path for Model Selection¶
- Number of features — For sparse interpretability, choose lambda where only 5-15 features are active
- Stability — Prefer lambdas where small changes don't drastically alter the model
- Domain knowledge — Features should align with domain understanding; if not, investigate multicollinearity
- Prediction vs interpretation — Higher lambda (fewer features) may sacrifice accuracy for simplicity
Computational Considerations¶
- Modern packages (scikit-learn, glmnet): Compute path for ~100 lambdas nearly as fast as single fit
- Warm starts: Essential for efficiency; don't solve each lambda independently
- Standardization: Always standardize features before fitting Lasso; coefficients are not comparable on different scales
Summary¶
The regularization path provides a complete picture of the bias-variance tradeoff:
- Visualizes how coefficients and feature selection change with regularization
- Enables efficient computation via warm starts and coordinate descent
- Guides lambda selection through cross-validation
- Supports interpretability by revealing feature importance and stability
- Bridges the gap between complex models (all features) and simple models (few features)
Understanding the regularization path transforms lambda selection from a "black box" into an informed, principled choice aligned with your data and goals.