Principal Components Regression (PCR)¶
Overview¶
Principal Components Regression (PCR) combines dimensionality reduction with regression. Instead of regressing the response directly on all predictors, PCR first extracts principal components (linear combinations of predictors that capture most variance) and then regresses the response on these components.
PCR is particularly valuable when: - Multicollinearity is severe (high correlation among predictors) - \(p\) is large relative to \(n\) (many more predictors than observations) - Interpretability is less critical than prediction accuracy
The PCR Algorithm¶
Step 1: Standardize Predictors¶
Standardize each predictor to have mean 0 and standard deviation 1:
This is essential because PCA is sensitive to scale. Without standardization, predictors with large variances dominate the components.
Step 2: Compute Principal Components¶
Apply PCA to the standardized predictors \(X_{\text{scaled}}\) to compute principal components:
where: - \(V_k\) is the matrix of eigenvectors (loadings) of the covariance matrix - \(Z_k\) is the matrix of the first \(k\) principal components - Each principal component is a linear combination: \(Z_j = \sum_{i=1}^p v_{ij} X_i\)
The components are ordered by the variance they explain:
Step 3: Regress on Principal Components¶
Perform standard linear regression using the first \(M\) principal components as predictors:
where \(M \leq p\) is selected by cross-validation (Step 4).
Step 4: Choose M via Cross-Validation¶
The number of components \(M\) is a tuning parameter:
- Too few components (\(M\) small): Underfitting; lose information from excluded predictors
- Too many components (\(M\) close to \(p\)): Overfitting; noisy components increase variance
- Optimal \(M^*\): Minimizes cross-validation error
Use \(k\)-fold cross-validation:
- For each candidate \(M \in \{1, 2, \ldots, p\}\):
- For each fold:
- Fit PCA on training fold (compute components)
- Fit regression on first \(M\) components
- Predict on validation fold
-
Compute average CV error
-
Select \(\hat{M} = \arg\min_M \text{CV}(M)\)
Advantages and Disadvantages¶
Advantages¶
- Handles multicollinearity — Uncorrelated components eliminate multicollinearity problems
- Works when \(p > n\) — Dimensionality reduction makes regression feasible
- Automatic feature combination — Components are data-driven linear combinations of all predictors
- Computational efficiency — Solving least squares on \(M < p\) predictors is faster than alternatives
- Reduces overfitting — Using fewer components acts as implicit regularization
Disadvantages¶
- Unsupervised dimension reduction — PCA ignores the response \(y\); components may not align with predicting \(y\)
- Contrast with PLS: Partial Least Squares uses the response to guide component construction
- Loss of interpretability — Components are linear combinations of original predictors; harder to interpret
- Standardization required — Must standardize predictors; predictions can be sensitive to scaling choices
- Model complexity — Must store the loading matrix \(V\) to apply model to new data
- Not for feature selection — All original features may be used, even if only a few truly matter
PCR vs. Ridge Regression¶
Both PCR and Ridge regression address multicollinearity, but differ fundamentally:
| Aspect | PCR | Ridge |
|---|---|---|
| Approach | Unsupervised dimension reduction (PCA) | Shrinkage of all coefficients |
| Components retained | Only first \(M\) components | All predictors, shrunk |
| Parameter | Number of components \(M\) | Regularization strength \(\lambda\) |
| Bias-variance | Drops components (bias), retains \(M\) (variance) | Shrinks all coefficients (bias ↑, variance ↓) |
| When to use | \(p\) large, severe multicollinearity, \(p > n\) | Moderate multicollinearity, moderate \(p\) |
Key insight: Ridge uses a continuous shrinkage mechanism, while PCR uses a discrete selection mechanism.
PCR vs. Partial Least Squares (PLS)¶
Both are dimensionality reduction methods for regression, but differ in how they construct components:
| Aspect | PCR | PLS |
|---|---|---|
| Component construction | Unsupervised (PCA): maximize variance of \(X\) | Supervised: maximize covariance of \(X\) and \(y\) |
| Components aligned with | Explaining variance in predictors | Predicting the response |
| Typical performance | Depends on PCA alignment with \(y\) | Often better when components should predict \(y\) |
| Interpretability | Same limitation: linear combinations of \(X\) | Same limitation: linear combinations of \(X\) |
In practice, PLS often outperforms PCR because it uses information about \(y\) when building components.
Mathematical Details¶
Variance Explained¶
The proportion of variance explained by the first \(k\) principal components is:
where \(\lambda_j\) are the eigenvalues of the covariance matrix \(\text{Cov}(X_{\text{scaled}})\), ordered from largest to smallest.
A scree plot visualizes this: it shows eigenvalues (or cumulative variance explained) vs. component number. An "elbow" indicates the number of components capturing most variation.
Regression Coefficients in Original Scale¶
PCR estimates coefficients in terms of principal components:
where: - \(V_M\) is the \(p \times M\) matrix of loadings for the first \(M\) components - \(\hat{\gamma}\) is the regression coefficients on the components
To make predictions on new data with original features \(x_{\text{new}}\):
Python Implementation¶
Step-by-Step Example¶
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold
import matplotlib.pyplot as plt
# Load data
X = pd.DataFrame(...) # Features
y = pd.Series(...) # Response
# Step 1: Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Step 2: Fit PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
# Check variance explained
cumsum_var = np.cumsum(pca.explained_variance_ratio_)
print(f"Variance explained by each component:\n{pca.explained_variance_ratio_}")
print(f"Cumulative variance:\n{cumsum_var}")
# Step 3 & 4: Choose M via cross-validation
kfold = KFold(n_splits=10, shuffle=True, random_state=42)
mse_scores = []
for M in range(1, X_scaled.shape[1] + 1):
# Fit regression on first M components
reg = LinearRegression()
cv_score = cross_val_score(
reg, X_pca[:, :M], y,
cv=kfold,
scoring='neg_mean_squared_error'
)
mse = -cv_score.mean()
mse_scores.append(mse)
print(f"M={M:2d}: CV MSE = {mse:,.0f}")
# Find optimal M
M_opt = np.argmin(mse_scores) + 1
print(f"\nOptimal number of components: M = {M_opt}")
print(f"Variance explained: {cumsum_var[M_opt-1]:.4f} ({cumsum_var[M_opt-1]*100:.2f}%)")
# Fit final model
pcr_model = LinearRegression()
pcr_model.fit(X_pca[:, :M_opt], y)
# Make predictions
y_pred = pcr_model.predict(X_pca[:, :M_opt])
Visualization Example¶
# Plot 1: Scree plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
# Variance explained by each component
ax1.plot(range(1, len(pca.explained_variance_ratio_) + 1),
pca.explained_variance_ratio_, 'o-', linewidth=2, markersize=6)
ax1.set_xlabel('Principal Component')
ax1.set_ylabel('Variance Explained')
ax1.set_title('Scree Plot')
ax1.grid(True, alpha=0.3)
# Cumulative variance
ax2.plot(range(1, len(cumsum_var) + 1), cumsum_var, 'o-', linewidth=2, markersize=6)
ax2.axhline(0.9, color='red', linestyle='--', label='90% threshold')
ax2.set_xlabel('Number of Components')
ax2.set_ylabel('Cumulative Variance Explained')
ax2.set_title('Cumulative Variance Explained')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Plot 2: CV error vs number of components
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(range(1, len(mse_scores) + 1), np.sqrt(mse_scores), 'o-', linewidth=2, markersize=6)
ax.axvline(M_opt, color='red', linestyle='--', label=f'Optimal M = {M_opt}')
ax.set_xlabel('Number of Components (M)')
ax.set_ylabel('CV RMSE')
ax.set_title('Cross-Validation Error vs Number of Components')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
When to Use PCR¶
PCR is a good choice when:
- Multicollinearity is severe — Components are uncorrelated
- \(p\) is very large — Dimensionality reduction needed
- \(p > n\) — OLS is infeasible; PCR makes regression possible
- Interpretability of original features is not critical — Willing to work with components
- Prediction accuracy is the primary goal — Doesn't require understanding individual features
Consider alternatives if: - Feature selection is important — Use Lasso or elastic net instead - Interpretability is critical — Linear regression with a subset of features may be preferable - PLS might work better — If the goal is prediction (PLS uses the response in component construction)
Comparison with Other Methods¶
| Method | Multicollinearity | Feature Selection | Interpretability | When to Use |
|---|---|---|---|---|
| OLS | Poor | No | High | Small \(p\), low correlation |
| Ridge | Good | No | High | Moderate \(p\), moderate correlation |
| Lasso | Good | Yes | High | Feature selection important |
| Elastic Net | Good | Yes | High | Balance of Ridge + Lasso |
| PCR | Excellent | No | Low | Large \(p\) or \(p > n\) |
| PLS | Excellent | No | Low | Large \(p\), prediction focus |
Summary¶
Principal Components Regression combines the unsupervised dimensionality reduction of PCA with linear regression:
- Standardize predictors
- Extract principal components (linear combinations of predictors)
- Select the number of components \(M\) via cross-validation
- Regress response on the first \(M\) components
PCR effectively addresses multicollinearity and high-dimensionality, making it valuable for prediction in challenging settings where \(p\) is large or correlation among predictors is severe. However, the loss of interpretability and unsupervised nature of component selection are trade-offs to consider.