sklearn vs statsmodels Comparison¶
Overview¶
Python offers two primary libraries for regression modeling: scikit-learn (sklearn) and statsmodels. They serve different purposes and are best suited for different workflows.
At a Glance¶
| Feature | statsmodels |
sklearn |
|---|---|---|
| Primary focus | Statistical inference | Prediction and machine learning |
| Model summary | Detailed (coefficients, \(p\)-values, \(R^2\), AIC, BIC) | Minimal (must compute manually) |
| Hypothesis testing | Built-in (\(t\)-tests, \(F\)-tests, Wald tests) | Not available |
| Confidence intervals | Built-in for coefficients and predictions | Not available |
| Diagnostics | Extensive (VIF, influence plots, residual tests) | Limited |
| Cross-validation | Not built-in | Built-in (cross_val_score, pipelines) |
| Regularization | Limited | Ridge, Lasso, ElasticNet built-in |
| Intercept handling | Must add manually with add_constant() |
Automatic (default fit_intercept=True) |
| Formula interface | Yes (smf.ols('y ~ x1 + x2', data=df)) |
No |
When to Use Each¶
Use statsmodels when:¶
- You need coefficient inference: \(p\)-values, confidence intervals, significance tests.
- You are performing model diagnostics: residual analysis, heteroscedasticity tests, multicollinearity checks.
- You want to compare models using AIC or BIC.
- You need a detailed regression summary table for reporting.
- You are working in a classical statistics or econometrics context.
Use sklearn when:¶
- The primary goal is prediction and generalization to new data.
- You need cross-validation and train/test splitting.
- You are building a pipeline with preprocessing steps (scaling, encoding, feature selection).
- You need regularized models (Ridge, Lasso, ElasticNet).
- You want a consistent API across different model types (regression, classification, clustering).
Side-by-Side Example¶
statsmodels¶
import statsmodels.api as sm
import pandas as pd
X = sm.add_constant(df[['RM', 'LSTAT', 'PTRATIO']])
y = df['PRICE']
model = sm.OLS(y, X).fit()
print(model.summary())
# Access specific results
print(f"R²: {model.rsquared:.4f}")
print(f"Adj. R²: {model.rsquared_adj:.4f}")
print(f"AIC: {model.aic:.2f}")
print(f"BIC: {model.bic:.2f}")
print(model.pvalues)
print(model.conf_int())
sklearn¶
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
X = df[['RM', 'LSTAT', 'PTRATIO']]
y = df['PRICE']
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)
print(f"R²: {r2_score(y, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y, y_pred)):.4f}")
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_:.4f}")
# Cross-validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f"CV R² (mean ± std): {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
Using Both Together¶
In practice, many analysts use both libraries in the same project:
- Explore and diagnose with
statsmodels: fit an OLS model, examine the summary, check VIF, test residual assumptions. - Predict and validate with
sklearn: use cross-validation, regularized models, and pipelines for production-ready prediction.
# Step 1: Statistical analysis with statsmodels
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
X_sm = sm.add_constant(df[['RM', 'LSTAT', 'PTRATIO']])
model_sm = sm.OLS(df['PRICE'], X_sm).fit()
print(model_sm.summary())
# Check VIF
vif = pd.DataFrame({
'Feature': X_sm.columns,
'VIF': [variance_inflation_factor(X_sm.values, i) for i in range(X_sm.shape[1])]
})
print(vif)
# Step 2: Prediction pipeline with sklearn
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('ridge', Ridge(alpha=1.0))
])
X_sk = df[['RM', 'LSTAT', 'PTRATIO']]
cv_scores = cross_val_score(pipe, X_sk, df['PRICE'], cv=5, scoring='r2')
print(f"Ridge CV R²: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
Summary¶
statsmodels excels at statistical inference and diagnostics, while sklearn excels at prediction and model deployment. The two libraries are complementary, and using both in a regression workflow provides the most complete analysis.