sklearn vs statsmodels Comparison¶

Overview¶

Python offers two primary libraries for regression modeling: scikit-learn (sklearn) and statsmodels. They serve different purposes and are best suited for different workflows.

At a Glance¶

Feature	`statsmodels`	`sklearn`
Primary focus	Statistical inference	Prediction and machine learning
Model summary	Detailed (coefficients, \(p\)-values, \(R^2\), AIC, BIC)	Minimal (must compute manually)
Hypothesis testing	Built-in (\(t\)-tests, \(F\)-tests, Wald tests)	Not available
Confidence intervals	Built-in for coefficients and predictions	Not available
Diagnostics	Extensive (VIF, influence plots, residual tests)	Limited
Cross-validation	Not built-in	Built-in (`cross_val_score`, pipelines)
Regularization	Limited	Ridge, Lasso, ElasticNet built-in
Intercept handling	Must add manually with `add_constant()`	Automatic (default `fit_intercept=True`)
Formula interface	Yes (`smf.ols('y ~ x1 + x2', data=df)`)	No

When to Use Each¶

Use `statsmodels` when:¶

You need coefficient inference: \(p\)-values, confidence intervals, significance tests.
You are performing model diagnostics: residual analysis, heteroscedasticity tests, multicollinearity checks.
You want to compare models using AIC or BIC.
You need a detailed regression summary table for reporting.
You are working in a classical statistics or econometrics context.

Use `sklearn` when:¶

The primary goal is prediction and generalization to new data.
You need cross-validation and train/test splitting.
You are building a pipeline with preprocessing steps (scaling, encoding, feature selection).
You need regularized models (Ridge, Lasso, ElasticNet).
You want a consistent API across different model types (regression, classification, clustering).

Side-by-Side Example¶

statsmodels¶

import statsmodels.api as sm
import pandas as pd

X = sm.add_constant(df[['RM', 'LSTAT', 'PTRATIO']])
y = df['PRICE']

model = sm.OLS(y, X).fit()
print(model.summary())

# Access specific results
print(f"R²: {model.rsquared:.4f}")
print(f"Adj. R²: {model.rsquared_adj:.4f}")
print(f"AIC: {model.aic:.2f}")
print(f"BIC: {model.bic:.2f}")
print(model.pvalues)
print(model.conf_int())

sklearn¶

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

X = df[['RM', 'LSTAT', 'PTRATIO']]
y = df['PRICE']

model = LinearRegression()
model.fit(X, y)

y_pred = model.predict(X)
print(f"R²: {r2_score(y, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y, y_pred)):.4f}")
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_:.4f}")

# Cross-validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f"CV R² (mean ± std): {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

Using Both Together¶

In practice, many analysts use both libraries in the same project:

Explore and diagnose with statsmodels: fit an OLS model, examine the summary, check VIF, test residual assumptions.
Predict and validate with sklearn: use cross-validation, regularized models, and pipelines for production-ready prediction.

# Step 1: Statistical analysis with statsmodels
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

X_sm = sm.add_constant(df[['RM', 'LSTAT', 'PTRATIO']])
model_sm = sm.OLS(df['PRICE'], X_sm).fit()
print(model_sm.summary())

# Check VIF
vif = pd.DataFrame({
    'Feature': X_sm.columns,
    'VIF': [variance_inflation_factor(X_sm.values, i) for i in range(X_sm.shape[1])]
})
print(vif)

# Step 2: Prediction pipeline with sklearn
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge(alpha=1.0))
])

X_sk = df[['RM', 'LSTAT', 'PTRATIO']]
cv_scores = cross_val_score(pipe, X_sk, df['PRICE'], cv=5, scoring='r2')
print(f"Ridge CV R²: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

Summary¶

statsmodels excels at statistical inference and diagnostics, while sklearn excels at prediction and model deployment. The two libraries are complementary, and using both in a regression workflow provides the most complete analysis.