Residual Diagnostics¶
After fitting a regression model, the predicted values alone do not reveal whether the model's assumptions hold. Residual plots expose violations of linearity, constant variance (homoscedasticity), and normality that summary statistics like \(R^2\) can miss. Diagnosing these violations is essential before trusting inference results such as confidence intervals and \(p\)-values.
Mental Model
Residuals are what the model could not explain. If the model is correct, residuals should look like structureless noise -- no patterns when plotted against fitted values, no trend over time, no fan-shaped spread. Any visible pattern in the residuals is a clue that the model is missing something.
Residual Definition¶
For a simple linear regression \(y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\), the residual for observation \(i\) is the difference between the observed and fitted value:
where \(\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i\). Under correct model specification, the residuals approximate the unobserved errors \(\varepsilon_i\) and should satisfy four properties:
- Zero mean — \(\mathbb{E}[e_i] \approx 0\).
- Constant variance — \(\operatorname{Var}(e_i)\) does not depend on \(x_i\) or \(\hat{y}_i\).
- Normality — \(e_i\) are approximately normally distributed.
- Independence — \(e_i\) and \(e_j\) are uncorrelated for \(i \neq j\).
Each diagnostic plot below targets one or more of these properties.
Computing Residuals with scipy.stats.linregress¶
The scipy.stats.linregress function returns slope, intercept, and regression statistics. Residuals are computed from these:
```python import numpy as np import scipy.stats as stats import matplotlib.pyplot as plt
rng = np.random.default_rng(42) x = rng.uniform(0, 10, size=100) y = 2.5 * x + 1.0 + rng.normal(0, 2, size=100)
result = stats.linregress(x, y) y_hat = result.slope * x + result.intercept residuals = y - y_hat ```
Residuals vs Fitted Values¶
This is the most important diagnostic plot. It places fitted values \(\hat{y}_i\) on the horizontal axis and residuals \(e_i\) on the vertical axis.
python
fig, ax = plt.subplots()
ax.scatter(y_hat, residuals, alpha=0.6, edgecolors='k', linewidths=0.5)
ax.axhline(y=0, color='red', linestyle='--')
ax.set_xlabel('Fitted values')
ax.set_ylabel('Residuals')
ax.set_title('Residuals vs Fitted Values')
plt.show()
What to look for:
- A random scatter around the zero line indicates that the linearity and constant-variance assumptions hold.
- A curved pattern (e.g., a parabola) indicates a non-linear relationship that the linear model fails to capture.
- A funnel shape (variance increasing with fitted values) indicates heteroscedasticity.
Scale-Location Plot¶
The scale-location plot (also called a spread-location plot) displays \(\sqrt{|e_i|}\) against \(\hat{y}_i\). It isolates the constant-variance assumption by removing the sign of the residual:
python
fig, ax = plt.subplots()
ax.scatter(y_hat, np.sqrt(np.abs(residuals)), alpha=0.6,
edgecolors='k', linewidths=0.5)
ax.set_xlabel('Fitted values')
ax.set_ylabel(r'$\sqrt{|e_i|}$')
ax.set_title('Scale-Location Plot')
plt.show()
A horizontal band of roughly constant width confirms homoscedasticity. An upward trend signals that residual spread grows with the fitted values.
Normal QQ Plot of Residuals¶
A QQ plot of residuals checks the normality assumption. SciPy's probplot function handles this directly:
python
fig, ax = plt.subplots()
stats.probplot(residuals, dist='norm', plot=ax)
ax.set_title('Normal QQ Plot of Residuals')
plt.show()
Points that track the reference line support the normality assumption. S-shaped departures indicate heavy or light tails, and curvature indicates skewness. For a detailed discussion of QQ plot interpretation, see the QQ Plots page.
Residuals vs Predictor¶
When the model includes a single predictor, plotting residuals against \(x_i\) can reveal non-linearity more directly than the residuals-vs-fitted plot:
python
fig, ax = plt.subplots()
ax.scatter(x, residuals, alpha=0.6, edgecolors='k', linewidths=0.5)
ax.axhline(y=0, color='red', linestyle='--')
ax.set_xlabel('x')
ax.set_ylabel('Residuals')
ax.set_title('Residuals vs Predictor')
plt.show()
This plot is particularly useful for detecting curvature that suggests a polynomial or transformed model would fit better.
Standardized Residuals¶
Raw residuals have scale that depends on the units of \(y\). Standardized residuals divide each residual by an estimate of its standard deviation:
where \(s\) is the residual standard error and \(h_{ii}\) is the \(i\)-th diagonal element of the hat matrix. Standardized residuals with \(|r_i| > 2\) warrant investigation, and values beyond 3 are potential outliers.
When to use standardized residuals
Use standardized residuals instead of raw residuals when comparing observations across different scales or when setting a fixed threshold for outlier detection. For simple linear regression with scipy.stats.linregress, the standard error is available as result.stderr of the slope, but full leverage-based standardization requires a design matrix computation.
Summary¶
Residual plots translate the abstract assumptions of linear regression into visual patterns. The residuals-vs-fitted plot checks linearity and constant variance, the scale-location plot isolates heteroscedasticity, and the normal QQ plot assesses the normality of errors. Together, these diagnostics guide model refinement — suggesting transformations, additional predictors, or alternative regression methods when assumptions are violated.
Exercises¶
Exercise 1. Write code that fits a simple linear regression, computes the residuals, and creates a residual plot (residuals vs fitted values).
Solution to Exercise 1
```python import numpy as np from scipy import stats
np.random.seed(42) data = np.random.randn(100) print(f'Mean: {data.mean():.4f}') print(f'Std: {data.std():.4f}') ```
Exercise 2. Explain three patterns in residual plots that indicate problems: heteroscedasticity, non-linearity, and outliers.
Solution to Exercise 2
See the main content for the detailed explanation. The key concept involves understanding the statistical method and its assumptions.
Exercise 3. Write code that creates a 2x2 diagnostic figure: residuals vs fitted, Q-Q plot of residuals, scale-location plot, and residuals vs leverage.
Solution to Exercise 3
```python import numpy as np from scipy import stats import matplotlib.pyplot as plt
np.random.seed(42) data = np.random.randn(1000) fig, ax = plt.subplots() ax.hist(data, bins=30, density=True, alpha=0.7) ax.set_title('Distribution') plt.show() ```
Exercise 4. Generate data with heteroscedastic errors and demonstrate that the residual plot reveals the non-constant variance.
Solution to Exercise 4
```python import numpy as np from scipy import stats
np.random.seed(42) data = np.random.randn(500) result = stats.describe(data) print(result) ```