Residual Diagnostics¶

After fitting a regression model, the predicted values alone do not reveal whether the model's assumptions hold. Residual plots expose violations of linearity, constant variance (homoscedasticity), and normality that summary statistics like $R^2$ can miss. Diagnosing these violations is essential before trusting inference results such as confidence intervals and $p$-values.

Mental Model

Residuals are what the model could not explain. If the model is correct, residuals should look like structureless noise -- no patterns when plotted against fitted values, no trend over time, no fan-shaped spread. Any visible pattern in the residuals is a clue that the model is missing something.

Residual Definition¶

For a simple linear regression $y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$, the residual for observation $i$ is the difference between the observed and fitted value:

\[ e_i = y_i - \hat{y}_i \]

where $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$. Under correct model specification, the residuals approximate the unobserved errors $\varepsilon_i$ and should satisfy four properties:

Zero mean — $\mathbb{E}[e_i] \approx 0$.
Constant variance — $\operatorname{Var}(e_i)$ does not depend on $x_i$ or $\hat{y}_i$.
Normality — $e_i$ are approximately normally distributed.
Independence — $e_i$ and $e_j$ are uncorrelated for $i \neq j$.

Each diagnostic plot below targets one or more of these properties.

Computing Residuals with scipy.stats.linregress¶

The scipy.stats.linregress function returns slope, intercept, and regression statistics. Residuals are computed from these:

```python import numpy as np import scipy.stats as stats import matplotlib.pyplot as plt

rng = np.random.default_rng(42) x = rng.uniform(0, 10, size=100) y = 2.5 * x + 1.0 + rng.normal(0, 2, size=100)

result = stats.linregress(x, y) y_hat = result.slope * x + result.intercept residuals = y - y_hat ```

Residuals vs Fitted Values¶

This is the most important diagnostic plot. It places fitted values $\hat{y}_i$ on the horizontal axis and residuals $e_i$ on the vertical axis.

python fig, ax = plt.subplots() ax.scatter(y_hat, residuals, alpha=0.6, edgecolors='k', linewidths=0.5) ax.axhline(y=0, color='red', linestyle='--') ax.set_xlabel('Fitted values') ax.set_ylabel('Residuals') ax.set_title('Residuals vs Fitted Values') plt.show()

What to look for:

A random scatter around the zero line indicates that the linearity and constant-variance assumptions hold.
A curved pattern (e.g., a parabola) indicates a non-linear relationship that the linear model fails to capture.
A funnel shape (variance increasing with fitted values) indicates heteroscedasticity.

Scale-Location Plot¶

The scale-location plot (also called a spread-location plot) displays $\sqrt{|e_i|}$ against $\hat{y}_i$. It isolates the constant-variance assumption by removing the sign of the residual:

python fig, ax = plt.subplots() ax.scatter(y_hat, np.sqrt(np.abs(residuals)), alpha=0.6, edgecolors='k', linewidths=0.5) ax.set_xlabel('Fitted values') ax.set_ylabel(r'$\sqrt{|e_i|}$') ax.set_title('Scale-Location Plot') plt.show()

A horizontal band of roughly constant width confirms homoscedasticity. An upward trend signals that residual spread grows with the fitted values.

Normal QQ Plot of Residuals¶

A QQ plot of residuals checks the normality assumption. SciPy's probplot function handles this directly:

python fig, ax = plt.subplots() stats.probplot(residuals, dist='norm', plot=ax) ax.set_title('Normal QQ Plot of Residuals') plt.show()

Points that track the reference line support the normality assumption. S-shaped departures indicate heavy or light tails, and curvature indicates skewness. For a detailed discussion of QQ plot interpretation, see the QQ Plots page.

Residuals vs Predictor¶

When the model includes a single predictor, plotting residuals against $x_i$ can reveal non-linearity more directly than the residuals-vs-fitted plot:

python fig, ax = plt.subplots() ax.scatter(x, residuals, alpha=0.6, edgecolors='k', linewidths=0.5) ax.axhline(y=0, color='red', linestyle='--') ax.set_xlabel('x') ax.set_ylabel('Residuals') ax.set_title('Residuals vs Predictor') plt.show()

This plot is particularly useful for detecting curvature that suggests a polynomial or transformed model would fit better.

Standardized Residuals¶

Raw residuals have scale that depends on the units of $y$. Standardized residuals divide each residual by an estimate of its standard deviation:

\[ r_i = \frac{e_i}{s \sqrt{1 - h_{ii}}} \]

where $s$ is the residual standard error and $h_{ii}$ is the $i$-th diagonal element of the hat matrix. Standardized residuals with $|r_i| > 2$ warrant investigation, and values beyond 3 are potential outliers.

When to use standardized residuals

Use standardized residuals instead of raw residuals when comparing observations across different scales or when setting a fixed threshold for outlier detection. For simple linear regression with scipy.stats.linregress, the standard error is available as result.stderr of the slope, but full leverage-based standardization requires a design matrix computation.

Summary¶

Residual plots translate the abstract assumptions of linear regression into visual patterns. The residuals-vs-fitted plot checks linearity and constant variance, the scale-location plot isolates heteroscedasticity, and the normal QQ plot assesses the normality of errors. Together, these diagnostics guide model refinement — suggesting transformations, additional predictors, or alternative regression methods when assumptions are violated.

Exercises¶

Exercise 1. Write code that fits a simple linear regression, computes the residuals, and creates a residual plot (residuals vs fitted values).

Solution to Exercise 1

```python import numpy as np from scipy import stats

np.random.seed(42) data = np.random.randn(100) print(f'Mean: {data.mean():.4f}') print(f'Std: {data.std():.4f}') ```

Exercise 2. Explain three patterns in residual plots that indicate problems: heteroscedasticity, non-linearity, and outliers.

Solution to Exercise 2

See the main content for the detailed explanation. The key concept involves understanding the statistical method and its assumptions.

Exercise 3. Write code that creates a 2x2 diagnostic figure: residuals vs fitted, Q-Q plot of residuals, scale-location plot, and residuals vs leverage.

Solution to Exercise 3

```python import numpy as np from scipy import stats import matplotlib.pyplot as plt

np.random.seed(42) data = np.random.randn(1000) fig, ax = plt.subplots() ax.hist(data, bins=30, density=True, alpha=0.7) ax.set_title('Distribution') plt.show() ```

Exercise 4. Generate data with heteroscedastic errors and demonstrate that the residual plot reveals the non-constant variance.

Solution to Exercise 4

```python import numpy as np from scipy import stats

np.random.seed(42) data = np.random.randn(500) result = stats.describe(data) print(result) ```