Assumptions and Diagnostics for Linear Regression¶

Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. However, for the results of a linear regression model to be valid, certain assumptions must be met. These assumptions ensure that the model is appropriately specified and that the statistical inferences made from the model are reliable.

The Four Key Assumptions¶

The four key assumptions underlying linear regression are commonly summarized by the acronym LINE:

Linearity — A straight-line relationship exists between the dependent variable and each independent variable.
Independence — The residuals (errors) of the regression model are independent of each other.
Normality — The residuals are normally distributed with a mean of zero.
Equal Variance (Homoscedasticity) — The variance of the residuals is constant across all levels of the independent variables.

Why These Assumptions Matter¶

Understanding and verifying these assumptions is essential because:

Violated linearity leads to biased predictions and invalid inferences, as the model either overestimates or underestimates the true relationship.
Violated independence produces biased coefficient estimates, incorrect standard errors, and invalid hypothesis tests — particularly critical in time-series data where autocorrelation may be present.
Violated homoscedasticity (heteroscedasticity) yields inefficient coefficient estimates and biased standard errors, ultimately compromising hypothesis tests.
Violated normality undermines the validity of confidence intervals, hypothesis tests (t-tests for coefficients, F-tests for overall significance), and prediction intervals.

Diagnostic Workflow¶

A typical diagnostic workflow for checking regression assumptions involves:

Fit the regression model and compute residuals.
Use visual diagnostics (scatterplots, residual plots, Q-Q plots, histograms) for initial assessment.
Apply formal statistical tests (Durbin-Watson, Breusch-Pagan, Shapiro-Wilk, etc.) for rigorous evaluation.
If violations are detected, apply remedies such as variable transformations, weighted least squares, robust regression, or alternative model specifications.

The following sections provide detailed coverage of each assumption, including diagnostic methods, formal tests, and remedies for violations.

Section Overview¶

Section	Topic	Key Diagnostics
Linearity	Straight-line relationship assumption	Scatterplots, residual plots
Independence	Uncorrelated residuals assumption	Durbin-Watson test, residual plots
Homoscedasticity	Constant variance assumption	Residual vs. fitted plot, Breusch-Pagan test
Normality	Normal residuals assumption	Q-Q plot, Shapiro-Wilk test
Checking Linearity	Methods for linearity assessment	Scatterplots, CPR plots, polynomial terms
Checking Independence	Methods for independence assessment	Durbin-Watson, Breusch-Godfrey, clustering
Checking Homoscedasticity	Methods for homoscedasticity assessment	Breusch-Pagan, White test, scale-location plot
Checking Normality	Methods for normality assessment	Histogram, Q-Q plot, Shapiro-Wilk, Anderson-Darling, Jarque-Bera