Normality Assumption¶
Definition¶
The normality assumption asserts that the residuals of the regression model are normally distributed, with a mean of zero:
This means that when the residuals are plotted, they should form a bell-shaped (Gaussian) curve centered around zero.
Importance¶
Normality of residuals is crucial because many inferential statistics in linear regression rely on this assumption:
- t-tests for coefficients — Testing whether individual regression coefficients are significantly different from zero requires normally distributed errors.
- F-tests for overall significance — The overall F-test for model significance assumes normal errors.
- Confidence intervals — The construction of confidence intervals for coefficients uses the t-distribution, which is derived under the normality assumption.
- Prediction intervals — Prediction intervals for new observations require normality to be valid.
Important caveat: The normality assumption is not required for unbiased estimation of regression coefficients. The OLS estimator \(\hat{\beta}\) is unbiased regardless of the error distribution. However, without normality, the exact distributional results used for inference (t-tests, F-tests) are only asymptotically valid via the Central Limit Theorem.
When Normality Matters Most¶
| Situation | Normality Importance |
|---|---|
| Small sample size (\(n < 30\)) | Critical — CLT does not provide sufficient approximation |
| Large sample size (\(n > 100\)) | Less critical — CLT ensures approximate normality of test statistics |
| Constructing prediction intervals | Always important regardless of sample size |
| Heavy-tailed data | Important — outliers can strongly influence OLS estimates |
Diagnostics¶
- Q-Q Plot (Quantile-Quantile Plot): Compares the quantiles of the residuals to the quantiles of a normal distribution. If the points fall approximately along a straight line, the residuals are likely normally distributed.
- Shapiro-Wilk Test: A formal statistical test for normality. A non-significant result suggests that the residuals are normally distributed.
- Histogram of Residuals: A simple visual check — the histogram should resemble a bell curve centered at zero.
- Jarque-Bera Test: Tests whether the skewness and kurtosis of the residuals match a normal distribution.
Remedies for Non-Normality¶
- Transformations: Applying transformations to the dependent variable, such as log, square root, or Box-Cox transformations, can sometimes correct non-normality.
- Robust Regression: When normality cannot be achieved, robust regression techniques can provide valid results by down-weighting the influence of outliers.
- Bootstrapping: Bootstrap methods provide valid inference without relying on distributional assumptions.
- Larger samples: With sufficiently large samples, the Central Limit Theorem ensures that test statistics are approximately normally distributed even if the errors are not.
For detailed diagnostic methods, see Checking Normality.