Skip to content

Normality Assumption

Definition

The normality assumption asserts that the residuals of the regression model are normally distributed, with a mean of zero:

\[ \epsilon_i \sim N(0, \sigma^2) \]

This means that when the residuals are plotted, they should form a bell-shaped (Gaussian) curve centered around zero.

Importance

Normality of residuals is crucial because many inferential statistics in linear regression rely on this assumption:

  • t-tests for coefficients — Testing whether individual regression coefficients are significantly different from zero requires normally distributed errors.
  • F-tests for overall significance — The overall F-test for model significance assumes normal errors.
  • Confidence intervals — The construction of confidence intervals for coefficients uses the t-distribution, which is derived under the normality assumption.
  • Prediction intervals — Prediction intervals for new observations require normality to be valid.

Important caveat: The normality assumption is not required for unbiased estimation of regression coefficients. The OLS estimator \(\hat{\beta}\) is unbiased regardless of the error distribution. However, without normality, the exact distributional results used for inference (t-tests, F-tests) are only asymptotically valid via the Central Limit Theorem.

When Normality Matters Most

Situation Normality Importance
Small sample size (\(n < 30\)) Critical — CLT does not provide sufficient approximation
Large sample size (\(n > 100\)) Less critical — CLT ensures approximate normality of test statistics
Constructing prediction intervals Always important regardless of sample size
Heavy-tailed data Important — outliers can strongly influence OLS estimates

Diagnostics

  • Q-Q Plot (Quantile-Quantile Plot): Compares the quantiles of the residuals to the quantiles of a normal distribution. If the points fall approximately along a straight line, the residuals are likely normally distributed.
  • Shapiro-Wilk Test: A formal statistical test for normality. A non-significant result suggests that the residuals are normally distributed.
  • Histogram of Residuals: A simple visual check — the histogram should resemble a bell curve centered at zero.
  • Jarque-Bera Test: Tests whether the skewness and kurtosis of the residuals match a normal distribution.

Remedies for Non-Normality

  • Transformations: Applying transformations to the dependent variable, such as log, square root, or Box-Cox transformations, can sometimes correct non-normality.
  • Robust Regression: When normality cannot be achieved, robust regression techniques can provide valid results by down-weighting the influence of outliers.
  • Bootstrapping: Bootstrap methods provide valid inference without relying on distributional assumptions.
  • Larger samples: With sufficiently large samples, the Central Limit Theorem ensures that test statistics are approximately normally distributed even if the errors are not.

For detailed diagnostic methods, see Checking Normality.