Normality Visualization¶
Many statistical methods -- \(t\)-tests, ANOVA, linear regression -- assume that data or residuals follow a normal distribution. While formal tests like Shapiro-Wilk provide a binary decision, visual methods reveal the nature and severity of departures from normality. A histogram might show skewness, a QQ plot might show heavy tails, and these patterns inform which remedies (transformations, nonparametric alternatives) are appropriate. This page demonstrates the primary visual techniques for assessing normality.
Histogram with Normal Overlay¶
The most intuitive check overlays the fitted normal density \(f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2/(2\sigma^2)}\) on a normalized histogram of the data.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
np.random.seed(42)
data = np.random.normal(loc=50, scale=10, size=300)
# Fit normal distribution
mu, sigma = stats.norm.fit(data)
x = np.linspace(data.min() - 5, data.max() + 5, 200)
pdf = stats.norm.pdf(x, mu, sigma)
plt.hist(data, bins=25, density=True, alpha=0.6, label='Data')
plt.plot(x, pdf, 'r-', linewidth=2,
label=f'Normal($\\mu$={mu:.1f}, $\\sigma$={sigma:.1f})')
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Histogram with Fitted Normal Density')
plt.legend()
plt.show()
Interpreting the Histogram
Look for symmetry around the peak, bell-shaped curvature, and tails that taper smoothly. Skewness appears as asymmetry; heavy tails appear as excess density far from the center.
Normal Probability Plot (QQ Plot)¶
The normal QQ plot plots sample quantiles against theoretical normal quantiles. If the data are normally distributed, the points follow a straight line. scipy.stats.probplot computes both the quantiles and a fitted line.
fig, ax = plt.subplots()
stats.probplot(data, dist="norm", plot=ax)
ax.set_title('Normal QQ Plot')
plt.show()
Common deviation patterns:
| Pattern | Interpretation |
|---|---|
| S-shaped curve | Heavy tails (leptokurtic) |
| Inverted S-shape | Light tails (platykurtic) |
| Points curve upward at both ends | Right skew |
| Points curve downward at both ends | Left skew |
| Points follow the line | Approximately normal |
ECDF vs Normal CDF¶
Comparing the empirical CDF against the theoretical normal CDF provides a non-binned view of normality. The maximum vertical gap between the two curves is the Kolmogorov-Smirnov statistic.
data_sorted = np.sort(data)
ecdf = np.arange(1, len(data) + 1) / len(data)
cdf_normal = stats.norm.cdf(data_sorted, loc=mu, scale=sigma)
plt.step(data_sorted, ecdf, where='post', label='Empirical CDF')
plt.plot(data_sorted, cdf_normal, 'r-', label='Normal CDF')
plt.xlabel('Value')
plt.ylabel('Cumulative Probability')
plt.title('ECDF vs Normal CDF')
plt.legend()
plt.show()
# Quantify the maximum deviation
ks_stat, p_value = stats.kstest(data, 'norm', args=(mu, sigma))
print(f"KS statistic: {ks_stat:.4f}, p-value: {p_value:.4f}")
Box Plot¶
A box plot summarizes the distribution through its quartiles and flags observations beyond the whiskers as potential outliers. For normal data, the box is approximately symmetric around the median, and few points appear beyond the whiskers.
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
# Normal data
axes[0].boxplot(data, vert=True)
axes[0].set_title('Normal Data')
# Skewed data for contrast
skewed = stats.lognorm.rvs(s=0.5, size=300, random_state=42)
axes[1].boxplot(skewed, vert=True)
axes[1].set_title('Right-Skewed Data')
plt.tight_layout()
plt.show()
Skewness and Kurtosis Indicators¶
Quantitative measures of shape complement the visual assessment. For a normal distribution, skewness is \(0\) and excess kurtosis is \(0\).
skew = stats.skew(data)
kurt = stats.kurtosis(data) # Excess kurtosis (Fisher)
print(f"Skewness: {skew:.3f} (normal = 0)")
print(f"Excess kurtosis: {kurt:.3f} (normal = 0)")
# D'Agostino-Pearson test combines skewness and kurtosis
stat, p_value = stats.normaltest(data)
print(f"D'Agostino-Pearson: statistic={stat:.3f}, p={p_value:.4f}")
Kurtosis Convention
scipy.stats.kurtosis returns excess kurtosis by default (Fisher's definition), so the expected value for a normal distribution is \(0\), not \(3\). Set fisher=False to get the Pearson definition where the normal value is \(3\).
Comparing Normal vs Non-Normal Data¶
Viewing multiple diagnostic plots side by side clarifies how departures from normality manifest across different visualization methods.
fig, axes = plt.subplots(2, 3, figsize=(14, 8))
for col, (label, sample) in enumerate([
('Normal', np.random.normal(0, 1, 500)),
('Right-skewed', stats.lognorm.rvs(s=0.8, size=500, random_state=1)),
('Heavy-tailed', stats.t.rvs(df=3, size=500, random_state=2)),
]):
# Histogram
axes[0, col].hist(sample, bins=30, density=True, alpha=0.6)
mu_fit, sigma_fit = stats.norm.fit(sample)
x = np.linspace(sample.min(), sample.max(), 100)
axes[0, col].plot(x, stats.norm.pdf(x, mu_fit, sigma_fit), 'r-')
axes[0, col].set_title(f'{label} - Histogram')
# QQ plot
stats.probplot(sample, dist="norm", plot=axes[1, col])
axes[1, col].set_title(f'{label} - QQ Plot')
plt.tight_layout()
plt.show()
Summary¶
Visual normality assessment reveals not just whether data are non-normal but how they deviate. Histograms show overall shape, QQ plots diagnose tail behavior and skewness with high sensitivity, and ECDF comparisons provide a complete non-binned view. Box plots highlight asymmetry and outliers at a glance, while skewness and kurtosis values quantify what the plots reveal. Using these methods together gives a thorough picture that guides the choice between parametric methods that assume normality and robust or nonparametric alternatives that do not.