F Distribution¶
Overview¶
The F distribution arises as the ratio of two independent chi-square random variables, each divided by their degrees of freedom. It is fundamental for comparing variances between two populations and for Analysis of Variance (ANOVA).
Definition¶
Let \(X_1^2 \sim \chi^2_{d_1}\) and \(X_2^2 \sim \chi^2_{d_2}\) be independent. Then:
where \(d_1\) is the numerator degrees of freedom and \(d_2\) is the denominator degrees of freedom.
Since each \(\chi^2\) is a sum of squared standard normals:
Degrees of Freedom¶
The F distribution depends on two sets of degrees of freedom, which distinguishes it from the chi-square distribution:
Numerator (d_1) and Denominator (d_2)¶
- Low \(d_1\) and \(d_2\): Highly right-skewed.
- Increasing \(d_1\) and \(d_2\): Distribution becomes more symmetric.
- Special case: \(F(1, d_2) = \frac{\chi^2(1)/1}{\chi^2(d_2)/d_2}\) relates directly to a squared \(t\) variable.
Comparison with Chi-Square¶
The chi-square distribution has a single degrees-of-freedom parameter controlling its shape. The F distribution's dual dependency creates a richer family of shapes due to the ratio of two independent variance-like quantities.
Properties¶
- Non-negativity: \(F \geq 0\) (ratio of non-negative quantities).
- Asymmetry: Positively skewed, especially for small degrees of freedom.
- Mean: \(\frac{d_2}{d_2 - 2}\) for \(d_2 > 2\).
- Mode: For \(d_1 > 2\):
Random Samples¶
Direct Sampling¶
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
np.random.seed(0)
d1, d2 = 5, 10
data = stats.f(d1, d2).rvs(10_000)
fig, ax = plt.subplots(figsize=(12, 3))
bins = np.linspace(0, 5, 100)
ax.hist(data, bins=bins, density=True, alpha=0.7, label='F Samples')
ax.plot(bins, stats.f(d1, d2).pdf(bins), '--r', lw=3, label=f'F({d1},{d2}) PDF')
ax.legend()
plt.show()
Sampling from Definition (Ratio of Chi-Squares)¶
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
np.random.seed(0)
d1, d2, n = 5, 10, 10_000
f_data = (stats.chi2(d1).rvs(n) / d1) / (stats.chi2(d2).rvs(n) / d2)
fig, ax = plt.subplots(figsize=(12, 3))
bins = np.linspace(0, 5, 100)
ax.hist(f_data, bins=bins, density=True, alpha=0.7, label='χ²-Ratio Samples')
ax.plot(bins, stats.f(d1, d2).pdf(bins), '--r', lw=3, label=f'F({d1},{d2}) PDF')
ax.legend()
plt.show()
Why F?¶
The F distribution arises naturally when comparing the variances of two independent normal populations.
Step 1: Distribution of Scaled Variances¶
For samples from normal populations:
These are independent because the two samples are independent.
Step 2: Ratio of Chi-Squares¶
Dividing each by its degrees of freedom and taking the ratio:
Step 3: Under the Null Hypothesis¶
If we test \(H_0: \sigma_1^2 = \sigma_2^2\), the population variances cancel:
This is the F-test for equality of two variances.
Why Not Something Else?¶
The F distribution is the unique distribution that arises from the ratio of independent chi-square variables divided by their degrees of freedom. This same logic extends to ANOVA, where F ratios measure whether between-group variability is significantly larger than within-group variability.
Limitations¶
The F distribution result is exact only under normality:
- Normality required: The chi-square results for each \(S^2\) depend on the population being normal. Without normality, the distribution of \(S_1^2/S_2^2\) deviates from \(F\).
- Small samples, non-normal: The F-test is unreliable. For skewed or heavy-tailed populations, the true Type I error rate can be much higher than the nominal level.
- Robust alternatives: Levene's test, Brown–Forsythe test, and Fligner–Killeen test maintain validity under broader distributional conditions and are widely preferred in practice.
| Scenario | F-Test Validity |
|---|---|
| Normal populations | Exact |
| Large samples, mild non-normality | Approximately valid |
| Small samples, skewed/heavy-tailed | Unreliable; use robust alternatives |
PPF Example¶
from scipy import stats
# 95th percentile of F(5, 20)
f_95 = stats.f(5, 20).ppf(0.95)
print(f"F_0.95(5, 20) = {f_95:.4f}")
Key Takeaways¶
- The F distribution is the ratio of two independent chi-square variables, each divided by their degrees of freedom.
- It governs comparisons of variances and is the foundation of ANOVA.
- Both numerator and denominator degrees of freedom affect the shape of the distribution.
- The exactness of the F-test depends critically on normality; robust alternatives are often preferred in practice.