Sampling Distribution of the Difference of Two Sample Proportions¶
Overview¶
When comparing proportions from two independent populations (e.g., treatment vs control, brand A vs brand B), the relevant statistic is \(\hat{p}_1 - \hat{p}_2\). Its sampling distribution enables confidence intervals and hypothesis tests for the difference \(p_1 - p_2\).
Mathematical Formulation¶
Let \(\hat{p}_1\) and \(\hat{p}_2\) be sample proportions from two independent samples of sizes \(n_1\) and \(n_2\), from populations with true proportions \(p_1\) and \(p_2\).
Properties¶
Expected value:
Variance:
Standard error:
Normal Approximation¶
For sufficiently large \(n_1\) and \(n_2\) (with \(n_i p_i \geq 5\) and \(n_i(1-p_i) \geq 5\) for both \(i\)):
In practice, since \(p_1\) and \(p_2\) are unknown, we substitute the sample proportions:
Confidence Interval¶
A \((1 - \alpha)\) confidence interval for \(p_1 - p_2\):
Hypothesis Testing¶
For General H_0: p_1 - p_2 = d_0¶
Use the estimated standard error:
For H_0: p_1 = p_2 (Special Case)¶
Under the null, \(p_1 = p_2 = p\). Use the pooled proportion:
Conditions for Validity¶
The normal approximation requires all four of the following:
When these conditions are not met, exact methods (Fisher's exact test) or simulation-based approaches should be used.
Example¶
Problem. In a study, 120 out of 200 patients in group 1 responded to treatment (\(\hat{p}_1 = 0.60\)), while 90 out of 200 patients in group 2 responded (\(\hat{p}_2 = 0.45\)). Find a 95% confidence interval for \(p_1 - p_2\).
Solution.
import numpy as np
from scipy import stats
p1_hat, p2_hat = 0.60, 0.45
n1, n2 = 200, 200
diff = p1_hat - p2_hat
se = np.sqrt(p1_hat * (1 - p1_hat) / n1 + p2_hat * (1 - p2_hat) / n2)
z_star = stats.norm.ppf(0.975)
ci_lower = diff - z_star * se
ci_upper = diff + z_star * se
print(f"Difference: {diff:.2f}")
print(f"SE: {se:.4f}")
print(f"95% CI: ({ci_lower:.3f}, {ci_upper:.3f})")
Since the confidence interval does not contain 0, there is statistically significant evidence that the treatment response rates differ between the two groups.
Summary¶
| Property | Result |
|---|---|
| \(E[\hat{p}_1 - \hat{p}_2]\) | \(p_1 - p_2\) |
| \(\text{SE}(\hat{p}_1 - \hat{p}_2)\) | \(\sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}\) |
| Distribution (large \(n\)) | Approximately \(N(0, 1)\) after standardization |
| CI formula | \((\hat{p}_1 - \hat{p}_2) \pm z^* \cdot \widehat{\text{SE}}\) |
| Validity condition | \(n_i p_i \geq 5\) and \(n_i(1-p_i) \geq 5\) for both groups |