The Bootstrap Resampling Method¶
Overview¶
The bootstrap is a powerful, distribution-free method for estimating the sampling distribution of a statistic without making parametric assumptions. By repeatedly resampling (with replacement) from a single sample, we can approximate the sampling distribution and compute standard errors, confidence intervals, and other inferential quantities.
Core Idea¶
The bootstrap rests on a simple principle: the empirical distribution of our sample is a reasonable estimate of the true population distribution. If we repeatedly sample from our observed sample (with replacement), the variability in our computed statistics approximates the true sampling variability.
Key insight: The bootstrap trades computational effort for avoiding strong distributional assumptions.
Bootstrap Algorithm¶
Given a sample of \(n\) observations:
- Resample: Draw \(n\) observations from the sample with replacement, creating a bootstrap sample
- Compute: Calculate the statistic of interest on the bootstrap sample
- Repeat: Steps 1-2 a large number of times (typically 500-10,000)
- Analyze: The collection of computed statistics forms the bootstrap distribution
Original Sample: X₁, X₂, ..., Xₙ
↓
├→ Bootstrap Sample 1* → Statistic θ₁*
├→ Bootstrap Sample 2* → Statistic θ₂*
├→ Bootstrap Sample 3* → Statistic θ₃*
└→ Bootstrap Sample B* → Statistic θ_B*
↓
Bootstrap Distribution: {θ₁*, θ₂*, ..., θ_B*}
Example: Bootstrap Distribution of the Median¶
The median is particularly useful when data contain outliers or aren't normally distributed. Unlike the mean, the median has no simple formula for its standard error. The bootstrap solves this problem elegantly.
import numpy as np
import pandas as pd
from sklearn.utils import resample
# Set random seed
np.random.seed(seed=1)
# Simulate income data (realistic for illustrating robustness of median)
loans_income = np.random.exponential(scale=50000, size=5000) + 20000
# Compute original sample median
original_median = loans_income.median()
print(f"Original sample median: ${original_median:,.0f}")
print()
# Bootstrap procedure: resample 1000 times
bootstrap_medians = []
for nrepeat in range(1000):
# Resample with replacement
bootstrap_sample = resample(loans_income) # size=len(loans_income) by default
# Compute median of bootstrap sample
bootstrap_medians.append(bootstrap_sample.median())
bootstrap_medians = pd.Series(bootstrap_medians)
# Compute bootstrap statistics
bootstrap_mean = bootstrap_medians.mean()
bootstrap_std = bootstrap_medians.std()
bias = bootstrap_mean - original_median
print(f"Bootstrap Statistics:")
print(f" Mean of bootstrap distribution: ${bootstrap_mean:,.0f}")
print(f" Standard error of median: ${bootstrap_std:,.0f}")
print(f" Bias of median estimator: ${bias:,.0f}")
print()
# The bootstrap distribution can be visualized
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(bootstrap_medians, bins=40, color='steelblue', edgecolor='black', alpha=0.7)
ax.axvline(original_median, color='darkred', linestyle='--', linewidth=2.5, label=f'Original median: ${original_median:,.0f}')
ax.axvline(bootstrap_mean, color='green', linestyle='--', linewidth=2.5, label=f'Bootstrap mean: ${bootstrap_mean:,.0f}')
ax.set_xlabel('Median Income ($)', fontsize=11)
ax.set_ylabel('Frequency', fontsize=11)
ax.set_title('Bootstrap Distribution of the Sample Median', fontsize=12, fontweight='bold')
ax.legend(fontsize=10)
ax.spines[['top', 'right']].set_visible(False)
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
Interpreting Bootstrap Results¶
Standard Error¶
The standard deviation of the bootstrap distribution is the standard error of the statistic:
This estimates the typical variation in the median across different samples from the population.
Bias¶
If the mean of the bootstrap distribution differs from the original statistic, the estimator has bias:
In the example above, a small negative bias suggests the median is slightly downward-biased (though this depends on the specific data).
Bootstrap Distribution Shape¶
The shape of the bootstrap distribution reveals: - Skewness: Non-symmetry indicates asymmetric sampling distribution - Heavy tails: Suggests the statistic is sensitive to outliers - Multimodality: Can indicate clustered data or multiple population modes
Advantages of Bootstrap¶
1. Distribution-Free¶
No assumption of normality or specific distributional form required. The bootstrap works for: - Non-normal data - Skewed distributions - Heavy-tailed distributions - Any arbitrary population shape
2. General Applicability¶
Works for any statistic, not just the mean: - Median - Correlation coefficient - Ratio statistics - Quantiles and percentiles - Custom estimators
3. Intuitive and Transparent¶
The procedure directly estimates sampling variability without requiring theoretical derivations. This makes it: - Easy to understand conceptually - Easy to explain to non-statisticians - Easy to implement and verify
Comparison with Parametric Approaches¶
| Aspect | Bootstrap | Parametric (Theory-based) |
|---|---|---|
| Assumptions | Minimal (IID sample) | Strong (normality, known variance, etc.) |
| Applicability | Any statistic | Limited to standard statistics |
| Computation | Resampling (intensive) | Formulas (fast) |
| Validity | Asymptotic, improves with B | Exact or asymptotic |
| Implementation | Simple code | Requires mathematical knowledge |
Practical Considerations¶
Sample Size Requirements¶
The bootstrap requires that the original sample be reasonably representative of the population. It works poorly when: - Sample size is very small (\(n < 30\)) relative to population variability - Extreme values are missing from the sample - The sample is biased or non-random
Number of Bootstrap Replications¶
Rule of thumb: Use \(B = 1000\) for confidence intervals or \(B \geq 500\) for standard errors.
For extreme quantiles (e.g., 99th percentile), use \(B \geq 5000\).
# Standard error with different B values
np.random.seed(1)
np.random.exponential(scale=50000, size=5000)
for B in [100, 500, 1000, 5000]:
boot_medians = np.array([
np.median(np.random.choice(loans_income, size=len(loans_income), replace=True))
for _ in range(B)
])
se = boot_medians.std()
print(f"B = {B:5d}: SE = ${se:8,.0f}")
Computational Cost¶
Modern computers can easily handle 10,000 bootstrap replications for standard statistics. For more computationally intensive operations (e.g., fitting complex models), start with \(B = 500\) and increase if necessary.
Limitations and Pitfalls¶
- Bootstrap can't estimate extremes well: For the maximum of a sample, the bootstrap max is always ≤ the observed max
- Dependent data: The standard bootstrap assumes independent observations; time series or clustered data require modifications (block bootstrap)
- Bias not accounted for: The bootstrap estimates standard error, not bias; biased estimators remain biased in the bootstrap
- Small samples: With very small samples, the empirical distribution may poorly represent the population
Extensions and Variants¶
Block Bootstrap¶
For time series or clustered data:
def block_bootstrap(data, block_size, n_bootstrap):
n = len(data)
bootstrap_samples = []
for _ in range(n_bootstrap):
blocks = [data[i:(i+block_size)] for i in np.arange(0, n, block_size)]
resampled_blocks = np.random.choice(len(blocks), size=len(blocks), replace=True)
bootstrap_sample = np.concatenate([blocks[i] for i in resampled_blocks])[:n]
bootstrap_samples.append(bootstrap_sample)
return bootstrap_samples
Percentile-t Bootstrap¶
More accurate confidence intervals for some statistics:
# Compute t-statistics and use t-quantiles instead of percentile quantiles
bootstrap_t_stats = (bootstrap_statistics - original_statistic) / bootstrap_ses
ci_lower = original_statistic - np.percentile(bootstrap_t_stats, 97.5) * original_se
ci_upper = original_statistic - np.percentile(bootstrap_t_stats, 2.5) * original_se
Summary¶
The bootstrap is a foundational tool in modern statistics: - Intuitive method based on resampling from the data - Widely applicable to any statistic and any distribution - Computationally accessible with modern computing power - Distribution-free and makes minimal assumptions
It trades computational effort for avoiding strong parametric assumptions, making it invaluable when theory-based methods are inadequate or unavailable.