Percentiles and Quantiles¶
Describing a dataset by its mean and standard deviation captures only the center and spread. Percentiles and quantiles reveal the full shape of a distribution by specifying the value below which a given fraction of observations falls. These measures underpin box plots, confidence intervals, and many nonparametric methods. This page defines percentiles and quantiles formally, explains the interpolation methods used to compute them, and demonstrates their use with NumPy and SciPy.
Mental Model
A percentile answers the question "what value separates the bottom \(k\)% from the top?" Sort your data, walk \(k\)% of the way through, and read off the value. The median is the 50th percentile, quartiles split the data into fourths, and together these landmarks sketch the distribution's shape without any normality assumption.
Definitions¶
Quantile Function¶
For a random variable \(X\) with cumulative distribution function \(F(x) = P(X \le x)\), the quantile function is the generalized inverse
The value \(Q(p)\) is called the \(p\)-th quantile (or \(100p\)-th percentile) of the distribution.
Percentile¶
A percentile is a quantile expressed on a 0-to-100 scale. The \(k\)-th percentile is the value below which \(k\%\) of the observations fall:
Common Named Quantiles¶
| Name | Quantile | Percentile |
|---|---|---|
| Median | \(Q(0.5)\) | 50th |
| Quartiles | \(Q(0.25),\; Q(0.5),\; Q(0.75)\) | 25th, 50th, 75th |
| Deciles | \(Q(0.1),\; Q(0.2),\; \ldots,\; Q(0.9)\) | 10th, 20th, ..., 90th |
Computing Quantiles from a Sample¶
Given an ordered sample \(x_{(1)} \le x_{(2)} \le \cdots \le x_{(n)}\), the empirical quantile at probability \(p\) typically requires interpolation because \(np\) is not always an integer.
Linear Interpolation (Default)¶
NumPy and SciPy use the following linear interpolation by default. For a desired probability \(p\), compute the virtual index
and interpolate between adjacent order statistics:
```python import numpy as np from scipy import stats
data = np.array([7, 15, 36, 39, 40, 41, 42, 43, 47, 49])
Quartiles using default linear interpolation¶
q1, q2, q3 = np.percentile(data, [25, 50, 75]) print(f"Q1 = {q1}, Median = {q2}, Q3 = {q3}")
Equivalent using np.quantile with fractions¶
print(np.quantile(data, [0.25, 0.50, 0.75])) ```
Interpolation Methods¶
NumPy supports several interpolation strategies through the method parameter:
```python p = 0.25
methods = ['linear', 'lower', 'higher', 'midpoint', 'nearest'] for m in methods: val = np.percentile(data, 25, method=m) print(f" {m:10s}: Q1 = {val}") ```
| Method | Rule |
|---|---|
linear |
Linear interpolation between adjacent order statistics |
lower |
Take the lower of the two bracketing values |
higher |
Take the higher of the two bracketing values |
midpoint |
Average of the lower and higher values |
nearest |
Take the nearest order statistic |
Five-Number Summary¶
The five-number summary consists of the minimum, \(Q_1\), median, \(Q_3\), and maximum. Together with the interquartile range \(\text{IQR} = Q_3 - Q_1\), these values provide a robust sketch of the data distribution.
```python summary = np.percentile(data, [0, 25, 50, 75, 100]) labels = ['Min', 'Q1', 'Median', 'Q3', 'Max'] for label, val in zip(labels, summary): print(f" {label:7s}: {val}")
IQR¶
iqr = stats.iqr(data) print(f" IQR : {iqr}") ```
SciPy Quantile Functions¶
scipy.stats.mstats.mquantiles¶
The mquantiles function supports multiple quantile estimation methods parametrized by plotting positions:
```python from scipy.stats.mstats import mquantiles
Default (Cunnane method, alphap=0.4, betap=0.4)¶
q = mquantiles(data, prob=[0.25, 0.5, 0.75]) print("Cunnane quartiles:", q)
Hazen method (alphap=0.5, betap=0.5)¶
q_hazen = mquantiles(data, prob=[0.25, 0.5, 0.75], alphap=0.5, betap=0.5) print("Hazen quartiles: ", q_hazen) ```
scipy.stats.scoreatpercentile¶
```python
Percentile score (older API, still available)¶
p90 = stats.scoreatpercentile(data, 90) print(f"90th percentile: {p90}")
Inverse: what percentile does a given value correspond to?¶
pct = stats.percentileofscore(data, 40) print(f"Score 40 is at the {pct:.1f}th percentile") ```
Percentile Ranks¶
The percentile rank of a value \(v\) in a dataset is the percentage of observations that are less than or equal to \(v\):
```python
Percentile rank of specific values¶
for value in [30, 40, 50]: rank = stats.percentileofscore(data, value, kind='weak') print(f" Value {value}: {rank:.1f}th percentile") ```
Quantiles of Standard Distributions¶
For parametric distributions, the quantile function (percent point function) is available directly through the .ppf() method:
```python
Normal distribution quantiles¶
z_90 = stats.norm.ppf(0.90) z_95 = stats.norm.ppf(0.95) z_975 = stats.norm.ppf(0.975) print(f"Normal z-values: z_90={z_90:.4f}, z_95={z_95:.4f}, z_975={z_975:.4f}")
t-distribution quantile (used in confidence intervals)¶
t_crit = stats.t.ppf(0.975, df=9) print(f"t critical (df=9, 97.5%): {t_crit:.4f}") ```
Summary¶
Percentiles and quantiles characterize how data values distribute across the range of observations. The quantile function \(Q(p)\) generalizes the median to arbitrary probability levels, while percentile ranks perform the inverse mapping from values to probabilities. NumPy provides flexible computation with multiple interpolation methods, and SciPy extends this with mquantiles for alternative estimation approaches and .ppf() for theoretical distribution quantiles. These tools form the foundation for box plots, outlier fences, confidence intervals, and distribution diagnostics throughout statistical analysis.
Exercises¶
Exercise 1.
Generate 200 samples from an exponential distribution with \(\lambda = 0.5\). Compute the 10th, 25th, 50th, 75th, and 90th percentiles. Compare the 50th percentile with np.median().
Solution to Exercise 1
import numpy as np
from scipy import stats
np.random.seed(42)
data = stats.expon.rvs(scale=2, size=200)
percs = np.percentile(data, [10, 25, 50, 75, 90])
print(f"Percentiles: {percs.round(4)}")
print(f"Median (np.median): {np.median(data):.4f}")
print(f"50th percentile: {percs[2]:.4f}")
Exercise 2. Compute the five-number summary (min, Q1, median, Q3, max) for 300 samples from a \(t\)-distribution with 5 degrees of freedom. Compute the IQR and identify the lower and upper fence values for outlier detection.
Solution to Exercise 2
import numpy as np
from scipy import stats
np.random.seed(42)
data = stats.t.rvs(df=5, size=300)
q1, median, q3 = np.percentile(data, [25, 50, 75])
iqr = q3 - q1
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr
print(f"Five-number: {data.min():.3f}, {q1:.3f}, {median:.3f}, {q3:.3f}, {data.max():.3f}")
print(f"IQR: {iqr:.3f}, Fences: [{lower_fence:.3f}, {upper_fence:.3f}]")
Exercise 3.
Compare the linear and lower interpolation methods in np.percentile() for a small dataset of 10 values. Show that the methods produce different results for quantiles that fall between observed values.
Solution to Exercise 3
import numpy as np
data = np.array([2, 5, 7, 8, 12, 15, 18, 22, 25, 30])
for method in ['linear', 'lower']:
p = np.percentile(data, 37, interpolation=method)
print(f"37th percentile ({method}): {p}")