Percentiles and Quantiles¶
Describing a dataset by its mean and standard deviation captures only the center and spread. Percentiles and quantiles reveal the full shape of a distribution by specifying the value below which a given fraction of observations falls. These measures underpin box plots, confidence intervals, and many nonparametric methods. This page defines percentiles and quantiles formally, explains the interpolation methods used to compute them, and demonstrates their use with NumPy and SciPy.
Definitions¶
Quantile Function¶
For a random variable \(X\) with cumulative distribution function \(F(x) = P(X \le x)\), the quantile function is the generalized inverse
The value \(Q(p)\) is called the \(p\)-th quantile (or \(100p\)-th percentile) of the distribution.
Percentile¶
A percentile is a quantile expressed on a 0-to-100 scale. The \(k\)-th percentile is the value below which \(k\%\) of the observations fall:
Common Named Quantiles¶
| Name | Quantile | Percentile |
|---|---|---|
| Median | \(Q(0.5)\) | 50th |
| Quartiles | \(Q(0.25),\; Q(0.5),\; Q(0.75)\) | 25th, 50th, 75th |
| Deciles | \(Q(0.1),\; Q(0.2),\; \ldots,\; Q(0.9)\) | 10th, 20th, ..., 90th |
Computing Quantiles from a Sample¶
Given an ordered sample \(x_{(1)} \le x_{(2)} \le \cdots \le x_{(n)}\), the empirical quantile at probability \(p\) typically requires interpolation because \(np\) is not always an integer.
Linear Interpolation (Default)¶
NumPy and SciPy use the following linear interpolation by default. For a desired probability \(p\), compute the virtual index
and interpolate between adjacent order statistics:
import numpy as np
from scipy import stats
data = np.array([7, 15, 36, 39, 40, 41, 42, 43, 47, 49])
# Quartiles using default linear interpolation
q1, q2, q3 = np.percentile(data, [25, 50, 75])
print(f"Q1 = {q1}, Median = {q2}, Q3 = {q3}")
# Equivalent using np.quantile with fractions
print(np.quantile(data, [0.25, 0.50, 0.75]))
Interpolation Methods¶
NumPy supports several interpolation strategies through the method parameter:
p = 0.25
methods = ['linear', 'lower', 'higher', 'midpoint', 'nearest']
for m in methods:
val = np.percentile(data, 25, method=m)
print(f" {m:10s}: Q1 = {val}")
| Method | Rule |
|---|---|
linear |
Linear interpolation between adjacent order statistics |
lower |
Take the lower of the two bracketing values |
higher |
Take the higher of the two bracketing values |
midpoint |
Average of the lower and higher values |
nearest |
Take the nearest order statistic |
Five-Number Summary¶
The five-number summary consists of the minimum, \(Q_1\), median, \(Q_3\), and maximum. Together with the interquartile range \(\text{IQR} = Q_3 - Q_1\), these values provide a robust sketch of the data distribution.
summary = np.percentile(data, [0, 25, 50, 75, 100])
labels = ['Min', 'Q1', 'Median', 'Q3', 'Max']
for label, val in zip(labels, summary):
print(f" {label:7s}: {val}")
# IQR
iqr = stats.iqr(data)
print(f" IQR : {iqr}")
SciPy Quantile Functions¶
scipy.stats.mstats.mquantiles¶
The mquantiles function supports multiple quantile estimation methods parametrized by plotting positions:
from scipy.stats.mstats import mquantiles
# Default (Cunnane method, alphap=0.4, betap=0.4)
q = mquantiles(data, prob=[0.25, 0.5, 0.75])
print("Cunnane quartiles:", q)
# Hazen method (alphap=0.5, betap=0.5)
q_hazen = mquantiles(data, prob=[0.25, 0.5, 0.75],
alphap=0.5, betap=0.5)
print("Hazen quartiles: ", q_hazen)
scipy.stats.scoreatpercentile¶
# Percentile score (older API, still available)
p90 = stats.scoreatpercentile(data, 90)
print(f"90th percentile: {p90}")
# Inverse: what percentile does a given value correspond to?
pct = stats.percentileofscore(data, 40)
print(f"Score 40 is at the {pct:.1f}th percentile")
Percentile Ranks¶
The percentile rank of a value \(v\) in a dataset is the percentage of observations that are less than or equal to \(v\):
# Percentile rank of specific values
for value in [30, 40, 50]:
rank = stats.percentileofscore(data, value, kind='weak')
print(f" Value {value}: {rank:.1f}th percentile")
Quantiles of Standard Distributions¶
For parametric distributions, the quantile function (percent point function) is available directly through the .ppf() method:
# Normal distribution quantiles
z_90 = stats.norm.ppf(0.90)
z_95 = stats.norm.ppf(0.95)
z_975 = stats.norm.ppf(0.975)
print(f"Normal z-values: z_90={z_90:.4f}, z_95={z_95:.4f}, z_975={z_975:.4f}")
# t-distribution quantile (used in confidence intervals)
t_crit = stats.t.ppf(0.975, df=9)
print(f"t critical (df=9, 97.5%): {t_crit:.4f}")
Summary¶
Percentiles and quantiles characterize how data values distribute across the range of observations. The quantile function \(Q(p)\) generalizes the median to arbitrary probability levels, while percentile ranks perform the inverse mapping from values to probabilities. NumPy provides flexible computation with multiple interpolation methods, and SciPy extends this with mquantiles for alternative estimation approaches and .ppf() for theoretical distribution quantiles. These tools form the foundation for box plots, outlier fences, confidence intervals, and distribution diagnostics throughout statistical analysis.