Skip to content

Percentiles and Quantiles

Describing a dataset by its mean and standard deviation captures only the center and spread. Percentiles and quantiles reveal the full shape of a distribution by specifying the value below which a given fraction of observations falls. These measures underpin box plots, confidence intervals, and many nonparametric methods. This page defines percentiles and quantiles formally, explains the interpolation methods used to compute them, and demonstrates their use with NumPy and SciPy.

Mental Model

A percentile answers the question "what value separates the bottom \(k\)% from the top?" Sort your data, walk \(k\)% of the way through, and read off the value. The median is the 50th percentile, quartiles split the data into fourths, and together these landmarks sketch the distribution's shape without any normality assumption.


Definitions

Quantile Function

For a random variable \(X\) with cumulative distribution function \(F(x) = P(X \le x)\), the quantile function is the generalized inverse

\[ Q(p) = \inf\{x \in \mathbb{R} : F(x) \ge p\}, \quad 0 < p < 1 \]

The value \(Q(p)\) is called the \(p\)-th quantile (or \(100p\)-th percentile) of the distribution.

Percentile

A percentile is a quantile expressed on a 0-to-100 scale. The \(k\)-th percentile is the value below which \(k\%\) of the observations fall:

\[ P_k = Q\!\left(\frac{k}{100}\right), \quad 0 \le k \le 100 \]

Common Named Quantiles

Name Quantile Percentile
Median \(Q(0.5)\) 50th
Quartiles \(Q(0.25),\; Q(0.5),\; Q(0.75)\) 25th, 50th, 75th
Deciles \(Q(0.1),\; Q(0.2),\; \ldots,\; Q(0.9)\) 10th, 20th, ..., 90th

Computing Quantiles from a Sample

Given an ordered sample \(x_{(1)} \le x_{(2)} \le \cdots \le x_{(n)}\), the empirical quantile at probability \(p\) typically requires interpolation because \(np\) is not always an integer.

Linear Interpolation (Default)

NumPy and SciPy use the following linear interpolation by default. For a desired probability \(p\), compute the virtual index

\[ h = (n - 1)\,p + 1 \]

and interpolate between adjacent order statistics:

\[ \hat{Q}(p) = x_{(\lfloor h \rfloor)} + (h - \lfloor h \rfloor)\bigl(x_{(\lceil h \rceil)} - x_{(\lfloor h \rfloor)}\bigr) \]

```python import numpy as np from scipy import stats

data = np.array([7, 15, 36, 39, 40, 41, 42, 43, 47, 49])

Quartiles using default linear interpolation

q1, q2, q3 = np.percentile(data, [25, 50, 75]) print(f"Q1 = {q1}, Median = {q2}, Q3 = {q3}")

Equivalent using np.quantile with fractions

print(np.quantile(data, [0.25, 0.50, 0.75])) ```

Interpolation Methods

NumPy supports several interpolation strategies through the method parameter:

```python p = 0.25

methods = ['linear', 'lower', 'higher', 'midpoint', 'nearest'] for m in methods: val = np.percentile(data, 25, method=m) print(f" {m:10s}: Q1 = {val}") ```

Method Rule
linear Linear interpolation between adjacent order statistics
lower Take the lower of the two bracketing values
higher Take the higher of the two bracketing values
midpoint Average of the lower and higher values
nearest Take the nearest order statistic

Five-Number Summary

The five-number summary consists of the minimum, \(Q_1\), median, \(Q_3\), and maximum. Together with the interquartile range \(\text{IQR} = Q_3 - Q_1\), these values provide a robust sketch of the data distribution.

```python summary = np.percentile(data, [0, 25, 50, 75, 100]) labels = ['Min', 'Q1', 'Median', 'Q3', 'Max'] for label, val in zip(labels, summary): print(f" {label:7s}: {val}")

IQR

iqr = stats.iqr(data) print(f" IQR : {iqr}") ```


SciPy Quantile Functions

scipy.stats.mstats.mquantiles

The mquantiles function supports multiple quantile estimation methods parametrized by plotting positions:

```python from scipy.stats.mstats import mquantiles

Default (Cunnane method, alphap=0.4, betap=0.4)

q = mquantiles(data, prob=[0.25, 0.5, 0.75]) print("Cunnane quartiles:", q)

Hazen method (alphap=0.5, betap=0.5)

q_hazen = mquantiles(data, prob=[0.25, 0.5, 0.75], alphap=0.5, betap=0.5) print("Hazen quartiles: ", q_hazen) ```

scipy.stats.scoreatpercentile

```python

Percentile score (older API, still available)

p90 = stats.scoreatpercentile(data, 90) print(f"90th percentile: {p90}")

Inverse: what percentile does a given value correspond to?

pct = stats.percentileofscore(data, 40) print(f"Score 40 is at the {pct:.1f}th percentile") ```


Percentile Ranks

The percentile rank of a value \(v\) in a dataset is the percentage of observations that are less than or equal to \(v\):

\[ R(v) = \frac{|\{i : x_i \le v\}|}{n} \times 100 \]

```python

Percentile rank of specific values

for value in [30, 40, 50]: rank = stats.percentileofscore(data, value, kind='weak') print(f" Value {value}: {rank:.1f}th percentile") ```


Quantiles of Standard Distributions

For parametric distributions, the quantile function (percent point function) is available directly through the .ppf() method:

```python

Normal distribution quantiles

z_90 = stats.norm.ppf(0.90) z_95 = stats.norm.ppf(0.95) z_975 = stats.norm.ppf(0.975) print(f"Normal z-values: z_90={z_90:.4f}, z_95={z_95:.4f}, z_975={z_975:.4f}")

t-distribution quantile (used in confidence intervals)

t_crit = stats.t.ppf(0.975, df=9) print(f"t critical (df=9, 97.5%): {t_crit:.4f}") ```


Summary

Percentiles and quantiles characterize how data values distribute across the range of observations. The quantile function \(Q(p)\) generalizes the median to arbitrary probability levels, while percentile ranks perform the inverse mapping from values to probabilities. NumPy provides flexible computation with multiple interpolation methods, and SciPy extends this with mquantiles for alternative estimation approaches and .ppf() for theoretical distribution quantiles. These tools form the foundation for box plots, outlier fences, confidence intervals, and distribution diagnostics throughout statistical analysis.


Exercises

Exercise 1. Generate 200 samples from an exponential distribution with \(\lambda = 0.5\). Compute the 10th, 25th, 50th, 75th, and 90th percentiles. Compare the 50th percentile with np.median().

Solution to Exercise 1
import numpy as np
from scipy import stats

np.random.seed(42)
data = stats.expon.rvs(scale=2, size=200)
percs = np.percentile(data, [10, 25, 50, 75, 90])
print(f"Percentiles: {percs.round(4)}")
print(f"Median (np.median): {np.median(data):.4f}")
print(f"50th percentile:    {percs[2]:.4f}")

Exercise 2. Compute the five-number summary (min, Q1, median, Q3, max) for 300 samples from a \(t\)-distribution with 5 degrees of freedom. Compute the IQR and identify the lower and upper fence values for outlier detection.

Solution to Exercise 2
import numpy as np
from scipy import stats

np.random.seed(42)
data = stats.t.rvs(df=5, size=300)
q1, median, q3 = np.percentile(data, [25, 50, 75])
iqr = q3 - q1
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr
print(f"Five-number: {data.min():.3f}, {q1:.3f}, {median:.3f}, {q3:.3f}, {data.max():.3f}")
print(f"IQR: {iqr:.3f}, Fences: [{lower_fence:.3f}, {upper_fence:.3f}]")

Exercise 3. Compare the linear and lower interpolation methods in np.percentile() for a small dataset of 10 values. Show that the methods produce different results for quantiles that fall between observed values.

Solution to Exercise 3
import numpy as np

data = np.array([2, 5, 7, 8, 12, 15, 18, 22, 25, 30])
for method in ['linear', 'lower']:
    p = np.percentile(data, 37, interpolation=method)
    print(f"37th percentile ({method}): {p}")