Skip to content

Summary Statistics

Summary statistics provide concise descriptions of data distributions using measures of central tendency, spread, and shape computed efficiently with scipy.stats functions.

Mental Model

Summary statistics compress an entire dataset into a handful of numbers. The mean, median, and mode tell you where the data clusters; the variance and IQR tell you how far it spreads; and skewness and kurtosis tell you how its shape deviates from a symmetric bell curve.


Central Tendency

1. Mean

```python import numpy as np from scipy import stats

data = [23, 25, 27, 24, 26, 22, 28, 25, 24, 26]

Arithmetic mean

print(np.mean(data)) # 25.0 print(stats.tmean(data)) # Same, but can trim

Trimmed mean (exclude outliers)

print(stats.tmean(data, limits=(23, 27))) # Exclude <23 or >27 ```

2. Median

```python

50th percentile

print(np.median(data)) # 25.0

More robust than mean to outliers

data_with_outlier = data + [100] print(np.mean(data_with_outlier)) # 31.82 (pulled up) print(np.median(data_with_outlier)) # 25.0 (stable) ```

3. Mode

```python

Most frequent value

mode_result = stats.mode(data, keepdims=True) print(f"Mode: {mode_result.mode[0]}, Count: {mode_result.count[0]}")

For continuous data, use binning

continuous_data = np.random.normal(0, 1, 1000) hist, bin_edges = np.histogram(continuous_data, bins=30) mode_bin = bin_edges[np.argmax(hist)] ```

4. Geometric Mean

```python

Useful for growth rates, ratios

returns = [1.1, 1.05, 0.95, 1.08, 1.12] # Return factors geom_mean = stats.gmean(returns) print(f"Geometric mean: {geom_mean:.4f}") # 1.0582

Equivalent to

print(np.prod(returns)**(1/len(returns))) ```

5. Harmonic Mean

```python

Useful for rates, speeds

speeds = [60, 40, 50] # km/h for equal distances harm_mean = stats.hmean(speeds) print(f"Harmonic mean: {harm_mean:.2f}") # 48.65 km/h average ```

6. Trimmed Mean

```python

Exclude extreme values

data = np.array([1, 2, 3, 4, 5, 100]) # Outlier at 100

Trim 20% from each tail

trimmed = stats.trim_mean(data, proportiontocut=0.2) print(f"Trimmed mean: {trimmed:.2f}") # More robust ```

7. Weighted Mean

```python

Different weights for observations

values = [10, 20, 30, 40] weights = [1, 2, 3, 4] # Last observation more important

weighted_mean = np.average(values, weights=weights) print(f"Weighted mean: {weighted_mean}") # 30.0 ```


Measures of Spread

1. Variance

```python

Sample variance (Bessel correction)

print(np.var(data, ddof=1)) # Unbiased estimator print(stats.tvar(data)) # Same

Population variance

print(np.var(data, ddof=0)) ```

2. Standard Deviation

```python print(np.std(data, ddof=1)) # Sample std

Interpretation: ~68% within 1σ, ~95% within 2σ

```

3. Range

```python data_range = np.ptp(data) # Peak-to-peak print(f"Range: {data_range}")

Or manually

print(f"Range: {np.max(data) - np.min(data)}") ```

4. Interquartile Range (IQR)

```python q75, q25 = np.percentile(data, [75, 25]) iqr = q75 - q25 print(f"IQR: {iqr}")

Using scipy

print(stats.iqr(data))

Outlier detection: values beyond 1.5*IQR from quartiles

lower_fence = q25 - 1.5 * iqr upper_fence = q75 + 1.5 * iqr outliers = [x for x in data if x < lower_fence or x > upper_fence] ```

5. Mean Absolute Deviation

```python

Average absolute deviation from mean

mad_mean = np.mean(np.abs(data - np.mean(data))) print(f"MAD (from mean): {mad_mean:.2f}")

From median (more robust)

mad_median = stats.median_abs_deviation(data) print(f"MAD (from median): {mad_median:.2f}") ```

6. Standard Error

```python

Standard error of mean

se = stats.sem(data) print(f"SE: {se:.3f}")

Equivalent to

print(np.std(data, ddof=1) / np.sqrt(len(data)))

Used for confidence intervals

```

7. Coefficient of Variation

```python

Relative variability (std/mean)

cv = stats.variation(data) print(f"CV: {cv:.3f}")

Equivalent to

print(np.std(data, ddof=1) / np.mean(data))

Useful for comparing variability across different scales

```


Quantiles

1. Percentiles

```python

Specific percentiles

print(np.percentile(data, 25)) # Q1 print(np.percentile(data, 50)) # Median print(np.percentile(data, 75)) # Q3

Multiple at once

percentiles = np.percentile(data, [25, 50, 75]) ```

2. Quartiles

```python

Using mquantiles

quartiles = stats.mstats.mquantiles(data, prob=[0.25, 0.5, 0.75]) print(f"Q1: {quartiles[0]}, Q2: {quartiles[1]}, Q3: {quartiles[2]}") ```

3. Deciles

```python

10 equal groups

deciles = np.percentile(data, np.arange(10, 100, 10)) print("Deciles:", deciles) ```

4. Quantile Function

```python

Custom quantiles

quantiles = np.quantile(data, [0.1, 0.9]) # 10th and 90th print(f"80% central range: [{quantiles[0]}, {quantiles[1]}]") ```

5. Five-Number Summary

```python

Min, Q1, Median, Q3, Max

summary = [np.min(data), *np.percentile(data, [25, 50, 75]), np.max(data)] print(f"Five-number summary: {summary}")

Or use describe

print(stats.describe(data)) ```

6. Confidence Intervals

```python

CI for mean

mean = np.mean(data) se = stats.sem(data) ci = stats.t.interval(0.95, len(data)-1, loc=mean, scale=se) print(f"95% CI: [{ci[0]:.2f}, {ci[1]:.2f}]") ```

7. Bootstrap Percentiles

```python

Bootstrap confidence intervals

n_boot = 1000 boot_means = []

for _ in range(n_boot): sample = np.random.choice(data, size=len(data), replace=True) boot_means.append(np.mean(sample))

ci_boot = np.percentile(boot_means, [2.5, 97.5]) print(f"Bootstrap 95% CI: {ci_boot}") ```


Describe Functions

1. describe()

```python

Comprehensive summary

desc = stats.describe(data) print(f"n: {desc.nobs}") print(f"Min/Max: ({desc.minmax[0]}, {desc.minmax[1]})") print(f"Mean: {desc.mean:.2f}") print(f"Variance: {desc.variance:.2f}") print(f"Skewness: {desc.skewness:.2f}") print(f"Kurtosis: {desc.kurtosis:.2f}") ```

2. describe with NaN

```python

Handle missing values

data_with_nan = np.array([1, 2, np.nan, 4, 5]) desc = stats.describe(data_with_nan, nan_policy='omit') print(desc) ```

3. Pandas describe()

```python import pandas as pd

df = pd.DataFrame({ 'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50] })

print(df.describe())

Shows count, mean, std, min, 25%, 50%, 75%, max

```

4. describeby()

```python

Not in scipy, but useful pattern

groups = np.array([0, 0, 1, 1, 1]) data_arr = np.array([10, 15, 20, 25, 30])

for g in [0, 1]: group_data = data_arr[groups == g] print(f"Group {g}: mean={np.mean(group_data):.1f}") ```

5. Summary Statistics Table

```python

Create summary table

summary_stats = { 'Mean': np.mean(data), 'Median': np.median(data), 'Std': np.std(data, ddof=1), 'Min': np.min(data), 'Max': np.max(data), 'Q1': np.percentile(data, 25), 'Q3': np.percentile(data, 75), 'IQR': stats.iqr(data) }

for stat, value in summary_stats.items(): print(f"{stat:10s}: {value:.2f}") ```

6. Binned Statistics

```python

Statistics within bins

x = np.random.randn(1000) y = x**2 + np.random.randn(1000) * 0.5

Mean y within x bins

bin_means, bin_edges, binnumber = stats.binned_statistic( x, y, statistic='mean', bins=10 ) print("Bin means:", bin_means) ```

7. Multi-dimensional Statistics

```python

For 2D data

data_2d = np.random.randn(100, 5)

Column-wise statistics

print("Means:", np.mean(data_2d, axis=0)) print("Stds:", np.std(data_2d, axis=0, ddof=1))

Row-wise

print("Row means:", np.mean(data_2d, axis=1)) ```


Robust Statistics

1. Median Absolute Deviation

```python

Robust measure of spread

mad = stats.median_abs_deviation(data) print(f"MAD: {mad:.2f}")

Scaling factor for normal consistency

mad_scaled = mad * 1.4826 # Estimates σ print(f"MAD (σ estimate): {mad_scaled:.2f}") ```

2. Trimmed Statistics

```python

Trim proportion from each tail

trimmed_mean = stats.trim_mean(data, 0.1) # Remove 10% from each tail trimmed_var = stats.tvar(data, limits=(np.percentile(data, 10), np.percentile(data, 90))) ```

3. Winsorized Statistics

```python

Cap extreme values instead of removing

from scipy.stats.mstats import winsorize

data_wins = winsorize(data, limits=[0.1, 0.1]) # Cap at 10th/90th percentiles print("Winsorized mean:", np.mean(data_wins)) ```

4. Biweight Midvariance

```python

Robust variance estimator (less sensitive to outliers)

Not directly in scipy, but conceptually important

```

5. Hodges-Lehmann Estimator

```python

Median of pairwise averages (robust location)

from itertools import combinations

pairwise_avgs = [(a + b) / 2 for a, b in combinations(data, 2)] hl_estimator = np.median(pairwise_avgs) print(f"Hodges-Lehmann estimator: {hl_estimator:.2f}") ```

6. Q-Q Plot

```python

Visual check for normality

stats.probplot(data, dist="norm", plot=plt) plt.title("Q-Q Plot") plt.show() ```

7. Outlier Detection

```python

Z-score method

z_scores = np.abs(stats.zscore(data)) outliers_zscore = data[z_scores > 3]

IQR method

q1, q3 = np.percentile(data, [25, 75]) iqr = q3 - q1 outliers_iqr = data[(data < q1 - 1.5iqr) | (data > q3 + 1.5iqr)]

print(f"Outliers (Z-score): {outliers_zscore}") print(f"Outliers (IQR): {outliers_iqr}") ```


Summary

Central tendency:

  • Mean: Average value (sensitive to outliers)
  • Median: Middle value (robust)
  • Mode: Most frequent (for categorical/discrete)

Spread:

  • Std: Average deviation from mean
  • IQR: Range of middle 50% (robust)
  • MAD: Median absolute deviation (very robust)

Key insight: Summary statistics reduce data to key numerical values, with robust measures (median, MAD, IQR) preferred when outliers are present, while classical measures (mean, std) are more efficient for clean, normally-distributed data.


Exercises

Exercise 1. Generate 200 samples from a gamma distribution with shape 3 and scale 2. Compute the mean, median, standard deviation, and IQR using NumPy and SciPy functions.

Solution to Exercise 1
import numpy as np
from scipy import stats

np.random.seed(42)
data = stats.gamma.rvs(a=3, scale=2, size=200)
print(f"Mean:   {np.mean(data):.4f}")
print(f"Median: {np.median(data):.4f}")
print(f"Std:    {np.std(data, ddof=1):.4f}")
print(f"IQR:    {np.percentile(data, 75) - np.percentile(data, 25):.4f}")

Exercise 2. Use scipy.stats.describe() on an array of 500 standard normal samples. Print all returned fields (nobs, minmax, mean, variance, skewness, kurtosis) and interpret each.

Solution to Exercise 2
import numpy as np
from scipy import stats

np.random.seed(42)
data = np.random.normal(size=500)
result = stats.describe(data)
print(f"nobs:     {result.nobs}")
print(f"minmax:   {result.minmax}")
print(f"mean:     {result.mean:.4f}")
print(f"variance: {result.variance:.4f}")
print(f"skewness: {result.skewness:.4f}")
print(f"kurtosis: {result.kurtosis:.4f}")

Exercise 3. Create two datasets: one from a normal distribution and one from an exponential distribution (both with 300 samples). Compare their describe() outputs and identify which statistics reveal the difference in distribution shape.

Solution to Exercise 3
import numpy as np
from scipy import stats

np.random.seed(42)
normal_data = np.random.normal(5, 2, size=300)
exp_data = np.random.exponential(5, size=300)

for name, data in [("Normal", normal_data), ("Exponential", exp_data)]:
    d = stats.describe(data)
    print(f"{name}: skew={d.skewness:.3f}, kurt={d.kurtosis:.3f}")