Outlier Detection¶
A single anomalous observation can shift the sample mean by several standard deviations, inflate variance estimates, and distort regression coefficients. Identifying such outliers before analysis protects downstream inferences. This page covers the main statistical methods for detecting outliers using scipy.stats and NumPy, progressing from simple threshold rules to formal hypothesis tests.
Z-Score Method¶
The Z-score measures how many standard deviations an observation lies from the sample mean. For a dataset \(x_1, x_2, \ldots, x_n\) with sample mean \(\bar{x}\) and sample standard deviation \(s\), the Z-score of observation \(x_i\) is
Observations with \(|z_i| > 3\) are commonly flagged as outliers under the assumption of approximate normality.
import numpy as np
from scipy import stats
data = np.array([12, 14, 13, 15, 14, 100, 13, 14, 12, 15])
z_scores = stats.zscore(data)
print("Z-scores:", np.round(z_scores, 2))
# Flag observations beyond 3 standard deviations
outliers = data[np.abs(z_scores) > 3]
print(f"Outliers (|z| > 3): {outliers}")
Sensitivity to Masking
The Z-score method uses the mean and standard deviation, both of which are themselves affected by outliers. A single extreme value inflates \(s\), which can mask other outliers by shrinking their Z-scores. For datasets with multiple outliers, consider the modified Z-score method below.
Modified Z-Score (MAD-Based)¶
The modified Z-score replaces the mean with the median and the standard deviation with the median absolute deviation (MAD), producing a robust alternative. For observation \(x_i\), the modified Z-score is
where \(\tilde{x}\) is the sample median and
The constant \(0.6745\) is the 75th percentile of the standard normal distribution, chosen so that MAD estimates the standard deviation consistently under normality. Observations with \(|M_i| > 3.5\) are flagged as outliers.
median = np.median(data)
mad = stats.median_abs_deviation(data)
modified_z = 0.6745 * (data - median) / mad
print("Modified Z-scores:", np.round(modified_z, 2))
outliers_mad = data[np.abs(modified_z) > 3.5]
print(f"Outliers (|M| > 3.5): {outliers_mad}")
IQR Method¶
The interquartile range (IQR) method defines outlier boundaries using the first quartile \(Q_1\) and third quartile \(Q_3\). The IQR is
An observation \(x_i\) is classified as an outlier if
The multiplier \(1.5\) identifies mild outliers. Replacing it with \(3.0\) targets extreme outliers only.
q1, q3 = np.percentile(data, [25, 75])
iqr = stats.iqr(data)
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr
outliers_iqr = data[(data < lower_fence) | (data > upper_fence)]
print(f"Q1: {q1}, Q3: {q3}, IQR: {iqr}")
print(f"Fences: [{lower_fence:.1f}, {upper_fence:.1f}]")
print(f"Outliers (IQR): {outliers_iqr}")
Why 1.5 Times the IQR
For a normal distribution, the interval \([Q_1 - 1.5 \times \text{IQR},\; Q_3 + 1.5 \times \text{IQR}]\) captures approximately 99.3% of the data. The 1.5 multiplier therefore flags roughly the most extreme 0.7% of observations under normality.
Grubbs' Test¶
Grubbs' test is a formal hypothesis test for detecting a single outlier in a univariate dataset assumed to come from a normal distribution. The test statistic is
Under \(H_0\) (no outliers), the critical value at significance level \(\alpha\) for sample size \(n\) is
where \(t_{\alpha/(2n),\, n-2}\) is the critical value of the \(t\)-distribution with \(n - 2\) degrees of freedom.
def grubbs_test(data, alpha=0.05):
"""Perform Grubbs' test for a single outlier."""
n = len(data)
mean = np.mean(data)
std = np.std(data, ddof=1)
# Test statistic
abs_deviations = np.abs(data - mean)
G = np.max(abs_deviations) / std
suspect_idx = np.argmax(abs_deviations)
# Critical value
t_crit = stats.t.ppf(1 - alpha / (2 * n), n - 2)
G_crit = (n - 1) / np.sqrt(n) * np.sqrt(t_crit**2 / (n - 2 + t_crit**2))
is_outlier = G > G_crit
return data[suspect_idx], G, G_crit, is_outlier
value, G, G_crit, is_outlier = grubbs_test(data)
print(f"Suspect value: {value}")
print(f"G = {G:.3f}, G_crit = {G_crit:.3f}")
print(f"Outlier detected: {is_outlier}")
Single Outlier Assumption
Grubbs' test detects at most one outlier per application. To check for multiple outliers, apply the test iteratively: remove the detected outlier, then re-test the reduced dataset. Stop when the test no longer rejects \(H_0\).
Percentile-Based Clipping¶
Percentile-based clipping caps or removes observations beyond specified percentile thresholds. Unlike the IQR method, it does not assume symmetry and works directly with empirical quantiles.
# Winsorize: cap extremes at 5th and 95th percentiles
from scipy.stats.mstats import winsorize
data_winsorized = winsorize(data, limits=[0.05, 0.05])
print("Original: ", data)
print("Winsorized:", data_winsorized)
# Trim: remove observations beyond thresholds
lower, upper = np.percentile(data, [5, 95])
data_trimmed = data[(data >= lower) & (data <= upper)]
print("Trimmed: ", data_trimmed)
Comparing Methods¶
Each method has different strengths depending on the data characteristics.
| Method | Assumes Normality | Robust to Multiple Outliers | Formal Test |
|---|---|---|---|
| Z-score | Yes | No | No |
| Modified Z-score | No | Yes | No |
| IQR | No | Yes | No |
| Grubbs' test | Yes | No | Yes |
| Percentile clipping | No | Yes | No |
# Side-by-side comparison on the same dataset
data = np.array([12, 14, 13, 15, 14, 100, 13, 14, 12, 15])
# Z-score
z = stats.zscore(data)
print("Z-score outliers: ", data[np.abs(z) > 3])
# Modified Z-score
med = np.median(data)
mad = stats.median_abs_deviation(data)
mz = 0.6745 * (data - med) / mad
print("Modified Z outliers: ", data[np.abs(mz) > 3.5])
# IQR
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
print("IQR outliers: ",
data[(data < q1 - 1.5*iqr) | (data > q3 + 1.5*iqr)])
Summary¶
Outlier detection protects statistical analyses from the disproportionate influence of anomalous observations. The Z-score method offers simplicity but is sensitive to the very outliers it seeks to find. The modified Z-score using MAD provides robustness against masking effects. The IQR method is distribution-free and widely used in exploratory analysis. Grubbs' test provides formal hypothesis testing for a single outlier under normality. In practice, applying multiple methods and comparing their results gives the most reliable identification of genuine anomalies.
Runnable Example: outlier_detection_methods.py¶
"""
Data Cleaning & Preprocessing - Tutorial 04
Topic: Outlier Detection and Treatment (Intermediate)
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
# =============================================================================
# Main
# =============================================================================
if __name__ == "__main__":
warnings.filterwarnings('ignore')
np.random.seed(42)
print("="*80)
print("TUTORIAL 04: OUTLIER DETECTION AND TREATMENT")
print("="*80)
# Generate data with outliers
data = np.concatenate([np.random.normal(50, 10, 95), [150, 160, 155, -20, -15]])
df = pd.DataFrame({'value': data})
print("\nDataset statistics:")
print(df.describe())
# Method 1: Z-score
print("\n--- Z-Score Method ---")
z_scores = np.abs((df['value'] - df['value'].mean()) / df['value'].std())
outliers_z = df[z_scores > 3]
print(f"Outliers (|z| > 3): {len(outliers_z)} found")
print(outliers_z)
# Method 2: IQR
print("\n--- IQR Method ---")
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_iqr = df[(df['value'] < lower_bound) | (df['value'] > upper_bound)]
print(f"IQR bounds: [{lower_bound:.2f}, {upper_bound:.2f}]")
print(f"Outliers: {len(outliers_iqr)} found")
print(outliers_iqr)
# Method 3: Modified Z-score (MAD)
print("\n--- Modified Z-Score (MAD) ---")
median = df['value'].median()
mad = np.median(np.abs(df['value'] - median))
modified_z = 0.6745 * (df['value'] - median) / mad
outliers_mad = df[np.abs(modified_z) > 3.5]
print(f"Outliers (robust method): {len(outliers_mad)} found")
# Treatment strategies
print("\n--- Treatment Strategies ---")
# Strategy 1: Remove outliers
df_removed = df[z_scores <= 3]
print(f"1. Removal: {len(df)} → {len(df_removed)} rows")
# Strategy 2: Cap/Winsorize
df_capped = df.copy()
df_capped['value'] = df_capped['value'].clip(lower=lower_bound, upper=upper_bound)
print(f"2. Capping: outliers clipped to [{lower_bound:.2f}, {upper_bound:.2f}]")
# Strategy 3: Log transformation
df_log = df[df['value'] > 0].copy() # Only positive values
df_log['value_log'] = np.log1p(df_log['value'])
print(f"3. Log transform applied to {len(df_log)} positive values")
print("\nKEY TAKEAWAYS:")
print("- Z-score: assumes normal distribution, sensitive to extreme outliers")
print("- IQR: robust, works for any distribution")
print("- MAD: most robust, resistant to outliers")
print("- Always investigate outliers before removing!")
print("\nNEXT: 05_intermediate_encoding.py")
print("="*80)