Data Normalization and Preprocessing¶
Why Normalize Data?¶
Real-world data rarely comes in a pristine, analysis-ready form. When working with biological datasets—such as gene expression measurements from RNA-sequencing—technical variation often obscures the signal you're trying to detect. Normalization techniques separate technical variation from biological variation, making your downstream analysis more reliable.
Technical Variation vs. Real Variation¶
Consider a gene expression experiment with samples processed on different days or with different reagent batches. Even if the true biological signal is identical, you might observe:
- Sample 1 (Day 1): average expression = 100
- Sample 2 (Day 2): average expression = 150
These differences could reflect: 1. Technical effects: differences in processing, library preparation quality, sequencing depth 2. Biological effects: real differences in gene activity between cell types or conditions
Normalization aims to remove (1) while preserving (2).
Distribution Shifts Across Samples¶
Normalization becomes essential when samples have different distributions of values. A simple visualization reveals this problem:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# Simulate raw count data from 3 samples
np.random.seed(42)
X = np.random.negative_binomial(5, 0.3, size=(1000, 3))
X[:, 1] *= 1.5 # Sample 2 has systematically higher values
X[:, 2] *= 0.7 # Sample 3 has systematically lower values
# Plot density for each column
fig, axes = plt.subplots(1, 3, figsize=(12, 3))
for i in range(3):
kde = stats.gaussian_kde(X[:, i])
x_range = np.linspace(X[:, i].min(), X[:, i].max(), 200)
axes[i].plot(x_range, kde(x_range))
axes[i].set_title(f'Sample {i+1}')
axes[i].set_xlabel('Expression Value')
axes[i].set_ylabel('Density')
plt.tight_layout()
plt.show()
Note
Visual Diagnosis Tool: Plotting per-column density curves (using scipy.stats.gaussian_kde) is an excellent first step to diagnose whether your data needs normalization. If distributions look substantially different, normalization is warranted.
When to Apply Normalization¶
Apply normalization when: - Samples have different median values or distributions - You're combining data from multiple experiments or batches - Technical artifacts could interfere with biological conclusions - You plan to compare expression levels across samples
Log Transformation¶
Before applying quantile normalization, log transformation is often valuable for count data.
Why Log Transform?¶
Count data (like RNA-seq reads, bacterial colony counts) typically: - Have a skewed (right-tailed) distribution - Show variance proportional to mean (heteroscedasticity) - Range over multiple orders of magnitude
The log transformation compresses this range and stabilizes variance.
Implementation: The Log(X+1) Pattern¶
Because log(0) is undefined, we add a pseudocount (typically 1):
# Log transformation with pseudocount
X_log = np.log(X + 1)
Tip
Adding 1: The pseudocount (usually 1) ensures that: - Zero values map to log(1) = 0 instead of undefined - Small counts aren't over-compressed relative to large counts - The transformation is invertible (approximately): \(\text{exp}(X_{\text{log}}) - 1 \approx X\)
Effect on Variance¶
The log transformation has a stabilizing effect. For count data where variance is proportional to the mean, the log scale (approximately) equalizes variance across different expression levels:
# Before and after variance comparison
print(f"Variance before log: {np.var(X, axis=0)}")
print(f"Variance after log: {np.var(np.log(X + 1), axis=0)}")
Quantile Normalization Algorithm¶
Quantile normalization forces all samples to have identical distributions. This is particularly useful for removing systematic biases while preserving sample-to-sample differences that are consistent across genes.
The Three-Step Process¶
Step 1: Sort values along each column
For each sample (column), sort all expression values in ascending order.
Step 2: Compute reference distribution
Average each row of sorted values. This reference distribution represents the "typical" expression pattern after sorting.
Step 3: Replace quantiles
Replace each sorted value with the corresponding reference quantile value, then unsort to restore original order.
Mathematical Intuition¶
If sample A has high variance and sample B has low variance:
After sorting and averaging each row:
Both samples now have identical values: 20.5, 47.5, 75 (in different original positions).
Implementation¶
import numpy as np
from scipy import stats
def quantile_normalize(X):
"""
Apply quantile normalization to a matrix.
Parameters
----------
X : ndarray of shape (n_features, n_samples)
Data matrix where columns are samples.
Returns
-------
Xn : ndarray of same shape as X
Quantile-normalized data.
Notes
-----
This function assumes column-wise samples and row-wise features.
"""
# Step 1: Compute reference distribution (mean of sorted rows)
quantiles = np.mean(np.sort(X, axis=0), axis=1)
# Step 2: Get rank of each value in its column
# rankdata returns 1-indexed ranks
ranks = np.apply_along_axis(stats.rankdata, 0, X)
rank_indices = ranks.astype(int) - 1
# Step 3: Replace each value with its quantile value
Xn = quantiles[rank_indices]
return Xn
Example Usage¶
# Create sample data: 5 features × 3 samples
np.random.seed(42)
X = np.array([
[1, 40, 200],
[2, 45, 250],
[5, 50, 300],
[10, 48, 280],
[20, 55, 350]
], dtype=float)
print("Original data:")
print(X)
Xn = quantile_normalize(X)
print("\nQuantile-normalized data:")
print(Xn)
# Verify all columns have identical values
print("\nColumn 0:", Xn[:, 0])
print("Column 1:", Xn[:, 1])
print("Column 2:", Xn[:, 2])
Note
Key NumPy Concepts:
- np.sort(axis=0): Sort within each column
- np.mean(..., axis=1): Average across columns (row-wise)
- np.apply_along_axis(): Apply function along each column
- stats.rankdata(): Compute ranks (handles ties)
- Fancy indexing with quantiles[rank_indices]: vectorized lookup
Feature Selection by Variance¶
High-dimensional datasets (thousands of features) can benefit from dimensionality reduction. A simple and effective approach: select features with the highest variance.
Rationale¶
- High-variance features carry more information and are more likely to be biologically relevant
- Low-variance features add noise without signal
- Reducing feature count speeds up downstream analysis (clustering, visualization) and reduces overfitting risk
Implementation¶
def most_variable_rows(X, n_features=100):
"""
Select the most variable features (rows).
Parameters
----------
X : ndarray of shape (n_features, n_samples)
Data matrix.
n_features : int, optional
Number of features to select. Default: 100.
Returns
-------
X_selected : ndarray of shape (n_features, n_samples)
Data containing only selected features.
indices : ndarray
Indices of selected features in original matrix.
"""
# Compute variance for each feature (row)
variances = np.var(X, axis=1)
# Find indices of top-n most variable features
indices = np.argsort(variances)[-n_features:]
# Return selected features and their indices
return X[indices, :], indices
Example¶
# Load and normalize data
np.random.seed(42)
X = np.random.negative_binomial(5, 0.3, size=(5000, 20))
X_log = np.log(X + 1)
X_norm = quantile_normalize(X_log)
# Select top 500 most variable genes
X_selected, gene_indices = most_variable_rows(X_norm, n_features=500)
print(f"Original shape: {X_norm.shape}")
print(f"Selected shape: {X_selected.shape}")
print(f"Selected gene indices: {gene_indices}")
Tip
Choosing n_features: Use domain knowledge and downstream goals: - Clustering: 100-1000 features often sufficient - Classification: May need more features for accurate prediction - Visualization: Keep it small (50-200) for interpretable plots
Hierarchical Clustering Preview¶
After normalization and feature selection, hierarchical clustering reveals structure in your data by grouping similar samples and features.
Basic Linkage Computation¶
from scipy.cluster.hierarchy import linkage, dendrogram, leaves_list
# Compute hierarchical clustering
Z = linkage(X_selected.T, method='ward') # Ward linkage on samples
# Generate dendrogram
dendrogram(Z)
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.title('Hierarchical Clustering of Samples')
plt.show()
Biclustering: Clustering Rows AND Columns¶
A more sophisticated approach clusters both features (rows) and samples (columns) simultaneously, revealing blocks of co-regulated genes:
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, leaves_list
# Cluster rows (features/genes)
Z_rows = linkage(X_selected, method='ward')
row_order = leaves_list(Z_rows)
# Cluster columns (samples)
Z_cols = linkage(X_selected.T, method='ward')
col_order = leaves_list(Z_cols)
# Reorder data matrix
X_reordered = X_selected[row_order, :][:, col_order]
# Visualize with dendrograms
fig = plt.figure(figsize=(10, 8))
gs = fig.add_gridspec(2, 2, width_ratios=[0.2, 1], height_ratios=[1, 0.2])
# Heatmap
ax_heatmap = fig.add_subplot(gs[0, 1])
im = ax_heatmap.imshow(X_reordered, aspect='auto', cmap='RdBu_r')
ax_heatmap.set_xlabel('Sample (clustered)')
ax_heatmap.set_ylabel('Gene (clustered)')
plt.colorbar(im, ax=ax_heatmap)
# Row dendrogram
ax_row_dend = fig.add_subplot(gs[0, 0])
dendrogram(Z_rows, ax=ax_row_dend, orientation='left')
ax_row_dend.set_yticks([])
# Column dendrogram
ax_col_dend = fig.add_subplot(gs[1, 1])
dendrogram(Z_cols, ax=ax_col_dend)
ax_col_dend.set_xticks([])
plt.tight_layout()
plt.show()
Note
Biclustering interpretation: The heatmap shows gene expression, with genes and samples reordered by similarity. Dark red = high expression, dark blue = low expression. Visual blocks indicate co-regulated genes in specific sample groups.
Why This Matters¶
Hierarchical clustering after normalization: 1. Removes batch effects via normalization (technical noise) 2. Reveals biological structure (cell types, conditions, disease states) 3. Identifies outliers (unusual samples or genes) 4. Provides visual feedback on data quality and preprocessing success
Complete Preprocessing Pipeline¶
Here's a full workflow combining all techniques: load → log transform → normalize → select features → cluster.
import numpy as np
from scipy import stats
from scipy.cluster.hierarchy import linkage, dendrogram, leaves_list
import matplotlib.pyplot as plt
def quantile_normalize(X):
"""Quantile normalize a matrix."""
quantiles = np.mean(np.sort(X, axis=0), axis=1)
ranks = np.apply_along_axis(stats.rankdata, 0, X)
rank_indices = ranks.astype(int) - 1
return quantiles[rank_indices]
def most_variable_rows(X, n_features=100):
"""Select most variable features."""
variances = np.var(X, axis=1)
indices = np.argsort(variances)[-n_features:]
return X[indices, :], indices
# Step 1: Load data
np.random.seed(42)
X_raw = np.random.negative_binomial(5, 0.3, size=(2000, 30))
# Step 2: Log transformation
X_log = np.log(X_raw + 1)
# Step 3: Quantile normalization
X_norm = quantile_normalize(X_log)
# Step 4: Feature selection (keep top 500 genes)
X_selected, selected_idx = most_variable_rows(X_norm, n_features=500)
# Step 5: Hierarchical clustering
Z = linkage(X_selected.T, method='ward')
# Visualize results
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Before normalization
axes[0].hist(X_raw.flatten(), bins=50)
axes[0].set_title('Raw Data Distribution')
axes[0].set_xlabel('Count')
# After normalization
axes[1].hist(X_norm.flatten(), bins=50)
axes[1].set_title('After Quantile Normalization')
axes[1].set_xlabel('Normalized Value')
# Clustering dendrogram
axes[2].remove()
ax_dend = fig.add_subplot(133)
dendrogram(Z, ax=ax_dend, leaf_rotation=90)
ax_dend.set_title('Sample Clustering')
ax_dend.set_xlabel('Sample')
ax_dend.set_ylabel('Distance')
plt.tight_layout()
plt.show()
Best Practices and Common Pitfalls¶
Best Practices¶
- Always visualize before and after
- Density plots, histograms, and heatmaps reveal whether normalization worked
-
Save plots for your analysis report
-
Document your normalization choice
- Different analyses may require different approaches
- Quantile normalization assumes technical shifts only (no true biological rank changes)
-
Log transformation assumes counts; avoid on already-normalized data
-
Be cautious with feature selection
- Features selected on training data can introduce bias
- Use cross-validation or hold-out test sets when building predictive models
-
Variance-based selection is exploratory; use statistical tests for inference
-
Check for over-normalization
- Quantile normalization forces artificial similarity
- Use it when you're confident technical effects dominate
- Consider milder approaches (e.g., median centering) if you want to preserve more biological signal
Common Pitfalls¶
- Normalizing twice
- Don't apply quantile normalization to already-normalized data
-
It will create artificial patterns
-
Forgetting batch effects
- If data comes from different batches, normalize within batches or use batch-aware methods
-
Visual clustering by batch before preprocessing is a good diagnostic
-
Selecting features on biased data
- Always perform feature selection AFTER normalization, not before
-
Pre-selection on non-normalized data can amplify technical artifacts
-
Ignoring the biological question
- Choose normalization that preserves signal relevant to your hypothesis
- Over-normalization can erase true biological differences
Tip
Gold Standard: Compare multiple preprocessing approaches on a small set of known positive/negative controls. The approach that best separates true signals from noise is your best choice.
Summary¶
Normalization and preprocessing are critical foundations for reliable data analysis:
- Log transformation compresses count data and stabilizes variance
- Quantile normalization forces identical distributions across samples, removing systematic biases
- Feature selection reduces dimensionality and focuses on informative signals
- Hierarchical clustering reveals structure while providing visual quality control
Master these techniques, and you'll spend less time fighting artifacts and more time discovering biological insights.