Performance Traps¶
pandas is built for vectorized operations. Using Python loops or incorrect patterns can make code orders of magnitude slower.
The Golden Rule¶
Vectorized operations > apply() > itertuples() > iterrows() > for loops
Trap 1: Row-by-Row Iteration¶
The Problem¶
import pandas as pd
import numpy as np
import time
# Create test data
n = 100_000
df = pd.DataFrame({
'A': np.random.randn(n),
'B': np.random.randn(n)
})
# SLOW: Using iterrows
def slow_sum():
result = []
for idx, row in df.iterrows():
result.append(row['A'] + row['B'])
return pd.Series(result)
start = time.time()
slow_result = slow_sum()
print(f"iterrows: {time.time() - start:.2f}s")
The Solution: Vectorize¶
# FAST: Vectorized
start = time.time()
fast_result = df['A'] + df['B']
print(f"Vectorized: {time.time() - start:.4f}s")
Speedup: 100-1000x faster
Trap 2: Growing DataFrames in a Loop¶
The Problem¶
# SLOW: Appending in a loop
df = pd.DataFrame(columns=['A', 'B'])
for i in range(10000):
df = pd.concat([df, pd.DataFrame({'A': [i], 'B': [i*2]})])
# Each concat creates a new DataFrame!
The Solution: Build List, Then DataFrame¶
# FAST: Collect data, then create DataFrame
data = []
for i in range(10000):
data.append({'A': i, 'B': i*2})
df = pd.DataFrame(data)
Speedup: 100x+ faster
Trap 3: Using apply() When Vectorized Option Exists¶
The Problem¶
# SLOW: Using apply
df['C'] = df['A'].apply(lambda x: x ** 2)
# SLOW: Row-wise apply
df['D'] = df.apply(lambda row: row['A'] + row['B'], axis=1)
The Solution: Use Built-in Vectorized Operations¶
# FAST: Vectorized
df['C'] = df['A'] ** 2
# FAST: Vectorized
df['D'] = df['A'] + df['B']
Speedup: 10-100x faster
Trap 4: String Operations in Loops¶
The Problem¶
df = pd.DataFrame({'text': ['hello', 'world', 'python'] * 10000})
# SLOW: Loop
result = []
for text in df['text']:
result.append(text.upper())
df['upper'] = result
The Solution: Use str Accessor¶
# FAST: Vectorized string operation
df['upper'] = df['text'].str.upper()
Trap 5: Conditional Assignment in Loops¶
The Problem¶
df = pd.DataFrame({'value': np.random.randn(100000)})
# SLOW: Loop with condition
for idx in df.index:
if df.loc[idx, 'value'] > 0:
df.loc[idx, 'category'] = 'positive'
else:
df.loc[idx, 'category'] = 'negative'
The Solution: Vectorized Conditional¶
# FAST: np.where
df['category'] = np.where(df['value'] > 0, 'positive', 'negative')
# FAST: .loc with boolean indexing
df.loc[df['value'] > 0, 'category'] = 'positive'
df.loc[df['value'] <= 0, 'category'] = 'negative'
Trap 6: Using Object Dtype for Numbers¶
The Problem¶
# Accidental object dtype
df = pd.DataFrame({'values': ['1', '2', '3', '4', '5']})
print(df['values'].dtype) # object
# Operations on object dtype are slow
df['values'].astype(float).sum() # Conversion on every operation
The Solution: Use Correct Dtypes¶
# Convert once, use many times
df['values'] = pd.to_numeric(df['values'])
print(df['values'].dtype) # int64 or float64
# Operations are now fast
df['values'].sum()
Trap 7: Inefficient GroupBy¶
The Problem¶
# SLOW: Multiple separate groupby calls
mean_by_group = df.groupby('group')['value'].mean()
std_by_group = df.groupby('group')['value'].std()
count_by_group = df.groupby('group')['value'].count()
The Solution: Single GroupBy with agg¶
# FAST: Single groupby with multiple aggregations
stats = df.groupby('group')['value'].agg(['mean', 'std', 'count'])
Trap 8: Large DataFrame Copies¶
The Problem¶
# SLOW: Unnecessary copies
df_copy = df.copy()
df_copy['new_col'] = df_copy['A'] * 2
# Then only use df_copy briefly
The Solution: Modify In-Place or Use Views¶
# Option 1: Modify original if appropriate
df['new_col'] = df['A'] * 2
# Option 2: Use assign for method chaining (creates copy only when needed)
result = df.assign(new_col=lambda x: x['A'] * 2)
Trap 9: Using Python max/min/sum¶
The Problem¶
# SLOW: Python built-ins
result = sum(df['A']) # Python sum, iterates one by one
result = max(df['A']) # Python max
The Solution: Use pandas/NumPy Methods¶
# FAST: pandas methods
result = df['A'].sum() # Vectorized
result = df['A'].max() # Vectorized
Benchmarking Your Code¶
import time
def benchmark(func, name, n_runs=3):
"""Simple benchmarking function."""
times = []
for _ in range(n_runs):
start = time.time()
func()
times.append(time.time() - start)
avg_time = sum(times) / n_runs
print(f"{name}: {avg_time:.4f}s (avg of {n_runs} runs)")
return avg_time
# Compare approaches
df = pd.DataFrame({'A': np.random.randn(100000)})
benchmark(lambda: df['A'].apply(lambda x: x**2), "apply")
benchmark(lambda: df['A'] ** 2, "vectorized")
When apply() is Acceptable¶
Sometimes apply() is necessary:
# Complex logic that can't be vectorized
def complex_business_logic(row):
if row['status'] == 'A' and row['value'] > 100:
return row['value'] * 1.1
elif row['status'] == 'B':
return row['value'] * 0.9
else:
return row['value']
# This is acceptable (but try to vectorize if possible)
df['adjusted'] = df.apply(complex_business_logic, axis=1)
Better alternative with np.select:
conditions = [
(df['status'] == 'A') & (df['value'] > 100),
df['status'] == 'B'
]
choices = [
df['value'] * 1.1,
df['value'] * 0.9
]
df['adjusted'] = np.select(conditions, choices, default=df['value'])
Summary: Performance Hierarchy¶
| Approach | Relative Speed | Use When |
|---|---|---|
| Vectorized (NumPy/pandas) | 1x (baseline) | Always prefer |
| Boolean indexing | ~1x | Conditional assignment |
| str/dt accessor | ~2-5x slower | String/datetime ops |
| apply() on column | ~10-100x slower | Last resort for complex logic |
| apply() on rows (axis=1) | ~100-500x slower | Very last resort |
| itertuples() | ~50-200x slower | Need row access, can't vectorize |
| iterrows() | ~500-2000x slower | Avoid |
| Python for loop | ~1000-5000x slower | Never for data ops |