Performance Traps¶

pandas is built for vectorized operations. Using Python loops or incorrect patterns can make code orders of magnitude slower.

Mental Model

The performance hierarchy is: vectorized ops > apply() > itertuples() > iterrows() > Python loops. Every time you write a loop over DataFrame rows, ask "can I express this as a column operation instead?" Other common traps include growing a DataFrame row-by-row (use a list of dicts then construct once) and calling .apply() when a built-in method exists.

The Golden Rule¶

Vectorized operations > apply() > itertuples() > iterrows() > for loops

Trap 1: Row-by-Row Iteration¶

The Problem¶

```python import pandas as pd import numpy as np import time

Create test data¶

n = 100_000 df = pd.DataFrame({ 'A': np.random.randn(n), 'B': np.random.randn(n) })

SLOW: Using iterrows¶

def slow_sum(): result = [] for idx, row in df.iterrows(): result.append(row['A'] + row['B']) return pd.Series(result)

start = time.time() slow_result = slow_sum() print(f"iterrows: {time.time() - start:.2f}s") ```

The Solution: Vectorize¶

```python

FAST: Vectorized¶

start = time.time() fast_result = df['A'] + df['B'] print(f"Vectorized: {time.time() - start:.4f}s") ```

Speedup: 100-1000x faster

Trap 2: Growing DataFrames in a Loop¶

The Problem¶

```python

SLOW: Appending in a loop¶

df = pd.DataFrame(columns=['A', 'B'])

for i in range(10000): df = pd.concat([df, pd.DataFrame({'A': [i], 'B': [i*2]})]) # Each concat creates a new DataFrame! ```

The Solution: Build List, Then DataFrame¶

```python

FAST: Collect data, then create DataFrame¶

data = [] for i in range(10000): data.append({'A': i, 'B': i*2})

df = pd.DataFrame(data) ```

Speedup: 100x+ faster

Trap 3: Using apply() When Vectorized Option Exists¶

The Problem¶

```python

SLOW: Using apply¶

df['C'] = df['A'].apply(lambda x: x ** 2)

SLOW: Row-wise apply¶

df['D'] = df.apply(lambda row: row['A'] + row['B'], axis=1) ```

The Solution: Use Built-in Vectorized Operations¶

```python

FAST: Vectorized¶

df['C'] = df['A'] ** 2

FAST: Vectorized¶

df['D'] = df['A'] + df['B'] ```

Speedup: 10-100x faster

Trap 4: String Operations in Loops¶

The Problem¶

```python df = pd.DataFrame({'text': ['hello', 'world', 'python'] * 10000})

SLOW: Loop¶

result = [] for text in df['text']: result.append(text.upper()) df['upper'] = result ```

The Solution: Use str Accessor¶

```python

FAST: Vectorized string operation¶

df['upper'] = df['text'].str.upper() ```

Trap 5: Conditional Assignment in Loops¶

The Problem¶

```python df = pd.DataFrame({'value': np.random.randn(100000)})

SLOW: Loop with condition¶

for idx in df.index: if df.loc[idx, 'value'] > 0: df.loc[idx, 'category'] = 'positive' else: df.loc[idx, 'category'] = 'negative' ```

The Solution: Vectorized Conditional¶

```python

FAST: np.where¶

df['category'] = np.where(df['value'] > 0, 'positive', 'negative')

FAST: .loc with boolean indexing¶

df.loc[df['value'] > 0, 'category'] = 'positive' df.loc[df['value'] <= 0, 'category'] = 'negative' ```

Trap 6: Using Object Dtype for Numbers¶

The Problem¶

```python

Accidental object dtype¶

df = pd.DataFrame({'values': ['1', '2', '3', '4', '5']}) print(df['values'].dtype) # object

Operations on object dtype are slow¶

df['values'].astype(float).sum() # Conversion on every operation ```

The Solution: Use Correct Dtypes¶

```python

Convert once, use many times¶

df['values'] = pd.to_numeric(df['values']) print(df['values'].dtype) # int64 or float64

Operations are now fast¶

df['values'].sum() ```

Trap 7: Inefficient GroupBy¶

The Problem¶

```python

SLOW: Multiple separate groupby calls¶

mean_by_group = df.groupby('group')['value'].mean() std_by_group = df.groupby('group')['value'].std() count_by_group = df.groupby('group')['value'].count() ```

The Solution: Single GroupBy with agg¶

```python

FAST: Single groupby with multiple aggregations¶

stats = df.groupby('group')['value'].agg(['mean', 'std', 'count']) ```

Trap 8: Large DataFrame Copies¶

The Problem¶

```python

SLOW: Unnecessary copies¶

df_copy = df.copy() df_copy['new_col'] = df_copy['A'] * 2

Then only use df_copy briefly¶

```

The Solution: Modify In-Place or Use Views¶

```python

Option 1: Modify original if appropriate¶

df['new_col'] = df['A'] * 2

Option 2: Use assign for method chaining (creates copy only when needed)¶

result = df.assign(new_col=lambda x: x['A'] * 2) ```

Trap 9: Using Python max/min/sum¶

The Problem¶

```python

SLOW: Python built-ins¶

result = sum(df['A']) # Python sum, iterates one by one result = max(df['A']) # Python max ```

The Solution: Use pandas/NumPy Methods¶

```python

FAST: pandas methods¶

result = df['A'].sum() # Vectorized result = df['A'].max() # Vectorized ```

Benchmarking Your Code¶

```python import time

def benchmark(func, name, n_runs=3): """Simple benchmarking function.""" times = [] for _ in range(n_runs): start = time.time() func() times.append(time.time() - start) avg_time = sum(times) / n_runs print(f"{name}: {avg_time:.4f}s (avg of {n_runs} runs)") return avg_time

Compare approaches¶

df = pd.DataFrame({'A': np.random.randn(100000)})

benchmark(lambda: df['A'].apply(lambda x: x**2), "apply") benchmark(lambda: df['A'] ** 2, "vectorized") ```

When apply() is Acceptable¶

Sometimes apply() is necessary:

```python

Complex logic that can't be vectorized¶

def complex_business_logic(row): if row['status'] == 'A' and row['value'] > 100: return row['value'] * 1.1 elif row['status'] == 'B': return row['value'] * 0.9 else: return row['value']

This is acceptable (but try to vectorize if possible)¶

df['adjusted'] = df.apply(complex_business_logic, axis=1) ```

Better alternative with np.select:

python conditions = [ (df['status'] == 'A') & (df['value'] > 100), df['status'] == 'B' ] choices = [ df['value'] * 1.1, df['value'] * 0.9 ] df['adjusted'] = np.select(conditions, choices, default=df['value'])

Summary: Performance Hierarchy¶

Approach	Relative Speed	Use When
Vectorized (NumPy/pandas)	1x (baseline)	Always prefer
Boolean indexing	~1x	Conditional assignment
str/dt accessor	~2-5x slower	String/datetime ops
apply() on column	~10-100x slower	Last resort for complex logic
apply() on rows (axis=1)	~100-500x slower	Very last resort
itertuples()	~50-200x slower	Need row access, can't vectorize
iterrows()	~500-2000x slower	Avoid
Python for loop	~1000-5000x slower	Never for data ops

Exercises¶

Exercise 1. Explain why growing a DataFrame row-by-row in a loop with pd.concat() inside the loop is slow. What is the correct alternative?

Solution to Exercise 1

```python import pandas as pd import numpy as np

Solution for the specific exercise¶

np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(10), 'B': np.random.randn(10)}) print(df.head()) ```

Exercise 2. Write code showing the performance difference between iterrows() and vectorized operations for computing a column.

Solution to Exercise 2

See the main content for the detailed explanation. The key concept involves understanding the Pandas API and its behavior for this specific operation.

Exercise 3. Explain why apply() with a Python function is slower than using built-in Pandas/NumPy vectorized methods.

Solution to Exercise 3

```python import pandas as pd import numpy as np

np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(20), 'B': np.random.randn(20)}) result = df.describe() print(result) ```

Exercise 4. Write code that collects results in a list and creates a DataFrame at the end, instead of appending to a DataFrame in a loop.

Solution to Exercise 4

```python import pandas as pd import numpy as np

np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(50), 'group': np.random.choice(['X', 'Y'], 50)}) result = df.groupby('group').mean() print(result) ```