Skip to content

Vectorization

Vectorization is the practice of applying operations to entire arrays at once, rather than iterating through elements. Vectorized operations in pandas are typically 10-100x faster than loops.

Mental Model

A Python for loop processes one value per iteration with full interpreter overhead. A vectorized operation hands the entire array to a compiled C/Fortran routine that processes all values in a tight, optimized loop. The rule is simple: if you can express the operation without a Python loop, do so -- the speedup is typically 10-100x.

Why Vectorization Matters

The Problem with Loops

```python import pandas as pd import numpy as np import time

df = pd.DataFrame({ 'A': np.random.randn(100000), 'B': np.random.randn(100000) })

Slow: iterating with a loop

def slow_sum(df): result = [] for i in range(len(df)): result.append(df.iloc[i]['A'] + df.iloc[i]['B']) return result

Even slower: iterrows

def slower_sum(df): result = [] for idx, row in df.iterrows(): result.append(row['A'] + row['B']) return result ```

The Vectorized Solution

```python

Fast: vectorized operation

df['C'] = df['A'] + df['B'] ```

Performance Comparison

```python n = 100000 df = pd.DataFrame({ 'A': np.random.randn(n), 'B': np.random.randn(n) })

Method 1: iterrows (slowest)

start = time.time() result = [] for idx, row in df.iterrows(): result.append(row['A'] + row['B']) iterrows_time = time.time() - start

Method 2: apply (slow)

start = time.time() result = df.apply(lambda row: row['A'] + row['B'], axis=1) apply_time = time.time() - start

Method 3: vectorized (fast)

start = time.time() result = df['A'] + df['B'] vector_time = time.time() - start

print(f"iterrows: {iterrows_time:.3f}s") print(f"apply: {apply_time:.3f}s") print(f"vectorized: {vector_time:.6f}s") ```

Typical results: iterrows: 4.521s apply: 1.234s vectorized: 0.001s

Common Vectorized Operations

Arithmetic

```python

Element-wise operations

df['sum'] = df['A'] + df['B'] df['diff'] = df['A'] - df['B'] df['product'] = df['A'] * df['B'] df['ratio'] = df['A'] / df['B'] df['power'] = df['A'] ** 2 ```

Comparisons

```python

Returns boolean Series

mask = df['A'] > 0 mask = df['A'] >= df['B'] mask = (df['A'] > 0) & (df['B'] < 0) mask = (df['A'] > 0) | (df['B'] > 0) ```

String Operations (with .str accessor)

```python df = pd.DataFrame({'text': ['hello', 'world', 'python']})

Vectorized string operations

df['upper'] = df['text'].str.upper() df['length'] = df['text'].str.len() df['contains_o'] = df['text'].str.contains('o') ```

Datetime Operations (with .dt accessor)

```python df = pd.DataFrame({'date': pd.date_range('2024-01-01', periods=100)})

Vectorized datetime operations

df['year'] = df['date'].dt.year df['month'] = df['date'].dt.month df['weekday'] = df['date'].dt.dayofweek ```

Replacing Loops with Vectorization

Conditional Assignment

```python

Instead of:

for i in range(len(df)): if df.loc[i, 'A'] > 0: df.loc[i, 'result'] = 'positive' else: df.loc[i, 'result'] = 'negative'

Use:

df['result'] = np.where(df['A'] > 0, 'positive', 'negative') ```

Multiple Conditions

```python

Instead of complex if-elif-else in loop:

conditions = [ df['A'] > 1, df['A'] > 0, df['A'] > -1 ] choices = ['high', 'medium', 'low'] df['category'] = np.select(conditions, choices, default='very_low') ```

Cumulative Operations

```python

Instead of loop with running total:

df['cumsum'] = df['A'].cumsum() df['cumprod'] = df['A'].cumprod() df['cummax'] = df['A'].cummax() df['cummin'] = df['A'].cummin() ```

Shifting and Differencing

```python

Instead of loop comparing to previous row:

df['prev_A'] = df['A'].shift(1) df['change'] = df['A'].diff() df['pct_change'] = df['A'].pct_change() ```

Rolling Operations

```python

Instead of loop calculating moving average:

df['rolling_mean'] = df['A'].rolling(window=5).mean() df['rolling_std'] = df['A'].rolling(window=5).std() df['rolling_sum'] = df['A'].rolling(window=5).sum() ```

When apply() Is Acceptable

Sometimes apply() is necessary, but optimize the function:

```python

Slow: complex logic in apply

def complex_calculation(row): if row['A'] > 0 and row['B'] > 0: return row['A'] * row['B'] elif row['A'] < 0 and row['B'] < 0: return -row['A'] * row['B'] else: return 0

Faster: vectorize the logic

mask1 = (df['A'] > 0) & (df['B'] > 0) mask2 = (df['A'] < 0) & (df['B'] < 0)

df['result'] = 0 # default df.loc[mask1, 'result'] = df.loc[mask1, 'A'] * df.loc[mask1, 'B'] df.loc[mask2, 'result'] = -df.loc[mask2, 'A'] * df.loc[mask2, 'B'] ```

Using NumPy Functions

NumPy functions work directly on pandas objects:

```python

Math functions

df['abs_A'] = np.abs(df['A']) df['sqrt_abs'] = np.sqrt(np.abs(df['A'])) df['log_abs'] = np.log(np.abs(df['A']) + 1) df['exp_A'] = np.exp(df['A'])

Trigonometric

df['sin_A'] = np.sin(df['A']) df['cos_A'] = np.cos(df['A'])

Rounding

df['floor'] = np.floor(df['A']) df['ceil'] = np.ceil(df['A']) df['round'] = np.round(df['A'], 2) ```

Practical Example: Financial Calculations

```python

Stock price data

np.random.seed(42) prices = pd.DataFrame({ 'close': 100 + np.cumsum(np.random.randn(10000) * 0.5) })

All vectorized calculations

prices['return'] = prices['close'].pct_change() prices['log_return'] = np.log(prices['close'] / prices['close'].shift(1)) prices['sma_20'] = prices['close'].rolling(20).mean() prices['sma_50'] = prices['close'].rolling(50).mean() prices['std_20'] = prices['return'].rolling(20).std() prices['upper_band'] = prices['sma_20'] + 2 * prices['std_20'] * prices['close'] prices['lower_band'] = prices['sma_20'] - 2 * prices['std_20'] * prices['close'] prices['signal'] = np.where(prices['sma_20'] > prices['sma_50'], 1, -1) ```

Performance Tips

1. Avoid iterrows()

```python

Never do this for large DataFrames

for idx, row in df.iterrows(): # process row pass ```

2. Minimize apply()

```python

Try to replace apply with vectorized operations

Bad

df['result'] = df.apply(lambda x: x['A'] + x['B'], axis=1)

Good

df['result'] = df['A'] + df['B'] ```

3. Use eval() for Complex Expressions

```python

For complex arithmetic on large DataFrames

df.eval('result = (A + B) * (C - D) / E', inplace=True) ```

4. Batch Operations

```python

Instead of multiple separate operations

df['A'] = df['A'] * 2 df['B'] = df['B'] + 1 df['C'] = df['A'] + df['B']

Consider eval for batch

df.eval(''' A = A * 2 B = B + 1 C = A + B ''', inplace=True) ```

Summary

Method Speed Use Case
Vectorized ops ⚡ Fastest Arithmetic, comparisons
NumPy functions ⚡ Fast Math operations
eval() ⚡ Fast Complex expressions
apply(axis=0) 🔶 Moderate Column-wise operations
apply(axis=1) 🔴 Slow Row-wise operations
iterrows() 🔴 Slowest Avoid if possible

Rule of thumb: If you're writing a loop over DataFrame rows, there's almost always a vectorized alternative.


Runnable Example: performance_tutorial.py

```python """ Pandas Tutorial: Performance Optimization.

Covers techniques for faster data processing. """

import pandas as pd import numpy as np import time

=============================================================================

Main

=============================================================================

if name == "main":

print("="*70)
print("PERFORMANCE OPTIMIZATION")
print("="*70)

# Create large dataset
n = 100000
np.random.seed(42)
df = pd.DataFrame({
    'A': np.random.randint(0, 100, n),
    'B': np.random.randint(0, 100, n),
    'C': np.random.choice(['X', 'Y', 'Z'], n),
    'D': np.random.random(n)
})

print(f"\nDataFrame with {len(df):,} rows")
print(df.head())

# 1. Vectorized operations vs loops
print("\n1. Vectorized operations vs loops:")

# Loop (slow)
start = time.time()
result_loop = []
for val in df['A']:
    result_loop.append(val * 2)
loop_time = time.time() - start

# Vectorized (fast)
start = time.time()
result_vec = df['A'] * 2
vec_time = time.time() - start

print(f"Loop time: {loop_time:.4f}s")
print(f"Vectorized time: {vec_time:.4f}s")
print(f"Speedup: {loop_time/vec_time:.2f}x")

# 2. Use categorical for repeated strings
print("\n2. Memory optimization with categorical:")
print(f"Original memory: {df['C'].memory_usage(deep=True):,} bytes")
df['C_cat'] = df['C'].astype('category')
print(f"Categorical memory: {df['C_cat'].memory_usage(deep=True):,} bytes")

# 3. Query method (faster than boolean indexing for large datasets)
print("\n3. Query method:")
start = time.time()
result1 = df[(df['A'] > 50) & (df['B'] < 30)]
bool_time = time.time() - start

start = time.time()
result2 = df.query('A > 50 and B < 30')
query_time = time.time() - start

print(f"Boolean indexing: {bool_time:.4f}s")
print(f"Query method: {query_time:.4f}s")

# 4. Use appropriate dtypes
print("\n4. Optimize data types:")
df_types = pd.DataFrame({
    'int_col': [1, 2, 3, 4, 5] * 20000,
    'float_col': [1.5, 2.5, 3.5] * 33334
})

print("\nBefore optimization:")
print(df_types.dtypes)
print(f"Memory: {df_types.memory_usage(deep=True).sum():,} bytes")

# Downcast to smaller types
df_types['int_col'] = pd.to_numeric(df_types['int_col'], downcast='integer')
df_types['float_col'] = pd.to_numeric(df_types['float_col'], downcast='float')

print("\nAfter optimization:")
print(df_types.dtypes)
print(f"Memory: {df_types.memory_usage(deep=True).sum():,} bytes")

# 5. Avoid chained indexing
print("\n5. Avoid chained indexing:")
print("❌ Bad: df[df['A'] > 50]['B'] = 100  (chained)")
print("✅ Good: df.loc[df['A'] > 50, 'B'] = 100")

# 6. Use inplace when appropriate (though not always faster)
print("\n6. Inplace operations:")
df_temp = df.copy()
start = time.time()
df_temp = df_temp.drop(columns=['D'])
drop_time = time.time() - start

df_temp2 = df.copy()
start = time.time()
df_temp2.drop(columns=['D'], inplace=True)
inplace_time = time.time() - start

print(f"Regular drop: {drop_time:.4f}s")
print(f"Inplace drop: {inplace_time:.4f}s")

print("\nKEY TAKEAWAYS:")
print("1. Use vectorized operations instead of loops")
print("2. Convert repeated strings to categorical")
print("3. Use query() for complex boolean indexing")
print("4. Optimize data types (downcast integers/floats)")
print("5. Avoid chained indexing - use loc/iloc")
print("6. Consider chunking for very large files")
print("7. Use eval() for complex expressions")
print("8. Avoid apply() when vectorization is possible")

```


Exercises

Exercise 1. Write code that computes a new column using vectorized operations (e.g., df['c'] = df['a'] + df['b']) and compare the timing with a for-loop approach using %%timeit or time.

Solution to Exercise 1

```python import pandas as pd import numpy as np

Solution for the specific exercise

np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(10), 'B': np.random.randn(10)}) print(df.head()) ```


Exercise 2. Explain why vectorized operations are faster than iterating with iterrows() in Pandas.

Solution to Exercise 2

See the main content for the detailed explanation. The key concept involves understanding the Pandas API and its behavior for this specific operation.


Exercise 3. Write code that replaces a for loop that applies a conditional transformation with a vectorized np.where() call.

Solution to Exercise 3

```python import pandas as pd import numpy as np

np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(20), 'B': np.random.randn(20)}) result = df.describe() print(result) ```


Exercise 4. Demonstrate the performance difference between df.apply(func) and a vectorized alternative for a simple mathematical operation.

Solution to Exercise 4

```python import pandas as pd import numpy as np

np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(50), 'group': np.random.choice(['X', 'Y'], 50)}) result = df.groupby('group').mean() print(result) ```