Performance vs Loops¶

Vectorization is the single most important performance concept in NumPy.

Python Loop Cost¶

1. Loop Overhead¶

A pure Python loop incurs significant overhead.

import numpy as np

def add_with_loop(a, b):
    res = []
    for i in range(len(a)):
        res.append(a[i] + b[i])
    return res

2. Sources of Slowdown¶

Each iteration involves:

Interpreter overhead
Repeated attribute lookups
Python object creation
Type checking

3. Accumulating Cost¶

For large arrays, these costs multiply dramatically.

Vectorized NumPy¶

1. Single Expression¶

import numpy as np

def add_vectorized(a, b):
    return a + b

2. Under the Hood¶

This executes:

In optimized C loops
With contiguous memory access
Without Python-level iteration

3. Hardware Benefits¶

Vectorized code can leverage:

CPU cache prefetching
SIMD instructions
Memory locality

Speedup Magnitude¶

1. Typical Speedups¶

Vectorization typically provides:

10×–100× faster for small arrays
100×–1000× faster for large arrays

2. Array Size Effect¶

import numpy as np
import time

def main():
    for n in [1_000, 100_000, 10_000_000]:
        a = np.random.randn(n)

        # Loop
        start = time.perf_counter()
        result = [x ** 2 for x in a]
        loop_time = time.perf_counter() - start

        # Vectorized
        start = time.perf_counter()
        result = a ** 2
        vec_time = time.perf_counter() - start

        print(f"n={n:>10}: {loop_time/vec_time:>6.0f}x speedup")

if __name__ == "__main__":
    main()

3. Why Essential¶

This is why NumPy is essential for numerical work.

Memory Tradeoffs¶

1. Temporary Arrays¶

Vectorization may allocate temporary arrays.

import numpy as np

# Creates temporary for (a + b) before multiplying
result = (a + b) * c

2. Peak Memory¶

Large operations increase peak memory usage.

3. In-Place Operations¶

Use in-place operations when appropriate:

import numpy as np

def main():
    a = np.array([1.0, 2.0, 3.0])
    b = np.array([4.0, 5.0, 6.0])

    a += b  # Modifies a in place
    print(a)

if __name__ == "__main__":
    main()

Practical Guidance¶

1. Avoid Python Loops¶

Never loop over array elements when vectorization is possible.

2. Think in Arrays¶

Reformulate problems as array operations.

3. Profile First¶

Measure before micro-optimizing.

import numpy as np
import time

def main():
    arr = np.random.randn(1_000_000)

    start = time.perf_counter()
    result = np.sum(arr ** 2)
    elapsed = time.perf_counter() - start

    print(f"Time: {elapsed:.6f} sec")

if __name__ == "__main__":
    main()

Runnable Example: `performance_tutorial.py`¶

"""
01_performance.py - Writing Fast NumPy Code

Key: Vectorization eliminates Python loops!
"""

import numpy as np
import time

if __name__ == "__main__":

    print("="*80)
    print("PERFORMANCE OPTIMIZATION")
    print("="*80)

    # ============================================================================
    # Vectorization Example
    # ============================================================================

    print("\nVectorization: Eliminate Loops!")
    print("="*80)

    n = 1000000
    arr = np.random.rand(n)

    # SLOW: Python loop
    print(f"\nProcessing {n:,} elements...")
    start = time.time()
    result = np.zeros(n)
    for i in range(n):
        result[i] = arr[i] ** 2 + 2 * arr[i] + 1
    slow_time = time.time() - start

    # FAST: Vectorized
    start = time.time()
    result_fast = arr**2 + 2*arr + 1
    fast_time = time.time() - start

    print(f"Python loop: {slow_time*1000:.0f} ms")
    print(f"Vectorized:  {fast_time*1000:.0f} ms")
    print(f"Speedup: {slow_time/fast_time:.0f}x faster!")

    print("""
    \nWhy vectorized is faster:
    1. NumPy uses CPU SIMD instructions
    2. No Python interpreter overhead per element
    3. Contiguous memory (cache friendly)
    4. Optimized C code underneath
    """)

    # ============================================================================
    # Memory Efficiency
    # ============================================================================

    print("\n" + "="*80)
    print("Memory Efficiency (Topic #24)")
    print("="*80)

    # Use appropriate dtype
    arr_default = np.arange(1000)
    arr_int16 = np.arange(1000, dtype=np.int16)

    print(f"Default int: {arr_default.nbytes:,} bytes")
    print(f"int16: {arr_int16.nbytes:,} bytes")
    print(f"Savings: {100*(1 - arr_int16.nbytes/arr_default.nbytes):.0f}%")

    # Reuse arrays instead of creating new ones
    print("\nReuse arrays (avoid allocations):")
    result = np.zeros(1000)
    for i in range(100):
        result[:] = np.random.rand(1000) * 2  # Reuse memory!
    print("  Use arr[:] = ... to reuse memory")

    print("""
    \n🎯 PERFORMANCE TIPS:
    1. Use vectorization (eliminate loops!)
    2. Choose appropriate dtype
    3. Reuse arrays when possible
    4. Use views instead of copies
    5. Profile before optimizing!
    """)