Skip to content

Performance vs Loops

Vectorization is the single most important performance concept in NumPy.

Python Loop Cost

1. Loop Overhead

A pure Python loop incurs significant overhead.

import numpy as np

def add_with_loop(a, b):
    res = []
    for i in range(len(a)):
        res.append(a[i] + b[i])
    return res

2. Sources of Slowdown

Each iteration involves:

  • Interpreter overhead
  • Repeated attribute lookups
  • Python object creation
  • Type checking

3. Accumulating Cost

For large arrays, these costs multiply dramatically.

Vectorized NumPy

1. Single Expression

import numpy as np

def add_vectorized(a, b):
    return a + b

2. Under the Hood

This executes:

  • In optimized C loops
  • With contiguous memory access
  • Without Python-level iteration

3. Hardware Benefits

Vectorized code can leverage:

  • CPU cache prefetching
  • SIMD instructions
  • Memory locality

Speedup Magnitude

1. Typical Speedups

Vectorization typically provides:

  • 10×–100× faster for small arrays
  • 100×–1000× faster for large arrays

2. Array Size Effect

import numpy as np
import time

def main():
    for n in [1_000, 100_000, 10_000_000]:
        a = np.random.randn(n)

        # Loop
        start = time.perf_counter()
        result = [x ** 2 for x in a]
        loop_time = time.perf_counter() - start

        # Vectorized
        start = time.perf_counter()
        result = a ** 2
        vec_time = time.perf_counter() - start

        print(f"n={n:>10}: {loop_time/vec_time:>6.0f}x speedup")

if __name__ == "__main__":
    main()

3. Why Essential

This is why NumPy is essential for numerical work.

Memory Tradeoffs

1. Temporary Arrays

Vectorization may allocate temporary arrays.

import numpy as np

# Creates temporary for (a + b) before multiplying
result = (a + b) * c

2. Peak Memory

Large operations increase peak memory usage.

3. In-Place Operations

Use in-place operations when appropriate:

import numpy as np

def main():
    a = np.array([1.0, 2.0, 3.0])
    b = np.array([4.0, 5.0, 6.0])

    a += b  # Modifies a in place
    print(a)

if __name__ == "__main__":
    main()

Practical Guidance

1. Avoid Python Loops

Never loop over array elements when vectorization is possible.

2. Think in Arrays

Reformulate problems as array operations.

3. Profile First

Measure before micro-optimizing.

import numpy as np
import time

def main():
    arr = np.random.randn(1_000_000)

    start = time.perf_counter()
    result = np.sum(arr ** 2)
    elapsed = time.perf_counter() - start

    print(f"Time: {elapsed:.6f} sec")

if __name__ == "__main__":
    main()

Runnable Example: performance_tutorial.py

"""
01_performance.py - Writing Fast NumPy Code

Key: Vectorization eliminates Python loops!
"""

import numpy as np
import time

if __name__ == "__main__":

    print("="*80)
    print("PERFORMANCE OPTIMIZATION")
    print("="*80)

    # ============================================================================
    # Vectorization Example
    # ============================================================================

    print("\nVectorization: Eliminate Loops!")
    print("="*80)

    n = 1000000
    arr = np.random.rand(n)

    # SLOW: Python loop
    print(f"\nProcessing {n:,} elements...")
    start = time.time()
    result = np.zeros(n)
    for i in range(n):
        result[i] = arr[i] ** 2 + 2 * arr[i] + 1
    slow_time = time.time() - start

    # FAST: Vectorized
    start = time.time()
    result_fast = arr**2 + 2*arr + 1
    fast_time = time.time() - start

    print(f"Python loop: {slow_time*1000:.0f} ms")
    print(f"Vectorized:  {fast_time*1000:.0f} ms")
    print(f"Speedup: {slow_time/fast_time:.0f}x faster!")

    print("""
    \nWhy vectorized is faster:
    1. NumPy uses CPU SIMD instructions
    2. No Python interpreter overhead per element
    3. Contiguous memory (cache friendly)
    4. Optimized C code underneath
    """)

    # ============================================================================
    # Memory Efficiency
    # ============================================================================

    print("\n" + "="*80)
    print("Memory Efficiency (Topic #24)")
    print("="*80)

    # Use appropriate dtype
    arr_default = np.arange(1000)
    arr_int16 = np.arange(1000, dtype=np.int16)

    print(f"Default int: {arr_default.nbytes:,} bytes")
    print(f"int16: {arr_int16.nbytes:,} bytes")
    print(f"Savings: {100*(1 - arr_int16.nbytes/arr_default.nbytes):.0f}%")

    # Reuse arrays instead of creating new ones
    print("\nReuse arrays (avoid allocations):")
    result = np.zeros(1000)
    for i in range(100):
        result[:] = np.random.rand(1000) * 2  # Reuse memory!
    print("  Use arr[:] = ... to reuse memory")

    print("""
    \n🎯 PERFORMANCE TIPS:
    1. Use vectorization (eliminate loops!)
    2. Choose appropriate dtype
    3. Reuse arrays when possible
    4. Use views instead of copies
    5. Profile before optimizing!
    """)