Skip to content

Summation Benchmark

Compare different approaches for computing \(\sum_{k=1}^n k^2\).

Problem Setup

1. Mathematical Form

\[\sum_{k=1}^n k^2 = 1^2 + 2^2 + 3^2 + \cdots + n^2\]

2. Closed Form

\[\sum_{k=1}^n k^2 = \frac{n(n+1)(2n+1)}{6}\]

3. Test Size

We use \(n = 10^7\) to highlight performance differences.

For Loop

1. Implementation

import time

def main():
    n = int(1e7)

    tic = time.time()
    s_n = 0
    for i in range(1, n + 1):
        s_n += i ** 2
    toc = time.time()

    print(f"{s_n = }")
    print(f"For Loop Time: {toc - tic:.4f} sec")

if __name__ == "__main__":
    main()

2. Characteristics

  • Pure Python iteration
  • Repeated addition
  • Maximum interpreter overhead

3. Typical Time

Slowest approach (~3-5 seconds).

List Append

1. Implementation

import time

def main():
    n = int(1e7)

    tic = time.time()
    lst = []
    for i in range(1, n + 1):
        lst.append(i ** 2)
    s_n = sum(lst)
    toc = time.time()

    print(f"{s_n = }")
    print(f"List Append Time: {toc - tic:.4f} sec")

if __name__ == "__main__":
    main()

2. Characteristics

  • Builds intermediate list
  • Extra memory allocation
  • Two-pass (build then sum)

3. Typical Time

Similar to for loop (~3-5 seconds).

List Comprehension

1. Implementation

import time

def main():
    n = int(1e7)

    tic = time.time()
    s_n = sum([i ** 2 for i in range(1, n + 1)])
    toc = time.time()

    print(f"{s_n = }")
    print(f"List Comprehension Time: {toc - tic:.4f} sec")

if __name__ == "__main__":
    main()

2. Characteristics

  • More Pythonic syntax
  • Still creates full list
  • Slightly optimized bytecode

3. Typical Time

Slightly faster (~2-4 seconds).

NumPy Vectorized

1. Implementation

import numpy as np
import time

def main():
    n = int(1e7)

    tic = time.time()
    x = np.arange(1, n + 1) ** 2
    s_n = np.sum(x)
    toc = time.time()

    print(f"{s_n = }")
    print(f"NumPy Time: {toc - tic:.4f} sec")

if __name__ == "__main__":
    main()

2. Characteristics

  • C-level computation
  • Contiguous memory
  • Optimized reduction

3. Typical Time

Much faster (~0.05-0.1 seconds).

Formula Direct

1. Implementation

import time

def main():
    n = int(1e7)

    tic = time.time()
    s_n = n * (n + 1) * (2 * n + 1) // 6
    toc = time.time()

    print(f"{s_n = }")
    print(f"Formula Time: {toc - tic:.8f} sec")

if __name__ == "__main__":
    main()

2. Characteristics

  • O(1) complexity
  • No iteration at all
  • Constant time regardless of n

3. Typical Time

Essentially instant (~0.000001 seconds).

Full Comparison

1. All Methods

import numpy as np
import time

def main():
    n = int(1e7)
    results = []

    # For loop
    tic = time.time()
    s = 0
    for i in range(1, n + 1):
        s += i ** 2
    results.append(("For Loop", time.time() - tic, s))

    # List append
    tic = time.time()
    lst = []
    for i in range(1, n + 1):
        lst.append(i ** 2)
    s = sum(lst)
    results.append(("List Append", time.time() - tic, s))

    # List comprehension
    tic = time.time()
    s = sum([i ** 2 for i in range(1, n + 1)])
    results.append(("List Comp", time.time() - tic, s))

    # NumPy
    tic = time.time()
    s = np.sum(np.arange(1, n + 1) ** 2)
    results.append(("NumPy", time.time() - tic, s))

    # Formula
    tic = time.time()
    s = n * (n + 1) * (2 * n + 1) // 6
    results.append(("Formula", time.time() - tic, s))

    print(f"{'Method':<15} {'Time (sec)':<12} {'Result'}")
    print("-" * 45)
    for method, elapsed, result in results:
        print(f"{method:<15} {elapsed:<12.6f} {result}")

if __name__ == "__main__":
    main()

2. Sample Output

Method          Time (sec)   Result
---------------------------------------------
For Loop        3.245000     333333383333335000
List Append     3.512000     333333383333335000
List Comp       2.891000     333333383333335000
NumPy           0.078000     333333383333335000
Formula         0.000001     333333383333335000

3. Key Insight

NumPy is ~40× faster than Python loops; formulas are instant when available.

Takeaways

1. Avoid Python Loops

For numerical computation, Python loops are prohibitively slow.

2. Use NumPy

Vectorized NumPy operations provide massive speedups.

3. Know Your Math

Closed-form solutions beat all iterative methods.


Runnable Example: python_vs_numpy_diffusion.py

"""
2D Diffusion Simulation: Pure Python vs NumPy Vectorization

This tutorial shows how vectorization transforms a simulation from slow to fast.
We'll simulate diffusion (how a substance spreads over time) on a 2D grid.

WHAT IS DIFFUSION?
A substance spreads from areas of high concentration to low concentration. Think
of food coloring diffusing through water - it spreads until evenly distributed.

THE MATH:
For each grid cell, the new concentration is determined by:
1. The cell's current value
2. Its neighbors' values (up, down, left, right)
3. The diffusion coefficient D and time step dt

This is governed by the discrete Laplacian operator, which measures how different
a cell is from its neighbors.

PURE PYTHON APPROACH:
- Nested loops over grid dimensions
- Each cell update requires accessing neighbor values
- Results in slow, interpreted code

NUMPY APPROACH:
- Use roll() to shift the entire array
- Subtract and add entire arrays at once
- Single equation: grid + dt * D * laplacian(grid)
- All operations compiled in C, huge speedup

Learning Goals:
- See how spatial operations can be vectorized
- Understand the Laplacian and diffusion math
- Compare nested loops vs array operations
- Measure dramatic speedup from vectorization
"""

import time
import numpy as np

if __name__ == "__main__":


    print("=" * 70)
    print("2D DIFFUSION SIMULATION: PURE PYTHON vs NUMPY VECTORIZATION")
    print("=" * 70)


    # ============ EXAMPLE 1: Understanding the Problem ============
    print("\n" + "=" * 70)
    print("EXAMPLE 1: Understanding Diffusion and the Laplacian Operator")
    print("=" * 70)

    print("""
    DIFFUSION INTUITION:
    Imagine a 2D grid where each cell contains a concentration value.
    Over time, particles move from high concentration to low concentration.

    The rate of change at each point depends on:
    - How much the point differs from its neighbors
    - This difference is measured by the Laplacian operator

    LAPLACIAN IN 2D:
    For a point (i, j), the Laplacian is:
        L = grid[i+1][j] + grid[i-1][j] + grid[i][j+1] + grid[i][j-1] - 4*grid[i][j]

    It sums the four neighbors and subtracts 4 times the center point.
    If neighbors are high, L is positive (substance increases).
    If neighbors are low, L is negative (substance decreases).

    EVOLUTION EQUATION:
        new_grid[i][j] = grid[i][j] + dt * D * Laplacian

    Where:
    - dt = time step (how much time passes in one iteration)
    - D = diffusion coefficient (how fast diffusion happens)

    Let's see this in action:
    """)


    # ============ EXAMPLE 2: Pure Python Implementation ============
    print("\n" + "=" * 70)
    print("EXAMPLE 2: Pure Python Implementation (Nested Loops)")
    print("=" * 70)

    def evolve_python(grid, dt, D=1.0):
        """
        Update the grid one time step using pure Python loops.

        This is slow because:
        1. Two nested loops iterate over every cell
        2. Four array accesses per cell (neighbors)
        3. Arithmetic operations happen in Python, not compiled code
        4. For a 640x640 grid, this is 409,600 operations per iteration!

        Grid wraps around (periodic boundary conditions) - edges connect to opposite side.
        """
        grid_shape = (len(grid), len(grid[0]))
        xmax, ymax = grid_shape

        # Create new grid for the updated values
        new_grid = [[0.0 for _ in range(ymax)] for _ in range(xmax)]

        for i in range(xmax):
            for j in range(ymax):
                # Get neighbors with wraparound (periodic boundary)
                # % operator wraps indices: 640 % 640 = 0, -1 % 640 = 639
                neighbor_right = grid[(i + 1) % xmax][j]
                neighbor_left = grid[(i - 1) % xmax][j]
                neighbor_down = grid[i][(j + 1) % ymax]
                neighbor_up = grid[i][(j - 1) % ymax]

                # Compute Laplacian: sum of neighbors - 4 * center
                laplacian = (neighbor_right + neighbor_left +
                            neighbor_down + neighbor_up - 4.0 * grid[i][j])

                # Update: new value = old value + diffusion term
                new_grid[i][j] = grid[i][j] + D * laplacian * dt

        return new_grid


    # Setup a small grid to demonstrate
    grid_shape = (10, 10)
    print(f"\nSmall demo with {grid_shape[0]}x{grid_shape[1]} grid:")

    # Initialize with zeros
    demo_grid = [[0.0 for _ in range(grid_shape[1])] for _ in range(grid_shape[0])]

    # Add a "hot spot" in the center
    center_low = 4
    center_high = 6
    for i in range(center_low, center_high):
        for j in range(center_low, center_high):
            demo_grid[i][j] = 1.0

    print(f"\nInitial grid (center 4x4 block = 1.0, rest = 0.0):")
    for row in demo_grid:
        print("  " + "  ".join(f"{v:.1f}" for v in row))

    # Run one evolution step
    demo_grid_after = evolve_python(demo_grid, dt=0.1, D=1.0)

    print(f"\nAfter one evolution step (dt=0.1, D=1.0):")
    for row in demo_grid_after:
        print("  " + "  ".join(f"{v:.2f}" for v in row))

    print(f"\nNotice: The hot spot stays mostly concentrated, with slight")
    print(f"spreading to neighbors. This is diffusion!")


    # Time the pure Python version
    print(f"\n" + "-" * 70)
    print(f"PERFORMANCE TEST: Pure Python with larger grid")
    print(f"-" * 70)

    grid_size = (256, 256)
    num_iterations = 10

    # Create initial grid
    grid = [[0.0 for _ in range(grid_size[1])] for _ in range(grid_size[0])]

    # Add initial conditions (hot spot in center)
    block_low = int(grid_size[0] * 0.4)
    block_high = int(grid_size[0] * 0.5)
    for i in range(block_low, block_high):
        for j in range(block_low, block_high):
            grid[i][j] = 0.005

    print(f"Grid size: {grid_size[0]}x{grid_size[1]} = {grid_size[0]*grid_size[1]:,} cells")
    print(f"Iterations: {num_iterations}")
    print(f"Total cells updated: {grid_size[0]*grid_size[1]*num_iterations:,}")

    start = time.time()
    for step in range(num_iterations):
        grid = evolve_python(grid, dt=0.1, D=1.0)
    elapsed_python = time.time() - start

    print(f"\nPure Python time: {elapsed_python:.4f}s ({elapsed_python/num_iterations:.4f}s per iteration)")


    # ============ EXAMPLE 3: NumPy Vectorized Implementation ============
    print("\n" + "=" * 70)
    print("EXAMPLE 3: NumPy Vectorized Implementation")
    print("=" * 70)

    def laplacian_numpy(grid):
        """
        Compute the Laplacian using NumPy array operations.

        Key insight: Instead of loops, use roll() to shift the entire array.

        roll(grid, +1, 0) shifts rows down: new row i = old row i-1
        roll(grid, -1, 0) shifts rows up:   new row i = old row i+1
        roll(grid, +1, 1) shifts cols right: new col j = old col j-1
        roll(grid, -1, 1) shifts cols left:  new col j = old col j+1

        Then compute: neighbors - 4*center in one vectorized operation!

        Why this is fast:
        1. No Python loops - NumPy handles all iteration in C
        2. roll() is optimized (just pointer manipulation, not data copy)
        3. Array arithmetic uses compiled operations
        4. Can use SIMD instructions for multiple cells at once
        """
        return (
            np.roll(grid, +1, 0) +   # neighbor above (wrapped)
            np.roll(grid, -1, 0) +   # neighbor below (wrapped)
            np.roll(grid, +1, 1) +   # neighbor to left (wrapped)
            np.roll(grid, -1, 1) -   # neighbor to right (wrapped)
            4 * grid                  # center cell (4 times)
        )


    def evolve_numpy(grid, dt, D=1.0):
        """
        Update the grid one time step using NumPy operations.

        This is fast because:
        1. laplacian_numpy() does all computation without Python loops
        2. Grid update is a single array equation
        3. All operations are compiled C code
        4. NumPy can parallelize with SIMD instructions
        """
        return grid + dt * D * laplacian_numpy(grid)


    # Test with same small grid to verify correctness
    print(f"Verifying NumPy gives same results as Pure Python:")

    grid_numpy = np.array([[0.0 for _ in range(grid_shape[1])] for _ in range(grid_shape[0])])

    # Add same initial conditions
    center_low = 4
    center_high = 6
    for i in range(center_low, center_high):
        for j in range(center_low, center_high):
            grid_numpy[i][j] = 1.0

    grid_numpy_after = evolve_numpy(grid_numpy, dt=0.1, D=1.0)

    print(f"\nNumPy result after one step:")
    for row in grid_numpy_after:
        print("  " + "  ".join(f"{v:.2f}" for v in row))

    print(f"\nResults match Pure Python! (Both methods are mathematically equivalent)")


    # Time the NumPy version
    print(f"\n" + "-" * 70)
    print(f"PERFORMANCE TEST: NumPy with same grid")
    print(f"-" * 70)

    grid_numpy = np.zeros(grid_size)

    # Add initial conditions
    block_low = int(grid_size[0] * 0.4)
    block_high = int(grid_size[0] * 0.5)
    grid_numpy[block_low:block_high, block_low:block_high] = 0.005

    print(f"Grid size: {grid_size[0]}x{grid_size[1]} = {grid_size[0]*grid_size[1]:,} cells")
    print(f"Iterations: {num_iterations}")

    start = time.time()
    for step in range(num_iterations):
        grid_numpy = evolve_numpy(grid_numpy, dt=0.1, D=1.0)
    elapsed_numpy = time.time() - start

    print(f"\nNumPy time: {elapsed_numpy:.4f}s ({elapsed_numpy/num_iterations:.4f}s per iteration)")


    # ============ EXAMPLE 4: Performance Comparison ============
    print("\n" + "=" * 70)
    print("EXAMPLE 4: Performance Comparison & Speedup Analysis")
    print("=" * 70)

    speedup = elapsed_python / elapsed_numpy
    print(f"\nResults for {grid_size[0]}x{grid_size[1]} grid over {num_iterations} iterations:")
    print(f"  Pure Python: {elapsed_python:.4f}s total ({elapsed_python/num_iterations:.4f}s per iteration)")
    print(f"  NumPy:       {elapsed_numpy:.4f}s total ({elapsed_numpy/num_iterations:.4f}s per iteration)")
    print(f"  Speedup:     {speedup:.1f}x faster with NumPy")

    print(f"\n{'*' * 70}")
    print("HOW NUMPY ACHIEVES THIS SPEEDUP")
    print("{'*' * 70}")

    print("""
    1. NO PYTHON LOOPS
       Pure Python: Two nested loops * 256*256 cells = 131,072 iterations
       NumPy: One function call that processes all cells at once

    2. VECTORIZED OPERATIONS
       Pure Python: Access neighbors, compute arithmetic per cell (many operations)
       NumPy: Shift entire array with roll(), subtract arrays with - operator

    3. COMPILED CODE
       Pure Python: Each operation interpreted by Python VM
       NumPy: All operations written in optimized C

    4. MEMORY LAYOUT
       Pure Python: Lists of lists - scattered in memory, poor cache usage
       NumPy: Contiguous array in memory - CPU cache works optimally

    5. ALGORITHMIC TRICKS
       roll() doesn't actually copy the array - it's pointer manipulation!
       This is orders of magnitude faster than accessing elements one by one
    """)


    # ============ EXAMPLE 5: The Vectorization Pattern ============
    print("\n" + "=" * 70)
    print("EXAMPLE 5: Recognizing the Stencil/Neighbor Pattern")
    print("=" * 70)

    print("""
    This tutorial demonstrates a common pattern called a STENCIL OPERATION:
    - Each output depends on neighbors of each input
    - Very common in: physics simulations, image processing, PDEs

    PURE PYTHON STENCIL PATTERN:
        for i in range(height):
            for j in range(width):
                value = grid[i][j]
                neighbors = [grid[i+1][j], grid[i-1][j], ...]
                new_grid[i][j] = compute(value, neighbors)

    NUMPY STENCIL PATTERN:
        neighbors_up = np.roll(grid, -1, 0)
        neighbors_down = np.roll(grid, +1, 0)
        neighbors_left = np.roll(grid, +1, 1)
        neighbors_right = np.roll(grid, -1, 1)
        new_grid = compute(grid, neighbors_up, neighbors_down, ...)

    The key insight: Use array operations (roll, slicing) to extract neighbors,
    then use vectorized arithmetic to compute updates.

    Applications:
    - Image filters (each pixel depends on neighbor pixels)
    - Physics simulations (this tutorial!)
    - Fluid dynamics
    - Weather modeling
    - Any spatial computation on grids
    """)


    print("\n" + "=" * 70)
    print("KEY TAKEAWAY")
    print("=" * 70)
    print(f"""
    Vectorizing from pure Python to NumPy achieved {speedup:.1f}x speedup.

    The transformation principle:
    1. Identify loops over array elements
    2. Recognize operations that can be expressed with array operations
    3. Use NumPy's broadcasting and functions (roll, slicing, etc.)
    4. Replace loops with array equations

    This is one of the most powerful optimization techniques in Python.
    When you have nested loops over numerical data, consider vectorization!
    """)