Skip to content

Loop Hoisting

Loop hoisting moves invariant computations outside inner loops for better performance.

Concept

1. What is Hoisting

Moving calculations that don't depend on the loop variable outside the loop.

2. Why It Helps

Avoids redundant recomputation on every iteration.

3. Common Pattern

# Before hoisting
for i in range(n):
    for j in range(m):
        val = expensive(i)  # Doesn't depend on j
        result[i, j] = val + something(j)

# After hoisting
for i in range(n):
    val = expensive(i)  # Computed once per i
    for j in range(m):
        result[i, j] = val + something(j)

Image Normalization

1. Problem Setup

Normalize each row of an image by its mean and standard deviation.

import numpy as np

def main():
    # Simulated image: (height, width, channels)
    img = np.random.randint(0, 256, size=(100, 150, 3), dtype=np.uint8)
    print(f"Image shape: {img.shape}")

if __name__ == "__main__":
    main()

2. Normalization Formula

For each pixel \((i, j)\):

\[\text{normalized}_{i,j} = \frac{\text{pixel}_{i,j} - \mu_i}{\sigma_i + \epsilon}\]

where \(\mu_i\) and \(\sigma_i\) are the mean and std of row \(i\).

3. Key Observation

Row statistics \(\mu_i\) and \(\sigma_i\) don't depend on column \(j\).

Naive Implementation

1. Code

import numpy as np
import time

def main():
    np.random.seed(42)
    img = np.random.randint(0, 256, size=(400, 600, 3), dtype=np.uint8)
    img_float = img.astype(np.float64)
    result = np.empty_like(img_float)

    ep = 1e-6

    tic = time.time()
    for i in range(img.shape[0]):
        for j in range(img.shape[1]):
            row_mean = np.mean(img[i, :, :], axis=0)  # Recomputed!
            row_std = np.std(img[i, :, :], axis=0)   # Recomputed!
            result[i, j, :] = (img_float[i, j, :] - row_mean) / (row_std + ep)
    toc = time.time()

    print(f"Naive Time: {toc - tic:.4f} sec")

if __name__ == "__main__":
    main()

2. Problem

row_mean and row_std are recalculated for every pixel in the row.

3. Redundant Work

For a 400×600 image, statistics are computed 600× more than necessary per row.

Loop Hoisting

1. Code

import numpy as np
import time

def main():
    np.random.seed(42)
    img = np.random.randint(0, 256, size=(400, 600, 3), dtype=np.uint8)
    img_float = img.astype(np.float64)
    result = np.empty_like(img_float)

    ep = 1e-6

    tic = time.time()
    for i in range(img.shape[0]):
        row_mean = np.mean(img[i, :, :], axis=0)  # Hoisted!
        row_std = np.std(img[i, :, :], axis=0)   # Hoisted!
        for j in range(img.shape[1]):
            result[i, j, :] = (img_float[i, j, :] - row_mean) / (row_std + ep)
    toc = time.time()

    print(f"Hoisted Time: {toc - tic:.4f} sec")

if __name__ == "__main__":
    main()

2. Improvement

Statistics computed once per row, not once per pixel.

3. Typical Speedup

5-10× faster than naive implementation.

Full Vectorization

1. Code

import numpy as np
import time

def main():
    np.random.seed(42)
    img = np.random.randint(0, 256, size=(400, 600, 3), dtype=np.uint8)
    img_float = img.astype(np.float64)

    ep = 1e-6

    tic = time.time()
    # keepdims=True enables broadcasting
    mean = np.mean(img_float, axis=1, keepdims=True)  # (400, 1, 3)
    std = np.std(img_float, axis=1, keepdims=True)    # (400, 1, 3)
    result = (img_float - mean) / (std + ep)
    toc = time.time()

    print(f"Vectorized Time: {toc - tic:.6f} sec")
    print(f"mean shape: {mean.shape}")
    print(f"std shape: {std.shape}")

if __name__ == "__main__":
    main()

2. Key Technique

keepdims=True preserves dimensions for broadcasting.

3. Typical Speedup

100-1000× faster than naive, 10-100× faster than hoisted.

Comparison

1. Full Benchmark

import numpy as np
import time

def main():
    np.random.seed(42)
    img = np.random.randint(0, 256, size=(200, 300, 3), dtype=np.uint8)
    img_float = img.astype(np.float64)
    ep = 1e-6

    # Naive
    result1 = np.empty_like(img_float)
    tic = time.time()
    for i in range(img.shape[0]):
        for j in range(img.shape[1]):
            row_mean = np.mean(img[i, :, :], axis=0)
            row_std = np.std(img[i, :, :], axis=0)
            result1[i, j, :] = (img_float[i, j, :] - row_mean) / (row_std + ep)
    naive_time = time.time() - tic

    # Hoisted
    result2 = np.empty_like(img_float)
    tic = time.time()
    for i in range(img.shape[0]):
        row_mean = np.mean(img[i, :, :], axis=0)
        row_std = np.std(img[i, :, :], axis=0)
        for j in range(img.shape[1]):
            result2[i, j, :] = (img_float[i, j, :] - row_mean) / (row_std + ep)
    hoisted_time = time.time() - tic

    # Vectorized
    tic = time.time()
    mean = np.mean(img_float, axis=1, keepdims=True)
    std = np.std(img_float, axis=1, keepdims=True)
    result3 = (img_float - mean) / (std + ep)
    vec_time = time.time() - tic

    print(f"Naive:      {naive_time:.4f} sec")
    print(f"Hoisted:    {hoisted_time:.4f} sec (×{naive_time/hoisted_time:.1f})")
    print(f"Vectorized: {vec_time:.6f} sec (×{naive_time/vec_time:.0f})")

    # Verify results match
    print(f"\nResults match: {np.allclose(result1, result3)}")

if __name__ == "__main__":
    main()

2. Sample Output

Naive:      8.2341 sec
Hoisted:    1.0234 sec (×8.0)
Vectorized: 0.0089 sec (×925)

Results match: True

3. Lessons Learned

  • Loop hoisting: Easy optimization, moderate speedup
  • Full vectorization: Best performance, requires rethinking the problem

General Strategy

1. Identify Invariants

Find computations that don't depend on inner loop variables.

2. Hoist First

Move invariants outside inner loops as a quick win.

3. Vectorize Fully

Eliminate loops entirely using NumPy broadcasting and keepdims.