Loop Hoisting¶
Loop hoisting moves invariant computations outside inner loops for better performance.
Mental Model
If a computation inside a loop produces the same result every iteration, compute it once before the loop starts. This "hoisting" technique is simple but high-impact: it turns O(n) redundant work into O(1). In NumPy code, hoisting often means precomputing arrays or constants that are reused across iterations.
In the optimization hierarchy, hoisting is a loop optimization step that precedes full vectorization. When you cannot eliminate a loop entirely, hoisting invariants out of it is the next best thing — it reduces the cost per iteration, making the remaining loop as cheap as possible before you attempt to replace it with array operations.
Concept¶
1. What is Hoisting¶
Moving calculations that don't depend on the loop variable outside the loop.
2. Why It Helps¶
Avoids redundant recomputation on every iteration.
3. Common Pattern¶
```python
Before hoisting¶
for i in range(n): for j in range(m): val = expensive(i) # Doesn't depend on j result[i, j] = val + something(j)
After hoisting¶
for i in range(n): val = expensive(i) # Computed once per i for j in range(m): result[i, j] = val + something(j) ```
Image Normalization¶
1. Problem Setup¶
Normalize each row of an image by its mean and standard deviation.
```python import numpy as np
def main(): # Simulated image: (height, width, channels) img = np.random.randint(0, 256, size=(100, 150, 3), dtype=np.uint8) print(f"Image shape: {img.shape}")
if name == "main": main() ```
2. Normalization Formula¶
For each pixel \((i, j)\):
where \(\mu_i\) and \(\sigma_i\) are the mean and std of row \(i\).
3. Key Observation¶
Row statistics \(\mu_i\) and \(\sigma_i\) don't depend on column \(j\).
Naive Implementation¶
1. Code¶
```python import numpy as np import time
def main(): np.random.seed(42) img = np.random.randint(0, 256, size=(400, 600, 3), dtype=np.uint8) img_float = img.astype(np.float64) result = np.empty_like(img_float)
ep = 1e-6
tic = time.time()
for i in range(img.shape[0]):
for j in range(img.shape[1]):
row_mean = np.mean(img[i, :, :], axis=0) # Recomputed!
row_std = np.std(img[i, :, :], axis=0) # Recomputed!
result[i, j, :] = (img_float[i, j, :] - row_mean) / (row_std + ep)
toc = time.time()
print(f"Naive Time: {toc - tic:.4f} sec")
if name == "main": main() ```
2. Problem¶
row_mean and row_std are recalculated for every pixel in the row.
3. Redundant Work¶
For a 400×600 image, statistics are computed 600× more than necessary per row.
Loop Hoisting¶
1. Code¶
```python import numpy as np import time
def main(): np.random.seed(42) img = np.random.randint(0, 256, size=(400, 600, 3), dtype=np.uint8) img_float = img.astype(np.float64) result = np.empty_like(img_float)
ep = 1e-6
tic = time.time()
for i in range(img.shape[0]):
row_mean = np.mean(img[i, :, :], axis=0) # Hoisted!
row_std = np.std(img[i, :, :], axis=0) # Hoisted!
for j in range(img.shape[1]):
result[i, j, :] = (img_float[i, j, :] - row_mean) / (row_std + ep)
toc = time.time()
print(f"Hoisted Time: {toc - tic:.4f} sec")
if name == "main": main() ```
2. Improvement¶
Statistics computed once per row, not once per pixel.
3. Typical Speedup¶
5-10× faster than naive implementation.
Full Vectorization¶
1. Code¶
```python import numpy as np import time
def main(): np.random.seed(42) img = np.random.randint(0, 256, size=(400, 600, 3), dtype=np.uint8) img_float = img.astype(np.float64)
ep = 1e-6
tic = time.time()
# keepdims=True enables broadcasting
mean = np.mean(img_float, axis=1, keepdims=True) # (400, 1, 3)
std = np.std(img_float, axis=1, keepdims=True) # (400, 1, 3)
result = (img_float - mean) / (std + ep)
toc = time.time()
print(f"Vectorized Time: {toc - tic:.6f} sec")
print(f"mean shape: {mean.shape}")
print(f"std shape: {std.shape}")
if name == "main": main() ```
2. Key Technique¶
keepdims=True preserves dimensions for broadcasting.
3. Typical Speedup¶
100-1000× faster than naive, 10-100× faster than hoisted.
Comparison¶
1. Full Benchmark¶
```python import numpy as np import time
def main(): np.random.seed(42) img = np.random.randint(0, 256, size=(200, 300, 3), dtype=np.uint8) img_float = img.astype(np.float64) ep = 1e-6
# Naive
result1 = np.empty_like(img_float)
tic = time.time()
for i in range(img.shape[0]):
for j in range(img.shape[1]):
row_mean = np.mean(img[i, :, :], axis=0)
row_std = np.std(img[i, :, :], axis=0)
result1[i, j, :] = (img_float[i, j, :] - row_mean) / (row_std + ep)
naive_time = time.time() - tic
# Hoisted
result2 = np.empty_like(img_float)
tic = time.time()
for i in range(img.shape[0]):
row_mean = np.mean(img[i, :, :], axis=0)
row_std = np.std(img[i, :, :], axis=0)
for j in range(img.shape[1]):
result2[i, j, :] = (img_float[i, j, :] - row_mean) / (row_std + ep)
hoisted_time = time.time() - tic
# Vectorized
tic = time.time()
mean = np.mean(img_float, axis=1, keepdims=True)
std = np.std(img_float, axis=1, keepdims=True)
result3 = (img_float - mean) / (std + ep)
vec_time = time.time() - tic
print(f"Naive: {naive_time:.4f} sec")
print(f"Hoisted: {hoisted_time:.4f} sec (×{naive_time/hoisted_time:.1f})")
print(f"Vectorized: {vec_time:.6f} sec (×{naive_time/vec_time:.0f})")
# Verify results match
print(f"\nResults match: {np.allclose(result1, result3)}")
if name == "main": main() ```
2. Sample Output¶
``` Naive: 8.2341 sec Hoisted: 1.0234 sec (×8.0) Vectorized: 0.0089 sec (×925)
Results match: True ```
3. Lessons Learned¶
- Loop hoisting: Easy optimization, moderate speedup
- Full vectorization: Best performance, requires rethinking the problem
General Strategy¶
1. Identify Invariants¶
Find computations that don't depend on inner loop variables.
2. Hoist First¶
Move invariants outside inner loops as a quick win.
3. Vectorize Fully¶
Eliminate loops entirely using NumPy broadcasting and keepdims.
Exercises¶
Exercise 1. Write a vectorized NumPy solution and a pure Python loop solution for the same computation. Measure and compare their performance using time.perf_counter().
Solution to Exercise 1
```python import numpy as np import time
n = 1_000_000 data = np.random.default_rng(42).random(n)
Python loop¶
start = time.perf_counter() result_py = [x ** 2 for x in data] py_time = time.perf_counter() - start
NumPy vectorized¶
start = time.perf_counter() result_np = data ** 2 np_time = time.perf_counter() - start
print(f"Python: {py_time:.4f}s, NumPy: {np_time:.6f}s") print(f"Speedup: {py_time / np_time:.0f}x") ```
Exercise 2. Identify a potential performance pitfall in the following code and rewrite it using NumPy vectorization:
python
result = []
for i in range(len(data)):
result.append(data[i] ** 2 + 2 * data[i] + 1)
Solution to Exercise 2
```python import numpy as np
data = np.random.default_rng(42).random(100000)
Vectorized (fast)¶
result = data ** 2 + 2 * data + 1 ```
The loop version creates Python objects for each element and calls append repeatedly. The vectorized version computes everything in compiled C code on contiguous memory.
Exercise 3. Explain why NumPy vectorized operations are faster than Python loops. Reference memory layout, type checking overhead, and SIMD instructions in your answer.
Solution to Exercise 3
NumPy vectorized operations are faster because:
- Contiguous memory: NumPy arrays store elements in a contiguous block, enabling efficient CPU cache usage.
- No type checking: Python loops check types at each iteration; NumPy knows the dtype in advance.
- Compiled C loops: The actual computation runs in compiled C/Fortran code, not interpreted Python.
- SIMD instructions: Modern CPUs can process multiple array elements simultaneously using SIMD (Single Instruction, Multiple Data).
Exercise 4. Apply the concepts from this page to a practical problem: given a large array of temperatures in Celsius, convert them all to Fahrenheit and find the maximum. Compare vectorized and loop approaches.
Solution to Exercise 4
```python import numpy as np import time
rng = np.random.default_rng(42) celsius = rng.uniform(-40, 50, 1_000_000)
Vectorized¶
start = time.perf_counter() fahrenheit = celsius * 9/5 + 32 max_f = fahrenheit.max() vec_time = time.perf_counter() - start
Loop¶
start = time.perf_counter() max_f_loop = max(c * 9/5 + 32 for c in celsius) loop_time = time.perf_counter() - start
print(f"Vectorized: {vec_time:.6f}s, max={max_f:.1f}F") print(f"Loop: {loop_time:.4f}s, max={max_f_loop:.1f}F") ```